MID TERM ASSESSMENT

docx

School

Humber College *

*We aren’t endorsed by this school

Course

ML5432

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

Uploaded by lapakshi123aggarwal

MID TERM ASSESSMENT – BIA 5302 SECTION A (CONCEPTS) 1. A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate. Please answer in WORD/pen and paper. Answer : - Classification Confusion Matrix Predicted Class Actual Class Fraudulent Non-Fraudulent Fraudulent 30 32 Non-Fraudulent 58 920 Error Rate = (32 + 58) / (30 + 32 + 920 + 58) = 90 / 1040 = 0.0865 or 8.65% 2. Suppose that this routine has an adjustable cutoff (threshold) mechanism by which you can alter the proportion of records classified as fraudulent. Describe how moving the cutoff up or down would affect.  the classification error rate for records that are truly fraudulent. Answer : -Increasing the cutoff may result in a higher classification error rate, as some true fraudulent records may be wrongly classified as non-fraudulent (false negatives). Decreasing the cutoff may reduce the classification error rate, as more true fraudulent records will be correctly identified (true positives).  the classification error rate for records that are truly nonfraudulent. Answer : - Increasing the cutoff may lower the classification error rate, as fewer true non- fraudulent records will be incorrectly classified as fraudulent (true negatives). Decreasing the cutoff may increase the classification error rate, as more true non-fraudulent records may be mistakenly labeled as fraudulent (false positives).

SECTION B (IMPLEMENTATION) 3. This question will require students to apply various data visualization techniques to represent and analyze datasets effectively. a) Dataset Selection:  Choose any publicly available dataset of your choice. It could be related to any domain or topic (e.g., finance, sports, healthcare, etc.).  Data selected from Kaggle ‘Iris’ b) Tasks to Perform:  Implement the following tasks using Python's data visualization libraries (e.g., Matplotlib, Seaborn, Plotly, etc.) Data Exploration: I. Load the dataset into your Python environment. II. Conduct initial data exploration to understand the dataset's structure, content, and statistical summary.

III. Identify relevant variables/features for visualization. Univariate Visualization: i. Create (at least two) appropriate visualizations to represent the distribution of individual variables/features. Examples: Histograms, box plots, bar plots, etc.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Bivariate Visualization: i. Analyze relationships between two variables/features using (at least two) appropriate visualizations. Examples: Scatter plots, line plots, stacked bar plots, etc.

Multivariate Visualization: i. Represent relationships involving multiple variables/features using (at least two) suitable visualizations. Examples: Heatmaps, parallel coordinates, tree maps, etc.

Time-Series Visualization: i. If your dataset contains temporal data, visualize the trends and patterns over time by using at least one chart. Examples: Line plots, area plots, seasonal decompositions, etc. As my data don’t have any Time or Date so can’t perform this Visualization. Interactive Visualization: i. Implement at least one interactive visualization using Plotly or any other interactive visualization library. Examples: Interactive scatter plots, choropleth maps, animated plots, etc.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4. Use the data for the breakfast cereals given to explore and summarize the data as follows a. Plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions:  Which variables have the largest variability? Answer : - As per me Calories variable have largest variability.  Which variables seem skewed? Answer : - Potass have Positive skewed  Are there any values that seem extreme? Answer:- All varieables with one or few values are extreme like fat, vitamins, weight, etc.

b. Plot a side-by-side boxplot of consumer rating as a function of the shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height? Answer : - The boxplot reveals minimal differences in consumer rating based on shelf height, suggesting a weak influence. Therefore, it is possible to simplify the analysis by reducing the shelf height categories without compromising the accuracy of predicting consumer rating.

c. Compute the correlation table for the quantitative variable (method corr()). In addition, generate a matrix plot for these variables.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

 Which pair of variables is most strongly correlated? Answer : - Weight and calories are highly correlated.  How can we reduce the number of variables based on these correlations? Answer : - To decrease the number of variables based on correlations, identify highly correlated variables, and potentially remove one to mitigate redundancy and multicollinearity.  How would the correlations change if we normalized the data first? Answer : - Normalization of data prior to computing correlations can alter the magnitude and direction of correlations by scaling variables to a common range, making it useful for comparing correlations across variables or handling different scales. d. Consider the first PCA of the analysis of the 13 numerical variables. Describe briefly what this PCA represents. Answer : - The Principal Component Analysis (PCA) in the analysis of the numerical variables represents the dominant pattern in the data, capturing the most significant sources of variation among the variables.

Related Documents

08-01_task.docx

Chapter 3 Assignment.docx

ITM-301 Lab 5 Fall 2023.docx

HW4.ipynb - Colaboratory.pdf

CIS - Basic Queries Assessment.xlsx

09 Indexes-Key (2).docx

08 Normalization I - key.docx

08 Normalization II -Key.docx

CIS_2550_Topic6_Lab.docx

CIS_2550_Topic10_Lab complete (1).docx

CIS_2550_Topic11_Lab complete.docx

CIS_2550_Topic12_Lab complete.docx

Recommended textbooks for you

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781285196145

Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel

Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Fundamentals of Information Systems

Computer Science

ISBN:9781305082168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning
Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781285196145

Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel

Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Fundamentals of Information Systems

Computer Science

ISBN:9781305082168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

SEE MORE TEXTBOOKS