MID TERM ASSESSMENT

docx

School

Humber College *

*We aren’t endorsed by this school

Course

ML5432

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

10

Uploaded by lapakshi123aggarwal

Report
MID TERM ASSESSMENT – BIA 5302 SECTION A (CONCEPTS) 1. A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate. Please answer in WORD/pen and paper. Answer : - Classification Confusion Matrix Predicted Class Actual Class Fraudulent Non-Fraudulent Fraudulent 30 32 Non-Fraudulent 58 920 Error Rate = (32 + 58) / (30 + 32 + 920 + 58) = 90 / 1040 = 0.0865 or 8.65% 2. Suppose that this routine has an adjustable cutoff (threshold) mechanism by which you can alter the proportion of records classified as fraudulent. Describe how moving the cutoff up or down would affect. the classification error rate for records that are truly fraudulent. Answer : -Increasing the cutoff may result in a higher classification error rate, as some true fraudulent records may be wrongly classified as non-fraudulent (false negatives). Decreasing the cutoff may reduce the classification error rate, as more true fraudulent records will be correctly identified (true positives). the classification error rate for records that are truly nonfraudulent. Answer : - Increasing the cutoff may lower the classification error rate, as fewer true non- fraudulent records will be incorrectly classified as fraudulent (true negatives). Decreasing the cutoff may increase the classification error rate, as more true non-fraudulent records may be mistakenly labeled as fraudulent (false positives).
SECTION B (IMPLEMENTATION) 3. This question will require students to apply various data visualization techniques to represent and analyze datasets effectively. a) Dataset Selection: Choose any publicly available dataset of your choice. It could be related to any domain or topic (e.g., finance, sports, healthcare, etc.). Data selected from Kaggle ‘Iris’ b) Tasks to Perform: Implement the following tasks using Python's data visualization libraries (e.g., Matplotlib, Seaborn, Plotly, etc.) Data Exploration: I. Load the dataset into your Python environment. II. Conduct initial data exploration to understand the dataset's structure, content, and statistical summary.
III. Identify relevant variables/features for visualization. Univariate Visualization: i. Create (at least two) appropriate visualizations to represent the distribution of individual variables/features. Examples: Histograms, box plots, bar plots, etc.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Bivariate Visualization: i. Analyze relationships between two variables/features using (at least two) appropriate visualizations. Examples: Scatter plots, line plots, stacked bar plots, etc.
Multivariate Visualization: i. Represent relationships involving multiple variables/features using (at least two) suitable visualizations. Examples: Heatmaps, parallel coordinates, tree maps, etc.
Time-Series Visualization: i. If your dataset contains temporal data, visualize the trends and patterns over time by using at least one chart. Examples: Line plots, area plots, seasonal decompositions, etc. As my data don’t have any Time or Date so can’t perform this Visualization. Interactive Visualization: i. Implement at least one interactive visualization using Plotly or any other interactive visualization library. Examples: Interactive scatter plots, choropleth maps, animated plots, etc.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Use the data for the breakfast cereals given to explore and summarize the data as follows a. Plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions: Which variables have the largest variability? Answer : - As per me Calories variable have largest variability. Which variables seem skewed? Answer : - Potass have Positive skewed Are there any values that seem extreme? Answer:- All varieables with one or few values are extreme like fat, vitamins, weight, etc.
b. Plot a side-by-side boxplot of consumer rating as a function of the shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height? Answer : - The boxplot reveals minimal differences in consumer rating based on shelf height, suggesting a weak influence. Therefore, it is possible to simplify the analysis by reducing the shelf height categories without compromising the accuracy of predicting consumer rating.
c. Compute the correlation table for the quantitative variable (method corr()). In addition, generate a matrix plot for these variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Which pair of variables is most strongly correlated? Answer : - Weight and calories are highly correlated. How can we reduce the number of variables based on these correlations? Answer : - To decrease the number of variables based on correlations, identify highly correlated variables, and potentially remove one to mitigate redundancy and multicollinearity. How would the correlations change if we normalized the data first? Answer : - Normalization of data prior to computing correlations can alter the magnitude and direction of correlations by scaling variables to a common range, making it useful for comparing correlations across variables or handling different scales. d. Consider the first PCA of the analysis of the 13 numerical variables. Describe briefly what this PCA represents. Answer : - The Principal Component Analysis (PCA) in the analysis of the numerical variables represents the dominant pattern in the data, capturing the most significant sources of variation among the variables.