Predective Assignment-3

docx

School

University Of Connecticut *

*We aren’t endorsed by this school

Course

5604

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

8

Uploaded by EarlMusicCrab33

Report
. Assignment-3 Part 1: Breakfast Cereals. Use the data for the breakfast cereals example in Section 4.8 (Cereals.jmp) to explore and summarize the data as follows: a. Which variables are continuous/numerical? Which are ordinal? Which are nominal? A) Type: It is a Nominal data as it represents unordered categories (A, B, C) Continuous/ Numerical data: Alcohol, Malic_Acid, Ash, Ash_Alcalinity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Color_Intensity, Hue, OD280_OD315, Proline. There were no ordinal data which ordered categories with inherent rankings. b. Calculate the following summary statistics: mean, median, min, max, and standard deviation for each of the continuous variables, and the count for each categorical variable. This can be done using Cols > Columns Viewer. i. Is there any evidence of extreme values? ii. Which, if any, of the variables is missing values? A)
i)There are no direct evidence of extreme values as all the minimum and maximum values for all of variables look reasonable. ii) Which, if any, of the variables is missing values? A)
There are not any variables missing values in the data. c) U se Analyze > Distribution to plot a histogram for each of the continuous variables and create summary statistics. Based on the histograms and summary statistics, answer the following questions: ii) Which variables seem skewed? A)Malic_Acid, Ash_Alcanalinity, Magnesium, Nonflavanoid_Phenols, Proanthocyanins, Color_Intensity, Proline, OD280_OD315. iii)Are there any values that seem extreme? A)There are some values seem extreme beyond box plots value in some of the variables Malic Acid: 5.51, 5.65, 5.8 Ash: 3.22, 3.23, 1.36 Ash_Alclinity: 30.28,10.6 Magnesium: 162, 151, 139 Proanthocyanins: 3.58, 3.28 Color_Intensity: 13, 11.75, 10.8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Hue: 1.71 f)Compute the correlation table, and generate a scatterplot matrix for the continuous variables (use Analyze > Multivariate Methods > Multivariate). i)Which pair of variables is most strongly correlated? A) Total_Phenols-Flavanoids(0.86), OD280_OD315-Flavanoids(0.78), OD280_OD315-Total_Phenols(0.69), Proanthocyanins-Flavanoids(0.65), Alcohol-Proline(0.64), Proanthocyanins-Total_Phenols(0.61), OD280_OD315-Hue(0.56), Alcohol-Color_Intensity(0.54), Flavanoids-Hue(0.54) and so on..   iii – To support your answer to this question, standardize a few variables and recreate your correlations on those variables and compare your results.   A)
A) By the visualization of the matrix and correlations, there is no change in standardization, as data essentially transform each variable to have a mean of 0 and a standard deviation of 1. For above transformation does not have change between variables, but it can affect the magnitude of the correlation coefficient. Part 2  – Continue working in the Wine dataset to answer the following questions.  You may complete this part in either JMP or Python.  1. Conduct a principal components analysis using all the continuous variables in the dataset.  Include a screen shot of the Eigenvectors and Eigenvalues (with cumulative percents).  A) 2.How much variability can you maintain if you keep 8 principal components? A) If we keep 8 principal components we can retain 92% of information variability. So, there is only 8% information loss from principal 13 to 8. 3.While PC1 is quite balanced with info from many of the predictors, which three variables contribute the least to PC1?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A) Most of the information from principle component 1 is coming from the Total_Phenol and Flavanoids variables. The Ash, Magnesium, and Color_Intensity are the three variables that contribute the least to PC1. 4.Which three variables contribute the most to PC2?  A) Alcohol, Color_Intensity, Proline are the three variables contribute the most to PC2. 5.If a wine is high in malic acid, will it be high or low in PC2 (all else being equal)? A) Malic_Acid are less influence in PC2 which are closer to 0. So, it is low in PC2 6.If a wine is high in malic acid, will it be high or low in PC4 (all else being equal)? A) Malic_Acid are more influence in PC4 which are closer to 1. So, it is high in PC4 Part 3  – Practice the math to manually to solve the problem below.  1. Imagine a dataset with 30,000 rows.  The target variable has 3,000 Yes rows and 27,000 No rows.  If you were to partition the data with a 60/20/20 split and 50/50 oversampling in Training and Validation, how many No and Yes records would be in each of the three partitions? A) Total rows=30,000 Yes rows= 3000 No rows= 27000 Partition: Training set= 60% Validation set= 20% Testing set= 20% For Training Yes rows= Yes rows* 60%= 3000*60= 1800 For Validation Yes rows= Yes rows* 20%= 3000*20= 600 Testing Yes rows= Yes rows*20%= 300*20= 600 Oversampling factor= 50/50 in Training and Validation For training: Training Yes rows= 1800 Training No rows= 1800 For Validation:
Validation Yes rows= 600 Validation No rows=600 For Testing: As it is a 1:9 ratio, Yes will have 10% and No will have 90% Testing Yes rows= 600 Testing No rows= 5400