Assignment1 - Jupyter Notebook (1)

pdf

School

Cambridge College *

*We aren’t endorsed by this school

Course

NLP

Subject

Mechanical Engineering

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by LieutenantComputerGuanaco32

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 1/6 In [6]: In [10]: 1. Calculate the mean centering vector (a 5 x 1 vector) In [11]: 2. Calculate the scaling vector (a 5 x 1 vector) In [12]: 3. What steps you would take to apply the centering and scaling vectors to the X matrix? a. Centering: Subtract the mean centering vector from each row of the X matrix. b. Scaling: Divide each column of the centered matrix by its corresponding value in the scaling vector (standard deviation). Out[10]: Unnamed: 0 Oil Density Crispy Fracture Hardness 0 B110 16.5 2955 10 23 97 1 B136 17.7 2660 14 9 139 2 B171 16.2 2870 12 17 143 3 B192 16.7 2920 10 31 95 4 B225 16.3 2975 11 26 143 Out[11]: Oil 17.202 Density 2857.600 Crispy 11.520 Fracture 20.860 Hardness 128.180 dtype: float64 Out[12]: Oil 1.592007 Density 124.499980 Crispy 1.775571 Fracture 5.466073 Hardness 31.127578 dtype: float64 import numpy as np import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv( './Downloads/food-texture.csv' ) data.head() mean_centering_vector = data[[ 'Oil' , 'Density' , 'Crispy' , 'Fracture' , 'Hardness' ]].mean() mean_centering_vector scaling_vector = data[[ 'Oil' , 'Density' , 'Crispy' , 'Fracture' , 'Hardness' ]].std() scaling_vector

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 2/6 4. Draw a scatter plot of Crispy vs. Fracture using all 50 observations from the raw data table. In [16]: plt.figure(figsize = ( 10 , 6 )) plt.scatter(data[ 'Crispy' ], data[ 'Fracture' ], color = 'blue' ) plt.title( 'Scatter Plot of Crispy vs. Fracture (Raw Data)' ) plt.xlabel( 'Crispy' ) plt.ylabel( 'Fracture' ) plt.grid( True ) plt.show()

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 3/6 5. Draw a scatter plot of Crispy vs. Fracture after you have centered and scaled the data. What observations can you make comparing the two scatter plots? In [17]: Observations comparing the two scatter plots: 1. The shape and distribution of the data points remain consistent between the two plots. Centering and scaling have not changed the inherent relationships between the variables. 2. In the raw data plot, the data points are spread out over the original scales of the variables. In the centered and scaled plot, the data points are concentrated around the origin, as both axes now have a mean of zero and a standard deviation of one. 3. The centered and scaled plot makes it easier to identify potential patterns or clusters in the data, as the scale is consistent across both axes. centered_data = data[[ 'Oil' , 'Density' , 'Crispy' , 'Fracture' , 'Hardness' ]] - mean_centering_vector scaled_data = centered_data / scaling_vector plt.figure(figsize = ( 10 , 6 )) plt.scatter(scaled_data[ 'Crispy' ], scaled_data[ 'Fracture' ], color = 'red' ) plt.title( 'Scatter Plot of Crispy vs. Fracture (Centered and Scaled Data)' ) plt.xlabel( 'Crispy (Centered and Scaled)' ) plt.ylabel( 'Fracture (Centered and Scaled)' ) plt.grid( True ) plt.show()

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 4/6 6. Use Aspen ProMV (or a software tool of your choice) to construct a PCA model on this data. What is the R^2 for the first and second components? What is the total R^2 using 2 components? In [19]: 1. R^2 for the first component: 60.62%. 2. R^2 for the second component: 25.91%. 3. Total R^2 using the first two components: 86.54%. This indicates that the first two components together explain approximately 86.54% of the variance in the centered and scaled data. 7. Report the R^2 value for each of the 5 variables after adding (a) one component and (b) two components. In [20]: The R^2 values (squared loadings) for each of the 5 variables are: (a) After adding one component: 1. Oil: 20.93% 2. Density: 22.92% 3. Crispy: 28.34% 4. Fracture: 25.45% 5. Hardness: 2.35% (b) After adding two components: 1. Oil: 34.66% 2. Density: 35.65% 3. Crispy: 32.25% 4. Fracture: 30.34% Out[19]: (0.6062426334584964, 0.2591411527599619, 0.8653837862184584) Out[20]: (array([0.20933684, 0.22919725, 0.28343663, 0.25449692, 0.02353236]), array([0.34656178, 0.35646779, 0.32250651, 0.30344402, 0.6710199 ])) from sklearn.decomposition import PCA pca = PCA(n_components = 2 ) pca.fit(scaled_data) r2_first_component = pca.explained_variance_ratio_[ 0 ] r2_second_component = pca.explained_variance_ratio_[ 1 ] total_r2_two_components = sum (pca.explained_variance_ratio_) r2_first_component, r2_second_component, total_r2_two_components loadings = pca.components_.T squared_loadings_one_component = loadings[:, 0 ] ** 2 squared_loadings_two_components = squared_loadings_one_component + (loadings[:, 1 ] ** 2 ) squared_loadings_one_component, squared_loadings_two_components

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 5/6 5. Hardness: 67.10% 8. Write down the values of the p1 loading vector. Also, create a bar plot of these values. In [21]: The values of the p1 loading vector for each variable are: Out[21]: array([-0.45753343, 0.4787455 , -0.53238767, 0.50447688, -0.15340262]) p1_loading_vector = loadings[:, 0 ] variables = [ 'Oil' , 'Density' , 'Crispy' , 'Fracture' , 'Hardness' ] plt.figure(figsize = ( 10 , 6 )) plt.bar(variables, p1_loading_vector, color = 'green' ) plt.title( 'p1 Loading Vector' ) plt.xlabel( 'Variables' ) plt.ylabel( 'Loading Value' ) plt.grid(axis = 'y' ) plt.show() p1_loading_vector

10/22/23, 11:30 PM Assignment1 - Jupyter Notebook localhost:8888/notebooks/Assignment1.ipynb# 6/6 Oil: −0.4575 Density: 0.4787 Crispy: −0.5324 Fracture: 0.5045 Hardness: −0.1534 9. What are the characteristics of pastries with a large negative t1 value? The t1 value is the score of the first principal component. A large negative t1 score would mean that the pastry has characteristics opposite to the direction of the positive p1 loading vector. A large negative t1 value would suggest the pastry has a high oil content (since the loading for oil is negative). It would also suggest a low density (since the loading for density is positive). The pastry would likely be very crispy (since the loading for crispy is negative).It would have a lower fracture angle (since the loading for fracture is positive). The hardness characteristic is less inﬂuential in this component, but a large negative t1 value would suggest a slightly softer pastry (since the loading for hardness is negative). 10. Replicate the calculation of t1 for pastry B554. Show each of the 5 terms that make up this linear combination. In [23]: The t1 score for pastry B554 is approximately 1.5424. The linear combination for t1 can be broken down into the following terms: 1. Oil Term: 0.7191 2. Density Term: 0.4707 3. Crispy Term: 0.4558 4. Fracture Term: −0.0794 5. Hardness Term: −0.0238 The t1 score is the sum of these terms. This breakdown gives insight into how each variable contributes to the t1 score for pastry B554 based on its relationship with the first principal component. In [ ]: Out[23]: (1.5423623149809962, array([ 0.71906014, 0.47067035, 0.45575714, -0.07937144, -0.02375388])) pastry_b554 = scaled_data[data[ 'Unnamed: 0' ] == 'B554' ].values[ 0 ] t1_score_b554 = pastry_b554.dot(p1_loading_vector) terms_b554 = pastry_b554 * p1_loading_vector t1_score_b554, terms_b554

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version