Module3Assignment

pdf

School

Washington State University *

*We aren’t endorsed by this school

Course

319

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

8

Uploaded by SargentMaskHawk21

Report
Module 3 Assignment DATA 319 Courtney Wilkinson WSU ID: 011685226 10/2/2023 1. (a) Evaluate T^2 for testing H0: μ = (7, 11)′ using the data X = [[2 12] [8 9] [6 9] [8 10]] (b) What is the distribution of T^2 for the situation in (a). (c) Using (a) and (b), test H0 at α = 0.01 level. In [52]: import numpy as np from scipy.stats import f # data x = np . array ([[ 2 , 12 ], [ 8 , 9 ], [ 6 , 9 ], [ 8 , 10 ]]) # hypothesized mean vect. mu = np . array ([ 7 , 11 ]) # calculate sample mean & covar. matrix xbar = np . mean ( x , axis = 0 ) S = np . cov ( x , rowvar = False ) # number of obs. n = x . shape [ 0 ] # number of variables p = x . shape [ 1 ] # calculate Hotelling's T^2 stat. Tsquared = n * np . dot ( np . dot ( xbar - mu , np . linalg . inv ( S )), ( xbar - mu )) # degrees of freedom df1 = p df2 = n - p # dalculate critical value from F-dist. at alpha = 0.01 alpha = 0.01 critical_value = f . ppf ( 1 - alpha , df1 , df2 ) # calculate the p-value p_value = 1 - f . cdf ( Tsquared , df1 , df2 ) # print print ( "Hotelling's T^2 statistic:" , Tsquared ) print ( "Critical value:" , critical_value ) print ( "P-value:" , p_value ) # perform hypothesis test if Tsquared >= critical_value : print ( "Reject H0: The means are not equal at 0.01 significance level." ) else : print ( "Fail to reject H0: The means are equal at 0.01 significance level." )
Hotelling's T^2 statistic: 13.63636363636363 Critical value: 98.99999999999991 P-value: 0.06832298136645965 Fail to reject H0: The means are equal at 0.01 significance level. 2. For the following two variables [X Y] [2 2] [0 0] [−1 3] [0 1] [0 1] [0 1] [1 −1] [1 0] Perform a hypothesis test to check whether the population means of the two variables are the same. t-stat: -0.9142324078276749 p-value: 0.37607193633458214 Fail to reject H0. The means are not different. In [53]: import scipy.stats as stats # data x = [ 2 , 0 , - 1 , 0 , 0 , 0 , 1 , 1 ] y = [ 2 , 0 , 3 , 1 , 1 , 1 , - 1 , 0 ] # perform two-sample t-test t_stat , p_value = stats . ttest_ind ( x , y ) # significance level alpha = 0.05 # print print ( "t-stat:" , t_stat ) print ( "p-value:" , p_value ) # pompare p-value to alpha & make decision if p_value < alpha : print ( "Reject H0. The means are different." ) else : print ( "Fail to reject H0. The means are not different." )
3. In the first phase of a study of the cost of transporting milk from farms to dairy plant, a survey was taken of firms engaged in milk transportation. Cost data on X1=fuel, X2=repair, and X3=capital, all measured on a per-mile basis, are presented in the attached data dairy.csv, for n1 = 36 gasoline and n2 = 23 diesel trucks. (a) Perform Hotellings T2 test on the dairy data, for the three vectors: μ0 = (12, 8, 10) μ1 = (10, 18, 10) μ2 = (11, 9, 13) (b) Compute individual confidence intervals for the mean of each of the three variables at a confidence level of 95%. Then compute the simultaneous T2 confidence intervals and summarize the results of these tests in your own words. (c) Using the Bonferroni correction, compute confidence intervals for the three variables so that the simultaneous coverage of these intervals is 95%. Describe in your own words how these intervals differ from those computed in Part (B). (d) Check for normality of the data set and comment on what you observe. Are there any concerning outliers? What if you distinguish between the gasoline and diesel trucks? In [54]: import numpy as np import pandas as pd from scipy.stats import t , f , shapiro import matplotlib.pyplot as plt # data from dairy.csv data = pd . read_csv ( 'dairy.csv' ) # define means for each group mu0 = np . array ([ 12 , 8 , 10 ]) mu1 = np . array ([ 10 , 18 , 10 ]) mu2 = np . array ([ 11 , 9 , 13 ]) # separate data gasoline_data = data [ data [ 'Type' ] == 'gasoline' ] . iloc [:, : - 1 ] . values diesel_data = data [ data [ 'Type' ] == 'diesel' ] . iloc [:, : - 1 ] . values # sample means and covar. matrices sample_means_gasoline = np . mean ( gasoline_data , axis = 0 ) sample_means_diesel = np . mean ( diesel_data , axis = 0 ) sample_cov_matrix_gasoline = np . cov ( gasoline_data , rowvar = False ) sample_cov_matrix_diesel = np . cov ( diesel_data , rowvar = False ) ## Step A # Hotelling's T^2 statistic d = len ( mu0 ) n1 = len ( gasoline_data ) n2 = len ( diesel_data ) T2_gasoline = ( n1 - d ) / (( n1 + n2 - 2 ) * d ) * np . matmul ( np . matmul (( sample_means_gasolin T2_diesel = ( n2 - d ) / (( n1 + n2 - 2 ) * d ) * np . matmul ( np . matmul (( sample_means_diesel - # critical F-value for alpha = 0.05 and degrees of freedom F_critical = f . ppf ( 0.95 , d , n1 + n2 - d - 1 ) print ( "Step A" )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print ( "Hotelling's T^2 statistic (Gasoline):" , T2_gasoline ) print ( "Hotelling's T^2 statistic (Diesel):" , T2_diesel ) print ( "Critical F-value:" , F_critical ) print () ## Step B # individual confidence intervalsat a confidence level of 95% alpha = 0.05 conf_intervals_gasoline = [] conf_intervals_diesel = [] for i in range ( d ): std_error_gasoline = np . sqrt ( sample_cov_matrix_gasoline [ i ][ i ] / n1 ) margin_error_gasoline = t . ppf ( 1 - alpha / 2 , n1 - 1 ) * std_error_gasoline conf_intervals_gasoline . append (( sample_means_gasoline [ i ] - margin_error_gasoline , sa std_error_diesel = np . sqrt ( sample_cov_matrix_diesel [ i ][ i ] / n2 ) margin_error_diesel = t . ppf ( 1 - alpha / 2 , n2 - 1 ) * std_error_diesel conf_intervals_diesel . append (( sample_means_diesel [ i ] - margin_error_diesel , sample_m # simultaneous T^2 confidence intervals conf_intervals_simultaneous_gasoline = [] conf_intervals_simultaneous_diesel = [] for i in range ( d ): radius_gasoline = np . sqrt ( F_critical * ( sample_cov_matrix_gasoline [ i ][ i ] / n1 + samp conf_intervals_simultaneous_gasoline . append (( sample_means_gasoline [ i ] - radius_gasol radius_diesel = np . sqrt ( F_critical * ( sample_cov_matrix_gasoline [ i ][ i ] / n1 + sample conf_intervals_simultaneous_diesel . append (( sample_means_diesel [ i ] - radius_diesel , s print ( "Step B" ) print ( "Individual Confidence Intervals (Gasoline):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_gasoline [ i ] }" ) print () print ( "Simultaneous T^2 Confidence Intervals (Gasoline):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_simultaneous_gasoline [ i ] }" ) print () print ( "Individual Confidence Intervals (Diesel):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_diesel [ i ] }" ) print () print ( "Simultaneous T^2 Confidence Intervals (Diesel):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_simultaneous_diesel [ i ] }" ) print () ## Step C # Bonferroni correction conf_intervals_bonferroni_gasoline = [] conf_intervals_bonferroni_diesel = [] for i in range ( d ): alpha_bonferroni = alpha / d
std_error_gasoline = np . sqrt ( sample_cov_matrix_gasoline [ i ][ i ] / len ( gasoline_data )) margin_error_gasoline = t . ppf ( 1 - alpha_bonferroni / 2 , len ( gasoline_data ) - 1 ) * st conf_intervals_bonferroni_gasoline . append (( sample_means_gasoline [ i ] - margin_error_g std_error_diesel = np . sqrt ( sample_cov_matrix_diesel [ i ][ i ] / len ( diesel_data )) margin_error_diesel = t . ppf ( 1 - alpha_bonferroni / 2 , len ( diesel_data ) - 1 ) * std_er conf_intervals_bonferroni_diesel . append (( sample_means_diesel [ i ] - margin_error_diese print ( "Step C" ) print ( "Bonferroni Confidence Intervals (Gasoline):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_bonferroni_gasoline [ i ] }" ) print () print ( "Bonferroni Confidence Intervals (Diesel):" ) for i in range ( d ): print ( f"Variable { i + 1 }: { conf_intervals_bonferroni_diesel [ i ] }" ) print () print ( "Step D" ) # check normality and outliers for col in data . columns [: - 1 ]: _ , p_gasoline = shapiro ( gasoline_data [:, data . columns . get_loc ( col )]) _ , p_diesel = shapiro ( diesel_data [:, data . columns . get_loc ( col )]) print ( f"Shapiro-Wilk Test { col } (Gasoline):" ) print ( f"p-value: { p_gasoline }" ) print ( f"Shapiro-Wilk Test { col } (Diesel):" ) print ( f"p-value: { p_diesel }" ) print ()
Step A Hotelling's T^2 statistic (Gasoline): 0.0032994325402454757 Hotelling's T^2 statistic (Diesel): 0.32655143827401384 Critical F-value: 2.772536907836251 Step B Individual Confidence Intervals (Gasoline): Variable 1: (10.59546393420547, 13.84175828801675) Variable 2: (6.695292126144336, 9.529707873855669) Variable 3: (8.325941954456459, 10.854613601099098) Simultaneous T^2 Confidence Intervals (Gasoline): Variable 1: (10.702620402883486, 13.734601819338735) Variable 2: (5.998878093440173, 10.226121906559833) Variable 3: (7.001968782535609, 12.178586773019948) Individual Confidence Intervals (Diesel): Variable 1: (9.202466906896884, 11.008837440929202) Variable 2: (8.563509689529011, 12.960838136557948) Variable 3: (15.21413851159266, 21.121513662320382) Simultaneous T^2 Confidence Intervals (Diesel): Variable 1: (8.58966146568542, 11.621642882140666) Variable 2: (8.648552006483651, 12.875795819603308) Variable 3: (15.57951709171435, 20.75613508219869) Step C Bonferroni Confidence Intervals (Gasoline): Variable 1: (10.208140888536283, 14.229081333685938) Variable 2: (6.357111289644602, 9.867888710355402) Variable 3: (8.024240168687994, 11.156315386867563) Bonferroni Confidence Intervals (Diesel): Variable 1: (8.977162616266776, 11.23414173155931) Variable 2: (8.015041394813283, 13.509306431273677) Variable 3: (14.477325691218773, 21.858326482694267) Step D Shapiro-Wilk Test Fuel (Gasoline): p-value: 9.55536961555481e-05 Shapiro-Wilk Test Fuel (Diesel): p-value: 0.5117290019989014 Shapiro-Wilk Test Repair (Gasoline): p-value: 0.2623240351676941 Shapiro-Wilk Test Repair (Diesel): p-value: 0.5000005960464478 Shapiro-Wilk Test Capital (Gasoline): p-value: 0.45318958163261414 Shapiro-Wilk Test Capital (Diesel): p-value: 0.6583071947097778 Question #3 Write Up Step A: For Gasoline: The T^2 statistic is very low (0.0033). For Diesel: The T^2 statistic is higher (0.3266). The Critical F-value (2.7725) is a value used to determine significance. If the T^2 statistic exceeds this value, it indicates a significant difference between the groups.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Step B: Individual Confidence Intervals: These intervals provide a 95% confidence range for each cost variable for both types of truck separately. Simultaneous T^2 Confidence Intervals: These intervals provide a 95% confidence range for the combined variation of all three cost variables for the types of trucks. Step C: With the Bonferroni correction, the intervals are more conservative to account for the increased probability of making a false positive when comparing multiple intervals. The intervals in Part C tend to be wider than those in Part B because they are adjusted to be more cautious. Step D: Outliers are more apparent in the gasoline truck group, particularly for the "Fuel" and "Repair" variables. The low p-values in these tests indicate that the data may have outliers. The diesel truck group generally shows data that is more likely to follow a normal distribution, as indicated by the higher p-values in the Shapiro-Wilk tests. 4. Consider the following multivariate dataset X = [2 2 3] [0 0 2] [−1 3 2] [0 1 1] [0 1 5] [0 1 3] [1 −1 3] [1 0 5] and test the hypothesis H0 : (μ1 − μ2, μ2 − μ3)′ = (0, 0)′ using differences. In [51]: import numpy as np from scipy import stats # data x = np . array ([[ 2 , 2 , 3 ], [ 0 , 0 , 2 ], [ - 1 , 3 , 2 ], [ 0 , 1 , 1 ], [ 0 , 1 , 5 ], [ 0 , 1 , 3 ], [ 1 , - 1 , 3 ], [ 1 , 0 , 5 ]]) # split the data group1 = x [:, 0 ] group2 = x [:, 1 ] group3 = x [:, 2 ] # differences between means diff_mean_1_2 = np . mean ( group1 ) - np . mean ( group2 ) diff_mean_2_3 = np . mean ( group2 ) - np . mean ( group3 ) # t-tests on the differences t_stat_1_2 , p_value_1_2 = stats . ttest_ind ( group1 , group2 ) t_stat_2_3 , p_value_2_3 = stats . ttest_ind ( group2 , group3 ) # significance level alpha = 0.05 # print print ( "Group 1 & Group 2:" ) print ( f'Difference of means: { diff_mean_1_2 }' ) print ( f't-stat: { t_stat_1_2 }' ) print ( f'p-value: { p_value_1_2 }' ) if p_value_1_2 < alpha : print ( 'Reject H0 for Group 1 & Group 2' )
Group 1 & Group 2: Difference of means: -0.5 t-stat: -0.9142324078276749 p-value: 0.37607193633458214 Fail to reject H0 for Group & Group 2 Group 2 & Group 3: Difference of means: -2.125 t-stat: -3.1883897418177476 p-value: 0.006570509597450871 Reject H0 for Group 2 & Group 3 else : print ( 'Fail to reject H0 for Group & Group 2' ) print () print ( 'Group 2 & Group 3:' ) print ( f'Difference of means: { diff_mean_2_3 }' ) print ( f't-stat: { t_stat_2_3 }' ) print ( f'p-value: { p_value_2_3 }' ) if p_value_2_3 < alpha : print ( 'Reject H0 for Group 2 & Group 3' ) else : print ( 'Fail to reject H0 for Group 2 & Group 3' )