Lab KX - Jupyter Notebook

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

88

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

17

Uploaded by DeanBookKookabura6

Report
Lab Lab KX: Chi Squared Tests and AB Tests Setup In [99]: # Import some useful functions from numpy import * from numpy.random import * from datascience import * # Customize look of graphics import matplotlib.pyplot as plt plt.style.use( 'fivethirtyeight' ) % matplotlib inline # Force display of all values from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" # Handle some obnoxious warning messages import warnings warnings.filterwarnings( "ignore" )
T-shirt sales Business Decision You have four di ff erent t-shirt designs and need a demand forecast so you know how many t-shirts to print for your next production run. Having already sold to Unit 1, you wonder whether Unit 2 will ultimately express similar preferences. Is this a Chi Squared "Goodness of Fit" test? or a Chi Squared "Test of Independence"? Assuming that students can purchase multiple designs and multiple quantities, we can construct the total distribution of sales to Unit 1 as: Style 1: 20%, Style 2: 35%, Style 3: 30%, Style 4: 15% Data Construct a table from your Unit 1 sample. In [100]: Out[100]: Style Demand Forecast 1 0.2 2 0.35 3 0.3 4 0.15 unit1 = Table().with_columns( "Style" , make_array( 1 , 2 , 3 , 4 ), "Demand Forecast" , make_array( 0.2 , 0.35 , 0 unit1
Suppose that current sales to Unit 2 look like the following: Style 1: 102, Style 2: 121, Style 3: 120, Style 4: 57 Show the sample information for Unit 2 as a table. In [101]: Analysis Knowing how many t-shirts were actually sold in Unit 2, add a column to your data for Unit 2 that contains the "expected sales" for Unit 2 if the current sales in Unit 2 had been distributed in the same proportions (the same percentages) as those proportions in Unit 1. Out[101]: Style Actual Sales 1 102 2 121 3 120 4 57 unit2 = Table().with_columns( "Style" , make_array( 1 , 2 , 3 , 4 ), "Actual Sales" , make_array( 102 , 121 , 120 , unit2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [102]: Compute the Chi-squared Statistic In [103]: Out[102]: Style Actual Sales Expected Sales 1 102 80 2 121 140 3 120 120 4 57 60 Out[103]: Style Actual Sales Expected Sales di ff di ff ^2 relative 1 102 80 22 484 6.05 2 121 140 -19 361 2.57857 3 120 120 0 0 0 4 57 60 -3 9 0.15 Out[103]: 8.778571428571428 unit2 = unit2.with_column( "Expected Sales" , unit1.column( "Demand Forecast" ) * sum (unit2.column( "Actual unit2 # compute chi-squared # this is a Goodness of Fit test unit2 = unit2.with_column( "diff" , unit2.column( "Actual Sales" ) - unit2.column( "Expected Sales" )) unit2 = unit2.with_columns( 'diff^2' , unit2.column( "diff" ) ** 2 ) unit2 = unit2.with_columns( 'relative' , unit2.column( 'diff^2' ) / unit2.column( 'Expected Sales' )) unit2 chi_s = sum (unit2.column( 'relative' )) chi_s
Generate the chi-squared distribution for the apprporiate degrees of freedom. In [104]: Calculate and show the critical value at significance level ( ) = 0.05 based on the chisquared distribution. Out[104]: 3 df = unit2.num_rows - 1 df dist_array = chisquare(df, 1000000 ) dist = Table().with_column( 'chisquared' , dist_array) dist.hist(bins = 50 , range = make_array( 0 , 25 ))
In [105]: Compute the P-value (Calculate and show the probability the chi-squared statistic is greater than or equal to the computed value of the statistic). Out[105]: 7.823004259600657 alpha = 0.05 cv = percentile(( 1 - alpha) * 100 , dist.column( 'chisquared' )) cv
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [106]: Out[106]: 0.032371 Out[106]: <matplotlib.lines.Line2D at 0x7f0034c25790> Out[106]: <matplotlib.lines.Line2D at 0x7f0034c25880> Out[106]: <function matplotlib.pyplot.legend(*args, **kwargs)> p_value = dist.where( 'chisquared' , are.above_or_equal_to(chi_s)).num_rows / dist.num_rows p_value dist.hist(bins = 50 , range = make_array( 0 , 35 ), left_end = cv, right_end = 35 ) plt.axvline(cv,color = "red" ) plt.axvline(chi_s,color = 'green' ) plt.legend plt.show()
In [107]: Conclusion What do you conclude about using Unit 1 to estimate demand for Unit 2? Quiz What type of test are we conducting Goodness of Fit Test of Independence How many degrees of freedom are there: ___ How many total sales are there in Unit 2 so far: _____ Assuming that Unit 2 sales were to follow the same proportions as those of Unit 1, how many sales-to-date of Style 2 would you have expected?____ What was your computed p_value?_____ What was your critical value based upon a significance level of 5% using the lookup table from in-class? _____ What was the value of your sample chi squared statistic? _____ What do you conclude about using Unit 1 to estimate demand for Unit 2? _____. you can reject the null hypothesis and conclude that Unit 1 is a good guideline for Unit 2, because the p-value is large. Unit 2 sales are not consistent with Unit 1 sales because the statistic is more extreme than the cv Unit 2 sales are not consistent with Unit 1 sales because the cv is greater than the significance level. Unit 1 sales are a good guideline for Unit 2 because the p-value is less than the significance level. Out[107]: False Out[107]: False p_value > alpha chi_s < cv
Financial Advice Business Decision A financial advisor wants to determine the relationship between the type of fund and client satisfaction across all its clients. A fund can be made up of either stocks or bonds. Client satisfaction can be high, medium, or low. Data Here are the numbers of clients reporting satisfaction level, according to what type of fund the client owns: stocks: 15 high, 12 medium, 3 low bonds: 24 high, 4 medium, 2 low Show the count of each fund type-client satisfaction pair.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [108]: Analysis Calculate and show the count of each fund type. In [109]: Calculate and show the count of each client satisfaction level. Out[108]: Fund Type Satisfaction Level Count stocks high 15 stocks medium 12 stocks low 3 bonds high 24 bonds medium 4 bonds low 2 Out[109]: Fund Type Count sum bonds 30 stocks 30 data = Table().with_columns( "Fund Type" , make_array( "stocks" , "stocks" , "stocks" , "bonds" , "bonds" , "bon data fund_type_freq = data.select( "Fund Type" , "Count" ).group( "Fund Type" , sum ) fund_type_freq
In [110]: Add a column to your table for the expected frequencies for each pairwise combination Out[110]: Satisfaction Level Count sum high 39 low 5 medium 16 satisfaction_freq = data.select( "Satisfaction Level" , "Count" ).group( "Satisfaction Level" , sum ) satisfaction_freq
In [111]: Calculate and show the sample chisquared. Out[111]: 60 Out[111]: Satisfaction Level Fund Type Count Fund Type Count Satisfaction Level Count Expected high bonds 24 30 39 19.5 high stocks 15 30 39 19.5 low bonds 2 30 5 2.5 low stocks 3 30 5 2.5 medium bonds 4 30 16 8 medium stocks 12 30 16 8 # expected frequencies for each pairwise combination total_obs = sum (fund_type_freq.column( "Count sum" )) total_obs pairwise_freq = data.join( 'Fund Type' , fund_type_freq) pairwise_freq = pairwise_freq.relabeled( "Count sum" , "Fund Type Count" ) pairwise_freq = pairwise_freq.join( "Satisfaction Level" , satisfaction_freq) pairwise_freq = pairwise_freq.relabeled( "Count sum" , "Satisfaction Level Count" ) pairwise_freq = pairwise_freq.with_columns( 'Expected' , \ pairwise_freq.column( 'Fund Type Count' ) * pairwise_freq.column( 'Satisfaction Level Count' ) / total_obs) pairwise_freq
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [112]: Get 1,000,000 values from the chi squared distribution for the appropirate degrees of freedom. Show the degrees of freedom and a few of the values and a histogram of all the values (50 bins, range 0 to 25). In [113]: Out[112]: Satisfaction Level Fund Type Count Fund Type Count Satisfaction Level Count Expected di ff di ff ^2 rel di ff high bonds 24 30 39 19.5 4.5 20.25 1.03846 high stocks 15 30 39 19.5 -4.5 20.25 1.03846 low bonds 2 30 5 2.5 -0.5 0.25 0.1 low stocks 3 30 5 2.5 0.5 0.25 0.1 medium bonds 4 30 16 8 -4 16 2 medium stocks 12 30 16 8 4 16 2 Out[112]: 6.276923076923078 # Compute Chi-Squared statistic # add column for difference between observed and expected pairwise_freq = pairwise_freq.with_column( 'diff' , \ pairwise_freq.column( 'Count' )\ - pairwise_freq.column( 'Expected' )) # square the difference pairwise_freq = pairwise_freq.with_column( 'diff^2' , pairwise_freq.column( 'diff' ) ** 2 ) # find relative difference by dividing squared differences by 'expected' pairwise_freq = pairwise_freq.with_column( 'rel diff' , \ pairwise_freq.column( 'diff^2' )\ / pairwise_freq.column( 'Expected' )) pairwise_freq chi_s = sum (pairwise_freq.column( 'rel diff' )) chi_s
In [113]: Out[113]: 2 Out[113]: chisquared 0.0544898 0.386128 0.390457 2.5022 0.615491 0.388123 0.492173 1.57597 0.272731 1.53969 ... (999990 rows omitted) df = 2 * 1 df dist_array = chisquare(df, 1000000 ) dist = Table().with_column( 'chisquared' , dist_array) dist dist.hist(bins = 50 , range = make_array( 0 , 25 ))
Calculate and show the probability of the sample chisquared (or above) if hypothesis is correct (this is the p-value). Also show the sample chisquared and histogram of chisquared distribution with the area corresponding to the probability highlighted. In [114]: Calculate and show the critical value at significance level 0.05 based on the chisquared distribution. Also show the significance level and histogram of chisquared distribution with the area corresponding to the significance level highlighted. Out[114]: 0.04351 p_value = dist.where( 'chisquared' , are.above_or_equal_to(chi_s)).num_rows / dist.num_rows p_value
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [115]: Calculate and show whether you should conclude that the hypothesis is correct, at significance level 0.05. Out[115]: 6.0029736001240765 Out[115]: 0.05 Out[115]: 6.0029736001240765 Out[115]: <matplotlib.lines.Line2D at 0x7f0034ba97f0> Out[115]: <matplotlib.lines.Line2D at 0x7f0034bda220> Out[115]: <function matplotlib.pyplot.legend(*args, **kwargs)> alpha = 0.05 cv = percentile(( 1 - alpha) * 100 , dist.column( 'chisquared' )) cv alpha cv dist.hist(bins = 50 , range = make_array( 0 , 35 ), left_end = cv, right_end = 35 ) plt.axvline(cv,color = "red" ) plt.axvline(chi_s,color = 'green' ) plt.legend plt.show()
In [116]: Quiz The financial advisory firm has reports from ___ of its clients. ___ of its clients own funds that comprise bonds. ___ of its clients are highly satisfied. If type of fund were independent of satisfaction level, then we would expect ____ of clients to be highly satisfied bond fund owners. In other words, then we would expect with this probability that a client has high satisfaction and owns a bond fund. The sample chi squared is ____. The p-value is ____. The critical value is ____. Based on this analysis and assuming 5% significance level, the financial advisor should conclude that a client's satisfaction level ____ depends on whether it owns stocks or bonds, because the sample chi squared is less than the critical value does not depend on whether it owns stocks or bonds, because the sample chi squared is less than the critical value depends on whether it owns stocks or bonds, because the sample chi squared is greater than the critical value does not depend on whether it owns stocks or bonds, because the sample chi squared is greater than the critical value Document revised 10 April 2023 Copyright (c) Huntsinger and Lee Out[116]: False Out[116]: False p_value > alpha chi_s < cv