HW7-solutions

html

School

University of Texas *

*We aren’t endorsed by this school

Course

230

Subject

Statistics

Date

Apr 3, 2024

Type

html

Pages

9

Uploaded by sjobs3121

Report
Homework 7 for ISEN 355 (System Simulation) (C) 2023 David Eckman Due Date: Upload to Canvas by 9:00pm (CDT) on Friday, March 24. In [ ]: # Import some useful Python packages. import numpy as np import matplotlib.pyplot as plt import scipy.stats import pandas as pd Problem 1. (50 points) The file canvas_total_times.csv contains the total time (in hours) each student has spent on the ISEN 355 Canvas page. The instructor, TA, and graders were excluded from the data. In [ ]: # Import the dataset. mydata = pd.read_csv('canvas_total_times.csv', header=None) # Convert the data to a list. times = mydata[mydata.columns[0]].values.tolist() (a) (2 points) Do you think it is reasonable to assume that the data are independent and identically distributed? Why or why not? If we think of the distribution as reflecting the Canvas habits of the "typical" ISEN 355 student, then it is reasonable to assume that the data are identically distributed. It would be reasonable to assume the data are independent if we believe that each student visits the Canvas webpage on their own devices on their own time. There are, however, a number of factors that could contribute to the Canvas times being dependent. For example, all students must log on to Canvas to take the timed quizzes during the labs. In addition, some students complete the homework assignments in pairs, so if they are working off of only one computer, then Canvas is only open on one of their accounts. Likewise for the group project assignments. Given that total times are in the tens of hours, and students may not have Canvas open most of the time when working on the homework and the project, these dependences may be minimal. (b) (3 points) Plot a histogram of the times using 10 bins. Label the axes of your plot. Comment on the shape of the data. In [ ]: plt.hist(times, bins=10) plt.xlabel("Time Spent on Canvas (hours)") plt.ylabel("Frequency") plt.title("Histogram of Time Spent on Canvas") plt.show()
The histogram shows that most students are on Canvas for less than 30 hours, while only a few are on Canvas for more than 50 hours. The shape of the histogram has a right tail and does not look symmetric. (c) (4 points) How many outliers do you see in the data? What is a possible explanation for these high values? In the rest of the problem, we will fit various distributions to the time data and assess their fit. What do you think we should do about the outliers? Explain your reasoning. From the histogram, it appears that 3 students spent more than 50 hours on Canvas, thus there are 3 outliers. It is possible that these 3 students simply left Canvas open on one of their browser tabs. If this in indeed the case, then we might consider the observations to be erroneous, in which case it would be advisable to remove the outliers. If instead, we were to discover that these students actually intentionally view the Canvas page for 50+ hours, then we should leave the outliers in the data. In [ ]: # Although not necessary, we could use the following piece of code to get the number of outliers: outliers = [value for value in times if value > 50] num_outliers = len(outliers) print(num_outliers) 3 For the rest of the assignment, use ALL of the data in canvas_total_times.csv . (d) (3 points) Use scipy.stats.rv_continuous.fit() to compute maximum likelihood estimators (MLEs) for the mean ( loc ) and standard deviation ( scale ) of a normal distribution fitted to the data. See the accompanying documentation to see how the function is called as well as the documentation about scipy.stats.norm . The term rv_continous is not part of the function call; instead, you would specifically use scipy.stats.norm.fit() . Report the MLEs. (Note: This .fit method is different from
the one we used in the previous homework to fit a Poisson distribution. You should not need to update SciPy to use this function.) (Note: For functions like .fit() that return multiple outputs, you will want to define multiple variables on the left-hand side of the equal sign. For example a, b = myfunction(c) would be a way to separate two variables ( a and b ) returned by one call of the function myfunction() .) In [ ]: loc_mle, scale_mle = scipy.stats.norm.fit(times) print(f"The MLEs of the fitted normal distribution are a mean of {round(loc_mle, 2)} and a standard deviation of {round(scale_mle, 2)}.") The MLEs of the fitted normal distribution are a mean of 17.17 and a standard deviation of 10.7. (e) (2 points) If you were to generate a single time from the fitted normal distribution, what is the probability it would be negative? Based on the mean and standard deviation of the fitted normal distribution, should you be concerned about using this input distribution? Explain your reasoning. In [ ]: prob_negative = scipy.stats.norm.cdf(x=0, loc=loc_mle, scale=scale_mle) print(f"The probability of generating a negative value is {round(prob_negative, 3)}.") The probability of generating a negative value is 0.054. We should be concerned, because the chance of generating a negative value is about 1/20, which is very high. If the simulation model needed to generate many Canvas time, it would very likely generate a negative value at some point, which is an impossible about of time to be on Canvas. (f) (3 points) Reproduce your histogram from part (b) with density=True . Superimpose the pdf of the fitted normal distribution over the interval [0, 70]. Use x = np.linspace(0, 70, 71) . Comment on the visual fit of the normal distribution. In [ ]: plt.hist(times, bins=10, density=True) x = np.linspace(0, 70, 71) plt.plot(x, scipy.stats.norm.pdf(x, loc_mle, scale_mle)) plt.xlabel("Time Spent on Canvas (hours)") plt.ylabel("Frequency / PDF") plt.title("Histogram / Normal PDF") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The normal distribution does not fit the data well. Even if the outliers were excluded, the fit for the rest of the data looks poor. I particular, the normal distribution puts too much probability density on times <= 5 hours (including negative values!), and not enough on the range 5-15 hours, were most of the data is concentrated. The misalignment between the mode (peak) of the normal distribution and the tallest bars in the histogram is possibly explained by the outliers shifting the normal distribution to the right. (g) (4 points) Use scipy.stats.rv_continuous.fit() to compute maximum likelihood estimators (MLEs) for the shape parameter ( a ), location parameter ( loc ) and scale parameter ( scale ) of a gamma distribution fitted to the data. See the documentation about scipy.stats.gamma . Report the MLEs. In [ ]: a_mle, loc_mle, scale_mle = scipy.stats.gamma.fit(times) print(f"The MLEs of the fitted gamma distribution are a = {round(a_mle, 3)}, loc = {round(loc_mle, 3)}, and scale = {round(scale_mle, 3)}.") The MLEs of the fitted gamma distribution are a = 1.567, loc = 4.484, and scale = 8.096. (h) (4 points) You are about to conduct a Chi-Square goodness-of-fit test for the fitted gamma distribution using 10 equiprobable bins. In preparation, you will need find the breakpoints of the bins, count the observed number of data points in each bin, and compute the expected number of data points in each bin. 1. To get the endpoints of the bins, use scipy.stats.gamma.ppf() to calculate quantiles of the fitted gamma distribution at q = np.linspace(0, 1, 11) . 2. To get the observed number in each bin, use np.histogram() (documentation found here ). (Note: The function np.histogram() returns multiple outputs.) 3. To get the expected number in each bin, note that you have 10 equiprobable bins. The expected numbers need to sum up to the total number of observations
in the dataset. Print the breakpoints, the observed numbers (the $O_i$'s), and the expected numbers (the $E_i$'s). In [ ]: q = np.linspace(0, 1, 11) breakpoints = scipy.stats.gamma.ppf(q, a_mle, loc_mle, scale_mle) print(f"The breakpoints of the 10 equiprobable bins are {[round(breakpoint, 2) for breakpoint in breakpoints]}.") observed, _ = np.histogram(times, breakpoints) print(f"The observed numbers of data points in each bin are {list(observed)}.") expected = [len(times) / 10] * 10 print(f"The expected number of data points in each bin are {expected}.") The breakpoints of the 10 equiprobable bins are [4.48, 7.09, 8.89, 10.66, 12.53, 14.6, 17.01, 19.99, 24.02, 30.65, inf]. The observed numbers of data points in each bin are [13, 6, 14, 11, 14, 11, 10, 10, 13, 9]. The expected number of data points in each bin are [11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1]. (i) (5 points) Use scipy.stats.chisquare() to perform a Chi-Square goodness-of-fit test; see the documentation for more details. The function has three important arguments: f_obs , which is the array of $O_is$; f_exp , which is the array of $E_is$; and ddof , which corresponds to $s$ from our class notes. (Note: The function scipy.stats.chisquare() returns multiple outputs.) Suppose you are testing with a significance level of $\alpha=0.1$. Report the test statistic, $D^2$, and compute the associated cutoff value $\chi^2_{k-s-1, 1-\alpha}$ using scipy.stats.chi2.ppf() . Report the $p$-value as well. Interpret the conclusion of the goodness-of-fit test in words . Don't just say "reject the null hypothesis" or "fail to reject the null hypothesis." In [ ]: # For the fitted gamma distribution, the number of estimated parameters was s=3. chisq, p_value = scipy.stats.chisquare(f_obs=observed, f_exp=expected, ddof=3) # Cutoff value is 1 - alpha = 1 - 0.1 = 0.90 quantile of a Chi-Squared distribution with k - s - 1 = 10 - 3 - 1 = 6 degrees of freedom. print(f"The test statistic is {round(chisq, 3)} while the cutoff value is {round(scipy.stats.chi2.ppf(q=0.9, df=6), 3)}.") print(f"The p-value of the test is {round(p_value, 3)}.") The test statistic is 5.126 while the cutoff value is 10.645. The p-value of the test is 0.528. The test statistic is less than the cutoff value and the p-value is higher than $\ alpha=0.1$. Both indicate that we fail to reject the null hypothesis. This means that we cannot rule out that the data came from a gamma distribution; in other words, a gamma distribution may be a good fit. As we said in class, we should look at other plots (e.g., Q-Q plots) and not blindly trust a goodness-of-fit test. (j) (2 points) Plot the pdf and histogram the Chi-Square test is working off of. Plot the pdf of the fitted gamma distribution over the interval $[0, 70]$, but this time plot the histogram of the data using the 10 equiprobable bins you found in part (h). Again, use the density=True argument to rescale the histogram. In [ ]: # Optional: replace the rightmost breakpoint (+ infinity) with a large number so that plt.hist can plot the rightmost bar.
breakpoints[-1] = 70 plt.hist(times, bins=breakpoints, density=True) x = np.linspace(0, 70, 71) plt.plot(x, scipy.stats.gamma.pdf(x, a_mle, loc_mle, scale_mle)) plt.xlabel("Time Spent on Canvas (hours)") plt.ylabel("Frequency / PDF") plt.title("Histogram / Gamma PDF") plt.show() (k) (4 points) Produce a Q-Q plot of the data versus quantiles from the fitted gamma distribution. Do NOT use the scipy.stats.probplot() function. Interpret the Q-Q plot. In [ ]: sorted_data = np.sort(times) invert_pts = (np.arange(1, len(sorted_data) + 1, 1) - 0.5) / len(sorted_data) gamma_quantiles = scipy.stats.gamma.ppf(q=invert_pts, a=a_mle, loc=loc_mle, scale=scale_mle) # Calculate the theoretical quantiles plt.scatter(sorted_data, gamma_quantiles) plt.plot([0,70], [0,70], color='red') plt.xlabel('Observed Values') plt.ylabel('Theoretical Quantiles') plt.title('Q-Q Plot for Fitted Gamma Distribution') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
With the exception of the 3 outliers, the Q-Q plot shows a very good fit, with all other values falling close to the 45-degree line through the origin. (l) (5 points) Use scipy.stats.kstest() to conduct a K-S test for the fitted gamma distribution; see the documentation for more details. The rvs argument should be your data and the cdf argument should be a callable function to evalute the cdf of the theoretical distribution. The documentation gives some examples. Report the $p$- value of the test and state what your conclusion would be if testing with a significance level of $\alpha=0.1$. In [ ]: # To pass a callable cdf function into kstest(), we create a Gamma distribution object having the fitted parameters. gamma_dist = scipy.stats.gamma(a_mle, loc_mle, scale_mle) _, pvalue = scipy.stats.kstest(rvs=times, cdf=gamma_dist.cdf) print(f"The p-value of the test is {round(pvalue, 3)}.") The p-value of the test is 0.996. Since the p-value is larger than than $\alpha=0.1$, we fail to reject the null hypothesis. Again, this means the gamma distribution is a good candidate distribution. (m) (3 points) Plot the empirical cdf of the data and overlay the cdf of the fitted gamma distribution. (This will resemble the plot from the lecture slide introducting the K-S test.) In [ ]: plt.plot(x, scipy.stats.gamma.cdf(x, a = a_mle ,loc = loc_mle, scale = scale_mle), label = 'Fitted') y_ecdf = [(i + 1) / len(sorted_data) for i in range(len(sorted_data))] plt.step(sorted_data, y_ecdf, where='post', label = 'Empirical') plt.xlabel('Time Spent on Canvas (hours)')
plt.title('Empirical and Fitted CDFs') plt.legend() plt.show() (n) (6 points) Pick another continuous distribution that you think might fit the data well. See the "Probability Distributions" section of the scipy.stats documentation for a comprehensive list of continuous distributions supported by scipy.stats . For your chosen distribution, 1) use MLE to estimate parameters for your distribution, 2) produce a plot of the histogram of the data with the pdf of your fitted distribution superimposed, and 3) perform one goodness-of-fit test (either Chi-Square or K-S) and state the conclusion for a reasonable significance level $\alpha$. In [ ]: # Fitting a lognormal distribution. s_mle, loc_mle, scale_mle = scipy.stats.lognorm.fit(times) print(f"The MLEs of the fitted lognormal distribution are s = {round(s_mle, 3)}, loc = {round(loc_mle, 3)}, and scale = {round(scale_mle, 3)}.") The MLEs of the fitted lognormal distribution are s = 0.69, loc = 2.68, and scale = 11.49. In [ ]: # Choosing to use 10 equiprobable bins, but could just use default bins. q = np.linspace(0, 1, 11) breakpoints = scipy.stats.lognorm.ppf(q, s_mle, loc_mle, scale_mle) breakpoints[-1] = 70 plt.hist(times, bins=breakpoints, density=True) x = np.linspace(0, 70, 71) plt.plot(x, scipy.stats.lognorm.pdf(x, s_mle, loc_mle, scale_mle)) plt.xlabel("Time Spent on Canvas (hours)")
plt.ylabel("Frequency / PDF") plt.title("Histogram / Lognorm PDF") plt.show() In [ ]: # Conducting a Chi-Square test. observed, _ = np.histogram(times, breakpoints) chisq, p_value = scipy.stats.chisquare(f_obs=observed, ddof=3) print(f"The p-value of the test is {round(p_value, 3)}.") The p-value of the test is 0.646. In [ ]: # Conducting a K-S test. lognorm_dist = scipy.stats.lognorm(s_mle, loc_mle, scale_mle) statistic, pvalue = scipy.stats.kstest(rvs=times, cdf=lognorm_dist.cdf) print(f"The p-value of the test is {round(pvalue, 3)}.") The p-value of the test is 0.997. For both tests, the p-value is much higher than any reasonable value of $\alpha$, so we fail to reject the null hypothesis and conclude that the lognormal distribution is a good choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help