HW6-solutions

html

School

University of Texas *

*We aren’t endorsed by this school

Course

230

Subject

Statistics

Date

Apr 3, 2024

Type

html

Pages

9

Uploaded by sjobs3121

Report
Homework 6 for ISEN 355 (System Simulation) (C) 2023 David Eckman Due Date: Upload to Canvas by 9:00pm (CST) on Friday, March 10. In [ ]: # Import some useful Python packages. import numpy as np import matplotlib.pyplot as plt import scipy.stats import pandas as pd import random Problem 1. (50 points) This questions asks you to analyze data I collected on the number of emails concerning ISEN 355 I received each day for the month of February. The daily counts are listed in chronological order below in a list called email_counts . In [ ]: email_counts = [3, 5, 9, 2, 1, 5, 2, 7, 14, 13, 2, 2, 10, 2, 1, 5, 9, 6, 6, 4, 3, 4, 11, 4, 7, 1, 2, 2] (a) (2 points) Do you think it is reasonable to assume that the data are independent and identically distributed? Why or why not? It is not reasonable to assume data is identically distributed because I am likely to receive more emails or certain days of the week, like Fridays, when homework assignments are due. From looking at the data, we can see that email counts for Friday (the third column) are much higher than those for Sunday (the fifth column). As for independence, one could argue either way. It would be reasonable to say that the email counts from day to day are not independent because there could be multiple emails in a single exchange (thread) that stretches over multiple days. For example, if I received many emails on Wednesday, but didn't reply until Thursday, I might receive more follow-up emails on Thursday. If the email counts were instead counting the number of email threads initiated on each day in February, then one could argue the counts are independent. (b) (4 points) Plot a histogram of the data with one bin for each integer between 0 and 14 (inclusive). Do this by using the argument bins=np.linspace(-0.5, 14.5, 16) . Label the axes of your plot. Comment on the shape of the data. In [ ]: plt.hist(email_counts, bins = np.linspace(-0.5, 14.5, 16)) plt.xlabel('Number of Emails Received') plt.ylabel('Frequency') plt.title("Histogram of Daily Email Counts") plt.show()
(c) (2 points) We mentioned in class that the Poisson distribution is commonly used to model the number of arrivals during a fixed time period when arrival events occur independently. Based on this interpretation and the shape of the data plotted in part (b), do you think a Poisson distribution would be a good choice for modeling the number of ISEN 355 emails I receive on a daily basis? (Ignore your answer for part (a) when answering this part.) If of the arrival of emails in my inbox that come from different people, we might view the arrival events as being independent, in which case a Poisson process could be a decent candidate choice for modeling the number of emails received in a day. (See part (a) for a counter-argument to the independence assumption.) In the histogram, the distribution of the email counts looks do spread out (too high of variance) to be Poisson distributed. We find this to be the case later in this problem. (d) (2 points) In the rest of the problem we will fit a Poisson distribution to the email count data. Recall that the Poisson distribution has a single parameter, $\lambda > 0$. Refer to the slides on special distributions (#4.2) to see how $\lambda$ is related to the mean (expected value) of a Poisson random variable. Based on the data in email_counts and this observation, what do you think would be a good guess as to the value of the parameter $\lambda$ that best fits the data? Because $\lambda$ is the mean of a Poisson($\lambda$) random variable, a good guess for the value of $\lambda$ would be the sample mean . This value is $\lambda$ = 5.071 emails. (Note: the parameter $\lambda$ is real-valued, so we should not round it to 5.) In [ ]: print(f"The sample mean is: {round(np.mean(email_counts), 3)} emails.") The sample mean is: 5.071 emails. (e) (4 points) We discussed in class how for a discrete random variable, the likelihood function is the product of the probability mass function evaluated each time at a given
$x_i$. Write a function that calculates the likelihood of seeing i.i.d. data x_data under a Poisson distribution with parameter lambd . (Note, I'm intentionally using lambd here, because lambda is a protected keyword in Python.) You will want to use the function np.prod() , which calculates the product of the terms in an array (list), and scipy.stats.poisson.pmf() . Documentation on the latter function can be found here ; note that scipy stats uses mu when referring to the parameter we are calling $\ lambda$. Hint: It may be helpful to structure your code similar to that in Question 1(c) from Homework 5, which asked you to write a function to evaluate the KDE $\hat{f}$ at a given $x$. In [ ]: def evaluate_likelihood(x_data, lambd): ell_lambda = np.prod(scipy.stats.poisson.pmf(x_data, lambd)) return ell_lambda # Check your work. print(f"ell(lambda=5.0) = {evaluate_likelihood(x_data=email_counts, lambd=5.0)} and should be 1.8887e-35.") ell(lambda=5.0) = 1.8887029762592377e-35 and should be 1.8887e-35. (f) (4 points) Using your code from part (e), evaluate the likelihood function at a grid of values lambdas = np.linspace(0.1, 10, 100) . Plot the resulting likelihood function using plt.plot . (The values appearing on the y-axis of your plot should be very small.) Label the axes of your plot. Comment on the shape of your plot. In [ ]: lambdas = np.linspace(0.1, 10, 100) likelihoods = [evaluate_likelihood(email_counts, lambd) for lambd in lambdas] plt.plot(lambdas, likelihoods) plt.xlabel(r'$\lambda$') plt.ylabel(r'$\ell(\lambda)$') plt.title('Plot of the Likelihood Function') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The likelihood function $\ell(\cdot)$ takes non-negative values and is unimodal with a maximizer around $\lambda = 5$. (g) (5 points) Use the np.log() function to evaluate the log-likelihood function $L(\ lambda) = \log(\ell(\lambda))$ for the same range of values of $\lambda$ as in part (f). Plot the log-likehood function and label the axes of your plot. (The values appearing on the y-axis of your plot should be negative.) In [ ]: loglikelihood = np.log(likelihoods) plt.plot(lambdas, loglikelihood) plt.xlabel(r'$\lambda$') plt.ylabel(r'$L(\lambda)$') plt.title('Plot of the Log-Likelihood Function') plt.show()
(h) (2 points) Based on your plots in parts (f) and (g), what values of $\lambda$ appear to maximize the likelihood and log-likelihood functions. Are the maximizers the same? (There's no need to specify the exact maximizer; we are about to get the answer from scipy.stats ). Both the likelihood function and log-likelihood functions appear to have maximizers around $\lambda = 5$. (i) (6 points) Use scipy.stats.fit() to find and report the maximum likelihood estimator (MLE) for $\lambda$. To check your work, the MLE for $\lambda$ should be equal to the sample mean of the data. Read the documentation to see how this function is called. You will need to provide three arguments: dist , data , and bounds . dist should be a generic scipy.stats Poisson distribution object: scipy.stats.poisson . bounds should be restrictions on the parameters we want to optimize over; use bounds={"mu": (0, 15), "loc": (0, 0)} . The parameter loc is a shift parameter scipy.stats uses for many distributions to shift the distribution left or right. We don't want to do that in this case, so we want to fix loc to be zero. Note: scipy.stats.fit() was released in SciPy version 1.9.0. You can check which version you are running by running the following cell. If you have an outdated version, you can update it by running the command pip install scipy --upgrade from your TERMINAL, not within this notebook. In [ ]: # To check your version of scipy, run this cell. scipy.__version__ Out[ ]: '1.10.1' In [ ]: maximizer = scipy.stats.fit(scipy.stats.poisson, email_counts, bounds={"mu": (0, 15),
"loc":(0, 0)}) mle = maximizer.params[0] print(f"The maximum likelihood estimation (per scipy .fit) for lambda is {round(mle, 3)} emails.") print(f"The sample mean is {round(np.mean(email_counts), 3)} emails.") The maximum likelihood estimation (per scipy .fit) for lambda is 5.071 emails. The sample mean is 5.071 emails. (j) (4 points) Reproduce your histogram from part (b), but with the additional argument density=True . Superimpose a plot of the probability mass function (pmf) of the fitted Poisson distribution (the one with the MLE estimator for $\lambda$). Plot the pmf for values of $x = 0, 1, 2, \ldots, 15$. Use scipy.stats.poisson.pmf() and plt.stem() . Label the axes of your plot. Comment on the visual fit of the Poisson distribution. In [ ]: plt.hist(email_counts, bins=np.linspace(-0.5, 14.5, 16), density=True, label="Histogram", color="red") x = np.arange(0,16) pmf = scipy.stats.poisson.pmf(x, mle) plt.stem(x, pmf, label="Fitted PMF") plt.legend() plt.xlabel('Number of Emails Received') plt.ylabel('Probability') plt.title("Histogram and PMF of Fitted Poisson Distribution") plt.show() The pmf of the fitted Poisson distribution does not fit the histogram well. The pmf concentrates too much probability mass around its mean while the histogram is more spread out.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(k) (5 points) Plot the empirical cdf of your data. Superimpose the cdf of the fitted Poisson distribution for values of $x = 0, 1, 2, \ldots, 15$. Use scipy.stats.poisson.cdf() and plt.step() . Label the axes of your plot. Comment on the visual fit of the Poisson distribution. In [ ]: x_ecdf = np.sort(email_counts) n = len(email_counts) y_ecdf = [(i + 1) / n for i in range(n)] #np.arange(1, len(email_counts) + 1) /(len(email_counts)) cdf = scipy.stats.poisson.cdf(x, mle) plt.step(x_ecdf, y_ecdf, where="post", label="ECDF") plt.step(x, cdf, where="post", label="Fitted CDF") plt.legend() plt.xlabel('Number of Emails Received') plt.ylabel('Probability') plt.title("Empirical CDF and CDF of Fitted Poisson Distribution") plt.show() Here too, the plot shows a poor fit. The cdf of the fitted Poisson distribution shows misses (seen here as vertical gaps between the curves) in both tails. (l) (7 points) Plot a Q-Q plot of the data vs the (theoretical) Poisson distribution with the MLE value of $\lambda$. Comment on the visual fit in the Q-Q plot. Note: Because a Poisson distribution is discrete, the points in the Q-Q plot cannot be expected to fall exactly along a 45-degree line.) Although scipy.stats.probplot() can make a Q-Q plot, it is less than ideal. In particular, it will plot a line of best fit instead of a 45 degree line through the origin. For this problem, make a Q-Q plot from scratch by following these instructions: 1. Sort the email_counts data in ascending order.
2. Use the scipy.stats.poisson.ppf() function (with appropriate arguments) to compute the corresponding quantiles of the Poisson distribution (fitted with the MLE of $\lambda$) at $q = (i - 0.5)/n$ where $i$ is the index ($i = 1, 2, \ldots, n$) and $n$ is the number of observations in the data. 3. Use plt.scatter() to produce a scatter plot of the quantile-quantile pairs. 4. Use plt.plot([0, 15], [0, 15], color='red') to plot a 45 degree red line through the origin. In [ ]: # Sort the sample data in ascending order sorted_email_counts = np.sort(email_counts) # Create an object for the fitted Poisson distribution n = len(email_counts) fitted_poisson = scipy.stats.poisson(mu=mle, loc=0) # Compute the theoretical q-quantiles for q = (i-.5)/n for i = 1, ..., n poisson_quantiles = fitted_poisson.ppf((np.arange(1, n + 1, 1) - 0.5) / n) # # Make a Q-Q plot from scratch (as in the class notes) plt.scatter(sorted_email_counts, poisson_quantiles) plt.xlabel('Observed values') plt.ylabel('Theoretical quantiles') plt.title('Quantile-Quantile Plot') plt.plot([0, 15], [0, 15], color='red') plt.show() (m) (3 points) For a Poisson random variable, its mean is equal to its variance. Compute the sample mean and sample variance of the data. How do they compare? If you had checked this at the start of your analysis (before doing MLE and Q-Q plots),
would you have considered the Poisson distribution to be a good candidate distribution? In [ ]: print(f"The sample mean is {round(np.mean(email_counts), 3)} emails.") print(f"The sample variance is {round(np.var(email_counts, ddof=1), 3)} emails^2.") The sample mean is 5.071 emails. The sample variance is 13.698 emails^2. Because the sample variance is much larger than the sample mean (more than double), it is unsurprising that the Poisson distribution turned out to be a poor candidate distribution.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help