HW5-solutions

html

School

University of Texas *

*We aren’t endorsed by this school

Course

230

Subject

Statistics

Date

Apr 3, 2024

Type

html

Pages

Uploaded by sjobs3121

Homework 5 for ISEN 355 (System Simulation) ¶ (C) 2023 David Eckman Due Date: Upload to Canvas by 9:00pm (CST) on Friday, March 3. In [ ]: # Import some useful Python packages. import numpy as np import matplotlib.pyplot as plt import scipy.stats import pandas as pd import random Problem 1. (20 points) ¶ In this problem, you will be revisiting the topic of applying KDE to the quiz3_times.csv dataset from Homework 4. This dataset contains the times (measured in minutes) it took each student to complete Quiz 3. In [ ]: # Import the dataset. mydata = pd.read_csv('quiz3_times.csv', header=None) # Convert the data to a list. times = mydata[mydata.columns[0]].values.tolist() (a) (2 points) Recall from the lecture slides that the KDE $\hat{f}$ can be viewed as the average of $n$ normal pdfs, each one centered at an observation $x_i$ for $i = 1, 2, \ldots, n$. Explain why $\hat{f}$ is itself a probability density function. In particular, why is $\hat{f}(x) \geq 0$ for all $x$ and why does $\int_{-\infty}^{\infty} \hat{f}(x) dx = 1$? Nonnegativity: Each normal pdf is function taking only nonnegative values, thus their average is also a function taking only nonnegative values. Integrates to 1: Each normal pdf integrates to 1 when integrated from $-\infty$ to $\ infty$. The integral (from $-\infty$ to $\infty$) of the average of the normal pdfs is equal to the average of the integrals (over the same range). The average of a bunch of 1s is 1. To put some more math on this last point, define $\hat{f}(x) = \frac{1}{n} \ sum_{i=1}^n g_i(x)$ where $g_i$ is normal pdf with mean $x_i$ and standard deviation $h$. Then $$ \int_{-\infty}^{\infty} \hat{f}(x) dx = \int_{-\infty}^{\infty} \ frac{1}{n} \sum_{i=1}^n g_i(x) dx = \frac{1}{n} \sum_{i=1}^n \int_{-\infty}^{\ infty} g_i(x) dx = \frac{1}{n} \sum_{i=1}^n 1 = 1. $$ Collectively, we have shown that $\hat{f}$ is a probability density function. (b) (3 points) When using the normal kernel, it can be shown that $\hat{f}(x) > 0$ for all $x$. In other words, the support of a random variable having pdf $\hat{f}$ is unbounded. In the case of the quiz time example, this means that the KDE places positive probability on values less than 0 minutes and greater than 15 minutes, both of which are impossible. Describe how you could modify $\hat{f}$ so that it obeys the real-world bounds (greater than 0 and less than 15), while still being a probability density function? The best approach is to set $\hat{f}(x)$ to 0 for values of $x$ that are $< 0$ or $> 15$. Because we have cut off a strictly positive amount of probability density in the two tails, we need to renormalize the truncated form of $\hat{f}$ so that it integrates to 1. This is accomplished by rescaling it, namely, dividing by the integral of $\hat{f}(x)$ from 0 to 15. Mathematically, define $$ \tilde{f}(x) = \begin{cases} 0 & \text{if } x < 0 \\ \frac{\ hat{f}(x)}{\int_{0}^{15} \hat{f}(u) du} & \text{if } 0 \leq x \leq 15 \\ 0 & \text{if } x > 15. \end{cases} $$ The function $\tilde{f}$ is a probability density with support on the interval $[0, 15]$.

(c) (3 points) Write a function that evaluates the KDE $\hat{f}$ at a given value of $x$. Your function should take as inputs a vector of observations x_data and a bandwidth parameter h and outputs a value fhatx . As stated in part (a), $\hat{f}$ is the average of $n$ normal pdfs, each of which has mean $x_i$ for $i = 1, 2, \ldots, n$ and standard deviation $h$. A skeleton function is provided for you to complete. Read the documentation for scipy.stats.norm.pdf to learn what the arguments x , loc , and scale refer to. A print statement is provided to check that the output of your function matches a known case. You ONLY need to fill in the question marks. No other coding is needed. (Do NOT use the scipy.stats.gaussian_kde() function as we did in Homework 4. It turns out that that function does not use the bandwidth argument in the same way we saw in class; hence why the built-in Silverman's option was different from the equation we saw in class.) In [ ]: def evaluate_fhat(x, x_data, h): fhatx = np.mean(scipy.stats.norm.pdf(x=x_data, loc=x, scale=h)) return fhatx # Check your work. print(f"fhat(x=3) = {round(evaluate_fhat(x=3, x_data=times, h=1.0), 3)} and should be 0.147.") fhat(x=3) = 0.147 and should be 0.147. (d) (3 points) Fix the bandwidth parameter to be $h = 1$. Evaluate the KDE $\hat{f} $ at a grid of points x = np.linspace(0, 15, 151) . Plot the resulting kernel density estimate $\hat{f}$ using plt.plot() . Label the axes of your plot. In [ ]: x = np.linspace(0, 15, 151) fhatx = [evaluate_fhat(x=xi, x_data=times, h=1.0) for xi in x] #for i in range(len(x)): # fhat_x.append(np.mean(scipy.stats.norm.pdf(times, loc=x[i], scale=1))) plt.plot(x, fhatx) plt.xlabel(r"$x$") plt.ylabel(r"$\hat{f}(x)$") plt.title(r"Kernel Density Estimate with $h = 1.0$") plt.show()

(e) (5 points) In this part, you will generate random variates drawn from the KDE $\ hat{f}$. Because $\hat{f}$ is non-parametric, the other methods we will soon learn in class for random-variate generation can not be applied straightforwardly. There is instead a simple algorithm for drawing a random variate $X$ having distribution $\ hat{f}$: 1. Pick one of the observations $x_1, x_2, \ldots, x_n$ uniformly at random. Let $j$ be the index of the selected observation. 2. Generate a normal random variate having mean $x_j$ and standard deviation $h$ where $h > 0$ is the bandwidth parameter used for KDE. 3. Set $X$ to be the value of the normal random variate. Write a function to implement this sampling algorithm. Your function should return a single random variate X. Read the documentation for the random package, specifically the sections on the random.randint() and random.normalvariate() functions. In [ ]: def sample_from_kde(x_data, h): n = len(times) j = random.randint(1, n) # Generate a random integer between 1 and 95 inclusive. X = random.normalvariate(mu=x_data[j-1], sigma=h) # -1 is because Python indexes from 0. return X print(f"X = {sample_from_kde(x_data=times, h=1)}.") X = 8.286396370621407. (f) (4 points) Use your algorithm to generate 10,000 random variates $X_1, X_2, \ ldots, X_{10000}$. Plot a histogram of $X_1, X_2, \ldots, X_{10000}$ using 100 bins and the argument density=True . Superimpose your plot of the KDE $\hat{f}$ from

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

part (e). Comment on the shapes of the histogram and the KDE. In [ ]: plt.plot(x, fhatx) X = [sample_from_kde(x_data=times, h=1.0) for _ in range(10000)] plt.hist(X, bins=100, density=True) plt.xlabel(r"$x$") plt.ylabel(r"$\hat{f}(x)$ and frequency") plt.title("KDE and Histogram") plt.show() The histogram closely matches the fitted KDE. Problem 2. (30 points) ¶ The dataset service_times.csv contains 25 service times (measured in minutes). In [ ]: # Import the dataset. mydata2 = pd.read_csv('service_times.csv', header=None) # Convert the data to a list. service_times = mydata2[mydata2.columns[0]].values.tolist() print(service_times) [1.88, 0.54, 1.9, 0.15, 0.02, 2.81, 1.5, 0.53, 2.62, 2.67, 3.53, 0.53, 1.8, 0.79, 0.21, 0.8, 0.26, 0.63, 0.36, 2.03, 1.42, 1.28, 0.82, 2.16, 0.05] (a) (3 points) Plot a histogram of the service times with an appropriate number of bins. Label your axes. Comment on the shape of the data. In [ ]: plt.hist(service_times, bins=5) plt.title("Histogram of Service Times") plt.xlabel("Service Times (min)")

plt.ylabel("Frequency") plt.show() Most services are completed within fewer than 3 minutes. The data is positively skewed. (b) (5 points) Plot the empirical cumulative distribution function (ecdf) of the service time data. Recall that the ecdf is a piecewise constant function with jumps of size $1/n$ where $n$ is the number of observations. You will want to use a function like np.sort() (see documentation ) to sort the data first. Label your axes. In [ ]: times = np.sort(service_times) n = len(times) ecdf = [(i + 1) / n for i in range(n)] plt.step(times, ecdf, where="post") plt.ylabel(r"$\hat{F}(x)$") plt.xlabel(r"Service Time ($x$)") plt.title("Empirical CDF") plt.show()

(c) (5 points) Generate 1000 random variates from the ecdf by sampling with replacement 1000 times from the service times data. You will want to use a function like random.randint() (to sample a random (integer) index) or random.choices() (to directly resample). Documentation for these functions can be found here . Plot the empirical cdf of the 1000 random variates - this will be another piecewise constant function, but with jumps of size 1/1000. Superimpose the empirical cdf (of the original service time data) you plotted in part (b). If you did things correctly, the two curves should look very similar. In [ ]: new_times = random.choices(times, k=1000) new_times = np.sort(new_times) n_new = 1000 new_ecdf = [(i + 1) / n_new for i in range(n_new)] plt.step(new_times, new_ecdf, where="post", label="1000 RVs") plt.step(times, ecdf, color="red", where="post", label="Original") plt.ylabel(r"$\hat{F}(x)$") plt.xlabel(r"Service Time ($x$)") plt.title("Empirical CDFs of original and generated data") plt.legend() plt.show()

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

(d) (5 points) Plot the interpolated empirical cumulative distribution function (iecdf) of the original service time data. Recall that the iecdf is a piecewise linear and continuous function with breakpoints at heights of $0, \frac{1}{n-1}, \frac{2}{n-1}, \ ldots, \frac{n-2}{n-1}, 1$. Label your axes. In [ ]: iecdf = [i / (n - 1) for i in range(n)] plt.plot(times, iecdf) plt.ylabel(r"$\tilde{F}(x)$") plt.xlabel(r"Service Time ($x$)") plt.title("Interpolated Empirical CDF") plt.show()

(e) (7 points) Generate 1000 random variates from the iecdf using the two-stage procedure we discussed in class. To generate a random variate $X$, 1. Choose one of the intervals between successive (sorted) service times uniformly at random. 2. Generate a continuous uniform random variable with lower and upper bounds corresponding to the left and right endpoints of the selected interval. Set $X$ to be the value of the generated uniform random variable. You will want to use a function like random.randint() (to sample a random (integer) index) and random.uniform() to sample uniformly from within an interval. Documentation for these functions can be found here . Plot the empirical cdf of the 1000 random variates - this will be a piecewise constant function with jumps of size 1/1000. Superimpose the interpolated empirical cdf (of the original service time data) you plotted in part (d). If you did things correctly, the two curves should look very similar. In [ ]: new_iecdf_times = [] n_new = 1000 for i in range(n_new): interval_idx = random.randint(1, n - 1) X = random.uniform(times[interval_idx - 1], times[interval_idx]) new_iecdf_times.append(X) new_iecdf_times = np.sort(new_iecdf_times) new_ecdf = new_ecdf = [(i + 1) / n_new for i in range(n_new)] plt.step(new_iecdf_times, new_ecdf, where="post") plt.plot(times, iecdf, color="red") plt.ylabel(r"$\hat{F}(x)$ and $\tilde{F}(x)$") plt.xlabel(r"Service Times ($x$)")

plt.title("ECDF and ECDF from sampling from IECDF") plt.legend(["1000 RVs", "Original"]) plt.show() (f) (2 points) Given this service time data, why might we prefer to sample from the interpolated empirical cdf instead of the empirical cdf? In [ ]: print(np.sort(times)) [0.02 0.05 0.15 0.21 0.26 0.36 0.53 0.53 0.54 0.63 0.79 0.8 0.82 1.28 1.42 1.5 1.8 1.88 1.9 2.03 2.16 2.62 2.67 2.81 3.53] Sampling from the interpolated empirical cdf allows us to generate intermediate values. For instance, we would generate service times between 0.82 and 1.28 and between 2.16 and 2.62. (g) (3 points) Give an example of a situation where you have data and would prefer to sample from the empirical cdf (i.e., resample your data with replacement) as opposed to fitting a parametric distribution? Explain your reasoning. We might prefer to sample from the empirical cdf if (1) the stochastic input is modeled as a discrete random variable, (2) we have obtained a large amount of data such that we have observed most of the plausible values the random variable can take, and (3) this data does not follow the shape of a known parametric distribution. In this case, the data can be believed to resemble the unknown underlying pmf better than any common parametric distribution.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version