HW8-solutions

html

School

University of Texas *

*We aren’t endorsed by this school

Course

230

Subject

Statistics

Date

Apr 3, 2024

Type

html

Pages

Uploaded by sjobs3121

Homework 8 for ISEN 355 (System Simulation) ¶ (C) 2023 David Eckman Due Date: Upload to Canvas by 9:00pm (CDT) on Friday, March 31. In [ ]: # Import some useful Python packages. import numpy as np import matplotlib.pyplot as plt import scipy.stats import pandas as pd Problem 1. (50 points) ¶ In this problem, we will study the arrival process of vehicles to a large parking garage. (a) (2 points) Recall the assumptions of a Poisson process. Do you think it is reasonable to assume that arrivals occur one at a time in the parking garage setting? Yes, it is reasonable to assume that arrivals occur one at a time. Even if two cars are in a line there will be a small lag between their arrivals. A possible exception is if the garage has multiple entrance gates, though there too there will be a small lag between arrivals of different cars. (b) (3 points) Recall the assumptions of a Poisson process. Do you think it is reasonable to assume that increments are independent in the parking garage setting? Explain/interpret what this assumption means in this context in words. Explain your reasoning. This assumption means that during non-overlapping intervals, the numbers of vehicles arriving to the parking garage are independent. For example, if we were told how many vehicles arrived between 8-9am, it would not cause us to update our belief about how many cars will arrive between 9-10am. In other words, the arrivals of cars during different time intervals are not related to each other. This assumption might be reasonable if the parking garage were very large and there were an infinite number of drivers who could possibly wish to park in the garage. This assumption is less reasonable if the parking garage has limited capacity, because if a large number of vehicles arrive in the morning, fewer vehicles will be able to enter and find a parking spot in the afternoon. This assumption is also less reasonable if there is a limited number of vehicles wishing to park, because again, if a large number of vehicles arrive in the morning, fewer will attempt to arrive in the afternoon. This would be the situation if the parking garage had reserved parking spots. (c) (2 points) We will assume that the arrival counts from different days are independent. Provide an example of a situation (in the parking garage context) for which this assumption would not hold. The arrival counts on different days would be dependent if vehicles can stay in the garage overnight. In such a case, if a large number of vehicles arrived on a given day, we might expect fewer arrivals the next day because some spots are already occupied by vehicles that did not leave. (d) (2 points) We will assume that the arrival counts from different days are identically distributed. Provide an example of a situation (in the parking garage context) for which this assumption would not hold. The assumption of identically distributed arrival counts would not hold if the schedule of events for which people choose to park in the parking garage varied by day. At the Polo garage, for example, the distribution of arrival counts on Mondays and Wednesdays is likely different from that of Tuesdays and Thursdays and from that of Fridays because of the class schedules. The file arrival_counts.csv contains counts of the number of vehicles that arrived in 1-hour periods from 8am-4pm over the course of 30 days. The code below loads the data and stores it in a numpy ndarray , which you can think of as a matrix. Each row in

the matrix corresponds to a day and each column corresponds to a 1-hour period, starting with 8am-9am in Column 0 (remember that Python indexes from 0) and 3-4pm in Column 7. In [ ]: # Import the dataset. mydata = pd.read_csv('arrival_counts.csv') arrivalcounts = np.array(mydata) print(arrivalcounts) [[253 380 307 179 118 174 162 77] [248 350 329 155 121 169 134 62] [285 384 310 161 103 169 135 64] [291 406 301 167 124 171 164 67] [246 396 302 169 128 155 135 77] [264 383 314 185 115 153 132 43] [256 352 337 166 133 164 137 64] [255 381 304 192 120 178 152 53] [267 396 318 176 117 166 137 48] [253 390 316 181 119 174 176 70] [266 356 322 195 118 165 142 60] [263 346 305 174 118 185 123 61] [284 360 322 166 114 159 141 74] [255 385 293 167 121 169 125 61] [259 400 313 198 108 183 133 56] [249 397 331 177 123 175 124 59] [269 367 322 175 112 145 123 54] [293 384 320 165 115 188 146 59] [256 393 328 186 122 134 125 60] [261 343 331 202 121 156 139 52] [269 376 314 194 135 154 158 71] [270 388 295 155 141 144 155 58] [239 367 311 177 116 177 148 61] [264 384 314 177 130 154 139 57] [250 382 332 175 132 192 150 60] [278 367 327 192 124 165 128 77] [262 366 339 197 124 173 151 64] [261 362 303 186 128 180 150 74] [262 345 348 169 126 173 136 66] [238 345 316 179 135 186 154 55]] In [ ]: # If you need, say, Column 6, use print(f"Column 6 is {arrivalcounts[:, 6]}.") # If you need, say, Row 3, use print(f"Row 3 is {arrivalcounts[3, :]}.") Column 6 is [162 134 135 164 135 132 137 152 137 176 142 123 141 125 133 124 123 146 125 139 158 155 148 139 150 128 151 150 136 154]. Row 3 is [291 406 301 167 124 171 164 67]. (e) (4 points) Test the assumption of independent increments mentioned in part (b) by calculating the sample correlation matrix for the arrival counts in the 8 one-hour periods. This matrix will be of size 8 x 8. Use the function np.corrcoef() to compute the sample correlation matrix; read the function's documentation and pay special

attention to the rowvar variable. Call the function np.round() with your correlation matrix as an argument to display the sample correlation matrix with each element rounded to the nearest 3 digits. In [ ]: corr_matrix = np.corrcoef(arrivalcounts, rowvar=False) np.round(corr_matrix, 3) Out[ ]: array([[ 1. , 0.172, -0.069, -0.155, -0.283, -0.154, 0.004, 0.1 ], [ 0.172, 1. , -0.423, -0.075, -0.132, -0.125, 0.129, -0.038], [-0.069, -0.423, 1. , 0.182, 0.03 , 0.008, -0.215, -0.064], [-0.155, -0.075, 0.182, 1. , -0.078, 0.011, 0.067, -0.115], [-0.283, -0.132, 0.03 , -0.078, 1. , -0.119, 0.307, 0.168], [-0.154, -0.125, 0.008, 0.011, -0.119, 1. , 0.259, 0.07 ], [ 0.004, 0.129, -0.215, 0.067, 0.307, 0.259, 1. , 0.282], [ 0.1 , -0.038, -0.064, -0.115, 0.168, 0.07 , 0.282, 1. ]]) (f) (2 points) We have not discussed correlation matrices in lecture. A correlation matrix contains the correlation coefficient ($\rho$) between each pair of random variables. In this problem, we have 8 random variables, each one corresponding to the number of arrivals in a given one-hour period. A correlation coefficient takes values between -1 and 1 with larger absolute values (i.e., values closer to -1 or 1) corresponding to strong correlation. Values closer to zero indicate weak correlation. Comment on the values in the sample correlation matrix and what they mean in terms of the assumption of independent increments mentioned in part (b). How comfortable are you with making this assumption for this data set? The diagonal of the matrix is all 1s, which is expected because any random variable is perfectly correlated with itself. All other values are close to zero (few have absolute value greater than 0.3), which means there is not strong dependence between pairs. The data thus supports the assumption from part (b) that increments are independent. (g) (4 points) Plot 30 curves of arrival counts over time, with one curve for each day. Use plt.plot() to superimpose the 30 curves. (You will need to use np.transpose() to transpose your data, otherwise plt.plot() will plot 8 curves, one for each column.) Use plt.xticks() to change the tick labels along the x-axis to show the times of day; documentation found here . Comment on the shape of the arrival counts over the course of a typical 8-hour day. In [ ]: plt.plot(np.transpose(arrivalcounts)) plt.xticks(ticks=[0, 1, 2, 3, 4, 5, 6, 7], labels=["8a-9a", "9a-10a", "10a-11a", "11a- 12p","12p-1p","1p-2p","2p-3p","3p-4p"]) plt.xlabel("Time of Day") plt.ylabel("Arrival Count") plt.title("Arrival Curves for 30 days") plt.show()

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

The curves of arrival counts from all 30 days show similar patterns. There are two peaks of arrivals: one at 9-10am and another at 1-2pm. There is also an off-peak period around lunch time. (h) (3 points) For a Poisson process, we mentioned that the mean number of arrivals in each period should be equal to the variance of the number of arrivals in that period. If the data were truly coming from a Poisson process, what would you expect to see in your plot from part (g) in terms of this mean=variance condition? Based on only your plot from part (g), do you believe it is plausible that the arrival process of cars satisfies this mean=variance condition? If the mean=variance condition held for each period, we would expect to see high variability for the 9-10am interval and low variability for the 1-2pm interval. We can see variability in the plot from part (g) by looking at the vertical spread of the 30 curves at any given time. From the plot in part (g), it does not appear that the mean=variance assumption holds because the variability at 1-2pm appears to be higher than that at 9-10am. (i) (5 points) For each 1-hour period, use np.mean() and np.var() to compute the sample mean and sample variance of the 30 observed arrival counts. (For np.var() , pay special attention to the ddof argument from the documentation .) Use plt.scatter() to produce a scatter plot of (sample mean, sample variance) for a total of 8 points. Superimpose a 45-degree line passing through the origin. Based on only your scatter plot, do you believe it is plausible that the arrival process of cars satisfies the mean=variance condition? In [ ]: arr_count_means = np.mean(arrivalcounts, axis=0) arr_count_vars = np.var(arrivalcounts, axis=0, ddof=1) plt.plot([0, 400], [0, 400], 'r')

plt.scatter(arr_count_means, arr_count_vars) plt.xlabel("Sample Means") plt.ylabel("Sample Variances") plt.title("Mean vs Variances") plt.show() For many of the 1-hour periods, the mean and variance of the arrival counts differ. The data indicate that a Poisson process may not be an appropriate model for the arrival process. Nevertheless, we will continue to use it in the rest of this assignment. (j) (5 points) Produce a histogram of the 30 counts from the 8am-9am period. Fit a Poisson distribution to this data using MLE. (There is no need to use the .fit() function like you did in Homework 6. Simply use the fact that the MLE for $\lambda$ for a Poisson random variable is the sample mean of the data.) Superimpose the pmf of the fitted Poisson distribution onto the plot of the histogram and comment on the visual fit. In [ ]: sample_89 = arrivalcounts[:,0] plt.hist(sample_89, bins=6, density=True, color='r') fitted_poisson = scipy.stats.poisson(mu=arr_count_means[0]) x = np.linspace(225, 300, 76) plt.stem(x, fitted_poisson.pmf(k=x)) plt.title("PMF vs Histogram of 8am-9am Arrival Counts") plt.xlabel("Arrival Count") plt.ylabel("Frequency") plt.show()

The pmf of the fitted Poisson distribution is centered slightly to the right of the peak in the histogram, because of the unexpectedly high count in the rightmost bin. With only 30 observations, it is hard to be confident about how good of a fit the Poisson distribution is. (k) (5 points) Conduct 8 K-S tests (one for each one-hour period) to determine whether the number of arrivals during each period could plausibly come from a Poisson distribution. Use a for loop over the 8 one-hour periods. On each iteration of the loop, calculate the MLE of $\lambda$ (from the data for that period), conduct the appropriate K-S test, and print out the $p$-value. Interpret the conclusions of the goodness-of-fit tests at a significance level $\alpha = 0.05$. In [ ]: for idx in range(8): fitted_poisson = scipy.stats.poisson(mu=arr_count_means[idx]) _, ks_pvalue = scipy.stats.kstest(arrivalcounts[:, idx], cdf=fitted_poisson.cdf) print(f"For Period {idx + 1} the K-S test for the fitted Poisson distribution has a p-value of {round(ks_pvalue, 3)}.") For Period 1 the K-S test for the fitted Poisson distribution has a p-value of 0.599. For Period 2 the K-S test for the fitted Poisson distribution has a p-value of 0.38. For Period 3 the K-S test for the fitted Poisson distribution has a p-value of 0.746. For Period 4 the K-S test for the fitted Poisson distribution has a p-value of 0.92. For Period 5 the K-S test for the fitted Poisson distribution has a p-value of 0.463. For Period 6 the K-S test for the fitted Poisson distribution has a p-value of 0.663. For Period 7 the K-S test for the fitted Poisson distribution has a p-value of 0.864. For Period 8 the K-S test for the fitted Poisson distribution has a p-value of 0.861. All $p$-values are greater than $\alpha=0.05$, so we fail to reject the null hypothesis and conclude that the Poisson distribution is a decent fit for all time periods. The high p-values may be a reflection of the fact that we only have 30 observations, thus it is

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

hard to rule out the Poisson distribution. (l) (5 points) Fit a non-stationary Poisson process to the data using the hourly counts. Your fitted rate function will take 8 different values, one for each one-hour period. Use plt.step() or plt.bar() to plot the fitted rate function as a piecewise constant function from 8AM to 4PM. Change the tick labels along the x-axis to show the times of day. In [ ]: plt.bar([0, 1, 2, 3, 4, 5, 6, 7], arr_count_means) plt.xticks(ticks=[0, 1, 2, 3, 4, 5, 6, 7], labels=["8a-9a", "9a-10a", "10a-11a", "11a- 12p","12p-1p","1p-2p","2p-3p","3p-4p"]) plt.xlabel(r"Time of Day ($t$)") plt.ylabel(r"Arrival Rate $\lambda(t)$") plt.title("Fitted Arrival Rate Function (1-Hour Periods)") plt.show() (m) (6 points) Fit a non-stationary Poisson process to the data using the counts over consecutive 2-hour intervals; i.e., the combined intervals should be 8am-10am, 10am- 12pm, 12pm-2pm, and 2pm-4pm. This time your fitted rate function will take only 4 different values. Plot the fitted rate function as a piecewise constant function from 8AM to 4PM. Change the tick labels along the x-axis to show the times of day. In [ ]: # Combined arrival rate counts in consecutive 1-hour periods. combined_means = [(arr_count_means[2 * i] + arr_count_means[2 * i + 1]) / 2 for i in range(4)] # Division by 2 is necessary to insure that arrival rates are still measured in arrivals per hour. plt.bar([0, 1, 2, 3], combined_means)

plt.xticks(ticks=[0, 1, 2, 3], labels=["8a-10a", "10a-12p", "12p-2p", "2p-4p"]) plt.xlabel(r"Time of Day ($t$)") plt.ylabel(r"Arrival Rate $\lambda(t)$") plt.title("Fitted Arrival Rate Function (2-Hour Periods)") plt.show() (n) (2 points) Of the two arrival rate functions you fitted in parts (l) and (m), which would you choose to use as input to a simulation model of the parking garage? Explain your reasoning. The answer depends on whether we believe the two peaks (between 9-11AM and between 1-3PM) shown in the hourly arrival rate function reflect the true non- stationarity of the arrival process. The bihourly arrival rate function instead shows a simpler trend of decreasing arrival rates over the course of the day. The bihourly arrival rate function also averages some pairs of rates that are very different (8-9AM and 9-10AM as well as 10-11AM and 11AM-12PM). The hourly arrival rate function is preferable.