HW1
pdf
keyboard_arrow_up
School
McGill University *
*We aren’t endorsed by this school
Course
416
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
4
Uploaded by MagistrateArt3035
MGSC 416 Winter 2024
Data-driven Models for Operations Analytics
Problem Set 1-Individual Assignment
Handout date: January 12, 2024
Due date: January 25, 2023
Please submit your Python code with comments. Please paste the necessary supporting graphs/tables
from python in your document. You also could submit a word/pdf document where you summarize your
findings and answer each question, with the appropriate code and graphs (this could be for example your
jupyter notebook in pdf form).
Problem 1.
(13 pts)
We replicate and analyze a regression equation developed by Orley Ashenfelter to predict the quality
of Bordeaux wines.
The data and original model can be found at http://www.liquidasset.com/orley.htm.
“WineData-HW1.csv” is a modified tidied-up version, it is available on myCourses. The variables are:
LPRICE2 - logged price index with base year 1961
WRAIN - rainfall in mm over preceding winter
DEGREES - average temperature (Apr-Sep)
HRAIN - rainfall in mm over harvest period
and TIME
SV - age of wine.
1. In the original model, Ashenfelter modelled the logged price index as a linear regression model with
the other four variables being the independent predictor variables. Reconstruct this model and report
its summary statistics. What do you think of this model? (3 pts)
2. We want to see if Ashenfelter’s model can be simplified, and we want to assess its in-sample and out-of
sample performance. Propose a simplified model for predicting wine quality (explain your reasoning
to how you choose it). Construct this model and contrast with Ashenfelter’s model. Which model do
you think is better and why? In making your assessment, make sure to consider an array of metrics
and criteria. (7 pts)
1
3. When checking out Ashenfelter’s model residuals, are you able to validate the linear model? (3pts)
Problem 2.
(10 pts) Let’s consider the linear model titled ”linreg” which was estimated on the Boston
housing dataset portrayed in the code for lecture 2 (Lecture 2 - Regression.ipynb). This model considers the
following features (11 features) shown in the table below:
Figure 1: Dataset Variable Descriptions
Figure 2 shows the summary statistics of this model on the training set generated in the code:
In the following, we want to contrast two regression models with the ”linreg”. The first regression model
”Modelreg1” selects its coefficients by minimizing the following metric:
1
n
n
X
i
=1
(
y
i
-
ˆ
y
i
)
2
+ 0
.
15
·
11
X
j
=0
β
2
j
(1)
The second model ”Modelreg2” selects its coefficients by minimizing the following metric:
1
n
n
X
i
=1
(
y
i
-
ˆ
y
i
)
2
+ 0
.
51
·
11
X
j
=0
|
β
j
|
(2)
where
n
is the number of houses in the training set, (
y
i
)
i
=1
···
n
is the outcome vector (MEDV), (ˆ
y
i
)
i
=1
···
n
are the predicted values and (
β
j
)
j
=1
···
11
are the coefficients of the regression associated with the features
above (
β
0
is the intercept).
Output from both models (fitted values, predicted values) is provided in the
files “Modelreg1.pkl“ and “Modelreg2.pkl“ which you can acces using the code in “HW1-template.ipynb“
and are also available on myCourses.
1. Calculate in-sample R
2
for both models and contrast with ”linreg”. Are you surprised by the result?
2
Figure 2: Model ”linreg” output summary
why? (3pts)
2. Table 3 shows the coefficients obtained from both models in contrast to those in linreg. What do you
observe? Do you have an explanation for it? (3pts)
3. Evaluate out-of-sample performance of the three models together. What is your conclusion? Which
model would you choose and for what reason? (4ps)
Problem 3.
(22 pts)
We want to construct a model that both depends on causal predictor variables, and considers time
series dynamics. For this question, we use the dataset called ”ApolloData-HW1.csv”, which is available on
myCourses, posted with HW1.
1. Dr. Rao has suggested that the food demand should be related to the occupancy, and so we investigate
this belief.
Using simple linear regression, decide which food item we would most want to build a
model of that depends on breakfast occupancy. (4 pts)
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2. For the demand of the food item you have chosen, fit these models and compare their MAD, MSE and
MAPE: (6 pts)
(a) Simple moving average with
N
= 10.
(b) Exponential smoothing with
α
= 0
.
3
(c) AR(1)
3. From the first part, you should have a regression model for which breakfast occupancy is the only
independent variable, and the selected food item is the dependent variable. Show that the regression
errors of this model exhibit autocorrelation. (3 pts)
4. Because breakfast occupancy is a significant predictor for the food item, we want to incorporate it into
our model. To do this, we first consider the regression errors from the simple regression model. Now
we treat these residuals as a new time series. Is this a stationary time series? Using what we learned
from Time Series lectures, which time series models fits best? explain why. (6 pts)
5. By adding up the predicted value from the simple regression model and the predicted value from the
time series model you chose, we can now forecast demand as a linear regression on occupancy with
autocorrelated errors. What are the MAD, MSE and MAPE of this new model on the dataset? (3 pts)
4