ITSS 4381-Group Project(1)
pdf
keyboard_arrow_up
School
University of Texas, Dallas *
*We aren’t endorsed by this school
Course
4381
Subject
Industrial Engineering
Date
Jan 9, 2024
Type
Pages
9
Uploaded by abdulatoum
ITSS 4381.001 23Fall
Instructor: Hong Zhang
1
Final Group Project: Data Collection, Preprocessing, Visualization, Machine Learning
Due Date: 12/10/2023, 11.59 pm
Please read the following guidelines before you proceed with this project. These
guidelines should be followed and will be used to grade your homework:
•
All the code should be included in one single Jupyter Notebook file (.ipynb file).
•
This is a group project. It counts for
15 points with 5 questions
.
•
Using GPT is allowed for this project. But please make sure that the AI-generated
codes are logically correct and work fine.
•
For the report,
o
Clearly identify your group number and all group members on the cover page
of the group report.
o
A professional quality report is expected - messy or hard-to-read reports will be
penalized!
o
Clearly identify which question each part of the code is for, and what it is
supposed to do. Also, provide clear explanations of the results in the report.
o
Write comments for each question in Jupyter notebook.
o
Put clear screenshots of the codes and the outputs in the report.
•
Please upload three things to eLearning:
o
The group report with CLEAR result interpretations and COMPLETE
screenshots of the Jupyter Notebook.
o
.ipynb file of the program.
o
The scraped
movie.csv
file of Q1.
ITSS 4381.001 23Fall
Instructor: Hong Zhang
2
Q1
(2 points)
:
Data Scraping
Box Office Mojo (http://boxofficemojo.com) is a website that tracks a variety of information on
movies released each year (such as genre, studio, box office revenues, etc.). You will write a script that
collects data on all movies for the year 2023 that are listed on the site. Information on movies that
were
released
in
2023
is
available
at:
http://www.boxofficemojo.com/yearly/chart/?yr=2023&p=.htm
.
You will parse this webpage with the default movie sorting to collect the information for each
movie.
Collect the following information for each movie and save in a csv file named movies.csv.
i. Movie Name
ii. Gross Revenues
iii. Total Gross Revenues
iv. Number of Theatres
v. Release Date
vi. Distributor
Note: You need to further work on the in-class example by considering the relevant data type
for each attribute.
For example, Movie Name should be defined as a string variable, and Gross
Revenues should be defined as a number. If no information is provided for the Distributor of a movie,
the value stored will be Null.
Finally, figure out which Distributor had the highest number of movies in 2023? Provide a list
of all the movies distributed by that distributor.
ITSS 4381.001 23Fall
Instructor: Hong Zhang
3
Q2
(2 points)
:
Data Preprocessing and Visualization
(a)
The
carprices.txt
contains a single column of car prices.
Read this data into your program.
The first line is the heading (i.e., column name).
For any line containing non-numeric data,
raise a ValueError exception and print a message saying,
“
This data is invalid and will be
ignored.
”
You should divide each number by 10 and
round it down to the nearest integer
.
Then plot a histogram that looks like the one shown below.
The number of bins to use is 50.
(b)
The attached cartype.txt contains a single column listing the type of car (van, compact, etc.).
Read this data into your program.
The first line is the heading (i.e., column name).
Then plot
a pie chart after computing the frequency for each car type.
Your chart should look like the
one shown below (no need to match the exact color or position).
Each slice should be
separated from the next as shown below (
Note:
You can use AI to help you solve this question
since we did not cover pie chart in class).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ITSS 4381.001 23Fall
Instructor: Hong Zhang
4
Q3 (3 points):
Linear Regression
In his undergraduate honors thesis, Versaci (2009) investigated what he termed
“
internet buzz
variables
”
to see whether they provide any additional predictive information towards a movie’s box
office revenue (beyond movie characteristics like genre, actors, budget, etc.).
boxOffice.csv
contains
this data involving 62 movies (all widely released movies between November 7, 2008 and April 3,
2009); the variables available (along with their descriptions) are in the table below. We will conduct
the analysis ignoring the
“
buzz
”
variables (addict, cmngsoon, fandango, and cntwait3) first.
Note: You should show the screenshots of the entire regression tables in your group report.
ITSS 4381.001 23Fall
Instructor: Hong Zhang
5
1. Plot histograms of the continuous variables (box, budget, starpowr) to see if any transformations
are needed. Are any of them skewed? Apply a log-transformation to all the skewed variables.
2. Run a linear regression of box office revenues on the
“
traditional
”
variables (i.e., using all the
independent variables except the last four
“
buzz
”
variables). If any variables were transformed, be
sure to use the transformed versions of those variables. What are the R2 and adjusted-R2 values?
Which variables (if any) are significant at the 0.10 level, based on the t-statistics and associated
probabilities (P>|t|)?
3. Plot histograms of the four
“
buzz
”
variables. Are any of them skewed? Apply a log-transformation
to all the skewed variables.
4. Run a linear regression of box office revenues on all the independent variables
including the “buzz”
variables (transformed as needed). What are the R2 and adjusted-R2 values? Which variables (if any)
are significant at the 0.10 level, based on the t-statistics and associated probabilities (P>|t|)?
5. Are the
“
buzz
”
variables helping build a better model? Do they increase the adjusted-R2
substantially? (
Note:
you could use adjusted R2 to compare linear regression models. A higher R2
value indicates a better model which explains the underlying pattern better. For this question, you
don’t
have to split the data into training and testing. Just use the entire dataset to fit the linear
regression line.)
ITSS 4381.001 23Fall
Instructor: Hong Zhang
6
Q4 (3 points):
Spam Classification using KNN
One of the earliest and most successful applications of machine learning has been in developing spam
detection algorithms. For this problem, you are going to implement kNN algorithm on a dataset
containing thousands of spams and non-spam emails. The dataset you are going to use is
spam_data.csv
. The
“
Test_Indicator
”
in the .csv file specifies which data points are part of the test
set (1 indicates that the data point belongs to the test set).
Train a kNN classifier to classify each email as spam or non-spam by the training set and further
evaluate the classifier by the test set. There are many valid distance metrics you can implement for the
classifier, and you should explore how different distance metrics ('euclidean', 'chebyshev', 'manhattan')
impact the misclassification rate, i.e., the fraction of emails that are incorrectly classified on the testing
set. Keep in mind that smaller distances should imply some notion of similarity between the emails.
Furthermore, you should try out different values of k (between 1 and 10) to see what neighborhood
size leads to the best accuracy on the testing data.
Plot the misclassification rate of three different distance metrics (i.e., 'euclidean',
'chebyshev', 'manhattan') and four different values of k (you can decide the values of k). For
example, classifier = KNeighborsClassifier(n_neighbors=4, metric='euclidean').
Using your
results, comment on the behavior of the kNN algorithm with varying distance metrics, and varying
values of k.
For this question, you don
’
t need to use grid search and cross-validation. For each distance and k
combination, just train the corresponding KNN model based on the training set and evaluate the
model performance on the test set. So, you will use a for loop to train 12 models (3*4) and then
compare the performance of these 12 models on the test set in order to select the best model among
these 12 models.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ITSS 4381.001 23Fall
Instructor: Hong Zhang
7
Q5 (5 points):
Predicting Customer Response
Data Sets
_ dmtrain.csv
_ dmtest.csv
Question Description
A direct marketing firm mails catalogs to its customer base of about 5 million households. Customers
respond either by ordering items from the catalog, or do not respond. The firm distinguishes itself by
mailing expensive catalogs, and - while the response rates to the firm
’
s mailers are higher than the
industry average (30% vs 22%) - they incur considerable printing and mailing costs. They are trying to
improve their performance by identifying and targeting profitable customers, i.e., customers who are
likely to respond (and order items that would justify the printing and mailing costs).
A preliminary study shows that customers seem to make their buying decision in two phases - they
decide whether or not to respond first and, if they decide to respond, make a follow-up decision on
what to order.
dmtrain.csv
contains information about 2,000 customers from the last mailing
campaign. Everyone included has made at least one purchase from the firm in the past. The variables
involved are:
The broad objective is to classify customers who are likely to respond to mailers, and customers who
are not (i.e., response) is the outcome variable Y.
ITSS 4381.001 23Fall
Instructor: Hong Zhang
8
1. Read in the data and review all the continuous variables to see if any are skewed and need to be
transformed. If so, transform them and drop the non-transformed versions of the variables. Use the
transformed values for the following analyses. Make sure that you do not include the customer
identifier id in your calculations.
2. Generate a decision tree on the entire dataset, without any limitations on the depth of the tree. Use
entropy as the metric (DecisionTreeClassifier(criterion = 'entropy')). What is the depth of the tree that
is generated? Provide a plot of the tree.
3. We will focus on decision trees first and try to identify the best decision tree classifier by pruning
the tree at different depths.
Use 10-fold cross validation and identify the best tree depth
(again, using
entropy as the metric), by trying as many possible depths as you deem necessary (please
use Grid
Search to determine the optimal Depth of Decision Tree
). What is the accuracy associated with this
tree depth on the training dataset?
4. Next, we will consider random forests. Develop a random forest classifier with 100 trees, again
using 10-fold cross validation and different tree-depths to find the optimal tree-depth with grid search.
Which tree-depth results in the best random forest classifier with the highest accuracy? How does it
perform relative to the best decision tree?
5. Repeat the previous experiment with 50 trees. Provide all relevant results. Does your
recommendation change?
6. Develop a logistic regression model using 10-fold cross validation. What is the associated accuracy?
7. Among these four models (i.e., decision tree, random forest-100, random forest-50, logistic
regression), which one would you recommend, and why?
8. Read the file
dmtest.csv
and use the final best model among the three categories to make further
predictions on which customers are likely to respond, and which are not on the test set. The prediction
labels have to be 0 or 1 - if the model you selected naturally gives a probability score, use 0.5 as the
threshold to determine whether your prediction will be 0 or 1. For example, if you use a logistic
ITSS 4381.001 23Fall
Instructor: Hong Zhang
9
regression model that gives you a probability estimate of 0.51, the prediction label should be 1, and if
it gives a probability estimate of 0.49, the prediction should be 0.
Compare your predicted labels against the
actual
response indicator
in
dmtest.csv.
Show the quality of your
classifier’s
predictions on the test set by printing the confusion matrix,
precision, recall, accuracy, and F-score.
Deliverables
:
When writing your codes, pay attention to (1) using meaningful variable names that reflect
what value is stored in the variable, (2) using comments to explain your codes, and (3) printing
the results in the required format.
Upload the .ipynb file, the scraped .csv file in Q1, and the group report with proper
interpretations to eLearning.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help