ITSS 4381-Group Project(1)

pdf

School

University of Texas, Dallas *

*We aren’t endorsed by this school

Course

4381

Subject

Industrial Engineering

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by abdulatoum

ITSS 4381.001 23Fall Instructor: Hong Zhang 1 Final Group Project: Data Collection, Preprocessing, Visualization, Machine Learning Due Date: 12/10/2023, 11.59 pm Please read the following guidelines before you proceed with this project. These guidelines should be followed and will be used to grade your homework: • All the code should be included in one single Jupyter Notebook file (.ipynb file). • This is a group project. It counts for 15 points with 5 questions . • Using GPT is allowed for this project. But please make sure that the AI-generated codes are logically correct and work fine. • For the report, o Clearly identify your group number and all group members on the cover page of the group report. o A professional quality report is expected - messy or hard-to-read reports will be penalized! o Clearly identify which question each part of the code is for, and what it is supposed to do. Also, provide clear explanations of the results in the report. o Write comments for each question in Jupyter notebook. o Put clear screenshots of the codes and the outputs in the report. • Please upload three things to eLearning: o The group report with CLEAR result interpretations and COMPLETE screenshots of the Jupyter Notebook. o .ipynb file of the program. o The scraped movie.csv file of Q1.

ITSS 4381.001 23Fall Instructor: Hong Zhang 2 Q1 (2 points) : Data Scraping Box Office Mojo (http://boxofficemojo.com) is a website that tracks a variety of information on movies released each year (such as genre, studio, box office revenues, etc.). You will write a script that collects data on all movies for the year 2023 that are listed on the site. Information on movies that were released in 2023 is available at: http://www.boxofficemojo.com/yearly/chart/?yr=2023&p=.htm . You will parse this webpage with the default movie sorting to collect the information for each movie. Collect the following information for each movie and save in a csv file named movies.csv. i. Movie Name ii. Gross Revenues iii. Total Gross Revenues iv. Number of Theatres v. Release Date vi. Distributor Note: You need to further work on the in-class example by considering the relevant data type for each attribute. For example, Movie Name should be defined as a string variable, and Gross Revenues should be defined as a number. If no information is provided for the Distributor of a movie, the value stored will be Null. Finally, figure out which Distributor had the highest number of movies in 2023? Provide a list of all the movies distributed by that distributor.

ITSS 4381.001 23Fall Instructor: Hong Zhang 3 Q2 (2 points) : Data Preprocessing and Visualization (a) The carprices.txt contains a single column of car prices. Read this data into your program. The first line is the heading (i.e., column name). For any line containing non-numeric data, raise a ValueError exception and print a message saying, “ This data is invalid and will be ignored. ” You should divide each number by 10 and round it down to the nearest integer . Then plot a histogram that looks like the one shown below. The number of bins to use is 50. (b) The attached cartype.txt contains a single column listing the type of car (van, compact, etc.). Read this data into your program. The first line is the heading (i.e., column name). Then plot a pie chart after computing the frequency for each car type. Your chart should look like the one shown below (no need to match the exact color or position). Each slice should be separated from the next as shown below ( Note: You can use AI to help you solve this question since we did not cover pie chart in class).

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

ITSS 4381.001 23Fall Instructor: Hong Zhang 4 Q3 (3 points): Linear Regression In his undergraduate honors thesis, Versaci (2009) investigated what he termed “ internet buzz variables ” to see whether they provide any additional predictive information towards a movie’s box office revenue (beyond movie characteristics like genre, actors, budget, etc.). boxOffice.csv contains this data involving 62 movies (all widely released movies between November 7, 2008 and April 3, 2009); the variables available (along with their descriptions) are in the table below. We will conduct the analysis ignoring the “ buzz ” variables (addict, cmngsoon, fandango, and cntwait3) first. Note: You should show the screenshots of the entire regression tables in your group report.

ITSS 4381.001 23Fall Instructor: Hong Zhang 5 1. Plot histograms of the continuous variables (box, budget, starpowr) to see if any transformations are needed. Are any of them skewed? Apply a log-transformation to all the skewed variables. 2. Run a linear regression of box office revenues on the “ traditional ” variables (i.e., using all the independent variables except the last four “ buzz ” variables). If any variables were transformed, be sure to use the transformed versions of those variables. What are the R2 and adjusted-R2 values? Which variables (if any) are significant at the 0.10 level, based on the t-statistics and associated probabilities (P>|t|)? 3. Plot histograms of the four “ buzz ” variables. Are any of them skewed? Apply a log-transformation to all the skewed variables. 4. Run a linear regression of box office revenues on all the independent variables including the “buzz” variables (transformed as needed). What are the R2 and adjusted-R2 values? Which variables (if any) are significant at the 0.10 level, based on the t-statistics and associated probabilities (P>|t|)? 5. Are the “ buzz ” variables helping build a better model? Do they increase the adjusted-R2 substantially? ( Note: you could use adjusted R2 to compare linear regression models. A higher R2 value indicates a better model which explains the underlying pattern better. For this question, you don’t have to split the data into training and testing. Just use the entire dataset to fit the linear regression line.)

ITSS 4381.001 23Fall Instructor: Hong Zhang 6 Q4 (3 points): Spam Classification using KNN One of the earliest and most successful applications of machine learning has been in developing spam detection algorithms. For this problem, you are going to implement kNN algorithm on a dataset containing thousands of spams and non-spam emails. The dataset you are going to use is spam_data.csv . The “ Test_Indicator ” in the .csv file specifies which data points are part of the test set (1 indicates that the data point belongs to the test set). Train a kNN classifier to classify each email as spam or non-spam by the training set and further evaluate the classifier by the test set. There are many valid distance metrics you can implement for the classifier, and you should explore how different distance metrics ('euclidean', 'chebyshev', 'manhattan') impact the misclassification rate, i.e., the fraction of emails that are incorrectly classified on the testing set. Keep in mind that smaller distances should imply some notion of similarity between the emails. Furthermore, you should try out different values of k (between 1 and 10) to see what neighborhood size leads to the best accuracy on the testing data. Plot the misclassification rate of three different distance metrics (i.e., 'euclidean', 'chebyshev', 'manhattan') and four different values of k (you can decide the values of k). For example, classifier = KNeighborsClassifier(n_neighbors=4, metric='euclidean'). Using your results, comment on the behavior of the kNN algorithm with varying distance metrics, and varying values of k. For this question, you don ’ t need to use grid search and cross-validation. For each distance and k combination, just train the corresponding KNN model based on the training set and evaluate the model performance on the test set. So, you will use a for loop to train 12 models (3*4) and then compare the performance of these 12 models on the test set in order to select the best model among these 12 models.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

ITSS 4381.001 23Fall Instructor: Hong Zhang 7 Q5 (5 points): Predicting Customer Response Data Sets _ dmtrain.csv _ dmtest.csv Question Description A direct marketing firm mails catalogs to its customer base of about 5 million households. Customers respond either by ordering items from the catalog, or do not respond. The firm distinguishes itself by mailing expensive catalogs, and - while the response rates to the firm ’ s mailers are higher than the industry average (30% vs 22%) - they incur considerable printing and mailing costs. They are trying to improve their performance by identifying and targeting profitable customers, i.e., customers who are likely to respond (and order items that would justify the printing and mailing costs). A preliminary study shows that customers seem to make their buying decision in two phases - they decide whether or not to respond first and, if they decide to respond, make a follow-up decision on what to order. dmtrain.csv contains information about 2,000 customers from the last mailing campaign. Everyone included has made at least one purchase from the firm in the past. The variables involved are: The broad objective is to classify customers who are likely to respond to mailers, and customers who are not (i.e., response) is the outcome variable Y.

ITSS 4381.001 23Fall Instructor: Hong Zhang 8 1. Read in the data and review all the continuous variables to see if any are skewed and need to be transformed. If so, transform them and drop the non-transformed versions of the variables. Use the transformed values for the following analyses. Make sure that you do not include the customer identifier id in your calculations. 2. Generate a decision tree on the entire dataset, without any limitations on the depth of the tree. Use entropy as the metric (DecisionTreeClassifier(criterion = 'entropy')). What is the depth of the tree that is generated? Provide a plot of the tree. 3. We will focus on decision trees first and try to identify the best decision tree classifier by pruning the tree at different depths. Use 10-fold cross validation and identify the best tree depth (again, using entropy as the metric), by trying as many possible depths as you deem necessary (please use Grid Search to determine the optimal Depth of Decision Tree ). What is the accuracy associated with this tree depth on the training dataset? 4. Next, we will consider random forests. Develop a random forest classifier with 100 trees, again using 10-fold cross validation and different tree-depths to find the optimal tree-depth with grid search. Which tree-depth results in the best random forest classifier with the highest accuracy? How does it perform relative to the best decision tree? 5. Repeat the previous experiment with 50 trees. Provide all relevant results. Does your recommendation change? 6. Develop a logistic regression model using 10-fold cross validation. What is the associated accuracy? 7. Among these four models (i.e., decision tree, random forest-100, random forest-50, logistic regression), which one would you recommend, and why? 8. Read the file dmtest.csv and use the final best model among the three categories to make further predictions on which customers are likely to respond, and which are not on the test set. The prediction labels have to be 0 or 1 - if the model you selected naturally gives a probability score, use 0.5 as the threshold to determine whether your prediction will be 0 or 1. For example, if you use a logistic

ITSS 4381.001 23Fall Instructor: Hong Zhang 9 regression model that gives you a probability estimate of 0.51, the prediction label should be 1, and if it gives a probability estimate of 0.49, the prediction should be 0. Compare your predicted labels against the actual response indicator in dmtest.csv. Show the quality of your classifier’s predictions on the test set by printing the confusion matrix, precision, recall, accuracy, and F-score. Deliverables : When writing your codes, pay attention to (1) using meaningful variable names that reflect what value is stored in the variable, (2) using comments to explain your codes, and (3) printing the results in the required format. Upload the .ipynb file, the scraped .csv file in Q1, and the group report with proper interpretations to eLearning.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

ITSS 4381-Group Project(1)

Related Documents