Final Report Exploratory Analysis of movies from IMDb and Prediction of Movie Rating

docx

School

George Mason University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

11

Uploaded by JusticeSkunkPerson786

Report
EXPLORATORY ANALYSIS OF MOVIES DATA FROM IMDB Authors: Diana Duarte Guedes Sai Aishwarya Namrata Reddy Palle Syeda Yumma Batool Zaidi Laysa Priya Rudroju K. Paola C. Reyes Sydney| CS-504 | December 2022 December 7,2022 Professor:
I. ABSTRACT Paragraph based on slide . II. INTRODUCTION Movies are not only a source of entertainment these days but are also a key source of international trade and marketing. People, especially young people, become caught up in new trends of movies. The success of movies is something that concerns everyone, not just movie directors and box office executives. IMDB is an online Internet Movie Database with all the information related to movies. It is basically a platform that keeps track of movies data such as genres, stars, directors, movie rating, meta-score, gross, votes, etc. These platforms are growing in popularity day by day since they provide individuals with frank reviews. Therefore, there is a huge amount of information on movie reviews and ratings online. In this project, this information is utilized for predictions, visualizations, and modeling. IMDB has become so popular that most people watch movies by looking into the rating provided by them. So, this project's aim is to scrap the data from the IMDB movies, find the attributes that contribute more towards the rating of the movies and the attributes that contribute towards the gross of the movies, to give some incites to the movie makers. III. METHODOLOGY IV. T O O L S The tools used in this project are excel, Jupiter notebook, RStudio and Tableau. MS Excel is used to store the data that is fetched from the IMBD site through the scraping of the data. The Jupiter notebook using python is used for scrapping data, exploratory analysis, correlation and predictions on different models. Even RStudio is used for the predictions of different models and Tableau is used for the visualizations for the exploratory analysis.
V. DATA SCRAPPING VI. SENTIMENT ANALYSIS VII. EXPLORATORY ANALYSIS The steps for this section are shown in the Exploratory Analysis diagram.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Dataset Description In the dataset that we have scrapped from IMDB movies has 1000 rows and 18 columns with 12 numeric columns and 6 categorical columns. Column Names Description Range of Columns Id Movie Id 0-999 Movie_Name Name of the movie - Year_of_Release The year in which the movie is released 1920-2022 Watch_Time The total time the movie runs 45-321 Movie Rating Rating for a movie given by IMDB 7.6-9.3 Meatscore_of_movie Score given to a movie by average reviews 28-100 Votes Number of votes a movie got 25547-2655610 Gross An amount that a movie earned 0-936.66 Description A short description of movies - Rank It is the rank given to each movie 1-1000 No_Of_Reviews Total number of reviews a movie got 14-25 Positive_Reviews Total number of positive reviews 4-25 Negative_Reviews Total number of negative reviews 0-21 Certificate Certificate given to each movie - Top250 Ranked the top 250 movies 1-250 Director Name of the directors - Stars Name of the stars - Genre Different genres names - Predicted Attribute As seen in the Rating distribution plot, 6.10% of the Movies is greater and equal 8.5. Most of the Movies in the data set have a rating lower than 8.5. Distribution of Attributes (From slide ~15-23) Numerical attributes The following three boxplots show the distribution of Runtime (Watch time), Gross and Number of votes for the two groups of rating. As seen on the plot, for those movies with a rating greater and equal to 8.5 (Green), the mentioned attributes have a higher average. The movies with the highest rating have on average around 150 minutes of runtime, $100,000 gross and around 1,000,000 votes. And the bar plot shows the top 5 meta scores of the movies and their corresponding rating. Meta Score and Movie Ratings need not be proportional but from the plot we can say that the Meta Score 96 has a lesser rating than in comparison to Meta score 94.
Categorical Attributes For the distribution of certificate and director, we identify that most of the movies with a rating greater and equal 8.5 are directed by Christopher Nolan and they are categorized as R. For the top 10 directors’ visualization, we identify other directors such as Francis Ford Coppola, Peter Jackson with 3 movies with a rating greater and equal 8.5. Categorical and Numerical Attributes
Below bar graphs show the top 5 genres with the highest rating and the average watch time of movies based on the rating (Certificate). In the top 5 genres with movie rating, Western, Mystery, and Sci-Fi are standing in the top 3 places with the rating greater than 8. In the second bar graph, average watch time of the movies based on the rating or the certificate, shows that TV-MA and M/PG have the highest rating and is near to 8.2 and the average watch time is around 125. T he above bubble chart shows the gross revenue collected by different genre movies and the line graph shows the yearly trend of the number of movies with ratings. From the bubble chart we can see that action movies are collecting more revenue than any other genre and then comes adventure and animation. And from the yearly trend we can say that the years 2004 and 2014 have a highest number of movies with good ratings.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
From the following directors and their average movie rating charts we can see that a greater number of directors have ratings of less than 8.2 which is shown in red color, around 90+ directors have rating between 8.2 to 8.7 which is shown in white color and around 8 directors have rating greater than 8.7. Frank Darabont, Irvin Kershner, Lana Wachowski, Lilly Wachowski, Madhavan, Rishab Shetty, Sudha Kongara and T.J. Gnanavel are the names of the top directors whose movie ratings are greater than 8.7. Normality We have performed a gaussian test and plotted density graph to check whether the data is normalized or not. The normality distribution is sometimes referred to as the Gaussian distribution itself and is known as the probability of numeric variables distribution equally or symmetrically around the mean I.e., the numeric variables that are nearer to the men are more likely to occur than that are farther from the mean. When the P value is greater than 0.05 then the data is said to be probably gaussian or gaussian and when p value is less than 0.05 than it is said to be as probably not gaussian or not gaussian. From the normality test on the IMDB scraped data we can see the p value in the gaussian test is 0.444 which is greater than 0.05 and it is said that data is probably gaussian or gaussian and even from the density curve we can say the data is normalized with some skewness towards left. Linear Regression & variables behavior with rating We have performed linear regression on the data considering movie rating as the dependent variable and the watch time, gross, votes, rank, positive reviews, negative reviews and meta-score as the independent variables. In the linear regression analysis, the variables which have p value less than 0.5 have no relation with the dependent variable whereas the variables which have p value greater than 0.5 have relation with the dependent variable. So, the positive reviews and negative reviews have no relation with the rating whereas the watch time, gross, votes, rank have relation or contribute to the
movie rating. The most important variables are Watch Time and Gross which have high contribution or high relation with the rating. VIII.MODELING AND KEY FEATURES Regression Analysis | Movie Rating Decision Tree | Movie Rating Regression Analysis | Gross Prediction As rating, gross is another measure to identify a movie success. We perform a Model that predicts the movie gross, to identify the variables that contribute the most in the Movie Industry. The models used are linear regression, decision tree and Random Forest. For linear regression, we run the model with all the variables, identify the most important predictors (those with a p-value <0.01) and, re-run the methods with the most significant variables. The summary for the regression is shown in the next Figure. The best predictors for the model are watch time, Certificate (R), Genre (Action, Adventure & Sci-Fi) and R-squared value is 0.24, which is low. For the Test performance, the model shows a Root Mean Squared Values (RMSE) of $3,212. Decision Tree | Gross Prediction The next method is decision tree. It was model by identifying the best tree size by the following plot, which shows the error vs the tree size. As seen, the best number of leaves for the three is 6. After this, the tree model was run.
In the Tree Model figure, we can identify the 6 Gross predictions based on the values of the most important variables which are genre (Adventure, Animation, &Sci-Fi) and watch time, For the test performance the RMSE is $3,177. IX. The last method is Random Forest, where we identify the best number of trees by the following plot that shows the error vs numbers of trees. As seen from the plot, 300 shows the lowest error. The model was run with this value and mtry=4, which is the number of variables used at each split. In the variable importance plot, we identify the attributes that contribute the most in the model. For example, if we removed the genre adventure for the model, the increase of MSE is more than 30%. The RMSE is $3,141. After running the 3 methods, we can identify Linear regression and Random Forest as the worst and best, respectively based on the RMSE. Finally, we calculate the Pearson correlation between the predicted and Real gross to check another measure for the model performance. As result, we obtained that the predictions with Random Forest have a correlation of 0.52 and, for the tree model is 0.47.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
X. RESULTS AND INSIGHTS XI. CONCLUSION XII. FUTURE WORK XIII.REFERENCES Guideline to follow:
We should work on At least 6 slides of the presentation for the Report: Please pick yours and let the others know here. Paola: I am taking some of my viz and The slides for the Gross Prediction (15, 16, 17) + Gross Prediction (+4).