Project3_Report

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

14

Uploaded by ConstableKudu4044

Report
1 Introduction to Analytics Module 3 - Project by Vidhi Naik ALY 6000 Introduction to Analytics October 13, 2023
2 Introduction Project 3 involves data analysis and visualization using R. This report outlines our exploration of a dataset from Goodreads, which we've cleaned and analyzed. It covers aspects like book ratings, publishing trends, and statistical insights. This report presents our methodology, findings, and recommendations based on data analysis. This report is generated using R's 'Rmarkdown' feature, providing an organized presentation of key findings, certain remarks, and comments to better explain the code, conclusions drawn from the analysis, and actionable recommendations for future exploration. Please note that certain output has been excluded from the report due to its substantial size, to prevent excessive pagination. Key Findings The data preparation stage involved cleaning and refining the dataset, making it suitable for analysis by standardizing column names, handling date formats, and filtering books published between 1990 and 2020 with page counts below 1200. Extensive statistical analysis revealed important insights. Descriptive statistics helped understand the distribution of book ratings, and population statistics, including mean, population variance, and population standard deviation, provided insights into book rating patterns. Publisher profiles were created based on the number of books published. Publishers with less than 250 books were excluded, and a Pareto Chart displayed how books were distributed among different publishers. The project generated key visualizations, such as a histogram illustrating the distribution of book ratings, a box plot displaying page count variations, and a scatter plot that explored relationships between the number of pages and book ratings. By drawing and comparing three random samples of 100 books each with population statistics, the impact of sample size on parameter estimation was assessed. The data-driven recommendations presented in the report offer valuable insights for book industry stakeholders, drawing from statistical patterns and trends identified in the dataset. These findings deepen our understanding of the book data's implications.
3 Naik_Project 3 – Rmarkdown Report VidhiNaik 2023-10-13 #VidhiNaik_Project3 #cat("\014") # clears console #rm(list = ls()) # clears global environment try(dev.off(dev.list()["RStudioGD"]), silent = TRUE) # clears plots #try(p_unload(p_loaded(), character.only = TRUE), silent = TRUE) # clears packages #options(scipen = 100) # disables scientific notion for entire R session library (pacman) p_load (tidyverse) library (tidyverse) #Q1. Download the file books.csv from Canvas and read the dataset into R. books_data <- read.csv ( "/Users/vidhinaik/Desktop/MS Project Management Syllabus/Intro to Analytics ALY6000/Assignment 3/books.csv" ) head (books_data) #Q2. The janitor package contains helpful functions that perform basic maintenance of your data frame. Use the clean_name function to standardize the names in your data frame. #install.packages("janitor") library (janitor) ## ## Attaching package: 'janitor' ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test books_data <- clean_names (books_data) head (books_data) #Q3. The lubridate package contains helpful functions to convert dates represented as strings to dates represented as dates. Convert the first_publish_date column to a type date using the mdy function. #install.packages("lubridate") library (lubridate) books_data $ first_publish_date <- mdy (books_data $ first_publish_date) ## Warning: 1186 failed to parse.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 class (books_data $ first_publish_date) books_data #Q4. Reduce your dataset to only include books published between 1990 and 2020 (inclusive). books_subset1 <- books_data[books_data $ first_publish_date >= as.Date ( "1990- 01-01" ) & books_data $ first_publish_date <= as.Date ( "2020-12-31" ), ] books_subset1 $ first_publish_date #Q5. Remove the following columns from the data set: publish_date, edition, characters, price, genres, setting, and isbn. #install.packages("dplyr") library (dplyr) books_subset1 <- books_subset1 %>% select ( - publish_date, - edition, - characters, - price, - genres, - setting, - isbn) books_subset1 #----------------------------Data Analysis------------------------------ #Q1. Use the glimpse function to produce a long view of the dataset. glimpse (books_subset1) #Q2. Use the summary function to produce a breakdown of the statistics of the dataset. summary (books_subset1) #Q3. Create a rating histogram with the following criteria. #– The y-axis is labeled “Number of Books.” #– The x-axis is labeled “Rating.” #– The title of the graph “Histogram of Book Ratings.” #– The graph is filled with the color “red.” #– Set a binwidth of .25. #– Use theme_bw(). library (ggplot2) ggplot ( data = books_subset1, aes ( x = rating)) + geom_histogram ( binwidth = 0.25 , fill = "red" ) + labs ( y = "Number of Books" , x = "Rating" , title = "Histogram of Book Ratings" ) + theme_bw () ## Warning: Removed 22495 rows containing non-finite values (`stat_bin()`).
5 #Q4. Create a boxplot of the number pages per book in the dataset with the following requirements. #– The boxplot is horizontal. #– The x-axis is labeled “Pages.” #– The title is “Box Plot of Page Counts.” #– Fill the boxplot with the color magenta. #– Use the theme theme_economist from the ggthemes package. #install.packages("ggthemes") library (ggthemes) ggplot ( data = books_subset1, aes ( x = pages)) + geom_boxplot ( fill = "magenta" ) + labs ( x = "Pages" , title = "Box Plot of Page Counts" ) + coord_cartesian ( xlim = c ( 0 , 1200 )) + theme_economist () ## Warning: Removed 23294 rows containing non-finite values (`stat_boxplot()`).
6 #Q5. Group the data by publisher and produce a summary data frame containing each publisher and their associated number of books in the dataset. With that data frame, make the following refinements: #– Remove any rows that contain NAs. #– Remove any publishers with fewer than 250 books. #– Order the data frame by the total number of books in descending order. #– Make the publisher into a factor with the levels defined by the current ordering of the publisher. #– Add a column to the data frame with cumulative count of books. #– Add a column to the data frame with the relative frequency of books. #– Add a column to the data frame with the cumulative relative frequency of books. library (dplyr) publisher_summary <- books_subset1 %>% group_by (publisher) %>% summarise ( total_books = n ()) %>% filter ( complete.cases (publisher)) %>% filter (total_books >= 250 ) %>% arrange ( desc (total_books)) %>% mutate ( publisher = factor (publisher, levels = publisher)) %>% mutate ( cumulative_count = cumsum (total_books)) %>% mutate ( relative_frequency = total_books / sum (total_books)) %>%
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 mutate ( cumulative_relative_frequency = cumsum (relative_frequency)) publisher_summary #Q6. Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of cumulative counts formatted with the following additional criteria: #– The bars are filled with the color cyan. #– The x-axis label is “Publisher.” #– The y-axis label is “Number of Books.” #– The title is “Pareto and Ogive of Publisher Book Counts (1990 - 2020).” #– Use the theme theme_clean(). #– Rotate the x-axis labels by 45 degrees (consider the ggeasy package). ggplot ( data= publisher_summary, aes ( x= publisher, y= total_books)) + geom_bar ( stat = "identity" , fill = "cyan" ) + labs ( x = "Publisher" , y = "Number of Books" , title = "Pareto and Ogive of Publisher Book Counts (1990 - 2020)" ) + theme_clean () + theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 )) + geom_line ( aes ( x = publisher, y = cumulative_count, group= 1 ), color = "black" ) + geom_point ( aes ( x = publisher, y = cumulative_count), color = "black" , size = 3 )
8 #Q7. Create requirement- a scatter plot of pages vs. rating for the books data frame with the following #– Color the points based on the year of publication. #– The x-axis is labeled “Pages.” #– The y-axis is labeled “Rating.” #– The graph is titled “Scatter Plot of Pages vs. Rating.” #– Use the theme theme_tufte(). library (ggplot2) library (lubridate) books_data $ first_publish_date <- mdy (books_data $ first_publish_date) ## Warning: All formats failed to parse. No formats found. books_subset1 $ year <- year (books_subset1 $ first_publish_date) ggplot (books_subset1, aes ( x = pages, y = rating, color = year)) + geom_point () + scale_x_continuous ( breaks = seq ( 0 , 1250 , by = 250 )) + labs ( x = "Pages" , y = "Rating" , title = "Scatter Plot of Pages vs. Rating" ) + scale_color_gradient ( low = "lightblue" , high = "navyblue" ) + coord_cartesian ( xlim = c ( 0 , 1250 )) + # Limit x-axis range theme_tufte () ## Warning: Removed 23294 rows containing missing values (`geom_point()`).
9 #Q8. Create a data frame from the books data frame that contains a count of the number of books by year and the average rating for each year. books_by_year <- books_subset1 %>% group_by (year) %>% summarize ( Count = n (), Average_Rating = mean (rating, na.rm = TRUE )) books_by_year #Q9. Create a line plot with from this data frame with points representing the counts per year from 1990 - 2020. Color the points for each year with the average rating. Format with the following specifications: #– The graph is titled “Total Number of Books Rated Per Year.” #– The theme is theme_excel_new(). ggplot (books_by_year, aes ( x = year, y = Count, color = Average_Rating)) + geom_line () + geom_point () + labs ( x = "Year" , y = "Count" , title = "Total Number of Books Rated Per Year" ) + scale_color_viridis_c ( option = "eruptions" ) + # Use the "magma" color scale scale_y_continuous ( breaks = seq ( 0 , 1000 , by = 500 )) + # Custom y-axis breaks coord_cartesian ( ylim = c ( 0 , 1500 )) + theme_excel_new () ## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option): ## Option 'eruptions' does not exist. Defaulting to 'viridis'. ## Warning: Removed 1 row containing missing values (`geom_line()`). ## Warning: Removed 1 rows containing missing values (`geom_point()`).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 #Q10. R has built-in functions to computer the sample mean (mean), sample variance (var), and sample standard deviation (sd). Create your own functions to compute the average, to compute the population variance (pop_var) and the population standard deviation (sd_var). You may not use the three built-in functions listed above, but may use other built-in functions. All three functions should accept a single vector of values and return the corresponding computed result. library (purrr) #Sample mean custom_mean <- function (x) { total <- sum (x) n <- length (x) return (total / n) } custom_mean (books_by_year $ Count) # population variance
11 pop_var <- function (x) { m <- custom_mean (x) n <- length (x) variance <- sum ((x - m) ^ 2 ) / n return (variance) } pop_var (books_by_year $ Count) # standard variance sd_var <- function (x) { variance <- pop_var (x) return ( sqrt (variance)) } sd_var (books_by_year $ Count) #Q11.Consider the complete dataset of books to be the population you are analyzing. Compute population stats for the average, variance, and standard deviation of the book rating. library (dplyr) ratings_book <- books_data %>% summarise ( avg_rating = custom_mean (books_data $ rating), variance = pop_var (books_data $ rating), sd = sd_var (books_data $ rating)) ratings_book #Q12.Create three samples of size 100 from the books data frame. Compute sample statistics for mean, variance and standard deviation of the book rating. Compare these results with the population stats in your report. sample_statistics <- books_data %>% sample_n ( 100 ) %>% summarize ( sample_mean = mean (rating), sample_variance = var (rating), sample_standard_deviation = sd (rating) ) sample_statistics #Q13. Create one or more additional visualizations based on the existing data or additional analysis that you perform. library (ggplot2) # Create a box plot for book ratings ggplot (books_data, aes ( y = rating)) + geom_boxplot ( fill = "pink" , color = "purple" ) + labs ( y = "Rating" ,
12 title = "Box Plot of Book Ratings" , subtitle = "Books Published (1990 - 2020)" ) + theme_minimal ()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 Conclusion In brief, this project, emphasizing data analysis and visualization in R, has yielded significant insights into book ratings, publishing trends, and data workings. The statistical analysis and visualization have clarified book rating distributions and uncovered publisher patterns. Comparing sample statistics to population statistics has emphasized the influence of sample size on parameter estimations. Recommendation 1. Implement stringent data quality checks and automated data cleaning processes to ensure data accuracy and consistency. 2. Explore advanced statistical techniques like hypothesis testing and regression analysis for deeper insights. 3. Utilize time series analysis to track evolving book trends over time.
14 References 1. Wickham, H., & Grolemund, G. (2016). R for Data Science . O'Reilly Media. 2. R Core Team. (2021). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing. 3. Stack Overflow. (Various). [Online community for programming and data analysis]. Retrieved from https://stackoverflow.com/