Vyas_Project3_Report

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

16

Uploaded by UltraWolverinePerson1024

Report
Project 3 - Exploring Visualizations ALY6000 71053 Introduction to Analytics SEC 27 Module 3 Prepared by: Anvita Vyas (NUID:002962386) For: Prof.Herath Gedara, Chinthaka Pathum Dinesh Submission Date: 10 October 2023
Introduction The third project of the course presents an opportunity to visualize data through R programming. The project entails working on the books data. The project involved data cleaning and data analysis through visualization. It culminates in a comprehensive analysis of the dataset, providing us the chance to analyze the data visually and understand different relations between the variables. Overview In this project, we will analyze the dataset about books, which was collected from Goodreads. The dataset includes details such as book titles, authors, ratings, pages, and more. The objective of this assignment is to give a chance to explore functions related to data cleaning, exploratory data analysis, and to create compelling visualizations. Moreover, we would also explore functions that help with basic statistics by computing population and sample statistics. The aim of the assignment is to draw insights from the data by visualization and understanding statistics. Key Findings 1. Data processing a. This process involves data loading and cleaning. The first step is loading the packages. For this assignment, we have used tidyverse by p_load(tidyverse) . Loading that package helped with reading the CSV by read.csv(). To work with date and time we have loaded library(lubridate) . To clean the data we had to download the dplyr library by library(dplyr) and load janitor by p_load(janitor) . To help with data visualization we loaded library(ggplot2), library(plotly), library(ggQC), library(ggthemes). 2. Data Cleaning a. This assignment had a huge focus on data cleaning so that the data was easy to work with. We first worked on cleaning the names by clean_names(). To deal with date and cleaning the date we used mdy() and year() functions. 3. Data Manipulation a. For data manipulation, we used functions like the vector function c() to select the data we require. To manipulate the data we also used select() , filter() , arrange() , group_by() , and mutate() functions. The data manipulation helped with analyzing the data more clearly as well as keeping the data more focused on what is needed. 4. Statistical Analysis 1
a. This assignment involves the use of descriptive statistics like means mean() which was previously introduced. In addition, this assignment introduced a lot of new ways to conduct statistical analysis. Like creating custom R functions by using function() to compute the average, population variance, and population standard deviation of book ratings. Additionally, we got a chance to explore sample statistics by conducting analysis on three random samples of 100 books from the dataset and computing sample statistics for mean, variance, and standard deviation. This helped to analyze the data for comparison with the population. Fig 1. Input to create custom functions Fig 2. Output for creating own functions 5. Data Visualization a. During the assignment, there were lots of functions that were used to help with data visualization. Using the glimpse() function to view the data set in a different configuration. Also, using ggplot() function to plot scatter plots and histograms with specific specifications. These were great tools for visualizing the data and gaining insights to make recommendations. 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The key findings demonstrate how various operations and functions in R programming can help with data analysis to extract meaningful information from datasets. Key Visualizations Question 3 (Data Analysis) Create a rating histogram with the following criteria: – The y-axis is labeled “Number of Books.” – The x-axis is labeled “Rating.” – The title of the graph is “Histogram of Book Ratings.” – The graph is filled with the color “red.” – Set a binwidth of .25. – Use theme_bw(). Fig 3. Input to create a histogram Fig 4. Output, the histogram 3
Question 4 (Data Analysis) Create a boxplot of the number pages per book in the dataset with the following requirements. – The boxplot is horizontal. – The x-axis is labeled “Pages.” – The title is “Box Plot of Page Counts.” – Fill the boxplot with the color magenta. – Use the theme theme_economist from the ggthemes package. Fig 5. Input to create a box plot 4
Fig 6. Output - the box plot Question 6 (Data Analysis) Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of cumulative counts formatted with the following additional criteria: – The bars are filled with the color cyan. – The x-axis label is “Publisher.” – The y-axis label is “Number of Books.” – The title is “Pareto and Ogive of Publisher Book Counts (1990 - 2020).” – Use the theme theme_clean(). – Rotate the x-axis labels by 45 degrees (consider the ggeasy package). Fig 7. Input to create a Pareto Chart 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig 8. Output - the Pareto chart Question 7 (Data Analysis) Create a scatter plot of pages vs. rating for the books data frame with the following requirements: – Color the points based on the year of publication. – The x-axis is labeled “Pages.” – The y-axis is labeled “Rating.” 6
– The graph is titled “Scatter Plot of Pages vs. Rating.” – Use the theme theme_tufte(). Fig 9. Input to create a scatter plot Fig 10. Output - the scatter plot Question 9 (Data Analysis) Create a line plot with from this data frame with points representing the counts per year from 1990 - 2020. Color the points for each year with the average rating. Format with the following specifications: – The graph is titled “Total Number of Books Rated Per Year.” 7
– The theme is theme_excel_new(). Fig 11. Input to create a line plot Fig 12. Output - the line plot Highlighted Question Question 11 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Consider the complete dataset of books to be the population you are analyzing. Compute population stats for the average, variance, and standard deviation of the book rating. Fig 13. Input Fig 14. Output Question 12 Create three samples of size 100 from the books data frame. Compute sample statistics for mean, variance, and standard deviation of the book rating. Compare these results with the population stats in your report. Fig 15. Input 9
Fig 16. Output Comparison of Sample Stat with Population Stats Population Stat Sample Stat Comparison Average 3.98 Sample 1: 3.99 Sample 2: 3.97 Sample 3: 3.97 We can notice the average across all 3 sample stats and the population stats are very similar and around the same range. Variance 0.0963 Sample 1: 0.103 Sample 2: 0.0955 Sample 3: 0.0728 Here Samples 1 & 2 are much closer to the population variance whereas Sample 3 is skewed and is further from the variance, giving us a thought to think about. Standard Deviation 0.31 Sample 1: 0.321 Sample 2: 0.309 Sample 3: 0.270 Similar to the variance even here Samples 1 & 2 have a similar sd as the population however Sample 3 sd has a little difference from the population variance. Showing the data in sample 3 has some skewed data points. 10
Question 13 Create one or more additional visualizations based on the existing data or additional analysis that you perform. For this Fig 17. Input for two visualizations Fig 18. Output - Visualization 1 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig 19. Output - Visualization 2 12
Executive Summary The Analysis of Top Authors Book Count and Average Rating Introduction This analysis focuses on gaining insights in regard to the contribution of authors and the quality of their work. To conduct this analysis we created two visualizations focusing on metrics like book counts for authors and their corresponding average ratings . Findings The first graph, "Top Authors with Most Books," shows the top authors based on the number of books they have published. Karen Kingsbury emerges at the top with the most books published. Nora Roberts and Mercedes Lackey follow suit. The second graph, "Top Authors Average Rating," focuses on the perceived quality of the authors' work, as reflected by the average rating of their books. Karen Kingsbury emerges on the top again with the highest rating. However, CLAMP William Flanagan, although producing fewer works, has a high average rating as well as Terry Pratchett indicating a strong reader response and consistent quality in their writing. Conclusion In conclusion, when comparing both visualizations, it is evident that Karen Kingsbury is an author who has not only consistently delivered books but has maintained great quality and received high ratings. We can also infer that having published more books does not guarantee high ratings. Although some authors successfully achieve both high volume and quality, we do have some that still get high ratings even though the volume of books published is lower. 13
Conclusion In conclusion, the third project has provided us with a better understanding of R-programming, statistics, and most importantly data visualization. The assignment gave us a chance to perform data manipulation and data visualizations to develop in-depth knowledge about data analysis. The analysis showcases various patterns and relationships within the data, providing a foundation for us to make conclusions, decisions, and analyses. 14
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
References Bluman, A. G. (2018). Elementary statistics: A step by step approach . McGraw-Hill Education. Cardillo, C. (2020, April 12). Data Manipulation with dplyr . RPubs. https://rpubs.com/odenipinedo/data-manipulation-with-dplyr Kabacoff, R. (2022). R in action: Data analysis and graphics with R and Tidyverse . Manning Publications. Madhugiri, D. (2023, July 26). A comprehensive guide on GGPLOT2 in R. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2022/03/a-comprehensive-guide-on-ggplot2-in-r/# :~:text=ggplot%20is%20a%20popular%20data,structured%20way%20to%20create%20v isualizations R Tutorial . R tutorial. (n.d.). https://www.w3schools.com/r/default.asp 15