Duvuru-Project3-Report-1

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6010

Subject

Mathematics

Date

Jan 9, 2024

Type

pdf

Pages

7

Uploaded by ProfFlower7844

Report
1 Introduction to Problem solving with R Module 3(project 3) Sudha Duvuru MPS Analytics, College of Professional Studies, Northeastern University ALY6000 : Introduction to Analytics Professor Kayal Chandrasekaran October 7th,2023
2 Introduction: "In this project, I've analyzed data from www.goodreads.com, conducted data cleaning, and created various visualizations in R. I've drawn conclusions, explored statistical concepts, and submitted the required files 'Lastname_Project3.R' and 'Lastname_Project3_Report.pdf' following instructions. I set up the project, downloaded the 'books.csv' file, and cleaned the dataset. For analysis, I used 'glimpse' and 'summary' functions, created a rating histogram, and a boxplot. I also grouped data by 'publisher' and refined it. I crafted a Pareto Chart, a scatter plot, and visualized book counts per year. I calculated custom statistics, comparing population stats with sample stats. I explored additional visualizations, documented my findings in an executive summary, and ensured clean code before submission. Key findings: -Data Source: I've analyzed data from www.goodreads.com , obtained from www.kaggle.com and adapted for our project. -Data Cleaning: I did the data cleaning by standardizing column names, converting date formats, and removing irrelevant columns. -Analysis Overview: I used R to create various visualizations and explore statistical concepts like samples, populations, and measures of dispersion and central tendency.
3 -Data Visualization: I generated a rating histogram, a horizontal boxplot, a Pareto Chart, and a scatter plot to visualize relationships between data points. Histogram: Boxplot: Pareto chart:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Scatter plot: -Publisher Insights: I grouped data by publisher, summarizing their book counts and relative frequencies. -Yearly Trends: I examined book counts and average ratings per year, identifying trends over time.
5 -Custom Functions: I created custom functions for average, population variance, and population standard deviation calculations. -Comparison: I compared population statistics with sample statistics obtained from three samples of 100 books each. -Additional Visualizations: I explored extra visualizations to gain deeper insights into the data. Package Setup : I've installed and used the 'ggplot2' package to make cool graphs with our data. Counting Authors : I've counted how many books each author wrote and stored this info in a data frame called 'author_counts.' Naming Columns : I've given our data frame's columns more meaningful names. Sorting : I've sorted the authors by the number of books they've written, showing the top ones first.
6 Creating a Graph : I've used 'ggplot' to make a bar graph showing the top 10 authors and how many books they've written. The bars are red. Labels and Title : I've labeled the graph nicely with "Author" on the x-axis, "Number of Books" on the y-axis, and titled it "Top 10 Most Common Authors." Styling : I've made it look good with a minimal design. Improving Readability : I've flipped the graph so it's easier to read. Key Statistics : Data Loading and Cleaning : I've loaded a dataset from a CSV file named 'books.csv' and standardized column names. Then, I converted a date column to the correct date format and extracted the year from it. Data Filtering : I filtered the dataset to include books published between 1990 and 2020, removed unnecessary columns, and kept books with fewer than 1200 pages. Data Exploration : I used the 'glimpse' function to understand the dataset's structure and 'summary' to get an overview of its statistics. Data Visualization : I created a histogram to visualize book ratings and a boxplot for page counts. Additionally, I generated a Pareto Chart to show the top publishers. Scatter Plot : I made a scatter plot of pages vs. ratings, color-coded by publication year. Yearly Analysis : I summarized the dataset by year, calculating the total number of books and average ratings for each year, and visualized it using a line plot. Custom Functions : I created custom functions to calculate the average, population variance, and population standard deviation for book ratings. Statistical Analysis : I computed population statistics for book ratings and compared them with sample statistics from three random samples. Additional Visualizations : I generated a bar chart to visualize the top 10 most common authors based on the number of books they've written. Author Analysis : I conducted an analysis to identify the top 10 most common authors based on the number of books they've written, visualizing this information in a bar chart. This analysis provides insights into prolific authors in the dataset. . These statistics provide a concise summary of the data preparation, analysis, and visualization processes in the R code. They offer insights into data cleanliness, filtering effectiveness, analytical findings, and the graphical representation of information, making it easier to grasp the key outcomes of the data analysis.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Conclusion: In conclusion, I've analyzed a dataset of books, cleaned and prepared the data, and performed various data visualizations and statistical calculations. Key findings include insights into book ratings, page counts, popular authors, and publishing trends. This analysis helps us understand the dataset better and can guide further exploration or decision- making related to books and publishing. with my analysis. I also compared different regions and challenged myself to see how wealth relates to happiness. References: Google- (data analysis with R)- https://www.coursera.org/learn/data-analysis-r (data analysis using R)- https://www.geeksforgeeks.org/data-analysis-using-r/ Books- (Exploratory Data Analysis with R") by Roger D. Peng (1-70 pages) ("ggplot2: Elegant Graphics for Data Analysis") by Hadley Wickham (1-50 pages)