Duvuru-Project3-Report-1
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6010
Subject
Mathematics
Date
Jan 9, 2024
Type
Pages
7
Uploaded by ProfFlower7844
1
Introduction to Problem solving with R
Module 3(project 3)
Sudha Duvuru
MPS Analytics, College of Professional Studies, Northeastern University
ALY6000 : Introduction to Analytics
Professor Kayal Chandrasekaran
October 7th,2023
2
Introduction:
"In this project, I've analyzed data from www.goodreads.com, conducted data cleaning, and
created various visualizations in R. I've drawn conclusions, explored statistical concepts, and
submitted the required files 'Lastname_Project3.R' and 'Lastname_Project3_Report.pdf'
following instructions.
I set up the project, downloaded the 'books.csv' file, and cleaned the dataset. For analysis, I
used 'glimpse' and 'summary' functions, created a rating histogram, and a boxplot. I also
grouped data by 'publisher' and refined it.
I crafted a Pareto Chart, a scatter plot, and visualized book counts per year. I calculated
custom statistics, comparing population stats with sample stats.
I explored additional visualizations, documented my findings in an executive summary, and
ensured clean code before submission.
Key findings:
-Data Source: I've analyzed data from
www.goodreads.com
, obtained from
www.kaggle.com
and adapted for our project.
-Data Cleaning: I did the data cleaning by standardizing column names, converting date
formats, and removing irrelevant columns.
-Analysis Overview: I used R to create various visualizations and explore statistical concepts
like samples, populations, and measures of dispersion and central tendency.
3
-Data Visualization: I generated a rating histogram, a horizontal boxplot, a Pareto Chart, and
a scatter plot to visualize relationships between data points.
Histogram:
Boxplot:
Pareto chart:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Scatter plot:
-Publisher Insights: I grouped data by publisher, summarizing their book counts and relative
frequencies.
-Yearly Trends: I examined book counts and average ratings per year, identifying trends over
time.
5
-Custom Functions: I created custom functions for average, population variance, and
population standard deviation calculations.
-Comparison: I compared population statistics with sample statistics obtained from three
samples of 100 books each.
-Additional Visualizations: I explored extra visualizations to gain deeper insights into the
data.
Package Setup
: I've installed and used the 'ggplot2' package to make cool graphs with our
data.
Counting Authors
: I've counted how many books each author wrote and stored this info in a
data frame called 'author_counts.'
Naming Columns
: I've given our data frame's columns more meaningful names.
Sorting
: I've sorted the authors by the number of books they've written, showing the top ones
first.
6
Creating a Graph
: I've used 'ggplot' to make a bar graph showing the top 10 authors and
how many books they've written. The bars are red.
Labels and Title
: I've labeled the graph nicely with "Author" on the x-axis, "Number of
Books" on the y-axis, and titled it "Top 10 Most Common Authors."
Styling
: I've made it look good with a minimal design.
Improving Readability
: I've flipped the graph so it's easier to read.
Key Statistics
:
Data Loading and Cleaning
: I've loaded a dataset from a CSV file named 'books.csv' and
standardized column names. Then, I converted a date column to the correct date format and
extracted the year from it.
Data Filtering
: I filtered the dataset to include books published between 1990 and 2020,
removed unnecessary columns, and kept books with fewer than 1200 pages.
Data Exploration
: I used the 'glimpse' function to understand the dataset's structure and
'summary' to get an overview of its statistics.
Data Visualization
: I created a histogram to visualize book ratings and a boxplot for page
counts. Additionally, I generated a Pareto Chart to show the top publishers.
Scatter Plot
: I made a scatter plot of pages vs. ratings, color-coded by publication year.
Yearly Analysis
: I summarized the dataset by year, calculating the total number of books and
average ratings for each year, and visualized it using a line plot.
Custom Functions
: I created custom functions to calculate the average, population variance,
and population standard deviation for book ratings.
Statistical Analysis
: I computed population statistics for book ratings and compared them
with sample statistics from three random samples.
Additional Visualizations
: I generated a bar chart to visualize the top 10 most common
authors based on the number of books they've written.
Author Analysis
: I conducted an analysis to identify the top 10 most common authors based
on the number of books they've written, visualizing this information in a bar chart. This
analysis provides insights into prolific authors in the dataset.
.
These statistics provide a concise summary of the data preparation, analysis, and visualization
processes in the R code. They offer insights into data cleanliness, filtering effectiveness,
analytical findings, and the graphical representation of information, making it easier to grasp
the key outcomes of the data analysis.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
Conclusion:
In conclusion, I've analyzed a dataset of books, cleaned and prepared the data, and
performed various data visualizations and statistical calculations. Key findings include
insights into book ratings, page counts, popular authors, and publishing trends. This analysis
helps us understand the dataset better and can guide further exploration or decision-
making related to books and publishing. with my analysis. I also compared different regions
and challenged myself to see how wealth relates to happiness.
References:
Google- (data analysis with R)-
https://www.coursera.org/learn/data-analysis-r
(data analysis using R)-
https://www.geeksforgeeks.org/data-analysis-using-r/
Books-
(Exploratory Data Analysis with R") by Roger D. Peng (1-70 pages)
("ggplot2: Elegant Graphics for Data Analysis") by Hadley Wickham (1-50 pages)