Project3_Report

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Information Systems

Date

Jun 10, 2024

Type

docx

Pages

19

Uploaded by CoachElementEmu28

Report
1 ALY 6000: Introduction To Analytics Module 3 Project Report: Exploring Visualization INTRODUCTION In Module 3 of ALY 6000 we use the data set from goodreads.com. In this project we learn how to clean the dataset, remove null values and columns which are not needed for the analysis so that it makes the dataset concise and readable. We perform a number of analyses for visualization to get meaningful insights from the data set. 1. Download the file books.csv from Canvas and read the dataset into R store it in a variable called books.
2 OUTPUT: Cleaning the data set 1. The janitor package contains helpful functions that perform basic maintenance of your data frame. Use the clean_names function to standardize the names in your data frame. OUTPUT:
3 2. The lubridate package contains helpful functions to convert dates represented as strings to dates represented as dates. Convert the first_publish_date column to a type date using the mdy function. OUTPUT: > books$first_publish_date <- mdy(books$first_publish_date)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 3. Using the year function in lubridate, extract the year from the first_publish_date column place it in a new column named year. OUTPUT: books$year <- year(books$first_publish_date) 4. Filter your dataset to only include books published between 1990 and 2020 (inclusive) OUTPUT:
5 5. Remove the following columns from the data set: publish_date, edition, characters, price, genres, setting, and isbn. OUTPUT:
6 6. Keep only books that are fewer than 1200 pages. OUTPUT: 7. Remove any rows that contains Nas OUTPUT: books <- na.omit(books)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Data Analysis 8. Use the glimpse function to produce a long view of the dataset OUTPUT: 9. Use the summary function to produce a breakdown of the statistics of the dataset. OUTPUT:
8 10. Create a rating histogram with the following criteria. – The y-axis is labeled “Number of Books.” – The x-axis is labeled “Rating.” – The title of the graph “Histogram of Book Ratings.” – The graph is filled with the color “red.” – Make the width of the bins .25. OUTPUT: > ggplot(data=books, aes(x=rating)) + geom_histogram(fill="red", binwidth = 0.25) + labs(title="Histogram of Book Ratings", x="Rating", y="Number Of books") 11. Create a boxplot of the number pages per book in the dataset with the following requirements. – The boxplot is horizontal. – The x-axis is labeled “Pages.” – The title is “Box Plot of Page Counts.” – Fill the boxplot with the color red. OUTPUT: ggplot(data=books, aes(x=pages)) + geom_boxplot( fill = "red") + labs(title="Box Plot Of Page Count", x="Pages")
9 12. Create a data frame from the books data frame that contains a count of the number of books by year in a column called total_books. Store this data frame in a variable named by_year. OUTPUT:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 13. Create a line plot with points from the by_year data frame with points representing the counts per year from 1990 - 2020. – The graph is titled “Total Number of Books Rated Per Year.” OUTPUT: ggplot(data=by_year, aes(x=year, y= total_books)) + geom_line() + geom_point() + labs(title="Total Number of Books Rated Per year", x="Year")
11 14. Create a new data frame named book_publisher. This data frame should have a column named publisher with the names of each unique publisher and a column named book_count that contains the number of titles in the books data frame for each publisher. OUTPUT: 15. Remove any publisher with fewer than 125 books. OUTPUT: > book_publisher <- book_publisher[book_publisher$book_count>= 125, ] > View(book_publisher)
12 16. Order book_publisher by the total number of books in descending order. OUTPUT: > book_publisher <- arrange(book_publisher, desc(book_count)) > View(book_publisher) 17. Add a column to the book_publisher data frame named cum_counts with the cumulative sum of the book_count column OUTPUT: > book_publisher <- mutate(book_publisher,cum_count = cumsum(book_count)) > View(book_publisher)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 18. Add a column to the book_publisher data frame named rel_freq with the relative frequency of the values in the book_count column. OUTPUT: > #18 > book_publisher <- mutate(book_publisher,rel_frequency = book_count/sum(book_count)) > View(book_publisher) 19. Add a column to the book_publisher data frame name cum_freq with the cumulative sum of the rel_freq column. OUTPUT: > book_publisher <- mutate(book_publisher,cum_freq = cumsum(rel_freq)) > View(book_publisher) 20. Make the publisher column into a factor with the levels defined by the current ordering of the publisher column:
14 OUTPUT: 21. Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of cumulative counts formatted with the following additional criteria: OUTPUT:
15 22. Create one or more additional visualizations based on the existing data or additional analysis that you perform. OUTPUT: 23. Write an executive summary report that contains an overview of your analysis, the visualizations you created with textual descriptions of key takeaways, and any key statistics that were computed in your analysis
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 Visualization and Analysis of the Data Set “Books” and Finding Top 10 Authors with Most Number of Books EXECUTIVE SUMMARY REPORT INTRODUCTION:
17 In this analysis we are going to find out top 10 authors who have maximum number of books published between the year 1990 and 2020. DATA OVERVIEW: - The data set I have taken is a subset of the original data set “Books”. - I have named the subset as “total_books_by_author” which has 2 columns- authors and the number of books published by them as seen in Table 1. - From the data set “total_books_by_author” I have created a dataframe which contains only the top 10 authors with maximum number of books published as seen in Table 2. Table 1. All the Authors and the Number of Books Published by Each
18 Table 2: Top 10 Authors with Maximum Number of Books DATA VISUALIZATION: Using the bar graph, I have presented the top 10 authors with authors on the x-axis and the number of books they have published on the y-axis Bar Graph of Top 10 Authors KEY FINDINGS: - From the above bar graph, we can conclude that Terry Pratchett is the author with the maximum number of publications which is 28. - The author who is placed 10 th overall in the number of publications is Bonnie Bryant. - The 10 th placed author has published a total of 22.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 - The authors Orson Scott Card and R.A. Salvatore are placed 8 th and 9 th even though they have also published 22 books. - The 2 nd ,3 rd and 4 th ranking authors have published 27 books which is just 1 less than the 1 st placed author. - Even though the authors with same number of publications are placed consecutively in the data frame, they are not placed consecutively in the bar graph as R orders them alphabetically by default. CONCLUSION: In conclusion, we can see from the above analysis the authors and the number of books they have published. The bar graph illustrates the book output of these top authors, with the x-axis representing author names and the y-axis indicating the total number of books they have published. CONCLUSION This module has taught me the different ways in which data can be presented in visual form. I have learnt how to factorise columns and how R language assigns levels and the different kinds of graphs. I have learnt the difference between qualitative and quantitative data and the different and effective ways to present the same. I have also learnt the different ways in which the graphs CITATIONS 1. R in Action, Third Edition by Robert I. Kabacoff 2. Eighth Edition Bluman Elementary Statistics-A Step by Step Approach, McGraw- Hill Education by Allan G.Bluman