Vyas_Project3_Report

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by UltraWolverinePerson1024

Project 3 - Exploring Visualizations ALY6000 71053 Introduction to Analytics SEC 27 Module 3 Prepared by: Anvita Vyas (NUID:002962386) For: Prof.Herath Gedara, Chinthaka Pathum Dinesh Submission Date: 10 October 2023

Introduction The third project of the course presents an opportunity to visualize data through R programming. The project entails working on the books data. The project involved data cleaning and data analysis through visualization. It culminates in a comprehensive analysis of the dataset, providing us the chance to analyze the data visually and understand different relations between the variables. Overview In this project, we will analyze the dataset about books, which was collected from Goodreads. The dataset includes details such as book titles, authors, ratings, pages, and more. The objective of this assignment is to give a chance to explore functions related to data cleaning, exploratory data analysis, and to create compelling visualizations. Moreover, we would also explore functions that help with basic statistics by computing population and sample statistics. The aim of the assignment is to draw insights from the data by visualization and understanding statistics. Key Findings 1. Data processing a. This process involves data loading and cleaning. The first step is loading the packages. For this assignment, we have used tidyverse by p_load(tidyverse) . Loading that package helped with reading the CSV by read.csv(). To work with date and time we have loaded library(lubridate) . To clean the data we had to download the dplyr library by library(dplyr) and load janitor by p_load(janitor) . To help with data visualization we loaded library(ggplot2), library(plotly), library(ggQC), library(ggthemes). 2. Data Cleaning a. This assignment had a huge focus on data cleaning so that the data was easy to work with. We first worked on cleaning the names by clean_names(). To deal with date and cleaning the date we used mdy() and year() functions. 3. Data Manipulation a. For data manipulation, we used functions like the vector function c() to select the data we require. To manipulate the data we also used select() , filter() , arrange() , group_by() , and mutate() functions. The data manipulation helped with analyzing the data more clearly as well as keeping the data more focused on what is needed. 4. Statistical Analysis 1

a. This assignment involves the use of descriptive statistics like means mean() which was previously introduced. In addition, this assignment introduced a lot of new ways to conduct statistical analysis. Like creating custom R functions by using function() to compute the average, population variance, and population standard deviation of book ratings. Additionally, we got a chance to explore sample statistics by conducting analysis on three random samples of 100 books from the dataset and computing sample statistics for mean, variance, and standard deviation. This helped to analyze the data for comparison with the population. Fig 1. Input to create custom functions Fig 2. Output for creating own functions 5. Data Visualization a. During the assignment, there were lots of functions that were used to help with data visualization. Using the glimpse() function to view the data set in a different configuration. Also, using ggplot() function to plot scatter plots and histograms with specific specifications. These were great tools for visualizing the data and gaining insights to make recommendations. 2

Your preview ends here