Vyas_Project2_Report

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

12

Uploaded by UltraWolverinePerson1024

Report
Project 2 - Exploratory Data Analysis (EDA) of Two Data Sets ALY6000 71053 Introduction to Analytics SEC 27 Module 2 Prepared by: Anvita Vyas (NUID:002962386) For: Prof.Herath Gedara, Chinthaka Pathum Dinesh Submission Date: 03 October 2023
Introduction This marks the second project of the course, presenting a valuable opportunity for an in-depth exploration of R programming. The project entails working with two distinct datasets and conducting a variety of data manipulation and visualization tasks. It culminates in a comprehensive analysis of the datasets, affording us the chance to analyze and provide recommendations. Overview In the first part of the assignment, we explore the data of the 2015 data set. This dataset includes various metrics such as happiness scores, economic indicators (GDP), and social factors (Freedom) to assess the quality of life in countries across the globe. The analysis aims to explore and understand the dataset, prepare it for further analysis, and derive insights on aspects like happiness, and freedom, their correlation, or GDP differences in regions. In the second part of the assignment, we delve into the data set of batting statistics from the 1986 Major League Baseball season that includes attributes like home runs, hits, runs scored, games played, etc. The analysis intends to explore player performance, calculate key statistics, and identify insights. Key Findings The assignments focused on understanding how data can be analyzed to measure freedom, trust, and other measures of human life. The key findings included: 1. Data processing a. This process involves data loading and cleaning. The first step is loading the package tidyverse by p_load(tidyverse) . Loading that package helped with reading the CSV by read.csv(). The processing part also involved cleaning the data. However, first, we had to download the dplyr library by library(dplyr) and load janitor by p_load(janitor) to clean the names by clean_names() . The data processing helped make the analyzing process simpler by easy loading and cleaning. 2. Data Manipulation a. For data manipulation, we used functions like the vector function c() to select the data we require, a combination of slice() function, and colon ‘:’ function to get the first 10 rows from the dataset. To manipulate the data we also used select () , filter() , arrange() , group_by() , and mutate() functions. The data manipulation helped with analyzing the data more clearly. 3. Comparative Analysis 1
a. For comparative analysis using the summarise() function is a great tool as it shows the required data in a concise way which makes it easier to compare different variables and analyze the data. Example: Assignment Part 1 Input Output Assignment Part 2 Input Output 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Descriptive statistics a. This assignment also involves the use of descriptive statistics like mean, median, max, and min using functions like mean() , median() , max() , and min() to help with analysis and making decisions. 5. Data Visualization a. During the assignment, there were lots of functions that were used to help with data visualization. Using the glimpse() function to view the data set in a different configuration. Also, using ggplot() function to plot scatter plots and histograms with specific specifications. These were great tools for visualizing the data and gaining insights to make recommendations. Example: Assignment Part 1 Input Output Assignment Part 2 Input Output 3
Input Output 4
The key findings demonstrate how various operations and functions in R programming can help with data analysis to extract meaningful information from diverse datasets. Highlighted Questions Question 13 [Challenge Problem] Compare the average gdp per capita of the ten least happy Western European countries with the ten happiest Sub-Saharan African countries. For testing, you can store the resulting data.frame or table as gdp_df. Input Output 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 14 [Challenge problem] From your regional_stats_df, create a scatterplot of mean_happiness vs. mean_freedom. Draw a line segment from the smallest of these values to the largest. Input Output 6
Question 19 Make a recommendation for the league's most valuable player (MVP). Keep in mind that the dataset completely ignores pitchers. You can decide whether a pitcher should be eligible for the MVP. Base your decision on the data you have analyzed. You may choose to do additional analysis at your discretion. You should produce a concise, written executive summary that, in addition to the title page and citations, contains an introduction, a presentation of written key findings supported by visualizations, and a conclusion that contains your recommendations as supported by the data. Your executive summary should adhere to basic APA guidelines. Code for analyzing the data to recommend the league's most valuable player (MVP) Input Output 7
8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Executive Summary The Most Valuable Player Introduction This analysis focuses on finding the Most Valuable Player (MVP) in the league. To conduct this analysis we have run a code considering the various metrics like No. of Games Played (G), Home Runs Ranking (RankHR), Runs Batted in Ranking (RankRBI), On-base Percentage Ranking (RankOBP) where OBP is H (Number of Hits) + BB (Bases on Balls) / AB (Number of at Bats) + BB (Bases on Balls) and the cumulative of all the ranks (TotalRank) . Process & Findings To select the MVP candidate the data was first filtered out to only focus on players who have played in at least 100 games in the season. Following that a new variable 'MVP_Ranking' was created by dividing 'G' by 'TotalRank' to emphasize the players who have played more games. This ranking was then arranged in descending order to showcase the top players. Additionally, a scatter plot was created to visualize the data which illustrates the relationship between 'TotalRank' and 'G' for the top 10 MVP candidates. The findings from the process showcase that the MVP Player would be the one with the highest MVP_Ranking. Conclusion After analyzing and visualizing the data, the Most Valuable Player is Don Mattingly as he has the highest MVP_Ranking of 8.1 and is the highest point on the scatter plot. 9
Conclusion In conclusion, the second project has provided us with more in-depth insight into R-programming and statistics. The assignment gave us a chance to perform data manipulation data analysis, and data visualizations to gain more understanding. The analysis showcased us various patterns and relationships within the data, providing a foundation for us to make conclusions and decisions 10
References 1986 Major League Baseball Standard Batting . Baseball. (n.d.). https://www.baseball-reference.com/leagues/majors/1986-standard-batting.shtml Bluman, A. G. (2018). Elementary statistics: A step by step approach . McGraw-Hill Education. Cardillo, C. (2020, April 12). Data Manipulation with dplyr . RPubs. https://rpubs.com/odenipinedo/data-manipulation-with-dplyr Kabacoff, R. (2022). R in action: Data analysis and graphics with R and Tidyverse . Manning Publications. R Tutorial . R tutorial. (n.d.). https://www.w3schools.com/r/default.asp 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help