Milestone One

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6010

Subject

Mathematics

Date

Jan 9, 2024

Type

pdf

Pages

10

Uploaded by KidDragonfly3195

Report
Module 2 R Practice Sagar Rao College of Professional Studies, Northeastern University ALY6010: Probability Theory and Introductory Statistics Dr. Marco MONTES DE OCA
Introduction To Data Set This report analyzes a dataset of wine reviews and ratings to uncover insights into the wine industry. The data contains 10,851 observations across 14 variables like country, variety, winery, price, points, etc. for wines reviewed by different tasters. We import, clean, and transform the raw data into a usable format in R. Then, we apply descriptive statistics and data visualization techniques to analyze wine prices, ratings, number of reviews per taster, and differences across geographical regions. The key goal is to analyze relationships between wine price and rating and understand variations across regions and tasters. The Dataset includes both categorical data like country, variety, region as well as numeric data like points and price. The fields are: Country: Categorical Description: Text Designation: Text Points: Numeric (wine rating out of 100) Price: Numeric (price of wine bottle) Province: Categorical (wine region) Region_1: Categorical Region_2: Categorical Taster_name: Categorical Taster_twitter_handle: Text Title: Text Variety: Categorical (grape variety) Winery: Text
Data Analysis Load the Data - First, we import the raw wine tasting CSV dataset into R using the read.csv() function and store it as a dataframe named wine_data. This dataframe will contain all the columns and observations from the original dataset for analysis. R Code Output Cleaning the Data - As our next step in preparing the data for analysis, we need to clean it by handling missing values. R's na.omit() function is used to remove any rows in wine_data that contain NA values, which could negatively impact analysis. This gives us a complete and reliable dataset for exploration. The region_2 variable had the most missing values, but now we have a full clean slate for proceeding. R Code
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics - With reliable data now available, we check out some descriptive statistics for the two key numeric variables - price and points ratings. The summary() function in R lets us quickly view distribution statistics like mean, median, range, standard deviation - crucial for understanding center, spread and shape of these variables. We note that average price is $35.38 while average rating is around 88.38 points. We now know the span and center of the distributions. R Code - Output Number of Reviews by Taster - To better understand the mix of reviewers, we rely on R's table() function to get the count of reviews per taster. We store this frequency distribution as taster_reviews. Then the barplot visualizes this distribution clearly to reveal the most prolific reviewers. Virginie Boone leads with the most reviews among all tasters. R Code
Output In this barplot it shows the number of reviews given for the wine tasting by each taster. Virginie Boone leads with the maximum reviews while Roger Voss has provided the least number of wine reviews. The plot enables us to identify and compare the top reviewers as well as those with fewer reviews through the bar heights. We see significant variability among tasters ranging from just 10 reviews up to nearly 150 reviews done by Boone. This visualization reveals the most prolific critic along with extreme lows - critical insights into understanding the data. Reviews volume indicates experience, so high contributors like Boone likely provide credible ratings. Price vs Points scatterplot Next, we visualize the relationship between price and rating points, which is crucial for analysis. The plot() function helps us quickly build the scatterplot with price on the x-axis and points rating on the y-axis. From the scatterplot, we can note the broad direct relationship between the variables, with more expensive wines receiving marginally better ratings. Most
wines cluster between $10-100 and ratings around 85-95 points. Visualizing the data helps uncover such patterns. R Code - Output
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Reviews by province - We then group the reviews data by province to analyze regional variations using table(). This helps uncover the fact that California wines dominate the dataset. We then plot the frequency distributions with barplot() to visualize California's lead. This helps us identify that California should be analyzed more closely. R Code - Output -
California Subset Statistics - Since California had the highest representation, we take a subset of just California wines from the full dataset into ca_wines. This helps focus the analysis on the state with the most wine reviews. Within this subset, we again check summary statistics of the price and points distributions to uncover attributes of California wines specifically. We find that on average, California wines are slightly more expensive and highly rated than overall dataset.
Summary This report presented a preliminary exploratory data analysis of the wine reviews dataset. We summarized the price and ratings distribution, analyzed reviews volume by critic, compared regional variations, and took a focused look at Californian wines. The main takeaways are: Most wines targeted towards mid market $10-100 range get strong ratings in 85-95 points range Prolific critic Virginie Boone could be an authority based on her high review frequency California leads wine production reflected in highest review counts Some aspects to explore further: Surface relationships between price, rating and region to uncover market trends Analyze wine variety preferences across consumer segments Relate critic experience levels with average ratings reliability Key questions to answer next: Which price segments have seen most growth over past 5 years? Do very high/low prices skew ratings relative to mid-range? Are there critic-variety interactions driving ratings? Deeper statistical analysis around pricing, ratings and critic bias could provide business-impactful insights from this dataset.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
References Gregutt, P. (2019). Using R and the Wine Reviews Dataset. R-Bloggers. Retrieved from https://www.r- bloggers.com/2019/01/using-r-and-the-wine-reviews-dataset/ Meza, S. (2021). Descriptive analysis of wine ratings and prices. Towards Data Science. https://towardsdatascience.com/descriptive-analysis-of-wine-ratings-and-prices-a8c8fcfdefa1