Milestone One
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6010
Subject
Mathematics
Date
Jan 9, 2024
Type
Pages
10
Uploaded by KidDragonfly3195
Module 2 R Practice
Sagar Rao
College of Professional Studies, Northeastern University
ALY6010: Probability Theory and Introductory Statistics
Dr. Marco MONTES DE OCA
Introduction To Data Set
This report analyzes a dataset of wine reviews and ratings to uncover insights into the wine
industry. The data contains 10,851 observations across 14 variables like country, variety, winery, price,
points, etc. for wines reviewed by different tasters. We import, clean, and transform the raw data into a
usable format in R. Then, we apply descriptive statistics and data visualization techniques to analyze
wine prices, ratings, number of reviews per taster, and differences across geographical regions. The key
goal is to analyze relationships between wine price and rating and understand variations across regions
and tasters.
The Dataset includes both categorical data like country, variety, region as well as numeric data
like points and price. The fields are:
•
Country: Categorical
•
Description: Text
•
Designation: Text
•
Points: Numeric (wine rating out of 100)
•
Price: Numeric (price of wine bottle)
•
Province: Categorical (wine region)
•
Region_1: Categorical
•
Region_2: Categorical
•
Taster_name: Categorical
•
Taster_twitter_handle: Text
•
Title: Text
•
Variety: Categorical (grape variety)
•
Winery: Text
Data Analysis
Load the Data -
First, we import the raw wine tasting CSV dataset into R using the read.csv() function and store
it as a dataframe named wine_data. This dataframe will contain all the columns and observations from
the original dataset for analysis.
R Code
–
Output
–
Cleaning the Data -
As our next step in preparing the data for analysis, we need to clean it by handling missing
values. R's na.omit() function is used to remove any rows in wine_data that contain NA values, which
could negatively impact analysis. This gives us a complete and reliable dataset for exploration. The
region_2 variable had the most missing values, but now we have a full clean slate for proceeding.
R Code
–
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Statistics -
With reliable data now available, we check out some descriptive statistics for the two key
numeric variables - price and points ratings. The summary() function in R lets us quickly view
distribution statistics like mean, median, range, standard deviation - crucial for understanding center,
spread and shape of these variables. We note that average price is $35.38 while average rating is
around 88.38 points. We now know the span and center of the distributions.
R
Code -
Output
–
Number of Reviews by Taster -
To better understand the mix of reviewers, we rely on R's table() function to get the count of
reviews per taster. We store this frequency distribution as taster_reviews. Then the barplot visualizes
this distribution clearly to reveal the most prolific reviewers. Virginie Boone leads with the most reviews
among all tasters.
R Code
–
Output
–
In this barplot it shows the number of reviews given for the wine tasting by each taster. Virginie
Boone leads with the maximum reviews while Roger Voss has provided the least number of wine
reviews.
The plot enables us to identify and compare the top reviewers as well as those with fewer
reviews through the bar heights. We see significant variability among tasters ranging from just 10
reviews up to nearly 150 reviews done by Boone.
This visualization reveals the most prolific critic along with extreme lows - critical insights into
understanding the data. Reviews volume indicates experience, so high contributors like Boone likely
provide credible ratings.
Price vs Points scatterplot
–
Next, we visualize the relationship between price and rating points, which is crucial for
analysis. The plot() function helps us quickly build the scatterplot with price on the x-axis and
points rating on the y-axis. From the scatterplot, we can note the broad direct relationship
between the variables, with more expensive wines receiving marginally better ratings. Most
wines cluster between $10-100 and ratings around 85-95 points. Visualizing the data helps
uncover such patterns.
R Code -
Output
–
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Reviews by province -
We then group the reviews data by province to analyze regional variations using table(). This
helps uncover the fact that California wines dominate the dataset. We then plot the frequency
distributions with barplot() to visualize California's lead. This helps us identify that California should be
analyzed more closely.
R Code -
Output -
California Subset Statistics -
Since California had the highest representation, we take a subset of just California wines
from the full dataset into ca_wines. This helps focus the analysis on the state with the most
wine reviews. Within this subset, we again check summary statistics of the price and points
distributions to uncover attributes of California wines specifically. We find that on average,
California wines are slightly more expensive and highly rated than overall dataset.
Summary
This report presented a preliminary exploratory data analysis of the wine reviews dataset. We
summarized the price and ratings distribution, analyzed reviews volume by critic, compared regional
variations, and took a focused look at Californian wines.
The main takeaways are:
•
Most wines targeted towards mid market $10-100 range get strong ratings in 85-95 points range
•
Prolific critic Virginie Boone could be an authority based on her high review frequency
•
California leads wine production reflected in highest review counts
Some aspects to explore further:
•
Surface relationships between price, rating and region to uncover market trends
•
Analyze wine variety preferences across consumer segments
•
Relate critic experience levels with average ratings reliability
Key questions to answer next:
•
Which price segments have seen most growth over past 5 years?
•
Do very high/low prices skew ratings relative to mid-range?
•
Are there critic-variety interactions driving ratings?
Deeper statistical analysis around pricing, ratings and critic bias could provide business-impactful
insights from this dataset.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
References
Gregutt, P. (2019). Using R and the Wine Reviews Dataset. R-Bloggers. Retrieved from
https://www.r-
bloggers.com/2019/01/using-r-and-the-wine-reviews-dataset/
Meza, S. (2021). Descriptive analysis of wine ratings and prices. Towards Data Science.
https://towardsdatascience.com/descriptive-analysis-of-wine-ratings-and-prices-a8c8fcfdefa1