week 3 recitation

docx

School

Michigan State University *

*We aren’t endorsed by this school

Course

231

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by SargentFieldMagpie28

Report
STT 231 – Statistics for Scientists Spring 2024 Name: __________________________ Week 3 recitation Introduction Background to the study The study you’ll explore involves a large dataset recording information on 53940 diamonds. There are ten variables measured for each diamond, including price (in USD), cut (the quality of the cut), clarity (a measurement of how clear the diamond is), color , and carat (weight from 0.2-5.01 carats). For a more detailed description of the dataset, you can open RStudio and run the command help(diamonds). Your main goal of this week’s activity will be to explore the factors influencing the price of a diamond. Historically, diamond prices have been thought to be a function of “ The Four C’s ”, c ut, c larity, c olor, and c arat. But are all four of these dimensions equally important or are some more influential than others? Task 1: Univariate analyses a. Two common graphs used to describe a data distribution are a histogram (in R, hist() ) and a barplot (in R, barplot() ). Create an appropriate graph for each of the variables we plan to explore. To get you started, the commands below will create a histogram of price and a barplot of cut . hist(diamonds$price) barplot(table(diamonds$cut)) b. In the space below, describe any interesting features of the distribution of each variable. Variable Notes on the distribution Price Right skewed Cut Left skewed data seems to vary Clarity Symmetric Color Symmetric Carat Right skewed
STT 231 – Statistics for Scientists Spring 2024 Name: __________________________ Week 3 recitation Task 2: Bivariate analyses As mentioned on the previous page, our main goal is to explore the relationship between a diamond’s selling price and its other attributes (including cut, clarity, color, and carat). c. Create side-by-side boxplots that show the association between a diamond’s price and its cut . In the space below, describe what you see. Is there anything surprising about this distribution? It varies d. In (c) above, you should have noticed that ideal cut diamonds look like they have lower prices than any other cut. To verify this, we can ask RStudio to compute summary statistics for the price of diamonds, according to what type of cut quality they had. The main function we can use to compute a summary statistic across subgroups is the aggregate command. The code below let’s us compute the median price of diamonds, separated into subgroups by their cut. Run the command below and then amend it so that it will instead give us the mean price by group. aggregate(diamonds$price, by = list(diamonds$cut), median) Compare the subgroup means and medians. Given their values, do you think the price distributions within each subgroup is left-skewed, right-skewed, or symmetric? Explain. The mean is high so it will be positively skewed (right skewed) e. Now we’ve verified that diamonds with better cuts tend to be cheaper. This doesn’t make sense because ‘ideal’ cut is the highest quality of cut possible. Brainstorm with your peers some ideas explaining why this might be true. - Suuply and demand - Trends different asthetics - Market dynamics Task 3: Multivariable analyses f. Now that you’ve brainstormed some ideas, let’s see if the data can help us explore their veracity. Using the command below, create a scatterplot of a diamond’s price against its carat (i.e., the size of the stone). In the space below, describe what you see. plot(diamonds$price, diamonds$carat)
STT 231 – Statistics for Scientists Spring 2024 Name: __________________________ Week 3 recitation I see the data clumped together looking like its going to the right g. Now, make an appropriate graph to explore the relationship between carat and cut . Describe what type of association these variables display. h. Below is another graph that displays the price of a diamond according to its cut , but only for diamonds that are exactly 1 carat. (Prof. Keane has also provided the code he used to restrict the plot to just 1 carat diamonds). What relationship can you observed between price and cut , when the carat (i.e., the size) of the diamond is controlled for? one.carat <- subset(diamonds, diamonds$carat == 1) boxplot(one.carat$price ~ one.carat$cut) i. Now let us take our 1 carat diamonds and only look at those of color = G or H and clarity = VS1 or VS2. There are 22 Fair, 53 Good, 64 Very Good, 82 Premium, and 27 Ideal diamonds meeting these conditions. What do you see in the plot?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STT 231 – Statistics for Scientists Spring 2024 Name: __________________________ Week 3 recitation j. Finally, let’s create a plot that shows the relationship between a diamond’s carat (i.e., its size) and its selling price , but that colors each data point according to the quality of its cut . The code below will create such a graph and add in a legend to describe how each observational point was colored. What do you notice? plot(diamonds$carat, diamonds$price,col=diamonds$clarity) legend("bottomright",legend = levels(diamonds$clarity), fill = palette("default"))