Week6_Assignment

docx

School

New England College *

*We aren’t endorsed by this school

Course

CRN129

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

20

Uploaded by MatePuppyPerson3950

Report
Week6_Assignment 2023-12-10 Sections: Introduction, Prerequisites, Variation, Visualizing Distributions, Typical Values, Unusual Values Exercises: 1, 2, 3, 4 library (GGally) ## Loading required package: ggplot2 ## Registered S3 method overwritten by 'GGally': ## method from ## +.gg ggplot2 library (tidyverse) #calling the "tidyverse" library ## Warning: package 'tidyverse' was built under R version 4.1.3 ## Warning: package 'tibble' was built under R version 4.1.3 ## Warning: package 'tidyr' was built under R version 4.1.3 ## Warning: package 'readr' was built under R version 4.1.3 ## Warning: package 'purrr' was built under R version 4.1.3 ## Warning: package 'forcats' was built under R version 4.1.3 ## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 -- ## v dplyr 1.1.4 v readr 2.1.4 ## v forcats 1.0.0 v stringr 1.5.1 ## v lubridate 1.9.3 v tibble 3.2.1 ## v purrr 1.0.1 v tidyr 1.3.0 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors #1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might
decide which dimension is the length, width, and depth. #solution: diamonds %>% gather ( key = dist, vals, x, y, z) %>% ggplot ( aes (vals, colour = dist)) + geom_freqpoly ( bins = 100 ) #It is hard to understand at first is that the distribution of X and Y are almost the same, since the same graph from above with `bins = 30` won't show you the X distribution because it overlaps perfectly. The correlation between the two is `cor(diamonds$x, diamonds$y)`. If we round each mm to the closest number, value-pairing x and y yields `mean(with(diamonds, round(x, 0) == round(y, 0)))` of the values with the same number. So far, the length is directly proportional to the y value. diamonds %>% filter (y < 30 ) %>% select (x, y, z) %>% ggpairs ()
#Yet the relationship between x and y with z is almost flat, as expected. That is, after excluding 2 diamonds which had unreasonable values. #2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.) #solution: #graph <- map(seq(50, 1000, 100), # ~ ggplot(diamonds, aes(x = price)) + # geom_histogram(bins = .x) + # labs(x = NULL, y = NULL) + # scale_x_continuous(labels = NULL) + # scale_y_continuous(labels = NULL)) #multiplot(plotlist = graph) #The distribution seems to decrease, as expected, but there's a cut in the distribution showing that most prices are above or below a certain threshold. #3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? #solution: diamonds %>% filter ( between (carat, . 96 , 1.05 )) %>% group_by (carat) %>% summarize ( count = n ())
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## # A tibble: 10 x 2 ## carat count ## <dbl> <int> ## 1 0.96 103 ## 2 0.97 59 ## 3 0.98 31 ## 4 0.99 23 ## 5 1 1558 ## 6 1.01 2242 ## 7 1.02 883 ## 8 1.03 523 ## 9 1.04 475 ## 10 1.05 361 #The data shows that there are way more 1ct diamonds than .99ct diamonds. #4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? #solution: ggplot ( data = diamonds, mapping = aes ( x = x)) + geom_histogram () ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot ( data = diamonds, mapping = aes ( x = x)) + geom_histogram () + coord_cartesian ( xlim = c ( 2 , 10 ), ylim = c ( 0 , 5000 )) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ggplot ( data = diamonds) + geom_histogram ( mapping = aes ( x = x)) + xlim ( 3 , 9 ) + ylim ( 0 , 5000 ) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 50 rows containing non-finite values (`stat_bin()`). ## Warning: Removed 1 rows containing missing values (`geom_bar()`).
#Because the coord_cartesian function is specified after the creation of the histogram, it zooms in on the coordinates, but does not change the shape of the histogram.However, the x and y lim operators remove values prior to the creation of the histogram and thus influence the shape of the histogram that is produced. Sections: Missing Values Exercises: 1, 2 #1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference? #solution: diamonds %>% mutate ( y = ifelse (y < 3 | y > 20 , NA , y)) %>% ggplot ( mapping = aes ( x = y)) + geom_histogram () ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#When producing a histogram, the histogram removes the missing values. This is likely because a histogram relies on continuous (numeric) data to plot the frequency of counts according to bins. When it does not know the value, it does not know which bin to put it in, so it removes it. diamonds %>% mutate ( cut = ifelse ( runif ( n ()) < . 1 , NA_character_ , as.character (cut))) %>% ggplot ( mapping = aes ( x = cut)) + geom_bar ()
#In a bar chart, the graphic relies on categorical data and assumes NA is a character string indicating another category. #2. What does na.rm = TRUE do in mean() and sum()? #solution: diamonds %>% mutate ( xyz = ifelse (z == 0 , NA , z)) %>% select (x,y,z,xyz) %>% arrange (z) %>% summarise ( total_xyz = sum (xyz, na.rm = TRUE ), mean_xyz = mean (xyz, na.rm = TRUE ) ) ## # A tibble: 1 x 2 ## total_xyz mean_xyz ## <dbl> <dbl> ## 1 190879. 3.54 #It removes the `NA` from the calculations. Sections: Covariation, A categorical and Continuous Variable Exercises: 2, 3, 5, 6 #2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to
lower quality diamonds being more expensive? #3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()? #solution: ggplot ( data = diamonds) + geom_boxplot ( mapping = aes ( x = cut, y = carat)) + coord_flip () ggplot ( data = diamonds) + geom_boxplot ( mapping = aes ( x = carat, y = cut))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#Although they essentially accomplish the same task, the ggstance package presupposes a horizontal graph. The only need is that you specify the axes or utilize orientation to specify in GGplots. #5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method? #solution: ggplot ( data = diamonds) + geom_violin ( mapping = aes ( x = price, y = cut))
ggplot ( data = diamonds) + geom_freqpoly ( mapping = aes ( x = price, y = ..density.., color = cut)) ## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0. ## i Please use `after_stat(density)` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot ( data = diamonds) + geom_histogram ( mapping = aes ( x = price)) + facet_wrap ( ~ cut, nrow = 5 , ncol = 1 ) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#The facetted histogram is the most effective at comparing the data's raw numbers. This means that it is quite obvious which "ideal" diamonds are present in the dataset the most frequently. It might be challenging to interpret the frequency polygon, especially when comparing groups that are comparable, like premium and very good diamonds in this case. However, it is quite helpful for tracking price fluctuation and figuring out which cut has the highest density at a specific price. For instance, it is extremely simple to see how all prices converge when the quantity of diamonds drops. Additionally, the violin plot is excellent for identifying any data irregularities. #6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does. #solution: library (ggbeeswarm) ?ggbeeswarm ## starting httpd help server ... done #The first two geom_beeswarm and geom_quasirandom produce plots that are a mix of point and violin. Functionally, the violins are produced using points and the difference between the two methods is whether to randomizing points within or across categories. Within each geom you can then specify randomization methods.
#geom_beeswarm: works similarly to geom_jitter #geom_quasirandom: works similarly to geom_jitter, but it randomizes points within categories to reduce overplotting. #position_beeswarm: violin point-style plots to show overlapping points. x must be discrete #position_quasirandom: violin point-style plots to show overlapping points. x must be discrete Sections: Two Categorical Variables Exercises: 1,2 #1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut? #solution: diamonds %>% group_by (color) %>% count (color, cut) %>% mutate ( prop = n / sum (n) ) %>% ggplot ( mapping = aes ( x = color, y = cut)) + geom_tile ( aes ( fill = prop)) diamonds %>% group_by (cut) %>% count (cut, color) %>%
mutate ( prop = n / sum (n) ) %>% ggplot ( mapping = aes ( x = color, y = cut)) + geom_tile ( aes ( fill = prop)) #2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it? #solution: library (nycflights13) ## Warning: package 'nycflights13' was built under R version 4.1.3 flights %>% filter ( ! is.na (arr_delay), arr_delay > 0 ) %>% group_by (dest, month) %>% mutate ( avg_delay = mean (arr_delay)) %>% ggplot ( mapping = aes ( x = factor (month), y = reorder (dest, distance))) + geom_tile ( mapping = aes ( fill = avg_delay)) + labs ( x = "Month" , y = "Destination" , fill = "Average Delay" )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#The number of destinations is too great for all the fields to comfortably fit on either axis. The month variable is represented in the data as an integer even though it actually has discrete values between 1:12 and is more of a notional number. By categorizing destinations by state and making the month a real discrete nominal variable, we might address these problems. The latter problem is simpler to resolve; all we need to do is utilize factor(month). Sections: Two Continuous Variables Exercises: 2, 4, 5 #2. Visualize the distribution of carat, partitioned by price. #solution: #Using geom_density() and partitioning by price with cut_width, it is not surprising to see that diamonds of higher carat are associated with higher price in general. diamonds %>% ggplot () + geom_density ( mapping = aes ( x = carat, color = cut_width (price, 5000 , boundary = 0 )))
#4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price. #solution: ggplot ( data = diamonds, mapping = aes ( x = carat, y = price)) + geom_hex () + facet_wrap ( ~ cut) ## Warning: Computation failed in `stat_binhex()` ## Computation failed in `stat_binhex()` ## Computation failed in `stat_binhex()` ## Computation failed in `stat_binhex()` ## Computation failed in `stat_binhex()` ## Caused by error in `compute_group()`: ## ! The package "hexbin" is required for `stat_binhex()`
ggplot ( data = diamonds, mapping = aes ( x = cut_number (carat, 5 ), y = price)) + geom_boxplot ( aes ( color = cut))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ggplot ( data = diamonds, mapping = aes ( x = cut, y = price)) + geom_boxplot ( aes ( color = cut_number (carat, 5 ))) #5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. #solution: ggplot ( data = diamonds) + geom_point ( mapping = aes ( x = x, y = y)) + coord_cartesian ( xlim = c ( 4 , 11 ), ylim = c ( 4 , 11 ))