ProblemSet3_sampleanswers

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

130

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

23

Uploaded by LieutenantFlagSquid18

Report
STA130H1S – Fall 2022 Problem Set 3 () and STA130 Professors Instructions Complete the exercises in this .Rmd file and submit your .Rmd and .pdf output through Quercus on Thursday, September 29 by 5:00 p.m. ET. Part 1: More Olympics Data The code below loads the VGAMdata package (so you can access the data sets it contains) and the tidyverse package (so you can use the functions it contains) and glimpses the oly12 data set, which you will use for this question. Do not use the olympics data set from class to answer the prompts in this question . library (tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.2 -- ## v ggplot2 3.3.6 v purrr 0.3.4 ## v tibble 3.1.8 v dplyr 1.0.10 ## v tidyr 1.2.1 v stringr 1.4.1 ## v readr 2.1.2 v forcats 0.5.2 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library (VGAMdata) # install.packages("VGAMdata") ## Loading required package: VGAM ## Loading required package: stats4 ## Loading required package: splines names (oly12) # convenient function to quickly glance at data set column names ## [1] "Name" "Country" "Age" "Height" "Weight" "Sex" "DOB" ## [8] "PlaceOB" "Gold" "Silver" "Bronze" "Total" "Sport" "Event" glimpse (oly12) ## Rows: 10,384 ## Columns: 14 ## $ Name <fct> Lamusi A, A G Kruger, Jamale Aarrass, Abdelhak Aatakni, Maria ~ ## $ Country <fct> "People s Republic of China", "United States of America", "Fra~ ## $ Age <int> 23, 33, 30, 24, 26, 27, 30, 23, 27, 19, 37, 28, 28, 28, 22, 19~ ## $ Height <dbl> 1.70, 1.93, 1.87, NA, 1.78, 1.82, 1.82, 1.87, 1.90, 1.70, NA, ~ ## $ Weight <int> 60, 125, 76, NA, 85, 80, 73, 75, 80, NA, NA, NA, 60, 64, 62, N~ ## $ Sex <fct> M, M, M, M, F, M, F, M, M, M, M, M, F, F, M, F, M, M, M, M, F,~ ## $ DOB <date> 1989-02-06, NA, NA, 1988-09-02, NA, 1984-06-09, NA, 1989-03-0~ 1
## $ PlaceOB <fct> "NEIMONGGOL (CHN)", "Sheldon (USA)", "BEZONS (FRA)", "AIN SEBA~ ## $ Gold <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Silver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Bronze <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Total <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Sport <fct> "Judo", "Athletics", "Athletics", "Boxing", "Athletics", "Hand~ ## $ Event <fct> "Men s -60kg", "Men s Hammer Throw", "Men s 1500m", "Men s Lig~ Question 1: Practice with filter() (a) In this week’s class, we looked at data for each country which participated in the 2012 Olympics (e.g. size of each country’s Olympic team, number of medals won, etc.), and there was one observation (i.e. one row) for each participating country. What does each row in the oly12 dataset represent? In the oly12 dataset, each row corresponds to one athlete who participated in the 2012 Olympic Games. Hint: Type ?oly12 or help(oly12) in the console (on the bottom left corner) to view the help file for the oly12 dataset in the Help tab (on the bottom right corner) of RStudio); or, just search for “oly12” in the Help tab. (b) Determine the number of athletes who represented Canada ( Canada ) or the United States ( United States of America ) in the 2012 Olympic Games. # Using filter to keep only canadian athletes, # then glimpse to view the number of observations oly12 %>% filter (Country == "Canada" ) %>% glimpse () ## Rows: 274 ## Columns: 14 ## $ Name <fct> Jennifer Abel, Natalie Achonwa, Mohammed Ahmed, Dylan Armstron~ ## $ Country <fct> "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "C~ ## $ Age <int> 20, 19, 21, 31, 28, 24, 20, 28, 23, 22, 21, 56, 29, 24, 23, 25~ ## $ Height <dbl> 1.60, 1.92, 1.90, 1.93, 1.85, 1.83, 1.68, 1.86, 1.86, 1.68, 1.~ ## $ Weight <int> 62, 83, 60, 139, 82, 78, 150, 90, 80, 58, 75, 78, 98, 48, 69, ~ ## $ Sex <fct> F, F, M, M, F, F, M, M, M, F, M, M, M, F, F, F, M, M, F, F, M,~ ## $ DOB <date> NA, NA, 1991-05-01, NA, NA, 1988-06-05, 1992-11-03, NA, NA, 1~ ## $ PlaceOB <fct> "Montreal (CAN)", "", "Mogadishu (SOM)", "Kamloops (CAN)", "",~ ## $ Gold <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Silver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~ ## $ Bronze <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,~ ## $ Total <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,~ ## $ Sport <fct> "Diving", "Basketball", "Athletics", "Athletics", "Basketball"~ ## $ Event <fct> "Women s 3m Springboard, Women s Synchronised 3m Springboard",~ oly12 %>% filter (Country == "United States of America" ) %>% glimpse () ## Rows: 518 ## Columns: 14 ## $ Name <fct> A G Kruger, Abdihakem Abdirahman, Amy Acuff, Cammile Adams, Na~ ## $ Country <fct> "United States of America", "United States of America", "Unite~ ## $ Age <int> 33, 35, 37, 20, 23, 24, 27, 23, 21, 20, 25, 28, 29, 38, 28, 30~ ## $ Height <dbl> 1.93, 1.80, 1.88, 1.73, 2.01, 1.91, 1.85, 1.80, 1.73, 1.78, 2.~ 2
## $ Weight <int> 125, 61, 66, 65, 102, 79, 74, 70, 64, 68, 93, 104, 77, 58, 75,~ ## $ Sex <fct> M, M, F, F, M, F, M, F, F, F, M, M, F, F, F, M, F, F, F, M, F,~ ## $ DOB <date> NA, 1977-01-01, NA, 1991-11-09, 1988-07-12, 1987-05-10, NA, N~ ## $ PlaceOB <fct> "Sheldon (USA)", "HARGISA (SOM)", "Port Arthur (USA)", "Housto~ ## $ Gold <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,~ ## $ Silver <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Bronze <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ Total <int> 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,~ ## $ Sport <fct> "Athletics", "Athletics", "Athletics", "Swimming", "Swimming",~ ## $ Event <fct> "Men s Hammer Throw", "Men s Marathon", "Women s High Jump", "~ # add the above 2 numbers together # Using filter to keep only canadian or USA athletes, # then count the number of rows in the resulting data frame oly12 %>% filter (Country == "Canada" | Country == "United States of America" ) %>% nrow () ## [1] 792 # Use summarise to calculate the number of athletes for each country, # then filter to keep only the row for Canada oly12 %>% group_by (Country) %>% summarise ( team_size = n ()) %>% filter (Country == "Canada" | Country == "United States of America" ) ## # A tibble: 2 x 2 ## Country team_size ## <fct> <int> ## 1 Canada 274 ## 2 United States of America 518 274 + 518 ## [1] 792 274 athletes represented Canada, and 518 athletes represented USA at the 2012 Olympic Games, thus 792 athletes represented either Canada or the USA at the 2012 Olympic Games. Hint: Apply the filter() function to the Country column of the oly12 dataset (c) Determine the number of female athletes who competed in classical gymnastics ( Gymnastics - Artistic and Gymnastics - Rhythmic ) or classical pool sports ( Diving and Swimming ). oly12_FemaleClassicalGymPool <- oly12 %>% filter (Sex == "F" ) %>% filter (Sport == "Gymnastics - Rhythmic" | Sport == "Gymnastics - Artistic" | Sport == "Diving" | Sport == "Swimming" ) oly12_FemaleClassicalGymPool %>% summarise ( n= n ()) ## n ## 1 685 Hint: You can see all the possible values for the Sport variable with levels(oly12$Sport) , and count the number of possible levels with nlevels(oly12$Sport) . 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(d) Determine the number of athletes who competed in ANY gymnastic ( Gymnastics - Artistic , Gymnastics - Rhythmic , Trampoline ) or ANY pool sports ( Diving , Swimming , Synchronised Swimming , and Water Polo ) oly12_GymnastsPoolers <- oly12 %>% filter (Sport %in% c ( "Gymnastics - Rhythmic" , "Gymnastics - Artistic" , "Diving" , "Trampoline" , "Swimming" , "Synchronised Swimming" , "Water Polo" )) oly12_GymnastsPoolers %>% summarise ( n= n ()) ## n ## 1 1695 Hint: As indicated on stackoverflow , the %in% comparision operator could be useful here with allGymnastics <- c("Gymnastics - Artistic", "Gymnastics - Rhythmic", "Trampoline") and allWaterPool <- c("Diving", "Swimming", "Synchronised Swimming", "Water Polo") and filter(Sport %in% allGymnastics | Sport %in% allWaterPool) . (e) Create the data subset oly12_FemaleArtisticRhythmicGymnasts which contains all female olympic athletes who competed in artistic gymnastics or rhythmic gymnastics. oly12_FemaleArtisticRhythmicGymnasts <- oly12 %>% filter (Sex == "F" ) %>% filter (Sport == "Gymnastics - Rhythmic" | Sport == "Gymnastics - Artistic" ) Hint: names(oly12) shows all the column names of the data set. (f) Use oly12_FemaleArtisticRhythmicGymnasts and ggplot2 to compare the age distribution of female olympic athletes competing in artistic gymnastics to the age distribution of female olympic athletes competing in rhythmic gymnastics using both boxplots and histrograms. oly12_FemaleArtisticRhythmicGymnasts %>% ggplot ( aes ( x= Sport, y= Age)) + geom_boxplot () 4
15 20 25 30 35 Gymnastics - Artistic Gymnastics - Rhythmic Sport Age oly12_FemaleArtisticRhythmicGymnasts %>% filter (Sport == "Gymnastics - Artistic" ) %>% ggplot ( aes ( x= Age)) + geom_histogram ( bins= 12 , color= "black" , fill= "gray" ) 5
0 10 20 30 15 20 25 30 35 Age count oly12_FemaleArtisticRhythmicGymnasts %>% filter (Sport == "Gymnastics - Rhythmic" ) %>% ggplot ( aes ( x= Age)) + geom_histogram ( bins= 12 , color= "black" , fill= "gray" ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 5 10 15 20 15.0 17.5 20.0 22.5 25.0 27.5 Age count Hint: don’t forget aes() and to use + rather than %>% . (g) Answer the following questions based on the plots you created in (d). Are the age distributions of female rhythmic gymnasts and female artistic gymnasts symmetrical or skewed? From the boxplots, we can see that the age distribution of female artistic gymnasts appears to be symmetric, with a slight right skew (based on outliers) and the age distribution of female rhythmic gymnasts appears to be right skewed. This can also be seen in the histograms of the age distributions. How do the medians, 25th percentiles, and 75th percentiles for ages of female rhythmic gymnasts and female artistic gymnasts compare? From the boxplots, we can see that the median age of female rhythmic gymnasts and female artistic gymnasts is similar (~18). From the boxplots, we can see that the 25th percentile age of female rhythmic gymnasts is slightly higher than the 25th percentile age of female artistic gymnasts is similar (~18 and ~17). Lastly, we can see that the 75th percentile of ages of female rhythmic gymnasts and female artistic gymnasts is similar (~21). Based only on the histogram and boxplots, predict whether the standard deviation of the ages is similar or different. Justify your answer in 1-2 sentences. I predict that the standard deviation of ages for female rhythmic gymnasts will be slightly smaller than the sd of ages from female artistic gymnasts the IQR and range are smaller in length for the rhythmic gymnast group. 7
Question 2: Practice with summarise() , group_by() , and mutate() (a) Create a summary table of oly12_FemaleArtisticRhythmicGymnasts reporting the minimum ( min ), maximum ( min ), mean , median , and standard deviation ( sd ) of ages for female rhythmic gymnasts and female artistic gymnasts. Were you correct in your guess about the standard deviation in part (g) of the last question? oly12_FemaleArtisticRhythmicGymnasts %>% group_by (Sport) %>% summarise ( min= min (Age), max= max (Age), mean= mean (Age), median= median (Age), sd= sd (Age)) ## # A tibble: 2 x 6 ## Sport min max mean median sd ## <fct> <int> <int> <dbl> <dbl> <dbl> ## 1 Gymnastics - Artistic 15 37 19.7 19 3.66 ## 2 Gymnastics - Rhythmic 16 27 19.5 19 2.68 As predicted, the standard deviation of ages is slightly higher for female artistic gymnast athletes than for female rhythmic gymnast athletes (3.66 vs 2.68), but they are very similar. (b) Create a new variable called total_medals and create a new tibble called oly12_OneMedalClub that contains athletes who won exactly one medal at the 2012 olympics. oly12_OneMedalClub <- oly12 %>% mutate ( total_medals= Gold + Silver + Bronze) %>% filter (total_medals == 1 ) (c) Uncomment the code below and run the glimpse of the data created in part (c). # glimpse(oly12_OneMedalClub) Question 3: Practice with select() , arrange() , desc() , and filter() (b) Find the Name and Age of the 6 oldest athletes who competed in the 2012 Olympics. # Type your code here oly12 %>% arrange ( desc (Age)) %>% head () %>% select (Name, Age) ## Name Age ## 1 Hiroshi Hoketsu 71 ## 2 Afanasijs Kuzmins 65 ## 3 Ian Millar 65 ## 4 Carl Bouckaert 58 ## 5 Andrei Kavalenka 57 ## 6 Mary Hanna 57 (b) Find the Name , Age and Sport of the 6 youngest female athletes who competed in the 2012 Olympics. oly12 %>% filter (Sex == "F" ) %>% 8
arrange (Age) %>% head () %>% select (Name, Age, Sport) ## Name Age Sport ## 1 Adzo Kpossi 13 Swimming ## 2 Aurelie Fanchette 14 Swimming ## 3 Suji Kim 14 Diving ## 4 Nafissatou Moussa Adamou 14 Swimming ## 5 Lea Melissa Moutoussamy 14 Fencing ## 6 Yuhan Qiu 14 Swimming (c) Find the Name , Age , Sport , and Event for the 6 youngest and 6 oldest competitors who won gold medals at the 2012 olympics. [This can be run as two pieces of code rather than one piece of combined code]. oly12 %>% filter (Gold > 0 ) %>% arrange (Age) %>% head () %>% select (Name, Age, Sport, Event) ## Name Age Sport ## 1 Ruta Meilutyte 15 Swimming ## 2 Kyla Ross 15 Gymnastics - Artistic ## 3 Gabrielle Douglas 16 Gymnastics - Artistic ## 4 Yolane Kukla 16 Swimming ## 5 Mc Kayla Maroney 16 Gymnastics - Artistic ## 6 Shiwen Ye 16 Swimming ## Event ## 1 Women s 50m Freestyle, Women s 100m Freestyle, Women s 100m Breaststroke ## 2 Women s Team, Women s Qualification ## 3 Women s Individual All-Around, Women s Team, Women s Qualification ## 4 Women s 4x100m Freestyle Relay ## 5 Women s Team, Women s Qualification ## 6 Women s 200m Individual Medley, Women s 400m Individual Medley, Women s 4x200m Freestyle Relay oly12 %>% filter (Gold > 0 ) %>% arrange ( desc (Age)) %>% head () %>% select (Name, Age, Sport, Event) ## Name Age Sport ## 1 Peter Thomsen 51 Equestrian ## 2 Ingrid Klimke 44 Equestrian ## 3 Sergei Martynov 44 Shooting ## 4 Kristin Armstrong 38 Cycling - Road ## 5 Valentina Vezzali 38 Fencing ## 6 Alexandr Vinokurov 38 Cycling - Road ## Event ## 1 Individual Eventing, Team Eventing, BARNY ## 2 Individual Eventing, Team Eventing, BUTTS ABRAXXAS ## 3 Men s 50m Rifle Prone 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 4 Women s Individual Time Trial, Women s Road Race ## 5 Women s Individual Foil, Women s Team Foil ## 6 Men s Individual Time Trial, Men s Road Race # google "tidy get the first and last rows" # https://stackoverflow.com/questions/31528981/select-first-and-last-row-from-grouped-data oly12 %>% filter (Gold > 0 ) %>% arrange (Age) %>% filter ( row_number () <= 6 | row_number () >= ( n () - 6 )) ## Name Country Age Height Weight Sex ## 1 Ruta Meilutyte Lithuania 15 1.72 64 F ## 2 Kyla Ross United States of America 15 1.57 NA F ## 3 Gabrielle Douglas United States of America 16 1.50 NA F ## 4 Yolane Kukla Australia 16 1.68 61 F ## 5 Mc Kayla Maroney United States of America 16 1.60 NA F ## 6 Shiwen Ye People s Republic of China 16 1.72 64 F ## 7 Chris Hoy Great Britain 36 1.85 93 M ## 8 Kristin Armstrong United States of America 38 1.73 58 F ## 9 Valentina Vezzali Italy 38 1.64 53 F ## 10 Alexandr Vinokurov Kazakhstan 38 1.76 69 M ## 11 Ingrid Klimke Germany 44 1.72 59 F ## 12 Sergei Martynov Belarus 44 1.72 70 M ## 13 Peter Thomsen Germany 51 1.83 73 M ## DOB PlaceOB Gold Silver Bronze Total Sport ## 1 <NA> Kaunas (LTU) 1 0 0 1 Swimming ## 2 <NA> Honolulu (USA) 1 0 0 1 Gymnastics - Artistic ## 3 <NA> Newport News (USA) 2 0 0 2 Gymnastics - Artistic ## 4 <NA> AUCHENFLOWER (AUS) 1 0 0 1 Swimming ## 5 1995-09-12 ALISO VIEJO (USA) 1 0 0 1 Gymnastics - Artistic ## 6 1996-01-03 Zhejiang (CHN) 2 0 0 2 Swimming ## 7 <NA> Edinburgh (GBR) 1 0 0 1 Cycling - Track ## 8 1973-11-08 Memphis (USA) 1 0 0 1 Cycling - Road ## 9 <NA> 1 0 1 2 Fencing ## 10 <NA> Pavlodar (KAZ) 1 0 0 1 Cycling - Road ## 11 1968-01-04 MUNSTER (GER) 1 0 0 1 Equestrian ## 12 <NA> VEREIA (RUS) 1 0 0 1 Shooting ## 13 1961-04-04 Flensburg (GER) 1 0 0 1 Equestrian ## Event ## 1 Women s 50m Freestyle, Women s 100m Freestyle, Women s 100m Breaststroke ## 2 Women s Team, Women s Qualification ## 3 Women s Individual All-Around, Women s Team, Women s Qualification ## 4 Women s 4x100m Freestyle Relay ## 5 Women s Team, Women s Qualification ## 6 Women s 200m Individual Medley, Women s 400m Individual Medley, Women s 4x200m Freestyle Relay ## 7 Men s Keirin, Men s Team Sprint ## 8 Women s Individual Time Trial, Women s Road Race ## 9 Women s Individual Foil, Women s Team Foil ## 10 Men s Individual Time Trial, Men s Road Race ## 11 Individual Eventing, Team Eventing, BUTTS ABRAXXAS ## 12 Men s 50m Rifle Prone ## 13 Individual Eventing, Team Eventing, BARNY 10
Question 4: The Data Consultant You have just been hired by a consultancy company. Congratulations! They are doing a report on each Olympics for the past 10 years. Given your recent experience in STA130, you ask to be responsible for the 2012 summary. Write a short report to your boss on information that can be gleaned about the ages of the athletes across sports. As it turns out, you happen to know that your new boss’ favourite sports are badminton and weightlifting, so addressing these sports specifically might be an easy way to capture their attention; but, other features athletes’ ages which can be learned from your plots and tables will of course be appreciated, too. The more interesting the better! Question Constraints This is a quick report for your boss, so use full sentences and communicate in a clear and professional manner. Grammar isn’t the main focus of the assessment, but don’t use slang or emojis. Avoid Analysis Paralysis : this is envisioned as a 30 minute exercise, so you don’t have time to exhaustively explore every aspect of the data set. Avoid Writer’s Block : this is envisioned as a 200-400 word exercise, so quickly find something you can communicate and write about. (a) Watch this 7-minute video introduction to hedging . Hedging is helpful whenever you can’t say something is 100% one way or another, as is often the case. In statistics, hedging should always be used with respect to the limitations of data and the strength and generalizability of the conclusions. (b) Provide a small introduction of one or two sentences to draw your reader in and then explain what you’ll be discussing. Be definitive about what your data is, and use hedging to caveat the limitations of the data. (c) Provide one or two clearly titled and labeled figures addressing interesting features of athletes’ ages. (d) Provide one or two clearly labeled summary tables addressing interesting features of ath- letes’ ages. (e) Watch this 8-minute video introduction to plagiarism . You don’t need to cite any outside references for your report to your boss, but you will be referring to your own created figures and tables. We’ll use this as an excuse to get started early thinking about this important topic, and also use it as an exercise to start getting into the right referencing habits. It’s easy and natural and makes your writing better (not mention avoids potential serious academic integrity violations. . . ) (f) Describe the interesting features of athletes’ ages that you’ve found, referencing the figures and summary tables created in (c) and (d) just above. Use at least two of the vocabulary words listed below; but, your boss isn’t a statistician, so make sure to clearly define and explain the vocabulary you use. (f) Finish with a conclusion to remind your boss of the key take home points from your summary about the athletes’ ages. Be definitive about what your findings are, but use hedging to caveat the limitations of the conclusion more generally. Vocabulary Cleaning data Tidy data Handling missing values (NAs) Removing a column Extracting a subset of variables 11
Filtering a tibble based on a condition (e.g. based on the values in one or more of the variables/columns) Sorting data based on the values of a variable Defining new variables Renaming the variables Producing new data frames Grouping categories Creating summary tables You may also find these vocabulary words from last week useful with your writing this week location/center (mean, median, mode) and scale/spread (range, IQR, var, sd) note: interpreting center and spread relative to each other can be helpful shape (symmetric, left-skewed, right-skewed, unimodal, bimodal, multimodal, uniform) outliers/extreme values note: this can be related to the tails of a distribution (heavy-tailed, thin-tailed) frequency (most, least, pattern tendencies) Part 2: OPTIONAL but Recommended You may complete these questions for practice if you wish. You are not required to complete these questions as they ARE NOT included as part of your mark. Question 5: Amazon Books The code below reads in data about books sold on Amazon . - Note that the height ( Height ), width ( Width ), and thickness ( Thick ) of books in this data frame are measured in inches. library (tidyverse) # Load the tidyverse package so it is available to use books <- read.csv ( "amazonbooks.csv" ) (a) What is the name of the book(s) with the smallest number of pages in this sample of books, and how many pages does it have? books %>% arrange (NumPages) %>% select (Title, NumPages) %>% head () ## Title NumPages ## 1 Big Dog . . . Little Dog 24 ## 2 The Berenstain Bears He Bear, She Bear 24 ## 3 The Shape of Me and Other Stuff: Dr. Seuss s Surprising Word Book 24 ## 4 Cloudy With a Chance of Meatballs 32 ## 5 Go the F**k Asleep 32 ## 6 Madeline 54 (b) Create a summary table which reports the total number of books written by each author and the mean and variance of the number of pages per book for each author, for the books represented in this sample of books. books %>% group_by (Author) %>% summarise ( n = n (), mean_pages = sum (NumPages) / n, var_pages = var (NumPages)) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## # A tibble: 256 x 4 ## Author n mean_pages var_pages ## <chr> <int> <dbl> <dbl> ## 1 "" 1 432 NA ## 2 "Abraham Verghese" 1 667 NA ## 3 "Adam Goodheart" 1 460 NA ## 4 "Adam Hochschild" 1 480 NA ## 5 "Adam Mansbach" 1 32 NA ## 6 "Alaa Aswany" 1 255 NA ## 7 "Alice Munro" 2 320 2048 ## 8 "Alice Schroeder" 1 832 NA ## 9 "Allen, Toorawa" 1 200 NA ## 10 "Andrea Warren" 1 160 NA ## # ... with 246 more rows (c) Modify your code from (b) so to create a new summary table which contains only informa- tion for authors who wrote more than 2 books, and sort them in decreasing order of number of books written. books %>% group_by (Author) %>% summarise ( n = n (), mean_pages = sum (NumPages) / n, var_pages = var (NumPages)) %>% filter (n > 2 ) %>% arrange ( desc (n)) ## # A tibble: 16 x 4 ## Author n mean_pages var_pages ## <chr> <int> <dbl> <dbl> ## 1 Jodi Picoult 7 414. 1658. ## 2 Vladimir Nabokov 7 316 20528 ## 3 Lewis 4 266. 18820. ## 4 Murakami 4 354. 9838. ## 5 Ben Mezrich 3 299 571 ## 6 Bruce Ballenger 3 448 9472 ## 7 Christensen 3 245. 24917. ## 8 Collins 3 370. 1920. ## 9 Drucker 3 304 11008 ## 10 Ha Jin 3 300 5232 ## 11 James Patterson 3 438. 1408. ## 12 John Steinbeck 3 392. 63632. ## 13 S.E. Hinton 3 181. 341. ## 14 Seuss 3 56 768 ## 15 Shel Silverstein 3 149. 5461. ## 16 William Faulker 3 339 763 Part 3: OPTIONAL for Additional Practice You may complete these questions for practice if you wish. You are not required to complete these questions as they ARE NOT included as part of your mark. 13
Question 6: Titanic Data At the time it departed from England in April 1912, the RMS Titanic was the largest ship in the world. In the night of April 14th to April 15th, the Titanic struck an iceberg and sank approximately 600km south of Newfoundland (a province in eastern Canada). Many people perished in this accident. The code below loads data about the passengers who were on board the Titanic at the time of the accident. titanic <- read_csv ( "titanic.csv" ) ## Rows: 2208 Columns: 14 ## -- Column specification -------------------------------------------------------- ## Delimiter: "," ## chr (12): Name, Survived, Boarded, Class, MWC, Adut_or_Chld, Sex, Ticket_No,... ## dbl (2): Age, Paid ## ## i Use spec() to retrieve the full column specification for this data. ## i Specify the column types or set show_col_types = FALSE to quiet this message. glimpse (titanic) ## Rows: 2,208 ## Columns: 14 ## $ Name <chr> "ABBING, Mr Anthony", "ABBOTT, Mr Ernest Owen", "ABBOTT, ~ ## $ Survived <chr> "Dead", "Dead", "Dead", "Dead", "Alive", "Alive", "Alive"~ ## $ Boarded <chr> "Southampton", "Southampton", "Southampton", "Southampton~ ## $ Class <chr> "3", "Crew", "3", "3", "3", "3", "3", "2", "2", "3", "3",~ ## $ MWC <chr> "Man", "Man", "Child", "Man", "Woman", "Woman", "Man", "M~ ## $ Age <dbl> 42.00, 21.00, 14.00, 16.00, 39.00, 16.00, 25.00, 30.00, 2~ ## $ Adut_or_Chld <chr> "Adult", "Adult", "Child", "Adult", "Adult", "Adult", "Ad~ ## $ Sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Male~ ## $ Paid <dbl> 7.550000, NA, 20.250000, 20.250000, 20.250000, 7.650000, ~ ## $ Ticket_No <chr> "5547", NA, "CA2673", "CA2673", "CA2673", "348125", "3481~ ## $ Boat_or_Body <chr> NA, NA, NA, "[190]", "A", "16", "A", NA, "10", "15", "C",~ ## $ Job <chr> "Blacksmith", "Lounge Pantry Steward", "Scholar", "Jewell~ ## $ Class_Dept <chr> "3rd Class Passenger", "Victualling Crew", "3rd Class Pas~ ## $ Class_Full <chr> "3", "V", "3", "3", "3", "3", "3", "2", "2", "3", "3", "E~ (a) Often, before you start working with a dataset you need to clean it. The variable Adut_or_Chld indicates which passengers were adults and which were children. Use the rename() function to change the name of this variable to Adult_or_Child . The variable MWC records whether the passenger was a man, woman or child. Use the rename() function to change the name of this variable to Man_Woman_or_Child to make this clear. titanic <- titanic %>% rename ( Adult_or_Child = Adut_or_Chld, Man_Woman_or_Child = MWC) Hint: Unless the transformed tibble is saved into a new object or overwrites the original tibble, like oly12 <- oly12 %>% rename(Place_of_birth = PlaceOB) , the changes won’t be permanent. Since many of their values are missing or unclear, modify the titanic data frame by removing the following variables: Ticket_No , Boat_or_Body , CLass_Dept , Class_Full . titanic <- titanic %>% select (Name, Survived, Boarded, Class, Man_Woman_or_Child, Age, Adult_or_Child, Sex, Paid, Job) 14
(b) Create a summary table reporting the number of passengers on the Titanic (n), the number of passengers who survied (n_surv), and the proportion of passengers who survived (prop_surv). titanic %>% summarise ( n= n (), n_surv= sum (Survived != "Dead" ), prop_surv= n_surv / n) ## # A tibble: 1 x 3 ## n n_surv prop_surv ## <int> <int> <dbl> ## 1 2208 712 0.322 (c) Calculate the proportion of deaths for the following groups of passengers. For men, women, and children: titanic %>% group_by (Man_Woman_or_Child) %>% summarise ( n= n (), n_died = sum (Survived == "Dead" ), proportion= n_died / n) ## # A tibble: 3 x 4 ## Man_Woman_or_Child n n_died proportion ## <chr> <int> <int> <dbl> ## 1 Child 124 60 0.484 ## 2 Man 1652 1331 0.806 ## 3 Woman 432 105 0.243 For passengers aged between 25-40 years of age: titanic %>% filter (Age >= 25 & Age <= 40 ) %>% summarise ( n= n (), n_died = sum (Survived == "Dead" ), proportion= n_died / n) ## # A tibble: 1 x 3 ## n n_died proportion ## <int> <int> <dbl> ## 1 1067 739 0.693 For men, women, and children among the passengers who paid more than 50 British pounds for their tickets: titanic %>% filter (Paid > 50 ) %>% group_by (Man_Woman_or_Child) %>% summarise ( n= n (), n_died = sum (Survived == "Dead" ), proportion= n_died / n) ## # A tibble: 3 x 4 ## Man_Woman_or_Child n n_died proportion ## <chr> <int> <int> <dbl> ## 1 Child 13 7 0.538 ## 2 Man 100 70 0.7 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 3 Woman 126 4 0.0317 Write several sentences interpreting the summary tables created for the three groups above. Survival rates on the Titanic were associated with whether the passenger was a man, woman or child and the cost of their ticket. About 24% of all women passengers on the Titanic died. Unfortunately men and children passengers had considerably higher death rates (0.81 and 0.48 respectively). Among the passengers who paid more for their tickets, death rates were lower for the adult passengers since only 3% of these women and 70% of these men died, but higher for children (58=4% of these children died). (d) What was the most common job among passengers of the Titanic? Write 1-2 sentences explaining your answer. titanic %>% group_by (Job) %>% summarise ( n= n ()) %>% arrange ( desc (n)) ## # A tibble: 358 x 2 ## Job n ## <chr> <int> ## 1 <NA> 631 ## 2 General Labourer 162 ## 3 Fireman 161 ## 4 Trimmer 73 ## 5 Saloon Steward 56 ## 6 Farm Labourer 49 ## 7 Farmer 48 ## 8 Saloon Steward (1st class) 48 ## 9 Greaser 33 ## 10 Able Seaman 28 ## # ... with 348 more rows 631 of the passengers do not have a job listed (NA). The job recorded for the largest number of passengers is “General Labourer” (162), although there were also 161 firemen. (e) Plot the age distribution for passengers with the job “General Labourer”, and describe this distribution in 1-2 sentences. titanic %>% filter (Job == "General Labourer" ) %>% ggplot ( aes ( x= "" , y= Age)) + geom_boxplot () 16
20 30 40 50 x Age titanic %>% filter (Job == "General Labourer" ) %>% ggplot ( aes ( x= Age)) + geom_histogram ( bins= 30 , color= "black" , fill= "gray" ) 17
0 5 10 15 20 30 40 50 Age count General labourers on the Titanic ranged in age from uner 15 to just over 50. The age distribution is slightly right skewed, with a few outliers in the right tail corresponding to older individuals (over age 43). The median age of general labourers on the Titanic is close to 25 years, with an interquartile range of approximately 9 years (21 to 30 years). (f) Were any of the general labourers on the titanic women? If so, how many? # there are several ways to do this titanic %>% filter (Job == "General Labourer" ) %>% ggplot ( aes ( x= Man_Woman_or_Child)) + geom_bar () 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 50 100 150 Child Man Woman Man_Woman_or_Child count titanic %>% filter (Job == "General Labourer" ) %>% group_by (Man_Woman_or_Child) %>% summarise ( n= n ()) ## # A tibble: 3 x 2 ## Man_Woman_or_Child n ## <chr> <int> ## 1 Child 1 ## 2 Man 160 ## 3 Woman 1 titanic %>% filter (Job == "General Labourer" & Sex == "Female" ) ## # A tibble: 1 x 10 ## Name Survi~1 Boarded Class Man_W~2 Age Adult~3 Sex Paid Job ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> ## 1 HAAS, Miss Aloi~ Dead Southa~ 3 Woman 24 Adult Fema~ 8.85 Gene~ ## # ... with abbreviated variable names 1: Survived, 2: Man_Woman_or_Child, ## # 3: Adult_or_Child Of the 162 general labourers on the Titanic, 160 were men, 1 was a child and 1 was a woman. (g) What are the names of the passengers with the top 4 most expensive tickets? Did these passengers survive the accident? titanic %>% arrange ( desc (Paid)) %>% select (Name, Paid, Survived) 19
## # A tibble: 2,208 x 3 ## Name Paid Survived ## <chr> <dbl> <chr> ## 1 CARDEZA, Mr Thomas Drake Martinez 512. Alive ## 2 CARDEZA, Mrs Charlotte Wardle 512. Alive ## 3 LESUEUR, Mr Gustave J. 512. Alive ## 4 WARD, Miss Annie Moore 512. Alive ## 5 FORTUNE, Miss Alice Elizabeth 263 Alive ## 6 FORTUNE, Miss Ethel Flora 263 Alive ## 7 FORTUNE, Miss Mabel Helen 263 Alive ## 8 FORTUNE, Mr Charles Alexander 263 Dead ## 9 FORTUNE, Mr Mark 263 Dead ## 10 FORTUNE, Mrs Mary 263 Alive ## # ... with 2,198 more rows The most expensive tickets were sold to: - Mr Thomas Drake Martinez CARDEZA - Mrs Charlotte Wardle CARDEZA - Mr Custave J. LESUEUR - Miss Annie Moore WARD All four of these passengers paid 512.32 British pounds for their tickets and they all survived the accident. (h) In this question, you will compare the distribution of ticket prices for survivors and non- survivors of the Titanic using both visualizations and summary tables. Construct two histograms to visualize the distribution of ticket prices for survivors and non-survivors (i.e. one histogram for survivors and one for non-survivors). Write 2-3 sentences comparing the two distributions based on these plots. titanic %>% filter (Survived == "Alive" ) %>% ggplot ( aes ( x= Paid)) + geom_histogram ( color= "black" , fill= "gray" , bins= 30 ) 0 50 100 150 0 100 200 300 400 500 Paid count titanic %>% filter (Survived == "Dead" ) %>% ggplot ( aes ( x= Paid)) + geom_histogram ( color= "black" , fill= "gray" , bins= 30 ) 20
0 100 200 300 400 0 100 200 Paid count The distribution of ticket prices is very right-skewed for both the survivors and those who perished; while most of the tickets cost less than 100 pounds, some of the survivors paid over 500 pounds for their tickets. The first bar in the histogram (corresponding to the lowest range of fares) is much taller in the distribution of non-survivors than survivors, so we see that most of the individuals who bought these low-cost tickets did not survive the accident. Construct a pair of boxplots (in the same figure) to visualize the distribution of ticket prices for survivors and non-survivors. Write 2-3 sentences comparing the two distributions based on these plots. titanic %>% ggplot ( aes ( x= Survived, y= Paid)) + geom_boxplot () 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 100 200 300 400 500 Alive Dead Survived Paid Again, we see that both distributions are highly right skewed. From the boxplots, it is clear that the median fare paid by surviving passengers was higher than that paid by the non-survivors and the interquartile range of ticket prices is much wider among survivors than non-survivors (IQR of approximately 50 pounds for survivors and less than 25 pounds for non-survivors). The distribution of ticket prices is more right-skewed among non-survivors than among survivors, as the median appears to be very close to the first quartile. Construct a summary table with the minimum, first quartile, median, mean, third quartile, and maximum ticket price for survivors and non-survivors. titanic %>% group_by (Survived) %>% summarise ( n= n (), min= min (Paid, na.rm= TRUE ), first_quartile= quantile (Paid, 0.25 , na.rm= TRUE ), median= median (Paid, na.rm= TRUE ), mean= mean (Paid, na.rm= TRUE ), third_quartile= quantile (Paid, 0.75 , na.rm= TRUE ), max= max (Paid, na.rm= TRUE )) ## # A tibble: 2 x 8 ## Survived n min first_quartile median mean third_quartile max ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alive 712 0 11.3 26 49.6 57.9 512. ## 2 Dead 1496 0 7.85 10.5 22.9 26 263 Write 2-3 sentences comparing the two distributions based on this summary table. The minimum ticket price among both survivors and non-survivors is 0, which is strange; more investigation is required to determine whether this is an error in the data or if some passengers in fact received complimentary tickets. From the summary table, we see that the median ticket price among survivors was more than twice as high as the median ticket price among non-survivors. Among survivors, 75% paid less than 58 pounds 22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
for their tickets, while 75% of the non-survivors paid less than 26 pounds. The mean ticket price is also much higher among survivors, but this is pulled up by the particularly high prices paid by a small number of passengers. As a side note, we can take a closer look at passenger with 0-pound tickets titanic %>% filter (Paid == 0 ) %>% group_by (Class) %>% summarise ( n ()) ## # A tibble: 3 x 2 ## Class n() ## <chr> <int> ## 1 1 8 ## 2 2 14 ## 3 3 6 titanic %>% filter (Paid == 0 ) %>% group_by (Boarded) %>% summarise ( n ()) ## # A tibble: 4 x 2 ## Boarded n() ## <chr> <int> ## 1 Belfast 9 ## 2 Cherbourg 1 ## 3 Queenstown 1 ## 4 Southampton 17 titanic %>% filter (Paid == 0 ) %>% group_by (Survived) %>% summarise ( n ()) ## # A tibble: 2 x 2 ## Survived n() ## <chr> <int> ## 1 Alive 4 ## 2 Dead 24 There is no obvious pattern connecting individuals with recorded 0-pound tickets. It is not clear whether this is an error or not, but since only 28 out of 2208 observations are affected, these are not expected to have a large impact on the comparison. Comment on the strengths and weaknesses of each of the visualizations and summary table constructed above. Histograms: The paired histograms give us a good overall impression of the distribution of ticket prices among survivors and non-survivors, but it is difficult to extract estimates of the mean, median, and quantiles, as well as the bounds of each bin. Boxplots: The boxplots make it easy for us to compare the medians, quartiles, IQR (and outliers) of ticket prices across the two groups, although we cannot easily extract exact values for these. Also, since boxplots only display a small number of summary statistics, we lose information about the shape of the distributions. Summary table: The summary table makes it easy to compare numerical values of key statistics. It is only from the summary table that we noticed that some passengers were recorded to have paid 0 pounds for their tickets. However, it is more difficult to get a quick sense of the overall shape of the distributions from these summary statistics alone, although these could be used to sketch a pair of boxplots. 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help