R-lab-03-assignment

docx

School

Truman State University *

*We aren’t endorsed by this school

Course

190

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

22

Uploaded by BarristerCrabMaster699

Report
RLab 03 - Transforming with dplyr Samuel Park today (change this, too) The Story so Far… As you already know, Dr. Hyun-Joo Kim has given a short survey to her STAT 190 classes for the past few years, and she then uses that data throughout the semester. The dataset found here has the information across six semesters that she has used the survey. Her grader or student worker types the information from the paper sheets in by hand. Over time, this leads to dirty data. There was an especially big change in recording between rows 194 and 195. #Before you start, save this RMarkdown file into your project folder. # Load the tidyverse package with dplyr commands. library (tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- ## v ggplot2 3.3.5 v purrr 0.3.4 ## v tibble 3.1.3 v dplyr 1.0.7 ## v tidyr 1.1.3 v stringr 1.4.0 ## v readr 2.0.0 v forcats 0.5.1 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() # Load the data file from the U: drive - U:\_MT Student File Area\ tberegovska\Stat220 Clean_KimData <- read_csv ( "U:/_MT Student File Area/tberegovska/Stat220/Clean-KimData.csv" ) ## Rows: 377 Columns: 25 ## -- Column specification -------------------------------------------------------- ## Delimiter: "," ## chr (6): Gender, Birth Order, dog vs cat, Handed, On/Off Campus, Phone ## dbl (19): Semester, Siblings, Shoe Size, Height, Weight, Calories per day, S...
## ## i Use `spec()` to retrieve the full column specification for this data. ## i Specify the column types or set `show_col_types = FALSE` to quiet this message. #Then, you might want to save your raw dataset into your project folder write_csv (Clean_KimData, "Clean_KimData.csv" ) #after you've done this, you can load it next time from your Y: Drive #by putting a hashtag # at the front of line 23 and removing the # from line 31 #Once you've saved it into your project folder last time, get it from here instead. #Clean_KimData <- read_csv("Clean_KimData.csv") # Finally, create a new copy of the data to keep the "clean" version clean. KimData <- Clean_KimData As an aside, note that there are two commands: read.csv and read_csv . The first is built in, and the second comes from the tidyverse package. Base R reads data in as a data frame , while the tidyverse version of read_csv reads data in as a tibble , which is a tidyverse version of a data frame with some small, but fancy upgrades. We used the tidyverse read_csv in this lab because it makes the output a bit prettier. Data Transformations with dplyr In this section, we’ll talk about the six dplyr commands you need to know: filter : pick individuals by their values. arrange : reorder the rows. select : pick variables by their names. mutate : create new variables as functions of existing variables. group_by and summarize : these two go together group_by : mark data points as falling in different groups according to some variable. summarise or summarize : collapse many values down to a single summary. #For more Information You will be happy that you know how to use dplyr, and it has even more capabilities than we will discuss here. You can get a nice cheatsheet from https://www.rstudio.com/resources/cheatsheets/ and our textbook also has a lot about it in Chapter 5 – https://r4ds.had.co.nz/transform.html #Brief Aside: Pipes A pipe is a way to connect multiple lines of code. It basically means, “take the result of this line down to the next line.” ggplot uses + as a pipe. That’s easy to remember (because + means “add something to your graph” but is not good coding practice (because + more commonly means “add these up”) dplyr and other tidyverse packages use a unique pipe that has no other meaning. %>% No, really, that’s what it looks like. Yes, that’s weird. But, you have to admit that you aren’t going to use %>% for anything else.
Filter Basic Syntax The filter command selects certain observations. So, if we just want those who identified themselves as women or as only children in the dataset: filter (KimData, Gender == "F" ) # you need the quotes for characters or factors. ## # A tibble: 214 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 2 F 3 Middle 11 69.5 180 dog ## 6 4 F 0 Only 7 64 135 neither ## 7 4 F 1 Last 7.5 65 130 dog ## 8 4 F 1 Last 6.5 67 128 cat ## 9 2 F 2 First 8 65 124 both ## 10 2 F 3 Middle 8 65 145 dog ## # ... with 204 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... filter (KimData, Siblings == 0 ) # you don’t need the “quotes” for numbers
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## # A tibble: 39 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 4 F 0 Only 10 64 187 cat ## 2 4 F 0 Only 7 64 135 neither ## 3 10 M 0 Only 13 72 210 both ## 4 7 M 0 Only 11 73 163 dog ## 5 2 M 0 Only 10 65 130 dog ## 6 4 F 0 Only 7 62 135 both ## 7 4 F 0 Only 7 64 105 cat ## 8 1 M 0 Only 8 65 135 cat ## 9 3 M 0 Only 9.5 62 120 cat ## 10 4 F 0 Only 7 63 110 cat ## # ... with 29 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... The “pipe” syntax is a great way to combine more than one transformation, and it’s really the preferred format for these commands. For clarity, we typically put each piped command on its own line. KimData %>% filter (Gender == "F" ) #you need the quotes for characters or factors. ## # A tibble: 214 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 2 F 3 Middle 11 69.5 180 dog ## 6 4 F 0 Only 7 64 135 neither ## 7 4 F 1 Last 7.5 65 130 dog ## 8 4 F 1 Last 6.5 67 128 cat ## 9 2 F 2 First 8 65 124 both ## 10 2 F 3 Middle 8 65 145 dog ## # ... with 204 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... KimData %>% filter (Siblings == 0 ) #you don’t need the “quotes” for numbers ## # A tibble: 39 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 4 F 0 Only 10 64 187 cat ## 2 4 F 0 Only 7 64 135 neither ## 3 10 M 0 Only 13 72 210 both ## 4 7 M 0 Only 11 73 163 dog ## 5 2 M 0 Only 10 65 130 dog
## 6 4 F 0 Only 7 62 135 both ## 7 4 F 0 Only 7 64 105 cat ## 8 1 M 0 Only 8 65 135 cat ## 9 3 M 0 Only 9.5 62 120 cat ## 10 4 F 0 Only 7 63 110 cat ## # ... with 29 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... When you use one of these dplyr commands, the result is not, by default, stored in memory. If you want to save the results, you need to explicitly assign the results of the transformation to a variable. For example: StinkyBoys <- KimData %>% filter (Gender == "M" ) View (StinkyBoys) Note: Dr.Thatcher would like it to be known that Dr.Alberts named this variable. Dr.Beregovska would never use this name for a variable. More seriously, using View( ) causes a popup window which is not included in your actual knitted document. We almost always want our output in our knitted document. So, if you want something in your output, use head( ) or summary( ) instead. More on “Relational Operators” How do you select individuals of interest? Maybe you want to select individuals because a variable exactly matches some value. That’s what the double equals sign, == means. But there are other comparisons you might want to do. What they all have in common is that they’re expressions that end up giving you a TRUE or FALSE value. [Note: TRUE and FALSE are reserved values in R that stand for the results of these kinds of computations.] Equal to: == (more properly “Logically Equal to”) Less than: < Less than or equal to: <= Greater than: >
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Greater than or equal to: >= Not equal to: != (That’s an exclamation point!) As an example, you might want to pick only students who have been at Truman at least 4 semesters: UpperClass <- KimData %>% filter (Semester >= 4 ) View (UpperClass) You can also combine multiple comparisons using the “and” and “or” connectors. Or: A | B True if either of A or B is true. And: A & B True if both of A and B are true. The example below shows code that finds all male students who are also first-year students. FreshMales <- KimData %>% filter (Semester < 2 & Gender == "M" ) Question 0 Edit the command below so that you’re specifying either male or first-year. Are more or fewer individuals selected? Give a brief explanation of why that’s the case. FreshOrMales <- KimData %>% filter (Semester < 2 | Gender == "M" ) Q0 explanation More individuals are selected because I am looking for students who are freshman or male. This includes the people who are only freshman or male while “and” only includes people who are freshman and male. Therefore, there are more individuals selected. Finally, there are special logical commands that detect specific characteristics of your data. One you should know about is is.na . This command checks to see if the value of a variable is NA (i.e., a missing value). The code below returns all individuals who did not give their number of siblings. NASibs <- KimData %>% filter ( is.na (Siblings)) Arrange The arrange command sorts your data into a certain order. So, if we want the data in order those with the fewest number of semesters to the greatest number of semesters, we could run KimData %>% arrange (Semester)
## # A tibble: 377 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 0 M 1 First 10 70 165 dog ## 2 0 F 4 Middle 7 60 105 neither ## 3 0 F 2 Middle 10 68 155 cat ## 4 0 F 2 Middle 10 68 155 cat ## 5 0 F 1 Last 8 67 140 dog ## 6 0 F 2 First 6 NA NA neither ## 7 0 F 1 First 8 69 150 dog ## 8 0 F 2 First 12 67 180 dog ## 9 0 M 1 Last 13 77 193 dog ## 10 1 F 3 Last 9.5 66 145 cat ## # ... with 367 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... On the other hand, if we wanted to order by decreasing number of semesters, the desc() command could be put around Semester . KimData %>% arrange ( desc (Semester)) ## # A tibble: 377 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 10 M 0 Only 13 72 210 both
## 2 9 M 1 Last 12 74 195 dog ## 3 8 M 2 Last 9 70 130 neither ## 4 8 F NA Last 7 63.0 122. neither ## 5 8 M 1 First 8.5 70 130 dog ## 6 8 M 2 First 10.5 68 178 both ## 7 8 M 2 Last 10 71 170 both ## 8 8 F 2 Last 7 65 135 cat ## 9 7 F 3 First 9.5 64 193 dog ## 10 7 M 0 Only 11 73 163 dog ## # ... with 367 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... Select The select command selects only certain variables. This is especially helpful for giant datasets, so you can make something smaller to work with. Let’s make a new data set with only the variables that describe something “physical” about each student. Note that Shoe Size needs funny back quotes (like the triple-backquotes for chunks) around it because it has a space in the name. The quotes will appear when you TAB- complete the name as long as you’ve already run a command that loads the data set in. KimDataPhysical <- KimData %>% select (Semester, Gender, ` Shoe Size ` , Height, Weight, Handed) head (KimDataPhysical) ## # A tibble: 6 x 6 ## Semester Gender `Shoe Size` Height Weight Handed ## <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 11 71 195 Right ## 2 4 F 10 64 187 Right ## 3 6 F 9.5 69 150 Right
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 4 7 F 9.5 64 193 Right ## 5 6 M 13 73 181 Right ## 6 6 M 10 68 167 Right If you want to have “all but” a certain number of variables, give a list with minus signs in front. If we decided we didn’t want the Handed variable, for example, we could get rid of it: KimDataPhysical <- KimDataPhysical %>% select (Handed) head (KimDataPhysical) ## # A tibble: 6 x 1 ## Handed ## <chr> ## 1 Right ## 2 Right ## 3 Right ## 4 Right ## 5 Right ## 6 Right We could also select particular columns by their column number. This can be problematic or annoying, but may also be easier for datasets that do not have variable names included. Remember that the c() command makes a vector of numbers. So, this will select all of the numerical variables, but I did it the hard way, by going through and counting by hand which variables those are. KimNumData <- select (KimData, c ( 1 , 3 , 5 : 7 , 12 , 13 : 22 )) View (KimNumData) Command rename is a variation of select that simply changes the name of a variable. KimDataPhys2 <- rename (KimData, Shoe.Size = ` Shoe Size ` ) KimDataPhys3 <- select (KimData, Shoe.Size = ` Shoe Size ` , everything ()) You can see how the name changes in the environment window on the right. It is possible, but annoying, to rename variables directly with the select command. You can see in KimDataPhys3 that it weirdly moved Shoe.Size to the first column. The everything( ) command, as you might guess, literally keeps everything. Renaming variables can be especially nice when the original dataset has really long or confusing names, but it makes it harder to connect back to the original later. Mutate The mutate command creates a new variable by applying a function to an existing variable or variables. This is especially helpful for data cleaning, or changing units or data types. Suppose we wanted to know the number of years that a student had been at Truman, rather than the number of semesters. Using mutate , we could create the Year variable by dividing Semester by 2, then rounding up with the round command.
KimDataPhysical <- KimData %>% mutate ( Year = round (Semester / 2 )) head (KimDataPhysical) ## # A tibble: 6 x 26 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 18 more variables: Handed <chr>, On/Off Campus <chr>, ## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>, ## # Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... The mutate command can also be helpful when you want to create a “logical” variable that’s TRUE when a certain condition is true and FALSE otherwise. For example, maybe we want to label all seniors who have had at least 6 prior semesters: KimDataPhysical <- KimDataPhysical %>% mutate ( Senior = Semester >= 6 ) head (KimDataPhysical) ## # A tibble: 6 x 27 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 19 more variables: Handed <chr>, On/Off Campus <chr>, ## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>, ## # Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... ‘Mutate’ command can get tricky pretty quickly. Maybe we want to convert Gender into a factor, and turn the missing one into an NA. KimDataP4 <- mutate (KimDataPhysical, Gender= as.factor ( sub ( "other" , NA , Gender))) Question 1 A person’s “Body Mass Index” is calculated by taking mass divided by height squared. If you’re measuring in metric (kg and m), you’re done. If you’re measuring in pounds and inches (as we’re doing here), you then have to multiply by 703. In other words, BMI = 703*(Weight/Height^2). See the knit version of the lab for the nicely-typeset version. Use the mutate command twice to first create the BMI variable, and then create a variable Obese whose value is TRUE for anyone whose BMI is greater than or equal to 30. Answer below by adding to the code in the next code block. KimDataPhysical <- KimDataPhysical %>% mutate ( BMI = 703 * (Weight / Height ^ 2 )) KimDataPhysical <- KimDataPhysical %>% mutate ( Obsese= BMI >= 30 ) head (KimDataPhysical) ## # A tibble: 6 x 29 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 21 more variables: Handed <chr>, On/Off Campus <chr>, ## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>, ## # Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... Note: Any results here should be viewed with caution. If we really wanted to explore the “freshman 15,” we would need to do a “matched design” where we measure the same students over multiple years. And, don’t even get us started on the limitations of using BMI by itself as an indicator of obesity. Our textbook has a list of helpful commands to include in mutate functions: [ http://r4ds.had.co.nz/transform.html#mutate-funs ] ( http://r4ds.had.co.nz/transform.html#mutate-funs ). Grouping and Summarizing The group_by command doesn’t do much by itself. It merely tells R that some of the individuals in your data set are grouped together according to the values of one or more of the categorical variables. Here’s what grouping by Gender looks like. Can you see the slight difference in output between the regular data set and the grouped version? KimDataPhysical ## # A tibble: 377 x 29 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog
## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## 7 6 M 3 First 13 73 190 dog ## 8 9 M 1 Last 12 74 195 dog ## 9 2 F 3 Middle 11 69.5 180 dog ## 10 4 M 3 Last 11 72.5 175 dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... KimDataPhysical <- KimDataPhysical %>% group_by (Gender) KimDataPhysical ## # A tibble: 377 x 29 ## # Groups: Gender [3] ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## 7 6 M 3 First 13 73 190 dog ## 8 9 M 1 Last 12 74 195 dog
## 9 2 F 3 Middle 11 69.5 180 dog ## 10 4 M 3 Last 11 72.5 175 dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... But, combined with summarize , the two commands become a “magical machine” (according to Dr. Alberts). The summarize commands collapses groups down to one or more descriptive statistics that are calculated from each group. It has the following form: summarize(summary.variable = function(old.variables)) where you replace each of the variables with actual variable names and function with an actual function. You can create multiple summary variables in the same command. KimDataPhysical %>% group_by (Gender) %>% summarize ( MHeight = mean (Height, na.rm= TRUE ), MWeight = mean (Weight, na.rm= TRUE )) ## # A tibble: 3 x 3 ## Gender MHeight MWeight ## <chr> <dbl> <dbl> ## 1 F 64.4 136. ## 2 M 69.9 173. ## 3 other 62 115 Note that these commands need the na.rm=TRUE option. Otherwise, the missing values (NA) would keep R from calculating a mean. Here’s what you get without `na.rm’: KimDataPhysical %>% group_by (Gender) %>% summarize ( MHeight = mean (Height), MWeight = mean (Weight)) ## # A tibble: 3 x 3 ## Gender MHeight MWeight ## <chr> <dbl> <dbl> ## 1 F NA NA ## 2 M NA 173. ## 3 other 62 115 Question 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We’ve heard of the “freshman 15,” the weight that many college students gain after their first year of all-you-can-eat dorm food. Use group_by and summarize to group students by Year and calculate mean BMI for each group. Does BMI seem to be higher for students who have been here more years? Answer Below. KimDataPhysical <- KimDataPhysical %>% mutate ( Year = round (Semester / 2 )) KimDataPhysical <- KimDataPhysical %>% mutate ( BMI = 703 * (Weight / Height ^ 2 )) Freshman <- KimDataPhysical %>% filter (Year <= 1 ) NFreshman <- KimDataPhysical %>% filter (Year > 1 ) Freshman <- Freshman $ BMI NFreshman <- NFreshman $ BMI KimDataPhysical %>% group_by (Year) %>% summarize ( MFreshman= mean (Freshman, na.rm= TRUE ), MNFreshman= mean (NFreshman, na.rm= TRUE )) ## # A tibble: 6 x 3 ## Year MFreshman MNFreshman ## <dbl> <dbl> <dbl> ## 1 0 24.1 24.0 ## 2 1 24.1 24.0 ## 3 2 24.1 24.0 ## 4 3 24.1 24.0 ## 5 4 24.1 24.0 ## 6 5 24.1 24.0 Q2 Explanation The mean BMI value of Freshman is slightly bigger than the value of Non- Freshman. In conclusion, BMI doesn’t seem to be higher for students who have been here more years. The freshman has a higher value of BMI than Non-freshman. ### Calculating Counts and Percentages The n() command without any arguments inside the parentheses will count the number of individuals within a group. Another specific use of grouping and summarizing is in calculating percentages , or proportions . Here’s the code to calculate the percentage of each gender that are seniors: KimDataPhysical %>% group_by (Gender) %>% summarize ( n = n (), Senior.Pct = mean (Senior == TRUE , na.rm= TRUE )) ## # A tibble: 3 x 3 ## Gender n Senior.Pct ## <chr> <int> <dbl> ## 1 F 214 0.0841 ## 2 M 162 0.160 ## 3 other 1 0 This code works because a comparison function like `Senior == TRUE’ creates a list of TRUE and FALSE values for each individual. If you try to do calculations with these TRUE and FALSE values, R converts then to 1’s and 0’s, where 1 stands for being a senior. So, the
mean of that list of 0’s and 1’s becomes (Number TRUE)/(Total Observations), which is a percent. Pretty nifty! Question 3 Write the code to calculate the percentage of obese people in each year. Does this percentage show any clear trend? Answer Below KimDataPhysical %>% group_by (Year) %>% summarize ( n= n (), obese= mean (BMI >= 30 , na.rm= TRUE )) ## # A tibble: 6 x 3 ## Year n obese ## <dbl> <int> <dbl> ## 1 0 110 0.186 ## 2 1 111 0.0784 ## 3 2 112 0.0952 ## 4 3 20 0.05 ## 5 4 23 0.182 ## 6 5 1 0 Q3 Explanation The data shows that freshman in their first semester has the highest obesity rate after that it after it the obesity rate decrease a lot in their second semester. Then, it increases again in the second year and decreases in the third year. At last, it increases a lot in the last year. ### Multiple Groups You can create multiple groups and summary statistics. For example, you could break the sample down by both Gender and Year just by listing both of these variables in the group_by command. Question 4 Complete the code below to break KimDataPhysical down by both Gender and Year , then calculate the mean BMI and the number of individuals in each group. The first line should save your results to the variable KimBMI , so don’t change that part (although you’ll add on to it). The last line of the code chunk displays the results, so don’t change that either. Do you see any trends within genders? Based on the number of data points in each group, in which groups are the statistics most suspect? KimBMI <- KimDataPhysical KimBMI %>% group_by (Gender, Year) %>% summarize ( MKimBMI= mean (BMI, na.rm= TRUE )) ## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument. ## # A tibble: 12 x 3 ## # Groups: Gender [3] ## Gender Year MKimBMI ## <chr> <dbl> <dbl> ## 1 F 0 25.4 ## 2 F 1 22.2 ## 3 F 2 22.8
## 4 F 3 22.4 ## 5 F 4 23.6 ## 6 M 0 24.6 ## 7 M 1 24.6 ## 8 M 2 25.5 ## 9 M 3 24.7 ## 10 M 4 25.0 ## 11 M 5 28.5 ## 12 other 1 21.0 KimBMI ## # A tibble: 377 x 29 ## # Groups: Gender [3] ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## 7 6 M 3 First 13 73 190 dog ## 8 9 M 1 Last 12 74 195 dog ## 9 2 F 3 Middle 11 69.5 180 dog ## 10 4 M 3 Last 11 72.5 175 dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Q4 Explanation The value of the Female decrease from Year 0 to Year 1 while the value of the male increase from Year 0 to Year 1. after that, both female and male repeat slightly increase and decrease the value. At the end, the value of both female and male increase. The value of fifth year of male is excessively bigger than other value. So, the value of male is doubtful. ##Return of pipes tall_Kim <- KimData %>% #This line just declares the dataset group_by (Gender,Semester) %>% #what shall we group by? Line for Q5 summarize ( count = n (), #this counts up how many in each thing aveht= mean (Height, na.rm= TRUE )) %>% #this finds the average height filter (count > 2 ) %>% #this gets rid of low-n categories arrange (aveht) #this sorts them from shortest to tallest ## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument. tall_Kim #See what you made? ## # A tibble: 16 x 4 ## # Groups: Gender [2] ## Gender Semester count aveht ## <chr> <dbl> <int> <dbl> ## 1 F 1 60 62.0 ## 2 F 5 5 62.9 ## 3 F 4 30 64.3 ## 4 F 7 5 65 ## 5 F 2 66 65.2 ## 6 F 6 11 65.8 ## 7 F 3 28 66.5 ## 8 F 0 7 66.5 ## 9 M 3 25 68.6 ## 10 M 2 44 69.5 ## 11 M 8 4 69.8 ## 12 M 1 41 69.8 ## 13 M 4 17 70.5 ## 14 M 7 11 71 ## 15 M 6 9 71.2 ## 16 M 5 7 71.4 Question 5 If I delete the marked line, the categorical variable for group which are gender and semester will also disappear. This means that only the value of count and average height will appear as the result. In the code chunk above check the line marked ‘Line for Q5’
and explain what is the output after executing this line of code? Be as specific as possible. ** Answer Below ** This script groups individuals by Gender and Semester, counts the number in each cell, then computes the average height. It eliminates low-n cells, then sorts it smallest to largest. That could be handy, right? Also notice that this script is long and wordy, but easy to understand. When you mix dplyr and ggplot, you have to be careful to get the pipes correct. That can be annoying, but you should keep your graphs away from your data management anyway. How about this? A chart of the average height of gender, by semester (excluding small groups). By keeping your dplyr transformation command far away from your ggplot visualization command, it’s super clear and easy. ggplot ( data= tall_Kim, mapping = aes ( x= Semester, y= aveht, color= Gender)) + geom_point () Data Cleaning Commands in the dplyr package are often useful for data cleaning. We’ll talk more about that later, but let’s do one example. Question 6
Complete the ggplot command below to create a histogram of the Shoe Size variable. Remember the single back quotes around Shoe Size because it’s a variable with a space in it. After making the histogram, do you see any data point that stands out? ** Answer Below ** ggplot (KimDataPhysical, aes ( x= 'Shoe Size' )) + geom_bar () Q6 Explanation I can only see the counts of the shoe size. Other than that, I don’t see any data points that stands out. When you find an outlier, you shouldn’t just throw it out. However, if you investigate, can’t find a good reason for it, can’t find the correct value for it, and it’s clearly some sort of mistake, then likely removing the entire individual is a reasonable thing to do. Let’s see the effect of removing the individual with the “giant” feet. Question 7 Write two R commands using dplyr . One should calculate the average size of men and women’s feet using all data points. The second should calculate the average size of men’s and women’s feet after removing the person with the giant feet. (Think about how you’ll remove that person using a dplyr command.) How much of a difference in average shoe size did removing the outlier make? ** Answer Below ** MKimDataPhysical <- filter (KimData, Gender == 'M' , na.rm= TRUE ) FKimDataPhysical <- filter (KimData, Gender == 'F' , na.rm= TRUE ) MKimDataPhysical <- MKimDataPhysical %>% group_by (Gender, 'Shoe Size' ) %>% mutate ( r= min_rank ( desc ( 'Shoe Size' ))) %>% filter (r %in% range (r)) MShKimDataPhysical <- mean (MKimDataPhysical $ ` Shoe Size ` , na.rm= TRUE ) FShKimDataPhysical <- mean (FKimDataPhysical $ ` Shoe Size ` , na.rm= TRUE )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
head (MShKimDataPhysical) ## [1] 11.51398 head (FShKimDataPhysical) ## [1] 7.992991 summary (MKimDataPhysical $ ` Shoe Size ` ) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 7.00 10.00 11.00 11.51 12.00 113.00 1 summary (FKimDataPhysical $ ` Shoe Size ` ) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5.000 7.000 8.000 7.993 9.000 12.000 MSKimDataPhysical <- MKimDataPhysical %>% filter ( ` Shoe Size ` >= 7 & ` Shoe Size ` <= 15 , na.rm= TRUE ) FSKimDataPhysical <- FKimDataPhysical %>% filter ( ` Shoe Size ` >= 4 , ` Shoe Size ` <= 12 , na.rm= TRUE ) MShoeKimDataPhysical = mean (MSKimDataPhysical $ ` Shoe Size ` , na.rm= TRUE ) FShoeKimDataPhysical = mean (FSKimDataPhysical $ ` Shoe Size ` , na.rm= TRUE ) head (MShoeKimDataPhysical) ## [1] 10.84748 head (FShoeKimDataPhysical) ## [1] 7.992991 Q7 Explanation The average value of male’s shoe size is 11.51398 and the average value of female’s shoe size is 7.992991 before excluding the outliers. However, when I remove the outliers, the value of male decreased to 10.84748 while the value of female doesn’t changed.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help