R-lab-03-assignment
docx
keyboard_arrow_up
School
Truman State University *
*We aren’t endorsed by this school
Course
190
Subject
Computer Science
Date
Feb 20, 2024
Type
docx
Pages
22
Uploaded by BarristerCrabMaster699
RLab 03 - Transforming with dplyr
Samuel Park
today (change this, too)
The Story so Far…
As you already know, Dr. Hyun-Joo Kim has given a short survey to her STAT 190 classes for the past few years, and she then uses that data throughout the semester. The dataset found here has the information across six semesters that she has used the survey. Her grader or student worker types the information from the paper sheets in by hand. Over time, this leads to dirty data. There was an especially big change in recording between rows
194 and 195.
#Before you start, save this RMarkdown file into your project folder.
# Load the tidyverse package with dplyr commands.
library
(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Load the data file from the U: drive - U:\_MT Student File Area\
tberegovska\Stat220
Clean_KimData <-
read_csv
(
"U:/_MT Student File Area/tberegovska/Stat220/Clean-KimData.csv"
)
## Rows: 377 Columns: 25
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): Gender, Birth Order, dog vs cat, Handed, On/Off Campus, Phone
## dbl (19): Semester, Siblings, Shoe Size, Height, Weight, Calories per day, S...
## ## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet
this message.
#Then, you might want to save your raw dataset into your project folder
write_csv
(Clean_KimData, "Clean_KimData.csv"
)
#after you've done this, you can load it next time from your Y: Drive
#by putting a hashtag # at the front of line 23 and removing the # from line 31
#Once you've saved it into your project folder last time, get it from here instead.
#Clean_KimData <- read_csv("Clean_KimData.csv")
# Finally, create a new copy of the data to keep the "clean" version clean.
KimData <-
Clean_KimData
As an aside, note that there are two commands: read.csv
and read_csv
. The first is built in, and the second comes from the tidyverse package. Base R reads data in as a data frame
, while the tidyverse version of read_csv
reads data in as a tibble
, which is a tidyverse version of a data frame with some small, but fancy upgrades. We used the tidyverse read_csv
in this lab because it makes the output a bit prettier.
Data Transformations with dplyr
In this section, we’ll talk about the six dplyr
commands you need to know:
filter
: pick individuals by their values. arrange
: reorder the rows. select
: pick variables by their names. mutate
: create new variables as functions of existing variables. group_by
and summarize
: these two go together group_by
: mark data points as falling in
different groups according to some variable. summarise
or summarize
: collapse many values down to a single summary.
#For more Information You will be happy that you know how to use dplyr, and it has even more capabilities than we will discuss here. You can get a nice cheatsheet from https://www.rstudio.com/resources/cheatsheets/ and our textbook also has a lot about it in Chapter 5 – https://r4ds.had.co.nz/transform.html
#Brief Aside: Pipes A pipe is a way to connect multiple lines of code. It basically means, “take the result of this line down to the next line.” ggplot uses + as a pipe. That’s easy to remember (because + means “add something to your graph” but is not good coding practice
(because + more commonly means “add these up”) dplyr and other tidyverse packages use a unique pipe that has no other meaning. %>% No, really, that’s what it looks like. Yes, that’s weird. But, you have to admit that you aren’t going to use %>% for anything else.
Filter
Basic Syntax
The filter
command selects certain observations. So, if we just want those who identified
themselves as women or as only children in the dataset:
filter
(KimData, Gender ==
"F"
) # you need the quotes for characters or factors.
## # A tibble: 214 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 6 F 5 Middle 11 71 195
cat ## 2 4 F 0 Only 10 64 187
cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193
dog ## 5 2 F 3 Middle 11 69.5 180
dog ## 6 4 F 0 Only 7 64 135
neither ## 7 4 F 1 Last 7.5 65 130
dog ## 8 4 F 1 Last 6.5 67 128
cat ## 9 2 F 2 First 8 65 124
both ## 10 2 F 3 Middle 8 65 145
dog ## # ... with 204 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
filter
(KimData, Siblings ==
0
) # you don’t need the “quotes” for numbers
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## # A tibble: 39 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 4 F 0 Only 10 64 187
cat ## 2 4 F 0 Only 7 64 135
neither ## 3 10 M 0 Only 13 72 210
both ## 4 7 M 0 Only 11 73 163
dog ## 5 2 M 0 Only 10 65 130
dog ## 6 4 F 0 Only 7 62 135
both ## 7 4 F 0 Only 7 64 105
cat ## 8 1 M 0 Only 8 65 135
cat ## 9 3 M 0 Only 9.5 62 120
cat ## 10 4 F 0 Only 7 63 110
cat ## # ... with 29 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
The “pipe” syntax is a great way to combine more than one transformation, and it’s really the preferred format for these commands. For clarity, we typically put each piped command on its own line.
KimData %>% filter
(Gender ==
"F"
) #you need the quotes for characters or factors.
## # A tibble: 214 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr>
## 1 6 F 5 Middle 11 71 195
cat ## 2 4 F 0 Only 10 64 187
cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193
dog ## 5 2 F 3 Middle 11 69.5 180
dog ## 6 4 F 0 Only 7 64 135
neither ## 7 4 F 1 Last 7.5 65 130
dog ## 8 4 F 1 Last 6.5 67 128
cat ## 9 2 F 2 First 8 65 124
both ## 10 2 F 3 Middle 8 65 145
dog ## # ... with 204 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
KimData %>% filter
(Siblings ==
0
) #you don’t need the “quotes” for numbers
## # A tibble: 39 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 4 F 0 Only 10 64 187
cat ## 2 4 F 0 Only 7 64 135
neither ## 3 10 M 0 Only 13 72 210
both ## 4 7 M 0 Only 11 73 163
dog ## 5 2 M 0 Only 10 65 130
dog
## 6 4 F 0 Only 7 62 135
both ## 7 4 F 0 Only 7 64 105
cat ## 8 1 M 0 Only 8 65 135
cat ## 9 3 M 0 Only 9.5 62 120
cat ## 10 4 F 0 Only 7 63 110
cat ## # ... with 29 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
When you use one of these dplyr
commands, the result is not, by default, stored in memory. If you want to save the results, you need to explicitly assign the results of the transformation to a variable. For example:
StinkyBoys <-
KimData %>% filter
(Gender ==
"M"
)
View
(StinkyBoys)
Note: Dr.Thatcher would like it to be known that Dr.Alberts named this variable. Dr.Beregovska would never use this name for a variable. More seriously, using View( ) causes a popup window which is not included in your actual knitted document. We almost always want our output in our knitted document. So, if you want something in your output, use head( ) or summary( ) instead.
More on “Relational Operators”
How do you select individuals of interest? Maybe you want to select individuals
because a variable exactly matches some value. That’s what the double equals sign, ==
means. But there are other comparisons you might want to do. What they all have in common is that they’re expressions that end up giving you a TRUE
or FALSE
value. [Note: TRUE
and FALSE
are reserved values in R that stand for the results of these kinds of computations.]
•
Equal to: ==
(more properly “Logically Equal to”)
•
Less than: <
•
Less than or equal to: <=
•
Greater than: >
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
•
Greater than or equal to: >=
•
Not equal to: !=
(That’s an exclamation point!)
As an example, you might want to pick only students who have been at Truman at least 4 semesters:
UpperClass <-
KimData %>% filter
(Semester >=
4
)
View
(UpperClass)
You can also combine multiple comparisons using the “and” and “or” connectors.
•
Or: A | B
True if either
of A or B is true.
•
And: A & B
True if both
of A and B are true.
The example below shows code that finds all male students who are also first-year students.
FreshMales <-
KimData %>% filter
(Semester <
2
&
Gender ==
"M"
)
Question 0
Edit the command below so that you’re specifying either male or
first-year. Are more or fewer individuals selected? Give a brief explanation of why that’s the case.
FreshOrMales <-
KimData %>% filter
(Semester <
2
|
Gender ==
"M"
)
Q0 explanation
More individuals are selected because I am looking for students who are freshman or male. This includes the people who are only freshman or male while “and” only includes people who are freshman and male. Therefore, there are more individuals selected.
Finally, there are special logical commands that detect specific characteristics of your data. One you should know about is is.na
. This command checks to see if the value of a variable is NA (i.e., a missing value).
The code below returns all individuals who did not give their number of siblings.
NASibs <-
KimData %>% filter
(
is.na
(Siblings))
Arrange
The arrange
command sorts your data into a certain order. So, if we want the data in order those with the fewest number of semesters to the greatest number of semesters, we could run
KimData %>% arrange
(Semester)
## # A tibble: 377 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 0 M 1 First 10 70 165
dog ## 2 0 F 4 Middle 7 60 105
neither ## 3 0 F 2 Middle 10 68 155
cat ## 4 0 F 2 Middle 10 68 155
cat ## 5 0 F 1 Last 8 67 140
dog ## 6 0 F 2 First 6 NA NA
neither ## 7 0 F 1 First 8 69 150
dog ## 8 0 F 2 First 12 67 180
dog ## 9 0 M 1 Last 13 77 193
dog ## 10 1 F 3 Last 9.5 66 145
cat ## # ... with 367 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
On the other hand, if we wanted to order by decreasing
number of semesters, the desc()
command could be put around Semester
.
KimData %>% arrange
(
desc
(Semester))
## # A tibble: 377 x 25
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 10 M 0 Only 13 72 210 both
## 2 9 M 1 Last 12 74 195 dog ## 3 8 M 2 Last 9 70 130 neither ## 4 8 F NA Last 7 63.0 122.
neither ## 5 8 M 1 First 8.5 70 130 dog ## 6 8 M 2 First 10.5 68 178 both ## 7 8 M 2 Last 10 71 170 both ## 8 8 F 2 Last 7 65 135 cat ## 9 7 F 3 First 9.5 64 193 dog ## 10 7 M 0 Only 11 73 163 dog ## # ... with 367 more rows, and 17 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
Select
The select
command selects only certain variables. This is especially helpful for giant datasets, so you can make something smaller to work with. Let’s make a new data set with only the variables that describe something “physical” about each student.
Note that Shoe Size
needs funny back quotes (like the triple-backquotes for chunks) around it because it has a space in the name. The quotes will appear when you TAB-
complete the name as long as you’ve already run a command that loads the data set in.
KimDataPhysical <-
KimData %>%
select
(Semester, Gender, `
Shoe Size
`
, Height, Weight, Handed)
head
(KimDataPhysical)
## # A tibble: 6 x 6
## Semester Gender `Shoe Size` Height Weight Handed
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 11 71 195 Right ## 2 4 F 10 64 187 Right ## 3 6 F 9.5 69 150 Right
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 4 7 F 9.5 64 193 Right ## 5 6 M 13 73 181 Right ## 6 6 M 10 68 167 Right
If you want to have “all but” a certain number of variables, give a list with minus signs in front. If we decided we didn’t want the Handed
variable, for example, we could get rid of it:
KimDataPhysical <-
KimDataPhysical %>%
select
(Handed)
head
(KimDataPhysical)
## # A tibble: 6 x 1
## Handed
## <chr> ## 1 Right ## 2 Right ## 3 Right ## 4 Right ## 5 Right ## 6 Right
We could also select particular columns by their column number. This can be problematic or annoying, but may also be easier for datasets that do not have variable names included. Remember that the c() command makes a vector of numbers. So, this will select all of the numerical variables, but I did it the hard way, by going through and counting by hand which variables those are.
KimNumData <-
select
(KimData, c
(
1
,
3
,
5
:
7
, 12
, 13
:
22
)) View
(KimNumData)
Command rename
is a variation of select that simply changes the name of a variable.
KimDataPhys2 <-
rename
(KimData, Shoe.Size =
`
Shoe Size
`
)
KimDataPhys3 <-
select
(KimData, Shoe.Size =
`
Shoe Size
`
, everything
())
You can see how the name changes in the environment window on the right. It is possible, but annoying, to rename variables directly with the select command. You can see in KimDataPhys3 that it weirdly moved Shoe.Size to the first column. The everything( ) command, as you might guess, literally keeps everything.
Renaming variables can be especially nice when the original dataset has really long or confusing names, but it makes it harder to connect back to the original later.
Mutate
The mutate
command creates a new variable by applying a function to an existing variable or variables. This is especially helpful for data cleaning, or changing units or data types.
Suppose we wanted to know the number of years
that a student had been at Truman, rather
than the number of semesters. Using mutate
, we could create the Year
variable by dividing
Semester by 2, then rounding up with the round
command.
KimDataPhysical <-
KimData %>% mutate
(
Year =
round
(Semester
/
2
))
head
(KimDataPhysical)
## # A tibble: 6 x 26
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 18 more variables: Handed <chr>, On/Off Campus <chr>,
## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>,
## # Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
The mutate
command can also be helpful when you want to create a “logical” variable that’s TRUE when a certain condition is true and FALSE otherwise. For example, maybe we want to label all seniors who have had at least 6 prior semesters:
KimDataPhysical <-
KimDataPhysical %>% mutate
(
Senior =
Semester >=
6
)
head
(KimDataPhysical)
## # A tibble: 6 x 27
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 19 more variables: Handed <chr>, On/Off Campus <chr>,
## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>,
## # Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
‘Mutate’ command can get tricky pretty quickly. Maybe we want to convert Gender into a factor, and turn the missing one into an NA.
KimDataP4
<-
mutate
(KimDataPhysical, Gender=
as.factor
(
sub
(
"other"
, NA
, Gender)))
Question 1
A person’s “Body Mass Index” is calculated by taking mass divided by height squared. If you’re measuring in metric (kg and m), you’re done. If you’re measuring in pounds and inches (as we’re doing here), you then have to multiply by 703. In other words, BMI = 703*(Weight/Height^2). See the knit version of the lab for the nicely-typeset version.
Use the mutate
command twice to first create the BMI
variable, and then create a variable Obese
whose value is TRUE for anyone whose BMI is greater than or equal to 30. Answer below by adding to the code in the next code block.
KimDataPhysical <-
KimDataPhysical %>%
mutate
(
BMI = 703
*
(Weight
/
Height
^
2
))
KimDataPhysical <-
KimDataPhysical %>%
mutate
(
Obsese=
BMI
>=
30
)
head
(KimDataPhysical)
## # A tibble: 6 x 29
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 6 M 2 Middle 13 73 181 dog ## 6 6 M 2 First 10 68 167 dog ## # ... with 21 more variables: Handed <chr>, On/Off Campus <chr>,
## # Calories per day <dbl>, Servings of Fruit <dbl>, Cups of Water <dbl>,
## # Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
Note:
Any results here should be viewed with caution. If we really wanted to explore the “freshman 15,” we would need to do a “matched design” where we measure the same students over multiple years. And, don’t even get us started on the limitations of using BMI by itself as an indicator of obesity.
Our textbook has a list of helpful commands to include in mutate functions: [
http://r4ds.had.co.nz/transform.html#mutate-funs
] (
http://r4ds.had.co.nz/transform.html#mutate-funs
).
Grouping and Summarizing
The group_by
command doesn’t do much by itself. It merely tells R that some of the individuals in your data set are grouped together according to the values of one or more of the categorical variables. Here’s what grouping by Gender looks like. Can you see the slight difference in output between the regular data set and the grouped version?
KimDataPhysical
## # A tibble: 377 x 29
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 6 F 5 Middle 11 71 195
cat ## 2 4 F 0 Only 10 64 187
cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193
dog
## 5 6 M 2 Middle 13 73 181
dog ## 6 6 M 2 First 10 68 167
dog ## 7 6 M 3 First 13 73 190
dog ## 8 9 M 1 Last 12 74 195
dog ## 9 2 F 3 Middle 11 69.5 180
dog ## 10 4 M 3 Last 11 72.5 175
dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
KimDataPhysical <-
KimDataPhysical %>% group_by
(Gender)
KimDataPhysical
## # A tibble: 377 x 29
## # Groups: Gender [3]
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 6 F 5 Middle 11 71 195
cat ## 2 4 F 0 Only 10 64 187
cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193
dog ## 5 6 M 2 Middle 13 73 181
dog ## 6 6 M 2 First 10 68 167
dog ## 7 6 M 3 First 13 73 190
dog ## 8 9 M 1 Last 12 74 195
dog
## 9 2 F 3 Middle 11 69.5 180
dog ## 10 4 M 3 Last 11 72.5 175
dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
But, combined with summarize
, the two commands become a “magical machine” (according to Dr. Alberts). The summarize
commands collapses groups down to one or more descriptive statistics that are calculated from each group. It has the following form:
summarize(summary.variable = function(old.variables))
where you replace each of the variables with actual variable names and function
with an actual function. You can create multiple summary variables in the same command.
KimDataPhysical %>%
group_by
(Gender) %>%
summarize
(
MHeight =
mean
(Height, na.rm=
TRUE
), MWeight =
mean
(Weight,
na.rm=
TRUE
))
## # A tibble: 3 x 3
## Gender MHeight MWeight
## <chr> <dbl> <dbl>
## 1 F 64.4 136.
## 2 M 69.9 173.
## 3 other 62 115
Note that these commands need the na.rm=TRUE
option. Otherwise, the missing values (NA) would keep R from calculating a mean. Here’s what you get without `na.rm’:
KimDataPhysical %>%
group_by
(Gender) %>%
summarize
(
MHeight =
mean
(Height), MWeight =
mean
(Weight))
## # A tibble: 3 x 3
## Gender MHeight MWeight
## <chr> <dbl> <dbl>
## 1 F NA NA ## 2 M NA 173.
## 3 other 62 115
Question 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We’ve heard of the “freshman 15,” the weight that many college students gain after their first year of all-you-can-eat dorm food. Use group_by
and summarize
to group students by
Year
and calculate mean BMI for each group. Does BMI seem to be higher for students who have been here more years? Answer Below.
KimDataPhysical <-
KimDataPhysical %>% mutate
(
Year =
round
(Semester
/
2
))
KimDataPhysical <-
KimDataPhysical %>%
mutate
(
BMI = 703
*
(Weight
/
Height
^
2
))
Freshman <-
KimDataPhysical %>%
filter
(Year
<=
1
)
NFreshman <-
KimDataPhysical %>%
filter
(Year
>
1
)
Freshman <-
Freshman
$
BMI
NFreshman <-
NFreshman
$
BMI
KimDataPhysical %>%
group_by
(Year) %>% summarize
(
MFreshman=
mean
(Freshman, na.rm=
TRUE
), MNFreshman=
mean
(NFreshman, na.rm=
TRUE
))
## # A tibble: 6 x 3
## Year MFreshman MNFreshman
## <dbl> <dbl> <dbl>
## 1 0 24.1 24.0
## 2 1 24.1 24.0
## 3 2 24.1 24.0
## 4 3 24.1 24.0
## 5 4 24.1 24.0
## 6 5 24.1 24.0
Q2 Explanation
The mean BMI value of Freshman is slightly bigger than the value of Non-
Freshman. In conclusion, BMI doesn’t seem to be higher for students who have been here more years. The freshman has a higher value of BMI than Non-freshman. ### Calculating Counts and Percentages
The n()
command without any arguments inside the parentheses will count the number
of individuals within a group.
Another specific use of grouping and summarizing is in calculating percentages
, or proportions
. Here’s the code to calculate the percentage of each gender that are seniors:
KimDataPhysical %>%
group_by
(Gender) %>%
summarize
(
n =
n
(), Senior.Pct =
mean
(Senior
==
TRUE
, na.rm=
TRUE
))
## # A tibble: 3 x 3
## Gender n Senior.Pct
## <chr> <int> <dbl>
## 1 F 214 0.0841
## 2 M 162 0.160 ## 3 other 1 0
This code works because a comparison function like `Senior == TRUE’ creates a list of TRUE
and FALSE values for each individual. If you try to do calculations with these TRUE and FALSE values, R converts then to 1’s and 0’s, where 1 stands for being a senior. So, the
mean of that list of 0’s and 1’s becomes (Number TRUE)/(Total Observations), which is a percent. Pretty nifty!
Question 3
Write the code to calculate the percentage of obese people in each year. Does this percentage show any clear trend? Answer Below
KimDataPhysical %>%
group_by
(Year) %>%
summarize
(
n=
n
(), obese=
mean
(BMI
>=
30
, na.rm=
TRUE
))
## # A tibble: 6 x 3
## Year n obese
## <dbl> <int> <dbl>
## 1 0 110 0.186 ## 2 1 111 0.0784
## 3 2 112 0.0952
## 4 3 20 0.05 ## 5 4 23 0.182 ## 6 5 1 0
Q3 Explanation
The data shows that freshman in their first semester has the highest obesity rate after that it after it the obesity rate decrease a lot in their second semester. Then, it increases again in the second year and decreases in the third year. At last, it increases a lot in the last year. ### Multiple Groups
You can create multiple groups and summary statistics. For example, you could break the sample down by both Gender
and Year
just by listing both of these variables in the group_by
command.
Question 4
Complete the code below to break KimDataPhysical
down by both Gender
and Year
, then calculate the mean BMI
and the number of individuals in each group. The first line should save your results to the variable KimBMI
, so don’t change that part (although you’ll add on to it). The last line of the code chunk displays the results, so don’t change that either. Do you see any trends within genders? Based on the number of data points in each group, in which groups are the statistics most suspect?
KimBMI <-
KimDataPhysical
KimBMI %>%
group_by
(Gender, Year) %>%
summarize
(
MKimBMI=
mean
(BMI, na.rm=
TRUE
))
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: Gender [3]
## Gender Year MKimBMI
## <chr> <dbl> <dbl>
## 1 F 0 25.4
## 2 F 1 22.2
## 3 F 2 22.8
## 4 F 3 22.4
## 5 F 4 23.6
## 6 M 0 24.6
## 7 M 1 24.6
## 8 M 2 25.5
## 9 M 3 24.7
## 10 M 4 25.0
## 11 M 5 28.5
## 12 other 1 21.0
KimBMI
## # A tibble: 377 x 29
## # Groups: Gender [3]
## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight
`dog vs cat`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
<chr> ## 1 6 F 5 Middle 11 71 195
cat ## 2 4 F 0 Only 10 64 187
cat ## 3 6 F 1 Last 9.5 69 150
dog ## 4 7 F 3 First 9.5 64 193
dog ## 5 6 M 2 Middle 13 73 181
dog ## 6 6 M 2 First 10 68 167
dog ## 7 6 M 3 First 13 73 190
dog ## 8 9 M 1 Last 12 74 195
dog ## 9 2 F 3 Middle 11 69.5 180
dog ## 10 4 M 3 Last 11 72.5 175
dog ## # ... with 367 more rows, and 21 more variables: Handed <chr>,
## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>,
## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>,
## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>,
## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>,
## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C
or L <dbl>,
## # Phone <chr>, Hrs per day on phone <dbl>, ...
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Q4 Explanation
The value of the Female decrease from Year 0 to Year 1 while the value of the male increase from Year 0 to Year 1. after that, both female and male repeat slightly increase and decrease the value. At the end, the value of both female and male increase. The
value of fifth year of male is excessively bigger than other value. So, the value of male is doubtful.
##Return of pipes
tall_Kim <-
KimData %>%
#This line just declares the dataset
group_by
(Gender,Semester) %>%
#what shall we group by? Line for
Q5
summarize
(
count =
n
(), #this counts up how many in each thing
aveht=
mean
(Height, na.rm=
TRUE
)) %>%
#this finds the average height
filter
(count >
2
) %>%
#this gets rid of low-n categories
arrange
(aveht) #this sorts them from shortest to tallest
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
tall_Kim #See what you made?
## # A tibble: 16 x 4
## # Groups: Gender [2]
## Gender Semester count aveht
## <chr> <dbl> <int> <dbl>
## 1 F 1 60 62.0
## 2 F 5 5 62.9
## 3 F 4 30 64.3
## 4 F 7 5 65 ## 5 F 2 66 65.2
## 6 F 6 11 65.8
## 7 F 3 28 66.5
## 8 F 0 7 66.5
## 9 M 3 25 68.6
## 10 M 2 44 69.5
## 11 M 8 4 69.8
## 12 M 1 41 69.8
## 13 M 4 17 70.5
## 14 M 7 11 71 ## 15 M 6 9 71.2
## 16 M 5 7 71.4
Question 5
If I delete the marked line, the categorical variable for group which are gender and semester will also disappear. This means that only the value of count and average height will appear as the result. In the code chunk above check the line marked ‘Line for Q5’
and explain what is the output after executing this line of code? Be as specific as possible. **
Answer Below **
This script groups individuals by Gender and Semester, counts the number in each cell, then computes the average height. It eliminates low-n cells, then sorts it smallest to largest. That could be handy, right?
Also notice that this script is long and wordy, but easy to understand. When you mix dplyr and ggplot, you have to be careful to get the pipes correct. That can be annoying, but you should keep your graphs away from your data management anyway.
How about this? A chart of the average height of gender, by semester (excluding small groups). By keeping your dplyr
transformation command far away from your ggplot
visualization command, it’s super clear and easy.
ggplot
(
data=
tall_Kim, mapping =
aes
(
x=
Semester, y=
aveht, color=
Gender)) +
geom_point
()
Data Cleaning
Commands in the dplyr
package are often useful for data cleaning. We’ll talk more about that later, but let’s do one example.
Question 6
Complete the ggplot command below to create a histogram of the Shoe Size
variable. Remember the single back quotes around Shoe Size
because it’s a variable with a space in
it. After making the histogram, do you see any data point that stands out? ** Answer Below **
ggplot
(KimDataPhysical, aes
(
x=
'Shoe Size'
)) +
geom_bar
()
Q6 Explanation
I can only see the counts of the shoe size. Other than that, I don’t see any data points that stands out. When you find an outlier, you shouldn’t just throw it out. However, if you investigate, can’t find a good reason for it, can’t find the correct value for it, and it’s clearly some sort of mistake, then likely removing the entire individual is a reasonable thing to do. Let’s see the effect of removing the individual with the “giant” feet.
Question 7
Write two R commands using dplyr
. One should calculate the average size of men and women’s feet using all data points. The second should calculate the average size of men’s and women’s feet after removing the person with the giant feet. (Think about how you’ll remove that person using a dplyr
command.) How much of a difference in average shoe size did removing the outlier make? ** Answer Below **
MKimDataPhysical <-
filter
(KimData, Gender
==
'M'
, na.rm=
TRUE
)
FKimDataPhysical <-
filter
(KimData, Gender
==
'F'
, na.rm=
TRUE
)
MKimDataPhysical <-
MKimDataPhysical %>%
group_by
(Gender, 'Shoe Size'
)
%>%
mutate
(
r=
min_rank
(
desc
(
'Shoe Size'
))) %>%
filter
(r %in%
range
(r))
MShKimDataPhysical <-
mean
(MKimDataPhysical
$
`
Shoe Size
`
, na.rm=
TRUE
)
FShKimDataPhysical <-
mean
(FKimDataPhysical
$
`
Shoe Size
`
, na.rm=
TRUE
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
head
(MShKimDataPhysical)
## [1] 11.51398
head
(FShKimDataPhysical)
## [1] 7.992991
summary
(MKimDataPhysical
$
`
Shoe Size
`
)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 7.00 10.00 11.00 11.51 12.00 113.00 1
summary
(FKimDataPhysical
$
`
Shoe Size
`
)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5.000 7.000 8.000 7.993 9.000 12.000
MSKimDataPhysical <-
MKimDataPhysical %>%
filter
( `
Shoe Size
`
>=
7
& `
Shoe Size
`
<=
15
, na.rm=
TRUE
)
FSKimDataPhysical <-
FKimDataPhysical %>%
filter
(
`
Shoe Size
`
>=
4
, `
Shoe
Size
`
<=
12
, na.rm=
TRUE
)
MShoeKimDataPhysical
=
mean
(MSKimDataPhysical
$
`
Shoe Size
`
, na.rm=
TRUE
)
FShoeKimDataPhysical
=
mean
(FSKimDataPhysical
$
`
Shoe Size
`
, na.rm=
TRUE
)
head
(MShoeKimDataPhysical)
## [1] 10.84748
head
(FShoeKimDataPhysical)
## [1] 7.992991
Q7 Explanation
The average value of male’s shoe size is 11.51398 and the average value of
female’s shoe size is 7.992991 before excluding the outliers. However, when I remove the outliers, the value of male decreased to 10.84748 while the value of female doesn’t changed.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr
Recommended textbooks for you
- Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage
- Database Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781305627482Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningCOMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LA Guide to SQLComputer ScienceISBN:9781111527273Author:Philip J. PrattPublisher:Course Technology Ptr

Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr