R-lab-03-assignment

docx

School

Truman State University *

*We aren’t endorsed by this school

Course

190

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

Uploaded by BarristerCrabMaster699

RLab 03 - Transforming with dplyr Samuel Park today (change this, too) The Story so Far… As you already know, Dr. Hyun-Joo Kim has given a short survey to her STAT 190 classes for the past few years, and she then uses that data throughout the semester. The dataset found here has the information across six semesters that she has used the survey. Her grader or student worker types the information from the paper sheets in by hand. Over time, this leads to dirty data. There was an especially big change in recording between rows 194 and 195. #Before you start, save this RMarkdown file into your project folder. # Load the tidyverse package with dplyr commands. library (tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- ## v ggplot2 3.3.5 v purrr 0.3.4 ## v tibble 3.1.3 v dplyr 1.0.7 ## v tidyr 1.1.3 v stringr 1.4.0 ## v readr 2.0.0 v forcats 0.5.1 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() # Load the data file from the U: drive - U:\_MT Student File Area\ tberegovska\Stat220 Clean_KimData <- read_csv ( "U:/_MT Student File Area/tberegovska/Stat220/Clean-KimData.csv" ) ## Rows: 377 Columns: 25 ## -- Column specification -------------------------------------------------------- ## Delimiter: "," ## chr (6): Gender, Birth Order, dog vs cat, Handed, On/Off Campus, Phone ## dbl (19): Semester, Siblings, Shoe Size, Height, Weight, Calories per day, S...

## ## i Use `spec()` to retrieve the full column specification for this data. ## i Specify the column types or set `show_col_types = FALSE` to quiet this message. #Then, you might want to save your raw dataset into your project folder write_csv (Clean_KimData, "Clean_KimData.csv" ) #after you've done this, you can load it next time from your Y: Drive #by putting a hashtag # at the front of line 23 and removing the # from line 31 #Once you've saved it into your project folder last time, get it from here instead. #Clean_KimData <- read_csv("Clean_KimData.csv") # Finally, create a new copy of the data to keep the "clean" version clean. KimData <- Clean_KimData As an aside, note that there are two commands: read.csv and read_csv . The first is built in, and the second comes from the tidyverse package. Base R reads data in as a data frame , while the tidyverse version of read_csv reads data in as a tibble , which is a tidyverse version of a data frame with some small, but fancy upgrades. We used the tidyverse read_csv in this lab because it makes the output a bit prettier. Data Transformations with dplyr In this section, we’ll talk about the six dplyr commands you need to know: filter : pick individuals by their values. arrange : reorder the rows. select : pick variables by their names. mutate : create new variables as functions of existing variables. group_by and summarize : these two go together group_by : mark data points as falling in different groups according to some variable. summarise or summarize : collapse many values down to a single summary. #For more Information You will be happy that you know how to use dplyr, and it has even more capabilities than we will discuss here. You can get a nice cheatsheet from https://www.rstudio.com/resources/cheatsheets/ and our textbook also has a lot about it in Chapter 5 – https://r4ds.had.co.nz/transform.html #Brief Aside: Pipes A pipe is a way to connect multiple lines of code. It basically means, “take the result of this line down to the next line.” ggplot uses + as a pipe. That’s easy to remember (because + means “add something to your graph” but is not good coding practice (because + more commonly means “add these up”) dplyr and other tidyverse packages use a unique pipe that has no other meaning. %>% No, really, that’s what it looks like. Yes, that’s weird. But, you have to admit that you aren’t going to use %>% for anything else.

Filter Basic Syntax The filter command selects certain observations. So, if we just want those who identified themselves as women or as only children in the dataset: filter (KimData, Gender == "F" ) # you need the quotes for characters or factors. ## # A tibble: 214 x 25 ## Semester Gender Siblings `Birth Order` `Shoe Size` Height Weight `dog vs cat` ## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> ## 1 6 F 5 Middle 11 71 195 cat ## 2 4 F 0 Only 10 64 187 cat ## 3 6 F 1 Last 9.5 69 150 dog ## 4 7 F 3 First 9.5 64 193 dog ## 5 2 F 3 Middle 11 69.5 180 dog ## 6 4 F 0 Only 7 64 135 neither ## 7 4 F 1 Last 7.5 65 130 dog ## 8 4 F 1 Last 6.5 67 128 cat ## 9 2 F 2 First 8 65 124 both ## 10 2 F 3 Middle 8 65 145 dog ## # ... with 204 more rows, and 17 more variables: Handed <chr>, ## # On/Off Campus <chr>, Calories per day <dbl>, Servings of Fruit <dbl>, ## # Cups of Water <dbl>, Cups of Coffee <dbl>, Hours of Sleep <dbl>, ## # Hours spent studying per week <dbl>, Hours spent working per week <dbl>, ## # Hours spent workingout per wee <dbl>, Hours socializing per w <dbl>, ## # Politically Liberal <dbl>, Religiously C or L <dbl>, Socially C or L <dbl>, ## # Phone <chr>, Hrs per day on phone <dbl>, ... filter (KimData, Siblings == 0 ) # you don’t need the “quotes” for numbers

Your preview ends here