Wk2 - P2 - Introduction to Tidyverse

pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

431

Subject

Information Systems

Date

Apr 3, 2024

Type

pdf

Pages

5

Uploaded by ConstableGiraffe4128

Report
WK2 - P2 - MIS 431 - Introduction to Tidyverse Jingyuan Yang - George Mason University, School of Business Introduction to the Tidyverse This section will cover the basics of data manipulation using the tidyverse package. Before we can use the package, we must first install i t. Use the following code to install it in R Studio. Use the following code to install Tidyverse - <install.packages(“tidyverse”)> Once installed, you would need to load it into the environment with the following code library(tidyverse) . This will import all of the functions available in the tidyverse package into our environment. The tidyverse is a collection of 8 packages that are designed specifically f or d ata s cience t asks. T o get more details about the tidyverse package see the tidyverse documentation by going to the following link. https://www.tidyverse.org/ We will also load the skimr package which is used for exploring the structure of a data frame. # This will load all 8 of the tidyverse packages suppressPackageStartupMessages(library(tidyverse)) #suppress the start up messages # Load skimr package library(skimr) Tibbles The first package we will explore is tibble . The tibble package is used for creating special types of data frames called tibbles. Tibbles are data frames with added properties and functionality. Many of the core functions in the tidyverse take tibbles as arguments and return them as results after execution. Creating tibbles R has many built-in datasets that can be loaded as data frames. One example is the iris data frame. To load this data, you just have to type iris in the R console. Each row in iris represents a flower with corresponding measurements of height and width of sepal and petal. Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames.Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. Type iris in the console and look at the output. It will print all 150 rows. Coverting Data Frames to Tibbles To convert any R data frame into a tibble, we can use the as_tibble() function from the tibble package. In the code below, we create a tibble named iris_df. 1
iris_tbl <- as_tibble(iris) iris_tbl ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows # As described earlier it prints only the first 10 rows # Use the following code to validate that you have created a tibble str(iris_tbl) ## tibble [150 x 5] (S3: tbl_df/tbl/data.frame) ## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... Converting Tibbles to Date Frames In general, tibbles are much easier to work with than data frames. However, not all R functions are able to work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with the as.data.frame() function. The code below converts out iris_tbl back to a data frame. # Convert tibble to dataframe iris_df <- as.data.frame(iris_tbl) #Use str() to validate that you have converted it back to data frame str(iris_df) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... 2
Introduction to Data Analysis Loading Data into R Before we are able to perform data analysis, we must import data into our R environment. The tidyverse package loads the readr package which contains a number of functions for importing data into R. The read_delim() function is used to import flat files such as comma-delimited (.csv) or tab-delimited (.txt) files.The read_delim() functions takes many arguments, but the 3 most important are: • file - the first argument is the path to a file on your computer or website address of the data file • delim - the type of delimiter in the data file (either ‘,’ for comma, \t for tab, or any other character) • col_names - TRUE or FALSE to indicate whether a file has column names To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository. If there are no column names in a dataset, read_delim() will auto-generate names that begin with an X and cycle through a sequence of integers. The read_delim() function will also print a message to the R console about the data types it has assigned to each column. wine_data <- read_delim( 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data' , delim = ',' , col_names = FALSE) ## Parsed with column specification: ## cols( ## X1 = col_double(), ## X2 = col_double(), ## X3 = col_double(), ## X4 = col_double(), ## X5 = col_double(), ## X6 = col_double(), ## X7 = col_double(), ## X8 = col_double(), ## X9 = col_double(), ## X10 = col_double(), ## X11 = col_double(), ## X12 = col_double(), ## X13 = col_double(), ## X14 = col_double() ## ) # In this instance there were no column names, and R assigned the data types wine_data ## # A tibble: 178 x 14 ## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 5.64 1.04 3.92 ## 2 1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.4 ## 3 1 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81 5.68 1.03 3.17 ## 4 1 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18 7.8 0.86 3.45 ## 5 1 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82 4.32 1.04 2.93 ## 6 1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 ## 7 1 14.4 1.87 2.45 14.6 96 2.5 2.52 0.3 1.98 5.25 1.02 3.58 ## 8 1 14.1 2.15 2.61 17.6 121 2.6 2.51 0.31 1.25 5.05 1.06 3.58 ## 9 1 14.8 1.64 2.17 14 97 2.8 2.98 0.290 1.98 5.2 1.08 2.85 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 10 1 13.9 1.35 2.27 16 98 2.98 3.15 0.22 1.85 7.22 1.01 3.55 ## # ... with 168 more rows, and 1 more variable: X14 <dbl> # Print Wine data Flights Data This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights Install the package nycflights13 using the install.packages(“nycflights13”) command Once installed we have to load the package nycflights13 along with tidyverse library(nycflights13) #Open/print the flights tibble data flights ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> Exploring Data Frames with skimr The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns. The skim() function can be used to accomplish all of this. It takes your data frame or tibble as an argument. In the output below, we first get the number of rows and columns along with the data types present in our data. The results are then grouped by the type of variables in our data. First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels. Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values. # View data frame properties and summary statistics skim(flights) Table 1: Data summary Name flights Number of rows 336776 Number of columns 19 _______________________ Column type frequency: 4
Table 1: Data summary character 4 numeric 14 POSIXct 1 ________________________ Group variables None Variable type: character skim_variable n_missing complete_rate min max empty n_unique whitespace carrier 0 1.00 2 2 0 16 0 tailnum 2512 0.99 5 6 0 4043 0 origin 0 1.00 3 3 0 3 0 dest 0 1.00 3 3 0 105 0 Variable type: numeric skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist year 0 1.00 2013.00 0.00 2013 2013 2013 2013 2013 ￿￿￿￿￿ month 0 1.00 6.55 3.41 1 4 7 10 12 ￿￿￿￿￿ day 0 1.00 15.71 8.77 1 8 16 23 31 ￿￿￿￿￿ dep_time 8255 0.98 1349.11 488.28 1 907 1401 1744 2400 ￿￿￿￿￿ sched_dep_time 0 1.00 1344.25 467.34 106 906 1359 1729 2359 ￿￿￿￿￿ dep_delay 8255 0.98 12.64 40.21 -43 -5 -2 11 1301 ￿￿￿￿￿ arr_time 8713 0.97 1502.05 533.26 1 1104 1535 1940 2400 ￿￿￿￿￿ sched_arr_time 0 1.00 1536.38 497.46 1 1124 1556 1945 2359 ￿￿￿￿￿ arr_delay 9430 0.97 6.90 44.63 -86 -17 -5 14 1272 ￿￿￿￿￿ flight 0 1.00 1971.92 1632.47 1 553 1496 3465 8500 ￿￿￿￿￿ air_time 9430 0.97 150.69 93.69 20 82 129 192 695 ￿￿￿￿￿ distance 0 1.00 1039.91 733.23 17 502 872 1389 4983 ￿￿￿￿￿ hour 0 1.00 13.18 4.66 1 9 13 17 23 ￿￿￿￿￿ minute 0 1.00 26.23 19.30 0 8 29 44 59 ￿￿￿￿￿ Variable type: POSIXct skim_variable n_missing complete_rate min max median n_unique time_hour 0 1 2013-01-01 05:00:00 2013-12-31 23:00:00 2013-07-03 10:00:00 6936 — End of Part 2 — 5