Wk2 - P2 - Introduction to Tidyverse
pdf
keyboard_arrow_up
School
George Mason University *
*We aren’t endorsed by this school
Course
431
Subject
Information Systems
Date
Apr 3, 2024
Type
Pages
5
Uploaded by ConstableGiraffe4128
WK2
-
P2
-
MIS
431
-
Introduction
to
Tidyverse
Jingyuan Yang -
George
Mason
University,
School
of
Business
Introduction
to
the
Tidyverse
This
section
will
cover
the
basics
of
data
manipulation
using
the
tidyverse
package.
Before
we
can
use
the
package,
we
must
first
install
i t.
Use
the
following
code
to
install
it
in
R Studio.
Use
the
following
code
to
install
Tidyverse
-
<install.packages(“tidyverse”)>
Once
installed,
you
would
need
to
load
it
into
the
environment
with
the
following
code
library(tidyverse)
.
This
will
import
all
of
the
functions
available
in
the
tidyverse
package
into
our
environment.
The
tidyverse
is
a
collection
of
8
packages
that
are
designed
specifically
f or
d ata
s cience
t asks.
T o
get
more
details
about
the
tidyverse
package
see
the
tidyverse
documentation
by
going
to
the
following
link.
https://www.tidyverse.org/
We
will
also
load
the
skimr
package
which
is
used
for
exploring
the
structure
of
a
data
frame.
# This will load all 8 of the tidyverse packages
suppressPackageStartupMessages(library(tidyverse))
#suppress the start up messages
# Load skimr package
library(skimr)
Tibbles
The first package we will explore is
tibble
. The tibble package is used for creating special types of data
frames called tibbles.
Tibbles are data frames with added properties and functionality. Many of the core functions in the
tidyverse
take tibbles as arguments and return them as results after execution.
Creating tibbles
R
has many built-in datasets that can be loaded as data frames. One example is the iris data frame. To
load this data, you just have to type iris in the R console.
Each row in iris represents a flower with corresponding measurements of height and width of sepal and petal.
Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data
frames.Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on
screen. This makes it much easier to work with large data. Type iris in the console and look at the output.
It will print all 150 rows.
Coverting Data Frames to Tibbles
To convert any R data frame into a tibble, we can use the
as_tibble()
function from the tibble package.
In the code below, we create a tibble named iris_df.
1
iris_tbl
<-
as_tibble(iris)
iris_tbl
## # A tibble: 150 x 5
##
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##
<dbl>
<dbl>
<dbl>
<dbl> <fct>
##
1
5.1
3.5
1.4
0.2 setosa
##
2
4.9
3
1.4
0.2 setosa
##
3
4.7
3.2
1.3
0.2 setosa
##
4
4.6
3.1
1.5
0.2 setosa
##
5
5
3.6
1.4
0.2 setosa
##
6
5.4
3.9
1.7
0.4 setosa
##
7
4.6
3.4
1.4
0.3 setosa
##
8
5
3.4
1.5
0.2 setosa
##
9
4.4
2.9
1.4
0.2 setosa
## 10
4.9
3.1
1.5
0.1 setosa
## # ... with 140 more rows
# As described earlier it prints only the first 10 rows
# Use the following code to validate that you have created a tibble
str(iris_tbl)
## tibble [150 x 5] (S3: tbl_df/tbl/data.frame)
##
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Converting Tibbles to Date Frames
In general, tibbles are much easier to work with than data frames. However, not all R functions are able to
work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with
the as.data.frame() function.
The code below converts out iris_tbl back to a data frame.
# Convert tibble to dataframe
iris_df
<-
as.data.frame(iris_tbl)
#Use str() to validate that you have converted it back to data frame
str(iris_df)
## 'data.frame':
150 obs. of
5 variables:
##
$ Sepal.Length: num
5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##
$ Sepal.Width : num
3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##
$ Petal.Length: num
1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##
$ Petal.Width : num
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
2
Introduction to Data Analysis
Loading Data into R
Before we are able to perform data analysis, we must import data into our R environment. The tidyverse
package loads the readr package which contains a number of functions for importing data into R.
The
read_delim()
function is used to import flat files such as comma-delimited (.csv) or tab-delimited
(.txt) files.The read_delim() functions takes many arguments, but the 3 most important are:
• file - the first argument is the path to a file on your computer or website address of the data file
• delim - the type of delimiter in the data file (either ‘,’ for comma,
\t
for tab, or any other character)
• col_names - TRUE or FALSE to indicate whether a file has column names
To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository.
If there are no column names in a dataset, read_delim() will auto-generate names that begin with an X and
cycle through a sequence of integers. The read_delim() function will also print a message to the R console
about the data types it has assigned to each column.
wine_data
<-
read_delim(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
,
delim =
','
,
col_names =
FALSE)
## Parsed with column specification:
## cols(
##
X1 = col_double(),
##
X2 = col_double(),
##
X3 = col_double(),
##
X4 = col_double(),
##
X5 = col_double(),
##
X6 = col_double(),
##
X7 = col_double(),
##
X8 = col_double(),
##
X9 = col_double(),
##
X10 = col_double(),
##
X11 = col_double(),
##
X12 = col_double(),
##
X13 = col_double(),
##
X14 = col_double()
## )
# In this instance there were no column names, and R assigned the data types
wine_data
## # A tibble: 178 x 14
##
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
##
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##
1
1
14.2
1.71
2.43
15.6
127
2.8
3.06 0.28
2.29
5.64
1.04
3.92
##
2
1
13.2
1.78
2.14
11.2
100
2.65
2.76 0.26
1.28
4.38
1.05
3.4
##
3
1
13.2
2.36
2.67
18.6
101
2.8
3.24 0.3
2.81
5.68
1.03
3.17
##
4
1
14.4
1.95
2.5
16.8
113
3.85
3.49 0.24
2.18
7.8
0.86
3.45
##
5
1
13.2
2.59
2.87
21
118
2.8
2.69 0.39
1.82
4.32
1.04
2.93
##
6
1
14.2
1.76
2.45
15.2
112
3.27
3.39 0.34
1.97
6.75
1.05
2.85
##
7
1
14.4
1.87
2.45
14.6
96
2.5
2.52 0.3
1.98
5.25
1.02
3.58
##
8
1
14.1
2.15
2.61
17.6
121
2.6
2.51 0.31
1.25
5.05
1.06
3.58
##
9
1
14.8
1.64
2.17
14
97
2.8
2.98 0.290
1.98
5.2
1.08
2.85
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 10
1
13.9
1.35
2.27
16
98
2.98
3.15 0.22
1.85
7.22
1.01
3.55
## # ... with 168 more rows, and 1 more variable: X14 <dbl>
# Print Wine data
Flights Data
This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from
the US Bureau of Transportation Statistics, and is documented in ?flights Install the package
nycflights13
using the install.packages(“nycflights13”) command
Once installed we have to load the package
nycflights13
along with
tidyverse
library(nycflights13)
#Open/print the flights tibble data
flights
## # A tibble: 336,776 x 19
##
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
<int> <int> <int>
<int>
<int>
<dbl>
<int>
<int>
##
1
2013
1
1
517
515
2
830
819
##
2
2013
1
1
533
529
4
850
830
##
3
2013
1
1
542
540
2
923
850
##
4
2013
1
1
544
545
-1
1004
1022
##
5
2013
1
1
554
600
-6
812
837
##
6
2013
1
1
554
558
-4
740
728
##
7
2013
1
1
555
600
-5
913
854
##
8
2013
1
1
557
600
-3
709
723
##
9
2013
1
1
557
600
-3
838
846
## 10
2013
1
1
558
600
-2
753
745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #
carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Exploring Data Frames with skimr
The first step in a data analysis project is to explore your data source. This includes summarizing the values
within each column, checking for missing data, checking the data types of each column, and verifying the
number of rows and columns.
The
skim()
function can be used to accomplish all of this.
It takes your
data frame or tibble as an argument.
In the output below, we first get the number of rows and columns
along with the data types present in our data. The results are then grouped by the type of variables in our
data. First we get a summary of our factor variables, including the number of missing observations, whether
our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor
levels. Then we get a summary of our numeric variables which include the number of missing observations,
the mean and standard deviation, a five number summary, and a plot of the distribution of values.
# View data frame properties and summary statistics
skim(flights)
Table 1: Data summary
Name
flights
Number of rows
336776
Number of columns
19
_______________________
Column type frequency:
4
Table 1: Data summary
character
4
numeric
14
POSIXct
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
carrier
0
1.00
2
2
0
16
0
tailnum
2512
0.99
5
6
0
4043
0
origin
0
1.00
3
3
0
3
0
dest
0
1.00
3
3
0
105
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
year
0
1.00
2013.00
0.00
2013
2013
2013
2013
2013
month
0
1.00
6.55
3.41
1
4
7
10
12
day
0
1.00
15.71
8.77
1
8
16
23
31
dep_time
8255
0.98
1349.11
488.28
1
907
1401
1744
2400
sched_dep_time
0
1.00
1344.25
467.34
106
906
1359
1729
2359
dep_delay
8255
0.98
12.64
40.21
-43
-5
-2
11
1301
arr_time
8713
0.97
1502.05
533.26
1
1104
1535
1940
2400
sched_arr_time
0
1.00
1536.38
497.46
1
1124
1556
1945
2359
arr_delay
9430
0.97
6.90
44.63
-86
-17
-5
14
1272
flight
0
1.00
1971.92
1632.47
1
553
1496
3465
8500
air_time
9430
0.97
150.69
93.69
20
82
129
192
695
distance
0
1.00
1039.91
733.23
17
502
872
1389
4983
hour
0
1.00
13.18
4.66
1
9
13
17
23
minute
0
1.00
26.23
19.30
0
8
29
44
59
Variable type: POSIXct
skim_variable
n_missing
complete_rate
min
max
median
n_unique
time_hour
0
1
2013-01-01 05:00:00
2013-12-31 23:00:00
2013-07-03 10:00:00
6936
— End of Part 2 —
5