worksheet_wrangling
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
22
Uploaded by CountKuduMaster478
Worksheet 3: Cleaning and Wrangling
Data
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
distinguish vectors and data frames in R, and how they relate to each other
define the term "tidy data"
discuss the advantages and disadvantages of storing data in a tidy data
format
recall and use the following tidyverse functions and operators for their
intended data wrangling tasks:
select
filter
|>
map
mutate
summarize
group_by
pivot_longer
separate
%in%
This worksheet covers parts of the Wrangling chapter of the online textbook. You
should read this chapter before attempting the worksheet.
### Run this cell before continuing. library
(
tidyverse
)
library
(
repr
)
source
(
"tests.R"
)
source
(
"cleanup.R"
)
options
(
repr.matrix.max.rows =
6
)
Question 0.0
Multiple Choice:
{points: 1}
Which statement below is incorrect about vectors and data frames in R?
A. the columns of data frames are vectors
B. data frames can have columns of different types (e.g., a column of numeric
data, and a column of character data)
C. vectors can have elements of different types (e.g., element one can be numeric,
and element 2 can be a character)
D. data frames are a special kind of list
In [ ]:
Assign your answer to an object called answer0.0
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.0 <-
"C"
### END SOLUTION
test_0.0
()
Question 0.1
Multiple Choice:
{points: 1}
Which of the following does not
characterize a tidy dataset?
A. each row is a single observation
B. each value should not be in a single cell
C. each column is a single variable
D. each value is a single cell
Assign your answer to an object called answer0.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.1 <-
"B"
### END SOLUTION
test_0.1
()
Question 0.2
Multiple Choice:
{points: 1}
For which scenario would using one of the group_by()
+ summarize()
be
appropriate?
A. To apply the same function to every row.
B. To apply the same function to every column.
C. To apply the same function to groups of rows.
D. To apply the same function to groups of columns.
Assign your answer to an object called answer0.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
In [ ]:
In [ ]:
In [ ]:
In [ ]:
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.2 <-
"C"
### END SOLUTION
test_0.2
()
Question 0.3
Multiple Choice:
{points: 1}
For which scenario would using one of the purrr
map_*
functions be
appropriate?
A. To apply the same function to groups of rows.
B. To apply the same function to every column.
C. To apply the same function to groups of columns.
D. All of the above.
*Assign your answer to an object called answer0.3
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).**
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.3 <-
"B"
### END SOLUTION
test_0.3
()
1. Assessing avocado prices to inform
restaurant menu planning
It is a well known that millennials LOVE avocado toast (joking...well mostly ),
and so many restaurants will offer menu items that centre around this delicious
food! Like many food items, avocado prices fluctuate. So a restaurant who wants
to maximize profits on avocado-containing dishes might ask if there are times
when the price of avocados are less expensive to purchase? If such times exist,
this is when the restaurant should put avocado-containing dishes on the menu to
maximize their profits for those dishes.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Source: https://www.averiecooks.com/egg-hole-avocado-toast/
To answer this question we will analyze a data set of avocado sales from multiple
US markets. This data was downloaded from the Hass Avocado Board website in
May of 2018 & compiled into a single CSV. Each row in the data set contains
weekly sales data for a region. The data set spans the year 2015-2018.
Some relevant columns in the dataset:
Date
- The date in year-month-day format
average_price
- The average price of a single avocado
type
- conventional or organic
yr
- The year
region
- The city or region of the observation
small_hass_volume
in pounds (lbs)
large_hass_volume
in pounds (lbs)
extra_l_hass_volume
in pounds (lbs)
wk
- integer number for the calendar week in the year (e.g., first week of
January is 1, and last week of December is 52).
To answer our question of whether there are times in the year when avocados are
typically less expensive (and thus we can make more profitable menu items with
them at a restaurant) we will want to create a scatter plot of average_price
(y-
axis) versus Date
(x-axis).
Question 1.1
Multiple Choice:
{points: 1}
Which of the following is not included in the csv
file?
A. Average price of a single avocado.
B. The farming practice (production with/without the use of chemicals).
C. Average price of a bag of avocados.
D. All options are included in the data set.
*Assign your answer to an object called answer1.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).**
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.1 <-
"C"
### END SOLUTION
test_1.1
()
Question 1.2
Multiple Choice:
{points: 1}
The rows in the data frame represent:
A. daily avocado sales data for a region
B. weekly avocado sales data for a region
C. bi-weekly avocado sales data for a region
D. yearly avocado sales data for a region
Assign your answer to an object called answer1.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.2 <-
"B"
### END SOLUTION
test_1.2
()
Question 1.3
{points: 1}
The first step to plotting total volume against average price is to read the file
avocado_prices.csv
using the shortest relative path. The data file was given to
you along with this worksheet, but you will have to look to see where it is in the
worksheet_03
directory to correctly load it. When you do this, you should also
preview the file to help you choose an appropriate read_*
function to read the
data.
Assign your answer to an object called avocado
.
#... <- ...("...")
### BEGIN SOLUTION avocado <-
read_csv
(
"data/avocado_prices.csv"
)
### END SOLUTION avocado
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
test_1.3
()
Question 1.4
Multiple Choice:
{points: 1}
Why are the 2nd to 5th columns col_double
instead of col_integer
?
A. They aren't "real" numbers.
B. They contain decimals.
C. They are numbers created using text/letters.
D. They are col_integer
...
Assign your answer to an object called answer1.4
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks.
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.4 <-
"B"
### END SOLUTION
test_1.4
()
Before we get started doing our analysis, let's learn about the pipe operator, |>
,
as it can be very helpful when doing data analysis in R!
Pipe Operators: |>
Pipe operators allow you to chain together different functions - it takes the
output of one statement and makes it the input of the next statement. Having a
chain of processing functions is known as a pipeline
.
If we wanted to subset the avocado data to obtain just the average prices for
organic avocados, we would need to first filter the type
column using the
function: filter()
for the rows where the type is organic. Then we would need
to use the select()
function to get just the average price column.
Below we illustrate how to do this using the pipe operator, |>
, instead of
creating an intermediate object as we have in past worksheets:
Note: the indentation on the second line of the pipeline is not
required, but added for readability.
# run this cell
filter
(
avocado
, type ==
"organic"
) |>
select
(
average_price
)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We can even start off a pipeline by passing the data frame into the first function.
This is convenient and aids in readability. You will see this being used often in this
course going forward. Below we show an example of this doing the same task we
just completed above (subsetting the average price data for organic avocados).
avocado |>
filter
( type ==
"organic"
) |>
select
(
average_price
)
Question 1.5
{points: 1}
To answer our question, let's now create the scatter plot where we plot
average_price
on the y-axis versus Date
on the x-axis. Fill in the ... in the cell
below. Copy and paste your finished answer in place of fail()
. Assign your
answer to an object called avocado_plot
. Don't forget to create proper English
axis labels.
options
(
repr.plot.width =
14
, repr.plot.height =
7
) # Modifies the size of the p
#... <- ... |>
# ggplot(aes(x = ..., y = ...)) + # geom_...() +
# xlab("...") + # ylab("...") + # theme(text = element_text(size=20))
### BEGIN SOLUTION
avocado_plot <-
avocado |>
ggplot
(
aes
(
x =
Date
, y =
average_price
)) +
geom_point
() +
xlab
(
"Date"
) +
ylab
(
"Average Price (in US Dollars)"
) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION
avocado_plot
test_1.5
()
We might be able to squint and start to see some pattern in the data above, but
really what we see in the plot above is not very informative. Why? Because there is
a lot of overplotting (data points sitting on top of other data points). What can we
do? One solution is to reduce/aggregate the data in a meaningful way to help
anwer our question. Remember that we are interested in determining if there are
times when the price of avocados are less expensive so that we can recommend
when restaurants should put dishes on the menu that contain avocado to
maximize their profits for those dishes.
In the data we plotted above, each row is the total sales for avocados for that
region for each year. Lets use group_by
+ summarize
calculate the average
In [ ]:
In [ ]:
In [ ]:
price for each week across years and region. We can then plot that aggregated
price against the week and perhaps get a clearer picture.
Question 1.6
{points: 1}
Create a reduced/aggregated version of the avocado
data set and name it
avocado_aggregate
. To do this you will want to group_by
the wk
column and
then use summarize
to calculate the average price (name that column
average_price
).
#... <- ... |> # group_by(...) |> # summarize(... = mean(average_price, na.rm = TRUE))
### BEGIN SOLUTION
avocado_aggregate <-
avocado |>
group_by
(
wk
) |>
summarize
(
average_price =
mean
(
average_price
, na.rm =
TRUE
))
### END SOLUTION
avocado_aggregate
test_1.6
()
Question 1.7
{points: 1}
Now let's take the avocado_aggregate
data frame and use it to create a scatter
plot where we plot average_price
on the y-axis versus wk
on the x-axis. Assign
your answer to an object called avocado_aggregate_plot
. Don't forget to create
proper English axis labels.
#... <- ... |>
# ggplot(aes(x = ..., y = ...)) + # ...() +
# ...("...") + # ...("...") +
# theme(text = element_text(size=20))
### BEGIN SOLUTION avocado_aggregate_plot <-
avocado_aggregate |>
ggplot
(
aes
(
x =
wk
, y =
average_price
)) +
geom_point
() +
xlab
(
"Week"
) +
ylab
(
"Average Price (in US Dollars)"
) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION avocado_aggregate_plot
test_1.7
()
We can now see that the prices of avocados does indeed fluctuate throughout the
year. And we could use this information to recommend to restaurants that if they
In [ ]:
In [ ]:
In [ ]:
In [ ]:
want to maximize profit from menu items that contain avocados, they should only
offer them on the menu roughly between December and May.
Why might this happen? Perhaps price has something to do with supply? We can
also use this data set to get some insight into that question by plotting total
avocado volume (y-axis) versus week. To do this, we will first have to create a
column called total_volume
whose value is the sum of the small, large and extra
large-sized avocado volumes. To do this we will have to go back to the original
avocado
data frame we loaded.
Question 1.8
{points: 1}
Our next step to plotting total_volume
per week against week is to use mutate
to create a new column in the avocado
data frame called total_volume
which
is equal to the sum of all three volume columns:
Fill in the ...
in the cell below. Copy and paste your finished answer and replace
the fail()
.
#... <- ... |>
# mutate(... = ... + ... + ...) ### BEGIN SOLUTION avocado <-
avocado |>
mutate
(
total_volume =
small_hass_volume +
large_hass_volume +
extra_l_hass_v
### END SOLUTION
avocado
test_1.8
()
Question 1.9
{points: 1}
Now, create another reduced/aggregated version of the avocado
data frame and
name it avocado_aggregate_2
. To do this you will want to group_by
the wk
column and then use summarize
to calculate the average total volume (name
that column total_volume
).
#... <- ... |> # group_by(...) |> # summarize(...)
### BEGIN SOLUTION
avocado_aggregate_2 <-
avocado |>
group_by
(
wk
) |>
summarize
(
total_volume =
mean
(
total_volume
, na.rm =
TRUE
))
### END SOLUTION
avocado_aggregate_2
test_1.9
()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 1.10
{points: 1}
Now let's take the avocado_aggregate_2
data frame and use it to create a
scatter plot where we plot average total_volume
(in pounds, lbs) on the y-axis
versus wk
on the x-axis. Assign your answer to an object called
avocado_aggregate_plot_2
. Don't forget to create proper English axis labels.
Hint: don't forget to include the units for volume in your data
visualization.
#... <- ... |>
# ggplot(aes(x = ..., y = ...)) + # ...() +
# ...("...") + # ...("...") +
# theme(text = element_text(size=20))
### BEGIN SOLUTION avocado_aggregate_plot_2 <-
avocado_aggregate_2 |>
ggplot
(
aes
(
x =
wk
, y =
total_volume
)) +
geom_point
() +
xlab
(
"Week"
) +
ylab
(
"Average total volume (lbs)"
) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION avocado_aggregate_plot_2
test_1.10
()
We can see from the above plot of the average total volume versus the week that
there are more avocados sold (and perhaps this reflects what is available for sale)
roughly between January to May. This time period of increased volume
corresponds with the lower avocado prices. We can hypothesize
(but not
conclude, of course) that the lower prices may be due to an increased availability
of avocados during this time period.
2. Sea Surface Temperatures in Departure Bay
The next data set that we will be looking at contains environmental data from
1914 to 2018. The data was collected by the DFO (Canada's Department of
Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea
surface temperature (in degrees Celsius) and salinity (in practical salinity units,
PSU) observations have been carried out at several locations on the coast of
British Columbia. The number of stations reporting at any given time has varied as
sampling has been discontinued at some stations, and started or resumed at
others.
In [ ]:
In [ ]:
Presently termed the British Columbia Shore Station Oceanographic Program
(BCSOP), there are 12 participating stations; most of these are staffed by Fisheries
and Oceans Canada. You can look at data from other stations at
http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-
phares/index-eng.html
Further information from the Government of Canada's website indicates:
Observations are made daily using seawater collected in a bucket
lowered into the surface water at or near the daytime high tide. This
sampling method was designed long ago by Dr. John P. Tully and has
not been changed in the interests of a homogeneous data set. This
means, for example, that if an observer starts sampling one day at 6
a.m., and continues to sample at the daytime high tide on the
second day the sample will be taken at about 06:50 the next day,
07:40 the day after etc. When the daytime high-tide gets close to 6
p.m. the observer will then begin again to sample early in the
morning, and the cycle continues. Since there is a day/night
variation in the sea surface temperatures the daily time series will
show a signal that varies with the14-day tidal cycle. This artifact does
not affect the monthly sea surface temperature data.
In this worksheet, we want to see if the sea surface temperature has been
changing over time.
Question 2.1
True or False:
{points: 1}
The sampling of surface water occurs at the same time each day.
Assign your answer to an object called answer2.1
. Make sure your answer is
lowercase "true" or lowercase "false".
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer2.1 <-
"false"
### END SOLUTION
test_2.1
()
Question 2.2
Multiple Choice:
{points: 1}
If high tide occurred at 9am today, what time would the scientist collect data
tomorrow?
A. 11:10 am
B. 9:50 am
In [ ]:
In [ ]:
C. 10:00 pm
D. Trick question... you skip days when collecting data.
Assign your answer to an object called answer2.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer2.2 <-
"B"
### END SOLUTION
test_2.2
()
Question 2.3
{points: 1}
To begin working with this data, read the file departure_bay_temperature.csv
using a relative path. Note, this file (just like the avocado data set) is found within
the worksheet_03
directory.
Assign your answer to an object called sea_surface
.
### BEGIN SOLUTION
sea_surface <-
read_csv
(
"data/departure_bay_temperature.csv"
, skip =
2
)
### END SOLUTION sea_surface
test_2.3
()
Question 2.3.1
{points: 1}
The data above in Question 2.3 is not tidy, which reasons listed below explain
why?
A. There are NA's in the data set
B. The variable temperature is split across more than one column
C. Values for the variable month are stored as column names
D. A and C
E. B and C
F. All of the above
Assign your answer to an object called answer2.3.1
.
# Replace the fail() with your answer. ### BEGIN SOLUTION
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
answer2.3.1 <-
"E"
### END SOLUTION
test_2.3.1
()
Question 2.4
{points: 1}
Given ggplot
expects tidy data, we need to convert our data into that format. To
do this we will use the pivot_longer()
function. We would like our data to end
up looking like this:
Year
Month
Temperature
1914
Jan
7.2
1914
Feb
NA
1914
Mar
NA
...
...
...
2018
Oct
NA
2018
Nov
NA
2018
Dec
NA
Fill in the ...
in the cell below. Copy and paste your finished answer and replace
the fail()
.
Assign your answer to an object called tidy_temp
.
#... <- sea_surface |>
# ...(cols = Jan:Dec, # names_to = "...", # values_to = "Temperature")
### BEGIN SOLUTION tidy_temp <-
sea_surface |>
pivot_longer
(
cols =
Jan
:
Dec
, names_to =
'Month'
, values_to =
'Temperature'
) ### END SOLUTION
tidy_temp
test_2.4
()
Question 2.5
{points: 1}
Now that we have our data in a tidy format, we can create our plot that compares
the average monthly sea surface temperatures (in degrees Celsius) to the year
they were recorded. To make our plots more informative, we should plot each
month separately. We can use filter
to do this before we pipe our data into the
In [ ]:
In [ ]:
In [ ]:
ggplot
function. Let's start out by just plotting the data for the month of
November. As usual, use proper English to label your axes :)
Assign your answer to an object called nov_temp_plot
.
Hint: don't forget to include the units for temperature in your data
visualization.
options
(
repr.plot.width =
12
, repr.plot.height =
7
)
#... <- ... |> # filter(... == ...) |> # ggplot(aes(x = ..., y = ...)) + # geom_point() + # xlab(...) + # ylab(...) +
# theme(text = element_text(size=20))
### BEGIN SOLUTION nov_temp_plot <-
tidy_temp |>
filter
(
Month ==
"Nov"
) |>
ggplot
(
aes
(
x =
Year
, y =
Temperature
)) +
geom_point
() +
xlab
(
"Year"
) +
ylab
(
"Temperature (Celsius)"
) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION nov_temp_plot
test_2.5
()
We can see that there may be a small decrease in colder temperatures in recent
years, and/or the temperatures in recent years look less variable compared to
years before 1975. What about other months? Let's plot them!
Instead of repeating the code above for the 11 other months, we'll take advantage
of a ggplot2
function that we haven't met yet, facet_wrap
. This function is
used to create many plots side-by-side, and wrapped around to new lines if there
are too many plots. You tell ggplot2
how to split up the plots by specifying the
argument facets = vars(...)
, where ...
represents the variable that is used
to split the plots. We will learn more about this function next week, this week we
will give you the code for it.
Question 2.6
{points: 1}
Fill in the missing code below to plot the average monthly sea surface
temperatures to the year they were recorded for all months. Assign your answer
to an object called all_temp_plot
.
Hint: don't forget to include the units for temperature in your data
visualization.
In [ ]:
In [ ]:
options
(
repr.plot.width =
14
, repr.plot.height =
8
)
#... <- ... |> # ggplot(aes(x = ..., y = ...)) + # geom_point() + # facet_wrap(facets = vars(factor(Month, levels = c("Jan","Feb","Mar","Apr","
# "Jul","Aug","Sep","Oct","Nov","Dec"))
# xlab(...) + # ylab(...) +
# theme(text = element_text(size=20))
### BEGIN SOLUTION all_temp_plot <-
tidy_temp |>
ggplot
(
aes
(
x =
Year
, y =
Temperature
)) +
geom_point
() +
facet_wrap
(
facets =
vars
(
factor
(
Month
, levels =
c
(
"Jan"
,
"Feb"
,
"Mar"
,
"Apr"
,
"M
"Jul"
,
"Aug"
,
"Sep"
,
"Oct"
,
"Nov"
,
"Dec"
)))
xlab
(
"Year"
) +
ylab
(
"Temperature (Celsius)"
) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION all_temp_plot
test_2.6
()
We can see above that some months show a small, but general increase in
temperatures, whereas others don't. And some months show a change in
variability and others do not. From this it is clear to us that if we are trying to
understand temperature changes over time, we best keep data from different
months separate.
3. Pollution in Madrid
We're working with a data set from Kaggle once again! This data was collected
under the instructions from Madrid's City Council and is publicly available on their
website. In recent years, high levels of pollution during certain dry periods has
forced the authorities to take measures against the use of cars and act as a
reasoning to propose certain regulations. This data includes daily and hourly
measurements of air quality from 2001 to 2008. Pollutants are categorized based
on their chemical properties.
There are a number of stations set up around Madrid and each station's data
frame contains all particle measurements that such station has registered from
01/2001 - 04/2008. Not every station has the same equipment, therefore each
station can measure only a certain subset of particles. The complete list of
possible measurements and their explanations are given by the website:
SO_2
: sulphur dioxide level measured in μg/m³. High levels can produce
irritation in the skin and membranes, and worsen asthma or heart diseases in
sensitive groups.
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
CO
: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning
involves headaches, dizziness and confusion in short exposures and can result
in loss of consciousness, arrhythmias, seizures or even death.
NO_2
: nitrogen dioxide level measured in μg/m³. Long-term exposure is a
cause of chronic lung diseases, and are harmful for the vegetation.
PM10
: particles smaller than 10 μm. Even though they cannot penetrate the
alveolus, they can still penetrate through the lungs and affect other organs.
Long term exposure can result in lung cancer and cardiovascular
complications.
NOx
: nitrous oxides level measured in μg/m³. Affect the human respiratory
system worsening asthma or other diseases, and are responsible of the
yellowish-brown color of photochemical smog.
O_3
: ozone level measured in μg/m³. High levels can produce asthma,
bronchytis or other chronic pulmonary diseases in sensitive groups or
outdoor workers.
TOL
: toluene (methylbenzene) level measured in μg/m³. Long-term exposure
to this substance (present in tobacco smoke as well) can result in kidney
complications or permanent brain damage.
BEN
: benzene level measured in μg/m³. Benzene is a eye and skin irritant,
and long exposures may result in several types of cancer, leukaemia and
anaemias. Benzene is considered a group 1 carcinogenic to humans.
EBE
: ethylbenzene level measured in μg/m³. Long term exposure can cause
hearing or kidney problems and the IARC has concluded that long-term
exposure can produce cancer.
MXY
: m-xylene level measured in μg/m³. Xylenes can affect not only air but
also water and soil, and a long exposure to high levels of xylenes can result in
diseases affecting the liver, kidney and nervous system.
PXY
: p-xylene level measured in μg/m³. See MXY for xylene exposure effects
on health.
OXY
: o-xylene level measured in μg/m³. See MXY for xylene exposure effects
on health.
TCH
: total hydrocarbons level measured in mg/m³. This group of substances
can be responsible of different blood, immune system, liver, spleen, kidneys
or lung diseases.
NMHC
: non-methane hydrocarbons (volatile organic compounds) level
measured in mg/m³. Long exposure to some of these substances can result in
damage to the liver, kidney, and central nervous system. Some of them are
suspected to cause cancer in humans.
The goal of this assignment is to see if pollutants are decreasing (is air quality
improving) and also compare which pollutant has decreased the most over the
span of 5 years (2001 - 2006).
1. First do a plot of one of the pollutants (EBE).
2. Next, group it by month and year; calculate the maximum value and plot it (to
see the trend through time).
3. Now we will look at which pollutant decreased the most. Repeat the same
thing for every column - to speed up the process, use the map()
function.
First we will look at pollution in 2001 (get the maximum value for each of the
pollutants). And then do the same for 2006.
Question 3.1
Multiple Choice:
{points: 1}
What big picture question are we trying to answer?
A. Did EBE decrease in Madrid between 2001 and 2006?
B. Of all the pollutants, which decreased the most between 2001 and 2006?
C. Of all the pollutants, which decreased the least between 2001 and 2006?
D. Did EBE increase in Madrid between 2001 and 2006?
Assign your answer to an object called answer3.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer3.1 <-
"B"
### END SOLUTION
test_3.1
()
Question 3.2
{points: 1}
To begin working with this data, read the file madrid_pollution.csv
. Note, this
file (just like the avocado and sea surface data set) is found in the
worksheet_wrangling
directory.
Assign your answer to an object called madrid
.
### BEGIN SOLUTION madrid <-
read_tsv
(
"data/madrid_pollution.csv"
)
### END SOLUTION madrid
test_3.2
()
Question 3.3
{points: 1}
Now that the data is loaded in R, create a scatter plot that compares ethylbenzene
(
EBE
) values against the date they were recorded. This graph will showcase the
concentration of ethylbenzene in Madrid over time. As usual, label your axes:
x = Date
In [ ]:
In [ ]:
In [ ]:
In [ ]:
y = Ethylbenzene (μg/m³)
Assign your answer to an object called EBE_pollution
.
options
(
repr.plot.width =
13
, repr.plot.height =
7
)
### BEGIN SOLUTION EBE_pollution <-
madrid |>
ggplot
(
aes
(
x =
date
, y =
EBE
)) +
geom_point
(
alpha =
0.15
) +
xlab
(
"Date"
) +
ylab
(
"Ethylbenzene (μg/m³)"
)
+
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION EBE_pollution
# Are levels increasing or decreasing?
test_3.3
()
We can see from this plot that over time, there are less and less high (> 25 μg/m³)
EBE values.
Question 3.4
{points: 1}
The question above asks you to write out code that allows visualization of all EBE
recordings - which are taken every single hour of every day. Consequently the
graph consists of many points and appears densely plotted. In this question, we
are going to clean up the graph and focus on max EBE readings from each month.
To further investigate if this trend is changing over time, we will use group_by
and summarize
to create a new data set.
Fill in the ...
in the cell below. Copy and paste your finished answer and replace
the fail()
.
Assign your answer to an object called madrid_pollution
.
# ... <- ... |>
# group_by(year, ...) |>
# ...(max_ebe = max(EBE, na.rm = TRUE))
### BEGIN SOLUTION
madrid_pollution <-
madrid |>
group_by
(
year
, mnth
) |>
summarize
(
max_ebe =
max
(
EBE
, na.rm =
TRUE
)) ### END SOLUTION madrid_pollution
test_3.4
()
Question 3.5
{points: 1}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Plot the new maximum EBE values versus the month they were recorded, split into
side-by-side plots for each year. Again, we will use facetting (this time with
facet_grid
, more on this next week) to plot each year side-by-side. We will also
use the theme
function to rotate the axis labels to make them more readable
(more on this is coming next week too!).
Assign your answer to an object called madrid_plot
. Remember to label your
axes.
#... <- ... |>
# ggplot(aes(x = ..., y = ...)) + # geom_point() +
# xlab(...) + # ylab(...) +
# facet_grid(~ year) +
# theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# theme(text = element_text(size=20))
### BEGIN SOLUTION
madrid_plot <-
madrid_pollution |>
ggplot
(
aes
(
x =
mnth
, y =
max_ebe
)) +
geom_point
() +
xlab
(
"Month"
) +
ylab
(
"Max Ethylbenzene (μg/m³)"
) +
facet_grid
(
~
year
) +
theme
(
axis.text.x =
element_text
(
angle =
90
, hjust =
1
)) +
theme
(
text =
element_text
(
size
=
20
))
### END SOLUTION madrid_plot
test_3.5
()
Question 3.6
{points: 1}
Now we want to see which of the pollutants has decreased the most. Therefore,
we must repeat the same thing that we did in the questions above but for every
pollutant (using the original data set)! This is where purrr
's map*
functions can
be really helpful!
First we will look at Madrid pollution in 2001 (filter for this year). Next we have to
select the columns that should be excluded (such as the date). Lastly, use the
map_df()
function to create max values for all columns.
Fill in the ...
in the cell below. Copy and paste your finished answer and replace
the fail()
.
Assign your answer to an object called pollution_2001
.
# ... <- madrid |>
# ...(year == 2001) |>
# select(-..., -year, -mnth) |>
# map_df(..., na.rm = TRUE))
In [ ]:
In [ ]:
In [ ]:
### BEGIN SOLUTION pollution_2001 <-
madrid |>
filter
(
year ==
2001
) |>
select
(
-
date
, -
year
, -
mnth
) |>
map_df
(
max
, na.rm =
TRUE
)
### END SOLUTION pollution_2001
test_3.6
()
Question 3.7
{points: 1}
Now repeat what you did for Question 3.6, but filter for 2006 instead.
Assign your answer to an object called pollution_2006
.
### BEGIN SOLUTION pollution_2006 <-
madrid |>
filter
(
year ==
2006
) |>
select
(
-
date
, -
year
, -
mnth
) |>
map_df
(
~
max
(
.
, na.rm =
TRUE
))
### END SOLUTION pollution_2006
test_3.7
()
Question 3.8
{points: 1}
Which pollutant decreased by the greatest magnitude between 2001 and 2006?
Given that your the two objects you just created, pollution_2001
and
pollution_2006
are data frames with the same columns you should be able to
subtract the two objects to find which pollutant decreased by the greatest
magnitude between the two years.
Assign your answer to an object called answer3.8
. Make sure to write the answer
exactly as it is given in the data set.
Example:
answer3.8 <- "BEN"
# Replace the fail() with your answer. ### BEGIN SOLUTION
pollution_2006 -
pollution_2001
answer3.8 <-
"TOL"
### END SOLUTION
test_3.8
()
Question 3.9
{points: 1}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Given that there were only 14 columns in the data frame above, you could use
your eyes to pick out which pollutant decreased by the greatest magnitude
between 2001 and 2006. But what would you do if you had 100 columns? Or 1000
columns? It would take A LONG TIME for your human eyeballs to find the biggest
difference. Maybe you could use the min funcion:
# run this cell
pollution_2006 -
pollution_2001
min
(
pollution_2006 -
pollution_2001
)
This is a step in the right direction, but you get the value and not the column
name... What are we to do? Tidy our data! Our data is not in tidy format, and so
it's difficult to access the values for the variable pollutant because they are stuck
as column headers. Let's use pivot_longer
to tidy our data and make it look like
this:
pollutant
value
BEN
-33.04
CO
-6.91
...
...
To answer this question, fill in the ...
in the cell below. Copy and paste your
finished answer and replace the fail()
.
Assign your answer to an object called pollution_diff
and ensure it has the
same column names as the table pictured above.
pollution_diff <-
pollution_2006 -
pollution_2001
#pollution_diff <- ... |> # pivot_longer(cols = everything(), # names_to = ..., # values_to = ...)
### BEGIN SOLUTION
pollution_diff <-
pollution_diff |>
pivot_longer
(
cols =
everything
(), names_to =
"pollutant"
, values_to =
"value"
)
### END SOLUTION
pollution_diff
test_3.9
()
Question 3.10
{points: 1}
Now that you have tidy data, you can use arrange
and desc
to order the data
in descending order. Each element of the value
column corresponds to an
amount of decrease in a pollutant; so the largest decrease
in pollutant should be
most negative entry
, i.e., the last row in the resulting dataframe. Therefore, we can
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
take the sorted dataframe and pipe it to tail
(with the argument n = 1
) to
return only the last row of the data frame.
(the function tail
is just like head
, except it returns the last rows of the
dataframe instead of the first rows.)
To answer this question, fill in the ...
in the cell below. Copy and paste your
finished answer and replace the fail()
.
Assign your answer to an object called max_pollution_diff
.
#... <- ... |> arrange(desc(...)) |> # tail(n = 1)
### BEGIN SOLUTION
max_pollution_diff <-
pollution_diff |>
arrange
(
desc
(
value
)) |>
tail
(
n =
1
)
### END SOLUTION
max_pollution_diff
test_3.10
()
At the end of this data wrangling worksheet, we'll leave you with a couple quotes
to ponder:
“Happy families are all alike; every unhappy family is unhappy in its
own way.” –– Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its
own way.” –– Hadley Wickham
source: Tidy data chapter from R for Data Science by Garrett Grolemund & Hadley
Wickham
source
(
"cleanup.R"
)
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Plz asap
arrow_forward
I need help on all parts please
arrow_forward
Use the data provided on Canvas. In automotive assembly processes, automation cannot always guarantee the dimensional accuracy of a car assembly as required by the design specification. Thus, some skillful workers will visually inspect those assembled car bodies and conduct manual adjustments when necessary. These workers are called "Fitter" in the automotive industry. This scenario is illustrated in the following
arrow_forward
Please explain each step to find the production schedule for technology matrix and demand vector
arrow_forward
Biostatistics
arrow_forward
ertifyFL
myPascoConnect
actice & Problem-Solving
6.3.20
For what value of x must ABCD be a parallelogram?
4x-8
B
C
3x
A
For ABCD to be a parallelogram, the value of x must be
- (Simplify your answer.)
Enter your answer in the answer box and then click Check Answer.
All parts showing
Clear All
Review progre
arrow_forward
rmat Slide Arrange
Tools
Add-ons
Help
Last edit was...
Background
Layout -
Theme
Transition
1 | 12 13
4
9.
Drag the orange circles
over the correct answer
choices.
Complete the choice matrix by determining whether each set of measurements
determines a unique triangle, more than one triangle, or no triangle.
More Than One
Triangle
No
A Unique
Triangle
Triangle
14 ft, 17 ft, 35 ft
15 cm, 21 cm, 30 cm
20, 70", 90"
Click to add speaker notes
arrow_forward
Speed (mph) Stopping Distances (ft)20 64, 62, 5930 114, 118, 10540 153, 171, 16550 231, 203, 23860 317, 321, 276
Brakes The table below shows stopping distances in feetfor a car tested 3 times at each of 5 speeds. We hope tocreate a model that predicts Stopping Distance from theSpeed of the car.a) Explain why a linear model is not appropriate.b) Re-express the data to straighten the scatterplot.c) Create an appropriate model.d) Estimate the stopping distance for a car traveling 55 mph.e) Estimate the stopping distance for a car traveling 70 mph.f) How much confidence do you place in these predic-tions? Why?
arrow_forward
Used cars 2011 Carmax.com lists numerous ToyotaCorollas for sale within a 250 mile radius of Redlands,CA. The table below shows the ages of the cars and theadvertised prices.a) Make a scatterplot for these data.b) Describe the association between Age and Price of aused Corolla.c) Do you think a linear model is appropriate?d) Computer software says that R2 = 89.1,. What is thecorrelation between Age and Price?e) Explain the meaning of R2
in this context.
f) Why doesn’t this model explain 100% of the variabil-ity in the price of a used Corolla?
arrow_forward
Fill in the blank.
A useful tool for checking if two of the conditions for a linear model hold is a
A useful tool for checking if two of the conditions for a linear model hold is a
boxplot.
residual plot.
histogram.
scatterplot.
arrow_forward
Can you please help me with this
arrow_forward
find where itosin.creasing and deveasing,
also fî hl the glöbal maximcum and
minimam.
arrow_forward
If data set has a relationtionship that is best described by a linear model, then the residual plot will
arrow_forward
For each of the worksheets, Total Profit per Region and Sales per Month, decide which type of chart would best represent the data. Please be sure to discuss why you would choose these types of charts based on the data you would be visualizing.
arrow_forward
show your working
arrow_forward
can anyone help with this please? Thank you.
arrow_forward
18
arrow_forward
Not examinable content, revision questions, written out if possible please.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- Plz asaparrow_forwardI need help on all parts pleasearrow_forwardUse the data provided on Canvas. In automotive assembly processes, automation cannot always guarantee the dimensional accuracy of a car assembly as required by the design specification. Thus, some skillful workers will visually inspect those assembled car bodies and conduct manual adjustments when necessary. These workers are called "Fitter" in the automotive industry. This scenario is illustrated in the followingarrow_forward
- Please explain each step to find the production schedule for technology matrix and demand vectorarrow_forwardBiostatisticsarrow_forwardertifyFL myPascoConnect actice & Problem-Solving 6.3.20 For what value of x must ABCD be a parallelogram? 4x-8 B C 3x A For ABCD to be a parallelogram, the value of x must be - (Simplify your answer.) Enter your answer in the answer box and then click Check Answer. All parts showing Clear All Review progrearrow_forward
- rmat Slide Arrange Tools Add-ons Help Last edit was... Background Layout - Theme Transition 1 | 12 13 4 9. Drag the orange circles over the correct answer choices. Complete the choice matrix by determining whether each set of measurements determines a unique triangle, more than one triangle, or no triangle. More Than One Triangle No A Unique Triangle Triangle 14 ft, 17 ft, 35 ft 15 cm, 21 cm, 30 cm 20, 70", 90" Click to add speaker notesarrow_forwardSpeed (mph) Stopping Distances (ft)20 64, 62, 5930 114, 118, 10540 153, 171, 16550 231, 203, 23860 317, 321, 276 Brakes The table below shows stopping distances in feetfor a car tested 3 times at each of 5 speeds. We hope tocreate a model that predicts Stopping Distance from theSpeed of the car.a) Explain why a linear model is not appropriate.b) Re-express the data to straighten the scatterplot.c) Create an appropriate model.d) Estimate the stopping distance for a car traveling 55 mph.e) Estimate the stopping distance for a car traveling 70 mph.f) How much confidence do you place in these predic-tions? Why?arrow_forwardUsed cars 2011 Carmax.com lists numerous ToyotaCorollas for sale within a 250 mile radius of Redlands,CA. The table below shows the ages of the cars and theadvertised prices.a) Make a scatterplot for these data.b) Describe the association between Age and Price of aused Corolla.c) Do you think a linear model is appropriate?d) Computer software says that R2 = 89.1,. What is thecorrelation between Age and Price?e) Explain the meaning of R2 in this context. f) Why doesn’t this model explain 100% of the variabil-ity in the price of a used Corolla?arrow_forward
- Fill in the blank. A useful tool for checking if two of the conditions for a linear model hold is a A useful tool for checking if two of the conditions for a linear model hold is a boxplot. residual plot. histogram. scatterplot.arrow_forwardCan you please help me with thisarrow_forwardfind where itosin.creasing and deveasing, also fî hl the glöbal maximcum and minimam.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage