worksheet_wrangling

pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

DSCI100

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

22

Uploaded by CountKuduMaster478

Report
Worksheet 3: Cleaning and Wrangling Data Lecture and Tutorial Learning Goals: After completing this week's lecture and tutorial work, you will be able to: distinguish vectors and data frames in R, and how they relate to each other define the term "tidy data" discuss the advantages and disadvantages of storing data in a tidy data format recall and use the following tidyverse functions and operators for their intended data wrangling tasks: select filter |> map mutate summarize group_by pivot_longer separate %in% This worksheet covers parts of the Wrangling chapter of the online textbook. You should read this chapter before attempting the worksheet. ### Run this cell before continuing. library ( tidyverse ) library ( repr ) source ( "tests.R" ) source ( "cleanup.R" ) options ( repr.matrix.max.rows = 6 ) Question 0.0 Multiple Choice: {points: 1} Which statement below is incorrect about vectors and data frames in R? A. the columns of data frames are vectors B. data frames can have columns of different types (e.g., a column of numeric data, and a column of character data) C. vectors can have elements of different types (e.g., element one can be numeric, and element 2 can be a character) D. data frames are a special kind of list In [ ]:
Assign your answer to an object called answer0.0 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.0 <- "C" ### END SOLUTION test_0.0 () Question 0.1 Multiple Choice: {points: 1} Which of the following does not characterize a tidy dataset? A. each row is a single observation B. each value should not be in a single cell C. each column is a single variable D. each value is a single cell Assign your answer to an object called answer0.1 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.1 <- "B" ### END SOLUTION test_0.1 () Question 0.2 Multiple Choice: {points: 1} For which scenario would using one of the group_by() + summarize() be appropriate? A. To apply the same function to every row. B. To apply the same function to every column. C. To apply the same function to groups of rows. D. To apply the same function to groups of columns. Assign your answer to an object called answer0.2 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). In [ ]: In [ ]: In [ ]: In [ ]:
# Replace the fail() with your answer. ### BEGIN SOLUTION answer0.2 <- "C" ### END SOLUTION test_0.2 () Question 0.3 Multiple Choice: {points: 1} For which scenario would using one of the purrr map_* functions be appropriate? A. To apply the same function to groups of rows. B. To apply the same function to every column. C. To apply the same function to groups of columns. D. All of the above. *Assign your answer to an object called answer0.3 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ).** # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.3 <- "B" ### END SOLUTION test_0.3 () 1. Assessing avocado prices to inform restaurant menu planning It is a well known that millennials LOVE avocado toast (joking...well mostly ), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes. In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Source: https://www.averiecooks.com/egg-hole-avocado-toast/ To answer this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. The data set spans the year 2015-2018. Some relevant columns in the dataset: Date - The date in year-month-day format average_price - The average price of a single avocado type - conventional or organic yr - The year region - The city or region of the observation small_hass_volume in pounds (lbs) large_hass_volume in pounds (lbs) extra_l_hass_volume in pounds (lbs) wk - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52). To answer our question of whether there are times in the year when avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of average_price (y- axis) versus Date (x-axis). Question 1.1 Multiple Choice: {points: 1} Which of the following is not included in the csv file? A. Average price of a single avocado. B. The farming practice (production with/without the use of chemicals). C. Average price of a bag of avocados. D. All options are included in the data set.
*Assign your answer to an object called answer1.1 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ).** # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.1 <- "C" ### END SOLUTION test_1.1 () Question 1.2 Multiple Choice: {points: 1} The rows in the data frame represent: A. daily avocado sales data for a region B. weekly avocado sales data for a region C. bi-weekly avocado sales data for a region D. yearly avocado sales data for a region Assign your answer to an object called answer1.2 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.2 <- "B" ### END SOLUTION test_1.2 () Question 1.3 {points: 1} The first step to plotting total volume against average price is to read the file avocado_prices.csv using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the worksheet_03 directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate read_* function to read the data. Assign your answer to an object called avocado . #... <- ...("...") ### BEGIN SOLUTION avocado <- read_csv ( "data/avocado_prices.csv" ) ### END SOLUTION avocado In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
test_1.3 () Question 1.4 Multiple Choice: {points: 1} Why are the 2nd to 5th columns col_double instead of col_integer ? A. They aren't "real" numbers. B. They contain decimals. C. They are numbers created using text/letters. D. They are col_integer ... Assign your answer to an object called answer1.4 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.4 <- "B" ### END SOLUTION test_1.4 () Before we get started doing our analysis, let's learn about the pipe operator, |> , as it can be very helpful when doing data analysis in R! Pipe Operators: |> Pipe operators allow you to chain together different functions - it takes the output of one statement and makes it the input of the next statement. Having a chain of processing functions is known as a pipeline . If we wanted to subset the avocado data to obtain just the average prices for organic avocados, we would need to first filter the type column using the function: filter() for the rows where the type is organic. Then we would need to use the select() function to get just the average price column. Below we illustrate how to do this using the pipe operator, |> , instead of creating an intermediate object as we have in past worksheets: Note: the indentation on the second line of the pipeline is not required, but added for readability. # run this cell filter ( avocado , type == "organic" ) |> select ( average_price ) In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can even start off a pipeline by passing the data frame into the first function. This is convenient and aids in readability. You will see this being used often in this course going forward. Below we show an example of this doing the same task we just completed above (subsetting the average price data for organic avocados). avocado |> filter ( type == "organic" ) |> select ( average_price ) Question 1.5 {points: 1} To answer our question, let's now create the scatter plot where we plot average_price on the y-axis versus Date on the x-axis. Fill in the ... in the cell below. Copy and paste your finished answer in place of fail() . Assign your answer to an object called avocado_plot . Don't forget to create proper English axis labels. options ( repr.plot.width = 14 , repr.plot.height = 7 ) # Modifies the size of the p #... <- ... |> # ggplot(aes(x = ..., y = ...)) + # geom_...() + # xlab("...") + # ylab("...") + # theme(text = element_text(size=20)) ### BEGIN SOLUTION avocado_plot <- avocado |> ggplot ( aes ( x = Date , y = average_price )) + geom_point () + xlab ( "Date" ) + ylab ( "Average Price (in US Dollars)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION avocado_plot test_1.5 () We might be able to squint and start to see some pattern in the data above, but really what we see in the plot above is not very informative. Why? Because there is a lot of overplotting (data points sitting on top of other data points). What can we do? One solution is to reduce/aggregate the data in a meaningful way to help anwer our question. Remember that we are interested in determining if there are times when the price of avocados are less expensive so that we can recommend when restaurants should put dishes on the menu that contain avocado to maximize their profits for those dishes. In the data we plotted above, each row is the total sales for avocados for that region for each year. Lets use group_by + summarize calculate the average In [ ]: In [ ]: In [ ]:
price for each week across years and region. We can then plot that aggregated price against the week and perhaps get a clearer picture. Question 1.6 {points: 1} Create a reduced/aggregated version of the avocado data set and name it avocado_aggregate . To do this you will want to group_by the wk column and then use summarize to calculate the average price (name that column average_price ). #... <- ... |> # group_by(...) |> # summarize(... = mean(average_price, na.rm = TRUE)) ### BEGIN SOLUTION avocado_aggregate <- avocado |> group_by ( wk ) |> summarize ( average_price = mean ( average_price , na.rm = TRUE )) ### END SOLUTION avocado_aggregate test_1.6 () Question 1.7 {points: 1} Now let's take the avocado_aggregate data frame and use it to create a scatter plot where we plot average_price on the y-axis versus wk on the x-axis. Assign your answer to an object called avocado_aggregate_plot . Don't forget to create proper English axis labels. #... <- ... |> # ggplot(aes(x = ..., y = ...)) + # ...() + # ...("...") + # ...("...") + # theme(text = element_text(size=20)) ### BEGIN SOLUTION avocado_aggregate_plot <- avocado_aggregate |> ggplot ( aes ( x = wk , y = average_price )) + geom_point () + xlab ( "Week" ) + ylab ( "Average Price (in US Dollars)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION avocado_aggregate_plot test_1.7 () We can now see that the prices of avocados does indeed fluctuate throughout the year. And we could use this information to recommend to restaurants that if they In [ ]: In [ ]: In [ ]: In [ ]:
want to maximize profit from menu items that contain avocados, they should only offer them on the menu roughly between December and May. Why might this happen? Perhaps price has something to do with supply? We can also use this data set to get some insight into that question by plotting total avocado volume (y-axis) versus week. To do this, we will first have to create a column called total_volume whose value is the sum of the small, large and extra large-sized avocado volumes. To do this we will have to go back to the original avocado data frame we loaded. Question 1.8 {points: 1} Our next step to plotting total_volume per week against week is to use mutate to create a new column in the avocado data frame called total_volume which is equal to the sum of all three volume columns: Fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . #... <- ... |> # mutate(... = ... + ... + ...) ### BEGIN SOLUTION avocado <- avocado |> mutate ( total_volume = small_hass_volume + large_hass_volume + extra_l_hass_v ### END SOLUTION avocado test_1.8 () Question 1.9 {points: 1} Now, create another reduced/aggregated version of the avocado data frame and name it avocado_aggregate_2 . To do this you will want to group_by the wk column and then use summarize to calculate the average total volume (name that column total_volume ). #... <- ... |> # group_by(...) |> # summarize(...) ### BEGIN SOLUTION avocado_aggregate_2 <- avocado |> group_by ( wk ) |> summarize ( total_volume = mean ( total_volume , na.rm = TRUE )) ### END SOLUTION avocado_aggregate_2 test_1.9 () In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 1.10 {points: 1} Now let's take the avocado_aggregate_2 data frame and use it to create a scatter plot where we plot average total_volume (in pounds, lbs) on the y-axis versus wk on the x-axis. Assign your answer to an object called avocado_aggregate_plot_2 . Don't forget to create proper English axis labels. Hint: don't forget to include the units for volume in your data visualization. #... <- ... |> # ggplot(aes(x = ..., y = ...)) + # ...() + # ...("...") + # ...("...") + # theme(text = element_text(size=20)) ### BEGIN SOLUTION avocado_aggregate_plot_2 <- avocado_aggregate_2 |> ggplot ( aes ( x = wk , y = total_volume )) + geom_point () + xlab ( "Week" ) + ylab ( "Average total volume (lbs)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION avocado_aggregate_plot_2 test_1.10 () We can see from the above plot of the average total volume versus the week that there are more avocados sold (and perhaps this reflects what is available for sale) roughly between January to May. This time period of increased volume corresponds with the lower avocado prices. We can hypothesize (but not conclude, of course) that the lower prices may be due to an increased availability of avocados during this time period. 2. Sea Surface Temperatures in Departure Bay The next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (in degrees Celsius) and salinity (in practical salinity units, PSU) observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others. In [ ]: In [ ]:
Presently termed the British Columbia Shore Station Oceanographic Program (BCSOP), there are 12 participating stations; most of these are staffed by Fisheries and Oceans Canada. You can look at data from other stations at http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations- phares/index-eng.html Further information from the Government of Canada's website indicates: Observations are made daily using seawater collected in a bucket lowered into the surface water at or near the daytime high tide. This sampling method was designed long ago by Dr. John P. Tully and has not been changed in the interests of a homogeneous data set. This means, for example, that if an observer starts sampling one day at 6 a.m., and continues to sample at the daytime high tide on the second day the sample will be taken at about 06:50 the next day, 07:40 the day after etc. When the daytime high-tide gets close to 6 p.m. the observer will then begin again to sample early in the morning, and the cycle continues. Since there is a day/night variation in the sea surface temperatures the daily time series will show a signal that varies with the14-day tidal cycle. This artifact does not affect the monthly sea surface temperature data. In this worksheet, we want to see if the sea surface temperature has been changing over time. Question 2.1 True or False: {points: 1} The sampling of surface water occurs at the same time each day. Assign your answer to an object called answer2.1 . Make sure your answer is lowercase "true" or lowercase "false". # Replace the fail() with your answer. ### BEGIN SOLUTION answer2.1 <- "false" ### END SOLUTION test_2.1 () Question 2.2 Multiple Choice: {points: 1} If high tide occurred at 9am today, what time would the scientist collect data tomorrow? A. 11:10 am B. 9:50 am In [ ]: In [ ]:
C. 10:00 pm D. Trick question... you skip days when collecting data. Assign your answer to an object called answer2.2 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer2.2 <- "B" ### END SOLUTION test_2.2 () Question 2.3 {points: 1} To begin working with this data, read the file departure_bay_temperature.csv using a relative path. Note, this file (just like the avocado data set) is found within the worksheet_03 directory. Assign your answer to an object called sea_surface . ### BEGIN SOLUTION sea_surface <- read_csv ( "data/departure_bay_temperature.csv" , skip = 2 ) ### END SOLUTION sea_surface test_2.3 () Question 2.3.1 {points: 1} The data above in Question 2.3 is not tidy, which reasons listed below explain why? A. There are NA's in the data set B. The variable temperature is split across more than one column C. Values for the variable month are stored as column names D. A and C E. B and C F. All of the above Assign your answer to an object called answer2.3.1 . # Replace the fail() with your answer. ### BEGIN SOLUTION In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
answer2.3.1 <- "E" ### END SOLUTION test_2.3.1 () Question 2.4 {points: 1} Given ggplot expects tidy data, we need to convert our data into that format. To do this we will use the pivot_longer() function. We would like our data to end up looking like this: Year Month Temperature 1914 Jan 7.2 1914 Feb NA 1914 Mar NA ... ... ... 2018 Oct NA 2018 Nov NA 2018 Dec NA Fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . Assign your answer to an object called tidy_temp . #... <- sea_surface |> # ...(cols = Jan:Dec, # names_to = "...", # values_to = "Temperature") ### BEGIN SOLUTION tidy_temp <- sea_surface |> pivot_longer ( cols = Jan : Dec , names_to = 'Month' , values_to = 'Temperature' ) ### END SOLUTION tidy_temp test_2.4 () Question 2.5 {points: 1} Now that we have our data in a tidy format, we can create our plot that compares the average monthly sea surface temperatures (in degrees Celsius) to the year they were recorded. To make our plots more informative, we should plot each month separately. We can use filter to do this before we pipe our data into the In [ ]: In [ ]: In [ ]:
ggplot function. Let's start out by just plotting the data for the month of November. As usual, use proper English to label your axes :) Assign your answer to an object called nov_temp_plot . Hint: don't forget to include the units for temperature in your data visualization. options ( repr.plot.width = 12 , repr.plot.height = 7 ) #... <- ... |> # filter(... == ...) |> # ggplot(aes(x = ..., y = ...)) + # geom_point() + # xlab(...) + # ylab(...) + # theme(text = element_text(size=20)) ### BEGIN SOLUTION nov_temp_plot <- tidy_temp |> filter ( Month == "Nov" ) |> ggplot ( aes ( x = Year , y = Temperature )) + geom_point () + xlab ( "Year" ) + ylab ( "Temperature (Celsius)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION nov_temp_plot test_2.5 () We can see that there may be a small decrease in colder temperatures in recent years, and/or the temperatures in recent years look less variable compared to years before 1975. What about other months? Let's plot them! Instead of repeating the code above for the 11 other months, we'll take advantage of a ggplot2 function that we haven't met yet, facet_wrap . This function is used to create many plots side-by-side, and wrapped around to new lines if there are too many plots. You tell ggplot2 how to split up the plots by specifying the argument facets = vars(...) , where ... represents the variable that is used to split the plots. We will learn more about this function next week, this week we will give you the code for it. Question 2.6 {points: 1} Fill in the missing code below to plot the average monthly sea surface temperatures to the year they were recorded for all months. Assign your answer to an object called all_temp_plot . Hint: don't forget to include the units for temperature in your data visualization. In [ ]: In [ ]:
options ( repr.plot.width = 14 , repr.plot.height = 8 ) #... <- ... |> # ggplot(aes(x = ..., y = ...)) + # geom_point() + # facet_wrap(facets = vars(factor(Month, levels = c("Jan","Feb","Mar","Apr"," # "Jul","Aug","Sep","Oct","Nov","Dec")) # xlab(...) + # ylab(...) + # theme(text = element_text(size=20)) ### BEGIN SOLUTION all_temp_plot <- tidy_temp |> ggplot ( aes ( x = Year , y = Temperature )) + geom_point () + facet_wrap ( facets = vars ( factor ( Month , levels = c ( "Jan" , "Feb" , "Mar" , "Apr" , "M "Jul" , "Aug" , "Sep" , "Oct" , "Nov" , "Dec" ))) xlab ( "Year" ) + ylab ( "Temperature (Celsius)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION all_temp_plot test_2.6 () We can see above that some months show a small, but general increase in temperatures, whereas others don't. And some months show a change in variability and others do not. From this it is clear to us that if we are trying to understand temperature changes over time, we best keep data from different months separate. 3. Pollution in Madrid We're working with a data set from Kaggle once again! This data was collected under the instructions from Madrid's City Council and is publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. This data includes daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties. There are a number of stations set up around Madrid and each station's data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website: SO_2 : sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups. In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CO : carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death. NO_2 : nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation. PM10 : particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications. NOx : nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog. O_3 : ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers. TOL : toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage. BEN : benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans. EBE : ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer. MXY : m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system. PXY : p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health. OXY : o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health. TCH : total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases. NMHC : non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans. The goal of this assignment is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006). 1. First do a plot of one of the pollutants (EBE). 2. Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time).
3. Now we will look at which pollutant decreased the most. Repeat the same thing for every column - to speed up the process, use the map() function. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006. Question 3.1 Multiple Choice: {points: 1} What big picture question are we trying to answer? A. Did EBE decrease in Madrid between 2001 and 2006? B. Of all the pollutants, which decreased the most between 2001 and 2006? C. Of all the pollutants, which decreased the least between 2001 and 2006? D. Did EBE increase in Madrid between 2001 and 2006? Assign your answer to an object called answer3.1 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer3.1 <- "B" ### END SOLUTION test_3.1 () Question 3.2 {points: 1} To begin working with this data, read the file madrid_pollution.csv . Note, this file (just like the avocado and sea surface data set) is found in the worksheet_wrangling directory. Assign your answer to an object called madrid . ### BEGIN SOLUTION madrid <- read_tsv ( "data/madrid_pollution.csv" ) ### END SOLUTION madrid test_3.2 () Question 3.3 {points: 1} Now that the data is loaded in R, create a scatter plot that compares ethylbenzene ( EBE ) values against the date they were recorded. This graph will showcase the concentration of ethylbenzene in Madrid over time. As usual, label your axes: x = Date In [ ]: In [ ]: In [ ]: In [ ]:
y = Ethylbenzene (μg/m³) Assign your answer to an object called EBE_pollution . options ( repr.plot.width = 13 , repr.plot.height = 7 ) ### BEGIN SOLUTION EBE_pollution <- madrid |> ggplot ( aes ( x = date , y = EBE )) + geom_point ( alpha = 0.15 ) + xlab ( "Date" ) + ylab ( "Ethylbenzene (μg/m³)" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION EBE_pollution # Are levels increasing or decreasing? test_3.3 () We can see from this plot that over time, there are less and less high (> 25 μg/m³) EBE values. Question 3.4 {points: 1} The question above asks you to write out code that allows visualization of all EBE recordings - which are taken every single hour of every day. Consequently the graph consists of many points and appears densely plotted. In this question, we are going to clean up the graph and focus on max EBE readings from each month. To further investigate if this trend is changing over time, we will use group_by and summarize to create a new data set. Fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . Assign your answer to an object called madrid_pollution . # ... <- ... |> # group_by(year, ...) |> # ...(max_ebe = max(EBE, na.rm = TRUE)) ### BEGIN SOLUTION madrid_pollution <- madrid |> group_by ( year , mnth ) |> summarize ( max_ebe = max ( EBE , na.rm = TRUE )) ### END SOLUTION madrid_pollution test_3.4 () Question 3.5 {points: 1} In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Plot the new maximum EBE values versus the month they were recorded, split into side-by-side plots for each year. Again, we will use facetting (this time with facet_grid , more on this next week) to plot each year side-by-side. We will also use the theme function to rotate the axis labels to make them more readable (more on this is coming next week too!). Assign your answer to an object called madrid_plot . Remember to label your axes. #... <- ... |> # ggplot(aes(x = ..., y = ...)) + # geom_point() + # xlab(...) + # ylab(...) + # facet_grid(~ year) + # theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # theme(text = element_text(size=20)) ### BEGIN SOLUTION madrid_plot <- madrid_pollution |> ggplot ( aes ( x = mnth , y = max_ebe )) + geom_point () + xlab ( "Month" ) + ylab ( "Max Ethylbenzene (μg/m³)" ) + facet_grid ( ~ year ) + theme ( axis.text.x = element_text ( angle = 90 , hjust = 1 )) + theme ( text = element_text ( size = 20 )) ### END SOLUTION madrid_plot test_3.5 () Question 3.6 {points: 1} Now we want to see which of the pollutants has decreased the most. Therefore, we must repeat the same thing that we did in the questions above but for every pollutant (using the original data set)! This is where purrr 's map* functions can be really helpful! First we will look at Madrid pollution in 2001 (filter for this year). Next we have to select the columns that should be excluded (such as the date). Lastly, use the map_df() function to create max values for all columns. Fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . Assign your answer to an object called pollution_2001 . # ... <- madrid |> # ...(year == 2001) |> # select(-..., -year, -mnth) |> # map_df(..., na.rm = TRUE)) In [ ]: In [ ]: In [ ]:
### BEGIN SOLUTION pollution_2001 <- madrid |> filter ( year == 2001 ) |> select ( - date , - year , - mnth ) |> map_df ( max , na.rm = TRUE ) ### END SOLUTION pollution_2001 test_3.6 () Question 3.7 {points: 1} Now repeat what you did for Question 3.6, but filter for 2006 instead. Assign your answer to an object called pollution_2006 . ### BEGIN SOLUTION pollution_2006 <- madrid |> filter ( year == 2006 ) |> select ( - date , - year , - mnth ) |> map_df ( ~ max ( . , na.rm = TRUE )) ### END SOLUTION pollution_2006 test_3.7 () Question 3.8 {points: 1} Which pollutant decreased by the greatest magnitude between 2001 and 2006? Given that your the two objects you just created, pollution_2001 and pollution_2006 are data frames with the same columns you should be able to subtract the two objects to find which pollutant decreased by the greatest magnitude between the two years. Assign your answer to an object called answer3.8 . Make sure to write the answer exactly as it is given in the data set. Example: answer3.8 <- "BEN" # Replace the fail() with your answer. ### BEGIN SOLUTION pollution_2006 - pollution_2001 answer3.8 <- "TOL" ### END SOLUTION test_3.8 () Question 3.9 {points: 1} In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
Given that there were only 14 columns in the data frame above, you could use your eyes to pick out which pollutant decreased by the greatest magnitude between 2001 and 2006. But what would you do if you had 100 columns? Or 1000 columns? It would take A LONG TIME for your human eyeballs to find the biggest difference. Maybe you could use the min funcion: # run this cell pollution_2006 - pollution_2001 min ( pollution_2006 - pollution_2001 ) This is a step in the right direction, but you get the value and not the column name... What are we to do? Tidy our data! Our data is not in tidy format, and so it's difficult to access the values for the variable pollutant because they are stuck as column headers. Let's use pivot_longer to tidy our data and make it look like this: pollutant value BEN -33.04 CO -6.91 ... ... To answer this question, fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . Assign your answer to an object called pollution_diff and ensure it has the same column names as the table pictured above. pollution_diff <- pollution_2006 - pollution_2001 #pollution_diff <- ... |> # pivot_longer(cols = everything(), # names_to = ..., # values_to = ...) ### BEGIN SOLUTION pollution_diff <- pollution_diff |> pivot_longer ( cols = everything (), names_to = "pollutant" , values_to = "value" ) ### END SOLUTION pollution_diff test_3.9 () Question 3.10 {points: 1} Now that you have tidy data, you can use arrange and desc to order the data in descending order. Each element of the value column corresponds to an amount of decrease in a pollutant; so the largest decrease in pollutant should be most negative entry , i.e., the last row in the resulting dataframe. Therefore, we can In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
take the sorted dataframe and pipe it to tail (with the argument n = 1 ) to return only the last row of the data frame. (the function tail is just like head , except it returns the last rows of the dataframe instead of the first rows.) To answer this question, fill in the ... in the cell below. Copy and paste your finished answer and replace the fail() . Assign your answer to an object called max_pollution_diff . #... <- ... |> arrange(desc(...)) |> # tail(n = 1) ### BEGIN SOLUTION max_pollution_diff <- pollution_diff |> arrange ( desc ( value )) |> tail ( n = 1 ) ### END SOLUTION max_pollution_diff test_3.10 () At the end of this data wrangling worksheet, we'll leave you with a couple quotes to ponder: “Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham source: Tidy data chapter from R for Data Science by Garrett Grolemund & Hadley Wickham source ( "cleanup.R" ) In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help