tutorial_reading
pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
17
Uploaded by CountKuduMaster478
Tutorial 2: Introduction to Reading Data
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
define the following:
absolute file path
relative file path
url
read data into R using a relative path and a url
compare and contrast the following functions:
read_csv
read_tsv
read_csv2
read_delim
read_excel
match the following tidyverse
read_*
function arguments to their
descriptions:
file
delim
col_names
skip
choose the appropriate tidyverse
read_*
function and function
arguments to load a given plain text tabular data set into R
use readxl
library's read_excel
function and arguments to load a sheet
from an excel file into R
connect to a database using the DBI
library's dbConnect
function
list the tables in a database using the DBI
library's dbListTables function
create a reference to a database table that is queriable using the tbl
from
the dbplyr
library
retrieve data from a database query and bring it into R using the collect
function from the dbplyr
library
use write_csv
to save a data frame to a csv file
optional:
scrape data from the web
read/scrape data from an internet URL using the rvest
html_nodes
and html_text
functions
compare downloading tabular data from a plain text file (e.g. *.csv
)
from the web versus scraping data from a .html
file
Any place you see ...
, you must fill in the function, variable, or data to complete
the code. Replace fail()
with your completed code and run the cell!
This worksheet covers parts of the Reading chapter of the online textbook. You
should read this chapter before attempting the worksheet.
### Run this cell before continuing. library
(
tidyverse
)
library
(
repr
)
library
(
rvest
)
library
(
stringr
)
library
(
janitor
)
options
(
repr.matrix.max.rows =
6
)
source
(
"tests.R"
)
source
(
"cleanup.R"
)
1. Happiness Report
As you might remember from worksheet_reading
, we practised loading data
from the Sustainable Development Solutions Network's
World Happiness Report
.
That data was the output of their analysis that calculated each country's
happiness score and how much each variable contributed to it. In this tutorial, we
are going to look at the data at an earlier stage of the study - the
aggregated/averaged values (per country and year) for many different social and
health aspects that the researchers anticipated might contribute to happiness
(Table2.1 from this Excel spreadsheet
).
The goal for today is to produce a plot of 2017's positive affect scores against
healthy life expectancy at birth, with healthy life expectancy at birth on the x-axis
and positive affect on the y-axis. For this study, positive affect was defined as the
average of three positive affect measures: happiness, laughter and enjoyment. We
would also like to convert the positive affect score
from a scale of 0 - 1 to a scale
from 0 - 10.
1. use filter
to subset the rows where the year is equal to 2017
2. use mutate
to convert the "Positive affect" score from a scale of 0 - 1 to a
scale from 0 - 10
3. use select
to choose the "Healthy life expectancy at birth" column and the
scaled "Positive affect" column
4. use ggplot
to create our plot of "Healthy life expectancy at birth" (x - axis)
and scaled "Positive affect" (y - axis)
Tips for success:
Try going through all of the steps on your own, but don't forget
to discuss with others (classmates, TAs, or an instructor) if you get stuck. If
something is wrong and you can't spot the issue, be sure to read the error
message carefully
. Since there are a lot of steps involved in working with data
and modifying it, feel free to look back at worksheet_reading
.
Question 1.1
Multiple Choice:
{points: 1}
What is the maximum value for the "Positive affect" score (in the original data file
that you read into R)?
In [ ]:
A. 100
B. 10
C. 1
D. 0.1
E. 5
Assign your answer to an object called answer1.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.1 <-
"C"
### END SOLUTION
test_1.1
()
Question 1.2
Multiple Choice:
{points: 1}
Which column's values will be used to filter the data?
A. countries
B. generosity
C. positive affect
D. year
Assign your answer to an object called answer1.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.2 <-
"D"
### END SOLUTION
test_1.2
()
Question 1.3.0
{points: 1}
Use the appropriate read_*
function to read in the
WHR2018Chapter2OnlineData
(look in the tutorial_02
directory to ensure you
use the correct relative path to read it in).
Assign the data frame to an object called happy_df_csv
.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
### BEGIN SOLUTION
happy_df_csv <-
read_csv
(
file =
"data/WHR2018Chapter2OnlineData.csv"
) ### END SOLUTION
happy_df_csv
test_1.3.0
()
Question 1.3.1
{points: 1}
Above, you loaded the data from a file we already downloaded and converted to
a .csv
for you. But you can also use the readxl
R package to directly load in
Excel files into R. Given that the data we loaded above
(
WHR2018Chapter2OnlineData.csv
) was originally sourced from an Excel file on
the web, let's now directly read that Excel file into R using the read_excel
function from that package. This Excel file has multiple sheets, the data we want is
on the first one.
Note:
read_excel
does not support putting a URL as the file path
argument. So we need to first download the file and write it to disk
using R's download.file
function, and then we can read that saved
Excel file into R using read_excel
.
To answer the question, fill in the blanks in the code below. If you are unsure, try
reading the documentation for the new functions and ask others for help!
Assign the data into an object called happy_df
.
library
(
readxl
)
url <-
"https://s3.amazonaws.com/happiness-report/2018/WHR2018Chapter2OnlineData
# download.file(..., destfile = "data/WHR2018Chapter2OnlineData.xls")
#... <- read_excel(path = ..., sheet = ...)
### BEGIN SOLUTION
download.file
(
url
, destfile =
"data/WHR2018Chapter2OnlineData.xls"
)
happy_df <-
read_excel
(
path =
"data/WHR2018Chapter2OnlineData.xls"
, sheet =
1
)
### END SOLUTION
happy_df
test_1.3.1
()
Look at the column names - they contain spaces! This is not a best practice and
will make it difficult to use our tidyverse functions... Run the cell below to use the
clean_names
function from the janitor
library that will replace all the spaces
with an underscore (
_
) and make all characters lowercase so that our column
names are in a standard format.
### Run this cell before continuing. happy_df <-
happy_df |>
clean_names
()
happy_df
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Question 1.3.2
{points: 1}
Using the scaffolding given in the cell below, filter
, mutate
, and select
the
happy_df
data frame as needed to get it ready to create our desired scatterplot.
Recall that we wanted to rescale the "Positive affect" scores so that they fall in the
range 0-10 instead of 0-1. Call the new, re-scaled column
positive_affect_scaled
.
Assign the data frame containing only the columns we need to create our plot to
an object called reduced_happy_df
.
# happy_step1 <- ...(happy_df, year == ...)
# happy_step2 <- mutate(happy_step1, positive_affect_scaled = ...)
# reduced_happy_df <- ...(happy_step2, ..., ...)
### BEGIN SOLUTION
happy_step1 <-
filter
(
happy_df
, year ==
2017
)
happy_step2 <-
mutate
(
happy_step1
, positive_affect_scaled =
positive_affect *
10
reduced_happy_df <-
select
(
happy_step2
, healthy_life_expectancy_at_birth
, positi
### END SOLUTION reduced_happy_df
test_1.3.2
()
Question 1.4
{points: 1}
Using the modified data set, reduced_happy_df
, generate the scatterplot
described above and make sure to label the axes in proper written English.
Assign your plot to an object called happy_plot
.
options
(
repr.plot.width =
8
, repr.plot.height =
8
)
#... <- ggplot(reduced_happy_df, ...(x = ..., y = ...)) + # geom_...() + # ...("...") + # ylab("Positive affect score (out of ...)")
### BEGIN SOLUTION happy_plot <-
ggplot
(
reduced_happy_df
, aes
(
x =
healthy_life_expectancy_at_birth
,
geom_point
() +
xlab
(
"Healthy life expectancy at birth (years)"
) +
ylab
(
"Positive affect score (out of 10)"
) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION happy_plot
test_1.4
()
Question 1.5
{points: 3}
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In one sentence or two, describe what you see in the scatterplot above. Does
there appear to be a relationship between life expectancy at birth and postive
affect? If so, describe it.
BEGIN SOLUTION
In the scatterplot above, we observe no relationship between life expectancy at
birth and postive affect.
END SOLUTION
Question 1.6
{points: 3}
Choose any variable (column) in the data set happy_df
other than
positive_affect
to plot against healthy life expectancy at birth. You should
NOT scale whichever variable you choose.
Ensure that healthy life expectancy at
birth is on the x-axis and that you give your axes human-readable labels.
Assign your plot to an object called happy_plot_2
.
### BEGIN SOLUTION
options
(
repr.plot.width =
8
, repr.plot.height =
8
)
happy_plot_2 <-
ggplot
(
happy_df
, aes
(
x =
healthy_life_expectancy_at_birth
, y =
p
geom_point
() +
xlab
(
"Healthy life expectancy at birth (years)"
) +
ylab
(
"Perceptions of corruption \n (average of binary answers to two questio
theme
(
text =
element_text
(
size =
20
))
happy_plot_2
### END SOLUTION
# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding # when you have the correct answer.
test_that
(
"Did not create an object named happy_plot_2"
, {
expect_true
(
exists
(
"happy_plot_2"
)) })
### BEGIN HIDDEN TESTS
properties <-
c
(
happy_plot_2
$
layers
[[
1
]]
$
mapping
, happy_plot_2
$
mapping
)
possible_columns <-
select
(
happy_df
, -
healthy_life_expectancy_at_birth
, -
positiv
test_that
(
"healthy_life_expectancy_at_birth should be on the x-axis."
, {
expect_true
(
"healthy_life_expectancy_at_birth" ==
rlang
::
get_expr
(
properties
$
x
})
test_that
(
"healthy_life_expectancy_at_birth should not be on the y-axis."
, {
expect_true
(
"healthy_life_expectancy_at_birth" !=
rlang
::
get_expr
(
properties
$
y
})
test_that
(
"positive_affect should not be on the y-axis."
, {
expect_true
(
"positive_affect" !=
rlang
::
get_expr
(
properties
$
y
))
})
test_that
(
"x-axis should be some other column from happ_df"
, {
expect_true
(
as.character
(
rlang
::
get_expr
(
properties
$
y
)) %in%
possible_columns
)
})
test_that
(
'happy_plot_2 should be a scatter plot.'
, {
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
expect_true
(
"GeomPoint" %in%
c
(
class
(
happy_plot_2
$
layers
[[
1
]]
$
geom
)))
})
test_that
(
'Labels on the axes should be descriptive and human readable.'
, {
expect_false
(
happy_plot_2
$
labels
$
y %in%
possible_columns
[
!
possible_columns %in
expect_false
(
happy_plot_2
$
labels
$
x ==
'healthy_life_expectancy_at_birth'
)
})
print
(
"Success!"
)
### END HIDDEN TESTS
Question 1.7
{points: 3}
In a sentence or two, describe what you see in the scatterplot above. Does there
appear to be a relationship between healthy life expectancy at birth and the other
variable you plotted? If so, describe it.
BEGIN SOLUTION
Any reasonable answer accepted. For example, in the plot in the answer key, there
is a weak negative relationship between life expectancy at birth and perception of
corruption.
END SOLUTION
2. Whistler Snow
Skiing and snowboarding are huge in British Columbia. Some of the best slopes
for snow sports are quite close. In fact, the famous mountain-bearing city of
Whistler is just two hours north of Vancouver. With cold weather and plenty of
snowfall, Whistler is an ideal destination for winter sports fanatics.
One thing skiers and snowboarders want is fresh snow! When are they most likely
to find this? In the data
directory, we have two-year-long data sets from
Environment Canada from the Whistler Roundhouse Station (on Whistler
mountain). This weather station is located 1,835 m above sea level.
To answer the question of "When are skiers and snowboarders most likely to find
fresh snow at Whistler?" you will create a line plot with the date is on the x-axis
and the total snow per day in centimetres (the column named Total Snow cm
in
the data file) on the y-axis. Given that we have data for two years (2017 & 2018),
we will create one plot for each year to see if there is a trend we can observe
across the two years.
Question 2.1
Multiple Choice:
{points: 1}
What are we going to plot on the y-axis?
A. total precipitation per day in centimetres
B. total snow on the ground in centimetres
C. total snow per day in centimetres
D. total rain per day in centimetres
Assign your answer to an object called answer2.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer2.1 <-
"C"
### END SOLUTION
test_2.1
()
Question 2.2.0
{points: 1}
Read in the file named eng-daily-01012018-12312018.csv
from the data
directory. Make sure you preview the file to choose the correct read_*
function and argument values to get the data into R.
Assign your data frame to an object called whistler_2018
.
Note: You'll see a lot of entries of the form NA
. This is the symbol R uses to
denote missing data. Interestingly, you can do math and make comparisons with
NA
: for example,
NA + 1 = NA
, NA * 3 = NA
, NA > 3 = NA
. Most operations
on NA
return NA
. This may seem a bit weird, but it makes things much simpler in
R since it removes the need to write any special code to handle missing data!
### BEGIN SOLUTION whistler_2018 <-
read_csv
(
file =
"data/eng-daily-01012018-12312018.csv"
, skip =
### END SOLUTION whistler_2018
test_2.2.0
()
Question 2.2.1
{points: 1}
Looking at the column names of the whistler_2018
data frame, you can see we
have white space in our column names again. Use clean_names
to remove the
whitespace to make it easier to use our tidyverse
functions. Store the result
with the same name, whistler_2018
.
### BEGIN SOLUTION
whistler_2018 <-
whistler_2018 |>
clean_names
()
### END SOLUTION
whistler_2018
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
test_2.2.1
()
Question 2.3
{points: 1}
Create a line plot with the date on the x-axis and the total snow per day (in cm)
on the y-axis by filling in the ...
in the code below. Ensure you give your axes
human-readable labels.
Assign your plot to an object called whistler_2018_plot
.
options
(
repr.plot.width =
12
, repr.plot.height =
5
)
# ... <- ggplot(..., aes(x = ..., y = ...)) + # geom_line() +
# xlab(...) +
# ylab(...) +
# scale_x_date(date_breaks = "1 month") + # labels every month
# theme(axis.text.x = element_text(angle = 90, hjust = 1)) # rotates x axis ### BEGIN SOLUTION
whistler_2018_plot <-
ggplot
(
whistler_2018
, aes
(
x =
date_time
, y =
total_snow_c
geom_line
() +
xlab
(
"Date"
) +
ylab
(
"Total snow per day (cm)"
) +
scale_x_date
(
date_breaks =
"1 month"
) +
theme
(
axis.text.x =
element_text
(
angle =
90
, hjust =
1
)) +
theme
(
text =
element_text
(
size =
20
))
### END SOLUTION
whistler_2018_plot
test_2.3
()
Question 2.4
{points: 3}
Looking at the line plot above, for 2018, of the months when it snowed, which 2
months had the most
fresh snow?
BEGIN SOLUTION
In 2018, December and January had the most fresh snow.
END SOLUTION
Question 2.5
{points: 3}
Repeat the data loading and plot creation using the file eng-daily-01012017-
12312017.csv
located in the data
directory to visualize the same data for the
year 2017.
Assign your plot to an object called whistler_2017_plot
.
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# whistler_2017 <- ...
# whistler_2017 <- whistler_2017 |> ...()
# ... <- ggplot(..., aes(x = ..., y = ...)) + # geom_line() + # xlab("...") + # ylab("...") +
# scale_x_date(date_breaks = "1 month") +
# theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # theme(text = element_text(size = 20))
### BEGIN SOLUTION whistler_2017 <-
read_csv
(
file =
"data/eng-daily-01012017-12312017.csv"
, skip =
whistler_2017 <-
whistler_2017 |>
clean_names
()
whistler_2017_plot <-
ggplot
(
whistler_2017
, aes
(
x =
date_time
, y =
total_snow_cm
geom_line
() +
xlab
(
"Date"
) +
ylab
(
"Total snow per day (cm)"
) +
scale_x_date
(
date_breaks =
"1 month"
) +
theme
(
axis.text.x =
element_text
(
angle =
90
, hjust =
1
)) +
theme
(
text =
element_text
(
size =
20
))
whistler_2017_plot
### END SOLUTION
# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding # when you have the correct answer.
test_that
(
"Did not create an object named whistler_2017_plot"
, {
expect_true
(
exists
(
"whistler_2017_plot"
)) })
### BEGIN HIDDEN TESTS
properties <-
c
(
whistler_2017_plot
$
layers
[[
1
]]
$
mapping
, whistler_2017_plot
$
mappi
test_that
(
"date_time should be on the x-axis."
, {
expect_true
(
"date_time" ==
rlang
::
get_expr
(
properties
$
x
))
})
test_that
(
"total_snow_cm should be on the y-axis."
, {
expect_true
(
"total_snow_cm" ==
rlang
::
get_expr
(
properties
$
y
))
})
test_that
(
"whistler_2017_plot should be a line plot."
, {
expect_true
(
"GeomLine" %in%
c
(
class
(
whistler_2017_plot
$
layers
[[
1
]]
$
geom
)))
})
test_that
(
"Labels on the axes should be descriptive and human readable."
, {
expect_false
(
whistler_2017_plot
$
labels
$
y ==
"total_snow_cm"
)
expect_false
(
whistler_2017_plot
$
labels
$
x ==
"date_time"
)
})
print
(
"Success!"
)
### END HIDDEN TESTS
Question 2.6
{points: 3}
Looking at the line plot above, for 2017, of the months when it snowed, which 2
months had the most
fresh snow?
BEGIN SOLUTION
In [ ]:
In [ ]:
In 2017, March and November had the most fresh snow.
END SOLUTION
Question 2.7
{points: 3}
Are the months with the most fresh snow the same in 2017 as they were in 2018?
Hint:
you might want to add a code cell where you plot the two plots right after
each other so you can easily compare them in one screen view.
You can combine two plots, one atop the other, by using the plot_grid
function
from the cowplot
package:
library(cowplot)
plot_grid(plot1, plot2, ncol = 1)
Is there any advantage of looking at 2 years worth of data? Why or why not?
BEGIN SOLUTION
The months with the most fresh snow are different across the two years (March
and November in 2017 vs. December and January in 2018).
Yes, there is an advantage of looking at 2 years worth of data. Doing this tells us
that we cannot use the month(s) that had the most fresh snow in the previous
year alone to base our predictions off for what month(s) will have the most fresh
snow in the upcoming year.
END SOLUTION
3. Reading from a Database
In worksheet_reading
, you'll recall that we opened a database stored in a .db
file. This involved a lot more effort than just opening a .csv
, .tsv
, or any of the
other plaintext / Excel formats. It was a bit of a pain to use a database in that
setting since we had to use dbplyr
to translate tidyverse
-like commands
(
filter
, select
, etc.) into SQL commands that the database understands. We
didn't run into this problem in the worksheet, but not all
tidyverse
commands
can currently be translated with SQLite databases. For example, with an SQLite
database, we can compute a mean, but can't easily compute a median.
Why should we bother with databases at all then?
Databases become really useful in a large-scale setting:
they enable storing large datasets across multiple computers with automatic
redundancy and backups
they enable multiple users to access them simultaneously and remotely
without conflicts and errors
they provide mechanisms for ensuring data integrity and validating input
they provide security to keep data safe
For example: there are around 4 billion Google searches conducted daily as of
2019. Can you imagine if Google stored all of the data from those queries in a
single .csv
file!? Chaos would ensue.
To reap the real benefits of databases, we'll need to move to a more fully-
powered one: PostgreSQL
. We'll begin by loading the DBI
and dbplyr
packages
that R uses to talk to databases, as well as the RPostgres
package that provides
the interface between these packages and PostgreSQL databases (note the
similarity to the RSQLite
package from worksheet_02
).
### Run this cell before continuing. library
(
dbplyr
)
library
(
DBI
)
library
(
RPostgres
)
library
(
lubridate
) # This package is used to convert different time/date formats
Investigating Trends in Crowdfunding
Kickstarter is an online crowd-funding site where people can post projects they
want to do, but don't have the financial resources required to fund the project on
their own. Other users of Kickstarter can pledge money to the project (also called
"backing" a project) to help the project become a reality. To persuade people to
back a project, the project owner usually offers rewards to the "backers" for their
help with funding, which they receive once funding reaches a particular amount.
In this section, we'll investigate how the amount of funding successful projects get
has changed over time. We consider a project to be successful if the amount of
funds pledged exceeded the goal.
Question 3.0
{points: 1}
Databases are often stored remotely
(i.e., not on your computer or on this
JupyterHub). Your first task is to load the Kickstarter data from a PostgreSQL
database stored remotely on the UBC statistics network.
URL: "dsci-100-student.stat.ubc.ca"
Port: 5432
Username: "dsci100"
Password: "dsci100"
Database Name: "kickstarter"
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Table Name: "projects"
We've provided the code to do this below. Replace each ...
with one of the 5
above items.
Note 1: Due to the UBC firewall, to get this to work you'll need to be connected to
the UBC network or use the UBC VPN. For instructions on how to connect to
UBC's VPN service, see this webpage on UBC's IT website
.
Note 2: As this database will be used by the entire class, you will only have read
access (no write permissions).
Assign the resulting database connection object to
connection
and the project
table data to
project_data
.
# ... <- dbConnect(RPostgres::Postgres(), dbname = ...,
# host = ..., port = 5432,
# user = ..., password = ...)
# ... <- tbl(connection, ...)
### BEGIN SOLUTION
connection <-
dbConnect
(
RPostgres
::
Postgres
(), dbname =
"kickstarter"
,
host =
"dsci-100-student.stat.ubc.ca"
, port =
5432
,
user =
"dsci100"
, password =
"dsci100"
)
project_data <-
tbl
(
connection
, "projects"
)
### END SOLUTION
test_3.0
()
We can now use the colnames
function to see what columns are available in the
project_data
table.
colnames
(
project_data
)
Question 3.1
{points: 1}
If we want to plot compare pledged and goal amounts of funding over time for
successful projects in the United States, which columns should we select
from
the table?
A. id
, slug
, pledged
B. pledged
, goal
, deadline
, country
C. pledged
, usd_pledged
, location_id
D. currency
, state
, country
, goal
Assign your answer to an object called answer3.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
In [ ]:
In [ ]:
In [ ]:
### BEGIN SOLUTION
answer3.1 <-
'B'
### END SOLUTION
test_3.1
()
Question 3.2
{points: 1}
Now we'll visualize the data. In order to do this, we need to take the correct
subset of data from the table and use ggplot
to plot the result. Note that we
make the scatter plot slightly transparent (using alpha = 0.01
in the code
below) because there is so much data that it would otherwise be hard to see
anything (
overplotting
).
In the below cell, you'll see some lines of code (currently commented out with #
characters). Remove the comments and rearrange these lines of code
to plot
the ratio of pledged and goal funding as a function of project deadline date for all
successful (where pledged funding is greater than goal funding) projects in the
United States in the dataset. You don't need to add any new code, just reorder the
lines we have given you.
Note: there is a lot of data to plot here, so give it a moment to display!
Hint: you'll want to put all the dataframe manipulation functions first, and then
the plotting functions afterward. Also note that some lines have a +
at the end,
meaning they're in the middle of the plotting code! To not be overwhelmed trying
to solve all the code at once, focus on one step at a time and uncomment only
the code needed to run that one step. When that step works, move on to the
next.
In [ ]:
In [ ]:
# geom_point(alpha = 0.01) +
# funding_over_time_plot <- ggplot(prj, aes(x = as_datetime(deadline), y = pledg
# ylab('Pledged Funding / Goal Funding')
# prj <- filter(prj_unfiltered, pledged > goal & country == "US")
# scale_y_continuous(trans = 'log10', breaks = c(1, 10, 100, 1000)) +
# xlab('Date') +
# theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank
# prj_unfiltered <- select(project_data, 'deadline', 'pledged', 'goal', 'country
### BEGIN SOLUTION
prj_unfiltered <-
select
(
project_data
, 'deadline'
, 'pledged'
, 'goal'
, 'country'
)
prj <-
filter
(
prj_unfiltered
, pledged >
goal &
country ==
"US"
)
funding_over_time_plot <-
ggplot
(
prj
, aes
(
x =
as_datetime
(
deadline
), y =
pledged
geom_point
(
alpha =
0.01
) +
theme
(
panel.grid.major =
element_blank
(), panel.grid.minor =
element_blank
()
scale_y_continuous
(
trans =
'log10'
, breaks =
c
(
1
, 10
, 100
, 1000
)) +
xlab
(
'Date'
) +
ylab
(
'Pledged Funding / Goal Funding'
) +
theme
(
text =
element_text
(
size =
20
))
funding_over_time_plot
### END SOLUTION
test_3.2
()
Question 3.3
{points: 3}
Is there a relationship between the ratio of pledged/goal funding and time? If so,
describe it.
Additionally, mention a pattern in the data or a characteristic of it that you may
not have expected in advance.
BEGIN SOLUTION
Any reasonable description here. For example, there is a slight increase in the bulk
of funding ratios over time, but the high end (i.e. high quantiles) increase
dramatically. There is a weird periodic blip happening each year.
END SOLUTION
Question 3.4
{points: 1}
Finally, we'll save the project data to a local file in the data/
folder called
project_data.csv
. Recall that we don't want to try to download and save the
entire dataset
(way too much data!) from the database, but only the tbl
object
named prj
. So you will need to use the collect
function followed by the
appropriate write_*
function.
Assign the output of collect to an object called project_df
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
### BEGIN SOLUTION
project_df <-
collect
(
prj
)
write_csv
(
project_df
, 'data/project_data.csv'
)
### END SOLUTION
test_3.4
()
4 (Optional). Reading Data from the Internet
Question 4.0
{points: 0}
More practice scraping! To keep ourselves out of legal hot water, we will get more
practice scraping data using a website that was created for that purpose:
http://books.toscrape.com/
Your task here is to scrape the prices of the science fiction novels on this page
and determine the maximum, minimum and average price of science fiction
novels at this bookstore. Tidy up and nicely present your results by creating a data
frame called sci_fi_stats
that has 2 columns, one called stats
that contains
the words max
, min
and mean
and once called value
that contains the
calculated value for each of these.
The functions for maximum, minimum and average in R are listed in the table
below:
Calculation to perform
Function in R
maximum
max
minimum
min
average
mean
Some other helpful hints:
If you end up scraping some characters other than numbers you will have to
use str_replace_all
from the stringr
library to remove them (similar to
what we did with the commas in worksheet_02).
Use as.numeric
to convert your character type numbers to numeric type
numbers before you pass them into the max
, min
and mean
functions.
If you have NA
values in your objects that you need to pass into the max
,
min
and mean
functions, you will need to set the na.rm
argument in these
functions to TRUE
.
use the function c
to create the vectors that will go in your data frame, for
example, to create a vector with the values 10, 16 and 13 named ages, we
would type: ages <- c(10, 16, 13)
.
use the function tibble
to create the data frame from your vectors.
In [ ]:
In [ ]:
### BEGIN SOLUTION
url <-
"http://books.toscrape.com/catalogue/category/books/science-fiction_16/in
sci_fi_page <-
read_html
(
url
)
sci_fi_price <-
sci_fi_page |>
html_nodes
(
".price_color"
) |>
html_text
() |>
str_replace_all
(
pattern =
"£"
, replacement =
""
) |>
as.numeric
()
sci_fi_max <-
max
(
sci_fi_price
, na.rm =
TRUE
)
sci_fi_min <-
min
(
sci_fi_price
, na.rm =
TRUE
)
sci_fi_mean <-
mean
(
sci_fi_price
, na.rm =
TRUE
)
sci_fi_stats <-
tibble
(
stats =
c
(
"max"
, "min"
, "mean"
),
value =
c
(
sci_fi_max
, sci_fi_min
, sci_fi_mean
))
### END SOLUTION
sci_fi_stats
Question 4.1
{points: 0}
In worksheet_reading
you had practice scraping data from the web. Now that
you have the skills, should you scrape that website you have been dreaming of
harvesting data from? Maybe, maybe not... You should check the website's Terms
of Service first and consider the application you have planned for the data after
you scrape it.
List 3 websites you might be interested in scraping data from (for fun, profit, or
research/education). List their URLs as part of your answer. For each website,
search for their Terms of Service page. Take note if such a page exists, and if it
does, provide the link to it and tell us whether or not they allow web scraping of
their website.
You can list them in this cell! Double click to edit.
Bonus/optional additional readings on legalities of web
scraping:
Here are two recent news stories about web scraping and their legal implications:
D.C. Court: Accessing Public Information is Not a Computer Crime
Dear Canada: Accessing Publicly Available Information on the Internet Is Not
a Crime
source
(
"cleanup.R"
)
In [ ]:
In [ ]:
Related Questions
Give 2 characteristics that indicate a linear model may be appropriate to model a data set
arrow_forward
Block: _
Date.
Five year-old Susie's parents are concerned that she seems short for her age. Their doctor has the followng rer
of Susie' height:
Age (months): 20
30
45
50
57
60
Height (cm):
80
86
87
90
91
94
(a) Make a scatterplot of these data:
arrow_forward
Plz help asap 40 need crit value as well
arrow_forward
Draw a scatter diagram with square feet of living space as the independent variable and selling price as the dependent variable and describe variable and describe the relationship between the size of a house and the selling price.
arrow_forward
Model the data witha linear function based on the data from 1995 and 1997. Let the inputvaraible be the number of years after 1995 and the output varaible represent the number of twin births in thousands. Thus two points would be (0, 96.7),(2, 104.1)
arrow_forward
Note: For discrete data set, use this format {#,#,#...#}
for continuous data set, write answer using inequality notation
arrow_forward
Define the terms dependent and independent variable and explain their relationship.
arrow_forward
Use the scatterplot of Vehicle Registrations below to answer the questions
Vehicle Registrations in the United States, 1925-
2011 Vehicles millions
300
y = 3.0161x - 5819.5
R² = 0.9695
250
200
150
100
50
1920
1940
1960
1980
2000
2020
-50
Year
What is the dependent variable and what does it represent?
number of vehicle registrations
year
y =3.0161 x -5819.5
R^2 = 0.9695
not enough information to determine
Registrations (in millions)
arrow_forward
Note: For discrete data set, use this format {#,#,#...#}
for continuous data set, write answer using inequality notation
arrow_forward
Tech Company Employees Construct a vertical and horizontal bar graph for the number of employees (in thousands) of a sample of the largest tech
companies as of 2014.
Company
Employees
Xerox
379
Microsoft
310
Apple
146
Samsung
121
Hewlett Packard
103
Send data to Excel
Part: 0/ 2
Part 1 of 2
Construct a vertical bar graph for the data.
Tech Company Employees
400
320
240
160
80
Apple Samsung Hewlett
Packard
Drag bar to change value.
Numb er of employees
(in thousands)
arrow_forward
Which of the following data would best be represented by a pie chart?
The number of students who have taken MA321 in the last 10 semesters
The percentage of QCC students in each major
The percentage of QCC students currently taking MA321
The number of courses each QCC student is taking
arrow_forward
Perform the processing of dynamic rows, process the prediction
arrow_forward
Move data from Microsoft Excel output window to MS Powerpoint using the copy/paste function,
True or false
arrow_forward
View Options
Lecturelntros - Goog X
O Lumen OHM- MAT120.7681FA x
Post Attendee- Zoom
6 Faculty and Staff Portal My Le
Nd3solWKTTX44 hz6umEOSRdg6TCnthhrGhE/edit
Help
Last edit was 9 minutes ago
13.5
E===1ヨ
E E X
Let's put this rules in practice:
a. In a survey of hospital nursing from a Washington state hospital it was found that 92% of the nurses
were female, that 74% had received their nursing degree in Washington state, and 70% were both
female and had received their nursing degree in Washington state. (Smith, 2007). If a nurse is chosen
at random from those sampled, find the probability of the following events.
wy did not receive their nursing degree in Washington state/ /
• Getting a nurse
• Getting a nurse who is a female and received their nursing degree in Washington state.
• Getting a nurse who is a female or received their nursing degree in Washington state.
arrow_forward
regression is done on data that represent ordered pairs true or false
arrow_forward
Determine whether the scatter diagram indicates that a linear relation may exist between the two variables. If the relation is linear,
determine whether it indicates a positive or negative association between the variables.
Use this information to answer the following.
arrow_forward
Find a real data on the internet that can be modeled by a linear function. Find the linear modeling function. Is your model exact or approximate?
arrow_forward
Based on a grade 11 math student, answer the following question:
Are the following coordinates part of a function?
(1,2), (3, 5), (2, 1), (1, 3), (5, 5)
Question options:
True
False
arrow_forward
Continue monitoring the process. A second ten days of data have been collected, see table labeled “2nd 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 2nd 10 days of monitoring. Plot the data for the 2nd 10 days on the Xbar and R charts.
Is the reservation process for the 2nd 10 days of monitoring in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Based on the X-bar and R Charts that you developed for the 2nd 10 days of data, is the process in control?
Group of answer choices
No. The X-bar and R Charts are both out of control.
No. The X-bar Chart is in control, but the R Chart is out of control.
No. The R Chart is in control, but the X-bar Chart is out of control.
Yes. The X-bar and R Charts are both in control.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
Related Questions
- Give 2 characteristics that indicate a linear model may be appropriate to model a data setarrow_forwardBlock: _ Date. Five year-old Susie's parents are concerned that she seems short for her age. Their doctor has the followng rer of Susie' height: Age (months): 20 30 45 50 57 60 Height (cm): 80 86 87 90 91 94 (a) Make a scatterplot of these data:arrow_forwardPlz help asap 40 need crit value as wellarrow_forward
- Draw a scatter diagram with square feet of living space as the independent variable and selling price as the dependent variable and describe variable and describe the relationship between the size of a house and the selling price.arrow_forwardModel the data witha linear function based on the data from 1995 and 1997. Let the inputvaraible be the number of years after 1995 and the output varaible represent the number of twin births in thousands. Thus two points would be (0, 96.7),(2, 104.1)arrow_forwardNote: For discrete data set, use this format {#,#,#...#} for continuous data set, write answer using inequality notationarrow_forward
- Define the terms dependent and independent variable and explain their relationship.arrow_forwardUse the scatterplot of Vehicle Registrations below to answer the questions Vehicle Registrations in the United States, 1925- 2011 Vehicles millions 300 y = 3.0161x - 5819.5 R² = 0.9695 250 200 150 100 50 1920 1940 1960 1980 2000 2020 -50 Year What is the dependent variable and what does it represent? number of vehicle registrations year y =3.0161 x -5819.5 R^2 = 0.9695 not enough information to determine Registrations (in millions)arrow_forwardNote: For discrete data set, use this format {#,#,#...#} for continuous data set, write answer using inequality notationarrow_forward
- Tech Company Employees Construct a vertical and horizontal bar graph for the number of employees (in thousands) of a sample of the largest tech companies as of 2014. Company Employees Xerox 379 Microsoft 310 Apple 146 Samsung 121 Hewlett Packard 103 Send data to Excel Part: 0/ 2 Part 1 of 2 Construct a vertical bar graph for the data. Tech Company Employees 400 320 240 160 80 Apple Samsung Hewlett Packard Drag bar to change value. Numb er of employees (in thousands)arrow_forwardWhich of the following data would best be represented by a pie chart? The number of students who have taken MA321 in the last 10 semesters The percentage of QCC students in each major The percentage of QCC students currently taking MA321 The number of courses each QCC student is takingarrow_forwardPerform the processing of dynamic rows, process the predictionarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALElementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning