TE #4
Rmd
keyboard_arrow_up
School
Southeastern Community College *
*We aren’t endorsed by this school
Course
240
Subject
English
Date
Dec 6, 2023
Type
Rmd
Pages
5
Uploaded by ConstableResolve4756
---
title: 'BAS 240 (I01) Data Structures for Analytics (2023FA) TE #4'
author: "Amaria Frost"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(nycflights13)
alaska_flights <- flights %>%
filter(carrier == "AS")
portland_flights <- flights %>%
filter(dest == "PDX")
View(portland_flights)
btv_sea_flights_fall <- flights %>%
filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10)
View(btv_sea_flights_fall)
btv_sea_flights_fall <- flights %>%
filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
View(btv_sea_flights_fall)
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
View(not_BTV_SEA)
flights %>% filter(!dest == "BTV" | dest == "SEA")
many_airports <- flights %>%
filter(dest == "SEA" | dest == "SFO" | dest == "PDX" |
dest == "BTV" | dest == "BDL")
many_airports <- flights %>%
filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL"))
View(many_airports)
summary_temp <- weather %>%
summarize(mean = mean(temp), std_dev = sd(temp))
summary_temp
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_temp
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
diamonds
diamonds %>%
group_by(cut)
diamonds %>%
group_by(cut) %>%
summarize(avg_price = mean(price))
diamonds %>%
group_by(cut) %>%
ungroup()
by_origin <- flights %>%
group_by(origin) %>%
summarize(count = n())
by_origin
by_origin_monthly <- flights %>%
group_by(origin, month) %>%
summarize(count = n())
by_origin_monthly
by_origin_monthly_incorrect <- flights %>%
group_by(origin) %>%
group_by(month) %>%
summarize(count = n())
by_origin_monthly_incorrect
```
3.5 mutate existing variables
```{r}
weather <- weather %>%
mutate(temp_in_C = (temp - 32) / 1.8)
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean_temp_in_F = mean(temp, na.rm = TRUE),
mean_temp_in_C = mean(temp_in_C, na.rm = TRUE))
summary_monthly_temp
flights <- flights %>%
mutate(gain = dep_delay - arr_delay)
gain_summary <- flights %>%
summarize(
min = min(gain, na.rm = TRUE),
q1 = quantile(gain, 0.25, na.rm = TRUE),
median = quantile(gain, 0.5, na.rm = TRUE),
q3 = quantile(gain, 0.75, na.rm = TRUE),
max = max(gain, na.rm = TRUE),
mean = mean(gain, na.rm = TRUE),
sd = sd(gain, na.rm = TRUE),
missing = sum(is.na(gain))
)
gain_summary
ggplot(data = flights, mapping = aes(x = gain)) +
geom_histogram(color = "white", bins = 20)
flights <- flights %>%
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
(LC3.10) What do positive values of the gain variable in flights correspond to?
What about negative values? And what about a zero value?
- Assume a flight was 20 minutes late, i.e. dep delay = 20.
- Then he was 10 minutes late, so arr delay = 10.
- Then gain = dep delay - arr delay = 20 - 10 = 10 is positive, indicating that
"time in the air was made up/gained."
- 0 indicates that the takeoff and arrival times were the same, implying that no
time was made up in the air. In most circumstances, the increase is close to zero
minutes.
(LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting
dep_time from sched_dep_time and similarly for arrivals? Try the code out and
explain any differences between the result and what actually appears in flights.
- No, because times cannot be calculated directly. The time difference between
12:03 and 11:59 is four minutes, however 1203-1159 = 44.
(LC3.12) What can we say about the distribution of gain? Describe it in a few
sentences using the plot and the gain_summary data frame values.
- Most of the time, the gain is somewhat less than zero, and the majority of the
time, the gain is between -50 and 50 minutes. However, there are extremely severe
examples.
3.6 arrange and sort rows
```{r}
freq_dest <- flights %>%
group_by(dest) %>%
summarize(num_flights = n())
freq_dest
freq_dest %>%
arrange(num_flights)
freq_dest %>%
arrange(desc(num_flights))
```
3.7 join data frames
```{r}
View(airlines)
flights_joined <- flights %>%
inner_join(airlines, by = "carrier")
View(flights)
View(flights_joined)
View(airports)
flights_with_airport_names <- flights %>%
inner_join(airports, by = c("dest" = "faa"))
View(flights_with_airport_names)
named_dests <- flights %>%
group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
rename(airport_name = name)
named_dests
flights_weather_joined <- flights %>%
inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
View(flights_weather_joined)
```
(LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other
words, matching the hourly weather values with each flight), why do we need to join
by all of year, month, day, hour, and origin, and not just hour?
- Because an hour is essentially a number between 0 and 23, we must know the year,
month, day, and airport to identify a specific hour.
(LC3.14) What surprises you about the top 10 destinations from NYC in 2013?
- The quantity of flights to Boston especially if it is easier and faster to ride
the train.
3.7.4 Normal forms
```{r}
joined_flights <- flights %>%
inner_join(airlines, by = "carrier")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
View(joined_flights)
```
(LC3.15) What are some advantages of data in normal forms? What are some
disadvantages?
- When datasets are in normal form, we can easily _join them with other datasets!
For example, we can join the flights data with the planes data.
3.8 Other verbs
```{r}
glimpse(flights)
flights %>%
select(carrier, flight)
flights_no_year <- flights %>% select(-year)
flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time)
flight_arr_times
flights_reorder <- flights %>%
select(year, month, day, hour, minute, time_hour, everything())
glimpse(flights_reorder)
flights %>% select(starts_with("a"))
flights %>% select(ends_with("delay"))
flights %>% select(contains("time"))
flights_time_new <- flights %>%
select(dep_time, arr_time) %>%
rename(departure_time = dep_time, arrival_time = arr_time)
glimpse(flights_time_new)
named_dests %>% top_n(n = 10, wt = num_flights)
named_dests
%>%
top_n(n = 10, wt = num_flights) %>%
arrange(desc(num_flights))
```
(LC3.16) What are some ways to select all three of the dest, air_time, and distance
variables from flights? Give the code showing how to do this in at least three
different ways.
```{r}
# The regular way:
flights %>%
select(dest, air_time, distance)
```
```{r}
# Since they are sequential columns in the dataset
flights %>%
select(dest:distance)
```
```{r}
# Not as effective, by removing everything else
flights %>%
select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
-sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin,
-hour, -minute, -time_hour)
```
(LC3.17) How could one use starts_with(), ends_with(), and contains() to select
columns from the flights data frame? Provide three different examples in total: one
for starts_with(), one for ends_with(), and one for contains().
```{r}
# Anything that starts with "d"
flights %>%
select(starts_with("d"))
```
```{r}
# Anything related to delays:
flights %>%
select(ends_with("delay"))
```
```{r}
# Anything related to departures:
flights %>%
select(contains("dep"))
```
(LC3.18) Why might we want to use the select function on a data frame?
- To narrow down the data frame, to make it easier to look at.
(LC3.19) Create a new data frame that shows the top 5 airports with the largest
arrival delays from NYC in 2013.
```{r}
top_five <- flights %>%
group_by(dest) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay)) %>%
top_n(n = 5)
top_five
```