TE #4

Rmd

School

Southeastern Community College *

*We aren’t endorsed by this school

Course

240

Subject

English

Date

Dec 6, 2023

Type

Rmd

Pages

5

Uploaded by ConstableResolve4756

Report
--- title: 'BAS 240 (I01) Data Structures for Analytics (2023FA) TE #4' author: "Amaria Frost" date: "`r Sys.Date()`" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(dplyr) library(ggplot2) library(nycflights13) alaska_flights <- flights %>% filter(carrier == "AS") portland_flights <- flights %>% filter(dest == "PDX") View(portland_flights) btv_sea_flights_fall <- flights %>% filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10) View(btv_sea_flights_fall) btv_sea_flights_fall <- flights %>% filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10) View(btv_sea_flights_fall) not_BTV_SEA <- flights %>% filter(!(dest == "BTV" | dest == "SEA")) View(not_BTV_SEA) flights %>% filter(!dest == "BTV" | dest == "SEA") many_airports <- flights %>% filter(dest == "SEA" | dest == "SFO" | dest == "PDX" | dest == "BTV" | dest == "BDL") many_airports <- flights %>% filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL")) View(many_airports) summary_temp <- weather %>% summarize(mean = mean(temp), std_dev = sd(temp)) summary_temp summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_temp summary_monthly_temp <- weather %>% group_by(month) %>% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp diamonds diamonds %>% group_by(cut) diamonds %>% group_by(cut) %>% summarize(avg_price = mean(price)) diamonds %>% group_by(cut) %>% ungroup() by_origin <- flights %>% group_by(origin) %>% summarize(count = n()) by_origin
by_origin_monthly <- flights %>% group_by(origin, month) %>% summarize(count = n()) by_origin_monthly by_origin_monthly_incorrect <- flights %>% group_by(origin) %>% group_by(month) %>% summarize(count = n()) by_origin_monthly_incorrect ``` 3.5 mutate existing variables ```{r} weather <- weather %>% mutate(temp_in_C = (temp - 32) / 1.8) summary_monthly_temp <- weather %>% group_by(month) %>% summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), mean_temp_in_C = mean(temp_in_C, na.rm = TRUE)) summary_monthly_temp flights <- flights %>% mutate(gain = dep_delay - arr_delay) gain_summary <- flights %>% summarize( min = min(gain, na.rm = TRUE), q1 = quantile(gain, 0.25, na.rm = TRUE), median = quantile(gain, 0.5, na.rm = TRUE), q3 = quantile(gain, 0.75, na.rm = TRUE), max = max(gain, na.rm = TRUE), mean = mean(gain, na.rm = TRUE), sd = sd(gain, na.rm = TRUE), missing = sum(is.na(gain)) ) gain_summary ggplot(data = flights, mapping = aes(x = gain)) + geom_histogram(color = "white", bins = 20) flights <- flights %>% mutate( gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` (LC3.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value? - Assume a flight was 20 minutes late, i.e. dep delay = 20. - Then he was 10 minutes late, so arr delay = 10. - Then gain = dep delay - arr delay = 20 - 10 = 10 is positive, indicating that "time in the air was made up/gained." - 0 indicates that the takeoff and arrival times were the same, implying that no time was made up in the air. In most circumstances, the increase is close to zero minutes. (LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights.
- No, because times cannot be calculated directly. The time difference between 12:03 and 11:59 is four minutes, however 1203-1159 = 44. (LC3.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values. - Most of the time, the gain is somewhat less than zero, and the majority of the time, the gain is between -50 and 50 minutes. However, there are extremely severe examples. 3.6 arrange and sort rows ```{r} freq_dest <- flights %>% group_by(dest) %>% summarize(num_flights = n()) freq_dest freq_dest %>% arrange(num_flights) freq_dest %>% arrange(desc(num_flights)) ``` 3.7 join data frames ```{r} View(airlines) flights_joined <- flights %>% inner_join(airlines, by = "carrier") View(flights) View(flights_joined) View(airports) flights_with_airport_names <- flights %>% inner_join(airports, by = c("dest" = "faa")) View(flights_with_airport_names) named_dests <- flights %>% group_by(dest) %>% summarize(num_flights = n()) %>% arrange(desc(num_flights)) %>% inner_join(airports, by = c("dest" = "faa")) %>% rename(airport_name = name) named_dests flights_weather_joined <- flights %>% inner_join(weather, by = c("year", "month", "day", "hour", "origin")) View(flights_weather_joined) ``` (LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour? - Because an hour is essentially a number between 0 and 23, we must know the year, month, day, and airport to identify a specific hour. (LC3.14) What surprises you about the top 10 destinations from NYC in 2013? - The quantity of flights to Boston especially if it is easier and faster to ride the train. 3.7.4 Normal forms ```{r} joined_flights <- flights %>% inner_join(airlines, by = "carrier")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
View(joined_flights) ``` (LC3.15) What are some advantages of data in normal forms? What are some disadvantages? - When datasets are in normal form, we can easily _join them with other datasets! For example, we can join the flights data with the planes data. 3.8 Other verbs ```{r} glimpse(flights) flights %>% select(carrier, flight) flights_no_year <- flights %>% select(-year) flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time) flight_arr_times flights_reorder <- flights %>% select(year, month, day, hour, minute, time_hour, everything()) glimpse(flights_reorder) flights %>% select(starts_with("a")) flights %>% select(ends_with("delay")) flights %>% select(contains("time")) flights_time_new <- flights %>% select(dep_time, arr_time) %>% rename(departure_time = dep_time, arrival_time = arr_time) glimpse(flights_time_new) named_dests %>% top_n(n = 10, wt = num_flights) named_dests %>% top_n(n = 10, wt = num_flights) %>% arrange(desc(num_flights)) ``` (LC3.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways. ```{r} # The regular way: flights %>% select(dest, air_time, distance) ``` ```{r} # Since they are sequential columns in the dataset flights %>% select(dest:distance) ``` ```{r} # Not as effective, by removing everything else flights %>% select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time, -sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin, -hour, -minute, -time_hour) ``` (LC3.17) How could one use starts_with(), ends_with(), and contains() to select columns from the flights data frame? Provide three different examples in total: one for starts_with(), one for ends_with(), and one for contains(). ```{r} # Anything that starts with "d"
flights %>% select(starts_with("d")) ``` ```{r} # Anything related to delays: flights %>% select(ends_with("delay")) ``` ```{r} # Anything related to departures: flights %>% select(contains("dep")) ``` (LC3.18) Why might we want to use the select function on a data frame? - To narrow down the data frame, to make it easier to look at. (LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013. ```{r} top_five <- flights %>% group_by(dest) %>% summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% arrange(desc(avg_delay)) %>% top_n(n = 5) top_five ```