Week3_Assignment

docx

School

New England College *

*We aren’t endorsed by this school

Course

CRN129

Subject

Electrical Engineering

Date

Jan 9, 2024

Type

docx

Pages

13

Uploaded by MatePuppyPerson3950

Report
Week3_Assignment 2023-11-22 Sections: Introduction, Prerequisites, nycflights13, dplyr Basics, Filter Rows with filter(), Comparisons, Logical Operators, Missing Values. Exercises: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 3 library (tidyverse) #calling the "tidyverse" library ## Warning: package 'tidyverse' was built under R version 4.1.3 ## Warning: package 'tibble' was built under R version 4.1.3 ## Warning: package 'tidyr' was built under R version 4.1.3 ## Warning: package 'readr' was built under R version 4.1.3 ## Warning: package 'purrr' was built under R version 4.1.3 ## Warning: package 'dplyr' was built under R version 4.1.3 ## Warning: package 'stringr' was built under R version 4.1.3 ## Warning: package 'forcats' was built under R version 4.1.3 ## Warning: package 'lubridate' was built under R version 4.1.3 ## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 -- ## v dplyr 1.1.2 v readr 2.1.4 ## v forcats 1.0.0 v stringr 1.5.0 ## v ggplot2 3.4.3 v tibble 3.2.1 ## v lubridate 1.9.2 v tidyr 1.3.0 ## v purrr 1.0.1 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors library (nycflights13) #calling the "nycflights13" library ## Warning: package 'nycflights13' was built under R version 4.1.3 nycflights13 :: flights ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time
sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # i 336,766 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #1.1 Had an arrival delay of two or more hours filter (flights, arr_delay >= 120 ) ## # A tibble: 10,200 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 811 630 101 1047 830 ## 2 2013 1 1 848 1835 853 1001 1950 ## 3 2013 1 1 957 733 144 1056 853 ## 4 2013 1 1 1114 900 134 1447 1222 ## 5 2013 1 1 1505 1310 115 1638 1431 ## 6 2013 1 1 1525 1340 105 1831 1626 ## 7 2013 1 1 1549 1445 64 1912
1656 ## 8 2013 1 1 1558 1359 119 1718 1515 ## 9 2013 1 1 1732 1630 62 2028 1825 ## 10 2013 1 1 1803 1620 103 2008 1750 ## # i 10,190 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #1.2 Flew to Houston (IAH or HOU) filter (flights, dest == 'IAH' | dest == 'HOU' ) ## # A tibble: 9,313 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 623 627 -4 933 932 ## 4 2013 1 1 728 732 -4 1041 1038 ## 5 2013 1 1 739 739 0 1104 1038 ## 6 2013 1 1 908 908 0 1228 1219 ## 7 2013 1 1 1028 1026 2 1350 1339 ## 8 2013 1 1 1044 1045 -1 1352 1351 ## 9 2013 1 1 1114 900 134 1447 1222 ## 10 2013 1 1 1205 1200 5 1503 1505 ## # i 9,303 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #1.3 Were operated by United, American, or Delta filter (flights, carrier == 'AA' | carrier == 'DL' | carrier == 'UA' )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## # A tibble: 139,504 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 554 600 -6 812 837 ## 5 2013 1 1 554 558 -4 740 728 ## 6 2013 1 1 558 600 -2 753 745 ## 7 2013 1 1 558 600 -2 924 917 ## 8 2013 1 1 558 600 -2 923 937 ## 9 2013 1 1 559 600 -1 941 910 ## 10 2013 1 1 559 600 -1 854 902 ## # i 139,494 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #1.4 Departed in summer (July, August, and September) filter (flights, month %in% c ( 7 , 8 , 12 )) ## # A tibble: 86,887 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 12 1 13 2359 14 446 445 ## 2 2013 12 1 17 2359 18 443 437 ## 3 2013 12 1 453 500 -7 636 651 ## 4 2013 12 1 520 515 5 749 808 ## 5 2013 12 1 536 540 -4 845 850 ## 6 2013 12 1 540 550 -10 1005
1027 ## 7 2013 12 1 541 545 -4 734 755 ## 8 2013 12 1 546 545 1 826 835 ## 9 2013 12 1 549 600 -11 648 659 ## 10 2013 12 1 550 600 -10 825 854 ## # i 86,877 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #1.5 Arrived more than two hours late, but didn’t leave late filter (flights, arr_delay > 120 & dep_delay <= 0 ) ## # A tibble: 29 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 27 1419 1420 -1 1754 1550 ## 2 2013 10 7 1350 1350 0 1736 1526 ## 3 2013 10 7 1357 1359 -2 1858 1654 ## 4 2013 10 16 657 700 -3 1258 1056 ## 5 2013 11 1 658 700 -2 1329 1015 ## 6 2013 3 18 1844 1847 -3 39 2219 ## 7 2013 4 17 1635 1640 -5 2049 1845 ## 8 2013 4 18 558 600 -2 1149 850 ## 9 2013 4 18 655 700 -5 1213 950 ## 10 2013 5 22 1827 1830 -3 2217 2010 ## # i 19 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>
#1.6 Were delayed by at least an hour, but made up over 30 minutes in flight filter (flights, dep_delay >= 60 & arr_delay <= 30 ) ## # A tibble: 239 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 3 1850 1745 65 2148 2120 ## 2 2013 1 3 1950 1845 65 2228 2227 ## 3 2013 1 3 2015 1915 60 2135 2111 ## 4 2013 1 6 1019 900 79 1558 1530 ## 5 2013 1 7 1543 1430 73 1758 1735 ## 6 2013 1 11 1020 920 60 1311 1245 ## 7 2013 1 12 1706 1600 66 1949 1927 ## 8 2013 1 12 1953 1845 68 2154 2137 ## 9 2013 1 19 1456 1355 61 1636 1615 ## 10 2013 1 21 1531 1430 61 1843 1815 ## # i 229 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #3 How many flights have a missing dep_time? What other variables are missing? What might these rows represent? count ( filter (flights, is.na (dep_time))) ## # A tibble: 1 x 1 ## n ## <int> ## 1 8255 colSums ( ! is.na (flights)) == 0 ## year month day dep_time sched_dep_time ## FALSE FALSE FALSE FALSE FALSE
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## dep_delay arr_time sched_arr_time arr_delay carrier ## FALSE FALSE FALSE FALSE FALSE ## flight tailnum origin dest air_time ## FALSE FALSE FALSE FALSE FALSE ## distance hour minute time_hour ## FALSE FALSE FALSE FALSE #Total number of flights with missing dep_time --> 8255 #Other variables that are missing include dep_delay and arr_time. #This means that either the flights was canceled or got mischeduled. Sections: Arrange Rows with Rows() Exercises: 1 and 2 #1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). arrange (flights, dep_time) %>% tail () ## # A tibble: 6 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 9 30 NA 1842 NA NA 2019 ## 2 2013 9 30 NA 1455 NA NA 1634 ## 3 2013 9 30 NA 2200 NA NA 2312 ## 4 2013 9 30 NA 1210 NA NA 1330 ## 5 2013 9 30 NA 1159 NA NA 1344 ## 6 2013 9 30 NA 840 NA NA 1020 ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> #2. Sort flights to find the most delayed flights. Find the flights that left earliest arrange (flights, desc (dep_delay))
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 9 641 900 1301 1242 1530 ## 2 2013 6 15 1432 1935 1137 1607 2120 ## 3 2013 1 10 1121 1635 1126 1239 1810 ## 4 2013 9 20 1139 1845 1014 1457 2210 ## 5 2013 7 22 845 1600 1005 1044 1815 ## 6 2013 4 10 1100 1900 960 1342 2211 ## 7 2013 3 17 2321 810 911 135 1020 ## 8 2013 6 27 959 1900 899 1236 2226 ## 9 2013 7 22 2257 759 898 121 1026 ## 10 2013 12 5 756 1700 896 1058 2020 ## # i 336,766 more rows ## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> Sections: Select columns with select() Exercises: 3 #3. What does the any_of() function do? Why might it be helpful in conjunction with this vector? vars <- c ( "year" , "month" , "day" , "dep_delay" , "arr_delay" ) select (flights, any_of (vars)) ## # A tibble: 336,776 x 5 ## year month day dep_delay arr_delay ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 1 2 11 ## 2 2013 1 1 4 20 ## 3 2013 1 1 2 33 ## 4 2013 1 1 -1 -18 ## 5 2013 1 1 -6 -25 ## 6 2013 1 1 -4 12 ## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14 ## 9 2013 1 1 -3 -8 ## 10 2013 1 1 -2 8 ## # i 336,766 more rows #It can be helpful since the names of the variables, which can be many, can be stored in a variable and passed to any_of() Sections: Add new variables with mutate(), Useful Creation Functions Exercises: 2, 4, 5 #2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it? flights_difference <- mutate (flights, dep_time_min = (dep_time %/% 100 ) * 60 + dep_time %% 100 , arr_time_min = (arr_time %/% 100 ) * 60 + arr_time %% 100 , difference_time = arr_time_min - dep_time_min ) nrow ( filter (flights_difference, air_time != difference_time)) / nrow (flights) * 100 ## [1] 97.14172 #Logically, we expect to see air_time = arr_time - dep_time. #However, we see that 92.64% of the flights are not aligned with this. The calculated difference_time and air_time are dissimilar. The fundamental cause of the discrepancy is that the initial values of arr_time and dep_time are not really continuous numbers. This is fixed by converting arr_time to minutes since midnight and then recalculating flight_time. #4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank(). head ( select ( arrange ( mutate (flights, total_delay = dep_delay + arr_delay), desc (total_delay)), total_delay, carrier, flight, origin, dest, time_hour ), 10 ) ## # A tibble: 10 x 6 ## total_delay carrier flight origin dest time_hour ## <dbl> <chr> <int> <chr> <chr> <dttm> ## 1 2573 HA 51 JFK HNL 2013-01-09 09:00:00 ## 2 2264 MQ 3535 JFK CMH 2013-06-15 19:00:00 ## 3 2235 MQ 3695 EWR ORD 2013-01-10 16:00:00 ## 4 2021 AA 177 JFK SFO 2013-09-20 18:00:00 ## 5 1994 MQ 3075 JFK CVG 2013-07-22 16:00:00
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 6 1891 DL 2391 JFK TPA 2013-04-10 19:00:00 ## 7 1826 DL 2119 LGA MSP 2013-03-17 08:00:00 ## 8 1793 DL 2047 LGA ATL 2013-07-22 07:00:00 ## 9 1774 AA 172 EWR MIA 2013-12-05 17:00:00 ## 10 1753 MQ 3744 EWR ORD 2013-05-03 20:00:00 #When we see the results, we realize that we do not need any additional sorting function, hence there are no ties. Just in case we get any ties, dplyr in R offers a few ranking functions like dense_rank(), percent_rank(), etc. which can serve the purpose. #5. What does 1:3 + 1:10 return? Why? 1 : 3 + 1 : 10 ## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object ## length ## [1] 2 4 6 5 7 9 8 10 12 11 #Running the above code shows a warning that the longer vector is not a multiple of the shorter vector. This warning throws an error in the code. This is due to the larger vector's length being covered by the shorter one. The vectors do not align correctly and we receive an error since 10 does not divide by 3 precisely. Sections: Grouped summaries with summarize(), Combining multiple operations with the Pipe, Missing Values, Counts, Useful Summary Functions, Grouping by Multiple Variables, Ungrouping Exercises: 5, 6 #5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n())) worst_delays <- flights %>% group_by (carrier) %>% summarize ( avg_arr_delay = mean (arr_delay, na.rm = TRUE ), avg_dep_delay = mean (dep_delay, na.rm = TRUE )) arrange (worst_delays, desc (avg_arr_delay)) ## # A tibble: 16 x 3 ## carrier avg_arr_delay avg_dep_delay ## <chr> <dbl> <dbl> ## 1 F9 21.9 20.2 ## 2 FL 20.1 18.7 ## 3 EV 15.8 20.0 ## 4 YV 15.6 19.0 ## 5 OO 11.9 12.6 ## 6 MQ 10.8 10.6 ## 7 WN 9.65 17.7
## 8 B6 9.46 13.0 ## 9 9E 7.38 16.7 ## 10 UA 3.56 12.1 ## 11 US 2.13 3.78 ## 12 VX 1.76 12.9 ## 13 DL 1.64 9.26 ## 14 AA 0.364 8.59 ## 15 HA -6.92 4.90 ## 16 AS -9.93 5.80 arrange (worst_delays, desc (avg_dep_delay)) ## # A tibble: 16 x 3 ## carrier avg_arr_delay avg_dep_delay ## <chr> <dbl> <dbl> ## 1 F9 21.9 20.2 ## 2 EV 15.8 20.0 ## 3 YV 15.6 19.0 ## 4 FL 20.1 18.7 ## 5 WN 9.65 17.7 ## 6 9E 7.38 16.7 ## 7 B6 9.46 13.0 ## 8 VX 1.76 12.9 ## 9 OO 11.9 12.6 ## 10 UA 3.56 12.1 ## 11 MQ 10.8 10.6 ## 12 DL 1.64 9.26 ## 13 AA 0.364 8.59 ## 14 AS -9.93 5.80 ## 15 HA -6.92 4.90 ## 16 US 2.13 3.78 #Worst Departure delays: flights %>% group_by (origin) %>% summarize ( avg_dep_delay = mean (dep_delay, na.rm = TRUE )) ## # A tibble: 3 x 2 ## origin avg_dep_delay ## <chr> <dbl> ## 1 EWR 15.1 ## 2 JFK 12.1 ## 3 LGA 10.3 #Worst Arrival delays: flights %>% group_by (origin) %>% summarize ( avg_arr_delay = mean (arr_delay, na.rm = TRUE )) ## # A tibble: 3 x 2 ## origin avg_arr_delay ## <chr> <dbl> ## 1 EWR 9.11
## 2 JFK 5.55 ## 3 LGA 5.78 #6. What does the sort argument to count() do. When might you use it? #The `sort` argument to `count()` sorts by descending order of `n`. The sort argument, when set to TRUE, sorts the output in descending order. This can be useful when the most occurring group/entity is the most important and needs to be on top of the list. Sections: Grouped Mutates (and Filters) Exercises: 2, 4, 7 #2. Which plane (tailnum) has the worst on-time record? arrange ( select ( filter (flights, ! is.na (arr_delay)), tailnum, arr_delay) %>% group_by (tailnum) %>% summarise ( sum (arr_delay)), desc ( ` sum(arr_delay) ` ) ) ## # A tibble: 4,037 x 2 ## tailnum `sum(arr_delay)` ## <chr> <dbl> ## 1 N15910 7317 ## 2 N15980 7134 ## 3 N16919 6904 ## 4 N228JB 6778 ## 5 N14998 6087 ## 6 N192JB 5810 ## 7 N292JB 5804 ## 8 N12921 5788 ## 9 N13958 5620 ## 10 N10575 5566 ## # i 4,027 more rows #tailnum N15910 #4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination. flights %>% select (dest, arr_delay) %>% group_by (dest) %>% filter (arr_delay > 0 ) %>% mutate ( total_delay = sum (arr_delay, na.rm = TRUE ), prop_delay = arr_delay / total_delay) ## # A tibble: 133,004 x 4 ## # Groups: dest [103] ## dest arr_delay total_delay prop_delay ## <chr> <dbl> <dbl> <dbl> ## 1 IAH 11 99391 0.000111 ## 2 IAH 20 99391 0.000201 ## 3 MIA 33 140424 0.000235 ## 4 ORD 12 283046 0.0000424
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 5 FLL 19 202605 0.0000938 ## 6 ORD 8 283046 0.0000283 ## 7 LAX 7 203226 0.0000344 ## 8 DFW 31 110009 0.000282 ## 9 ATL 12 300299 0.0000400 ## 10 DTW 16 138258 0.000116 ## # i 132,994 more rows #7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers. flights %>% group_by (dest) %>% filter ( n_distinct (carrier) >= 2 ) %>% group_by (carrier) %>% summarise ( number_of_transfers = n_distinct (dest)) %>% arrange ( desc (number_of_transfers)) ## # A tibble: 16 x 2 ## carrier number_of_transfers ## <chr> <int> ## 1 EV 51 ## 2 9E 48 ## 3 UA 42 ## 4 DL 39 ## 5 B6 35 ## 6 AA 19 ## 7 MQ 19 ## 8 WN 10 ## 9 OO 5 ## 10 US 5 ## 11 VX 4 ## 12 YV 3 ## 13 FL 2 ## 14 AS 1 ## 15 F9 1 ## 16 HA 1