HW3a.knit

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

7

Uploaded by DeanTigerMaster997

Report
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 1/7 Intro to Data Science HW 3 Copyright 2023, Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Adesh More Attribution statement: (choose only one and delete the rest) # 2. I did this homework with help from the book and the professor and these Internet sources: w 3Schools Reminders of things to practice from last week: Make a data frame: data.frame( ) Row index of max/min: which.max( ) which.min( ) Sort value or order rows: arrange( ) sort( ) order( ) Descriptive statistics: mean( ) sum( ) max( ) Conditional statement: if (condition) true stuff else false stuff This Week: Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called ** data munging. ** In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness. Part 1: Use read_csv( ) to read a CSV file from the web into a data frame: A. Use R code to read directly from a URL on the web. Store the dataset into a new dataframe, called testDF . The URL is: “https://data-science-intro.s3.us-east-2.amazonaws.com/NYS_COVID_Testing.csv (https://data-science- intro.s3.us-east-2.amazonaws.com/NYS_COVID_Testing.csv)” Hint: use read_csv( ), not read.csv( ). This is from the tidyverse package . Check the help to compare them. library (tidyverse)
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 2/7 ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## dplyr 1.1.2 readr 2.1.4 ## forcats 1.0.0 stringr 1.5.0 ## ggplot2 3.4.3 tibble 3.2.1 ## lubridate 1.9.2 tidyr 1.3.0 ## purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## dplyr::filter() masks stats::filter() ## dplyr::lag() masks stats::lag() ## Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors testDF <- read_csv("https://data-science-intro.s3.us-east-2.amazonaws.com/NYS_COVID_Testing.cs v") ## Rows: 7383 Columns: 5 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (3): TestDate, AgeGroup, AgeCategory ## dbl (2): PositiveCases, TotalTests ## ## Use `spec()` to retrieve the full column specification for this data. ## Specify the column types or set `show_col_types = FALSE` to quiet this message. B. Use View( ), head( ), and tail( ) to examine the testDF dataframe. Add a block comment that briefly describes what you see. #view(testDF) head(testDF) ## # A tibble: 6 × 5 ## TestDate AgeGroup PositiveCases TotalTests AgeCategory ## <chr> <chr> <dbl> <dbl> <chr> ## 1 3/2/2020 45 to 54 1 1 middle-aged_adults ## 2 3/3/2020 25 to 34 0 2 young_adults ## 3 3/3/2020 35 to 44 0 1 middle-aged_adults ## 4 3/3/2020 45 to 54 0 1 middle-aged_adults ## 5 3/3/2020 55 to 64 0 2 senior_citizens ## 6 3/3/2020 65 to 74 0 2 senior_citizens tail(testDF)
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 3/7 ## # A tibble: 6 × 5 ## TestDate AgeGroup PositiveCases TotalTests AgeCategory ## <chr> <chr> <dbl> <dbl> <chr> ## 1 1/3/2022 5 to 19 9923 38977 children ## 2 1/3/2022 55 to 64 5739 27019 senior_citizens ## 3 1/3/2022 65 to 74 2759 14498 senior_citizens ## 4 1/3/2022 75 to 84 1141 6519 senior_citizens ## 5 1/3/2022 85 + 680 4028 senior_citizens ## 6 1/3/2022 < 1 717 2074 children Part 2: Create new data frames based on a condition: A. Use the table( ) command to summarize the contents of the AgeCategory variable in testDF . Write a comment interpreting what you see how many age categories are there in the dataset and what is the proportion of observations in each? table(testDF$AgeCategory) ## ## children middle-aged_adults senior_citizens young_adults ## 2010 1345 2686 1342 # There are 4 age group categories. Senior citizens being the highest and young adults being the lowest. B. Terms like “senior citizens” can function as othering language which demeans the people it seeks to describe. We can use the str_replace_all() function from tidyverse to find all instances of senior_citizens in the AgeCategory variable and replace them with older_adults . In this case, we want to search for senior_citizens and replace it with older_adults in testDF$AgeCategory - how can you use this information to overwrite the AgeCategory in the function below: testDF$AgeCategory <- str_replace_all(testDF$AgeCategory, "senior_citizens", "older_adults") head(testDF$AgeCategory) ## [1] "middle-aged_adults" "young_adults" "middle-aged_adults" ## [4] "middle-aged_adults" "older_adults" "older_adults" C. Create a dataframe (called olderAdults ) that contains only the rows (observations) for which the value in the AgeCategory variable (column) is older_adults . Hint: Use subsetting. olderAdults <- testDF[testDF$AgeCategory == "older_adults", ] head(olderAdults)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 4/7 ## # A tibble: 6 × 5 ## TestDate AgeGroup PositiveCases TotalTests AgeCategory ## <chr> <chr> <dbl> <dbl> <chr> ## 1 3/3/2020 55 to 64 0 2 older_adults ## 2 3/3/2020 65 to 74 0 2 older_adults ## 3 3/3/2020 85 + 0 1 older_adults ## 4 3/4/2020 55 to 64 0 2 older_adults ## 5 3/4/2020 65 to 74 0 3 older_adults ## 6 3/4/2020 85 + 0 1 older_adults D. Use the dim() command on olderAdults to confirm that the data frame contains 2,686 observations and 5 columns/variables. dim(olderAdults) ## [1] 2686 5 E. Use subsetting to create a new dataframe that contains only the observations for which the value in the AgeCategory variable is young_adults . The name of this new df should be youngAdults . youngAdults <- testDF[testDF$AgeCategory == "young_adults", ] head(youngAdults) ## # A tibble: 6 × 5 ## TestDate AgeGroup PositiveCases TotalTests AgeCategory ## <chr> <chr> <dbl> <dbl> <chr> ## 1 3/3/2020 25 to 34 0 2 young_adults ## 2 3/4/2020 25 to 34 0 5 young_adults ## 3 3/5/2020 20 to 24 1 4 young_adults ## 4 3/5/2020 25 to 34 0 10 young_adults ## 5 3/6/2020 20 to 24 1 4 young_adults ## 6 3/6/2020 25 to 34 1 8 young_adults F. Create one last data frame which only contains the observations for children in the AgeCategory variable of testDF . Call this new df childrenDF . childrenDF <- testDF[testDF$AgeCategory == "children", ] head(childrenDF)
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 5/7 ## # A tibble: 6 × 5 ## TestDate AgeGroup PositiveCases TotalTests AgeCategory ## <chr> <chr> <dbl> <dbl> <chr> ## 1 3/4/2020 1 to 4 0 1 children ## 2 3/4/2020 5 to 19 0 3 children ## 3 3/4/2020 < 1 0 1 children ## 4 3/5/2020 1 to 4 0 7 children ## 5 3/5/2020 5 to 19 6 17 children ## 6 3/5/2020 < 1 0 2 children Part 3: Analyze the numeric variables in the testDF dataframe. A. How many numeric variables does the dataframe have? Show the code you used to answer the question. sum(sapply(testDF, is.numeric)) ## [1] 2 B. What is the average number of total daily tests? mean(testDF$TotalTests) ## [1] 11724.96 C. How many tests were performed in the row with the highest number of total daily tests? What age category do they correspond to? max_row <- testDF[which.max(testDF$TotalTests), ] count <- max_row$TotalTests age_category <- max_row$AgeCategory count ## [1] 88904 age_category ## [1] "young_adults" D. How many positive cases were registered in the row with the highest number of positive cases? What age category do they correspond to?
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 6/7 max_positive_row <- testDF[which.max(testDF$PositiveCases), ] highest_positive <- max_positive_row$PositiveCases age_category1 <- max_positive_row$AgeCategory highest_positive ## [1] 17486 age_category1 ## [1] "young_adults" E. What is the total number of positive cases in testDF ? sum(testDF$PositiveCases) ## [1] 3722542 F. Create a new variable in testDF which is the ratio of PostiveCases to TotalTests . Call this variable PositivityRate and explain in a comment what information it gives us. testDF$PositivityRate <- (testDF$PositiveCases)/(testDF$TotalTests) #It gives the proportion of positive cases with respect to total cases. G. What is the average positivity rate in testDF ? mean(testDF$PositivityRate) ## [1] 0.05363633 Part 4: Create a function to automate the process from F-G: A. The following function should work most of the time. Make sure to run this code before trying to test it. That is how you make the new function known to R. Add comments to each line explaining what it does: calculatePositivity <- function (dataset) { dataset$PositivityRate <- dataset$PositiveCases/dataset$TotalTests avePositivity <- mean(dataset$PositivityRate) return (avePositivity) } #This function first finds the proportion of positive cases to total called as Positivity rate. Then finds average of the Positivity rate and returns the same. B. Run your new function on the testDF dataframe. Is the output of the function consistent with the output in Step G above? Explain.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
9/20/23, 3:49 PM HW3a.knit file:///C:/Users/morea/Downloads/HW3a.html 7/7 calculatePositivity(testDF) ## [1] 0.05363633 #yes output is consistent C. Run the function on the olderAdults df you created earlier. calculatePositivity(olderAdults) ## [1] 0.05386202 D. Run the function on the youngAdults df. calculatePositivity(youngAdults) ## [1] 0.05435627 E. Lastly, run the posivity function on the childrenDF dataframe. calculatePositivity(childrenDF) ## [1] 0.05034919 F. In a comment, describe what you observe across these 3 datasets - which age group exhibits the highest positivity rate? How do these numbers compare to the baseline positivity rate in testDF ? What does this mean to you? #Positivity rate in young adults is least which means they are less likely to inflect covid.