IST 687 HW5.knit

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

17

Uploaded by MinisterGoldfinch3708

Report
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 1/17 Intro to Data Science - HW 5 Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Elyse Peterson Attribution statement: (choose only one and delete the rest) # 1. I did this homework by myself, with help from the book and the professor. This module: Data visualization is important because many people can make sense of data more easily when it is presented in graphic form. As a data scientist, you will have to present complex data to decision makers in a form that makes the data interpretable for them. From your experience with Excel and other tools, you know that there are a variety of common data visualizations (e.g., pie charts). How many of them can you name? The most powerful tool for data visualization in R is called ggplot . Written by computer/data scientist Hadley Wickham , this “graphics grammar” tool builds visualizations in layers. This method provides immense flexibility, but takes a bit of practice to master. Step 1: Make a copy of the data A. Read the who dataset from this URL: https://intro-datascience.s3.us-east-2.amazonaws.com/who.csv (https://intro-datascience.s3.us-east-2.amazonaws.com/who.csv) into a new dataframe called tb . Your new dataframe, tb, contains a so-called multivariate time series : a sequence of measurements on 23 Tuberculosis-related (TB) variables captured repeatedly over time (1980-2013). Familiarize yourself with the nature of the 23 variables by consulting the dataset’s codebook which can be found here: https://intro-datascience.s3.us- east-2.amazonaws.com/TB_data_dictionary_2021-02-06.csv (https://intro-datascience.s3.us-east- 2.amazonaws.com/TB_data_dictionary_2021-02-06.csv). library (readr) library (tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## dplyr 1.1.3 purrr 1.0.2 ## forcats 1.0.0 stringr 1.5.0 ## ggplot2 3.4.4 tibble 3.2.1 ## lubridate 1.9.3 tidyr 1.3.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## dplyr::filter() masks stats::filter() ## dplyr::lag() masks stats::lag() ## Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors urlToRead <- "https://intro-datascience.s3.us-east-2.amazonaws.com/who.csv" tb <- read_csv(url(urlToRead)) File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 2/17 ## Rows: 5769 Columns: 23 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): iso2 ## dbl (22): year, new_sp, new_sp_m04, new_sp_m514, new_sp_m014, new_sp_m1524, ... ## ## Use `spec()` to retrieve the full column specification for this data. ## Specify the column types or set `show_col_types = FALSE` to quiet this message. str(tb) File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 3/17 ## spc_tbl_ [5,769 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ iso2 : chr [1:5769] "AD" "AD" "AD" "AD" ... ## $ year : num [1:5769] 1989 1990 1991 1992 1993 ... ## $ new_sp : num [1:5769] NA NA NA NA 15 24 8 17 1 4 ... ## $ new_sp_m04 : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_m514 : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_m014 : num [1:5769] NA NA NA NA NA NA 0 0 0 0 ... ## $ new_sp_m1524: num [1:5769] NA NA NA NA NA NA 0 0 0 0 ... ## $ new_sp_m2534: num [1:5769] NA NA NA NA NA NA 0 1 0 0 ... ## $ new_sp_m3544: num [1:5769] NA NA NA NA NA NA 4 2 1 1 ... ## $ new_sp_m4554: num [1:5769] NA NA NA NA NA NA 1 2 0 1 ... ## $ new_sp_m5564: num [1:5769] NA NA NA NA NA NA 0 1 0 0 ... ## $ new_sp_m65 : num [1:5769] NA NA NA NA NA NA 0 6 0 0 ... ## $ new_sp_mu : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f04 : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f514 : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f014 : num [1:5769] NA NA NA NA NA NA 0 0 NA 0 ... ## $ new_sp_f1524: num [1:5769] NA NA NA NA NA NA 1 1 NA 0 ... ## $ new_sp_f2534: num [1:5769] NA NA NA NA NA NA 1 2 NA 0 ... ## $ new_sp_f3544: num [1:5769] NA NA NA NA NA NA 0 3 NA 1 ... ## $ new_sp_f4554: num [1:5769] NA NA NA NA NA NA 0 0 NA 0 ... ## $ new_sp_f5564: num [1:5769] NA NA NA NA NA NA 1 0 NA 0 ... ## $ new_sp_f65 : num [1:5769] NA NA NA NA NA NA 0 1 NA 0 ... ## $ new_sp_fu : num [1:5769] NA NA NA NA NA NA NA NA NA NA ... ## - attr(*, "spec")= ## .. cols( ## .. iso2 = col_character(), ## .. year = col_double(), ## .. new_sp = col_double(), ## .. new_sp_m04 = col_double(), ## .. new_sp_m514 = col_double(), ## .. new_sp_m014 = col_double(), ## .. new_sp_m1524 = col_double(), ## .. new_sp_m2534 = col_double(), ## .. new_sp_m3544 = col_double(), ## .. new_sp_m4554 = col_double(), ## .. new_sp_m5564 = col_double(), ## .. new_sp_m65 = col_double(), ## .. new_sp_mu = col_double(), ## .. new_sp_f04 = col_double(), ## .. new_sp_f514 = col_double(), ## .. new_sp_f014 = col_double(), ## .. new_sp_f1524 = col_double(), ## .. new_sp_f2534 = col_double(), ## .. new_sp_f3544 = col_double(), ## .. new_sp_f4554 = col_double(), ## .. new_sp_f5564 = col_double(), ## .. new_sp_f65 = col_double(), ## .. new_sp_fu = col_double() ## .. ) ## - attr(*, "problems")=<externalptr> File failed to load: /extensions/MathZoom.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 4/17 codes <- read_csv(url("https://intro-datascience.s3.us-east-2.amazonaws.com/TB_data_dictionary_2 021-02-06.csv")) ## Rows: 22 Columns: 4 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (3): variable_name, dataset, definition ## lgl (1): code_list ## ## Use `spec()` to retrieve the full column specification for this data. ## Specify the column types or set `show_col_types = FALSE` to quiet this message. view(codes) B. How often were these measurements taken (in other words, at what frequency were the variables measured)? Put your answer in a comment. years <- tb %>% group_by(year) %>% tally() years ## # A tibble: 29 × 2 ## year n ## <dbl> <int> ## 1 1980 191 ## 2 1981 194 ## 3 1982 194 ## 4 1983 196 ## 5 1984 193 ## 6 1985 198 ## 7 1986 197 ## 8 1987 199 ## 9 1988 201 ## 10 1989 197 ## # 19 more rows #These measurements were taken annually Step 2: Clean-up the NAs and create a subset A. Let’s clean up the iso2 attribute in tb Hint: use is.na() – well use ! is.na() tb <- tb[!(is.na(tb$iso2)), ] tb File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 5/17 ## # A tibble: 5,746 × 23 ## iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 AD 1989 NA NA NA NA NA ## 2 AD 1990 NA NA NA NA NA ## 3 AD 1991 NA NA NA NA NA ## 4 AD 1992 NA NA NA NA NA ## 5 AD 1993 15 NA NA NA NA ## 6 AD 1994 24 NA NA NA NA ## 7 AD 1996 8 NA NA 0 0 ## 8 AD 1997 17 NA NA 0 0 ## 9 AD 1998 1 NA NA 0 0 ## 10 AD 1999 4 NA NA 0 0 ## # 5,736 more rows ## # 16 more variables: new_sp_m2534 <dbl>, new_sp_m3544 <dbl>, ## # new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, ## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>, ## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, ## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl> B. Create a subset of tb containing only the records for Canada (“CA” in the iso2 variable) . Save it in a new dataframe called tbCan . Make sure this new df has 29 observations and 23 variables . tbCan <- tb[tb$iso2== "CA",] str(tbCan) ## tibble [29 × 23] (S3: tbl_df/tbl/data.frame) ## $ iso2 : chr [1:29] "CA" "CA" "CA" "CA" ... ## $ year : num [1:29] 1980 1981 1982 1983 1984 ... ## $ new_sp : num [1:29] 951 803 812 771 811 791 752 668 682 652 ... ## $ new_sp_m04 : num [1:29] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_m514 : num [1:29] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_m014 : num [1:29] 12 8 6 9 3 11 9 9 4 10 ... ## $ new_sp_m1524: num [1:29] 54 49 52 47 44 42 58 40 43 45 ... ## $ new_sp_m2534: num [1:29] 75 61 66 63 75 70 73 71 73 56 ... ## $ new_sp_m3544: num [1:29] 83 64 69 62 58 59 62 60 62 60 ... ## $ new_sp_m4554: num [1:29] 100 87 90 90 68 77 59 49 52 54 ... ## $ new_sp_m5564: num [1:29] 108 103 91 92 83 81 73 64 68 62 ... ## $ new_sp_m65 : num [1:29] 186 141 150 123 169 168 147 129 131 122 ... ## $ new_sp_mu : num [1:29] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f04 : num [1:29] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f514 : num [1:29] NA NA NA NA NA NA NA NA NA NA ... ## $ new_sp_f014 : num [1:29] 18 6 7 11 9 5 10 8 6 6 ... ## $ new_sp_f1524: num [1:29] 62 46 51 50 51 30 33 39 38 37 ... ## $ new_sp_f2534: num [1:29] 51 57 57 50 59 56 54 48 56 51 ... ## $ new_sp_f3544: num [1:29] 34 26 30 29 28 19 33 29 27 23 ... ## $ new_sp_f4554: num [1:29] 31 28 25 24 28 28 20 17 16 24 ... ## $ new_sp_f5564: num [1:29] 33 35 38 35 36 48 26 26 26 21 ... ## $ new_sp_f65 : num [1:29] 104 92 80 86 100 97 95 79 80 81 ... ## $ new_sp_fu : num [1:29] NA NA NA NA NA NA NA NA NA NA ... File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 6/17 C. A simple method for dealing with small amounts of missing data in a numeric variable is to substitute the mean of the variable in place of each missing datum . This expression locates (and reports to the console) all the missing data elements in the variable measuring the number of positive pulmonary smear tests for male children 0-4 years old (there are 26 data points missing) tbCan$new_sp_m04[is.na(tbCan$new_sp_m04)] ## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## [26] NA Error in eval(expr, envir, enclos): object 'tbCan' not found Traceback: D. Write a comment describing how that statement works. ##the above statement is identifying how many NAs there are within the specific column we select ed## E. Write 4 more statements to check if there is missing data for the number of positive pulmonary smear tests for: male and female children 0-14 years old ( new_sp_m014 and new_sp_f014 ), and male and female citizens 65 years of age and older , respectively. What does empty output suggest about the number of missing observations? youngM<- tbCan$new_sp_m014[is.na(tbCan$new_sp_m014)] youngM ## numeric(0) youngF<- tbCan$new_sp_f014[is.na(tbCan$new_sp_f014)] youngF ## numeric(0) oldM<- tbCan$new_sp_m65[is.na(tbCan$new_sp_m65)] oldM ## numeric(0) oldF<-tbCan$new_sp_f65[is.na(tbCan$new_sp_f65)] oldF ## numeric(0) File failed to load: /extensions/MathZoom.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 7/17 There is an R package called imputeTS specifically designed to repair missing values in time series data. We will use this instead of mean substitution. The na_interpolation() function in this package takes advantage of a unique characteristic of time series data: neighboring points in time can be used to “guess” about a missing value in between . F. Install the imputeTS package (if needed) and use na_interpolation( ) on the variable from part C. Don’t forget that you need to save the results back to the tbCan dataframe. Also update any attribute discussed in part E (if needed). library (imputeTS) ## Warning: package 'imputeTS' was built under R version 4.3.2 ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo tbCan$new_sp_m04 <- na_interpolation(tbCan$new_sp_m04) tbCan$new_sp_m014 <- na_interpolation(tbCan$new_sp_m014) tbCan$new_sp_m65 <- na_interpolation(tbCan$new_sp_m65) tbCan$new_sp_f014 <- na_interpolation(tbCan$new_sp_f014) tbCan$new_sp_f65 <- na_interpolation(tbCan$new_sp_f65) G. Rerun the code from C and E above to check that all missing data have been fixed. youngM2 <- tbCan$new_sp_m014[is.na(tbCan$new_sp_m014)] youngM2 ## numeric(0) youngF2 <- tbCan$new_sp_f014[is.na(tbCan$new_sp_f014)] youngF2 ## numeric(0) oldM2 <- tbCan$new_sp_m65[is.na(tbCan$new_sp_m65)] oldM2 ## numeric(0) oldF2 <- tbCan$new_sp_f65[is.na(tbCan$new_sp_f65)] oldF2 ## numeric(0) File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 8/17 Step 3: Use ggplot to explore the distribution of each variable Don’t forget to install and library the ggplot2 package. Then: H. Create a histogram for new_sp_m014 . Be sure to add a title and briefly describe what the histogram means in a comment. library (ggplot2) hist(tbCan$new_sp_m014, main="Histogram for Positive Cases Males 14y or younger ", xlab="Cases") I. Create histograms (using ggplot) of each of the other three variables from E with ggplot( ). Which parameter do you need to adjust to make the other histograms look right? tbCan %>% ggplot() + geom_histogram(binwidth = 1, fill="red", color="white", aes(x=new_sp_m014)) + ggtitle('Histogram for Positive Cases Males 14y or younger in CA') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 9/17 tbCan %>% ggplot() + geom_histogram(binwidth = 1, fill="orange", color="white", aes(x=new_sp_f014)) + ggtitle('Histogram for Positive Cases Females 14y or younger in CA') File failed to load: /extensions/MathZoom.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 10/17 tbCan %>% ggplot() + geom_histogram(binwidth = 1, fill="blue", color="white", aes(x=new_sp_m65)) + ggtitle('Histogram for Positive Cases Male 65+ in CA') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 11/17 tbCan %>% ggplot() + geom_histogram(binwidth = 1, fill="grey", color="white", aes(x=new_sp_f65)) + ggtitle('Histogram for Positive Cases Female 65+ in CA') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 12/17 ## Step 4: Explore how the data changes over time J. These data were collected in a period of several decades (1980-2013). You can thus observe changes over time with the help of a line chart. Create a line chart , with year on the X-axis and new_sp_m014 on the Y-axis. tbCan %>% ggplot() + geom_line(color = 'green', aes(x=year, y=new_sp_m014)) + ggtitle('Line Graph for Positive Cases Males 14y or younger in CA') File failed to load: /extensions/MathZoom.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 13/17 K. Next, create similar graphs for each of the other three variables. Change the color of the line plots (any color you want). tbCan %>% ggplot() + geom_line(color = 'blue', aes(x=year, y=new_sp_f014)) + ggtitle('Line Graph for Positive Cases Females 14y or younger in CA') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 14/17 tbCan %>% ggplot() + geom_line(color = 'purple', aes(x=year, y=new_sp_m65)) + ggtitle('Line Graph for Positive Cases Males 65+ in CA') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 15/17 tbCan %>% ggplot() + geom_line(color = 'black', aes(x=year, y=new_sp_f65)) + ggtitle('Line Graph for Positive Cases Females 65+ in CA') File failed to load: /extensions/MathZoom.js
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 16/17 L. Using vector math, create a new variable by combining the numbers from new_sp_m014 and new_sp_f014 . Save the resulting vector as a new variable in the tbCan df called new_sp_combined014 . This new variable represents the number of positive pulmonary smear tests for male AND female children between the ages of 0 and 14 years of age. Do the same for SP tests among citizens 65 years of age and older and save the resulting vector in the tbCan variable called new_sp_combined65 . tbCan$new_sp_combined014 <- tbCan$new_sp_m014 + tbCan$new_sp_f014 tbCan$new_sp_combined65 <- tbCan$new_sp_m65 + tbCan$new_sp_f65 M. Finally, create a scatter plot , showing new_sp_combined014 on the x axis, new_sp_combined65 on the y axis, and having the color and size of the point represent year . tbCan %>% ggplot() + geom_point() + aes(x=new_sp_combined014, y=new_sp_combined65, size=year, color=year ) + ggtitle('Scatter Plot of Positive Test for M/F 0-14 and 65 +') File failed to load: /extensions/MathZoom.js
11/29/23, 8:11 PM HW5.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW5.knit.html 17/17 N. Interpret this visualization – what insight does it provide? ##Older population had more tb than younger and the tb rate is decreasing File failed to load: /extensions/MathZoom.js