IST 687 HW6.knit

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

8

Uploaded by MinisterGoldfinch3708

Report
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 1/8 Intro to Data Science - HW 6 Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Elyse Peterson Attribution statement: (choose only one and delete the rest) # 3. I did this homework with help from Sarah Morris but did not cut and paste any code. Last assignment we explored data visualization in R using the ggplot2 package. This homework continues to use ggplot, but this time, with maps. In addition, we will merge datasets using the built-in merge( ) function, which provides a similar capability to a JOIN in SQL (don’t worry if you do not know SQL). Many analytical strategies require joining data from different sources based on a “key” – a field that two datasets have in common. Step 1: Load the population data A. Read the following JSON file, https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json (https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json) and store it in a variable called pop . Examine the resulting pop dataframe and add comments explaining what each column contains. library (jsonlite) library (tidyverse) ## Warning: package 'ggplot2' was built under R version 4.3.2 ## Warning: package 'readr' was built under R version 4.3.2 ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## dplyr 1.1.3 readr 2.1.4 ## forcats 1.0.0 stringr 1.5.0 ## ggplot2 3.4.4 tibble 3.2.1 ## lubridate 1.9.3 tidyr 1.3.0 ## purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## dplyr::filter() masks stats::filter() ## purrr::flatten() masks jsonlite::flatten() ## dplyr::lag() masks stats::lag() ## Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors library (ggplot2) library (maps)
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 2/8 ## Warning: package 'maps' was built under R version 4.3.2 ## ## Attaching package: 'maps' ## ## The following object is masked from 'package:purrr': ## ## map pop<- fromJSON("https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json") head(pop) ## city growth_from_2000_to_2013 latitude longitude population rank ## 1 New York 4.8% 40.71278 -74.00594 8405837 1 ## 2 Los Angeles 4.8% 34.05223 -118.24368 3884307 2 ## 3 Chicago -6.1% 41.87811 -87.62980 2718782 3 ## 4 Houston 11.0% 29.76043 -95.36980 2195914 4 ## 5 Philadelphia 2.6% 39.95258 -75.16522 1553165 5 ## 6 Phoenix 14.0% 33.44838 -112.07404 1513367 6 ## state ## 1 New York ## 2 California ## 3 Illinois ## 4 Texas ## 5 Pennsylvania ## 6 Arizona B. Calculate the average population in the dataframe. Why is using mean() directly not working? Find a way to correct the data type of this variable so you can calculate the average (and then calculate the average) Hint: use str(pop) or glimpse(pop) to help understand the dataframe glimpse(pop) ## Rows: 1,000 ## Columns: 7 ## $ city <chr> "New York", "Los Angeles", "Chicago", "Housto… ## $ growth_from_2000_to_2013 <chr> "4.8%", "4.8%", "-6.1%", "11.0%", "2.6%", "14… ## $ latitude <dbl> 40.71278, 34.05223, 41.87811, 29.76043, 39.95… ## $ longitude <dbl> -74.00594, -118.24368, -87.62980, -95.36980, … ## $ population <chr> "8405837", "3884307", "2718782", "2195914", "… ## $ rank <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", … ## $ state <chr> "New York", "California", "Illinois", "Texas"…
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 3/8 ##the mean function isnt working because it is considered a character, not numeric. Converting t he column to numeric should take care of this. pop$population <- as.numeric(pop$population) mean(pop$population) ## [1] 131132.4 C. What is the population of the smallest city in the dataframe? Which state is it in? min(pop$population) ## [1] 36877 pop[pop$population == min(pop$population),] ## city growth_from_2000_to_2013 latitude longitude population rank ## 1000 Panama City 0.1% 30.15881 -85.66021 36877 1000 ## state ## 1000 Florida ## The smallest city in the list is Panama City, located in Florida## Step 2: Merge the population data with the state name data D. Read in the state name .csv file from the URL below into a dataframe named abbr (for “abbreviation”) – make sure to use the read_csv() function from the tidyverse package: https://intro-datascience.s3.us-east-2.amazonaws.com/statesInfo.csv (https://intro-datascience.s3.us-east- 2.amazonaws.com/statesInfo.csv) abbr <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/statesInfo.csv") ## Rows: 51 Columns: 2 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): State, Abbreviation ## ## Use `spec()` to retrieve the full column specification for this data. ## Specify the column types or set `show_col_types = FALSE` to quiet this message. abbr
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 4/8 ## # A tibble: 51 × 2 ## State Abbreviation ## <chr> <chr> ## 1 Alabama AL ## 2 Alaska AK ## 3 Arizona AZ ## 4 Arkansas AR ## 5 California CA ## 6 Colorado CO ## 7 Connecticut CT ## 8 Delaware DE ## 9 District of Columbia DC ## 10 Florida FL ## # 41 more rows E. To successfully merge the dataframe pop with the abbr dataframe, we need to identify a column they have in common which will serve as the “key” to merge on. One column both dataframes have is the state column . The only problem is the slight column name discrepancy – in pop , the column is called “state” and in abbr “State.” These names need to be reconciled for the merge() function to work. Find a way to rename abbr’s “State” to match the state column in pop . colnames(pop) ## [1] "city" "growth_from_2000_to_2013" ## [3] "latitude" "longitude" ## [5] "population" "rank" ## [7] "state" names(abbr)[names(abbr) == "State"] <- "state" abbr ## # A tibble: 51 × 2 ## state Abbreviation ## <chr> <chr> ## 1 Alabama AL ## 2 Alaska AK ## 3 Arizona AZ ## 4 Arkansas AR ## 5 California CA ## 6 Colorado CO ## 7 Connecticut CT ## 8 Delaware DE ## 9 District of Columbia DC ## 10 Florida FL ## # 41 more rows F. Merge the two dataframes (using the ‘state’ column from both dataframes), storing the resulting dataframe in dfNew .
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 5/8 dfNew <- merge(pop,abbr, by="state") head(dfNew) ## state city growth_from_2000_to_2013 latitude longitude population ## 1 Alabama Auburn 26.4% 32.60986 -85.48078 58582 ## 2 Alabama Florence 10.2% 34.79981 -87.67725 40059 ## 3 Alabama Huntsville 16.3% 34.73037 -86.58610 186254 ## 4 Alabama Dothan 16.6% 31.22323 -85.39049 68001 ## 5 Alabama Birmingham -12.3% 33.52066 -86.80249 212113 ## 6 Alabama Phenix City 31.9% 32.47098 -85.00077 37498 ## rank Abbreviation ## 1 615 AL ## 2 922 AL ## 3 126 AL ## 4 502 AL ## 5 101 AL ## 6 983 AL G. Review the structure of dfNew and explain the columns (aka attributes) in that dataframe. str(dfNew) ## 'data.frame': 1000 obs. of 8 variables: ## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ city : chr "Auburn" "Florence" "Huntsville" "Dothan" ... ## $ growth_from_2000_to_2013: chr "26.4%" "10.2%" "16.3%" "16.6%" ... ## $ latitude : num 32.6 34.8 34.7 31.2 33.5 ... ## $ longitude : num -85.5 -87.7 -86.6 -85.4 -86.8 ... ## $ population : num 58582 40059 186254 68001 212113 ... ## $ rank : chr "615" "922" "126" "502" ... ## $ Abbreviation : chr "AL" "AL" "AL" "AL" ... #This is all of the columns in pop, with the addition of the state abbreviation Step 3: Visualize the data H. Plot points (on top of a map of the US) for each city . Have the color represent the population . us<- map_data("state") ggplot() + geom_polygon(data = us, color = "lightblue", fill = "white", aes(x=long, y=lat, group =group)) + coord_map() + geom_point(data = dfNew, aes(x=longitude, y=latitude, color=populatio n))
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 6/8 I. Add a block comment that criticizes the resulting map. It’s not very good. ## Although the data points are on the map, Hawaii and Alaska skew the view and makes it hard to actually see the data points## Step 4: Group by State J. Use group_by and summarise to make a dataframe of state-by-state population. Store the result in dfSimple . dfSimple <- dfNew %>% group_by(state) %>% summarize(population=sum(population)) head(dfSimple)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 7/8 ## # A tibble: 6 × 2 ## state population ## <chr> <dbl> ## 1 Alabama 1279813 ## 2 Alaska 300950 ## 3 Arizona 4691466 ## 4 Arkansas 787011 ## 5 California 27910620 ## 6 Colorado 3012284 K. Name the most and least populous states in dfSimple and show the code you used to determine them. dfSimple [dfSimple$population == min(dfSimple$population),] #the state with the smallest populat ion is Vermont ## # A tibble: 1 × 2 ## state population ## <chr> <dbl> ## 1 Vermont 42284 dfSimple [dfSimple$population == max(dfSimple$population),] #the state with the largest populati on is california ## # A tibble: 1 × 2 ## state population ## <chr> <dbl> ## 1 California 27910620 Step 5: Create a map of the U.S., with the color of the state representing the state population L. Make sure to expand the limits correctly and that you have used coord_map appropriately. dfSimple$region <-tolower(dfSimple$state) usMap<- merge(us,dfSimple, region=region) usMap %>% ggplot() +geom_polygon(aes(x=long,y=lat,group=group, fill=population)) + coord_map()
11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 8/8