IST 687 HW6.knit

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by MinisterGoldfinch3708

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 1/8 Intro to Data Science - HW 6 Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Elyse Peterson Attribution statement: (choose only one and delete the rest) # 3. I did this homework with help from Sarah Morris but did not cut and paste any code. Last assignment we explored data visualization in R using the ggplot2 package. This homework continues to use ggplot, but this time, with maps. In addition, we will merge datasets using the built-in merge( ) function, which provides a similar capability to a JOIN in SQL (don’t worry if you do not know SQL). Many analytical strategies require joining data from different sources based on a “key” – a field that two datasets have in common. Step 1: Load the population data A. Read the following JSON file, https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json (https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json) and store it in a variable called pop . Examine the resulting pop dataframe and add comments explaining what each column contains. library (jsonlite) library (tidyverse) ## Warning: package 'ggplot2' was built under R version 4.3.2 ## Warning: package 'readr' was built under R version 4.3.2 ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ purrr::flatten() masks jsonlite::flatten() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors library (ggplot2) library (maps)

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 2/8 ## Warning: package 'maps' was built under R version 4.3.2 ## ## Attaching package: 'maps' ## ## The following object is masked from 'package:purrr': ## ## map pop<- fromJSON("https://intro-datascience.s3.us-east-2.amazonaws.com/cities.json") head(pop) ## city growth_from_2000_to_2013 latitude longitude population rank ## 1 New York 4.8% 40.71278 -74.00594 8405837 1 ## 2 Los Angeles 4.8% 34.05223 -118.24368 3884307 2 ## 3 Chicago -6.1% 41.87811 -87.62980 2718782 3 ## 4 Houston 11.0% 29.76043 -95.36980 2195914 4 ## 5 Philadelphia 2.6% 39.95258 -75.16522 1553165 5 ## 6 Phoenix 14.0% 33.44838 -112.07404 1513367 6 ## state ## 1 New York ## 2 California ## 3 Illinois ## 4 Texas ## 5 Pennsylvania ## 6 Arizona B. Calculate the average population in the dataframe. Why is using mean() directly not working? Find a way to correct the data type of this variable so you can calculate the average (and then calculate the average) Hint: use str(pop) or glimpse(pop) to help understand the dataframe glimpse(pop) ## Rows: 1,000 ## Columns: 7 ## $ city <chr> "New York", "Los Angeles", "Chicago", "Housto… ## $ growth_from_2000_to_2013 <chr> "4.8%", "4.8%", "-6.1%", "11.0%", "2.6%", "14… ## $ latitude <dbl> 40.71278, 34.05223, 41.87811, 29.76043, 39.95… ## $ longitude <dbl> -74.00594, -118.24368, -87.62980, -95.36980, … ## $ population <chr> "8405837", "3884307", "2718782", "2195914", "… ## $ rank <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", … ## $ state <chr> "New York", "California", "Illinois", "Texas"…

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 3/8 ##the mean function isnt working because it is considered a character, not numeric. Converting t he column to numeric should take care of this. pop$population <- as.numeric(pop$population) mean(pop$population) ## [1] 131132.4 C. What is the population of the smallest city in the dataframe? Which state is it in? min(pop$population) ## [1] 36877 pop[pop$population == min(pop$population),] ## city growth_from_2000_to_2013 latitude longitude population rank ## 1000 Panama City 0.1% 30.15881 -85.66021 36877 1000 ## state ## 1000 Florida ## The smallest city in the list is Panama City, located in Florida## Step 2: Merge the population data with the state name data D. Read in the state name .csv file from the URL below into a dataframe named abbr (for “abbreviation”) – make sure to use the read_csv() function from the tidyverse package: https://intro-datascience.s3.us-east-2.amazonaws.com/statesInfo.csv (https://intro-datascience.s3.us-east- 2.amazonaws.com/statesInfo.csv) abbr <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/statesInfo.csv") ## Rows: 51 Columns: 2 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): State, Abbreviation ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. abbr

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 4/8 ## # A tibble: 51 × 2 ## State Abbreviation ## <chr> <chr> ## 1 Alabama AL ## 2 Alaska AK ## 3 Arizona AZ ## 4 Arkansas AR ## 5 California CA ## 6 Colorado CO ## 7 Connecticut CT ## 8 Delaware DE ## 9 District of Columbia DC ## 10 Florida FL ## # ℹ 41 more rows E. To successfully merge the dataframe pop with the abbr dataframe, we need to identify a column they have in common which will serve as the “key” to merge on. One column both dataframes have is the state column . The only problem is the slight column name discrepancy – in pop , the column is called “state” and in abbr – “State.” These names need to be reconciled for the merge() function to work. Find a way to rename abbr’s “State” to match the state column in pop . colnames(pop) ## [1] "city" "growth_from_2000_to_2013" ## [3] "latitude" "longitude" ## [5] "population" "rank" ## [7] "state" names(abbr)[names(abbr) == "State"] <- "state" abbr ## # A tibble: 51 × 2 ## state Abbreviation ## <chr> <chr> ## 1 Alabama AL ## 2 Alaska AK ## 3 Arizona AZ ## 4 Arkansas AR ## 5 California CA ## 6 Colorado CO ## 7 Connecticut CT ## 8 Delaware DE ## 9 District of Columbia DC ## 10 Florida FL ## # ℹ 41 more rows F. Merge the two dataframes (using the ‘state’ column from both dataframes), storing the resulting dataframe in dfNew .

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 5/8 dfNew <- merge(pop,abbr, by="state") head(dfNew) ## state city growth_from_2000_to_2013 latitude longitude population ## 1 Alabama Auburn 26.4% 32.60986 -85.48078 58582 ## 2 Alabama Florence 10.2% 34.79981 -87.67725 40059 ## 3 Alabama Huntsville 16.3% 34.73037 -86.58610 186254 ## 4 Alabama Dothan 16.6% 31.22323 -85.39049 68001 ## 5 Alabama Birmingham -12.3% 33.52066 -86.80249 212113 ## 6 Alabama Phenix City 31.9% 32.47098 -85.00077 37498 ## rank Abbreviation ## 1 615 AL ## 2 922 AL ## 3 126 AL ## 4 502 AL ## 5 101 AL ## 6 983 AL G. Review the structure of dfNew and explain the columns (aka attributes) in that dataframe. str(dfNew) ## 'data.frame': 1000 obs. of 8 variables: ## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ city : chr "Auburn" "Florence" "Huntsville" "Dothan" ... ## $ growth_from_2000_to_2013: chr "26.4%" "10.2%" "16.3%" "16.6%" ... ## $ latitude : num 32.6 34.8 34.7 31.2 33.5 ... ## $ longitude : num -85.5 -87.7 -86.6 -85.4 -86.8 ... ## $ population : num 58582 40059 186254 68001 212113 ... ## $ rank : chr "615" "922" "126" "502" ... ## $ Abbreviation : chr "AL" "AL" "AL" "AL" ... #This is all of the columns in pop, with the addition of the state abbreviation Step 3: Visualize the data H. Plot points (on top of a map of the US) for each city . Have the color represent the population . us<- map_data("state") ggplot() + geom_polygon(data = us, color = "lightblue", fill = "white", aes(x=long, y=lat, group =group)) + coord_map() + geom_point(data = dfNew, aes(x=longitude, y=latitude, color=populatio n))

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 6/8 I. Add a block comment that criticizes the resulting map. It’s not very good. ## Although the data points are on the map, Hawaii and Alaska skew the view and makes it hard to actually see the data points## Step 4: Group by State J. Use group_by and summarise to make a dataframe of state-by-state population. Store the result in dfSimple . dfSimple <- dfNew %>% group_by(state) %>% summarize(population=sum(population)) head(dfSimple)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 7/8 ## # A tibble: 6 × 2 ## state population ## <chr> <dbl> ## 1 Alabama 1279813 ## 2 Alaska 300950 ## 3 Arizona 4691466 ## 4 Arkansas 787011 ## 5 California 27910620 ## 6 Colorado 3012284 K. Name the most and least populous states in dfSimple and show the code you used to determine them. dfSimple [dfSimple$population == min(dfSimple$population),] #the state with the smallest populat ion is Vermont ## # A tibble: 1 × 2 ## state population ## <chr> <dbl> ## 1 Vermont 42284 dfSimple [dfSimple$population == max(dfSimple$population),] #the state with the largest populati on is california ## # A tibble: 1 × 2 ## state population ## <chr> <dbl> ## 1 California 27910620 Step 5: Create a map of the U.S., with the color of the state representing the state population L. Make sure to expand the limits correctly and that you have used coord_map appropriately. dfSimple$region <-tolower(dfSimple$state) usMap<- merge(us,dfSimple, region=region) usMap %>% ggplot() +geom_polygon(aes(x=long,y=lat,group=group, fill=population)) + coord_map()

11/29/23, 8:11 PM Elyse_Peterson_HW6.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW6.html 8/8

IST 687 HW6.knit

Related Documents