IST 687 HW3.knit

pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

10

Uploaded by MinisterGoldfinch3708

Report
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 1/10 Intro to Data Science - HW 3 Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Elyse Peterson Attribution statement: (choose only one and delete the rest) # 1. I did this homework by myself, with help from the book and the professor. Reminders of things to practice from last week: Make a data frame data.frame( ) Row index of max/min which.max( ) which.min( ) Sort value or order rows sort( ) order( ) Descriptive statistics mean( ) sum( ) max( ) Conditional statement if (condition) “true stuff” else “false stuff” This Week: Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness. Part 1: Use read_csv( ) to read a CSV file from the web into a data frame: A. Use R code to read directly from a URL on the web. Store the dataset into a new dataframe, called dfComps. The URL is: “https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv (https://intro-datascience.s3.us- east-2.amazonaws.com/companies1.csv)” Hint: use read_csv( ), not read.csv( ). This is from the tidyverse package . Check the help to compare them. library (readr) urlToRead <- "https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv" dfcomps <- read_csv(url(urlToRead))
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 2/10 ## Rows: 47758 Columns: 18 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (16): permalink, name, homepage_url, category_list, market, funding_tota... ## dbl (2): funding_rounds, founded_year ## ## Use `spec()` to retrieve the full column specification for this data. ## Specify the column types or set `show_col_types = FALSE` to quiet this message. head(dfcomps) ## # A tibble: 6 × 18 ## permalink name homepage_url category_list market funding_total_usd status ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 /organizatio… #way… http://www.… |Entertainme… News 1 750 000 acqui… ## 2 /organizatio… &TV … http://enjo… |Games| Games 4 000 000 opera… ## 3 /organizatio… 'Roc… http://www.… |Publishing|… Publi… 40 000 opera… ## 4 /organizatio… (In)… http://www.… |Electronics… Elect… 1 500 000 opera… ## 5 /organizatio… #NAM… http://plus… |Software| Softw… 1 200 000 opera… ## 6 /organizatio… -R- … <NA> |Entertainme… Games 10 000 opera… ## # 11 more variables: country_code <chr>, state_code <chr>, region <chr>, ## # city <chr>, funding_rounds <dbl>, founded_at <chr>, founded_month <chr>, ## # founded_quarter <chr>, founded_year <dbl>, first_funding_at <chr>, ## # last_funding_at <chr> Part 2: Create a new data frame that only contains companies with a homepage URL: E. Use subsetting to create a new dataframe that contains only the companies with homepage URLs (store that dataframe in urlComps ). urlcomps <- subset(dfcomps,complete.cases(dfcomps$homepage_url)) head(urlcomps) ## # A tibble: 6 × 18 ## permalink name homepage_url category_list market funding_total_usd status ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 /organizatio… #way… http://www.… |Entertainme… News 1 750 000 acqui… ## 2 /organizatio… &TV … http://enjo… |Games| Games 4 000 000 opera… ## 3 /organizatio… 'Roc… http://www.… |Publishing|… Publi… 40 000 opera… ## 4 /organizatio… (In)… http://www.… |Electronics… Elect… 1 500 000 opera… ## 5 /organizatio… #NAM… http://plus… |Software| Softw… 1 200 000 opera… ## 6 /organizatio… .Clu… http://nic.… |Software| Softw… 7 000 000 <NA> ## # 11 more variables: country_code <chr>, state_code <chr>, region <chr>, ## # city <chr>, funding_rounds <dbl>, founded_at <chr>, founded_month <chr>, ## # founded_quarter <chr>, founded_year <dbl>, first_funding_at <chr>, ## # last_funding_at <chr>
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 3/10 D. How many companies are missing a homepage URL? library (tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## dplyr 1.1.3 purrr 1.0.2 ## forcats 1.0.0 stringr 1.5.0 ## ggplot2 3.4.4 tibble 3.2.1 ## lubridate 1.9.3 tidyr 1.3.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## dplyr::filter() masks stats::filter() ## dplyr::lag() masks stats::lag() ## Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors count(dfcomps) ## # A tibble: 1 × 1 ## n ## <int> ## 1 47758 count(urlcomps) ## # A tibble: 1 × 1 ## n ## <int> ## 1 44435 count(dfcomps)-count(urlcomps) ## n ## 1 3323 Part 3: Analyze the numeric variables in the dataframe. G. How many numeric variables does the dataframe have? You can figure that out by looking at the output of str(urlComps) . H. What is the average number of funding rounds for the companies in urlComps ? str(urlcomps)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 4/10 ## tibble [44,435 × 18] (S3: tbl_df/tbl/data.frame) ## $ permalink : chr [1:44435] "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ... ## $ name : chr [1:44435] "#waywire" "&TV Communications" "'Rock' Your Paper" "(In) Touch Network" ... ## $ homepage_url : chr [1:44435] "http://www.waywire.com" "http://enjoyandtv.com" "http:// www.rockyourpaper.org" "http://www.InTouchNetwork.com" ... ## $ category_list : chr [1:44435] "|Entertainment|Politics|Social Media|News|" "|Games|" "| Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Co mmerce|" ... ## $ market : chr [1:44435] "News" "Games" "Publishing" "Electronics" ... ## $ funding_total_usd: chr [1:44435] "1 750 000" "4 000 000" "40 000" "1 500 000" ... ## $ status : chr [1:44435] "acquired" "operating" "operating" "operating" ... ## $ country_code : chr [1:44435] "USA" "USA" "EST" "GBR" ... ## $ state_code : chr [1:44435] "NY" "CA" NA NA ... ## $ region : chr [1:44435] "New York City" "Los Angeles" "Tallinn" "London" ... ## $ city : chr [1:44435] "New York" "Los Angeles" "Tallinn" "London" ... ## $ funding_rounds : num [1:44435] 1 2 1 1 2 1 1 1 1 1 ... ## $ founded_at : chr [1:44435] "1/6/12" NA "26/10/2012" "1/4/11" ... ## $ founded_month : chr [1:44435] "2012-06" NA "2012-10" "2011-04" ... ## $ founded_quarter : chr [1:44435] "2012-Q2" NA "2012-Q4" "2011-Q2" ... ## $ founded_year : num [1:44435] 2012 NA 2012 2011 2012 ... ## $ first_funding_at : chr [1:44435] "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ... ## $ last_funding_at : chr [1:44435] "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ... #There are 2 fields (funding rounds and founded year) that are numeric# mean(urlcomps$funding_rounds) ## [1] 1.725194 I. What year was the oldest company in the dataframe founded? Hint: If you get a value of “NA,” most likely there are missing values in this variable which preclude R from properly calculating the min & max values. You can ignore NAs with basic math calculations. For example, instead of running mean(urlComps$founded_year), something like this will work for determining the average (note that this question needs to use a different function than ‘mean’. #mean(urlComps$founded_year, na.rm=TRUE) min(na.omit(dfcomps$founded_year)) ## [1] 1900 Part 4: Use string operations to clean the data. K. The permalink variable in urlComps contains the name of each company but the names are currently preceded by the prefix “/organization/”. We can use str_replace() in tidyverse or gsub() to clean the values of
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 5/10 this variable: view(urlcomps$permalink) urlcomps$name <- gsub("/organization/", "", urlcomps$permalink) head(urlcomps$name) ## [1] "waywire" "tv-communications" "rock-your-paper" ## [4] "in-touch-network" "n-plusn" "club-domains" L. Can you identify another variable which should be numeric but is currently coded as character? Use the as.numeric() function to add a new variable to urlComps which contains the values from the char variable as numbers. Do you notice anything about the number of NA values in this new column compared to the original “char” one? glimpse(urlcomps$funding_total_usd) ## chr [1:44435] "1 750 000" "4 000 000" "40 000" "1 500 000" "1 200 000" ... ### Funding should be in USD and there are no blank cells in the column funding_null <- is.na(dfcomps$funding_total_usd) table(funding_null) ## funding_null ## FALSE ## 47758 urlcomps$funding_new <- gsub("\\s","",as.numeric(urlcomps$funding_total_usd)) ## Warning in is.factor(x): NAs introduced by coercion urlcomps$funding_new<- as.numeric(urlcomps$funding_total_usd) ## Warning: NAs introduced by coercion glimpse(urlcomps$funding_new) ## num [1:44435] NA NA NA NA NA NA NA NA NA NA ... M. To ensure the char values are converted correctly, we first need to remove the spaces between the digits in the variable. Check if this works, and explain what it is doing:
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 6/10 library (stringi) urlcomps$funding_new <- stri_replace_all_charclass(urlcomps$funding_total_usd,"\\p{WHITE_SPAC E}", "") glimpse(urlcomps$funding_new) ## chr [1:44435] "1750000" "4000000" "40000" "1500000" "1200000" "7000000" ... head(as.numeric(urlcomps$funding_new)) ## Warning in head(as.numeric(urlcomps$funding_new)): NAs introduced by coercion ## [1] 1750000 4000000 40000 1500000 1200000 7000000 Error in stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}", : object 'ur lComps' not found Traceback: 1. stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}", . "") N. You are now ready to convert urlComps$funding_new to numeric using as.numeric(). Calculate the average funding amount for urlComps . If you get “NA,” try using the na.rm=TRUE argument from problem I. urlcomps$funding_new <- as.numeric(urlcomps$funding_new) ## Warning: NAs introduced by coercion mean(urlcomps$funding_new, na.rm=TRUE) ## [1] 18321551 Sample three unique observations from urlComps$funding_rounds, store the results in the vector ‘observations’ observations<- sample(urlcomps$funding_rounds,3,replace=FALSE) Take the mean of those observations mean(observations) ## [1] 1 Do the two steps (sampling and taking the mean) in one line of code
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 7/10 mean(sample(urlcomps$funding_rounds,3,replace=FALSE)) ## [1] 1.666667 urlFundingMean<-mean(sample(urlcomps$funding_rounds,3,replace=FALSE)) urlFundingMean ## [1] 1.666667 Explain why the two means are (or might be) different Use the replicate( ) function to repeat your sampling of three observations of urlComps$funding_rounds observations five times. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated. #The means are different because each time I run the code, it takes a new sample and finds the m ean of that. replicate(5,(sample(urlcomps$funding_rounds,3,replace = FALSE))) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 1 1 2 4 ## [2,] 1 1 2 1 1 ## [3,] 2 2 1 4 1 Rerun your replication, this time doing 20 replications and storing the output of replicate() in a variable called values . values<- (replicate(20,(mean(sample(urlcomps$funding_rounds,3,replace = FALSE))))) values ## [1] 1.000000 1.666667 1.666667 2.666667 1.666667 2.333333 3.333333 4.000000 ## [9] 1.333333 2.000000 1.333333 1.000000 1.666667 1.000000 1.333333 2.666667 ## [17] 1.333333 2.000000 3.000000 1.000000 Generate a histogram of the means stored in values . hist(values)
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 8/10 Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values , and then generate a histogram of values . values<- (replicate(1000,(mean(sample(urlcomps$funding_rounds,3,replace = FALSE))))) hist(values)
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 9/10 Repeat the replicated sampling, but this time, raise your sample size from 3 to 22. How does that affect your histogram? Explain in a comment. values<- (replicate(1000,(mean(sample(urlcomps$funding_rounds,22,replace = FALSE))))) hist(values)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 10/10 Explain in a comment below, the last three histograms, why do they look different? ### As the sample size increases, the histogram begins to looks more like a bell curve (normal d istribution) ###