ANguyen - Drills with R Week 1

docx

School

University of South Florida *

*We aren’t endorsed by this school

Course

6217

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

10

Uploaded by qinhann

Report
Week 1 – Drills with R 1 Week 1 – Drills with R An Nguyen University of the Cumberlands Statistics for Data Science (MSDS-531-M30) – Full Term Dr. Ora Denton January 21 st , 2024
Week 1 – Drills with R 2 1. The student directory for a large university has 400 pages with 130 names per page, a total of 52,000 names. Using software, show how to select a simple random sample of 10 names. - I imagine that every student would have an ID # assigned to them and if we save it as data frame, most likely there would be an index column from 1 to 52000 for each student. The easiest way to pick 10 random name is to get a random sample of 10 index # from 1:52000. - R code to do so: - Result: 2. From the Murder data file, use the variable murder, which is the murder rate (per 100,000 population) for each state in the U.S. in 2017 according to the FBI Uniform Crime Reports. At first, do not use the observation for D.C. (DC). Using software: a) Find the mean and standard deviation and interpret their values. #create variable to store index # from 1 to 52000 studentDirectory <- 1:52000 #take a random sample of 10, assign to variable and print it randomSample <- sample(studentDirectory, 10) print(randomSample)
Week 1 – Drills with R 3 - R code to do so: - Results: - Mean & Standard deviation interpretation: On average, if we do not count DC then the United States have an average murder rate of 4.874. This means that the murder rates across the states are centered around 4.874. On the other hand, the standard deviation of 2.586291 indicates the dispersion of the murder rates regarding the mean. However, without further details about the distribution of the murder rates, it’s hard to say if that standard deviation #Question 2a #Assign the murder data to var murderAll murderAll <- read.table("https://stat4ds.rwth-aachen.de/data/Murder.dat", header = TRUE) #Assign murderAll var except for DC to var murderNoDC murderNoDC <- murderAll[murderAll$state != "DC", ] #Calculate mean of murder rate of all states except for DC, assign mean to var meanNoDC meanNoDC <- mean(murderNoDC$murder) # Calculate standard deviation of murder rate of all states except for DC, assign sd to var sdNoDC sdNoDC <- sd(murderNoDC$murder) # Print meanNoDC and sdNoDC print(meanNoDC) print(sdNoDC)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Week 1 – Drills with R 4 is considered high or low. But what I can is if standard deviation is low, it means that most states have murder rates that are relatively close to the mean of 4.874, while a high standard deviation would indicate otherwise. b) Find the five-number summary and construct the corresponding boxplot. - R code to do so: - Result: #Question 2b # Get the 5 number summary of murder rate by states without DC sumNoDC <- summary(murderNoDC$murder) # Print the summary print(sumNoDC) # Generate a boxplot on murder rate by states without DC boxplot(murderNoDC$murder, ylab="Murder Rate") print(sdNoDC)
Week 1 – Drills with R 5 c) Now include the observation for D.C. What is affected more by this outlier: The mean or the median? - R code to do so: - Result: # Question 2c # Calculate murder rate mean of all states, assign to var meanWDC meanWDC <- mean(murderAll$murder) # Print meanWDC and medianWDC print(meanWDC) print(medianWDC) #To understand the stat better, generate a boxplot on murder rate by states boxplot(murderAll$murder, ylab = "Murder Rate")
Week 1 – Drills with R 6 - Without DC, the mean and median was respectively 4.874 and 4.85. After we took DC into consideration, the mean has increased by 0.378941 to 5.252941 while the median also increase, but only, by 0.15. With that, we can say that the mean was affected by the outlier, DC murder rate, more than the median 3. The Houses data file lists the selling price (thousands of dollars), size (square feet), tax bill (dollars), number of bathrooms, number of bedrooms, and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Week 1 – Drills with R 7 whether the house is new (1 = yes,0 = no) for 100 home sales in Gainesville, Florida. Let’s analyze the selling prices. a) Construct a frequency distribution and a histogram. - R code to do so: - Result: #Question 3a # Assign the house data to var houses houses <- read.table("https://stat4ds.rwth-aachen.de/data/Houses.dat", header = TRUE) # Constructing a frequency table priceSeq <- seq(min(houses$price), max(houses$price), length.out = 10) housesFreqTab <- table(cut(houses$price, breaks = priceSeq, include.lowest = TRUE)) print(housesFreqTab) # Constructing a histogram hist(houses$price, breaks = priceSeq, xlab = "Selling Prices", ylab = "Frequency")
Week 1 – Drills with R 8 b) Find the percentage of observations that fall within one standard deviation of the mean. - R code to do so: #Question 3b #Calculate mean, sd as well as lower, upper and within range to find %, assigning to var meanPrice <- mean(houses$price) sdPrice <- sd(houses$price) lowerRange <- meanPrice - sdPrice upperRange <- meanPrice + sdPrice withinRange <- sum(houses$price >= lowerRange & houses$price <= upperRange) # Calculate the % of observations that fall within 1 sd of the mean percWithinRange <- (withinRange / nrow(houses)) * 100 print(percWithinRange)
Week 1 – Drills with R 9 - Result: c) Construct a boxplot. - R code to do so: - Result: #Question 3b #boxplot(houses$price, ylab = “Selling Price”)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Week 1 – Drills with R 10