Week 1 Assignment _ Drills with R

docx

School

Cumberland University *

*We aren’t endorsed by this school

Course

441

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

9

Uploaded by DrElementNewt28

Report
1 Week 1 Assignment Manisha Reddy Yerla Masters in data science, University of Cumberland’s 2023 Fall - Statistics for Data Science (MSDS-531-B02) - Second Bi-term Dr. Mina Richards October 26, 2023
2 Drills with R on importing and plotting data, and finding the distribution measures Question 1: The student directory for a large university has 400 pages with 130 names per page, a total of 52,000 names. Using software, show how to select a simple random sample of 10 names. # Calculate the total number of names total_names <- 400 * 130 # 400 pages with 130 names per page # Define the sample size sample_size <- 10 # Generate a simple random sample of 10 names random_sample <- sample(1:total_names, sample_size)
3 Question 2: From the Murder data file, use the variable murder, which is the murder rate (per 100,000 population) for each state in the U.S. in 2017 according to the FBI Uniform Crime Reports. At first, do not use the observation for D.C. (DC). Using software: a. Find the mean and standard deviation and interpret their values. b. Find the five-number summary and construct the corresponding boxplot. c. Now include the observation for D.C. What is affected more by this outlier: The mean or the median? Find the mean and standard deviation and interpret their values. #read the dataset table into a data frame data <- read.table("https://stat4ds.rwth-aachen.de/data/Murder.dat", header=TRUE) # Exclude the observation for DC data <- data[data$state != "DC", ] #Calculate the mean and standard deviation of the murder rate mean_murder <- mean(data$murder) sd_murder <- sd(data$murder)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Interpretation: The mean murder rate is approximately 4.874, indicating that, on average, there were 4.874 murders per 100,000 population in the given states in 2017 according to the FBI Uniform Crime Reports. The standard deviation is approximately 2.586, which represents the spread or variability in the murder rates across the states. Find the five-number summary and construct the corresponding boxplot. #five number summaries summary(data$murder) #constructing the boxplot boxplot(data$murder, main="Murder Rate Boxplot", ylab="Murder Rate (per 100,000 population)")
5 Now include the observation for D.C. What is affected more by this outlier: The mean or the median? # Include the row for D.C. new_row <- data.frame(state = "DC", murder = 24.2) # Add the row for D.C. to the data frame data <- rbind(data, new_row) # Calculate the new mean and median mean_murder_dc <- mean(data$murder) median_murder_dc <- median(data$murder)
6 Observations: The inclusion of D.C. (DC) significantly affects the mean (5.2529) but has a smaller impact on the median (5). This is because the median is a robust statistic that is less influenced by extreme values, while the mean is sensitive to outliers. In this case, the very high murder rate in D.C. has a substantial impact on the mean, pulling it up, but the median remains relatively stable. Question 3: The Houses data file lists the selling price (thousands of dollars), size (square feet), tax bill (dollars), number of bathrooms, number of bedrooms, and whether the house is new (1 = yes,0 = no) for 100 home sales in Gainesville, Florida. Let’s analyze the selling prices. a. Construct a frequency distribution and a histogram. b. Find the percentage of observations that fall within one standard deviation of the mean. c. Construct a boxplot.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Construct a frequency distribution and a histogram. #read the dataset table into a data frame data <- read.table("https://stat4ds.rwth-aachen.de/data/Houses.dat", header=TRUE) # Frequency distribution freq_table <- table(data$price) print(freq_table) # Histogram hist(data$price, main = "Histogram of Selling Prices", xlab = "Price (in thousands of dollars)", ylab = "Frequency", col = "lightblue") Find the percentage of observations that fall within one standard deviation of the mean. #calculating means and standard deviation of selling price mean_price <- mean(data$price) std_dev_price <- sd(data$price)
8 # Calculate the range within one standard deviation lower_bound <- mean_price - std_dev_price upper_bound <- mean_price + std_dev_price #observations within standard deviation obs_within_sd <- data$price >= lower_bound & data$price <= upper_bound # Calculate the percentage of observations within one standard deviation per_within_sd <- sum(obs_within_sd) / length(data$price) * 100 print(paste("Percentage within one standard deviation:", per_within_sd, "%")) Construct a boxplot #boxplot for selling Price” boxplot(data$price, main = "Boxplot for Selling Prices", ylab="prices(in thousand dollars)", col = "light blue")
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help