STAT 405 Homework_3

txt

School

University of Pennsylvania *

*We aren’t endorsed by this school

Course

405

Subject

Statistics

Date

Feb 20, 2024

Type

txt

Pages

10

Uploaded by BaronNarwhal3261

Report
#### STAT 4050/7050 Homework 3, Spring 2023 #### Name: #### Instructions: #### 1. If a question refers to a function not seen in class, #### use the help facility in R to learn about it. #### 2. Insert your answers and the R code you used to generate them, beneath each question. #### Even though for some questions you could find the answer "by inspection", for this homework you need #### to write code to get the answers, unless it is explicitly stated that no code is required. #### 3. When you are asked to print an answer, it is NOT ENOUGH to simply write a print statement, #### for example print(2*3) as an answer, you also need to #### cut and paste into this document the value(s) that is actually printed (i.e.,6). On the other hand, pasting #### long lines of code from the console when you are not asked to do so is unnecessary. Please ask if you have questions on this. #### 4. When submitting to Canvas, please change the file extension to .txt, otherwise it cannot be submitted. ## ------------------------------------------------------------------------ #### Q1 (10 points) #a. 7pts. Write a function 'NA_Stats' that takes a numeric vector as input and returns the mean, variance, and a variable # named NAstatus of the non-NA values in the vector as a list. NAstatus is a "Yes" or "No" value depending on whether # there is at least one NA value in the input or not respectively. # The function should have a stop message if all values are NA or if the vector is all non-numeric. The stop message should # read "This function wont work for this vector." It should also have a warning message if there is at least one NA value, # which should then be ignored in the calculation of the mean and variance. The warning message should be "There are some missing # values. I am ignoring them." NA_Stats <- function(x) { if (!is.numeric(x)){ stop("This function won't work for this vector.") } if (all(is.na(x))) { stop("This function won't work for this vector.") } na_status <- ifelse(any(is.na(x)), "Yes", "No") x_no_na <- x[!is.na(x)] if (length(x_no_na) == 0) { stop("This function won't work for this vector.") } if (any(is.na(x))) { warning("There are some missing values. I am ignoring them.")
} list(mean = mean(x_no_na), variance = var(x_no_na), NAstatus = na_status) } #b. 3pts. Apply this function to the following numeric vectors: # a) c(rep(1,7),NA,seq(2,8,2)) # b) c(rep(1,7),seq(2,8,2)) # c) c("Krishna", "Davis", "Joanne", "Yash") NA_Stats(c(rep(1,7),NA,seq(2,8,2))) # $mean # [1] 2.454545 # # $variance # [1] 6.072727 # # $NAstatus # [1] "Yes" # Warning message: # In NA_Stats(c(rep(1, 7), NA, seq(2, 8, 2))) : # There are some missing values. I am ignoring them. NA_Stats(c(rep(1,7),seq(2,8,2))) # $mean # [1] 2.454545 # # $variance # [1] 6.072727 # # $NAstatus # [1] "No" NA_Stats(c("Krishna", "Davis", "Joanne", "Yash")) # Error in NA_Stats(c("Krishna", "Davis", "Joanne", "Yash")) : # This function won't work for this vector. #### Q2 (20 points) #a. 10pts. Write a function called "kurtosis.transform" that takes a numerical vector input (x), and if the input has # outlier values, it returns the kurtosis of the natural log of the input vector. Otherwise, it should return the kurtosis # of the original vector. Use the magnitude of the kurtosis value of the original input vector to determine whether there # are outliers. # The function should start by printing the kurtosis value calculated from the data. Use the paste/print functions to display the # kurtosis value using a sentence that says: "The calculated kurtosis value of the input data is x.x" where the number x.x is the # result of the calculation." # The function should also have an argument in addition to x, with a default value. This additional argument should be called # "thresh" which takes a single non-negative numeric value. Default its value to 100. The role of "thresh" is so that the user # can decide for themselves the cut-off level of the sample's kurtosis at which you will transform the data. If the kurtosis of
# the data is > the thresh value, then simply transform the data to the log values and return the kurtosis of the log transformed x # values; if not, just return the kurtosis of the original vector x. # Hint: the kurtosis function you created in homework 2 may be # helpful # In either case, you need to print a second message saying one of the following: # Data has been transformed - New Kurtosis = y.y OR # Data has not been transformed - Kurtosis = x.x # with x.x and y.y coming from the calculations. kurtosis <- function(x) { n <- length(x) m <- mean(x) diff_mean_x <- x - m num <- sum(diff_mean_x^4) diff_2 <- x - (m^2) denom <- sum(diff_2^2) kurt <- (n*num) /(denom) return(kurt) } kurtosis.transform <- function(x, thresh=100){ orig_kurt <- kurtosis(x) print(paste("The calculated kurtosis value of the input data is", round(orig_kurt,2))) if (orig_kurt > thresh){ x_trans <- log(x) trans_kurt <- kurtosis(x_trans) print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt,2))) return(trans_kurt) } else { print(paste("Data has not been transformed - Kurtosis =", round(orig_kurt,2))) return(orig_kurt) } } #b. 5pts. Run the 'kurtosis.transform' function on the salary variable (default value of thresh) and report the results after # loading the following data at http://mathmba.com/richardw/HR.csv. hr.data <- read.csv(file = "http://mathmba.com/richardw/HR.csv") kurtosis.transform(hr.data$salary,100) # [1] "The calculated kurtosis value of the input data is 373.01" # [1] "Data has been transformed - New Kurtosis = 0.04" # [1] 0.03851951 #c. 5pts. Expand the 'kurtosis.transform' function to have a third argument called 'transf' which takes one of two values: # either "LOG" or "SQRT". If you choose "LOG" then the kurtosis of the logarithm of x will be returned (as in Q2a) # But if you choose "SQRT", the then the kurtosis of the square root of x will be returned. Then, re-run the 'kurtosis.transform' # function on the Salary variable using the "SQRT" option in the function using the default thresh value. kurtosis.transform <- function(x, thresh = 100, transf = "LOG") {
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
orig_kurt <- kurtosis(x) print(paste("The calculated kurtosis value of the input data is", round(orig_kurt, 2))) if (orig_kurt > thresh) { if (transf == "LOG") { x_trans <- log(x) trans_kurt <- kurtosis(x_trans) print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt, 2))) return(trans_kurt) } else if (transf == "SQRT") { x_trans <- sqrt(x) trans_kurt <- kurtosis(x_trans) print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt, 2))) return(trans_kurt) } } print(paste("Data has not been transformed - Kurtosis =", round(orig_kurt, 2))) return(orig_kurt) } kurtosis.transform(hr.data$salary,100,"SQRT") # [1] "The calculated kurtosis value of the input data is 373.01" # [1] "Data has been transformed - New Kurtosis = 21.96" # [1] 21.9591 #### Q3 (20 points) #a. 10pts. Write a function called "is.any.prime" that takes a numeric vector of data and reports whether any of the input values # are prime (i.e. a whole number greater than 1 that cannot be exactly divided by any whole number other than itself and 1). This # means that negative numbers and non-whole numbers are technically not prime numbers. # The "is.any.prime" function should return a single TRUE if the vector contains a prime number or FALSE if not. Hint: to check # whether a number is a whole number, you can simply check to see if it is equal to its rounded value is.any.prime <- function(x) { for (i in x) { if (i == round(i) && i > 1) { isprime <- TRUE for (j in 2:(i-1)) { if (i %% j == 0) { isprime <- FALSE break } } if (isprime) { return(TRUE) } } } return(FALSE)
} #b. 5pts. Paste the output of the "is.any.prime" function on the following arrays: # (i) c(12, 6, 14, 9, 44) # (ii) c(1, 4.8, 24, 18.1) # (iii) c(4, 6, 8, -18, 22) # (iv) c(1, 4, 12, 13, 20) # (v) c(1, 1, 1, 1, 1, 0) is.any.prime(c(12, 6, 14, 9, 44)) # [1] FALSE is.any.prime(c(1, 4.8, 24, 18.1)) # [1] FALSE is.any.prime(c(4, 6, 8, -18, 22)) # [1] FALSE is.any.prime(c(1, 4, 12, 13, 20)) # [1] TRUE is.any.prime(c(1, 1, 1, 1, 1, 0)) # [1] FALSE #c. 5pts. Write a new function "first.prime.num" that takes a numeric vector of data and outputs only the first prime # number in the vector. If there are no prime numbers in the vector, the function should display a message that says: # "No prime numbers were found in the input vector." Hint: ?break and ?next may be useful here. first.prime.num <- function(x) { for (i in x) { if (i == round(i) & i > 1) { isprime <- TRUE for (j in 2:(i-1)) { if (i %% j == 0) { isprime <- FALSE break } } if (isprime) { return(i) break } } } return(print("No prime numbers were found in the input vector.")) } #d. 5pts. Paste the output of this new function on the following arrays: # (i) c(-20, 100, 41, 38, 29, 15, 22) # (ii) c(18, 1, 24, 14, 45, 27, 49, 38) first.prime.num(c(-20, 100, 41, 38, 29, 15, 22)) # [1] 41 first.prime.num(c(18, 1, 24, 14, 45, 27, 49, 38)) # [1] "No prime numbers were found in the input vector." #### Q4 (10 points) # Recall that the sum of the first N natural numbers is given by sum(N) = (N * (N +
1)) / 2). Example: sum(5) = 1+2+3+4+5 = 15. # Some numbers like sum(6) = 36 can also be expressed as perfect squares (i.e. 36 = 6*6). Lets call these numbers like 36 # "Sum-Square Twins". That is, they are perfect squares (N^2 for some N) and also the sum of the first M digits, for some other M. # Write a function to find all "Sum-Square Twins" between 1 and sum(1 Million) - which is approximately 500 Billion. is.perfect.square <- function(n) { root <- sqrt(n) if (root==round(root)){ return(TRUE) } else { return(FALSE) } } find.sum.square.twins <- function(N) { twins <- c() for (m in 1:N) { sum.m <- (m * (m + 1)) / 2 for (n in 1:(sum.m - 1)) { if (is.perfect.square(sum.m)==TRUE && is.perfect.square(n)==TRUE) { twins <- c(twins, sum.m) break } } } return(twins) } find.sum.square.twins(1000000) #Laptop could not compute- emailed TA about it. Have done it for 100 below: find.sum.square.twins(100) # [1] 1 36 1225 #### Q5 (20 points) # A deck of playing cards contains four suits: spades (S), hearts (H), diamonds (D), and clubs (C). In each # suit there are thirteen cards: 10, 9, 8, 7, 6, 5, 4, 3, 2, ace, and three face cards (i.e. king, queen, jack). In total, there # are 52 distinct cards. To keep things simple, assign ace a numeric value of one, and remember that an ace is not a face card. # Additionally, assign a numeric value of thirteen to king, twelve to queen, and eleven to jack. # Now, imagine you're playing a new card game in which you draw six cards at random each turn. If you draw a face card (see definition # of face card above) of any suit or at least one diamond (D) of any value, you win. Otherwise, you lose. #a 5pts. Set seed to 2023, then create a full set of the playing deck using the grep and paste commands. Sample six random cards from this set # and paste the output. The format of the hand should be "value-suit" (e.g. an ace of spades should be denoted by "1-S"). You should be sure # to use the sort() command when creating your playing deck. set.seed(2023) CARDS <- paste( rep(c("1","2","3","4","5","6","7","8","9","10","11","12","13"),4),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
rep(c("H","C","D","S"),13), sep="_") CARDS <- sort(CARDS) hand <- sample(CARDS, 6) print(hand) # [1] "12_S" "8_H" "7_C" "10_S" "3_D" "7_S" #b. 5pts. Write a function "win.outcome" that calculates whether you won a turn or not. It should take in a hand (i.e. six cards) and return # TRUE (win) or FALSE (lost). Hint: use the grep function win.outcome <- function(hand){ has_face_card <- any(grep("11|12|13", hand)) has_diamond <- any(grep("D", hand)) return(has_face_card | has_diamond) } set.seed(2023) hand <- sample(CARDS, 6) hand win.outcome(hand) # [1] TRUE #c. 3pts. Show the function and the results of applying the function on the following examples: # (i) c("7-C","9-S","2-H","3-C") # (ii) c("3-D","1-C","10-D","4-S") # (iii) c("2-S","6-H","9-C","1-S") win.outcome(c("7-C","9-S","2-H","3-C")) # [1] FALSE win.outcome(c("3-D","1-C","10-D","4-S")) # [1] TRUE win.outcome(c("2-S","6-H","9-C","1-S")) # [1] FALSE #e. 7pts. Reset the seed to 2023. Next, write code that simulates playing this imaginary card game, and calculate the probability of # winning across 100K simulation runs. Print the resulting probability as part of statement that includes the number of turns won, # number of games played, and the win probability (rounded to two decimals places). set.seed(2023) num_wins <- 0 num_games <- 100000 for(i in 1:num_games) { hand <- sample(CARDS, 6, replace = FALSE) if(win.outcome(hand)) { num_wins <- num_wins + 1 } } win_prob <- round(num_wins/num_games, 2) print(paste("Number of turns won:", num_wins)) print(paste("Number of games played:", num_games)) print(paste("Win probability:", win_prob)) # [1] "Number of turns won: 97090" # [1] "Number of games played: 1e+05" # [1] "Win probability: 0.97" #### Q6 (20 points)
# You are a data analysis consultant for a multinational retail company struggling with its sales. # They have hired you to analyze their sales data and identify areas of improvement. Using the # 'sales.csv' file in the HW3 folder, answer the following: #a. 8pts. The data your client has provided contains the following columns: # Product ID (character) # Region (character) # Date (character in the format "YYYY-MM-DD") # Units Sold (numeric) # Revenue (numeric) # Write a function called "sales.report" that takes the following inputs: # 1. path.to.csv: a character string representing the path to the CSV file # 2. product.id: a character string representing the ID of the product to report on # 3. region: a charcter string representing the region to report on # 4. date.range: a character vector of length 2 representing the ranges of dates # to report on in the format "YYYY-MM-DD" # This function should read the CSV file, filter the data to only include a specified # product, region, and date range, and retrun a data frame with the following columns: # 1. date (character): the date of the sales # 2. units sold (numeric): the number of units sold on the date # 3. revenue (numeric): the revenue generated on the date # The data frame should be sorted by date in ascending order. sales.csv <- read.csv(file="C:\\Users\\Varyam\\Downloads\\sales.csv") head(sales.csv) sales.csv <- "C:\\Users\\Varyam\\Downloads\\sales.csv" sales.report <- function(path.to.csv, product.id, region, date.range) { sales_data <- read.csv(file=path.to.csv) filtered_data <- sales_data[sales_data$Product.ID == product.id & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ] report <- filtered_data[, c("Date", "Units.Sold", "Revenue")] report <- report[order(report$Date), ] return(report) } #b. 2pts. Test your sales.report function below. Print the output. sales.report(sales.csv, "A100", "West", c("2018-01-02", "2018-01-04")) # Date Units.Sold Revenue # 6 2018-01-02 98 2134.76 # 8 2018-01-02 73 1967.32 # 10 2018-01-02 90 2356.87 # 11 2018-01-03 67 1889.45
# 15 2018-01-03 101 2657.12 # 18 2018-01-04 67 1745.78 #c. 4pts. Now modify your sales.report with a new function called sales.report.v2 to allow multiple # product IDs to be reported at once. sales.report.v2 <- function(path.to.csv, product.ids, region, date.range) { sales_data <- read.csv(file=path.to.csv) filtered_data <- sales_data[sales_data$Product.ID %in% product.ids & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ] reports_list <- list() for (product_id in product.ids) { report <- filtered_data[filtered_data$Product.ID == product_id, c("Date", "Units.Sold", "Revenue")] report <- report[order(report$Date), ] reports_list[[product_id]] <- report } final_report <- do.call(rbind, reports_list) return(final_report) } #d. 1pts. Test your sales.report.v2 function below. Print the output. sales.report.v2(sales.csv, c("A100", "B200"), "East", c("2018-01-02", "2018-01- 07")) # Date Units.Sold Revenue # A100.13 2018-01-03 78 2123.32 # A100.16 2018-01-04 123 2460.34 # A100.20 2018-01-04 112 3056.43 # A100.21 2018-01-05 78 2198.21 # A100.28 2018-01-06 78 2123.32 # A100.33 2018-01-07 123 2460.34 # B200.9 2018-01-02 78 2198.21 # B200.17 2018-01-04 56 1786.45 # B200.19 2018-01-04 89 2098.54 # B200.24 2018-01-05 56 1786.45 # B200.32 2018-01-07 89 2098.54 #e. 4pts. Now alter the sales.report.v2 function to include a column for the total revenue # generated over the specified date range. Call the new function sales.report.v3 sales.report.v3 <- function(path.to.csv, product.ids, region, date.range) { sales_data <- read.csv(file=path.to.csv) filtered_data <- sales_data[sales_data$Product.ID %in% product.ids & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ] reports_list <- list() for (product_id in product.ids) {
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
report <- filtered_data[filtered_data$Product.ID == product_id, c("Date", "Units.Sold", "Revenue")] report <- report[order(report$Date), ] total_revenue <- sum(report$Revenue) report$total_revenue <- total_revenue reports_list[[product_id]] <- report } final_report <- do.call(rbind, reports_list) return(final_report) } #f. 1pts. Test your sales.report.v3 function below. Print the output. sales.report.v3(sales.csv, c("A100", "B200"), "West", c("2018-01-06", "2018-01- 08")) # Date Units.Sold Revenue total_revenue # A100.26 2018-01-06 45 1203.23 8295.56 # A100.30 2018-01-06 90 2356.87 8295.56 # A100.31 2018-01-07 56 1786.45 8295.56 # A100.35 2018-01-07 67 1745.78 8295.56 # A100.38 2018-01-08 45 1203.23 8295.56 # B200.27 2018-01-06 123 2460.34 7006.91 # B200.29 2018-01-06 67 1889.45 7006.91 # B200.34 2018-01-07 101 2657.12 7006.91