STAT 405 Homework_3
txt
keyboard_arrow_up
School
University of Pennsylvania *
*We aren’t endorsed by this school
Course
405
Subject
Statistics
Date
Feb 20, 2024
Type
txt
Pages
10
Uploaded by BaronNarwhal3261
#### STAT 4050/7050 Homework 3, Spring 2023
#### Name: #### Instructions:
#### 1. If a question refers to a function not seen in class, #### use the help facility in R to learn about it.
#### 2. Insert your answers and the R code you used to generate them, beneath each question.
#### Even though for some questions you could find the answer "by inspection", for
this homework you need
#### to write code to get the answers, unless it is explicitly stated that no code
is required.
#### 3. When you are asked to print an answer, it is NOT ENOUGH to simply write a print statement, #### for example print(2*3) as an answer, you also need to #### cut and paste into this document the value(s) that is actually printed (i.e.,6). On the other hand, pasting
#### long lines of code from the console when you are not asked to do so is unnecessary. Please ask if you have questions on this.
#### 4. When submitting to Canvas, please change the file extension to .txt, otherwise it cannot be submitted.
## ------------------------------------------------------------------------
#### Q1 (10 points)
#a. 7pts. Write a function 'NA_Stats' that takes a numeric vector as input and returns the mean, variance, and a variable
# named NAstatus of the non-NA values in the vector as a list. NAstatus is a "Yes" or "No" value depending on whether
# there is at least one NA value in the input or not respectively.
# The function should have a stop message if all values are NA or if the vector is all non-numeric. The stop message should
# read "This function wont work for this vector." It should also have a warning message if there is at least one NA value,
# which should then be ignored in the calculation of the mean and variance. The warning message should be "There are some missing
# values. I am ignoring them." NA_Stats <- function(x) {
if (!is.numeric(x)){ stop("This function won't work for this vector.") }
if (all(is.na(x))) { stop("This function won't work for this vector.")
}
na_status <- ifelse(any(is.na(x)), "Yes", "No")
x_no_na <- x[!is.na(x)]
if (length(x_no_na) == 0) {
stop("This function won't work for this vector.")
}
if (any(is.na(x))) {
warning("There are some missing values. I am ignoring them.")
}
list(mean = mean(x_no_na), variance = var(x_no_na), NAstatus = na_status)
}
#b. 3pts. Apply this function to the following numeric vectors:
# a) c(rep(1,7),NA,seq(2,8,2))
# b) c(rep(1,7),seq(2,8,2))
# c) c("Krishna", "Davis", "Joanne", "Yash")
NA_Stats(c(rep(1,7),NA,seq(2,8,2)))
# $mean
# [1] 2.454545
# # $variance
# [1] 6.072727
# # $NAstatus
# [1] "Yes"
# Warning message:
# In NA_Stats(c(rep(1, 7), NA, seq(2, 8, 2))) :
# There are some missing values. I am ignoring them.
NA_Stats(c(rep(1,7),seq(2,8,2)))
# $mean
# [1] 2.454545
# # $variance
# [1] 6.072727
# # $NAstatus
# [1] "No"
NA_Stats(c("Krishna", "Davis", "Joanne", "Yash"))
# Error in NA_Stats(c("Krishna", "Davis", "Joanne", "Yash")) : # This function won't work for this vector.
#### Q2 (20 points)
#a. 10pts. Write a function called "kurtosis.transform" that takes a numerical vector input (x), and if the input has # outlier values, it returns the kurtosis of the natural log of the input vector. Otherwise, it should return the kurtosis
# of the original vector. Use the magnitude of the kurtosis value of the original input vector to determine whether there
# are outliers.
# The function should start by printing the kurtosis value calculated from the data. Use the paste/print functions to display the
# kurtosis value using a sentence that says: "The calculated kurtosis value of the input data is x.x" where the number x.x is the
# result of the calculation."
# The function should also have an argument in addition to x, with a default value.
This additional argument should be called
# "thresh" which takes a single non-negative numeric value. Default its value to 100. The role of "thresh" is so that the user
# can decide for themselves the cut-off level of the sample's kurtosis at which you
will transform the data. If the kurtosis of
# the data is > the thresh value, then simply transform the data to the log values and return the kurtosis of the log transformed x
# values; if not, just return the kurtosis of the original vector x. # Hint: the kurtosis function you created in homework 2 may be
# helpful
# In either case, you need to print a second message saying one of the following:
# Data has been transformed - New Kurtosis = y.y OR
# Data has not been transformed - Kurtosis = x.x
# with x.x and y.y coming from the calculations.
kurtosis <- function(x) {
n <- length(x)
m <- mean(x)
diff_mean_x <- x - m
num <- sum(diff_mean_x^4)
diff_2 <- x - (m^2)
denom <- sum(diff_2^2)
kurt <- (n*num) /(denom)
return(kurt)
}
kurtosis.transform <- function(x, thresh=100){
orig_kurt <- kurtosis(x)
print(paste("The calculated kurtosis value of the input data is", round(orig_kurt,2)))
if (orig_kurt > thresh){
x_trans <- log(x)
trans_kurt <- kurtosis(x_trans)
print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt,2)))
return(trans_kurt)
}
else {
print(paste("Data has not been transformed - Kurtosis =", round(orig_kurt,2)))
return(orig_kurt)
}
}
#b. 5pts. Run the 'kurtosis.transform' function on the salary variable (default value of thresh) and report the results after
# loading the following data at http://mathmba.com/richardw/HR.csv.
hr.data <- read.csv(file = "http://mathmba.com/richardw/HR.csv")
kurtosis.transform(hr.data$salary,100)
# [1] "The calculated kurtosis value of the input data is 373.01"
# [1] "Data has been transformed - New Kurtosis = 0.04"
# [1] 0.03851951
#c. 5pts. Expand the 'kurtosis.transform' function to have a third argument called 'transf' which takes one of two values:
# either "LOG" or "SQRT". If you choose "LOG" then the kurtosis of the logarithm of x will be returned (as in Q2a)
# But if you choose "SQRT", the then the kurtosis of the square root of x will be
returned. Then, re-run the 'kurtosis.transform'
# function on the Salary variable using the "SQRT" option in the function using the default thresh value.
kurtosis.transform <- function(x, thresh = 100, transf = "LOG") {
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
orig_kurt <- kurtosis(x)
print(paste("The calculated kurtosis value of the input data is", round(orig_kurt, 2)))
if (orig_kurt > thresh) {
if (transf == "LOG") {
x_trans <- log(x)
trans_kurt <- kurtosis(x_trans)
print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt, 2)))
return(trans_kurt)
} else if (transf == "SQRT") {
x_trans <- sqrt(x)
trans_kurt <- kurtosis(x_trans)
print(paste("Data has been transformed - New Kurtosis =", round(trans_kurt, 2)))
return(trans_kurt)
}
}
print(paste("Data has not been transformed - Kurtosis =", round(orig_kurt, 2)))
return(orig_kurt)
}
kurtosis.transform(hr.data$salary,100,"SQRT")
# [1] "The calculated kurtosis value of the input data is 373.01"
# [1] "Data has been transformed - New Kurtosis = 21.96"
# [1] 21.9591
#### Q3 (20 points) #a. 10pts. Write a function called "is.any.prime" that takes a numeric vector of data and reports whether any of the input values
# are prime (i.e. a whole number greater than 1 that cannot be exactly divided by
any whole number other than itself and 1). This
# means that negative numbers and non-whole numbers are technically not prime numbers.
# The "is.any.prime" function should return a single TRUE if the vector contains a prime number or FALSE if not. Hint: to check
# whether a number is a whole number, you can simply check to see if it is equal to its rounded value
is.any.prime <- function(x) {
for (i in x) {
if (i == round(i) && i > 1) {
isprime <- TRUE
for (j in 2:(i-1)) {
if (i %% j == 0) {
isprime <- FALSE
break
}
}
if (isprime) {
return(TRUE)
}
}
}
return(FALSE)
}
#b. 5pts. Paste the output of the "is.any.prime" function on the following arrays:
# (i) c(12, 6, 14, 9, 44)
# (ii) c(1, 4.8, 24, 18.1)
# (iii) c(4, 6, 8, -18, 22)
# (iv) c(1, 4, 12, 13, 20)
# (v) c(1, 1, 1, 1, 1, 0)
is.any.prime(c(12, 6, 14, 9, 44))
# [1] FALSE
is.any.prime(c(1, 4.8, 24, 18.1))
# [1] FALSE
is.any.prime(c(4, 6, 8, -18, 22))
# [1] FALSE
is.any.prime(c(1, 4, 12, 13, 20))
# [1] TRUE
is.any.prime(c(1, 1, 1, 1, 1, 0))
# [1] FALSE
#c. 5pts. Write a new function "first.prime.num" that takes a numeric vector of data and outputs only the first prime
# number in the vector. If there are no prime numbers in the vector, the function
should display a message that says:
# "No prime numbers were found in the input vector." Hint: ?break and ?next may be useful here.
first.prime.num <- function(x) {
for (i in x) {
if (i == round(i) & i > 1) {
isprime <- TRUE
for (j in 2:(i-1)) {
if (i %% j == 0) {
isprime <- FALSE
break
}
}
if (isprime) {
return(i)
break
}
}
}
return(print("No prime numbers were found in the input vector."))
}
#d. 5pts. Paste the output of this new function on the following arrays:
# (i) c(-20, 100, 41, 38, 29, 15, 22)
# (ii) c(18, 1, 24, 14, 45, 27, 49, 38)
first.prime.num(c(-20, 100, 41, 38, 29, 15, 22))
# [1] 41
first.prime.num(c(18, 1, 24, 14, 45, 27, 49, 38))
# [1] "No prime numbers were found in the input vector."
#### Q4 (10 points)
# Recall that the sum of the first N natural numbers is given by sum(N) = (N * (N +
1)) / 2). Example: sum(5) = 1+2+3+4+5 = 15.
# Some numbers like sum(6) = 36 can also be expressed as perfect squares (i.e. 36 =
6*6). Lets call these numbers like 36
# "Sum-Square Twins". That is, they are perfect squares (N^2 for some N) and also the sum of the first M digits, for some other M.
# Write a function to find all "Sum-Square Twins" between 1 and sum(1 Million) - which is approximately 500 Billion.
is.perfect.square <- function(n) {
root <- sqrt(n)
if (root==round(root)){
return(TRUE)
} else {
return(FALSE)
}
}
find.sum.square.twins <- function(N) {
twins <- c()
for (m in 1:N) {
sum.m <- (m * (m + 1)) / 2
for (n in 1:(sum.m - 1)) {
if (is.perfect.square(sum.m)==TRUE && is.perfect.square(n)==TRUE) {
twins <- c(twins, sum.m)
break
}
}
}
return(twins)
}
find.sum.square.twins(1000000)
#Laptop could not compute- emailed TA about it. Have done it for 100 below:
find.sum.square.twins(100)
# [1] 1 36 1225
#### Q5 (20 points)
# A deck of playing cards contains four suits: spades (S), hearts (H), diamonds (D), and clubs (C). In each
# suit there are thirteen cards: 10, 9, 8, 7, 6, 5, 4, 3, 2, ace, and three face cards (i.e. king, queen, jack). In total, there
# are 52 distinct cards. To keep things simple, assign ace a numeric value of one, and remember that an ace is not a face card.
# Additionally, assign a numeric value of thirteen to king, twelve to queen, and eleven to jack.
# Now, imagine you're playing a new card game in which you draw six cards at random
each turn. If you draw a face card (see definition
# of face card above) of any suit or at least one diamond (D) of any value, you win. Otherwise, you lose.
#a 5pts. Set seed to 2023, then create a full set of the playing deck using the grep and paste commands. Sample six random cards from this set
# and paste the output. The format of the hand should be "value-suit" (e.g. an ace of spades should be denoted by "1-S"). You should be sure
# to use the sort() command when creating your playing deck.
set.seed(2023)
CARDS <- paste(
rep(c("1","2","3","4","5","6","7","8","9","10","11","12","13"),4),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
rep(c("H","C","D","S"),13),
sep="_")
CARDS <- sort(CARDS)
hand <- sample(CARDS, 6)
print(hand)
# [1] "12_S" "8_H" "7_C" "10_S" "3_D" "7_S" #b. 5pts. Write a function "win.outcome" that calculates whether you won a turn or not. It should take in a hand (i.e. six cards) and return
# TRUE (win) or FALSE (lost). Hint: use the grep function
win.outcome <- function(hand){
has_face_card <- any(grep("11|12|13", hand))
has_diamond <- any(grep("D", hand))
return(has_face_card | has_diamond)
}
set.seed(2023)
hand <- sample(CARDS, 6)
hand
win.outcome(hand)
# [1] TRUE
#c. 3pts. Show the function and the results of applying the function on the following examples:
# (i) c("7-C","9-S","2-H","3-C")
# (ii) c("3-D","1-C","10-D","4-S")
# (iii) c("2-S","6-H","9-C","1-S")
win.outcome(c("7-C","9-S","2-H","3-C"))
# [1] FALSE
win.outcome(c("3-D","1-C","10-D","4-S"))
# [1] TRUE
win.outcome(c("2-S","6-H","9-C","1-S"))
# [1] FALSE
#e. 7pts. Reset the seed to 2023. Next, write code that simulates playing this imaginary card game, and calculate the probability of
# winning across 100K simulation runs. Print the resulting probability as part of
statement that includes the number of turns won,
# number of games played, and the win probability (rounded to two decimals places). set.seed(2023)
num_wins <- 0
num_games <- 100000
for(i in 1:num_games) {
hand <- sample(CARDS, 6, replace = FALSE)
if(win.outcome(hand)) {
num_wins <- num_wins + 1
}
}
win_prob <- round(num_wins/num_games, 2)
print(paste("Number of turns won:", num_wins))
print(paste("Number of games played:", num_games))
print(paste("Win probability:", win_prob))
# [1] "Number of turns won: 97090"
# [1] "Number of games played: 1e+05"
# [1] "Win probability: 0.97"
#### Q6 (20 points)
# You are a data analysis consultant for a multinational retail company struggling with its sales.
# They have hired you to analyze their sales data and identify areas of improvement. Using the
# 'sales.csv' file in the HW3 folder, answer the following:
#a. 8pts. The data your client has provided contains the following columns:
# Product ID (character)
# Region (character)
# Date (character in the format "YYYY-MM-DD")
# Units Sold (numeric)
# Revenue (numeric)
# Write a function called "sales.report" that takes the following inputs:
# 1. path.to.csv: a character string representing the path to the CSV file
# 2. product.id: a character string representing the ID of the product to report on
# 3. region: a charcter string representing the region to report on
# 4. date.range: a character vector of length 2 representing the ranges of dates
# to report on in the format "YYYY-MM-DD"
# This function should read the CSV file, filter the data to only include a specified
# product, region, and date range, and retrun a data frame with the following columns:
# 1. date (character): the date of the sales
# 2. units sold (numeric): the number of units sold on the date
# 3. revenue (numeric): the revenue generated on the date
# The data frame should be sorted by date in ascending order.
sales.csv <- read.csv(file="C:\\Users\\Varyam\\Downloads\\sales.csv")
head(sales.csv)
sales.csv <- "C:\\Users\\Varyam\\Downloads\\sales.csv"
sales.report <- function(path.to.csv, product.id, region, date.range) {
sales_data <- read.csv(file=path.to.csv)
filtered_data <- sales_data[sales_data$Product.ID == product.id & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ]
report <- filtered_data[, c("Date", "Units.Sold", "Revenue")]
report <- report[order(report$Date), ]
return(report)
}
#b. 2pts. Test your sales.report function below. Print the output.
sales.report(sales.csv, "A100", "West", c("2018-01-02", "2018-01-04"))
# Date Units.Sold Revenue
# 6 2018-01-02 98 2134.76
# 8 2018-01-02 73 1967.32
# 10 2018-01-02 90 2356.87
# 11 2018-01-03 67 1889.45
# 15 2018-01-03 101 2657.12
# 18 2018-01-04 67 1745.78
#c. 4pts. Now modify your sales.report with a new function called sales.report.v2 to allow multiple
# product IDs to be reported at once.
sales.report.v2 <- function(path.to.csv, product.ids, region, date.range) {
sales_data <- read.csv(file=path.to.csv)
filtered_data <- sales_data[sales_data$Product.ID %in% product.ids & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ]
reports_list <- list()
for (product_id in product.ids) {
report <- filtered_data[filtered_data$Product.ID == product_id, c("Date", "Units.Sold", "Revenue")]
report <- report[order(report$Date), ]
reports_list[[product_id]] <- report
}
final_report <- do.call(rbind, reports_list)
return(final_report)
}
#d. 1pts. Test your sales.report.v2 function below. Print the output.
sales.report.v2(sales.csv, c("A100", "B200"), "East", c("2018-01-02", "2018-01-
07"))
# Date Units.Sold Revenue
# A100.13 2018-01-03 78 2123.32
# A100.16 2018-01-04 123 2460.34
# A100.20 2018-01-04 112 3056.43
# A100.21 2018-01-05 78 2198.21
# A100.28 2018-01-06 78 2123.32
# A100.33 2018-01-07 123 2460.34
# B200.9 2018-01-02 78 2198.21
# B200.17 2018-01-04 56 1786.45
# B200.19 2018-01-04 89 2098.54
# B200.24 2018-01-05 56 1786.45
# B200.32 2018-01-07 89 2098.54
#e. 4pts. Now alter the sales.report.v2 function to include a column for the total revenue
# generated over the specified date range. Call the new function sales.report.v3
sales.report.v3 <- function(path.to.csv, product.ids, region, date.range) {
sales_data <- read.csv(file=path.to.csv)
filtered_data <- sales_data[sales_data$Product.ID %in% product.ids & sales_data$Region == region & sales_data$Date >= date.range[1] & sales_data$Date <= date.range[2], ]
reports_list <- list()
for (product_id in product.ids) {
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
report <- filtered_data[filtered_data$Product.ID == product_id, c("Date", "Units.Sold", "Revenue")]
report <- report[order(report$Date), ]
total_revenue <- sum(report$Revenue)
report$total_revenue <- total_revenue
reports_list[[product_id]] <- report
}
final_report <- do.call(rbind, reports_list)
return(final_report)
}
#f. 1pts. Test your sales.report.v3 function below. Print the output.
sales.report.v3(sales.csv, c("A100", "B200"), "West", c("2018-01-06", "2018-01-
08"))
# Date Units.Sold Revenue total_revenue
# A100.26 2018-01-06 45 1203.23 8295.56
# A100.30 2018-01-06 90 2356.87 8295.56
# A100.31 2018-01-07 56 1786.45 8295.56
# A100.35 2018-01-07 67 1745.78 8295.56
# A100.38 2018-01-08 45 1203.23 8295.56
# B200.27 2018-01-06 123 2460.34 7006.91
# B200.29 2018-01-06 67 1889.45 7006.91
# B200.34 2018-01-07 101 2657.12 7006.91
Related Documents
Recommended textbooks for you
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtAlgebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
- Algebra for College StudentsAlgebraISBN:9781285195780Author:Jerome E. Kaufmann, Karen L. SchwittersPublisher:Cengage LearningFunctions and Change: A Modeling Approach to Coll...AlgebraISBN:9781337111348Author:Bruce Crauder, Benny Evans, Alan NoellPublisher:Cengage Learning
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning