Bhariday_Module3_Report
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6015
Subject
Mathematics
Date
Apr 3, 2024
Type
Pages
38
Uploaded by PresidentBraveryAnt103
ALY-6015 –
Intermediate Analytics Bhavana Bhariday Module 3 Assignment –
GLM and Logistic Regression College of Professional Studies, Northeastern University Mykhaylo Trubskyy January 30, 2024
Introduction: We examine the College dataset in this analysis with the goal of developing a logistic regression model to determine if a university is private or public. We start with exploratory data analysis and use descriptive statistics and visualizations to find patterns. To train and evaluate the model, we divided the dataset into train and test sets using R. We forecast university classifications by utilizing logistic regression, and we evaluate model performance with ROC curves and classification metrics. As a summary metric, the Area Under the Curve (AUC) measures the discriminatory strength of the model. This analysis offers insightful information about how well logistic regression predicts colleges according to important characteristics. Analysis: 1)
Loading the Dataset Exploratory Data Analysis: There are 777 observations in the College dataset that describe characteristics of both public and private universities. Important variables include rates of application, acceptance, and enrollment in addition to budgetary measures like the cost of room and board and tuition for out-of-state students. Descriptive statistics draw attention to important distinctions between public and private entities. For example, compared to public institutions, private universities typically have greater applicant and acceptance rates. Additional information can be gleaned from the dataset's mean values, which show an average enrollment rate of 780 and an average application rate of 3002. Furthermore, the average cost of accommodation and board is about $4358, with $9660 being spent on average. These numbers highlight the costs related to pursuing higher education. All things considered, the dataset provides academics and policymakers with a wealth of information about the traits and financial profiles of colleges. A) Barplot of Private and Public Universities
We created a bar chart for the first plot, placing the count of universities on the y-axis and the type of university on the x-axis. Based on the results gathered, we can conclude that there are more private universities than state universities. B) Barplot of Elite vs Non Elite universities We acquired the frequency of Elite vs. Non-Elite Universities for the second visualization. We can draw the conclusion that non-elite universities outnumber elite ones. C) Boxplot of Universities and their Out of State fee’s
We generated a boxplot with x-axis University type and y-axis Outstate Tuition, we can interpret that the public university’s tuition is significantly less than the Private University’s tuition. With more outliers for the public and comparatively less for the private. 2)
We will use the caret library, which offers the function for Data partitioning. For reproducibility, we use set.seed(123) to set the seed. we generate a random split of the data according to the "Private" variable using the createDataPartition function. Eighty percent of the data is assigned to the training set, as indicated by the p = 0.8 parameter. Lastly, we divide the original dataset into the training and test sets using the generated index. Currently, we have train_data, which comprises 80% of the data, and test_data, which comprises the remaining 20% for assessment. It gives a single numeric vector with the indices of the chosen samples when we set list = FALSE. In other words, every chosen index is combined into a single vector. If list is equal to FALSE, the output is
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
a single vector. 3)
The formula that we used is Private ~ Apps + Enrol, in which Apps and Enrol are the predictor variables and Private is the response variable. The data argument receives the training dataset train_data. To indicate that we wish to fit a logistic regression model (binomial family), we enter family = binomial. The obtained results are as follows: We can sum up a few important points: Intercept: The coefficient of the intercept is roughly 2.973. Apps has a coefficient of about 0.0002549. This means that, while keeping all other factors fixed, the log-odds of the answer variable falling into the "Private" category rise by about 0.0002549 for every unit increase in the number of applications (Apps). Enrol has a coefficient of about -0.003569. With p-values under 0.05, every coefficient is a statistically significant predictor of the response variable. Deviance: The deviation of the null model, or the model without any predictors, is indicated by the null deviance, which is 729.64. The fitted model's deviation is shown by the residual deviance, which is 461.05. The improvement in model fit with the addition of predictors is indicated by the difference between the null and residual deviances. AIC: The value of AIC is 467.05. Better models are indicated by lower AIC values. According to these figures, "Apps" and "Enrol" are both likely to be important predictors of the response variable ("Private"). The improvement in model fit with respect to the null model is seen by the decline in deviation and the comparatively low AIC value.
4)
Based on the confusion matrix for the train set: Interpretation: - The confusion matrix provides a breakdown of model predictions compared to the actual classes in the train set. - True negatives (TN) are 97, true positives (TP) are 435, false negatives (FN) are 17, and false positives (FP) are 73. - Accuracy, precision, recall, and specificity can be calculated based on these values. - False negatives (17) occur when the model incorrectly predicts a university as public when it's actually private. False positives (73) occur when the model incorrectly predicts a university as private when it's actually public. - In this context, false negatives (misclassifying private universities as public) may be more damaging as it could lead to missed opportunities or misallocation of resources, such as financial aid. 5)
Interpretation: The logistic regression model exhibits an accuracy of 85.53%, precision of 85.63%, and recall of 96.24% on the train set. However, its specificity is lower at 57.06%, indicating potential challenges in correctly identifying negative cases. 6)
The confusion matrix for the test set can be summarized as follows: True Positives (TP): 106 True Negatives (TN): 24 False Positives (FP): 7 False Negatives (FN): 18 This matrix provides a summary of the model's predictions compared to the actual classes in the test set. This information can be used to evaluate the model's performance and assess its strengths and
weaknesses in classifying universities as private or public. 7)
An example of a graphical depiction that shows how well a binary classification model performs across different threshold values is the Receiver Operating Characteristic (ROC) curve. For various threshold values, it shows the True Positive Rate (TPR) against the False Positive Rate (FPR). This plot is used to obtain the probability graph between 0 and 1 with Specificity on x-axis and Sensitivity on y-axis. 8)
In R, we can obtain the Area Under Curve (AUC) value with the function auc. The obtained result is: AUC of 0.980 suggests that the model is good at binomial classification. It excels at differentiating between positive and negative examples An AUC value near 1 in the context of a ROC indicates that the model is a good classifier since it has a high True Positive Rate (sensitivity) and a low False Positive Rate (1-specificity) over a range of threshold values.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
References:
Narkhede, S. (2021, June 15). Understanding AUC - roc curve
. Medium. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
YouTube. (2019, July 11). Roc and AUC, clearly explained!
YouTube. https://www.youtube.com/watch?v=4jRBRDbJemM
YouTube. (2018, March 5). StatQuest: Logistic regression
. YouTube. https://www.youtube.com/watch?v=yIYKR4sgzI8
GeeksforGeeks. (2023, May 2). Generalized linear models
. GeeksforGeeks. https://www.geeksforgeeks.org/generalized-linear-models/
APPENDIX: library(dplyr) library(tidyverse) library(pacman) library(ggplot2) library(ISLR) #1 ---- data("College") glimpse(College) summary(College) view(College) #Barchart of frequency of private and public universities ggplot(College, aes(x = Private, fill = Private)) + geom_bar() + labs(x = "Type of University", y = "Count", title = "Count of Private and Public Universities") + theme_minimal() #Boxplot of outstate tuition by the type of university ggplot(College, aes(x = Private, y = Outstate, fill = Private)) + geom_boxplot() + labs(x = "University type", y = "Outstate Tuition", title = "Outstate Tuition by Private/Public") + theme_minimal()
#Barplot of Elite vs Non-Elite universities using Top10perc variable ggplot(College, aes(x = Top10perc >= 50, fill = factor(Top10perc >= 50))) + geom_bar() + scale_fill_manual(values = c("skyblue", "salmon"), labels = c("Non-Elite", "Elite")) + labs(x = "Elite", y = "Count", title = "Count of Elite and Non-Elite Universities") + theme_minimal() #Descriptive statistics of numerical variables summary(College[, c("Apps", "Accept", "Enroll", "Top10perc", "Top25perc", "F.Undergrad", "P.Undergrad", "Outstate", "Room.Board", "Books", "Personal", "PhD", "Terminal", "S.F.Ratio", "perc.alumni", "Expend", "Grad.Rate")]) #2 ---- library(caret) set.seed(123) #Splitting the data into train and test sets train_index <- createDataPartition(College$Private, p = 0.8, list = FALSE) train_data <- College[train_index, ] test_data <- College[-train_index, ] #3---- library(stats) #'Private' is the response variable and 'Apps' and 'Enroll' are predictors
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
glm_model <- glm(Private ~ Apps + Enroll, data = train_data, family = binomial) summary(glm_model) #4---- train_predictions <- predict(model, train_data, type = "response") #Converting predicted probabilities to binary predictions (0 or 1) train_predicted_classes <- ifelse(train_predictions > 0.5, 1, 0) #Creating confusion matrix for train_data train_conf_matrix <- table(Actual = train_data$Private, Predicted = train_predicted_classes) print(train_conf_matrix) #5---- test_predicted_classes <- ifelse(predicted_probabilities_test > 0.5,1,0) train_conf_matrix <- table(Actual = train_data$Private, Predicted = train_predicted_classes) #Extracting TP, TN, FP, FN TP <- train_conf_matrix[2, 2] TN <- train_conf_matrix[1, 1] FP <- train_conf_matrix[1, 2] FN <- train_conf_matrix[2, 1]
accuracy <- (TP + TN) / sum(train_conf_matrix) precision <- TP / (TP + FP) recall <- TP / (TP + FN) specificity <- TN / (TN + FP) #printing metrics print(paste("Accuracy:", accuracy)) print(paste("Precision:", precision)) print(paste("Recall :", recall)) print(paste("Specificity:", specificity)) #6---- probabilities.test <- predict(model,test_data, type = "response") test_predicted_classes <- ifelse(probabilities.test > 0.5, "Yes", "No") test_conf_matrix <- table(test_predicted_classes, test_data$Private) print("Confusion Matrix for Test Set:") print(test_conf_matrix) #7---- install.packages("ROCR") library(ROCR) probabilities.train <- predict(model, train_data, type = "response") ROC <- roc(train_data$Private, probabilities.train) plot(ROC, main = "ROC Curve for Logistic Regr Model", col = "brown", lwd = 6) #8---- auc_value <- auc(ROC)
print(paste("Area Under the Curve (AUC):", auc_value))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The ANOVA test fails to reject the null hypothesis 0.543>0.05 indicating no significant difference in mean expenditures among regions. The Tukey HSD results support this, as none of the pairwise comparisons have adjusted p-values less than the chosen significance level. In summary, there is insufficient evidence to conclude a significant difference in mean expenditures among regions at the 0.05 significance level. 12-3.10. Increasing Plant Growth A two-way ANOVA can be used to assess if there is an interaction between the two components (plant food and grow light) and to find out if the mean growth varies according to each factor.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Since, 0.00455 is less than the alpha value we reject the null hypothesis. ON YOUR OWN 1.
Importing Baseball dataset 2.
Descriptive Statistics: The 1,232 rows and 15 columns of the baseball dataset contain data on numerous baseball teams from various years. Runs Allowed (RA) and Scored Runs (RS): There is a minimum of 463 runs and a maximum of 1009 runs scored. The mean of runs allowed is quite similar to the mean of runs scored, with a comparable range. Wins (W): A minimum of 40 and a maximum of 116 wins are possible. There are roughly 81 wins on average. Derogatory Measures (OBP, SLG, BA): The statistical distributions of offensive metrics, such as batting average, slugging percentage, and
on-base percentage, are common. Visualizations to support EDA:
Histogram of Runs Scored In the above plot, we generated a histogram for Runs scored (RS) from the dataset. On the x-axis we have the number of runs scored and on the y-axis the frequency associated with it. The obtained plot is normally distributed with a peak frequency of above 150.
Scatter Plot of Runs Allowed vs. Wins In the above scatter plot, the variable on the x-axis is Runs Allowed (RA) and the variable on the The y-axis in Wins (W). From the figure, we can summarize that it shows a strong negative correlation Between RA and W as the runs increase the Wins are decreasing and it is a downward trend.
Scatter plot of OOBP vs. OSLG In the above scatter plot, the variable on the x-axis is OBP, and the variable on the The y-axis in Wins OSLG. It shows a strong positive correlation, and the graph is upwards. There are also, quite a few outliers. 3.
Wins by Decade We conducted a Chi-Square Goodness-of-Fit test to determine if there is a difference in the number of wins by decade. With a high-test statistic and p-value close to zero, the null hypothesis was rejected,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
indicating a significant difference in wins across decades. (H0): The distribution of wins by decade is equal. (H1): The distribution of wins by decade is not equal. The analysis involved creating a 'Decade' variable, summarizing total wins by decade, and applying for the Chi-Square Goodness-of-Fit test. The test statistic is far more than the critical value, and the p-value is negligible (near zero), providing significant proof against the null hypothesis. This implies that the number of wins varies significantly by decade. 4.
Crop Data Analysis 5.
To analyze the crop dataset, we first imported the Dataset, performed factoring of variables and conducted a Two-way ANOVA on yield with density and fertilizer and evaluated the results.
Using an alpha of 0.05 as the significance level, the comparison of the p-values is as follows: Density: The null hypothesis is rejected since the p-value (0.000186) is less than 0.05. Density has an important impact on yield. Fertilizer: We reject the null hypothesis since the p-value (0.000273) is less than 0.05. Fertilizer has an important impact on yield. Interaction (Density x Fertilizer): We are unable to reject the null hypothesis since the p-value (0.5325) is higher than 0.05. The interaction impact is negligible. In conclusion, yield is strongly impacted by density and fertilizer, but not much by their combination.
References:
Kabacoff, R. I. (2022). R in action: Data analysis and graphics with R and tidyverse (3rd ed.). Manning Publications. ISBN 978-1-617- 29605-5
Taiyun Wei, V. S. (2021, November 18). An Introduction to corrplot Package
.
Scatterplots and correlation - university of west georgia. (n.d.). https://www.westga.edu/academics/research/vrc/assets/docs/scatterplots_and_correlation_notes.pdf
YouTube. (2022, January 5). ANOVA (analysis of variance) simply explained
. YouTube. https://www.youtube.com/watch?v=0NwA9xxxtHw
YouTube. (2021, June 7). How to... perform a two-way ANOVA in R #93
. YouTube. https://www.youtube.com/watch?v=7uR9c_yA0pE
YouTube. (2022b, November 7). Chi squared test using R programming
. YouTube. https://www.youtube.com/watch?v=cJRPZ8HD-zI
Bluman, A.G. (2018) Elementary Statistics, a Step-by-Step Approach. Tenth Edition, McGraw-Hill Education, New York. ISBN 978-1-260-04200-9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
APPENDIX: library(ggplot2) library(dplyr) library(pacman) library(tidyverse) library(correlation) #11-1.6. ---- ap = 0.10 obs= c(12, 8, 24, 6) df <- length(obs) - 1 cv <- qchisq(1 - ap, df) cv p = c(0.2, 0.28, 0.36, 0.16) result = chisq.test(x=obs, p=p) ifelse(result$p.value>ap, "Accept (H0)", "Reject H(0)") print(result) #11-1.8.---- ap = 0.05
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
obs =c(125, 10,25, 40) df <- length(obs) - 1 cv <- qchisq(1 - ap, df) cv p = c(70.8, 8.2, 9.0, 12.0) / 100 expected <- p*sum(obs) expected <- expected / sum(expected) result_01 = chisq.test(x=obs, p=expected) ifelse(result_01$p.value>ap, "Accept (H0)", "Reject (H0)") print(result_01) #11-2.8.---- ap = 0.05 #Creating vectors cau =c(724, 370) his = c(335,292) afam = c(174, 152) oth =c(107,140)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
mtx = matrix(c(cau,his,afam,oth), nrow = 4, byrow = TRUE) rownames(mtx)=c("Caucasian","Hispanic","African American", "Other") colnames(mtx)=c("2013","2014") result_02 <- chisq.test(mtx) cv <- qchisq(1 - alpha, df = result_02$parameter) cv ifelse(result$p.value>ap, "accept (H0)", "reject (H0)") print(result_02) #11-2.10.---- ap = 0.05 #Create vectors of rows a_rmy = c(10791, 62491) nav_y = c(7816, 42750)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
marcorps = c(932, 9525) aforce = c(11819, 54344) mtx_01 = matrix(c(a_rmy,nav_y,marcorps,aforce), nrow = 4, byrow = TRUE) rownames(mtx_01)=c("Army","Navy","Marine Corps", "Air Force") colnames(mtx_01)=c("Officers", "Enlisted") result_03 <- chisq.test(mtx_01) cv <- qchisq(1 - ap, df = result_03$parameter) cv ifelse(result$p.value>ap, "accept (H0)", "Reject (H0)") print(result_03) #12-1.8. ---- ap <- 0.05 cond <- data.frame('sodium' = c(270, 130, 230, 180, 80, 70, 200), 'food' = rep('condiments', 7),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
stringsAsFactors = FALSE) cer <- data.frame('sodium' = c(260, 220, 290, 290, 200, 320, 140), 'food' = rep('cereals', 7), stringsAsFactors = FALSE) dess <- data.frame('sodium' = c(100, 180, 250, 250, 300, 360, 300, 160), 'food' = rep('desserts', 8), stringsAsFactors = FALSE) #binding sodium <- rbind(cond, cer, dess) sodium$food <- as.factor(sodium$food) anovares <- aov(sodium$sodium ~ sodium$food) summary(anovares) a.summary <- summary(anovares) #degrees of freedom #k - 1 df.numerator <- a.summary[[1]][1, "Df"] df.numerator #N - k df.denominator <- a.summary[[1]][2, "Df"]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
df.denominator #F test value from summary F.value <- a.summary[[1]][[1, "F value"]] F.value #p-value from summary p.value <- a.summary[[1]][[1, "Pr(>F)"]] p.value ifelse(p.value > ap, "accept (H0)", "Reject (H0)") cv <- qf(1 - ap, df.numerator, df.denominator) cv TukeyHSD(anova) #12-2.10. Sales for Leading Companies ---- ap = 0.01 cer = data.frame("Sales"=c(578,320,264,249,237), "Company"=rep("Cereal",5), stringsAsFactors = FALSE) choco = data.frame("Sales"=c(311,106,109,125,173),"Company"=rep("Chocolate Candy", 5), stringsAsFactors = FALSE)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cof = data.frame("Sales"=c(261,185,302,689),"Company"=rep("Coffee",4), stringsAsFactors = FALSE) sales = rbind(cer, choco, cof) sales$Company = as.factor(sales$Company) anova = aov(Sales~Company, data=sales) summary(anova) a.summary = summary(anova) df_num = a.summary df_num df_denom <- a.summary df_denom F.value <- a.summary F.value p.value <- a.summary p.value
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cvalue <- qf(1 - alpha, df.numerator, df.denominator) cvalue TukeyHSD(anova) #12-2.12.---- ap <- 0.05 easthir <- data.frame('expenditure' = c(4946, 5953, 6202, 7243, 6113), 'region' = rep('Eastern', 5), stringsAsFactors = FALSE) midthir <- data.frame('expenditure' = c(6149, 7451, 6000, 6479), 'region' = rep('Middle', 4), stringsAsFactors = FALSE) westhir <- data.frame('expenditure' = c(5282, 8605, 6528, 6911), 'region' = rep('Western', 4), stringsAsFactors = FALSE) expenditures <- rbind(easthir, midthir, westhir) expenditures$region <- as.factor(expenditures$region) av <- aov(expenditure ~ region, data = expenditures) summary(av) a.su <- summary(av) #k - 1 df_num <- a.su[[1]][1, "Df"]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
df_num #N - k df_denom <- a.su[[1]][2, "Df"] df_denom #F test val F.value <- a.su[[1]][[1, "F value"]] F.value #p-value from the summary p.value <- a.su[[1]][[1, "Pr(>F)"]] p.value ifelse(p.value > ap, "accept (H0)", "Reject (H0)") cvalue <- qf(1 - ap, df_num, df_denom) cvalue TukeyHSD(av) #12-3.10. ap <- 0.05 light1_foodA <- data.frame('growth' = c(9.2, 9.4, 8.9), 'group' = rep('Light 1 - Food A', 3),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
stringsAsFactors = FALSE) light1_foodB <- data.frame('growth' = c(8.5, 9.2, 8.9), 'group' = rep('Light 1 - Food B', 3), stringsAsFactors = FALSE) light2_foodA <- data.frame('growth' = c(7.1, 7.2, 8.5), 'group' = rep('Light 2 - Food A', 3), stringsAsFactors = FALSE) light2_foodB <- data.frame('growth' = c(5.5, 5.8, 7.6), 'group' = rep('Light 2 - Food B', 3), stringsAsFactors = FALSE) plant_growth <- rbind(light1_foodA, light1_foodB, light2_foodA, light2_foodB) plant_growth$group <- as.factor(plant_growth$group) anv <- aov(growth ~ group, data = plant_growth) summary(anv) a.suy <- summary(anv) df.num <- a.suy[[1]][1, "Df"] df.num df.den <- a.suy[[1]][2, "Df"] df.den F.val <- a.suy[[1]][[1, "F value"]]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
F.val p.val <- a.suy[[1]][[1, "Pr(>F)"]] p.val ifelse(p.val > ap, "accept (h0)", "Reject (h0)") TukeyHSD(anv) #self ---- # 1 ---- baseb <- read.csv("baseball-1.csv") glimpse(baseb) summary(baseb) View(baseb) # 2 ---- #EDA mean(baseb$W) # (RS)hist ggplot(baseb, aes(x = RS)) +
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
geom_histogram(binwidth = 30, fill = "skyblue", color = "black", alpha = 0.7) + labs(title = "Runs Scored", x = "Runs Scored", y = "Frequency") + theme_classic() #Scatter plot for Runs Allowed (RA) vs. Wins (W) ggplot(baseb, aes(x = RA, y = W)) + geom_point(color = "red", alpha = 0.7) + labs(title = "Scatter Plot of Runs Allowed vs. Wins", x = "Runs Allowed", y = "Wins") + theme_minimal() #Scatter plot for OBP vs. OSLG ggplot(baseb, aes(x = OOBP, y = OSLG)) + geom_point(color = "green", alpha = 0.7) + labs(title = "Scatter Plot: OBP vs. OSLG", x = "Opponent's OBP", y = "Opponent's SLG") + theme_minimal() #view(baseb)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# 3 ---- #Wins by decade #Baseball library(dplyr) baseb$Decade <- baseb$Year - (baseb$Year %% 10) print(unique(baseb$Decade)) wins <- baseb %>% mutate(Decade = cut(Year, breaks = seq(min(Year), max(Year) + 10, by = 10), labels = FALSE)) %>% group_by(Decade) %>% summarize(TotalWins = sum(W, na.rm = TRUE)) k <- nrow(wins) df <- k - 1 cv <- qchisq(0.95, df) cv observedf <- wins$TotalWins expectedf <- rep(sum(observedf) / k, k) chisquare <- sum((observedf - expectedf)^2 / expectedf)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
if (chisquare > cv) { decision_critical <- "Reject (h0)" } else { decision_critical <- "accept (h0)" } p_val <- 1 - pchisq(chisquare, df) if (p_val <= 0.05) { conclusion <- "Reject (h0)" } else { conclusion <- "accept (h0)" } cat("Chi-Square Statistics:", chisquare, "\n") cat("Degrees of Freedom:", df, "\n") cat("Critic Val:", cv, "\n") cat("Decision based on Critical Value:", decision_critical, "\n") cat("P-Val:", p_val, "\n") cat("Decision based on P-Value:", conclusion, "\n")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# 4 ---- cropd = read.csv("crop_data-1.csv") #View(cropd) summary(cropd) glimpse(cropd) cropd$density <- as.factor(cropd$density) cropd$fertilizer <- as.factor(cropd$fertilizer) cropd$block <- as.factor(cropd$block) aov_result <- aov(yield ~ density * fertilizer, data = cropd) summary(aov_result) aov_res <- aov(yield ~ density * fertilizer, data = cropd) suy_table <- summary(aov_res) #Hypothesis for Density: h0_density <- "There is no significant difference in yield between different density levels." h1_density <- "There is a significant difference in yield between at least two density levels." #Hypothesis for Fertilizer:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
h0_fertilizer <- " no difference in yield between different fertilizer levels." h1_fertilizer <- "There is difference in yield and fertilizer levels." #Hypotheses for Interaction (Density x Fertilizer): h0_interaction <- "The effect of density on yield is the same for all levels of fertilizer" h1_interaction <- "The effect of density on yield is not the same for all levels of fertilizer" #p-value interaction p_interaction <- suy_table[[1]][[1, "Pr(>F)"]] #p-values factors p_density <- suy_table[[1]][[2, "Pr(>F)"]] p_fertilizer <- suy_table[[1]][[3, "Pr(>F)"]] critval_interaction <- qf(1 - alpha, df1 = suy_table[[1]][1, "Df"], df2 = suy_table[[1]][2, "Df"]) critval_interaction critival_density <- qf(1 - alpha, df1 = suy_table[[1]][2, "Df"], df2 = suy_table[[1]][3, "Df"]) critival_density critival_fertilizer <- qf(1 - alpha, df1 = suy_table[[1]][3, "Df"], df2 = suy_table[[1]][4, "Df"]) critival_fertilizer result_interaction <- ifelse(p_interaction > alpha, "accept (h0) for interaction", "Reject (h0) for interaction") result_density <- ifelse(p_density > alpha, "accept the (h0) for difference of yield with density", " Reject (h0) for difference of yield with density")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help