HM1

pdf

School

University of Washington *

*We aren’t endorsed by this school

Course

484

Subject

Accounting

Date

May 31, 2024

Type

pdf

Pages

Uploaded by ChefTank6577

HM1 Yuchen Zou 2024-04-09 ##Q1 During modeling, convert the variable “default” to numeric, i.e., between 0 and 1. Using the linear probability model (“lm” in R) with the adjustment for predicted probabilities less than 0 set to some very small number (e.g., 1e-5) and probabilities greater than 1 set just below 1. Predict the binary variable “pred_default” as Yes/No using the modified fitted values from the previous step. Using the actual outcomes in the “Default” dataset, compute the confusion matrix. Hint: table(pred_default, Default$default) library (ISLR2) ## Warning: package 'ISLR2' was built under R version 4.2.3 View(Default) attach (Default) Default$default<- as.numeric(Default$default) fit<- lm(default ~., data = Default) predicted_probs<- predict(fit, type = "response") predicted_probs[predicted_probs>0] <- 1e-5 predicted_probs[predicted_probs>1] <- 1-(1e-5) preb_default<- ifelse(predicted_probs > 0.5, "Yes","No") confusion_matrix<- table(preb_default,Default$default) confusion_matrix ## ## preb_default 1 2 ## No 9667 333 ##Q2 Now instead of “lm,” run a weighted least squares model using “gls.” You’ll need to run library(nlme) in order to do gls. The weights are specified as a linear function of the modified fitted values from the previous step. Again, adjust the predicted probabilities from the gls model to reasonable values, predict the binary response as Yes/No, and compute the confusion matrix. library (nlme) gls_fit<- gls(default ~., weights = varIdent(predicted_probs),data = Default) predict_probs_gls<- predict(gls_fit, type = "response") predict_probs_gls[predict_probs_gls>0] <- 1e-5 predict_probs_gls[predict_probs_gls>1] <- 1-(1e-5) preb_default_gls<- ifelse(predict_probs_gls > 0.5, "Yes","No") confusion_matrix_2<- table(preb_default_gls,Default$default) confusion_matrix_2 ## ## preb_default_gls 1 2 ## No 9667 333

##Q3 Randomly partition the data into training, validation and test sets (proportions = 80% + 10% + 10%). Set a seed of your choice for reproducibility. Using the naive Bayes classifier (“naiveBayes” in R library “e1071”) on the training set, predict on the training, validation and test sets and compute the 3 confusion matrices. The levels in the confusion matrices should be Yes/No. Note that the response variable should be of class “factor.” Hint for naiveBayes: ISLR textbook, section 4.7.5. library (e1071) ## Warning: package 'e1071' was built under R version 4.2.3 set.seed(123) n<- nrow(Default) train_index<- sample(seq_len(n),size = 0.8*n) remianing_index<- setdiff(seq_len(n),train_index) vaild_index<- sample(remianing_index, 0.1*n) test_index<- setdiff(remianing_index, vaild_index) train_data<- Default[train_index, ] vaild_data<- Default[vaild_index, ] test_data<- Default[test_index, ] train_data$default<- as.factor(train_data$default) vaild_data$default<- as.factor(vaild_data$default) test_data$default<- as.factor(test_data$default) fit_nb<- naiveBayes(default ~ ., data = train_data) train_pred<- predict(fit_nb, train_data) vaild_pred<- predict(fit_nb, vaild_data) test_pred<- predict(fit_nb, test_data) confusion_matrix_train<- table(train_pred , train_data$default) confusion_matrix_vaild<- table(vaild_pred , vaild_data$default) confusion_matrix_test<- table(test_pred , test_data$default) confusion_matrix_train ## ## train_pred 1 2 ## 1 7692 190 ## 2 41 77 confusion_matrix_vaild ## ## vaild_pred 1 2 ## 1 962 23 ## 2 5 10 confusion_matrix_test

## ## test_pred 1 2 ## 1 960 25 ## 2 7 8 ##Q4Using the same training, validation and test sets, run a logistic regression (“glm” in R for binary response) on the training data. Note that the response variable should be of class “factor.” Again, predict on the training, validation and test sets and compute the 3 confusion matrices. The levels in the confusion matrices should be Yes/No. logistic_model<- glm(default ~ ., data = train_data, family = binomial) train_pred_glm<- ifelse(predict(logistic_model, type = "response")> 0.5, "Yes", "No") vaild_pred_glm<- ifelse(predict(logistic_model, newdata = vaild_data, type = "response")> 0.5, "Yes", "No") test_pred_glm<- ifelse(predict(logistic_model, newdata = test_data, type = "response")> 0.5, "Y es", "No") confusion_matrix_train_glm<- table(train_pred_glm , train_data$default) confusion_matrix_vaild_glm<- table(vaild_pred_glm , vaild_data$default) confusion_matrix_test_glm<- table(test_pred_glm , test_data$default) confusion_matrix_train_glm ## ## train_pred_glm 1 2 ## No 7698 180 ## Yes 35 87 confusion_matrix_vaild_glm ## ## vaild_pred_glm 1 2 ## No 964 23 ## Yes 3 10 confusion_matrix_test_glm ## ## test_pred_glm 1 2 ## No 964 23 ## Yes 3 10 ##Q5 The linear model just give us a basic assumption about the data; the WLS model assigns different weights to observations based on their influence; the Naive Bayes model assumes independence among predictors, and the logistic regression models the probability of a binary outcome and can handle both categorical and continuous predictors. I generally believe that the logistic regression model is performing better since it use the train and the test data which can give more accurate and dependable test automation, and it assumes a linear relationship between the predictors and the log-odds of the outcome. ##Q6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

options(repos = "https://cran.rstudio.com/") cmc<- read.csv("E:/econ484/cmc.data") View(cmc) names(cmc)[names(cmc) == 'X24']<- 'wife_age' names(cmc)[names(cmc) == 'X2']<- 'wife_edu' names(cmc)[names(cmc) == 'X3']<- 'husband_edu' names(cmc)[names(cmc) == 'X3.1']<- 'num_children' names(cmc)[names(cmc) == 'X1']<- 'wife_religion' names(cmc)[names(cmc) == 'X1.1']<- 'wife_working' names(cmc)[names(cmc) == 'X2.1']<- 'husband_occupation' names(cmc)[names(cmc) == 'X3.2']<- 'standard_of_living_index' names(cmc)[names(cmc) == 'X0']<- 'media_exposure' names(cmc)[names(cmc) == 'X1.2']<- 'contraceptice_method' install.packages("mlogit") ## Installing package into 'C:/Users/tamel/AppData/Local/R/win-library/4.2' ## (as 'lib' is unspecified) ## package 'mlogit' successfully unpacked and MD5 sums checked ## ## The downloaded binary packages are in ## C:\Users\tamel\AppData\Local\Temp\Rtmpa0Pwoz\downloaded_packages install.packages("zoo") ## Installing package into 'C:/Users/tamel/AppData/Local/R/win-library/4.2' ## (as 'lib' is unspecified) ## package 'zoo' successfully unpacked and MD5 sums checked ## Warning: cannot remove prior installation of package 'zoo' ## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying ## C:\Users\tamel\AppData\Local\R\win-library\4.2\00LOCK\zoo\libs\x64\zoo.dll to ## C:\Users\tamel\AppData\Local\R\win-library\4.2\zoo\libs\x64\zoo.dll: Permission ## denied ## Warning: restored 'zoo' ## ## The downloaded binary packages are in ## C:\Users\tamel\AppData\Local\Temp\Rtmpa0Pwoz\downloaded_packages library (zoo) ## Warning: package 'zoo' was built under R version 4.2.3

## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric library (mlogit) ## Warning: package 'mlogit' was built under R version 4.2.3 ## Loading required package: dfidx ## Warning: package 'dfidx' was built under R version 4.2.3 ## ## Attaching package: 'dfidx' ## The following object is masked from 'package:stats': ## ## filter long_data<- mlogit.data(cmc, choice = "contraceptice_method" ,shape = "wide") View(long_data) mlogit_model <- mlogit(contraceptice_method ~ 1 | wife_age + wife_edu + husband_edu + num_child ren + wife_religion + wife_working + husband_occupation + standard_of_living_index + media_expo sure, data = long_data) summary(mlogit_model)

## ## Call: ## mlogit(formula = contraceptice_method ~ 1 | wife_age + wife_edu + ## husband_edu + num_children + wife_religion + wife_working + ## husband_occupation + standard_of_living_index + media_exposure, ## data = long_data, method = "nr") ## ## Frequencies of alternatives:choice ## 1 2 3 ## 0.42663 0.22622 0.34715 ## ## nr method ## 5 iterations, 0h:0m:0s ## g'(-H)^-1g = 0.000362 ## successive function values within tolerance limits ## ## Coefficients : ## Estimate Std. Error z-value Pr(>|z|) ## (Intercept):2 -3.229492 0.759486 -4.2522 2.117e-05 *** ## (Intercept):3 -0.099123 0.627502 -0.1580 0.8744843 ## wife_age:2 -0.045815 0.012060 -3.7988 0.0001454 *** ## wife_age:3 -0.105979 0.011256 -9.4150 < 2.2e-16 *** ## wife_edu:2 0.881807 0.114534 7.6991 1.377e-14 *** ## wife_edu:3 0.366726 0.087486 4.1918 2.767e-05 *** ## husband_edu:2 -0.084184 0.133938 -0.6285 0.5296593 ## husband_edu:3 0.055833 0.100691 0.5545 0.5792400 ## num_children:2 0.344962 0.042878 8.0453 8.882e-16 *** ## num_children:3 0.351056 0.038624 9.0891 < 2.2e-16 *** ## wife_religion:2 -0.478887 0.200415 -2.3895 0.0168722 * ## wife_religion:3 -0.325347 0.198101 -1.6423 0.1005213 ## wife_working:2 0.032842 0.168364 0.1951 0.8453437 ## wife_working:3 0.168330 0.150681 1.1171 0.2639382 ## husband_occupation:2 -0.081099 0.097250 -0.8339 0.4043236 ## husband_occupation:3 0.179077 0.084399 2.1218 0.0338566 * ## standard_of_living_index:2 0.342624 0.096783 3.5401 0.0003999 *** ## standard_of_living_index:3 0.228939 0.072548 3.1557 0.0016013 ** ## media_exposure:2 -0.445152 0.388015 -1.1473 0.2512759 ## media_exposure:3 -0.479717 0.272739 -1.7589 0.0785969 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log-Likelihood: -1390.4 ## McFadden R^2: 0.11469 ## Likelihood ratio test : chisq = 360.25 (p.value = < 2.22e-16) Question: Which is(are) the variable(s) that affect the current contraceptive method choice significantly? From the summary, we can clearly see that the education level of the wife is affecting the choice of contraceptive method the most.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version