0 Logistic Regression

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by jenniezhu

Logistic Regression 1 Objective 2 Credit Card Default Data 3 Logistic Regression 3.1 Train a logistic regression model with all variables 3.1.1 (Optional) Two-way contingency table and Chi-square test 3.2 Get some criteria of model fitting 3.3 Prediction 4 Summary 4.1 Things to remember 4.2 Guide for Assignment 1 Objective The objective of this case is to get you understand logistic regression (binary classification) and some important ideas such as cross validation, ROC curve, cut-off probability. 2 Credit Card Default Data We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients (http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). Think about what kind of factors could affect people to fail to pay their credit balance. We first load the credit scoring data. It is easy to load comma-separated values (CSV). credit_data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/l ecture/data/credit_default.csv", header=T) Look at what information do we have. colnames(credit_data)

## [1] "LIMIT_BAL" "SEX" ## [3] "EDUCATION" "MARRIAGE" ## [5] "AGE" "PAY_0" ## [7] "PAY_2" "PAY_3" ## [9] "PAY_4" "PAY_5" ## [11] "PAY_6" "BILL_AMT1" ## [13] "BILL_AMT2" "BILL_AMT3" ## [15] "BILL_AMT4" "BILL_AMT5" ## [17] "BILL_AMT6" "PAY_AMT1" ## [19] "PAY_AMT2" "PAY_AMT3" ## [21] "PAY_AMT4" "PAY_AMT5" ## [23] "PAY_AMT6" "default.payment.next.month" Let’s look at how many people were actually default in this sample. mean(credit_data$default.payment.next.month) ## [1] 0.2193333 The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function. library (dplyr) credit_data<- rename(credit_data, default=default.payment.next.month) How about the variable type and summary statistics? str(credit_data) # structure - see variable type summary(credit_data) # summary statistics We see all variables are int , but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor . credit_data$SEX<- as.factor(credit_data$SEX) credit_data$EDUCATION<- as.factor(credit_data$EDUCATION) credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE) We omit other EDA, but you shouldn’t whenever you are doing data analysis. go to top 3 Logistic Regression Randomly split the data to training (80%) and testing (20%) datasets:

index <- sample(nrow(credit_data),nrow(credit_data)*0.80) credit_train = credit_data[index,] credit_test = credit_data[-index,] 3.1 Train a logistic regression model with all variables credit_glm0 <- glm(default~., family=binomial, data=credit_train) summary(credit_glm0)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

## ## Call: ## glm(formula = default ~ ., family = binomial, data = credit_train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.2047 -0.6919 -0.5367 -0.2797 3.0897 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.126e+00 1.540e-01 -7.310 2.68e-13 *** ## LIMIT_BAL -1.051e-06 2.878e-07 -3.654 0.000259 *** ## SEX2 -1.401e-01 5.470e-02 -2.562 0.010412 * ## EDUCATION2 -1.173e-01 6.373e-02 -1.840 0.065770 . ## EDUCATION3 -1.513e-01 8.559e-02 -1.767 0.077170 . ## EDUCATION4 -1.319e+00 3.333e-01 -3.959 7.54e-05 *** ## MARRIAGE2 -1.600e-01 6.175e-02 -2.591 0.009582 ** ## MARRIAGE3 3.789e-02 2.398e-01 0.158 0.874436 ## AGE 9.005e-03 3.326e-03 2.708 0.006774 ** ## PAY_0 5.925e-01 3.197e-02 18.533 < 2e-16 *** ## PAY_2 7.969e-02 3.648e-02 2.185 0.028919 * ## PAY_3 1.312e-02 4.051e-02 0.324 0.746049 ## PAY_4 9.896e-02 4.472e-02 2.213 0.026901 * ## PAY_5 6.529e-02 4.751e-02 1.374 0.169386 ## PAY_6 -1.356e-02 3.999e-02 -0.339 0.734658 ## BILL_AMT1 -7.046e-06 2.034e-06 -3.465 0.000531 *** ## BILL_AMT2 4.992e-06 2.507e-06 1.992 0.046416 * ## BILL_AMT3 2.279e-06 2.191e-06 1.040 0.298208 ## BILL_AMT4 -4.197e-06 2.397e-06 -1.751 0.079955 . ## BILL_AMT5 2.595e-06 2.757e-06 0.941 0.346575 ## BILL_AMT6 7.560e-07 2.197e-06 0.344 0.730752 ## PAY_AMT1 -1.031e-05 3.302e-06 -3.124 0.001786 ** ## PAY_AMT2 -7.732e-06 3.287e-06 -2.353 0.018646 * ## PAY_AMT3 -6.155e-07 2.996e-06 -0.205 0.837236 ## PAY_AMT4 -3.381e-06 2.809e-06 -1.203 0.228842 ## PAY_AMT5 -1.002e-06 2.961e-06 -0.338 0.735155 ## PAY_AMT6 -1.400e-06 2.104e-06 -0.666 0.505723 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10060.6 on 9599 degrees of freedom ## Residual deviance: 8770.4 on 9573 degrees of freedom ## AIC: 8824.4

## ## Number of Fisher Scoring iterations: 5 You have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm() , the only thing new is family . It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You can also specify family=binomial(link = "probit") to run probit regression. You may also use glm() to build many other generalized linear models. 3.1.1 (Optional) Two-way contingency table and Chi-square test Two-way contingency table is a very useful tool for exploring the relationship between categorical variables. It is essentially the simplest pivot-table (see example below). Often time, after you create a two-way contingency table, Chi-square test is used to test if X affect Y. The null hypothesis is: X and Y are independent (e.g., MARRIAGE has nothing to do with likelihood of default). The test statistic is defined as where the expected count is calculated by assuming row variable has nothing to do with column variable. Here is a very good tutorial for Chi-square test https://www.youtube.com/watch?v=WXPBoFDqNVk (https://www.youtube.com/watch?v=WXPBoFDqNVk). table_edu <- table(credit_data$EDUCATION, credit_data$default) table_edu ## ## 0 1 ## 1 3399 829 ## 2 4298 1298 ## 3 1501 491 ## 4 170 14 chisq.test(table_edu)

## ## Pearson's Chi-squared test ## ## data: table_edu ## X-squared = 49.19, df = 3, p-value = 1.189e-10 What we saw from above test result is that p-value < 0.05. What is your conclusion? go to top 3.2 Get some criteria of model ±tting You can simply extract some criteria of the model fitting, for example, Residual deviance (equivalent to SSE in linear regression model), mean residual deviance, AIC and BIC. Unlike linear regression models, there is no in logistic regression. # in-sample residual deviance credit_glm0$deviance ## [1] 8770.363 # in-sample mean residual deviance using df credit_glm0$dev/credit_glm0$df.residual ## [1] 0.9161562 AIC(credit_glm0) ## [1] 8824.363 BIC(credit_glm0) ## [1] 9017.94 3.3 Prediction Similar to linear regression, we use predict() function for prediction. To get prediction from a logistic regression model, there are several steps you need to understand. Refer to textbook/slides for detailed math.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

1.The fitted model gives you the estimated value before the inverse of link (logit in case of logistic regression). In logistic regression the are called log odds ratio , which is . In R you use the predict() function to get a vector of all in-sample (for each training obs). hist(predict(credit_glm0)) 2.For each , in order to get the P(y=1), we can apply the inverse of the link function (logit here) to . The equation is . In R you use the fitted() function or *predict(,type=“response”) to get the predicted probability for each training ob. pred_resp <- predict(credit_glm0,type="response") hist(pred_resp)

3.Last but not least, you want a binary classification decision rule. The default rule is if the fitted then . The value 0.5 is called cut-off probability . You can choose the cut-off probability based on mis-classification rate, cost function, etc. In this case, the cost function can indicate the trade off between the risk of giving loan to someone who cannot pay (predict 0, truth 1), and risk of rejecting someone who qualifys (predict 1, truth 0). These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1. table(credit_train$default, (pred_resp > 0.5)*1, dnn=c("Truth","Predicted")) ## Predicted ## Truth 0 1 ## 0 7308 202 ## 1 1556 534 table(credit_train$default, (pred_resp > 0.2)*1, dnn=c("Truth","Predicted")) ## Predicted ## Truth 0 1 ## 0 4653 2857 ## 1 609 1481

table(credit_train$default, (pred_resp > 0.0001)*1, dnn=c("Truth","Predicte d")) ## Predicted ## Truth 0 1 ## 0 1 7509 ## 1 0 2090 4 Summary 4.1 Things to remember Know how to use glm() to build logistic regression; 4.2 Guide for Assignment EDA Train logistic model go to top

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Documents

Durva sikligar 4501574 test 2.docx

Yadav_Module5_Report.pdf

Yadav_module_4.docx

RScript.docx

STAT1181_Patel_Kegrin_HW3_marked.pdf

S1181 Assignment 4 S24 (1).pdf

MQ-STAT2170-6180-Exam-S1-2022.pdf

MAST20005-2020.pdf

LEG Short Paper 3.docx

Week 1 SAFMEDS.docx

quiz decision science.docx

STAT1034 - Ch3 and Ch4 StudyTest.docx

Recommended textbooks for you

Algebra and Trigonometry (MindTap Course List)

Algebra

ISBN:9781305071742

Author:James Stewart, Lothar Redlin, Saleem Watson

Publisher:Cengage Learning

College Algebra

Algebra

ISBN:9781305115545

Author:James Stewart, Lothar Redlin, Saleem Watson

Publisher:Cengage Learning

Functions and Change: A Modeling Approach to Coll...

Algebra

ISBN:9781337111348

Author:Bruce Crauder, Benny Evans, Alan Noell

Publisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...

Algebra

ISBN:9780079039897

Author:Carter

Publisher:McGraw Hill

College Algebra

Algebra

ISBN:9781337282291

Author:Ron Larson

Publisher:Cengage Learning

Big Ideas Math A Bridge To Success Algebra 1: Stu...

Algebra

ISBN:9781680331141

Author:HOUGHTON MIFFLIN HARCOURT

Publisher:Houghton Mifflin Harcourt

SEE MORE TEXTBOOKS

Recommended textbooks for you

Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning
Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
College Algebra
Algebra
ISBN:9781337282291
Author:Ron Larson
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt