HOMEWORK 3

docx

School

Saint Paul College *

*We aren’t endorsed by this school

Course

2420

Subject

Computer Science

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by Aishaomar922

HOMEWORK 3 -- Classification Business Context: Banks/Financial Companies 1) Load the data  Load the data. Then, use table() to obtain information on how many customers defaulted and did not default in this month, respectively. 2) Use ggplot2 to create a boxplot of Age grouped by DEFAULT .  You plot should have proper title and axis labels.  Based on the plot, do the defaulters and non-defaulters differ by age? o Based on the median values presented on the box plot, it appears that defaulters differ in age compared to non-defaulters. The median age of defaulters is observed to be younger than that of non-defaulters.

3) Use ggplot2 to create a boxplot of LIMIT_BAL by DEFAULT .  You plot should have proper title and axis labels.  Please set your y = LIMIT_BAL/1000 to make the numbers on the Y axis more readable.  Based on the plot, do the defaulters and non-defaulters differ in credit limit? o The credit limit doesn’t differ by aloo the median values look close to each other and box look about the same size and the outer whiskers and extended point on both. 4) Split the data into 80% training data and 20% test data.  to increase the reproducibility of the results, set the random seed to 123456 using set. Seed(123456) before random partitioning.

B. k-NN 1) Next, you decide to train a k-NN model  First, assess whether you need to standardize the data and explain why (or why not) .  If yes, apply the standardization to the chosen fields, making sure not to override the original dataset. 2) Train a k-NN model  Train the model for 5 different values of k : 5,10,20,30,40  Which k is the best according to the model summary? o If you prioritize accuracy, then k=40 would be considered the best. On the other hand, if you prioritize Kappa, then k=5 would be considered the best.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3) Plot the accuracy of kNN as a function of k  Explain the concept of accuracy.  Explain how the model accuracy changes with the number of neighbors k. Note: since the model is trained on a random subset of data, we may get slightly different results

- The accuracy is increasing as the umber of neighbors increase. This meaning the more people in the dataset the more accurate the dataset will become. 4) Make (class) predictions on the test set using the k-NN model 5) Create the confusion matrix  In doing so, please use "YES" as the positive class.  The chosen mode of confusion Matrix should allow you to answer the questions below.  Use the results to answer: o Among 100 customers predicted to default on their debt, how many are expected to truly default. o For every 100 default customers, how many are expected to be caught by this algorithm (i.e. classified as "YES"). o What does the accuracy you obtained in the result tell us?  Report the accuracy, recall, precision, and F1-Score of the positive class (DEFAULT=YES) This metric indicates the overall correctness of the model's predictions. In this case, the model is correct approximately 78.46% of the time.

C. Decision Trees 1) Next, you decide to train a Decision Tree model  Decide whether to use the standardized data or not and explain your choice .  Use the data partitions you have created earlier, if possible. Otherwise, create the partitions. The decision to use the standardized dataset stems from the presence of default instances and concerns about overlapping variables in the non-standardized data. Standardizing ensures proportional feature contribution, mitigates scale-related issues, and maintains consistency across models 2) Train the decision tree  It is ok if your tree is small or very large.  If the tree obtained does not seem meaningful (for example, it only shows one node and no branches), you may run it again.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3) Based on the tree you obtain, answer the following questions  What attributes are used in your decision tree? Did it use all the attributes available?  Interpret the label on one of the nodes.  Pick the first two leaf nodes , identify the conditions leading to these nodes and the prediction made for these nodes. First Leaf Node: Conditions: Marriage: No Age: Less than 27 Limit balance: Less than 125,000 Pay amount: Less than 855 Prediction: No default (0.31) Second Leaf Node: Conditions: Marriage: No Age: Less than 27 Limit balance: Less than 125,000 Pay amount: Greater than or equal to 855 Gender: 2 (unclear representation) Prediction: Default (0.74) These leaf nodes represent specific conditions under which predictions are made regarding loan default. For example, the first leaf node predicts a lower probability of default (0.31) for borrowers who are not married, younger than 27, have a limit balance less than 125,000, and have made a payment of less than 855. The second leaf node, under similar demographic conditions, predicts a higher probability of default (0.74) when the payment amount is greater than or equal to 855, with an additional condition related to an unclear representation of gender.

4) Make the class predictions on the test set and produce the confusion matrix  Save the predictions in DT_predictions.  Produce a confusion matrix.  Use the YES as the positive class.  Use the same mode as specified in kNN.  Identify accuracy, recall, precision, and F1-Score of the positive class ("YES")  Compared the performance of the kNN and DT models, which one you would prefer and why? Based on the provided metrics, the Decision Tree model generally outperforms the kNN model across various evaluation criteria. Therefore, if you prioritize overall performance, precision,

recall, and balanced accuracy, you may prefer the Decision Tree model.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Documents

PRIYA LR&CFS QUIZ-2.docx

PS#10.pdf

CNET 237 Ch 4.docx

CNET 237 Chapter 10-12 Exam.docx

CNET 237 Ch 1-9 Midterm Exam.docx

Screenshot 2023-02-04 205949 HW 5 q 1-4.png

PRIYA LR&CFS QUIZ-5.docx

testing2.pdf

test2.pdf

testing.pdf

2-1 Milestone One.docx

7-1 Project Submission.docx

Recommended textbooks for you

Oracle 12c: SQL

Computer Science

ISBN:9781305251038

Author:Joan Casteel

Publisher:Cengage Learning

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

COMPREHENSIVE MICROSOFT OFFICE 365 EXCE

Computer Science

ISBN:9780357392676

Author:FREUND, Steven

Publisher:CENGAGE L

A Guide to SQL

Computer Science

ISBN:9781111527273

Author:Philip J. Pratt

Publisher:Course Technology Ptr

Programming with Microsoft Visual Basic 2017

Computer Science

ISBN:9781337102124

Author:Diane Zak

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Oracle 12c: SQL
Computer Science
ISBN:9781305251038
Author:Joan Casteel
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
A Guide to SQL
Computer Science
ISBN:9781111527273
Author:Philip J. Pratt
Publisher:Course Technology Ptr
Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning