Module3Assignment - Copy

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

6015

Subject

Mathematics

Date

Apr 3, 2024

Type

docx

Pages

14

Uploaded by PresidentToadPerson1018

Report
Module 3 Assignment College of Professional Studies, Northeastern University ALY6015, 21626 Harpreet Sharma January 29 th , 2024 1
Table of Contents Introduction ................................................................................................................................ 4 Analysis ...................................................................................................................................... 4 Figure 1 ................................................................................................................................... 4 Scatterplot for Enrollment vs. Applicants ............................................................................... 4 Figure 2 ................................................................................................................................... 5 Boxplot comparing Private and Public Schools ..................................................................... 5 Figure 3 ................................................................................................................................... 5 Descriptive statistics for Enrollment numbers ....................................................................... 5 Figure 4 ................................................................................................................................... 6 Descriptive Statistics for Out-of-state Tuition ........................................................................ 6 Figure 5 ................................................................................................................................... 6 Summary of Logistic Regression Model ................................................................................. 6 Figure 6 ................................................................................................................................... 6 Exponentiated Coefficients of the Logistic Regression Model ............................................... 6 Figure 7 ................................................................................................................................... 7 Confusion Matrix for Train Set ............................................................................................... 7 Figure 8 ................................................................................................................................... 8 Model Metrics ......................................................................................................................... 8 Figure 9 ................................................................................................................................... 9 Confusion Matrix for Test set ................................................................................................. 9 Figure 10 ............................................................................................................................... 10 ROC Curve ........................................................................................................................... 10 Figure 11 ............................................................................................................................... 11 The area under the Curve ..................................................................................................... 11 Conclusion/Interpretation ......................................................................................................... 11 References ................................................................................................................................ 12 Appendices ............................................................................................................................... 13 Appendix A ........................................................................................................................... 13 Scatterplot of Enrollments vs. Applications ......................................................................... 13 Appendix B ........................................................................................................................... 13 Boxplot of Private and Public Schools ................................................................................. 13 Appendix C ........................................................................................................................... 13 Descriptive Statistics of the number of Enrollments ............................................................ 13 Appendix D .......................................................................................................................... 13 2
Descriptive Statistics for Out-of-state Tuition ...................................................................... 13 Appendix E ........................................................................................................................... 14 Train and Test set .................................................................................................................. 14 Appendix F ........................................................................................................................... 14 Logistic Regression Model ................................................................................................... 14 Appendix G .......................................................................................................................... 14 Confusion Matrix of Train Set .............................................................................................. 14 Appendix H .......................................................................................................................... 14 Confusion Matric for Test Set .............................................................................................. 14 Appendix I ............................................................................................................................ 14 ROC Curve ........................................................................................................................... 14 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Introduction This report focuses on the analysis and prediction of university classification (private or public) using logistic regression on the College dataset obtained from the ISLR package, which comprises 777 records and 18 variables. The process involves exploratory data analysis, a train-test split, and subsequent model training. Key evaluations encompassing confusion matrices, along with accuracy, precision, recall, specificity, and testing set performance analyses. The ROC curve assesses discrimination ability, with AUC calculated for overall model performance. Analysis 1. Exploratory Data Analysis Created a scatterplot to explore the relationship between the number of applicants and enrollments (see Appendix A). Figure 1 Scatterplot for Enrollment vs. Applicants 4
Figure 1 indicates that there is a positive correlation between the variables. Suggesting that as the number of applications increases, there tends to be an increase in the number of enrollments, and vice versa. Created a boxplot to compare the enrollments in private and non-private (public) schools (see Appendix B). Figure 2 Boxplot comparing Private and Public Schools Figure 2 indicates that there is a huge difference between the two groups as the median of public school (yes) lies outside of the private school box plot. Calculated descriptive statistics for enrollment numbers (see Appendix C) Figure 3 Descriptive statistics for Enrollment numbers Calculated descriptive statistics for out-of-state tuition (see Appendix D) 5
Figure 4 Descriptive Statistics for Out-of-state Tuition 2. Split the data into train and test sets for model training and evaluation The dataset was split into training (70%) and test sets (30%) using a random seed for reproducibility (see Appendix E). 3. Fit a logistic regression model to the training set (predict whether a college is private or not) based on selected predictor variables (see Appendix F). Figure 5 Summary of Logistic Regression Model Figure 5 shows the summary statistics of the logistic regression model. The intercept is highly significant. Apps is not statistically significant (p-value = 0.1803). Outstate and Accept are both statistically significant at a 0.05 significance level. Figure 6 Exponentiated Coefficients of the Logistic Regression Model 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
As shown in Figure 6, The odds of a college being private (versus public) decrease by a factor of approximately 0.013 compared to the reference category, holding all other variables constant. For each additional application received, the odds of a college being private (versus public) decrease by a factor of approximately 0.9997, holding all other variables constant. For each additional unit increase in out- of-state tuition, the odds of a college being private (versus public) increase by a factor of approximately 1.0009, holding all other variables constant. For each additional acceptance of applications, the odds of a college being private (versus public) decrease by a factor of approximately 0.9993, holding all other variables constant. 4. Confusion Matrix for Train Set Figure 7 Confusion Matrix for Train Set Figure 7 shows the confusion matrix of the Logistic Regression Model for the training set (see Appendix G) Interpretation True Positives (TP): 377 (Correctly predicted as Private colleges). False Positives (FP): 21 (Incorrectly predicted as Private colleges when they are Public). True Negatives (TN): 128 (Correctly predicted as Public colleges). False Negatives (FN): 19 (Incorrectly predicted as Public colleges when they are Private). 7
False Positives vs. False Negatives: False Positives (predicting a college as Private when it's Public) can lead to inefficient use of resources, misallocation of funds, and wasted efforts on colleges that do not require additional support or attention. False Negatives (predicting a college as Public when it's Private). In scenarios where identifying Private colleges accurately is crucial for strategic planning or interventions, false negatives can lead to missed opportunities, inadequate support, and underestimation of factors that require targeted interventions. Therefore, in contexts where precision and targeted interventions are critical, False Negatives might have a more significant impact due to missed opportunities and inadequate support for colleges in need. 5. Interpret metrics for Accuracy, Precision, Recall, and Specificity. Figure 8 Model Metrics As shown in Figure 8, the model metrics are as follows: Accuracy: Approximately 92.66%. The model demonstrates high accuracy, correctly predicting college types around 92.66% of the time. 8
Precision: When the model predicts a college to be Private, it is correct approximately 94.72% of the time. Recall: The model identifies approximately 95.20% of actual private colleges. Specificity: The model identifies approximately 85.91% of actual Public colleges. 6. Confusion Matrix for Test Set Figure 9 Confusion Matrix for Test set Figure 9 shows the confusion matrix of the Logistic Regression Model for the training set (see Appendix H). Interpretation True Positives (TP): 156 (correctly identifies 156 Private colleges) False Positives (FP): 10 (incorrectly classifies 10 Public colleges as Private) True Negatives (TN): 53 (correctly identifies 53 Public colleges) 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
False Negatives (FN): 13 (incorrectly identifies 13 Private colleges as Public) Model Metrics: Accuracy: The overall accuracy of the model on the test set is approximately 90.09%. Precision (Positive Predictive Value): Indicates that of all colleges predicted to be Private, approximately 93.98% are Private. Recall (Sensitivity): shows that the model correctly identifies approximately 92.31% of actual Private colleges. Specificity: demonstrates that the model correctly identifies approximately 84.13% of actual Public colleges. 7. ROC Curve Figure 10 ROC Curve 10
As shown in Figure 10, the ROC curve (see Appendix I) for the classifier demonstrates strong performance in correctly identifying positive instances while minimizing false positive classifications. This indicates that the model has good sensitivity (true positive rate) and specificity (true negative rate), suggesting it can effectively distinguish between the two classes. 8. AUC Figure 11 The area under the Curve As shown in Figure 11, The Area Under the Curve (AUC) value for the ROC curve is 0.9598. This indicates that the classifier has a high ability to discriminate between the positive and negative classes across different threshold values. AUC values closer to 1 suggest better classifier performance, therefore indicating excellent performance. Conclusion/Interpretation Based on the analysis, the logistic regression model demonstrates strong predictive performance in classifying colleges as private or non-private. The model's high accuracy, sensitivity, specificity, precision, and AUC highlight its robust performance in classification tasks. 11
References DataCamp. (2020). R Tutorial: Evaluating classification model performance [YouTube Video]. In YouTube. https://www.youtube.com/watch?v=Q-eB-DBeLqE Epps, T. (2020, December). Private vs. Public Colleges: What’s the Difference? BestColleges.com; BestColleges.com. https://www.bestcolleges.com/blog/private-vs-public- colleges/ Kabacoff, R. (2015). R in action: Data analysis and graphics with R. Manning. 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Appendices Appendix A Scatterplot of Enrollments vs. Applications Appendix A details the R code used to plot a scatterplot to explore the relationship between the number of enrollments and applicants. Appendix B Boxplot of Private and Public Schools Appendix B details the R code for the box plot of private and public schools. Appendix C Descriptive Statistics of the number of Enrollments Appendix C details the R code for the descriptive statistics of the number of enrollments. Appendix D Descriptive Statistics for Out-of-state Tuition Appendix D details the R code for the descriptive statistics of out-of-state tuition. 13
Appendix E Train and Test set Appendix E details the R code used to split the data into train and test set. Appendix F Logistic Regression Model Appendix F details the R code for the Logistic Regression Model. Appendix G Confusion Matrix of Train Set Appendix G details the R code for the Confusion Matrix of the Train Set. Appendix H Confusion Matric for Test Set Appendix H details the R code of the Confusion Matrix for the Test Set. Appendix I ROC Curve 14