Beteta - Project Two MAT 303

docx

School

Southern New Hampshire University *

*We aren’t endorsed by this school

Course

303

Subject

Mathematics

Date

Apr 3, 2024

Type

docx

Pages

13

Uploaded by HighnessProton14584

Report
MAT 303 Project Two Summary Report Diego Beteta diego.beteta@snhu.edu Southern New Hampshire University
1. Introduction In this project, we are exploring a dataset related to heart disease, which includes various health indicators such as age, sex, types of chest pain, blood pressure, cholesterol levels, fasting blood sugar, and maximum heart rate achieved. The primary goal is to use statistical models to predict the risk of heart disease and certain cardiovascular metrics like maximum heart rate. To achieve this, we'll employ logistic regression for binary classification, determining the likelihood of heart disease presence, and random forest models for classifying the risk of heart disease and predicting continuous variables like maximum heart rate. The results from these analyses could potentially be used in a clinical setting to assist healthcare professionals in identifying individuals at higher risk for heart disease, enabling earlier and more targeted interventions. 2. Data Preparation In this heart disease dataset, key variables include age, sex, types of chest pain (cp), resting blood pressure (trestbps), cholesterol levels (chol), fasting blood sugar (fbs), resting electrocardiographic measurements (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), ST depression (oldpeak), the slope of the peak exercise ST segment, the number of major vessels (ca), and thalassemia status (thal). Each variable provides insights into an individual's cardiovascular health and risk factors for heart disease. The dataset comprises 303 rows representing individual patient records and 14 columns, each corresponding to one of the mentioned variables, including the target variable that indicates the presence or absence of heart disease. 3. Model #1 - First Logistic Regression Model Reporting Results The general form and the prediction equation of the logistic multiple regression model for heart disease (target) using variables age (age), resting blood pressure (trestbps), exercised induced angina (exang), and maximum heart rate achieved (thalach) are as follows: General Form: E ( y ) = e ( β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 ) 1 + e ( β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 ) Prediction Equation: ln ( odds ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 The prediction equation of this model in terms of the natural log of odds to express the beta terms in linear form: E ( y ) = e (− 1.0211 0.0175 x 1 0.0149 x 2 1.625 x 3 + 0.0311 x 4 ) 1 + e (− 1.0211 0.0175 x 1 0.0149 x 2 1.625 x 3 + 0.0311 x 4 )
In the context of the logistic regression model for predicting heart disease, the terms π and π 1 π have specific meanings related to the probability of an individual having heart disease: π: This represents the probability of an individual having heart disease. It is a value between 0 and 1, where 0 means no chance of heart disease and 1 means a certain presence of heart disease. In the context of our model, π is what we are trying to predict based on the input variables (age, resting blood pressure, exercise-induced angina, and maximum heart rate achieved). π 1 π : This is known as the odds ratio. It's a way of comparing the likelihood of having heart disease (π) to the likelihood of not having heart disease (1-π). For example, if π=0.5, the odds ratio π 1 π is 1, meaning the odds of having heart disease are equal to the odds of not having it. If π > 0.5, the odds ratio is greater than 1, indicating a higher likelihood of heart disease. Conversely, if π <0.5, the odds ratio is less than 1, suggesting a lower likelihood of heart disease. In logistic regression, we use the natural logarithm of the odds ratio as the outcome variable. The model estimates these log odds as a linear combination of the predictor variables. Prediction Model Equation: ln ( odds ) =− 1.0211 0.0175 age 0.0149 trestbps 1.625 exang 1 + 0.0311 thalach The estimated coefficient for the maximum heart rate achieved (thalach) in our logistic regression model is 0.031095. This indicates that in our model, for each one-unit increase in maximum heart rate (measured in beats per minute), the log odds of having heart disease increase by 0.031095. This positive coefficient suggests that higher maximum heart rates are associated with a greater likelihood of heart disease in our dataset. It's crucial to interpret this finding within the context of our model and other influencing factors; a higher heart rate is not a direct cause of heart disease but shows an association in the population we studied. This interpretation should be integrated with clinical insights and other relevant variables for a comprehensive understanding. Evaluating Model Significance The Hosmer-Lemeshow test statistic is based on a chi-square distribution, and the degrees of freedom are typically calculated as the number of groups minus 2. The test resulted in a chi-square value of 44.622 with 48 degrees of freedom, leading to a P-value of 0.612. Given that the P-value is significantly higher than the 0.05 (5%) significance level, we do not reject the null hypothesis. This implies that no substantial evidence suggests that our model fails to fit the data appropriately. According to the Hosmer-Lemeshow goodness of fit test, our model seems suitable for the data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Such a result is favorable in the context of logistic regression analysis, as it indicates that the predictions made by our model are in good agreement with the actual observed data across various subgroups. To determine which terms in the logistic regression model are significant based on Wald's test, we look at the p-values associated with each term. Wald's test evaluates the significance of each coefficient in the model. At a 5% level of significance, a term is considered significant if its p-value is less than 0.05. Age (age) : p-value = 0.3060. This is greater than 0.05, meaning age is not statistically significant at the 5% level. Resting Blood Pressure (trestbps) : p-value = 0.0741. This is also greater than 0.05, indicating that resting blood pressure is not statistically significant at the 5% level. Exercise-Induced Angina (exang1) : p-value ≈ 0.000000107 (1.07e-07). This is much less than 0.05, meaning exercise-induced angina is statistically significant at the 5% level. Maximum Heart Rate Achieved (thalach) : p-value ≈ 0.0000192 (1.92e-05). This is also much less than 0.05, indicating that maximum heart rate achieved is statistically significant at the 5% level. Based on Wald's 5% significance level test, the terms 'exercise-induced angina' and 'maximum heart rate achieved' are significant in the model. However, the terms 'age' and 'resting blood pressure' are not significant at this level. Based on the confusion matrix from our model, here are the counts for true positives, true negatives, false positives, and false negatives: True Negatives (TN) : 89 - The model correctly predicted 'no heart disease' (default=0) in 89 cases. False Positives (FP) : 49 - The model incorrectly predicted 'heart disease' (default=1) in 49 cases without heart disease. False Negatives (FN) : 31 - The model incorrectly predicted 'no heart disease' in 31 cases with heart disease. True Positives (TP) : 134 - The model correctly predicted 'heart disease' (default=1) in 134 cases. Based on the confusion matrix of our model, the following metrics are calculated: Accuracy : Approximately 73.60. This means that the model correctly predicts whether heart disease is present or not in about 73.60% of cases. Precision : About 73.22. This indicates that when the model predicts heart disease, it is correct 73.22% of the time. Recall (Sensitivity) : Approximately 81.21. This means that the model correctly identifies 81.21% of the actual cases of heart disease.
The ROC curve depicts the True Positive Rate (TPR, or sensitivity) versus the False Positive Rate (FPR, or 1 - specificity) across various thresholds. A curve that approaches the top left corner reflects a desirable high TPR and a low FPR, indicating that the model is highly adept at differentiating between the classes of interest, in this case, those with heart disease and those without. The AUC for this model is approximately 0.8007, which is considered very good. This value signifies the likelihood that the model will accurately rank a randomly chosen positive instance (a case with heart disease) higher in terms of risk than a randomly chosen negative instance (a case without heart disease). An AUC near 1.0 is indicative of a model with excellent predictive prowess. The ROC curve is instrumental in evaluating the performance of a classification model, especially when the balance between sensitivity and specificity is critical. A high AUC value, as seen with this model, suggests that it can distinguish between individuals with and without heart disease. Making Predictions Using Model 1. The probability of an individual having heart disease at 50 years old, with a resting blood pressure of 122, exercise-induced angina, and a maximum heart rate of 140, is 27.16%. With a probability value of 27.16%, the model suggests that individuals with these characteristics are moderately likely to have heart disease. This probability indicates a moderate risk of heart disease for individuals with this profile, highlighting the importance of considering these factors in assessing heart disease risk.
The odds of an individual with these characteristics having heart disease are approximately 0.3729. This means that such individuals are about 0.3729 times more likely to have heart disease than not to have it. While the odds are not extremely high, they still suggest an increased likelihood of heart disease for individuals with these specific attributes. This information can be valuable for medical assessments and risk evaluations. 2. The probability of an individual having heart disease at 50 years old, with a resting blood pressure of 130, no exercise-induced angina, and a maximum heart rate of 165, is 94.19%. With a probability value of 94.19%, the model indicates that individuals with these characteristics are highly likely to have heart disease. This probability suggests a strong risk of heart disease for individuals with this specific profile, emphasizing the importance of these factors in assessing heart disease risk. The odds of an individual with these characteristics having heart disease are approximately 16.41. This means that such individuals are about 16.41 times more likely to have heart disease than not to have it. These odds underscore the significant likelihood of heart disease for individuals with this particular combination of attributes, making it crucial information for medical evaluations and risk assessments. 4. Model #2 - Second Logistic Regression Model Reporting Results General Form: E ( y ) = e ( β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 + β 6 x 6 + β 7 x 7 2 + β 8 x 2 x 6 ) 1 + e ( β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 + β 6 x 6 + β 7 x 7 2 + β 8 x 2 x 6 ) Prediction Equation: ln ( odds ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 + β 6 x 6 + β 7 x 7 2 + β 8 x 2 x 6 The prediction equation of this model in terms of the natural log of odds to express the beta terms in linear form: E ( y ) = e (− 15.56 + 0.1744 x 1 0.0196 x 2 + 1.913 x 3 + 2.037 x 4 + 1.777 x 5 + 0.1363 x 6 + 0.0008 x 7 2 0.0019 x 2 x 6 ) 1 + e (− 15.56 + 0.1744 x 1 0.0196 x 2 + 1.913 x 3 + 2.037 x 4 + 1.777 x 5 + 0.1363 x 6 + 0.0008 x 7 2 0.0019 x 2 x 6 ) Prediction Model Equation: ln ( odds ) =− 15.56 + 0.1744 age 0.0196 trestbps + 1.913 c p 1 + 2.037 c p 2 + 1.777 c p 3 + 0.1363 thalach + 0.0008 age 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Evaluating Model Significance The Hosmer-Lemeshow test statistic is based on a chi-square distribution, typically using a number of degrees of freedom that equals the number of groups minus 2. In the given scenario, with a chi-square value of 52 and 48 degrees of freedom, the P-value is 0.3209. Given that this P-value is significantly greater than the 0.05 significance level (5%), we do not reject the null hypothesis. This outcome suggests that there is no substantial evidence to conclude that the model does not adequately fit the data. According to the Hosmer-Lemeshow goodness of fit test results, the logistic regression model appears to be a suitable fit for the data. Such a result is favorable in logistic regression analysis, indicating that the model's predictions are in good agreement with the actual observed data across various segments. Based on Wald's test and using a 5% level of significance, we can assess the significance of terms in our logistic regression model for heart disease by examining their p-values. A term is considered statistically significant if its p-value is less than 0.05. The intercept term is not significant, with a p-value of 0.13988, which is above the threshold of 0.05. Age is not statistically significant (p-value = 0.51357), indicating that age alone does not significantly influence this model's likelihood of heart disease. Resting blood pressure (trestbps) is significant, with a p-value of 0.02916, suggesting that resting blood pressure is a meaningful predictor of heart disease risk. Types of chest pain (cp1, cp2, cp3) are highly significant (cp1: p-value = 1.61e-05, cp2: p-value = 4.45e-09, cp3: p-value = 0.00117), indicating a strong relationship between these types of chest pain and the likelihood of heart disease. Maximum heart rate (thalach) is also significant, with a p-value of 0.00775, showing its importance in predicting heart disease. The quadratic term for age (I(age^2)) is not significant (p-value = 0.63025), suggesting that the squared term of age does not significantly affect the risk of heart disease in this model. The interaction term between age and maximum heart rate (age:thalach) is significant (p-value = 0.03616), indicating that the combined effect of age and maximum heart rate significantly predicts heart disease. The model indicates that resting blood pressure, specific types of chest pain, maximum heart rate, and the interaction between age and maximum heart rate are significant predictors of heart disease. Conversely, age alone and the quadratic term for age do not significantly influence the risk of heart disease. The Hosmer-Lemeshow test statistic is based on a chi-square distribution, typically using a number of degrees of freedom that equals the number of groups minus 2. In the given scenario, with a chi-square value of 52 and 48 degrees of freedom, the P-value is 0.3209. Given that this P-value is significantly greater than the 0.05 significance level (5%), we do not reject the null hypothesis. This outcome suggests that there is no substantial evidence to conclude that the model does not adequately fit the data. According to the Hosmer-Lemeshow goodness of fit test results, the logistic regression model appears to be a suitable fit for the data.
Such a result is favorable in logistic regression analysis, indicating that the model's predictions are in good agreement with the actual observed data across various segments. Based on Wald's test and using a 5% level of significance, we can assess the significance of terms in our logistic regression model for heart disease by examining their p-values. A term is considered statistically significant if its p-value is less than 0.05. The intercept term is not significant, with a p-value of 0.13988, which is above the threshold of 0.05. Age is not statistically significant (p-value = 0.51357), indicating that age alone does not significantly influence this model's likelihood of heart disease. Resting blood pressure (trestbps) is significant, with a p-value of 0.02916, suggesting that resting blood pressure is a meaningful predictor of heart disease risk. Types of chest pain (cp1, cp2, cp3) are highly significant (cp1: p-value = 1.61e-05, cp2: p-value = 4.45e-09, cp3: p-value = 0.00117), indicating a strong relationship between these types of chest pain and the likelihood of heart disease. Maximum heart rate (thalach) is also significant, with a p-value of 0.00775, showing its importance in predicting heart disease. The quadratic term for age (I(age^2)) is not significant (p-value = 0.63025), suggesting that the squared term of age does not significantly affect the risk of heart disease in this model. The interaction term between age and maximum heart rate (age:thalach) is significant (p-value = 0.03616), indicating that the combined effect of age and maximum heart rate significantly predicts heart disease. The model indicates that resting blood pressure, specific types of chest pain, maximum heart rate, and the interaction between age and maximum heart rate are significant predictors of heart disease. Conversely, age alone and the quadratic term for age do not significantly influence the risk of heart disease. Based on the confusion matrix from your model, we can report the counts for true positives, true negatives, false positives, and false negatives as follows: True Negatives (TN): 102 - The model correctly predicted 'no heart disease' (default=0) in 102 cases. False Positives (FP): 36 - The model incorrectly predicted 'heart disease' (default=1) in 36 cases without heart disease. False Negatives (FN): 36 - The model incorrectly predicted 'no heart disease' in 36 cases with heart disease. True Positives (TP): 129 - The model correctly predicted 'heart disease' (default=1) in 129 cases. Based on the confusion matrix of your model, here are the calculated metrics: Accuracy: Approximately 76.24%. This means that the model correctly predicts whether heart disease is present or not in about 76.24% of cases. Precision: About 78.18%. This indicates that when the model predicts heart disease, it is correct approximately 78.18% of the time.
Recall (Sensitivity): Approximately 78.18%. This means that the model correctly identifies about 78.18% of the actual cases of heart disease. These metrics provide a comprehensive overview of the model's performance in predicting heart disease, considering its ability to identify true cases and its accuracy in ruling out the disease when it is not present. The ROC curve illustrates the True Positive Rate (TPR, or sensitivity) against the False Positive Rate (FPR, or 1 - specificity) at various threshold settings. The curve appears to rise quickly towards the upper left corner, which indicates an excellent true positive rate for the model at a low false positive rate. This shape suggests that the model has a strong ability to discriminate between the classes of interest, which in this context are individuals with and without heart disease. The AUC value is 0.8478, which is considered very good. This value represents the probability that the model will correctly rank a randomly chosen positive instance (a case with heart disease) higher than a randomly chosen negative instance (a case without heart disease) in terms of risk. An AUC value closer to 1.0 indicates a model with excellent predictive accuracy. The ROC curve is crucial for assessing a classification model's performance, mainly when the trade-off between sensitivity and specificity is vital. The high AUC value observed with this model suggests it effectively distinguishes between patients with and without heart disease.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Making Predictions Using Model The probability of an individual who is 50 years old, has a resting blood pressure of 115, does not experience chest pain, and has a maximum heart rate of 133 is 21.88%. A probability of 21.88% places them in a moderate-risk category for heart disease. This probability suggests that, according to the model, there is a noticeable likelihood of the individual having heart disease. The absence of chest pain combined with the individual's age, blood pressure, and heart rate contributes to this level of risk. While not in a high-risk category, this probability is significant enough to warrant attention and potentially further medical evaluation. The odds of the event occurring, in this case, an individual having heart disease, given a 21.88% probability of the disease, is approximately 0.28. This means that the individual is about 0.28 times as likely to have heart disease as not to have it. In a clinical setting, this information could be used to guide further diagnostic testing or interventions. The probability of an individual who is 50 years old, has a resting blood pressure of 125, experiences typical angina, and has a maximum heart rate of 155 is 80.07%. A probability of 80.07% places them in a very high-risk category for heart disease. This high probability indicates that, according to the model, such an individual is extremely likely to have heart disease. The combination of age, high blood pressure, the presence of typical angina, and a specific heart rate level appears to be a significant risk factor for heart disease. The model suggests that these characteristics strongly predict the likelihood of heart disease. In a clinical context, a prediction this high would likely influence medical decisions significantly. Patients with this profile might be considered for aggressive risk management strategies, including lifestyle changes and possibly medical treatment to mitigate the risk. The odds of the event occurring, in this case, an individual having heart disease, given an 80.07% probability of the condition, is approximately 4.01. This means that the individual is about 4 times more likely to have heart disease than not to have it. 5. Random Forest Classification Model Reporting Results The original dataset contains 303 rows. After splitting the dataset with 85% for training and 15% for testing using the specified seed, the training set contains 257 rows, and the testing (validation) set contains 46.
Based on the graphic, the classification error stabilizes around the point where the model includes 30 trees. After reaching this number of trees, there is no significant error reduction with adding more trees. Hence, a Random Forest model with 30 trees is the best choice for this dataset. Evaluating the Utility of the Model Based on the results of the confusion matrix for the training set using a Random Forest model built with 30 trees, the calculated metrics are as follows: Accuracy : Approximately 99.22% (The model correctly predicted most instances). Precision : Approximately 99.27% (When the model predicted a positive case, it was correct nearly all the time). Recall : Approximately 99.27% (It correctly identified almost all actual positive cases). These metrics are exceptionally high and suggest the model has learned the training data well. For the testing set, the metrics calculated from the confusion matrix are: Accuracy : Approximately 65.22% (The model correctly predicted around two-thirds of the instances). Precision : Approximately 73.08% (Of the instances the model predicted as positive, around 73% were correct). Recall : Approximately 67.86% (The model correctly identified around 68% of all positive cases).
These metrics indicate that the model's performance on the testing set is lower than on the training set, which is expected as the testing set represents new, unseen data. It shows room for improvement, particularly in reducing false positives and negatives to improve accuracy, precision, and recall. 6. Random Forest Regression Model Reporting Results The original dataset contains 303 rows. After splitting with an 80% training and 20% testing ratio using the specified seed, the training set contains 242 rows, and the testing set contains 61 rows. Based on the graphic, the Mean Squared Error (MSE) for the Random Forest Regression model stabilizes around the point where the model includes 20 trees. After reaching this number of trees, there is no significant error reduction with adding more trees. Hence, a Random Forest model with 20 trees is the optimal choice for this dataset, balancing the trade-off between model complexity and accuracy. Evaluating the Utility of the Random Forest Regression Model 1. Root Mean Squared Error for the Training Set (11.6282) : o This relatively low value indicates that the Random Forest model predicts the maximum heart rate quite accurately on the training data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
o A lower RMSE on the training set suggests a good fit for the training data. Still, it's important to be cautious about overfitting, where the model might be too tailored to the training data and may need to generalize better to new, unseen data. 2. Root Mean Squared Error for the Testing Set (21.1139) o The RMSE is higher on the testing set compared to the training set. This is typical as the model was trained on the training data and is now tested on new, unseen data. o A higher RMSE on the testing set indicates that the model's predictions are less accurate on data it has yet to see, which is a normal occurrence. However, the difference between the training and testing RMSE should not be too large as it might indicate overfitting. 7. Conclusion Both logistic regression models were compared in our scenario of predicting heart disease. The first model, which used basic predictors such as age, resting blood pressure, exercise-induced angina, and maximum heart rate, had good accuracy (73.60%) and was especially effective in identifying true cases of heart disease (recall of 81.21%). However, the second model, which included more variables such as different types of chest pain and an interaction term between age and maximum heart rate, performed slightly better in terms of accuracy (76.24%), precision (78.18%), and recall (78.18%). This suggests that the second model is more adept at correctly classifying individuals with and without heart disease. Considering its higher performance metrics and the inclusion of additional relevant factors like chest pain types, I recommend the second model for predicting heart disease. It balances complexity and predictive accuracy, making it a more robust choice for identifying heart disease risk. In this scenario, I recommend caution in using the random forest classification model over the logistic regression models for predicting heart disease. While the random forest model showed exceptional performance on the training set (99.22% accuracy), it significantly underperformed on the testing set with only 65.22% accuracy. This large discrepancy suggests overfitting, meaning the model may not generalize well to new, unseen data. In contrast, the logistic regression models, especially the second one, demonstrated more consistent and reliable performance across training and testing sets. Logistic regression also offers the advantage of interpretability, which is crucial in medical settings for understanding risk factors and making informed decisions. Therefore, despite the high training accuracy of the random forest model, its lack of generalizability and interpretability makes the logistic regression models a more prudent choice in this context. The analyses performed using logistic regression and random forest models are of significant practical importance in the medical field, particularly for predicting heart disease. These models help identify key risk factors and their relationships with heart disease, enabling healthcare professionals to make more informed decisions about patient care. The ability to accurately predict heart disease risk can lead to earlier interventions, better patient management, and potentially improved outcomes. For instance, understanding the significance of variables like chest pain type, exercise-induced angina, and maximum heart rate provides valuable insights for screening and preventive strategies. Furthermore, these analyses contribute to the broader goal of personalized medicine, where treatments and interventions can be tailored based on individual risk profiles derived from such predictive models.