done MAT 303 Project Two Summary Report Template
docx
keyboard_arrow_up
School
Southern New Hampshire University *
*We aren’t endorsed by this school
Course
303
Subject
Mathematics
Date
Feb 20, 2024
Type
docx
Pages
7
Uploaded by melgal14
MAT 303 Project Two Summary Report
Melissa Galvan
melissa.galvan@snhu.edu
Southern New Hampshire University
1.
Introduction
The heart disease data set is the one I will be examining. My findings will be used to predict the likelihood of heart disease based on numerous criteria. For this research, I will be doing two types of logistic regression models, a random forest regression model, and a randomly selected forest classification model.
2. Data Preparation
The gender, age, Cp (chest pain), resting cholesterol, blood pressure, relaxing electrocardiograms, activity-induced angina: true or false, the slope of the maximum physical activity, ca (the number of main vessels), and objective are the variables of importance that will be used from this data set. The datum collection consists of 303 rows and 14 columns.
3. Model #1 - First Logistic Regression Model
Reporting Results
The dual regression model's basic equation is as outlined below: use - E(Y)=BO + B1X1 + B2X2 + B3X3. The intercept point corresponds to β_0. The commonly used regression components for their age, baseline blood pressure, and highest heart rate hit are B1, B2, and B3. With these variables, the model equation is: Started with Y^ =
-3.576198 - 0.009424 X1-0.016019 X2 + 0.042697 X3.
The equation that follows is transformed to produce a model with linear beta terms: Start here:
〖
(π/(1-
π)) = B 0 + B
1 X 1+B 2
X 2 + B 3 X 3 Listed below is the model equation for the natural logarithm of odds: (chances) = B
0, + B 1 X 1 + B 2 X 2 + B 3 X 3 The potential of a situation occurring is represented by π; in this case, the occurrence is the onset of heart disease. The probability of having heart disease is π / ( 1-π ). Evaluating Model Significance
This regression model's equation is as follows: E(Y) is equal to 1 + e^ (( -3.576198 - 0.009424 X_1-
0.016019 X_2 + 0.042697 X_3 )) / ( -3.576198 - 0.009424 X_1 - 0.016019 X_2 + 0.042697 X_3 )fixed In terms of the natural logarithm of the probability, the equation for this model is: * ln
(odds) = -3.576198-
0.009424X_1-0.016019X_2 + 0.042697X_3 The estimated coefficient for the maximum heart rate factor reached is 0.042697. This fact shows that, when all other factors are held constant, the average change in the log chances of getting heart disease is 0.042697.
The parameter for resting blood pressure (trestbps) is β_2. Based on Wald's test and a 5% level of significance, the following are the null and alternative hypotheses for determining if the maximal heart rate attained (Thalach) is significant: H_0 ∗
β
_3=0 H_a ∅
β
_3
≠
0 The highest heart rate attained (thalach) parameter is denoted by β
_3. Age has a P-value of 0.5578. For resting blood pressure (trestbps), the P-
value is 0.0392. The highest heart rate attained (thalach) has a P-value of 8.06e-10. At a 5% level of
significance, the terms for maximum heart rate attained and resting blood pressure are statistically significant. Age is not a key phrase. A confusion matrix's general form table output looks like this:
This model's confusion matrix is:
The results of the confusion matrix are: True Successes: 127 Actual Negatives: 8355 False Positives 38 False Negatives True Positives (TP): The expected value (default = 1) is met, and the actual value (default
= 1) is 1.
Thus, a real plus. True Negative (TN): 0 is the expected value and 0 is the actual value (default = 0). Thus, a negative negative. The expected value is 1 (default = 1), but the actual value is 0 (default = 0). This fact is known as a False Positive (FP). hence, a fictitious positive. Error type 1 also applies here. False Negative (FN): 0 is the expected result and 1 is the actual value (default =
1). Thus, a fictitious negative. This error type is also type 2.
The ratio of the accuracy to the proportion of all observations that are accurately predicted. (TP+TN)/(TP+TN+FP+FN) equals accuracy. (127+83)/(127+83+55+38) equals accuracy. Precision = 0.6931 / Accuracy Accuracy, y The ratio of accurate positive predictions to all predicted positives is known as precision. = TP/(TP+FP) = Precision Accuracy = 127 / (127 + 55)Precision is 69.78%, or 0.6978. The ratio of accurate positive predictions to all positive examples is known as
recall. Recollect is equal to TP divided by (TP+FN). Recollect is equal to 127/(127 + 38).
At 0.7575, or 75.75%, the area under the curve (AUC) is found. This fact shows how well the model can discriminate between Y = 0 and Y = 1. Because the area under the curve (AUC) represents how well a model predicts binary classes, a larger AUC is generally better.
Making Predictions Using Model #1
If a person is 50 years old, has a maximum heart rate of 140, a resting blood pressure of 122, and is likely
to have a heart disease, the likelihood is 0.4939. Given these factors, a person's chance of developing heart disease is 49.39%. Odds can be calculated by dividing the probability by 1 minus the probability. For example, 0.4939/(1-0.4939). In this case, the odds are 49.39% to 50.61%, or 0.4939 to 0.5061. For people with these characteristics, the odds of developing heart disease are extremely close to 50/50 or 1:1.
#2
If a person is 50 years old, has a maximum heart rate of 170, a resting blood pressure of 140, and is likely
to have a heart disease, the likelihood is 0.7248. Given these factors, the likelihood of an individual
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
developing heart disease is 72.48%. One may compute the odds by taking the probability and dividing it by 1 minus the probability. For example, 0.7248/(1-0.7248). In this instance, the odds are 72.48% to 27.52%, or 0.7248 to 0.2752. For a person with these characteristics, the probabilities of acquiring heart disease is around 75/25 or 3:1. In essence, the probability of these outcomes is the percentage of the chance that a person with these characteristics will have a heart disease. Calculate the probability to determine its exact likelihood. The odds of the first prediction seem closer to 50/50. Therefore, it can be predicted that one of two people who meet these criteria will have heart disease. According to the second estimate, the likelihood of acquiring heart disease is over 75%. In other words, based on probability, three out of four people with these characteristics are likely to develop heart disease.
4. Model #2 - Second Logistic Regression Model
Reporting Results
Regression words that are most frequently used are B1, B2, B3, and B4. Even more than this, they stand for the age of the person (age), the maximal heart rate attained (thalach), the dummy term for sex1 (male), and the dummy term for exercise-induced angina (exang1). The dummy words for cp1, cp2, and cp3 are B5, B6, and B7. Fixed, often used regression components for the interaction of thalach vs age and age squared are B8, B9. Method: E(Y) = B1X1 + B2X2 + B3
Evaluating Model Significance
This test is called the
Hosmer-Lemeshow goodness of fit (GOF) test evaluates how closely the model's predictions match the observed values of Y, which can be either one of two values: 0 or 1. This model uses it to evaluate whether or not the model fits the data. It is employed in this model to evaluate whether or not the model fits the data. The following are the alternative and null hypotheses: H_0: The model matches the information. H_a The data does not fit the model. The value of the test statistic (X^2)
is 60.596. P-value is 0.1048. There is a five percent significance level. The P-value of 0.1048 surpasses the significance threshold of 0.05. Therefore, it is not necessary to reject the null hypothesis. The model fits the data set, according to the conclusion. The primary and secondary assumptions for determining whether the maximal heart rate is significant
are based on a 5% level of significance on Wald's test are: H0 ∗
β
1= 0 Ha ∅
, Β
1= 0 The maximum heart rate (thalach) parameter is denoted by β
_1. The highest heart rate (thalach) P value would be: 0.014760.
At the 5% significance level, the facts are statistically significant. Based on Wald's test and a 5% threshold of significance, the following are the null and alternative hypotheses for determining if age is significant: H0 ∗
β
2 = 0 Ha ∅
β
2 = 0 The age parameter is β
2. A P-value of 0.510325 is found for the age. At the 5% level of significance, this is not statistically significant. Based on Wald's test and a 5% threshold of significance, the following are the null and alternative hypotheses for determining if the dummy variable for sex (male) is significant: H0 ∗
B3 = 0, Ha ∅
β
3 = 0 The dummy variable for this is β
3.
Making Predictions Using Model
First Prediction The probability that a 30-year-old man with a maximum heart rate of 145 has heart disease, exercise-induced angina, and no chest pain associated with typical, atypical, or non-angina complaints is 0.2654, or 26.54 percent. It's slightly better than a one to three ratio.
#2
A 30-year-old man with a maximum heart rate of 145 who does not experience exercise-induced angina but does experience normal angina has a 0.8432, or 84.32 percent, chance of having heart disease. The ratio is slightly less than 5 to 1. In the first prediction, the odds were about 1 to 3, but in the second prediction, the odds increased to just under 5 to 1.
5. Random Forest Classification Model
Reporting Results
Dividing the training and testing sets of the heart disease data set using an 80% and 20% split, respectively. When using set.seed(511038), the initial set contains 303 rows, the training set contains 242 rows, and the test set contains 61 rows. A classification, random forest model is used to graph the training and testing error against the number of trees for the presence of heart disease (target); the variables used are age, gender, pain in the chest type, resting blood pressure, cholesterol, induced angina from workout, slope of peak exercise, number of important vessels, including ca. The following graph is produced by using set.seed(511038) with a maximum of 200 trees:
Evaluating the Utility of the model
Develop an classification of the random forest framework for the existence of heart illness (target) using the following variables: chest pain, age (maturity), sex (gender), chest pain type (Cp), blood pressure at rest (trestbps), cholesterol test (chol), relaxed electrocardiographic exam (restecg), slope exercise (slope), including a good deal of major vessels (ca). The model will work as long as you find the right number of trees. Succeeding that, as can be seen below, we generate a confusion matrix.
For the training set, the confusion matrix results are True benefit: 130 True drawbacks: 111 One false positive Negative false positive: True Positive (TP): The expected value (default = 1) is met, and the actual value (default = 1) is 1. Thus, a real plus. True Negative (TN): 0 is the expected value and 0 is the actual value (default = 0). Thus, a real negative. The expected value is 1 (default = 1), but the actual value is 0 (default = 0). This fact is known as a False Positive (FP). hence, a fictitious positive. Error type 1 also applies here. False Negative (FN): 0 is the expected result and 1 is the actual value (default = 1). Thus, a fictitious negative. This error type is also type 2.
The testing set's confusion matrix results are as follows: RealGains: 28 Actual Disadvantages: 188 False Positives Negative Deviation: 7 True Positive (TP): The expected value (default = 1) is met, and the actual
value (default = 1) is 1. Thus, a real plus. True Negative (TN): 0 is the expected value and 0 is the actual value (default = 0). Thus, a negativenegative. The expected value is 1 (default = 1), but the actual value is
0 (default = 0). This fact is known as a False Positive (FP). hence, a fictitious positive. Error type 1 also applies here.this. False Negative (FN): 0 is the expected result and 1 is the actual value (default = 1). Thus, a fictitious negative. This error type is also type 2.
The proportion of accurate forecasts to the observations. (TP+TN)/(TP+TN+FP+FN) equals accuracy. 28+18) / (28+18+8+7) = Accuracy is 75.41%, or 0.7541. The ratio of accurate positive predictions to all predicted positives is known as precision. = TP/(TP+FP) = Precision is equal to 28/(28+8). Accuracy = 0.7778 or 77.78 % The ratio of accurate positive predictions to all positive examples is known as recall. Recollect is equal to TP divided by (TP+FN). Recall is equal to 28/(28+7). fixedThe recall rate is 0.8% or 80%.
6. Random Forest Regression Model
Reporting Results
The heart disease data set was split using 80% and 20% split, respectively, to create training and testing sets. Using set.seed(511038), the first set contained 303 rows, 242 rows were in the training set, and 61 rows were in the testing set. With up to 80 trees and set.seed(511038), we can create a graph with the collected information. This graph shows the results of our analysis:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The ideal number of trees equals 10 trees. Then you see the curve start to flatten.
Evaluating the Utility of the Random Forest Regression Model
We can use the collected information to build a random forest regression model to obtain the maximum
heart rate.
The model works as long as you find the right number of trees. For the training set, the root mean squared error is 9.9028. For the test set, the root mean square is 17.387.
7. Conclusion
I would pick the second of the two logistic regression models that have been examined here. It predicts binary values more accurately since it has a greater area under the curve and more variables. It also performs better in terms of recall, accuracy, and precision for the confusion matrix. This fact, in my opinion, is a more accurate indicator of a person's risk of heart disease. Given that the random forest classification model has significantly superior accuracy, precision, and recall than either of the logistic regression models, I would advise employing it in instead of the latter.