Arndt-Kohlway_assignment2
docx
keyboard_arrow_up
School
University of Maryland Global Campus (UMGC) *
*We aren’t endorsed by this school
Course
630
Subject
Electrical Engineering
Date
Apr 3, 2024
Type
docx
Pages
15
Uploaded by Yoloswaggins12
Statistical Data Mining
Logistical Regression on Congenital Heart Defects
Nicholas Arndt-Kohlway
DATA630: Machine Learning (2215)
Professor Ami Gates
1
Statistical Data Mining
Objective
The goal of this analysis is to identify the contributing factors of congenital heart defects by utilizing a logistical regression along with the Naїve Bayes method. The specific question to be answered is: what variables contribute to an increased risk of having a congenital heart defect? The logistical regression will provide estimates of how much a variable contributes to the
overall chance of having a congenital heart defect. The Naїve Bayes method will identify conditional probabilities for congenital heart defects. The amount of certain congenital heart defects (CHD) has been increasing, while other types of CHDs have remained stable over the past few years. “CHDs affect nearly 1% of—or about 40,000—births per year in the United States” (Reller, M. D., Strickland, M. J., Riehle-
Colarusso, T., Mahle, W. T., & Correa, A.). Treatment of CHDs increases the survival rate of people with CHDs. However, there is no U.S. population-based tracking system to identify adolescents and adults with CHDs. “During 1999-2006, there were 41,494 deaths related to CHDs in the United States…Nearly half (48%) of the deaths due to CHDs occurred during infancy (younger than 1 year of age)” (Gilboa, S. M., Salemi, J. L., Nembhard, W. N., Fixler, D. E., & Correa, A.). By utilizing the data provided by KEEL it is possible to track and identify possible CHD cases to treat the condition before serious illness or death occurs. For this dataset it will be best to use a logistical regression to identify possible indicators for increased risk of a CHD. Since having a CHD is a yes or no answer, a logistical regression will be more useful than a linear regression. Linear regression would be more suited for a numerical dependent variable as opposed to a binary variable. The Naїve Bayes method will help
with identifying the variable values that are more likely to contribute to a CHD. The Naїve Bayes method provides a more in-depth Apriori that includes probabilities of a CHD occurring 2
Statistical Data Mining
giving certain circumstances. Between the logistical regression and Naїve Bayes there will be an increase in the efficiency and effectiveness of tracking CHDs among the older population. Analysis
The congenital heart disease dataset was kindly provided by Knowledge Extraction Evolutionary Learning (KEEL). The given dataset was produced from a sample of men in a heart-disease high-risk region of the Western Cape in South Africa. The dataset includes 462 observations over 10 variables. The 10 variables include:
1.
Sbp (systolic blood pressure) with values ranging from 101 to 218.
2.
Tobacco (cumulative tobacco in kilograms) with values ranging from 0 to 31.2.
3.
Ldl (low density lipoprotein cholesterol) with values ranging from 0.98 to 15.33.
4.
Adiposity with values ranging from 6.74 to 42.49.
5.
Famhist (family history of heart disease) with values Present or Absent.
6.
Typea (type-A behavior) with values ranging from 13 to 78.
7.
Obesity with values ranging from 14.7 to 46.58.
8.
Alcohol (current alcohol consumption) with values ranging from 0 to 147.19.
9.
Age (age at onset) with values ranging from 15 to 64.
10. Chd (congenital heart defect) with values 0 = no and 1 = yes. Looking at a summary in Figure 1, the maximum value of systolic blood pressure is 218 which is overly concerning. Quite frankly, did not even know it was possible to reach a SBP of 218. From this dataset there are a significant number of men that have a SBP over the normal range of 120. When looking at the alcohol, obesity, adiposity, and typea it is difficult to identify the measures of these variables. Assumptions were made that obesity is measured by Body Mass 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Statistical Data Mining
Index (BMI); however, adiposity is typically measured by BMI as well. There was nothing on KEEL identifying the measures of these variables, so it is difficult to provide an exploratory analysis on the aforementioned variables. The age variable has a nice distribution with a median age of 45. The people in this dataset are typically the ones ignored when tracking cases CHD—
CHD is mainly tracked in infants until they reach adolescence. Figure 2 shows that around 33% of the dataset was positive for having a CHD. Whether the CHD is mild to severe there is no data
which could help improve the analysis. The data preprocessing was not significant for the confusion matrix, ROCR curve, residuals plot, and the minimal adequate models created. However, for the Naïve Bayes method it was necessary to discretize the tobacco, ldl, adiposity, obesity, and alcohol variables. The sbp, typea, age, and chd variables were factored. For the discretized variables 4 bins of equal frequency were created. None of the rows of data or variables were deleted since there were no missing values, and all the variables were critical for analysis. For this regression it was best to use a logistical regression model since the dependent variable being studied was binary. Therefore, after creating the train and test sets it was crucial to
include in the model parameters that the regression would be binomial. The predicted values were of type response to obtain the values based on the scale of the response variable. Otherwise,
it would be on the scale of the linear predictors creating a less accurate model. For the residuals plot a Locally Weighted Scatterplot Smoothing (LOWESS) line was used to help identify relationships between variables and identify trends. A minimal adequate model was also created to identify which variables impacted the chances of having CHD the most. 4
Statistical Data Mining
Results
From the logistical regression summary in Figure 3 it is apparent that systolic blood pressure, adiposity, obesity, and alcohol are not significant contributors to CHD. The figure also shows that the adiposity, obesity, and alcohol were closest to the average with a z-score close to 0. The summary also suggest that obesity and alcohol have a negative correlation with CHD; however, without a stronger p-value this correlation cannot be confirmed conclusively. Family history has the strongest influence with a coefficient of 0.867. Family history’s p-value of .0017 suggests that the correlation between family history and CHD is prominent. However, Family history had one of the highest standard errors among our variables making it possible that this variable is not representative of the entire population. In Figure 4 the amount of CHD cases that were correctly predicted to have CHD was 76. The amount of CHD cases that were not predicted
to have CHD was 27 (same as the number of true negative cases). The amount of non-CHD cases
that were predicted to have CHD was 10. This means that the logistical model accurately predicted 103 cases out of 140 (73.57%). The ROCR curve in Figure 5 shows that the logistical model is not the best predictor of CHD cases. The vertical line shows that there is only about a 0.5 true positive rate with a 0.1 false positivity rate. The curve is not entirely worthless as it does not lie extremely close to the 45-degree line. The residual plot in figure 6 shows that there is not too much inaccuracy in our residuals due to the relative closeness of the Lowess line around 0. While a couple of the values have significant residuals, the overall model maintains accuracy throughout the entirety of the dataset due to a normalized distribution.
The minimal adequate model created in Figure 7 shows that from the original 9 independent variables only systolic blood pressure, tobacco, low density lipoprotein cholesterol, family history, type-a behavior, and age were the significant variables in determining CHD. 5
Statistical Data Mining
Adiposity, obesity, and alcohol consumption were not significant indicators in determining CHD cases. Looking at the family history plot there is a notable increase in the possibility of CHD when present in a family member. The chance of having CHD nearly doubles when a family member is afflicted with the condition. From these 6 plots it is apparent that the significant variables have a relatively strong impact on increasing chances of having a CHD. Systolic blood pressure has a strong positive correlation with CHD—this can be attributed to CHD causing a rise in blood pressure as you get older and the condition worsens. However, systolic blood pressure has a large confidence interval which can make it difficult to predict how significant this factor is in being associated with CHD. There are many people out there that have a high blood pressure due to extraneous circumstances such as stress, which is not measured in this dataset. Cholesterol and tobacco also have a large confidence interval due to the same circumstance that were stated prior. Cholesterol can be high due to several factors and tobacco use might increase chances of having a CHD, but there are plenty of people who smoke a pack a day and do not have symptoms of a CHD. The summary in Figure 8 shows blood pressure and cholesterol are not significant indicators of having a CHD. The minimal adequate model decreased the Akaike Information Criterion (AIC) by less than 6 so the original model and the minimal adequate model are the same in terms of goodness of fit. The p-values and standard values also did not change significantly. Figure 9 shows that the accuracy for the test data stayed the same at 73.57%.
The Naïve Bayes summary in Figure 10 identifies old age as being a high probability for CHD. If someone has a CHD, the probability that the person is 55 to 64 years old is 45.28%. Compared to only 4.72% of 15- to 30-year-old people. This number is even higher among people
with a family history of CHDs at a probability of 59.43%. The third highest rates are among 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Statistical Data Mining
smokers that consume 5.5 to 31.2 kilograms of tobacco. The highest rate of non-CHD people come from a family background of no CHDs with nearly 67%. The second highest rate of non-
CHDs are people who smoke 0 to 0.0525 kilograms of tobacco at a probability of 33.8%. The third highest rate of non-CHDs are 15 to 30 years old at a probability of 31.94%. The Naïve Bayes matrix in Figure 11 shows that this model is not as accurate as the original logistical regression. With only 94 instances of predictions being correct out of 140 instances (67.14%). The ROCR model in Figure 12 shows that the Naïve Bayes method is not as accurate since the ROC line is less vertical and closer to the 45-degree line. Conclusion
From the findings it is important to note that family history is the most prevalent indicator of having a CHD. Cholesterol, type a behavior, tobacco consumption, and age were also significant indicators of CHDs. Type A behavior can indicate stress and wanting to be in control of one’s own environment which can contribute to a higher cholesterol. From this it is also understood that people who exhibit this behavior might be more willing to smoke to calm down and destress when unable to control the things going on around them. As people get older, they tend to become more stressed and have higher cholesterol naturally creating a higher likelihood of having a CHS. These 4 variables could be considered confounding variables: When
age increases, the other 3 variables tend to increase with it, causing the changes we see in the dataset. This can also be considered a limitation of this dataset since it is difficult to measure the impact that one of these independent variables has on the other independent variables. Another limitation is the fact that this dataset is a ridiculously small subset of the population. It only encompasses men on the Western Cape in South Africa which makes it impossible to apply to the entirety of the population. It is also important to note that the subjects in this dataset have 7
Statistical Data Mining
higher rates of CHDs than other places in the world. The question as to why is not a question that
can be answered through the analysis of this dataset. Most of the improvements in this analysis come from accounting for the confounding variables. Another limitation that made this analysis difficult was the lack of descriptors for some of the variables. Variables such as adiposity, type a behavior, obesity, and alcohol did not state the exact measures used to obtain their numeric values. When it came to alcohol, I had no idea how much 97.2 was and over what time period the
consumption spanned. When it came to adiposity and obesity, I figured one had to be measured by BMI, but then what is the measure for the other variable. I tried finding out how type a behavior is measured, but never found a numeric scale that the dataset utilized. While the analysis provided interesting insights, it is hard to fully interpret what they mean when we do not
have measurements attached to the values. 8
Statistical Data Mining
Appendix
Figure 1: Summary shows over half of the dataset had a family member with CHD.
Figure 2: Approximately 33% of the dataset has a CHD.
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Statistical Data Mining
Figure 3: sbp, obesity, and alcohol are not significant indicators of CHD.
Figure 4: Matrix shows 76 true positives out of 140 instances in the test set.
10
Statistical Data Mining
Figure 5: ROCR Curve shows that the model can accurately predict CHDs.
Figure 6: Lowess line shows that the model accurately represents the test data.
11
Statistical Data Mining
Figure 7: All impacting variables have a positive correlation to CHD.
Figure 8: sbp and ldl are not statistically significant.
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Statistical Data Mining
Figure 9: True positive increased by 1 in the Minimal Adequate Model.
Figure 10: If the person does have CHD, the chance they have a SBP over 148 is about 39%
13
Statistical Data Mining
Figure 11: True positive rate drops by 12 from the original model.
Figure 12: Naive Bayes does not predict CHD as accurately as the logistical model.
14
Statistical Data Mining
References
Gilboa, S. M., Salemi, J. L., Nembhard, W. N., Fixler, D. E., & Correa, A. (2010). Mortality Resulting From Congenital Heart Disease Among Children and Adults in the United States, 1999 to 2006.
Circulation,
122
(22), 2254-2263. doi:10.1161/circulationaha.110.947002
Reller, M. D., Strickland, M. J., Riehle-Colarusso, T., Mahle, W. T., & Correa, A. (2009). Prevalence of Congenital Heart Defects in Metropolitan Atlanta, 1998–2005.
Obstetrical & Gynecological Survey,
64
(3), 156-157. doi:10.1097/01.ogx.0000344391.78229.08
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
Complete the phrase! The error of a measurement system depends on the non-ideal characteristics of every element in the system. Using calibration techniques we can identify which element in the system have the most dominant non-ideal behavior. We can than devise compensation strategies for t6hese elements which should produce significant reductions in the overall system error. The methods are named .....
arrow_forward
(b) Explain the concept of absolute stability
in non-linear system. Also state and
explain the Popov criterion of stability.
arrow_forward
I need fast plz
arrow_forward
Load-Balancing Algorithms
Different algorithms are used to make decisions on load balancing. These include random allocation, round-robin, weighted round-robin, round-robin DNS load balancing, and others.Write about load-balancing algorithms. Create a table that lists at least five algorithms and their advantages and disadvantages. Do any of these algorithms compromise security?
arrow_forward
1. In perfect regulation control system the control effort at transient response must has:
(a.) Zero value b. Minimum value c. Maximum value d. Any value
2. In LQR controller technique, the transient response of the system output can be
decreased by
a.
Reducing diagonal elements of R matrix
Reducing diagonal elements of Q matrix
c. ncreasing diagonal elements of R matrix
d. Increasing diagonal elements of Q matrix
0
1-2
b. under damped.
3. A plant with a state
a. critical damped.
4. In tracking control system
a. The initial and steady state values of state vector are zero
b.
The desired values of the state vector are non zero
The inital values of state vector are non zero and desired values are zero.
d. The inital and desired values of the state vector are zero
5. In LQR controller technique increasing value of the element 921 in Q matrix can_
a. Increase rise time
b. Decreae rise time
matrix A=
is
c. over damped.
c. Increase settling time
d. Decrease settling time
None of all above
6.…
arrow_forward
Number 6
arrow_forward
What is the difference (comparison) between the capability of variation of parameters method and the undetermined coefficients – superposition approach when solving non-homogeneous linear higher-order differential equations?
arrow_forward
The base bias is commonly used in linear operation for its simplicity and B-independent feature.
True
False
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Power System Analysis and Design (MindTap Course ...
Electrical Engineering
ISBN:9781305632134
Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. Sarma
Publisher:Cengage Learning
Related Questions
- Complete the phrase! The error of a measurement system depends on the non-ideal characteristics of every element in the system. Using calibration techniques we can identify which element in the system have the most dominant non-ideal behavior. We can than devise compensation strategies for t6hese elements which should produce significant reductions in the overall system error. The methods are named .....arrow_forward(b) Explain the concept of absolute stability in non-linear system. Also state and explain the Popov criterion of stability.arrow_forwardI need fast plzarrow_forward
- Load-Balancing Algorithms Different algorithms are used to make decisions on load balancing. These include random allocation, round-robin, weighted round-robin, round-robin DNS load balancing, and others.Write about load-balancing algorithms. Create a table that lists at least five algorithms and their advantages and disadvantages. Do any of these algorithms compromise security?arrow_forward1. In perfect regulation control system the control effort at transient response must has: (a.) Zero value b. Minimum value c. Maximum value d. Any value 2. In LQR controller technique, the transient response of the system output can be decreased by a. Reducing diagonal elements of R matrix Reducing diagonal elements of Q matrix c. ncreasing diagonal elements of R matrix d. Increasing diagonal elements of Q matrix 0 1-2 b. under damped. 3. A plant with a state a. critical damped. 4. In tracking control system a. The initial and steady state values of state vector are zero b. The desired values of the state vector are non zero The inital values of state vector are non zero and desired values are zero. d. The inital and desired values of the state vector are zero 5. In LQR controller technique increasing value of the element 921 in Q matrix can_ a. Increase rise time b. Decreae rise time matrix A= is c. over damped. c. Increase settling time d. Decrease settling time None of all above 6.…arrow_forwardNumber 6arrow_forward
- What is the difference (comparison) between the capability of variation of parameters method and the undetermined coefficients – superposition approach when solving non-homogeneous linear higher-order differential equations?arrow_forwardThe base bias is commonly used in linear operation for its simplicity and B-independent feature. True Falsearrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Power System Analysis and Design (MindTap Course ...Electrical EngineeringISBN:9781305632134Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. SarmaPublisher:Cengage Learning
Power System Analysis and Design (MindTap Course ...
Electrical Engineering
ISBN:9781305632134
Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. Sarma
Publisher:Cengage Learning