Final Project.ipynb - Colaboratory
pdf
keyboard_arrow_up
School
Pennsylvania State University *
*We aren’t endorsed by this school
Course
200
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
5
Uploaded by DeaconPencilApe14
This dataset was originally obtained from the CDC's website. It consisted of more than 300 variables, fortunately, I found a "cleaned/trimmed"
version of the same dataset that only had the most important variables.
This data were obtained from the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data
on the health status of U.S. residents. The ±nal dataset consists of 18 variables (4 quantitative and 14 categorical), it has exactly 319,796 cases
with the following variables and descriptions:
Heart Disease (Categorical, response): Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction
(MI)
BMI (quantitative continuous): Body mass index
Smoking (Categorical): Have you smoked at least 100 cigarettes in your entire life?
Alcohol Drinking (Categorical): Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7
drinks per week
Stroke (Categorical): Ever had a stroke?
Physical Health (Discrete Quantitative): for how many days during the past 30 days was your physical health not good?
Mental Health (Discrete Quantitative): for how many days during the past 30 days was your mental health not good?
Di²culty walking (Categorical): Do you have serious di²culty walking or climbing stairs?
Sex (Categorical): Are you male or female?
Age (Categorical): Each category is the is made out of 14 years
Race/Ethnicity (Categorical): Black, White, Asian, American Indian/Alaskan Native, Hispanic
Diabetic (Cateogrical): Have you being diagnosed with diabtes?
Physical Activity (Categorical): Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
General Health (Categorical): How would you classify your general health?
Sleep time (Discrete quantitative): Average sleep time you get per night
Asthma (Categorical): Do you have Asthma?
Kidney Disease (Categorical): Do you suffer from a Kidney Disease?
Skin Cancer (Categorical): Do you have/had any skin type of skin cancer?
Heart Disease Indicator-Introduction
Heart diseases are one of the leading causes of death in modern society, I got interested in this dataset after reading an
article
that stated that
BMI was a ³awed and misleading way of assesing one's overall "healthy weight", yet it is widely used in this type of indicators and predictions.
For these reasons, I chose this dataset, to see by myself if it works or not for this type of predictions.
Description of Research Question
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import scipy as sp
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
%matplotlib inline
import warnings
from pylab import rcParams
warnings.filterwarnings("ignore")
rcParams['figure.figsize'] = 20,10
rcParams['font.size'] = 30
sns.set()
np.random.seed(8)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
%cd /content/drive/My Drive/Colab Notebooks/
/content/drive/My Drive/Colab Notebooks
data = pd.read_csv('heart_2020_cleaned.csv')
HeartDisease
BMI
Smoking
AlcoholDrinking
Stroke
PhysicalHealth
MentalHealth
DiffWalking
Sex
AgeCategory
Race
Diabetic
PhysicalActivity
0
No
16.60
Yes
No
No
3.0
30.0
No
Female
55-59
White
Yes
Yes
1
No
20.34
No
No
Yes
0.0
0.0
No
Female
80 or older
White
No
Yes
2
No
26.58
Yes
No
No
20.0
30.0
No
Male
65-69
White
Yes
Yes
3
No
24.21
No
No
No
0.0
0.0
No
Female
75-79
White
No
No
4
No
23.71
No
No
No
28.0
0.0
Yes
Female
40-44
White
No
Yes
data.head()
By doing a simple inspection from this dataset, it's easy to tell that in order to be able to make EDA and classi±cation easier or more "feasible",
we need to change the format of the variables (Yes,No) to binary (1,0). The next line of code does it
data =
data[data.columns].replace({'Yes':1, 'No':0, 'Male':1,'Female':0,'No, borderline diabetes':0,'Yes (during pregnancy)':1 }) #Replacing all the categorical "binary" values
data.head()
HeartDisease
BMI
Smoking
AlcoholDrinking
Stroke
PhysicalHealth
MentalHealth
DiffWalking
Sex
AgeCategory
Race
Diabetic
PhysicalActivity
G
0
0
16.60
1
0
0
3.0
30.0
0
0
55-59
White
1
1
1
0
20.34
0
0
1
0.0
0.0
0
0
80 or older
White
0
1
2
0
26.58
1
0
0
20.0
30.0
0
1
65-69
White
1
1
3
0
24.21
0
0
0
0.0
0.0
0
0
75-79
White
0
0
4
0
23.71
0
0
0
28.0
0.0
1
0
40-44
White
0
1
Now with this updated version of the dataset, we won't get any errors when plotting our graphs
Make a boxplot and a density function
fig, ax = plt.subplots(figsize = (13,5))
sns.kdeplot(data[data["HeartDisease"]==1]["BMI"], alpha=0.5,shade = True, color="red", label="HeartDisease", ax = ax)
sns.kdeplot(data[data["HeartDisease"]==0]["BMI"], alpha=0.5,shade = True, color="blue", label="Normal", ax = ax)
plt.title('Distribution of BMI', fontsize = 18)
ax.set_xlabel("BMI")
ax.set_ylabel("Frequency")
ax.legend();
plt.show()
In this distribution of both BMI categories, we can see that there is a slight difference in normal BMI wrt Heart Disease BMI, normal is shifted a
little bit to the left, this means that the mean BMI is going to be less, we don't know if it is a signi±cant difference, this dataset is really big so
maybe that difference that we see is not signi±cant
sns.set(style="darkgrid")
sns.set(rc={'figure.figsize':(16,8)})
# creating a figure composed of 3 matplotlib.Axes objects
f, (ax_box1, ax_box2, ax_hist) = plt.subplots(3, sharex=True, gridspec_kw={"height_ratios": (.15, .15, .85)})
colours = ['red', 'blue', '#fbbc05', '#34a853'] #some random colors
# assigning a graph to each ax_box
sns.boxplot(x=data[data['HeartDisease']==1]["BMI"], ax=ax_box1, color="#ea4335")
sns.histplot(data[data['HeartDisease']==1], x="BMI", ax=ax_hist, kde=True, color="#ea4335")
sns.boxplot(x=data[data['HeartDisease']==0]["BMI"], ax=ax_box2, color='#4285f4')
sns.histplot(data[data['HeartDisease']==0], x="BMI", ax=ax_hist, kde=True, color='#4285f4')
# Remove x axis name for the boxplots, would get a formatting error otherwise
ax_box1.set(xlabel='')
ax_box2.set(xlabel='')
plt.legend(title='', loc=2, labels=['Heart Disease', 'No HeartDisease'],bbox_to_anchor=(1.02, 1), borderaxespad=0.)
plt.show()
By simple inspection it seems that people with heart diseases have higher BMI than those who don't, the boxplots and reveal that there are a
ton
of outliers, our mean will be pulled to the right in both cases. Difference in medians might be a better option but it is out of the scope of this
class and would be harder to interpret
Null: There is no difference in the Mean BMI of those with heart disease compared to those whithout a Heart Disease
Alternative: There is a statistically signi±cant difference in the mean BMI of those without a heart disease compared to those with a Heart
Disease
Hypothesis test-Difference in means
def plot_distribution(inp):
#Function that plots the distribution and includes the mean, saying where it is located
plt.figure()
ax = sns.distplot(inp)
plt.axvline(np.mean(inp), color="k", linestyle="dashed", linewidth=5) #Dot line to represent the mean
_, max_ = plt.ylim()
plt.text(
inp.mean() + inp.mean() / 10,
max_ - max_ / 10,
"Mean: {:.2f}".format(inp.mean()), #write the mean of the distribution
)
return plt.figure
updatedTable = data[['HeartDisease','BMI']] #Making a table with the only 2 variables we are interested
arr = updatedTable.to_numpy()
arr #Our new table will be a 2-D array of 0s or 1s and the mean
array([[ 0.
, 16.6 ],
[ 0.
, 20.34],
[ 0.
, 26.58],
...,
[ 0.
, 24.24],
[ 0.
, 32.81],
[ 0.
, 46.56]])
len(arr) #Making sure we have the correct amount of cases
319795
def separating_0(arr): #This function will separate those with heart disease and store their BMI values in a numpy array
group_0 = []
for i in range(len(arr)):
if arr[i][0] == 0:
#Since the first value of each row is going to be 0 or 1, we just set a loop that would traverse through the 2D list and if the first item of each list
group_0.append(arr[i][1])
arr_group_0 = np.array(group_0)
return arr_group_0
def separating_1(arr): #Same as the other function, it just sorts those with Heart Disease
group_1 = []
for i in range(len(arr)):
if arr[i][0] == 1:
group_1.append(arr[i][1])
arr_group_1 = np.array(group_1)
return arr_group_1
<function matplotlib.pyplot.figure>
plot_distribution(separating_0(arr)) #Plotting a more exact distribution of the BMI of those without Heart Disease
<function matplotlib.pyplot.figure>
plot_distribution(separating_1(arr)) #Plotting a more exact distribution of the BMI of those with Heart Disease
By looking at the means, we can see that we have a difference of 1.18 (29.4 - 28.22). This is congruent with our observations in the ±rst
distribution, the BMI of those without a heart disease was going to be less
plt.figure()
ax1 = sns.distplot(separating_0(arr))
ax2 = sns.distplot(separating_1(arr))
plt.axvline(np.mean(separating_0(arr)), color='black', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(separating_1(arr)), color='black', linestyle='dashed', linewidth=2)
#Plotting both distributions again but this time with the dashed line
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
<matplotlib.lines.Line2D at 0x7fb7e32ae750>
Plotting both distributions again with the dashed line for reference purposes to make it easier to visualize the difference in both means
def compare_2_groups(arr_1, arr_2, alpha, sample_size):
stat, p = ttest_ind(arr_1, arr_2) #ttest_ind is a function from scipy that performs a difference in means
print('p=%.3f' % (p)) #Printing the p-value to 3 decimals
if p > alpha:
print('fail to reject H0')
else:
print('reject H0')
sample_size = len(separating_1(arr)) # Make the sub_sample size the same as the smallest array that we have, if it gets any bigger, it would throw a Value error. This is because
Heart_Disease_1 = np.random.choice(separating_1(arr), sample_size, replace = False)
Heart_Disease_0 = np.random.choice(separating_0(arr), sample_size, replace = False)
compare_2_groups(Heart_Disease_1, Heart_Disease_0 , 0.05, sample_size)
p=0.000
reject H0
np.mean(Heart_Disease_1) #Making sure that the mean of the subsamples are the same as the original sample
29.40159207978665
np.mean(Heart_Disease_0) #Making sure that the mean of the subsamples are the same as the original sample
28.254150805538305
implement neighbors classi±ers
data = data.drop(columns = ['AgeCategory', 'Race', 'GenHealth'], axis = 1) #Dropping the useless variables
features = data.drop(columns =['HeartDisease'], axis = 1) #We can't add our target prediction to the model, would be silly, that's why I dropped it
#Select Target predictor
target = data['HeartDisease']
# Set Training and Testing Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, shuffle = True, test_size = .2, random_state = 44) #The shuffling doesnt make any diff
print('Shape of training feature:', X_train.shape)
print('Shape of testing feature:', X_test.shape)
print('Shape of training label:', y_train.shape)
print('Shape of training label:', y_test.shape)
Shape of training feature: (255836, 14)
Shape of testing feature: (63959, 14)
Shape of training label: (255836,)
Shape of training label: (63959,)
def evaluate_model(model, x_test, y_test):
from sklearn import metrics
# Predict Test Data
y_pred = model.predict(x_test)
# Calculate accuracy
acc = metrics.accuracy_score(y_test, y_pred)
return {'acc': acc}
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 11)
knn.fit(X_train, y_train)
# Evaluate Model
knn_eval = evaluate_model(knn, X_test, y_test)
# Print result
print('Accuracy:', knn_eval['acc'])
Accuracy: 0.9141324911271284
Even though Body Mass Index (BMI), is more often than not criticized by its faultiness on assesing "healthy" and/or "unhealthy" weights, we
could prove that there is actually a signi±cant difference in the mean BMI of those who have a heart disease compared to those who don't.
Conclusion
14 s
se ejecutó 22:02
Furthermore, By using all of our other varaibles (Aside from Race, GenHealth or age, this were not signi±cant), we created a K_neighbors
classi±ers model that could predict a heart disease with a 91% accuracy rate.
Overall, BMI proved itself to be accurate enough to be used as an indicator of Heart Diseases.
Related Documents
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt