Final Project.ipynb - Colaboratory

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

5

Uploaded by DeaconPencilApe14

Report
This dataset was originally obtained from the CDC's website. It consisted of more than 300 variables, fortunately, I found a "cleaned/trimmed" version of the same dataset that only had the most important variables. This data were obtained from the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. The ±nal dataset consists of 18 variables (4 quantitative and 14 categorical), it has exactly 319,796 cases with the following variables and descriptions: Heart Disease (Categorical, response): Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) BMI (quantitative continuous): Body mass index Smoking (Categorical): Have you smoked at least 100 cigarettes in your entire life? Alcohol Drinking (Categorical): Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week Stroke (Categorical): Ever had a stroke? Physical Health (Discrete Quantitative): for how many days during the past 30 days was your physical health not good? Mental Health (Discrete Quantitative): for how many days during the past 30 days was your mental health not good? Di²culty walking (Categorical): Do you have serious di²culty walking or climbing stairs? Sex (Categorical): Are you male or female? Age (Categorical): Each category is the is made out of 14 years Race/Ethnicity (Categorical): Black, White, Asian, American Indian/Alaskan Native, Hispanic Diabetic (Cateogrical): Have you being diagnosed with diabtes? Physical Activity (Categorical): Adults who reported doing physical activity or exercise during the past 30 days other than their regular job General Health (Categorical): How would you classify your general health? Sleep time (Discrete quantitative): Average sleep time you get per night Asthma (Categorical): Do you have Asthma? Kidney Disease (Categorical): Do you suffer from a Kidney Disease? Skin Cancer (Categorical): Do you have/had any skin type of skin cancer? Heart Disease Indicator-Introduction Heart diseases are one of the leading causes of death in modern society, I got interested in this dataset after reading an article that stated that BMI was a ³awed and misleading way of assesing one's overall "healthy weight", yet it is widely used in this type of indicators and predictions. For these reasons, I chose this dataset, to see by myself if it works or not for this type of predictions. Description of Research Question import pandas as pd import numpy as np from scipy.stats import ttest_ind import scipy as sp import seaborn as sns %matplotlib inline import matplotlib.pyplot as plt from google.colab import drive drive.mount('/content/drive') %matplotlib inline import warnings from pylab import rcParams warnings.filterwarnings("ignore") rcParams['figure.figsize'] = 20,10 rcParams['font.size'] = 30 sns.set() np.random.seed(8) Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True). %cd /content/drive/My Drive/Colab Notebooks/ /content/drive/My Drive/Colab Notebooks data = pd.read_csv('heart_2020_cleaned.csv') HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity 0 No 16.60 Yes No No 3.0 30.0 No Female 55-59 White Yes Yes 1 No 20.34 No No Yes 0.0 0.0 No Female 80 or older White No Yes 2 No 26.58 Yes No No 20.0 30.0 No Male 65-69 White Yes Yes 3 No 24.21 No No No 0.0 0.0 No Female 75-79 White No No 4 No 23.71 No No No 28.0 0.0 Yes Female 40-44 White No Yes data.head() By doing a simple inspection from this dataset, it's easy to tell that in order to be able to make EDA and classi±cation easier or more "feasible", we need to change the format of the variables (Yes,No) to binary (1,0). The next line of code does it data = data[data.columns].replace({'Yes':1, 'No':0, 'Male':1,'Female':0,'No, borderline diabetes':0,'Yes (during pregnancy)':1 }) #Replacing all the categorical "binary" values data.head()
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity G 0 0 16.60 1 0 0 3.0 30.0 0 0 55-59 White 1 1 1 0 20.34 0 0 1 0.0 0.0 0 0 80 or older White 0 1 2 0 26.58 1 0 0 20.0 30.0 0 1 65-69 White 1 1 3 0 24.21 0 0 0 0.0 0.0 0 0 75-79 White 0 0 4 0 23.71 0 0 0 28.0 0.0 1 0 40-44 White 0 1 Now with this updated version of the dataset, we won't get any errors when plotting our graphs Make a boxplot and a density function fig, ax = plt.subplots(figsize = (13,5)) sns.kdeplot(data[data["HeartDisease"]==1]["BMI"], alpha=0.5,shade = True, color="red", label="HeartDisease", ax = ax) sns.kdeplot(data[data["HeartDisease"]==0]["BMI"], alpha=0.5,shade = True, color="blue", label="Normal", ax = ax) plt.title('Distribution of BMI', fontsize = 18) ax.set_xlabel("BMI") ax.set_ylabel("Frequency") ax.legend(); plt.show() In this distribution of both BMI categories, we can see that there is a slight difference in normal BMI wrt Heart Disease BMI, normal is shifted a little bit to the left, this means that the mean BMI is going to be less, we don't know if it is a signi±cant difference, this dataset is really big so maybe that difference that we see is not signi±cant sns.set(style="darkgrid") sns.set(rc={'figure.figsize':(16,8)}) # creating a figure composed of 3 matplotlib.Axes objects f, (ax_box1, ax_box2, ax_hist) = plt.subplots(3, sharex=True, gridspec_kw={"height_ratios": (.15, .15, .85)}) colours = ['red', 'blue', '#fbbc05', '#34a853'] #some random colors # assigning a graph to each ax_box sns.boxplot(x=data[data['HeartDisease']==1]["BMI"], ax=ax_box1, color="#ea4335") sns.histplot(data[data['HeartDisease']==1], x="BMI", ax=ax_hist, kde=True, color="#ea4335") sns.boxplot(x=data[data['HeartDisease']==0]["BMI"], ax=ax_box2, color='#4285f4') sns.histplot(data[data['HeartDisease']==0], x="BMI", ax=ax_hist, kde=True, color='#4285f4') # Remove x axis name for the boxplots, would get a formatting error otherwise ax_box1.set(xlabel='') ax_box2.set(xlabel='') plt.legend(title='', loc=2, labels=['Heart Disease', 'No HeartDisease'],bbox_to_anchor=(1.02, 1), borderaxespad=0.) plt.show() By simple inspection it seems that people with heart diseases have higher BMI than those who don't, the boxplots and reveal that there are a ton of outliers, our mean will be pulled to the right in both cases. Difference in medians might be a better option but it is out of the scope of this class and would be harder to interpret Null: There is no difference in the Mean BMI of those with heart disease compared to those whithout a Heart Disease Alternative: There is a statistically signi±cant difference in the mean BMI of those without a heart disease compared to those with a Heart Disease Hypothesis test-Difference in means def plot_distribution(inp): #Function that plots the distribution and includes the mean, saying where it is located plt.figure() ax = sns.distplot(inp) plt.axvline(np.mean(inp), color="k", linestyle="dashed", linewidth=5) #Dot line to represent the mean _, max_ = plt.ylim() plt.text(
inp.mean() + inp.mean() / 10, max_ - max_ / 10, "Mean: {:.2f}".format(inp.mean()), #write the mean of the distribution ) return plt.figure updatedTable = data[['HeartDisease','BMI']] #Making a table with the only 2 variables we are interested arr = updatedTable.to_numpy() arr #Our new table will be a 2-D array of 0s or 1s and the mean array([[ 0. , 16.6 ], [ 0. , 20.34], [ 0. , 26.58], ..., [ 0. , 24.24], [ 0. , 32.81], [ 0. , 46.56]]) len(arr) #Making sure we have the correct amount of cases 319795 def separating_0(arr): #This function will separate those with heart disease and store their BMI values in a numpy array group_0 = [] for i in range(len(arr)): if arr[i][0] == 0: #Since the first value of each row is going to be 0 or 1, we just set a loop that would traverse through the 2D list and if the first item of each list group_0.append(arr[i][1]) arr_group_0 = np.array(group_0) return arr_group_0 def separating_1(arr): #Same as the other function, it just sorts those with Heart Disease group_1 = [] for i in range(len(arr)): if arr[i][0] == 1: group_1.append(arr[i][1]) arr_group_1 = np.array(group_1) return arr_group_1 <function matplotlib.pyplot.figure> plot_distribution(separating_0(arr)) #Plotting a more exact distribution of the BMI of those without Heart Disease <function matplotlib.pyplot.figure> plot_distribution(separating_1(arr)) #Plotting a more exact distribution of the BMI of those with Heart Disease By looking at the means, we can see that we have a difference of 1.18 (29.4 - 28.22). This is congruent with our observations in the ±rst distribution, the BMI of those without a heart disease was going to be less plt.figure() ax1 = sns.distplot(separating_0(arr)) ax2 = sns.distplot(separating_1(arr)) plt.axvline(np.mean(separating_0(arr)), color='black', linestyle='dashed', linewidth=2) plt.axvline(np.mean(separating_1(arr)), color='black', linestyle='dashed', linewidth=2) #Plotting both distributions again but this time with the dashed line
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<matplotlib.lines.Line2D at 0x7fb7e32ae750> Plotting both distributions again with the dashed line for reference purposes to make it easier to visualize the difference in both means def compare_2_groups(arr_1, arr_2, alpha, sample_size): stat, p = ttest_ind(arr_1, arr_2) #ttest_ind is a function from scipy that performs a difference in means print('p=%.3f' % (p)) #Printing the p-value to 3 decimals if p > alpha: print('fail to reject H0') else: print('reject H0') sample_size = len(separating_1(arr)) # Make the sub_sample size the same as the smallest array that we have, if it gets any bigger, it would throw a Value error. This is because Heart_Disease_1 = np.random.choice(separating_1(arr), sample_size, replace = False) Heart_Disease_0 = np.random.choice(separating_0(arr), sample_size, replace = False) compare_2_groups(Heart_Disease_1, Heart_Disease_0 , 0.05, sample_size) p=0.000 reject H0 np.mean(Heart_Disease_1) #Making sure that the mean of the subsamples are the same as the original sample 29.40159207978665 np.mean(Heart_Disease_0) #Making sure that the mean of the subsamples are the same as the original sample 28.254150805538305 implement neighbors classi±ers data = data.drop(columns = ['AgeCategory', 'Race', 'GenHealth'], axis = 1) #Dropping the useless variables features = data.drop(columns =['HeartDisease'], axis = 1) #We can't add our target prediction to the model, would be silly, that's why I dropped it #Select Target predictor target = data['HeartDisease'] # Set Training and Testing Data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(features, target, shuffle = True, test_size = .2, random_state = 44) #The shuffling doesnt make any diff print('Shape of training feature:', X_train.shape) print('Shape of testing feature:', X_test.shape) print('Shape of training label:', y_train.shape) print('Shape of training label:', y_test.shape) Shape of training feature: (255836, 14) Shape of testing feature: (63959, 14) Shape of training label: (255836,) Shape of training label: (63959,) def evaluate_model(model, x_test, y_test): from sklearn import metrics # Predict Test Data y_pred = model.predict(x_test) # Calculate accuracy acc = metrics.accuracy_score(y_test, y_pred) return {'acc': acc} from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 11) knn.fit(X_train, y_train) # Evaluate Model knn_eval = evaluate_model(knn, X_test, y_test) # Print result print('Accuracy:', knn_eval['acc']) Accuracy: 0.9141324911271284 Even though Body Mass Index (BMI), is more often than not criticized by its faultiness on assesing "healthy" and/or "unhealthy" weights, we could prove that there is actually a signi±cant difference in the mean BMI of those who have a heart disease compared to those who don't. Conclusion
14 s se ejecutó 22:02 Furthermore, By using all of our other varaibles (Aside from Race, GenHealth or age, this were not signi±cant), we created a K_neighbors classi±ers model that could predict a heart disease with a 91% accuracy rate. Overall, BMI proved itself to be accurate enough to be used as an indicator of Heart Diseases.