D212_JBT_Task2

pdf

School

Western Governors University *

*We aren’t endorsed by this school

Course

D212

Subject

Computer Science

Date

Jun 21, 2024

Type

pdf

Pages

14

Uploaded by MajorMouseMaster1147

Report
In [1]: import pandas as pd from pandas.api.types import CategoricalDtype import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_auc_score from sklearn.metrics import roc_curve from sklearn.metrics import accuracy_score # import warnings filter from warnings import simplefilter # ignore all future warnings simplefilter ( action = 'ignore' , category = FutureWarning ) # The CSV's first column is an index and Pandas will duplicate this and create df = pd . read_csv ( './medical_clean.csv' , index_col = 0 ) In [2]: # Check data types and count of values df . info ()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 1 to 10000 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_id 10000 non-null object 1 Interaction 10000 non-null object 2 UID 10000 non-null object 3 City 10000 non-null object 4 State 10000 non-null object 5 County 10000 non-null object 6 Zip 10000 non-null int64 7 Lat 10000 non-null float64 8 Lng 10000 non-null float64 9 Population 10000 non-null int64 10 Area 10000 non-null object 11 TimeZone 10000 non-null object 12 Job 10000 non-null object 13 Children 10000 non-null int64 14 Age 10000 non-null int64 15 Income 10000 non-null float64 16 Marital 10000 non-null object 17 Gender 10000 non-null object 18 ReAdmis 10000 non-null object 19 VitD_levels 10000 non-null float64 20 Doc_visits 10000 non-null int64 21 Full_meals_eaten 10000 non-null int64 22 vitD_supp 10000 non-null int64 23 Soft_drink 10000 non-null object 24 Initial_admin 10000 non-null object 25 HighBlood 10000 non-null object 26 Stroke 10000 non-null object 27 Complication_risk 10000 non-null object 28 Overweight 10000 non-null object 29 Arthritis 10000 non-null object 30 Diabetes 10000 non-null object 31 Hyperlipidemia 10000 non-null object 32 BackPain 10000 non-null object 33 Anxiety 10000 non-null object 34 Allergic_rhinitis 10000 non-null object 35 Reflux_esophagitis 10000 non-null object 36 Asthma 10000 non-null object 37 Services 10000 non-null object 38 Initial_days 10000 non-null float64 39 TotalCharge 10000 non-null float64 40 Additional_charges 10000 non-null float64 41 Item1 10000 non-null int64 42 Item2 10000 non-null int64 43 Item3 10000 non-null int64 44 Item4 10000 non-null int64 45 Item5 10000 non-null int64 46 Item6 10000 non-null int64 47 Item7 10000 non-null int64 48 Item8 10000 non-null int64 dtypes: float64(7), int64(15), object(27) memory usage: 3.8+ MB In [3]: # Data Cleaning Code from D206 # Replace the city-specific values with the standard time zones
df . TimeZone . replace ( { 'America/Chicago' : 'America - Central' , 'America/New_York' : 'America - Eastern' , 'America/Los_Angeles' : 'America - Pacific' , 'America/Denver' : 'America - Mountain' , 'America/Detroit' : 'America - Eastern' , 'America/Indiana/Indianapolis' : 'America - Eastern' , 'America/Phoenix' : 'America - Mountain' , 'America/Boise' : 'America - Mountain' , 'America/Anchorage' : 'America - Alaskan' , 'America/Puerto_Rico' : 'America - Atlantic' , 'Pacific/Honolulu' : 'America - Hawaii-Aleutian' , 'America/Menominee' : 'America - Central' , 'America/Nome' : 'America - Alaskan' , 'America/Indiana/Vincennes' : 'America - Eastern' , 'America/Sitka' : 'America - Alaskan' , 'America/Kentucky/Louisville' : 'America - Eastern' , 'America/Toronto' : 'America - Eastern' , 'America/Indiana/Tell_City' : 'America - Central' , 'America/Indiana/Marengo' : 'America - Eastern' , 'America/North_Dakota/Beulah' : 'America - Central' , 'America/Indiana/Winamac' : 'America - Eastern' , 'America/Indiana/Vevay' : 'America - Eastern' , 'America/North_Dakota/New_Salem' : 'America - Central' , 'America/Indiana/Knox' : 'America - Eastern' , 'America/Yakutat' : 'America - Alaskan' , 'America/Adak' : 'America - Hawaii-Aleutian' }, inplace = True ) # Convert the column type of zip from int to string then front fill with zeroes df [ 'Zip' ] = df [ 'Zip' ] . astype ( "str" ) . str . zfill ( 5 ) # Convert column that are string object to category df [ "TimeZone" ] = df [ "TimeZone" ] . astype ( "category" ) df [ "Area" ] = df [ "Area" ] . astype ( "category" ) df [ "Marital" ] = df [ "Marital" ] . astype ( "category" ) df [ "Gender" ] = df [ "Gender" ] . astype ( "category" ) df [ "Initial_admin" ] = df [ "Initial_admin" ] . astype ( "category" ) df [ "Complication_risk" ] = df [ "Complication_risk" ] . astype ( "category" ) df [ "Services" ] = df [ "Services" ] . astype ( "category" ) # Convert column that are float to integer df [ "Initial_days" ] = df [ "Initial_days" ] . astype ( "int64" ) # Recast object to boolean by changing the Yes to 1 and No to 0 bool_mapping = { "Yes" : 1 , "No" : 0 } # Convert column that are string to boolean df [ "ReAdmis" ] = df [ "ReAdmis" ] . map ( bool_mapping ) df [ "Soft_drink" ] = df [ "Soft_drink" ] . map ( bool_mapping ) df [ "HighBlood" ] = df [ "HighBlood" ] . map ( bool_mapping ) df [ "Stroke" ] = df [ "Stroke" ] . map ( bool_mapping ) df [ "Overweight" ] = df [ "Overweight" ] . map ( bool_mapping ) df [ "Arthritis" ] = df [ "Arthritis" ] . map ( bool_mapping ) df [ "Diabetes" ] = df [ "Diabetes" ] . map ( bool_mapping ) df [ "Hyperlipidemia" ] = df [ "Hyperlipidemia" ] . map ( bool_mapping ) df [ "BackPain" ] = df [ "BackPain" ] . map ( bool_mapping ) df [ "Anxiety" ] = df [ "Anxiety" ] . map ( bool_mapping )
df [ "Allergic_rhinitis" ] = df [ "Allergic_rhinitis" ] . map ( bool_mapping ) df [ "Reflux_esophagitis" ] = df [ "Reflux_esophagitis" ] . map ( bool_mapping ) df [ "Asthma" ] = df [ "Asthma" ] . map ( bool_mapping ) # Round the decimal places to 2 from 6 df [ "TotalCharge" ] = df . TotalCharge . round ( 2 ) df [ "Additional_charges" ] = df . Additional_charges . round ( 2 ) df [ "Income" ] = df . Income . round ( 2 ) df [ "VitD_levels" ] = df . VitD_levels . round ( 2 ) # Establish map for reversing survey questions to reflect a truth where 1 < 8 survey_score = { 1 : 8 , 2 : 7 , 3 : 6 , 4 : 5 , 5 : 4 , 6 : 3 , 7 : 2 , 8 : 1 } # Convert column that are int to ordered category. Need to reassign int to stri df [ "Item1" ] = df [ "Item1" ] . map ( survey_score ) df [ "Item1" ] = df [ "Item1" ] . astype ( 'float64' ) df [ "Item2" ] = df [ "Item2" ] . map ( survey_score ) df [ "Item2" ] = df [ "Item2" ] . astype ( 'float64' ) df [ "Item3" ] = df [ "Item3" ] . map ( survey_score ) df [ "Item3" ] = df [ "Item3" ] . astype ( 'float64' ) df [ "Item4" ] = df [ "Item4" ] . map ( survey_score ) df [ "Item4" ] = df [ "Item4" ] . astype ( 'float64' ) df [ "Item5" ] = df [ "Item5" ] . map ( survey_score ) df [ "Item5" ] = df [ "Item5" ] . astype ( 'float64' ) df [ "Item6" ] = df [ "Item6" ] . map ( survey_score ) df [ "Item6" ] = df [ "Item6" ] . astype ( 'float64' ) df [ "Item7" ] = df [ "Item7" ] . map ( survey_score ) df [ "Item7" ] = df [ "Item7" ] . astype ( 'float64' ) df [ "Item8" ] = df [ "Item8" ] . map ( survey_score ) df [ "Item8" ] = df [ "Item8" ] . astype ( 'float64' ) # Replace header names with a snake case convention format snake_case_column = [ 'customer_id' , 'interaction' , 'uid' , 'city' , 'state' , 'county' , 'zip_code' , 'latitude' , 'longitude' , 'population' , 'area_type' , 'timezone' , 'job' , 'chi 'income' , 'marital_status' , 'gender' , 'readmission' , 'vit_d_level' , 'doctor_visits' , 'num_full_meals' , 'vit_d_supp' , 'soft_drink' , 'initial_adm 'stroke' , 'complication_risk' , 'overweight' , 'arthritis' , 'diabetes' , 'hype 'anxiety' , 'allergic_rhinitis' , 'reflux_esophagitis' , 'asthma' , 'service_ty 'daily_charge' , 'additional_charge' , 'surv1_timely_admission' , 'surv2_timel 'surv3_timely_visits' , 'surv4_reliability' , 'surv5_options' , 'surv6_hours_o 'surv7_courteous_staff' , 'surv8_active_listening_dr' ] df . set_axis ( snake_case_column , axis = 1 , inplace = True ) #Visually inspecting dataset df . info ()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 1 to 10000 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10000 non-null object 1 interaction 10000 non-null object 2 uid 10000 non-null object 3 city 10000 non-null object 4 state 10000 non-null object 5 county 10000 non-null object 6 zip_code 10000 non-null object 7 latitude 10000 non-null float64 8 longitude 10000 non-null float64 9 population 10000 non-null int64 10 area_type 10000 non-null category 11 timezone 10000 non-null category 12 job 10000 non-null object 13 children 10000 non-null int64 14 age 10000 non-null int64 15 income 10000 non-null float64 16 marital_status 10000 non-null category 17 gender 10000 non-null category 18 readmission 10000 non-null int64 19 vit_d_level 10000 non-null float64 20 doctor_visits 10000 non-null int64 21 num_full_meals 10000 non-null int64 22 vit_d_supp 10000 non-null int64 23 soft_drink 10000 non-null int64 24 initial_admin 10000 non-null category 25 highblood 10000 non-null int64 26 stroke 10000 non-null int64 27 complication_risk 10000 non-null category 28 overweight 10000 non-null int64 29 arthritis 10000 non-null int64 30 diabetes 10000 non-null int64 31 hyperlipidemia 10000 non-null int64 32 back_pain 10000 non-null int64 33 anxiety 10000 non-null int64 34 allergic_rhinitis 10000 non-null int64 35 reflux_esophagitis 10000 non-null int64 36 asthma 10000 non-null int64 37 service_type 10000 non-null category 38 initial_stay 10000 non-null int64 39 daily_charge 10000 non-null float64 40 additional_charge 10000 non-null float64 41 surv1_timely_admission 10000 non-null float64 42 surv2_timely_treatment 10000 non-null float64 43 surv3_timely_visits 10000 non-null float64 44 surv4_reliability 10000 non-null float64 45 surv5_options 10000 non-null float64 46 surv6_hours_of_treatment 10000 non-null float64 47 surv7_courteous_staff 10000 non-null float64 48 surv8_active_listening_dr 10000 non-null float64 dtypes: category(7), float64(14), int64(20), object(8) memory usage: 3.3+ MB
Verifying if mean is 0 and standard deviation is 1 For column 'age', the mean is 0.0 and the standard deviation is 1.0001. For column 'income', the mean is 0.0 and the standard deviation is 1.0001. For column 'vit_d_level', the mean is -0.0 and the standard deviation is 1.000 1. For column 'doctor_visits', the mean is 0.0 and the standard deviation is 1.00 01. For column 'num_full_meals', the mean is 0.0 and the standard deviation is 1.0 001. For column 'vit_d_supp', the mean is -0.0 and the standard deviation is 1.000 1. For column 'initial_stay', the mean is -0.0 and the standard deviation is 1.00 01. For column 'daily_charge', the mean is 0.0 and the standard deviation is 1.000 1. For column 'additional_charge', the mean is 0.0 and the standard deviation is 1.0001. In [4]: # Create X dataframe with variables we are interested in X = df [[ "age" , "income" , "vit_d_level" , "doctor_visits" , "num_full_meals" , "vit_d_supp" , "initial_stay" , "daily_charge" , "additi # Create list of column headers X_cols = list ( X . columns ) # Set y to as back pain we are interested in predicting y = df [ "back_pain" ] In [5]: # Create array of X values that are standardized X_std = StandardScaler () . fit_transform ( df [[ "age" , "income" , "vit_d_level" , "doc "num_full_meals" , "vit_d_supp" , "initial_stay" , "daily_charge" , "additi # Verify standardization print ( f"Verifying if mean is 0 and standard deviation is 1" ) # Stick the standardized values into a temporary dataframe that we'll use for v X_std_df = pd . DataFrame ( X_std , columns = X_cols ) # Print out the mean and the standard deviation for each of the 13 columns that for column in X_cols : col_mean = round ( X_std_df . loc [:, column ] . mean (), 4 ) col_std = round ( X_std_df . loc [:, column ] . std (), 4 ) print ( f"For column '{ column }', the mean is { col_mean } and the standard devi In [6]: # Define colors for the conditional formatting that we'll apply to our covarian def highlight_cells ( val ): if val > 0.9 : color = 'red' else : color = '' return f"background: { color }" # Generate covariance_matrix (much quicker and more precise that an SNS pairplo covariance_matrix = pd . DataFrame . cov ( X_std_df ) # Apply the styling defined in the above defined function, very closely correll covariance_matrix . style . applymap ( highlight_cells )
Verifying if mean is 0 and standard deviation is 1 For column 'age', the mean is 0.0 and the standard deviation is 1.0001. For column 'income', the mean is 0.0 and the standard deviation is 1.0001. For column 'vit_d_level', the mean is -0.0 and the standard deviation is 1.000 1. For column 'doctor_visits', the mean is 0.0 and the standard deviation is 1.00 01. For column 'num_full_meals', the mean is 0.0 and the standard deviation is 1.0 001. For column 'vit_d_supp', the mean is -0.0 and the standard deviation is 1.000 1. For column 'initial_stay', the mean is -0.0 and the standard deviation is 1.00 01. For column 'additional_charge', the mean is 0.0 and the standard deviation is 1.0001. Out[6]: In [7]: # Re-create X and X_cols to reflect this change X = df [[ "age" , "income" , "vit_d_level" , "doctor_visits" , "num_full_meals" , "vit_d_supp" , "initial_stay" , "additional_charge" ]] . c X_cols = list ( X . columns ) # Re-create array of standardized X values, omitting 'daily_charge' X_std = StandardScaler () . fit_transform ( df [[ "age" , "income" , "vit_d_level" , "doc "num_full_meals" , "vit_d_supp" , "initial_stay" , "additional_charge" ]] . c # Re-verify that everything has been standardized to mean of 0, standard deviat print ( f"Verifying if mean is 0 and standard deviation is 1" ) # Stick the standardized values into a temporary dataframe that we'll use for v X_std_df = pd . DataFrame ( X_std , columns = X_cols ) # Print out the mean and the standard deviation for each of the 13 columns that for column in X_cols : col_mean = round ( X_std_df . loc [:, column ] . mean (), 4 ) col_std = round ( X_std_df . loc [:, column ] . std (), 4 ) print ( f"For column '{ column }', the mean is { col_mean } and the standard devi
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
fit_transform X_pca In [8]: # Save dataframe to CSV, ignore index (if included, this will create an additio X_std_df . to_csv ( 'task2_clean.csv' , index = False ) In [9]: # X_std is the arrays created by the StandardScaler for us to perform PCA with # Instantiate our PCA object pca = PCA ( n_components = 8 , random_state = 369 ) # Fit the PCA to the standardized X data, then transform X_pca = pca . fit_transform ( X_std ) # Generate the matrix of PCA loadings X_pca_loadings = pd . DataFrame ( pca . components_ . T , columns = [ "PC1" , "PC2" , "PC3" , "PC4" , "PC5" , "PC index = X_cols ) X_pca_loadings Out[9]: In [10]: # Generate the 8 PC's variance print ( f"These 8 principal components explain { round ( sum ( pca . explained_variance_ # Break this down to show the individual contribution of each PC to the whole print ( f"The contribution of each principal component to the total can be seen h
These 8 principal components explain 100.0% of variance. The contribution of each principal component to the total can be seen here: For PC1, the contribution is 21.484% For PC2, the contribution is 13.082% For PC3, the contribution is 12.779% For PC4, the contribution is 12.583% For PC5, the contribution is 12.251% For PC6, the contribution is 12.176% For PC7, the contribution is 12.108% For PC8, the contribution is 3.537% pc_contributions = list ( pca . explained_variance_ratio_ ) pc_names = list ( X_pca_loadings . columns ) for i in range ( len ( pc_names )): print ( f"For { pc_names [ i ] }, the contribution is { round ( pc_contributions [ i ] * In [11]: # Use Scree Plot plt . plot ( np . cumsum ( pca . explained_variance_ratio_ )) plt . xlabel ( "Number of Principal Components" ) plt . ylabel ( "Percentage of Explained Variance" ) plt . show ();
final_pca The amount of variance accounted for by each principal component can be seen h ere: For PC1, the contribution is 21.484% For PC2, the contribution is 13.082% For PC3, the contribution is 12.779% For PC4, the contribution is 12.583% For PC5, the contribution is 12.251% For PC6, the contribution is 12.176% For PC7, the contribution is 12.108% In [12]: #Rerun PCA final_pca = PCA ( n_components = 7 , random_state = 369 ) # Fit the PCA to the standardized X data, then transform final_pca . fit ( X_std ) final_X_pca = final_pca . transform ( X_std ) # Generate PCA loadings for the final_pca final_X_pca_loadings = pd . DataFrame ( final_pca . components_ . T , columns = [ "PC1" , "PC2" , "PC3" , "PC4" , "PC5" , "PC index = X_cols ) final_X_pca_loadings Out[12]: In [13]: # Show the individual contribution of each PC to the whole print ( f"The amount of variance accounted for by each principal component can be # Doing this in a pretty way, rather than printing an unlabelled list of explai pc_contributions = list ( final_pca . explained_variance_ratio_ ) pc_names = list ( final_X_pca_loadings . columns ) for i in range ( len ( pc_names )): print ( f"For { pc_names [ i ] }, the contribution is { round ( pc_contributions [ i ] *
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
These 7 principal components explain 96.463% of variance in the data. The shape of the X_train set is: (8000, 7) The shape of the X_test set is: (2000, 7) Decision tree accuracy: 0.518 The confusion matrix for this Decision Tree model: Predicted No Back Pain | Predicted Back Pain [675 502] Actual No Back Pain [462 361] Actual Back Pain In [14]: print ( f"These 7 principal components explain { round ( sum ( final_pca . explained_va In [15]: # Split the data into train and test sets, 80% train, 20% test, use stratify to X_train , X_test , y_train , y_test = train_test_split ( final_X_pca , y , train_size # Verify that each of the X sets are shaped as expected, to reflect the 11 PC's print ( f"The shape of the X_train set is: { X_train . shape }" ) print ( f"The shape of the X_test set is: { X_test . shape }" ) In [16]: # Instantiate our classification model classification_model = DecisionTreeClassifier ( random_state = 369 ) . fit ( X_train , y_ y_predictions = classification_model . predict ( X_test ) # Generate accuracy report for this model test_accuracy = accuracy_score ( y_test , y_predictions ) print ( f'Decision tree accuracy: { test_accuracy }' ) # Predict the test set probabilities of the positive class y_pred_proba = classification_model . predict_proba ( X_test )[:, 1 ] # Generate Confusion Matrix final_matrix = confusion_matrix ( y_test , y_predictions ) print ( "\nThe confusion matrix for this Decision Tree model:" ) print ( "Predicted No Back Pain | Predicted Back Pain" ) print ( f" { final_matrix [ 0 ] } Actual No Back Pain" ) print ( f" { final_matrix [ 1 ] } Actual Back Pain" )