shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory

pdf

School

William Rainey Harper College *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

11

Uploaded by elliebat

Report
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 1/11 HW 10: Predicting medical costs for by an insurance company You work for an insurance company as a data scientist. You have demographic and health-related information about the customers that purchase your insurance policies. You want to build a model to predict how many medical bills a future customer will incur. In particular, you want to build a model to predict the precise number ( charges ), as well as a model that predicts whether the customer will be in the high cost category or not ( high_charge ). Decriptions of the variables age : Age of primary bene±ciary sex : Gender of primary bene±ciary (female/male) bmi : Body mass index children : Number of children (dependents) covered by health insurance smoker : Smoker status (yes/no) region : Bene±ciary's residential area in the US (northeast, southeast, southwest, northwest) charges : Individual medical costs billed to health insurance high_charge : Whether the medical costs billed to health insurance is high or not (based on the value of charges ) Section 1: Business problem and data Section 2: Understanding the data (1.5 pts) First, install the dmba package and import all required packages. Install the packages !pip install dmba Looking in indexes: https://pypi.org/simple , https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: dmba in /usr/local/lib/python3.10/dist-packages (0.2.2) import numpy as np import pandas as pd from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from dmba import classificationSummary, regressionSummary, plotDecisionTree from sklearn import metrics from sklearn.metrics import accuracy_score import matplotlib.pylab as plt from sklearn.metrics._plot.roc_curve import roc_curve # Run this two times %matplotlib inline 1. Read the data from this link: https://raw.githubusercontent.com/irenebao2020/badm211/main/insurance.csv . 2. Show the data types of each columns (0.5 pt) 3. Show the distributions of the four categorical variables (1 pt) Note: there is no need to check for missing values or duplicate rows. Each row in the dataset corresponds to an indvidual consumer, but there is no column with the consumer ID. Due to the small number of predictor columns and the lack of a consumer ID column, some rows may appear to be duplicates, but they actually refer to different consumers. Read the dataset and inspect the dataset (1.5 pts)
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 2/11 # Your code here df=pd.read_csv("https://raw.githubusercontent.com/irenebao2020/badm211/main/insurance.csv") df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 7 high_charge 1338 non-null object dtypes: float64(2), int64(2), object(4) memory usage: 83.8+ KB df["sex"].value_counts(normalize = True) male 0.505232 female 0.494768 Name: sex, dtype: float64 df["smoker"].value_counts(normalize = True) no 0.795217 yes 0.204783 Name: smoker, dtype: float64 df["region"].value_counts(normalize = True) southeast 0.272048 southwest 0.242900 northwest 0.242900 northeast 0.242152 Name: region, dtype: float64 df["high_charge"].value_counts(normalize = True) No 0.749626 Yes 0.250374 Name: high_charge, dtype: float64 A. 70 B. 30 C. 75 D. 25 Q1 . What percentage of consumers are in the high charge category? 1. De±ne X (Prediction Matrix) 2. Dummy code the categorical variables in X (Hint: no need to drop a dummy in decision trees and random forests.) Note, in this notebook, you will be predicting both a classi±cation and a regression problem. The variable high_charge is a categorical variable that is determined by the numerical variables charges . Therefore, when we create X , we should exclude both the categorical variable high_charge and the numerical variable charges . Section 3: Data preprocessing (2 pts) # Your code here X = df.drop(columns=['high_charge', 'charges'], axis=1) X = pd.get_dummies(X)
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 3/11 In this section, you will perform regression analysis using both a regression tree and a random forest. Your goal is to predict charges using all of the the predictor variables. Before modeling, take the following two steps. 1. De±ne y (Outcome Variable) (0.25 pts) 2. Partition the data into a training set and test set. Specify test_size as 0.2 and random_state as 511. (0.25 pts) Section 4: Regression Analysis (4 pts + 2 bonus pts) # Your code here y = df['charges'] train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=511) Regression Tree (2 pts) 1. Fit a decision tree to the training set. Make sure to including the stopping criteria max_depth = 5 and random_state = 1 . 2. Plot the decision tree (only show the top 3 levels) 3. Make predictions on the training and the test sets 4. Evaluate the performance of the decision tree on the training set and test set DecisionTreeRegressor DecisionTreeRegressor(max_depth=5, random_state=1) # Your code here reg_tree = DecisionTreeRegressor(max_depth=5, random_state=1) reg_tree.fit(train_X, train_y) smoker_no ≤ 0.5 samples = 1070 value = 13267.072 bmi ≤ 30.01 220 32010.944 True age ≤ 42.5 850 8415.717 False age ≤ 42.5 104 21020.26 age ≤ 41.5 116 41864.662 age ≤ 30.5 64 18317.822 age ≤ 56.5 40 25344.161 (...) (...) (...) (...) bmi ≤ 35.4 63 38511.983 bmi ≤ 46.805 53 45849.921 (...) (...) (...) (...) children ≤ 0.5 478 5457.676 age ≤ 32.5 203 4032.285 region_southwe 275 6509.874 (...) (...) (...) plotDecisionTree(reg_tree,feature_names=train_X.columns, max_depth=3) train_pred_y_dt = reg_tree.predict(train_X) test_pred_y_dt = reg_tree.predict(test_X) regressionSummary(train_y, train_pred_y_dt)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 4/11 Regression statistics Mean Error (ME) : -0.0000 Root Mean Squared Error (RMSE) : 4094.7903 Mean Absolute Error (MAE) : 2332.5808 Mean Percentage Error (MPE) : -17.7097 Mean Absolute Percentage Error (MAPE) : 28.8514 A. Smoker_no <=0.5 B. Age<=42.5 C. children <=0.5 D. bmi<=35.4 Q2 . What is the splitting condition in the root decision node? A. 1070 B. 220 C. 850 D. 104 Q3 . How many non-smokers are in the training set? A. 13267 B. 8415.7 C. 32011 D. 21020 Q4 . What is the average charge for smokers in the training set? Random Forest (1.5 pts) 1. Fit a random forest to the training set. The forest should include 500 trees. Use the stopping criteria min_impurity_decrease=0.001 and max_depth=10 , as well as random_state=1 . 2. Make predictions on the training and the test sets 3. Evaluate the performance of the random forest on the training set and test set # Your code here reg_rf=RandomForestRegressor(n_estimators=500, min_impurity_decrease=0.001, max_depth=10, random_state=1) reg_rf=reg_rf.fit(train_X, train_y) train_pred_rf=reg_rf.predict(train_X) test_pred_rf=reg_rf.predict(test_X) regressionSummary(train_y, train_pred_rf) regressionSummary(test_y, test_pred_rf) Regression statistics Mean Error (ME) : -74.1901 Root Mean Squared Error (RMSE) : 2125.1206 Mean Absolute Error (MAE) : 1163.8475 Mean Percentage Error (MPE) : -10.3977 Mean Absolute Percentage Error (MAPE) : 15.4118 Regression statistics Mean Error (ME) : -192.5143 Root Mean Squared Error (RMSE) : 4921.7697 Mean Absolute Error (MAE) : 2663.4850
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 5/11 Mean Percentage Error (MPE) : -19.1123 Mean Absolute Percentage Error (MAPE) : 28.7115 A. Regression Tree B. Random Forest Q5 . Based on RMSE, which model performs better? Rerun the above models using gridsearchcv with at least 3 stopping criteria and at least 3 possible values for each criteria. (You do not have to visualize any trees.) After ±tting the model using grid search, add a markdown cell commenting on whether this is an improvement on the above models. Bonus: Grid Search (2 pts) Grid search Decision Tree # Your code here param_grid = { 'max_depth': [2, 3, 4, 5, 6, 7, 10], 'min_impurity_decrease': [0, 0.00005, 0.0005, 0.001, 0.003], 'min_samples_split': [3, 5, 7, 10], 'random_state': [1], } GridSearchCV estimator: DecisionTreeRegressor DecisionTreeRegressor regTree = GridSearchCV(DecisionTreeRegressor(), param_grid) regTree.fit(train_X, train_y) DecisionTreeRegressor DecisionTreeRegressor(max_depth=3, min_impurity_decrease=0, min_samples_split=3, random_state=1) regTree.best_estimator_ regTree_train_predictions = regTree.predict(train_X) regTree_test_predictions = regTree.predict(test_X) regressionSummary(train_y, regTree_train_predictions) regressionSummary(test_y, regTree_test_predictions) Regression statistics Mean Error (ME) : 0.0000 Root Mean Squared Error (RMSE) : 4537.2213 Mean Absolute Error (MAE) : 2751.8003 Mean Percentage Error (MPE) : -24.0686 Mean Absolute Percentage Error (MAPE) : 37.7458 Regression statistics Mean Error (ME) : 20.7248 Root Mean Squared Error (RMSE) : 4839.4350 Mean Absolute Error (MAE) : 2893.3887 Mean Percentage Error (MPE) : -25.0872 Mean Absolute Percentage Error (MAPE) : 37.8368 Grid search Random Forest
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 6/11 # Your code here param_grid = { 'n_estimators': [400, 700, 1000], 'min_impurity_decrease': [0.0001, 0.001, 0.01], 'min_samples_split': [3,5,10], 'max_depth':[30], "max_features":["sqrt"] } GridSearchCV estimator: RandomForestRegressor RandomForestRegressor RF_reg = GridSearchCV(RandomForestRegressor(), param_grid) RF_reg.fit(train_X, train_y) RandomForestRegressor RandomForestRegressor(max_depth=30, max_features='sqrt', min_impurity_decrease=0.001, min_samples_split=5, n_estimators=700) RF_reg.best_estimator_ RF_train_predictions = RF_reg.predict(train_X) RF_test_predictions = RF_reg.predict(test_X) regressionSummary(train_y, RF_train_predictions) regressionSummary(test_y, RF_test_predictions) Regression statistics Mean Error (ME) : -49.2991 Root Mean Squared Error (RMSE) : 2846.1973 Mean Absolute Error (MAE) : 1640.2170 Mean Percentage Error (MPE) : -14.6292 Mean Absolute Percentage Error (MAPE) : 21.1756 Regression statistics Mean Error (ME) : 68.5006 Root Mean Squared Error (RMSE) : 4791.1305 Mean Absolute Error (MAE) : 2745.1475 Mean Percentage Error (MPE) : -20.9043 Mean Absolute Percentage Error (MAPE) : 31.3503 In this section, you will perform regression analysis using both a classi±cation tree and a random forest. Your goal is to predict high_charge using all of the the predictor variables. Before modeling, take the following three steps. 1. Converting the categorical outcome variable into numeric variable (0/1) (0.5 pts) 2. De±ne y (Outcome Variable) (0.25 pts) 3. Partition the data into a training set and test set. Specify test_size as 0.2 and random_state as 511. (0.25 pts) Section 5: Classi±cation Analysis (5.5 pts + 2 bonus pts) # Your code here df['high_charge']=np.where(df['high_charge']=='Yes',1,0) y=df['high_charge'] train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=511) Decision Tree (2 pts)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 7/11 1. Fit a decision tree to the training set. Make sure to including the stopping criteria max_depth=10 and min_samples_leaf=10 , as well as random_state=1 . 2. Plot the decision tree (only show the top 3 levels) 3. Make predictions on the test set 4. Evaluate the performance of the decision tree on the training set and test set # Your code here class_tree=DecisionTreeClassifier(max_depth=10, min_samples_leaf=10, random_state=1) class_tree=class_tree.fit(train_X, train_y) smoker_no ≤ 0.5 samples = 1070 value = [802, 268] bmi ≤ 22.943 220 [16, 204] True age ≤ 47.5 850 [786, 64] False age ≤ 30.5 25 [11, 14] bmi ≤ 26.078 195 [5, 190] 12 [10, 2] 13 [1, 12] age ≤ 32.5 29 [4, 25] bmi ≤ 27.72 166 [1, 165] (...) (...) (...) (...) bmi ≤ 29.62 559 [525, 34] bmi ≤ 23.865 291 [261, 30] age ≤ 33.5 268 [257, 11] bmi ≤ 29.805 291 [268, 23] (...) (...) (...) (...) 29 [29, 0] bmi ≤ 25.53 262 [232, 30] (...) (.. plotDecisionTree(class_tree, feature_names=train_X.columns, max_depth=3) train_pred_y_dt = class_tree.predict(train_X) test_pred_y_dt = class_tree.predict(test_X) classtree_train_predictions_prob = class_tree.predict_proba(train_X) classtree_test_predictions_prob = class_tree.predict_proba(test_X) classificationSummary(train_y, train_pred_y_dt) Confusion Matrix (Accuracy 0.9327) Prediction Actual 0 1 0 796 6 1 66 202 classificationSummary(test_y, test_pred_y_dt) Confusion Matrix (Accuracy 0.9366) Prediction Actual 0 1 0 200 1 1 16 51 A. 64 Q6 . How many non-smokers are in the high charge category?
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 8/11 B. 16 C. 204 D. 268 A. 291 B. 559 C. 850 D. 220 Q7 . How many non-smokers are older than 47.5 years old? Random Forest (1.5 pts) 1. Fit a random forest to the training set. The forest should include 500 trees. Use the stopping criteria min_impurity_decrease=0.001 and max_depth=10 , as well as random_state=1 . 2. Make predictions on the training and the test sets 3. Evaluate the performance of the random forest on the training set and test set # Your code here class_rf=RandomForestClassifier(n_estimators=500, min_impurity_decrease=0.001, max_depth=10, random_state=1) class_rf=class_rf.fit(train_X, train_y) train_pred_rf=class_rf.predict(train_X) test_pred_rf=class_rf.predict(test_X) test_pred_prob_rf = class_rf.predict_proba(train_X) test_pred_prob_rf = class_rf.predict_proba(test_X) classificationSummary(train_y, train_pred_rf) classificationSummary(test_y, test_pred_rf) Confusion Matrix (Accuracy 0.9280) Prediction Actual 0 1 0 789 13 1 64 204 Confusion Matrix (Accuracy 0.9291) Prediction Actual 0 1 0 198 3 1 16 51 Compare the decision tree and random forest models by using ROC curves. ROC Curve (1 pt) # Your code here # specify plot size plt.figure (figsize=(8, 6)) # generate values and plot for decision tree fpr, tpr, threshold = roc_curve(test_y, classtree_test_predictions_prob[:,1]) plt.plot (fpr, tpr, color='darkorange',label="Decision Tree") # generate values and plot for random forest fpr, tpr, threshold = roc_curve(test_y, test_pred_prob_rf[:,1]) plt.plot (fpr, tpr, color='indigo',label="Random Forest") # generate line for performance of a random classifier plt.plot ([0, 1], [0, 1], color='magenta', lw=1, linestyle='--') # other settings plt.legend (loc="lower right") plt.xlim([0.0, 1.0])
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 9/11 plt.ylim([0.0, 1.05]) plt.xlabel('False Pos Rate') plt.ylabel('True Pos Rate') plt. show() Rerun the above models using gridsearchcv with at least 3 stopping criteria and at least 3 possible values for each criteria. (You do not have to visualize any trees.) After ±tting the model using grid search, add a markdown cell commenting on whether this is an improvement on the above models using accuracy. Bonus: Grid Search (2 pts) Grid Search Decision Tree # Your code here param_grid = { 'max_depth': [10, 20, 30], 'min_impurity_decrease': [0, 0.0001, 0.001, 0.01], 'min_samples_split': [10, 15, 20, 40, 50], 'random_state': [1], } GridSearchCV estimator: DecisionTreeClassifier DecisionTreeClassifier classTree = GridSearchCV(DecisionTreeClassifier(), param_grid) classTree.fit(train_X, train_y) DecisionTreeClassifier DecisionTreeClassifier(max_depth=10, min_impurity_decrease=0.001, min_samples_split=20, random_state=1) classTree.best_estimator_ classTree_train_predictions = classTree.predict(train_X)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 10/11 classTree_test_predictions = classTree.predict(test_X) classTree_train_predictions_prob = classTree.predict_proba(train_X) classTree_test_predictions_prob = classTree.predict_proba(test_X) classificationSummary(train_y, classTree_train_predictions) classificationSummary(test_y, classTree_test_predictions) Confusion Matrix (Accuracy 0.9327) Prediction Actual 0 1 0 796 6 1 66 202 Confusion Matrix (Accuracy 0.9366) Prediction Actual 0 1 0 200 1 1 16 51 Grid Search Random Forest # Your code here param_grid = { 'n_estimators': [100, 300, 500], 'min_impurity_decrease': [0.0001, 0.001, 0.01], 'min_samples_split': [10,15,20], 'max_depth':[20], "max_features":["sqrt"] } GridSearchCV estimator: RandomForestClassifier RandomForestClassifier RF_class = GridSearchCV(RandomForestClassifier(), param_grid) RF_class.fit(train_X, train_y) RandomForestClassifier RandomForestClassifier(max_depth=20, min_impurity_decrease=0.0001, min_samples_split=10, n_estimators=500) RF_class.best_estimator_ RF_train_predictions = RF_class.predict(train_X) RF_test_predictions = RF_class.predict(test_X) RF_train_predictions_prob = RF_class.predict_proba(train_X) RF_test_predictions_prob =RF_class.predict_proba(test_X) classificationSummary(train_y, RF_train_predictions) classificationSummary(test_y, RF_test_predictions) Confusion Matrix (Accuracy 0.9336) Prediction Actual 0 1 0 795 7 1 64 204 Confusion Matrix (Accuracy 0.9328) Prediction Actual 0 1
4/28/23, 8:18 PM shurenb2_HW10 - Decision Tree and Random Forest_student.ipynb - Colaboratory https://colab.research.google.com/drive/1ZFtUXMq6-k5i0SQGziT2-E3OE5R1sugL#printMode=true 11/11 0 199 2 1 16 51