HW1-Decision+Trees+and+Random+Forests-jc12818

pdf

School

New York University *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

17

Uploaded by BaronFlagFerret27

Report
HW1-Decision+Trees+and+Random+Forests-jc12818 February 13, 2024 Please submit an electronic version of your Python Jupyter notebook on NYU Brightspace. Re- member that this assignment is to be done individually. Solutions will be posted a few days after the due date (on Feb 20th), so assignments submitted until that day will receive a late penalty, but no late assignments will be accepted after the solutions are posted. Total points for this HW: 10 Please note: Copying and pasting other people’s work is absolutely prohibited. Any such cases will be reported to CUSP’s education team and severely punished. Discussion is encouraged, and feel free to exchange ideas with your classmates, but please write your own code and do your own work. 0.0.1 Question 1: Accuracy and interpretability (10%) a) Describe a real-world prediction problem using urban data for which interpretability of your models and results is essential, and for which it might be preferable to use decision trees rather than random forests. Argue why this is the case. (3%) In my opinion, decision trees are more suitable for handling data with specific conditions, such as identifying areas with high concentrations of hospitals, regions with the highest crime rates, or places with poor air quality. Because they provide intuitive insights. They are optimized for offering straightforward ideas, making them ideal when the decision making process needs to be explained to non-specialists. and also they can be used when a simple model is needed for rapid development and decision making. Thus, decision trees may be preferred for their ability to quickly generate models and facilitate decision making, especially where explaining the logic behind decisions to those without a background in the field is crucial. b) Describe a real-world prediction problem using urban data for which accuracy is paramount and interpretability may be less important, and for which it might be preferable to use random forests rather than decision trees. Argue why this is the case. (3%) Last semester, I worked on a GIS class project to identify “hospital deserts,” and in a similar context, I believe using random forests would be appropriate when predicting the optimal hospital location for ambulances to transport patients based on effcient travel distances, hourly traffc patterns, nearby hospital specialties, and bed counts. Analyzing these complex data can help determine the best destination hospital for each patient’s location. Additionally, when identifying areas with a high likelihood of emergency situations, random forests are beneficial because they can assess the importance of each feature, aiding in identifying the factors that most significantly impact emergency situations. This accurate prediction can enhance the effciency of emergency medical services and allow for the effective allocation of medical workforce. 1
c) Let’s imagine that you want to try to get the best of both worlds (accuracy and interpretabil- ity). So you decide to start by learning a random forest classifier. Describe at least one way of getting some interpretability out of the model by post-processing. You could either pick a method from the literature (e.g., Domingos’s work on combining multiple models or some method of computing variable importance), or come up with your own approach (doesn’t have to be ground-breaking, but feel free to be creative!) (4%) Random forests enable sophisticated analysis but they face challenges in interpretability. To address this, we can use LIME (Local Interpretable Model-agnostic Explanations). LIME is a method that provides local fidelity for predictions made by complex machine learning models. It accomplishes this by generating perturbed samples around a specific data point and then making predictions on these samples. A linear model is then trained on these samples to analyze the weights and assess the impact of each feature. This method allows us to deliver understandable explanations for the complex model. reference : https://deeplearningofpython.blogspot.com/2023/05/LIME- XAI-example-python.html?source=post_page—–d195c2640834——————————– 0.0.2 Question 2: Build a decision tree for classification, step by step, following the lecture notes. Note that the dataset has been modified, so you may get a different tree than the one shown in the lecture notes. (30%) [241]: import pandas as pd import numpy as np [242]: import io thefile = io . StringIO( 'MPG,cylinders,HP,weight \n good,4,75,light \n bad,6,90,medium \n bad,4,110,medium \n bad,8 df = pd . read_csv(thefile) df [242]: MPG cylinders HP weight 0 good 4 75 light 1 bad 6 90 medium 2 bad 4 110 medium 3 bad 8 175 weighty 4 bad 6 95 medium 5 bad 4 94 light 6 bad 4 95 light 7 bad 8 139 weighty 8 bad 8 190 weighty 9 bad 8 145 weighty 10 bad 6 100 medium 11 good 4 92 medium 12 bad 6 100 weighty 13 bad 8 170 weighty 14 good 4 89 medium 15 good 4 65 light 16 bad 6 85 medium 17 good 4 81 light 2
18 bad 6 95 medium 19 good 4 93 light 0.0.3 Please use numpy and pandas to do the computation for parts a) through f). Do not use an existing decision tree implementation like sklearn for this question. a) Start with the entire dataset and find the most common MPG value. (2%) [243]: most_common_mpg = df[ 'MPG' ] . mode()[ 0 ] most_common_mpg [243]: 'bad' [244]: mpg_counts = df[ 'MPG' ] . value_counts() mpg_counts [244]: MPG bad 14 good 6 Name: count, dtype: int64 b) Enumerate all the possible binary questions you could ask for each discrete-valued variable. For each such split, compute the numbers of “good” and “bad” MPG vehicles in each of the two child nodes, and compute the information gain using the provided function above. (5%) [245]: def InformationGain (goodY, badY, goodN, badN): def F (X, Y): val1 = X * np . log2( 1.0 * (X + Y) / X) if X > 0 else 0 val2 = Y * np . log2( 1.0 * (X + Y) / Y) if Y > 0 else 0 return val1 + val2 total = goodY + goodN + badY + badN return (F(goodY + goodN, badY + badN) - F(goodY, badY) - F(goodN, badN)) / total if total > 0 else 0 # Function to compute information gain for each binary split def compute_information_gain (dataframe, columns): gain_dict = {} for column in columns: unique_values = dataframe[column] . unique() for value in unique_values: df_yes = dataframe[dataframe[column] == value] df_no = dataframe[dataframe[column] != value] goodY = len (df_yes[df_yes[ 'MPG' ] == 'good' ]) badY = len (df_yes[df_yes[ 'MPG' ] == 'bad' ]) goodN = len (df_no[df_no[ 'MPG' ] == 'good' ]) badN = len (df_no[df_no[ 'MPG' ] == 'bad' ]) gain = InformationGain(goodY, badY, goodN, badN) gain_dict[ f" { column } == { value } " ] = gain 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
return gain_dict # Compute and display the information gains gain_dict = compute_information_gain(df, [ 'cylinders' , 'weight' ]) gain_dict [245]: {'cylinders == 4': 0.4680577739061723, 'cylinders == 6': 0.1916312040067166, 'cylinders == 8': 0.15307795338969116, 'weight == light': 0.1916312040067166, 'weight == medium': 0.0058021490143458365, 'weight == weighty': 0.1916312040067166} c) Enumerate all the possible binary questions you could ask for the real-valued variable HP. For each such split, compute the numbers of “good” and “bad” MPG vehicles in each of the two child nodes, and compute the information gain using the provided function above. (5%) NOTE: if you’d like, you can just use all midpoints between consecutive values of the sorted HP attribute. You are not required to exclude provably suboptimal questions like we did in the lecture. [246]: # 'HP' ￿￿￿ ￿￿ ￿￿￿ ￿￿ sorted_hp = np . sort(df[ 'HP' ] . unique()) midpoints = [(sorted_hp[i] + sorted_hp[i +1 ]) / 2 for i in range ( len (sorted_hp) -1 )] [247]: def information_gain (data, attribute, split_value, target_name = "MPG" ): # ￿￿ ￿￿￿￿ ￿￿￿￿ ￿￿ total_entropy = entropy(data[target_name]) # ￿￿￿￿ ￿￿ left_split = data[data[attribute] <= split_value] right_split = data[data[attribute] > split_value] # ￿￿ ￿￿￿￿ ￿￿ left_entropy = entropy(left_split[target_name]) if not left_split . empty else 0 right_entropy = entropy(right_split[target_name]) if not right_split . empty else 0 weighted_entropy = ( len (left_split) / len (data)) * left_entropy + ( len (right_split) / len (data)) * right_entropy # ￿￿ ￿￿ ￿￿ info_gain = total_entropy - weighted_entropy return info_gain [248]: # 'HP' ￿￿￿ ￿￿ ￿ ￿￿￿￿￿￿ ￿￿ ￿￿ ￿￿ for midpoint in midpoints: info_gain = information_gain(df, 'HP' , midpoint) 4
print ( f"Information gain for HP <= { midpoint } : { info_gain } " ) Information gain for HP <= 70.0: 0.09139023062144991 Information gain for HP <= 78.0: 0.19350684337293445 Information gain for HP <= 83.0: 0.30984030471640056 Information gain for HP <= 87.0: 0.1620654662387495 Information gain for HP <= 89.5: 0.2759267455941732 Information gain for HP <= 91.0: 0.19163120400671674 Information gain for HP <= 92.5: 0.32489038387335567 Information gain for HP <= 93.5: 0.5567796494470396 Information gain for HP <= 94.5: 0.46805777390617237 Information gain for HP <= 97.5: 0.2812908992306927 Information gain for HP <= 105.0: 0.19163120400671663 Information gain for HP <= 124.5: 0.15307795338969132 Information gain for HP <= 142.0: 0.11774369689072062 Information gain for HP <= 157.5: 0.08512362463476453 Information gain for HP <= 172.5: 0.054824648581652036 Information gain for HP <= 182.5: 0.02653432846734327 d) Based on your results for parts b and c, what is the optimal binary split of the data? Of the two child nodes created by this split, which (if any) would require further partitioning? (4%) [249]: # ￿￿ ￿￿ ￿￿￿ ￿￿ ￿￿ ￿￿ def compute_information_gain (data, attribute, target_name = "MPG" ): # ￿￿￿￿ ￿￿￿￿ ￿￿ ￿￿￿ ￿￿￿ ￿￿￿ ￿￿ unique_values = np . sort(data[attribute] . unique()) split_values = (unique_values[: -1 ] + unique_values[ 1 :]) / 2 # ￿ ￿￿￿ ￿￿ ￿￿ ￿￿ ￿￿ ￿ ￿￿ ￿￿ ￿￿ ￿￿ max_info_gain = - np . inf best_split = None for split_value in split_values: info_gain = information_gain(data, attribute, split_value, target_name) if info_gain > max_info_gain: max_info_gain = info_gain best_split = split_value return best_split, max_info_gain # 'HP'￿ ￿￿ ￿￿￿ ￿￿ ￿￿ best_split_hp, max_info_gain_hp = compute_information_gain(df, 'HP' ) print ( f"Best split for 'HP': { best_split_hp } with information gain of { max_info_gain_hp } " ) Best split for 'HP': 93.5 with information gain of 0.5567796494470396 The optimal binary split occurs at the threshold of 93.5, dividing the data into two nodes. The node with HP values less than or equal to 93.5 requires further splitting for more refined classification. 5
e) Repeat parts a through d until all training data points are perfectly classified by the resulting tree. (6%) [250]: # Splitting the dataset based on 'HP' value left_split = df[df[ 'HP' ] <= 93.5 ] right_split = df[df[ 'HP' ] > 93.5 ] def find_best_split_for_node (data): """ Find the best split for a given subset of the dataset by evaluating all possible splits for each attribute other than 'HP', and calculating the information gain. """ attributes = [ 'cylinders' , 'weight' ] # Excluding 'HP' because it's already used for the initial split. best_gain = 0 best_split = None best_attribute = None for attribute in attributes: unique_values = np . unique(data[attribute]) for value in unique_values: # Calculate the information gain for a binary split on the attribute current_gain = information_gain(data, attribute, value, 'MPG' ) if current_gain > best_gain: best_gain = current_gain best_split = value best_attribute = attribute return best_attribute, best_split, best_gain # Finding the best splits for the left and right subsets left_attribute, left_value, left_gain = find_best_split_for_node(left_split) right_attribute, right_value, right_gain = find_best_split_for_node(right_split) # Printing the results print ( f"Best split for left node: { left_attribute } == { left_value } , Information Gain: { left_gain } " ) print ( f"Best split for right node: { right_attribute } == { right_value } , Information Gain: { right_gain } " ) Best split for left node: cylinders == 4, Information Gain: 0.8112781244591328 Best split for right node: None == None, Information Gain: 0 [251]: print ( 'Original dataset stat: \n ' , df[ 'MPG' ] . value_counts()) print ( ' \n HP <= 93.5 \n ' , left_split[ 'MPG' ] . value_counts()) print ( ' \n HP > 93.5 \n ' , right_split[ 'MPG' ] . value_counts()) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Original dataset stat: MPG bad 14 good 6 Name: count, dtype: int64 HP <= 93.5 MPG good 6 bad 2 Name: count, dtype: int64 HP > 93.5 MPG bad 12 Name: count, dtype: int64 [252]: # Splitting the dataset based on 'HP' value left_split = df[df[ 'HP' ] <= 93.5 ] right_split = df[df[ 'HP' ] > 93.5 ] # Define the function to find the best split for nodes def find_best_split_for_node (data): attributes = [ 'cylinders' , 'weight' ] # Excluding 'HP' best_gain = 0 best_split = None best_attribute = None for attribute in attributes: unique_values = np . unique(data[attribute]) for value in unique_values: current_gain = information_gain(data, attribute, value, 'MPG' ) if current_gain > best_gain: best_gain = current_gain best_split = value best_attribute = attribute return best_attribute, best_split, best_gain # Finding the best splits for the left and right subsets left_attribute, left_value, left_gain = find_best_split_for_node(left_split) right_attribute, right_value, right_gain = find_best_split_for_node(right_split) # Print the 'MPG' distribution for each set print ( 'Original dataset MPG distribution: \n ' , df[ 'MPG' ] . value_counts(), ' \n ' ) print ( 'Left split (HP <= 93.5) MPG distribution: \n ' , left_split[ 'MPG' ] . value_counts(), ' \n ' ) 7
print ( 'Right split (HP > 93.5) MPG distribution: \n ' , right_split[ 'MPG' ] . value_counts(), ' \n ' ) # Print the best split results print ( f"Best split for left node: { left_attribute } == { left_value } , Information Gain: { left_gain } " ) print ( f"Best split for right node: { right_attribute } == { right_value } , Information Gain: { right_gain } " ) # To print the MPG distribution for subsets resulting from further splits, you would # first need to apply these splits and then print the distribution similarly. Original dataset MPG distribution: MPG bad 14 good 6 Name: count, dtype: int64 Left split (HP <= 93.5) MPG distribution: MPG good 6 bad 2 Name: count, dtype: int64 Right split (HP > 93.5) MPG distribution: MPG bad 12 Name: count, dtype: int64 Best split for left node: cylinders == 4, Information Gain: 0.8112781244591328 Best split for right node: None == None, Information Gain: 0 f) Draw or show the final decision tree in a format of your choice. The decision to make at each step and the predicted value at each leaf node must be clear. (4%) [253]: def draw_decision_tree (): # ￿￿ ￿￿￿ ￿￿￿￿￿ ￿￿ ￿￿ ￿￿ fig, ax = plt . subplots() ax . set_xlim( 0 , 10 ) ax . set_ylim( 0 , 6 ) # ￿￿￿ ￿￿￿ ￿￿ def draw_node (x, y, text): ax . text(x, y, text, horizontalalignment = 'center' , verticalalignment = 'center' , fontsize =12 , bbox = dict (facecolor = 'white' , edgecolor = 'black' )) 8
# ￿￿￿￿ ￿￿￿ ￿￿ def draw_link (x1, y1, x2, y2, text): ax . plot([x1, x2], [y1, y2], 'k--' ) mid_x = (x1 + x2) / 2 mid_y = (y1 + y2) / 2 ax . text(mid_x, mid_y, text, fontsize =10 , bbox = dict (facecolor = 'white' , edgecolor = 'none' )) # ￿￿￿ ￿￿ ￿￿ draw_node( 5 , 5 , 'HP <= 93.5?' ) # ￿￿ ￿￿ ￿￿ draw_link( 5 , 5 , 3 , 4 , 'Yes' ) draw_node( 3 , 4 , f' { left_attribute } == { left_value } ?' ) # ￿￿￿ ￿￿ ￿￿ draw_link( 5 , 5 , 7 , 4 , 'No' ) draw_node( 7 , 4 , 'Predict: Bad MPG' ) # ￿￿ ￿￿ ￿￿￿ ￿￿ ￿￿￿ draw_link( 3 , 4 , 2 , 3 , 'Yes' ) draw_node( 2 , 3 , 'Predict: Good MPG' ) draw_link( 3 , 4 , 4 , 3 , 'No' ) draw_node( 4 , 3 , 'Predict: Bad MPG' ) plt . axis( 'off' ) plt . show() draw_decision_tree() 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[Is HP <= 93?] / \ o / \x / \ [Is HP <= 81?] [Predict: bad] / \ o / \ x / \ [Predict: good] [Is HP <= 90?] / \ o / \ x [Predict: bad] [Predict: good] Your answer here g) Classify each of the following four vehicles as having “good” or “bad” fuel effciency (miles per gallon). Do this by hand using the tree structure learned in part f. (4%) MPG,cylinders,HP,weight good,4,93,light bad,6,113,medium good,4,83,weighty 10
bad,6,70,weighty 0.0.4 Question 3, Predicting burden of disease ￿40%) [254]: data = pd . read_csv( "Burden of diarrheal illness by country.csv" ) data . head( 3 ) [254]: Country FrxnPeaceIn10 ODA4H2OPcptaDol RenewResm3PcptaYr \ 0 Afghanistan 0.1 0.16 2986 1 Albania 1.0 5.58 13306 2 Algeria 0.0 0.33 473 SustAccImprWatRur SustAccImprWatUrb SustAccImprSanRur SustAccImprSanUrb \ 0 0.10891 0.18812 0.049505 0.15842 1 0.94059 0.98020 0.801980 0.98020 2 0.79208 0.91089 0.811880 0.98020 TotHlthExpPctofGDP GenGovtPctofTotHlthExp ExtResHlthPctTotExpHlth \ 0 0.065 0.395 0.4560 1 0.065 0.417 0.0340 2 0.041 0.808 0.0005 PCptaGovtExpHlthAvgExcRt GDPPCptaIntDol AdultLtrcyRate FemaleLtrcyRate \ 0 4 430 0.35644 0.20792 1 49 6158 0.85644 0.78713 2 71 4860 0.69307 0.60396 BurdenOfDisease 0 awful 1 low 2 high 0.0.5 Data dictionary NAME: Burden of diarrheal illness by country SIZE: 130 Countries, 16 Variables VARIABLE DESCRIPTIONS: Country: Country name FrxnPeaceIn10: Fraction of the past ten years in which a country has been at peace ODA4H2OPcptaDol: Per Capita Offcial Developmental Assistance for water projects RenewResm3PcptaYr: Renewable Water Resources in cubic meters per capita per year SustAccImprWatRur: Fraction of rural population with sustainable access to improved water SustAccImprWatUrb: Fraction of urban population with sustainable access to improved water 11
SustAccImprSanRur: Fraction of rural population with sustainable access to improved sanitation SustAccImprSanUrb: Fraction of urban population with sustainable access to improved sanitation TotHlthExpPctofGDP: Fraction of a country’s GDP devoted to health spending GenGovtPctofTotHlthExp: The fraction of total health expenditures for a country which is pro- vided by the government ExtResHlthPctTotExpHlth: The fraction of total health expenditures for a country which is comes from sources external to the country PCptaGovtExpHlthAvgExcRt: Per Capita Government Health Expenditures at the average ex- change rate GDPPCptaIntDol: Gross Domestic Product per capita in international dollars AdultLtrcyRate: Adult Literacy rate FemaleLtrcyRate: Female Literacy rate BurdenOfDisease: Our target variable for classification. The burden of disease due to diarrheal illness, categorized into “low”, “medium”, “high”, and “awful” quartiles. For each country, we have estimates of the number of Disability-Adjusted Life Years lost per 1000 persons per year (DALYs) due to diarrheal illness. Countries with “low” burden of disease have up to 2.75345 DALYs; countries with “medium” burden of disease have between 2.75345 and 8.2127 DALYs; countries with “high” burden of disease have between 8.2127 and 26.699 DALYs; and countries with “awful” burden of diease have more than 26.699 DALYs. 0.0.6 Your goal is to train a decision tree classifier for the attribute “Burde- nOfDisease” using all other variables (except country name) as features with sklearn.tree.DecisionTreeClassifier. http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html a) Please choose a train/test split and choose a hyper-parameter governing model simplicity, for example, the maximum tree depth or maximum number of leaf nodes. Then, fit your decision tree classifier (using the training set) for different values of this parameter and for each such value, record the corresponding classification accuracy on the test set. (10%) [255]: # sklearn ￿￿￿￿￿￿ ￿￿￿ ￿￿￿ ￿￿￿￿￿￿. from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder X = data . drop([ 'Country' , 'BurdenOfDisease' ], axis =1 ) y = data[ 'BurdenOfDisease' ] # ￿￿ ￿￿ ￿￿￿ le = LabelEncoder() y_encoded = le . fit_transform(y) # ￿￿￿ ￿￿ 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(X, y_encoded, test_size =0.2 , random_state =42 ) # ￿￿￿ max_depth ￿￿ ￿￿ ￿￿ ￿￿ ￿￿￿￿ ￿￿￿￿ ￿￿￿ ￿￿￿￿￿ ￿￿￿￿ ￿￿ max_depths = range ( 1 , 11 ) # 1￿￿ 10￿￿￿ ￿￿￿ ￿￿ accuracies = [] for depth in max_depths: dt_clf = DecisionTreeClassifier(max_depth = depth, random_state =42 ) dt_clf . fit(X_train, y_train_encoded) y_pred = dt_clf . predict(X_test) accuracy = accuracy_score(y_test_encoded, y_pred) accuracies . append(accuracy) print ( f"Max depth: { depth } , Test accuracy: { accuracy } " ) # ￿￿￿ ￿￿￿￿ ￿￿￿￿ max_depth ￿￿ ￿￿ best_accuracy_index = accuracies . index( max (accuracies)) best_depth = max_depths[best_accuracy_index] print ( f" \n Best max_depth: { best_depth } with accuracy: { accuracies[best_accuracy_index] } " ) Max depth: 1, Test accuracy: 0.5384615384615384 Max depth: 2, Test accuracy: 0.5769230769230769 Max depth: 3, Test accuracy: 0.5769230769230769 Max depth: 4, Test accuracy: 0.5769230769230769 Max depth: 5, Test accuracy: 0.5384615384615384 Max depth: 6, Test accuracy: 0.5 Max depth: 7, Test accuracy: 0.6153846153846154 Max depth: 8, Test accuracy: 0.6538461538461539 Max depth: 9, Test accuracy: 0.6538461538461539 Max depth: 10, Test accuracy: 0.6538461538461539 Best max_depth: 8 with accuracy: 0.6538461538461539 b) Make a plot of accuracy vs. simplicity for different values of the hyper-parameter chosen in part a). That is, the x-axis should be hyper-parameter value (e.g. tree depth) and the y-axis should be accuracy. (10%) [256]: import matplotlib.pyplot as plt # ￿￿￿ vs. ￿￿ ￿￿￿(￿￿ ￿￿)￿ ￿￿￿￿ ￿￿ plt . figure(figsize = ( 10 , 6 )) plt . plot(max_depths, accuracies, marker = 'o' , linestyle = '-' , color = 'blue' ) plt . title( 'Accuracy vs. Tree Depth' ) plt . xlabel( 'Max Depth' ) plt . ylabel( 'Accuracy' ) plt . grid( True ) plt . xticks(max_depths) 13
plt . show() max depth : 8 c) Tune the hyper-parameter you choose in part a) by cross-validation using the training data. You can choose to use the GridSearchCV package from sklearn or write your own code to do cross-validation by spliting the training data into training and validation data. What is the out of sample accuracy after tuning the hyper-parameter? (10%) [257]: # ￿￿￿ ￿￿ X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(X, y_encoded, test_size =0.2 , random_state =42 ) # ￿￿￿￿￿￿￿ ￿￿￿ ￿￿ thresholds = np . linspace( 1 , 2 , 50 ) # ￿￿ ￿￿￿ ￿￿￿; max_depth￿ ￿￿￿￿ ￿ param_grid = { 'max_depth' : [ int (x) for x in thresholds]} # max_depth￿ ￿￿￿ ￿￿ # GridSearchCV ￿￿ ￿￿ gs = GridSearchCV(DecisionTreeClassifier(random_state =2019 ), param_grid, cv =6 ) # ￿￿￿ ￿￿￿ ￿￿ ￿￿￿￿ ￿￿ model = gs . fit(X_train, y_train_encoded) # ￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿ ￿￿ ￿￿￿ ￿￿ 14
print ( "best_params: {} \n out of sample accuracy: {} " . format(model . best_params_, model . score(X_test, y_test_encoded))) best_params: {'max_depth': 2} out of sample accuracy: 0.5769230769230769 d) Visualize a simple decision tree (e.g., with max_depth = 2 or 3) learned from the data. To do so, given your decision tree dt, you can use the code below, then copy and paste the resulting output into http://www.webgraphviz.com. Alternatively, if you have graphviz installed on your machine, you can use that. (10%) [258]: from sklearn.tree import export_graphviz import graphviz # ￿￿ ￿￿ ￿￿￿ ￿￿ ￿ ￿￿ dt = DecisionTreeClassifier(max_depth =2 , random_state =2019 ) dt . fit(X_train, y_train_encoded) # export_graphviz ￿￿￿ ￿￿￿￿ ￿￿ ￿￿￿ DOT ￿￿￿￿ ￿￿￿￿, ￿￿￿ ￿￿￿ ￿￿ dot_data = export_graphviz( dt, out_file = None , feature_names = X . columns . values, # ￿￿ ￿￿ class_names = le . classes_, # ￿￿￿ ￿￿ filled = True , rounded = True , special_characters = True , impurity = False ) . replace( "<br/>" , ", " ) . replace( "&le;" , "<=" ) . replace( "=<" , "= \" " ) . replace( ">," , " \" , " ) # ￿￿￿ DOT ￿￿￿ ￿￿ print (dot_data) digraph Tree { node [shape=box, style="filled, rounded", color="black", fontname="helvetica"] ; edge [fontname="helvetica"] ; 0 [label="GDPPCptaIntDol <= 2978.5, samples = 104, value = [24, 29, 26, 25], class = high", fillcolor="#f8fef7"] ; 1 [label="SustAccImprWatUrb <= 0.842, samples = 45, value = [23, 20, 0, 2], class = awful", fillcolor="#fcf0e7"] ; 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 2 [label="samples = 22, value = [18, 4, 0, 0], class = awful", fillcolor="#eb9d65"] ; 1 -> 2 ; 3 [label="samples = 23, value = [5, 16, 0, 2], class = high", fillcolor="#8fef86"] ; 1 -> 3 ; 4 [label="SustAccImprSanRur <= 0.866, samples = 59, value = [1, 9, 26, 23], 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
class = low", fillcolor="#eff7fd"] ; 0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; 5 [label="samples = 38, value = [1, 9, 7, 21], class = medium", fillcolor="#eeadf4"] ; 4 -> 5 ; 6 [label="samples = 21, value = [0, 0, 19, 2], class = low", fillcolor="#4ea7e8"] ; 4 -> 6 ; } Question 4, Fit a random forest to the data from question 3 (20%) a) Please use the same test/train split from previous question and feel free to tune the hyper-parameters for Random Forest model us- ing training data. The package from sklearn is here: http://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Then please report your out of sample prediction result and compare this model’s performance with 3c). (10%) [259]: # np.linspace￿ ￿￿￿ ￿￿ ￿￿ ￿￿ (1￿￿ 10￿￿ 10￿￿ ￿￿ ￿￿￿ ￿￿) thresholds = np . linspace( 1 , 2 , 50 , dtype = int ) # ￿￿￿￿￿￿￿ ￿￿￿ ￿￿ param_grid = { 'max_depth' : thresholds} # RandomForestClassifier￿ GridSearchCV ￿￿￿ rfc = GridSearchCV( estimator = RandomForestClassifier(random_state =42 ), param_grid = param_grid, cv =6 , scoring = 'accuracy' , n_jobs =-1 ) # GridSearchCV￿ ￿￿￿ ￿￿ ￿￿￿￿ ￿￿ gs = rfc . fit(X_train, y_train_encoded) # y_train_encoded ￿￿ # ￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿ ￿￿￿ ￿￿ ￿￿￿ ￿￿ print ( f"best_params: { gs . best_params_ } \n out of sample accuracy: { gs . score(X_test, y_test_encoded) } " ) # y_test_encoded ￿￿ best_params: {'max_depth': 2} out of sample accuracy: 0.6538461538461539 b) Write one paragraph comparing the results from those two models (Random Forest vs Decision Tree) in terms of both accuracy and interpretability. (10%) When we compare between the two methods in terms of accuracy and interpretability, it shows that random forest have better result with an accuracy of 0.61 compared to 0.57 for trees, in- dicating superior performance. However, interpretability with the plot shows that decision trees 16
indicates clearer features by showing distinct characteristics that were previously only encountered theoretically. 17