HW1-Decision+Trees+and+Random+Forests-jc12818
pdf
keyboard_arrow_up
School
New York University *
*We aren’t endorsed by this school
Course
MISC
Subject
Computer Science
Date
Feb 20, 2024
Type
Pages
17
Uploaded by BaronFlagFerret27
HW1-Decision+Trees+and+Random+Forests-jc12818
February 13, 2024
Please submit an electronic version of your Python Jupyter notebook on NYU Brightspace. Re-
member that this assignment is to be done individually. Solutions will be posted a few days after
the due date (on Feb 20th), so assignments submitted until that day will receive a late penalty,
but no late assignments will be accepted after the solutions are posted.
Total points for this HW: 10
Please note: Copying and pasting other people’s work is absolutely prohibited. Any such cases will
be reported to CUSP’s education team and severely punished. Discussion is encouraged, and feel
free to exchange ideas with your classmates, but please write your own code and do your own work.
0.0.1
Question 1: Accuracy and interpretability (10%)
a) Describe a real-world prediction problem using urban data for which
interpretability
of your
models and results is essential, and for which it might be preferable to use decision trees
rather than random forests. Argue why this is the case. (3%)
In my opinion, decision trees are more suitable for handling data with specific conditions, such
as identifying areas with high concentrations of hospitals, regions with the highest crime rates, or
places with poor air quality. Because they provide intuitive insights. They are optimized for offering
straightforward ideas, making them ideal when the decision making process needs to be explained
to non-specialists. and also they can be used when a simple model is needed for rapid development
and decision making. Thus, decision trees may be preferred for their ability to quickly generate
models and facilitate decision making, especially where explaining the logic behind decisions to
those without a background in the field is crucial.
b) Describe a real-world prediction problem using urban data for which
accuracy
is paramount
and interpretability may be less important, and for which it might be preferable to use random
forests rather than decision trees. Argue why this is the case. (3%)
Last semester, I worked on a GIS class project to identify “hospital deserts,” and in a similar context,
I believe using random forests would be appropriate when predicting the optimal hospital location
for ambulances to transport patients based on effcient travel distances, hourly traffc patterns,
nearby
hospital
specialties,
and
bed
counts.
Analyzing these complex data can help determine
the best destination hospital for each patient’s location. Additionally, when identifying areas with
a high likelihood of emergency situations, random forests are beneficial because they can assess
the importance of each feature, aiding in identifying the factors that most significantly impact
emergency situations.
This accurate prediction can enhance the effciency of emergency medical
services and allow for the effective allocation of medical workforce. 1
c) Let’s imagine that you want to try to get the best of both worlds (accuracy
and
interpretabil-
ity). So you decide to start by learning a random forest classifier. Describe at least one way
of getting some interpretability out of the model by post-processing. You could either pick
a method from the literature (e.g., Domingos’s work on combining multiple models or some
method of computing variable importance), or come up with your own approach (doesn’t
have to be ground-breaking, but feel free to be creative!) (4%)
Random forests enable sophisticated analysis but they face challenges in interpretability. To address
this, we can use LIME (Local Interpretable Model-agnostic Explanations). LIME is a method that
provides local fidelity for predictions made by complex machine learning models. It accomplishes
this by generating perturbed samples around a specific data point and then making predictions
on these samples.
A linear model is then trained on these samples to analyze the weights and
assess the impact of each feature. This method allows us to deliver understandable explanations
for the complex model.
reference :
https://deeplearningofpython.blogspot.com/2023/05/LIME-
XAI-example-python.html?source=post_page—–d195c2640834——————————–
0.0.2
Question 2: Build a decision tree for classification, step by step, following the
lecture notes.
Note that the dataset has been modified, so you may get a
different tree than the one shown in the lecture notes. (30%)
[241]:
import
pandas
as
pd
import
numpy
as
np
[242]:
import
io
thefile
=
io
.
↪
StringIO(
'MPG,cylinders,HP,weight
\n
good,4,75,light
\n
bad,6,90,medium
\n
bad,4,110,medium
\n
bad,8
df
=
pd
.
read_csv(thefile)
df
[242]:
MPG
cylinders
HP
weight
0
good
4
75
light
1
bad
6
90
medium
2
bad
4
110
medium
3
bad
8
175
weighty
4
bad
6
95
medium
5
bad
4
94
light
6
bad
4
95
light
7
bad
8
139
weighty
8
bad
8
190
weighty
9
bad
8
145
weighty
10
bad
6
100
medium
11
good
4
92
medium
12
bad
6
100
weighty
13
bad
8
170
weighty
14
good
4
89
medium
15
good
4
65
light
16
bad
6
85
medium
17
good
4
81
light
2
18
bad
6
95
medium
19
good
4
93
light
0.0.3
Please use numpy and pandas to do the computation for parts a) through f). Do
not use an existing decision tree implementation like sklearn for this question.
a) Start with the entire dataset and find the most common MPG value. (2%)
[243]:
most_common_mpg
=
df[
'MPG'
]
.
mode()[
0
]
most_common_mpg
[243]:
'bad'
[244]:
mpg_counts
=
df[
'MPG'
]
.
value_counts()
mpg_counts
[244]:
MPG
bad
14
good
6
Name: count, dtype: int64
b) Enumerate all the possible binary questions you could ask for each discrete-valued variable.
For each such split, compute the numbers of “good” and “bad” MPG vehicles in each of the
two child nodes, and compute the information gain using the provided function above. (5%)
[245]:
def
InformationGain
(goodY, badY, goodN, badN):
def
F
(X, Y):
val1
=
X
*
np
.
log2(
1.0 *
(X
+
Y)
/
X)
if
X
> 0
else
0
val2
=
Y
*
np
.
log2(
1.0 *
(X
+
Y)
/
Y)
if
Y
> 0
else
0
return
val1
+
val2
total
=
goodY
+
goodN
+
badY
+
badN
return
(F(goodY
+
goodN, badY
+
badN)
-
F(goodY, badY)
-
F(goodN, badN))
/
␣
↪
total
if
total
> 0
else
0
# Function to compute information gain for each binary split
def
compute_information_gain
(dataframe, columns):
gain_dict
=
{}
for
column
in
columns:
unique_values
=
dataframe[column]
.
unique()
for
value
in
unique_values:
df_yes
=
dataframe[dataframe[column]
==
value]
df_no
=
dataframe[dataframe[column]
!=
value]
goodY
=
len
(df_yes[df_yes[
'MPG'
]
==
'good'
])
badY
=
len
(df_yes[df_yes[
'MPG'
]
==
'bad'
])
goodN
=
len
(df_no[df_no[
'MPG'
]
==
'good'
])
badN
=
len
(df_no[df_no[
'MPG'
]
==
'bad'
])
gain
=
InformationGain(goodY, badY, goodN, badN)
gain_dict[
f"
{
column
}
==
{
value
}
"
]
=
gain
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
return
gain_dict
# Compute and display the information gains
gain_dict
=
compute_information_gain(df, [
'cylinders'
,
'weight'
])
gain_dict
[245]:
{'cylinders == 4': 0.4680577739061723,
'cylinders == 6': 0.1916312040067166,
'cylinders == 8': 0.15307795338969116,
'weight == light': 0.1916312040067166,
'weight == medium': 0.0058021490143458365,
'weight == weighty': 0.1916312040067166}
c) Enumerate all the possible binary questions you could ask for the real-valued variable HP.
For each such split, compute the numbers of “good” and “bad” MPG vehicles in each of the
two child nodes, and compute the information gain using the provided function above. (5%)
NOTE: if you’d like, you can just use all midpoints between consecutive values of the sorted HP
attribute. You are not required to exclude provably suboptimal questions like we did in the lecture.
[246]:
# 'HP'
sorted_hp
=
np
.
sort(df[
'HP'
]
.
unique())
midpoints
=
[(sorted_hp[i]
+
sorted_hp[i
+1
])
/ 2
for
i
in
␣
↪
range
(
len
(sorted_hp)
-1
)]
[247]:
def
information_gain
(data, attribute, split_value, target_name
=
"MPG"
):
#
total_entropy
=
entropy(data[target_name])
#
left_split
=
data[data[attribute]
<=
split_value]
right_split
=
data[data[attribute]
>
split_value]
#
left_entropy
=
entropy(left_split[target_name])
if
not
left_split
.
empty
␣
↪
else
0
right_entropy
=
entropy(right_split[target_name])
if
not
right_split
.
empty
␣
↪
else
0
weighted_entropy
=
(
len
(left_split)
/
len
(data))
*
left_entropy
+
␣
↪
(
len
(right_split)
/
len
(data))
*
right_entropy
#
info_gain
=
total_entropy
-
weighted_entropy
return
info_gain
[248]:
# 'HP'
for
midpoint
in
midpoints:
info_gain
=
information_gain(df,
'HP'
, midpoint)
4
print
(
f"Information gain for HP <=
{
midpoint
}
:
{
info_gain
}
"
)
Information gain for HP <= 70.0: 0.09139023062144991
Information gain for HP <= 78.0: 0.19350684337293445
Information gain for HP <= 83.0: 0.30984030471640056
Information gain for HP <= 87.0: 0.1620654662387495
Information gain for HP <= 89.5: 0.2759267455941732
Information gain for HP <= 91.0: 0.19163120400671674
Information gain for HP <= 92.5: 0.32489038387335567
Information gain for HP <= 93.5: 0.5567796494470396
Information gain for HP <= 94.5: 0.46805777390617237
Information gain for HP <= 97.5: 0.2812908992306927
Information gain for HP <= 105.0: 0.19163120400671663
Information gain for HP <= 124.5: 0.15307795338969132
Information gain for HP <= 142.0: 0.11774369689072062
Information gain for HP <= 157.5: 0.08512362463476453
Information gain for HP <= 172.5: 0.054824648581652036
Information gain for HP <= 182.5: 0.02653432846734327
d) Based on your results for parts b and c, what is the optimal binary split of the data? Of the
two child nodes created by this split, which (if any) would require further partitioning? (4%)
[249]:
#
def
compute_information_gain
(data, attribute, target_name
=
"MPG"
):
#
unique_values
=
np
.
sort(data[attribute]
.
unique())
split_values
=
(unique_values[:
-1
]
+
unique_values[
1
:])
/ 2
#
max_info_gain
= -
np
.
inf
best_split
=
None
for
split_value
in
split_values:
info_gain
=
information_gain(data, attribute, split_value, target_name)
if
info_gain
>
max_info_gain:
max_info_gain
=
info_gain
best_split
=
split_value
return
best_split, max_info_gain
# 'HP'
best_split_hp, max_info_gain_hp
=
compute_information_gain(df,
'HP'
)
print
(
f"Best split for 'HP':
{
best_split_hp
}
with information gain of
␣
↪
{
max_info_gain_hp
}
"
)
Best split for 'HP': 93.5 with information gain of 0.5567796494470396
The optimal binary split occurs at the threshold of 93.5, dividing the data into two
nodes. The node with HP values less than or equal to 93.5 requires further splitting
for more refined classification.
5
e) Repeat parts a through d until all training data points are perfectly classified by the resulting
tree. (6%)
[250]:
# Splitting the dataset based on 'HP' value
left_split
=
df[df[
'HP'
]
<= 93.5
]
right_split
=
df[df[
'HP'
]
> 93.5
]
def
find_best_split_for_node
(data):
"""
Find the best split for a given subset of the dataset by evaluating all
␣
↪
possible splits
for each attribute other than 'HP', and calculating the information gain.
"""
attributes
=
[
'cylinders'
,
'weight'
]
# Excluding 'HP' because it's already
␣
↪
used for the initial split.
best_gain
= 0
best_split
=
None
best_attribute
=
None
for
attribute
in
attributes:
unique_values
=
np
.
unique(data[attribute])
for
value
in
unique_values:
# Calculate the information gain for a binary split on the attribute
current_gain
=
information_gain(data, attribute, value,
'MPG'
)
if
current_gain
>
best_gain:
best_gain
=
current_gain
best_split
=
value
best_attribute
=
attribute
return
best_attribute, best_split, best_gain
# Finding the best splits for the left and right subsets
left_attribute, left_value, left_gain
=
find_best_split_for_node(left_split)
right_attribute, right_value, right_gain
=
find_best_split_for_node(right_split)
# Printing the results
print
(
f"Best split for left node:
{
left_attribute
}
==
{
left_value
}
, Information
␣
↪
Gain:
{
left_gain
}
"
)
print
(
f"Best split for right node:
{
right_attribute
}
==
{
right_value
}
,
␣
↪
Information Gain:
{
right_gain
}
"
)
Best split for left node: cylinders == 4, Information Gain: 0.8112781244591328
Best split for right node: None == None, Information Gain: 0
[251]:
print
(
'Original dataset stat:
\n
'
, df[
'MPG'
]
.
value_counts())
print
(
'
\n
HP <= 93.5
\n
'
, left_split[
'MPG'
]
.
value_counts())
print
(
'
\n
HP > 93.5
\n
'
, right_split[
'MPG'
]
.
value_counts())
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Original dataset stat:
MPG
bad
14
good
6
Name: count, dtype: int64
HP <= 93.5
MPG
good
6
bad
2
Name: count, dtype: int64
HP > 93.5
MPG
bad
12
Name: count, dtype: int64
[252]:
# Splitting the dataset based on 'HP' value
left_split
=
df[df[
'HP'
]
<= 93.5
]
right_split
=
df[df[
'HP'
]
> 93.5
]
# Define the function to find the best split for nodes
def
find_best_split_for_node
(data):
attributes
=
[
'cylinders'
,
'weight'
]
# Excluding 'HP'
best_gain
= 0
best_split
=
None
best_attribute
=
None
for
attribute
in
attributes:
unique_values
=
np
.
unique(data[attribute])
for
value
in
unique_values:
current_gain
=
information_gain(data, attribute, value,
'MPG'
)
if
current_gain
>
best_gain:
best_gain
=
current_gain
best_split
=
value
best_attribute
=
attribute
return
best_attribute, best_split, best_gain
# Finding the best splits for the left and right subsets
left_attribute, left_value, left_gain
=
find_best_split_for_node(left_split)
right_attribute, right_value, right_gain
=
find_best_split_for_node(right_split)
# Print the 'MPG' distribution for each set
print
(
'Original dataset MPG distribution:
\n
'
, df[
'MPG'
]
.
value_counts(),
'
\n
'
)
print
(
'Left split (HP <= 93.5) MPG distribution:
\n
'
, left_split[
'MPG'
]
.
↪
value_counts(),
'
\n
'
)
7
print
(
'Right split (HP > 93.5) MPG distribution:
\n
'
, right_split[
'MPG'
]
.
↪
value_counts(),
'
\n
'
)
# Print the best split results
print
(
f"Best split for left node:
{
left_attribute
}
==
{
left_value
}
, Information
␣
↪
Gain:
{
left_gain
}
"
)
print
(
f"Best split for right node:
{
right_attribute
}
==
{
right_value
}
,
␣
↪
Information Gain:
{
right_gain
}
"
)
# To print the MPG distribution for subsets resulting from further splits, you
␣
↪
would
# first need to apply these splits and then print the distribution similarly.
Original dataset MPG distribution:
MPG
bad
14
good
6
Name: count, dtype: int64
Left split (HP <= 93.5) MPG distribution:
MPG
good
6
bad
2
Name: count, dtype: int64
Right split (HP > 93.5) MPG distribution:
MPG
bad
12
Name: count, dtype: int64
Best split for left node: cylinders == 4, Information Gain: 0.8112781244591328
Best split for right node: None == None, Information Gain: 0
f) Draw or show the final decision tree in a format of your choice. The decision to make at each
step and the predicted value at each leaf node must be clear. (4%)
[253]:
def
draw_decision_tree
():
#
fig, ax
=
plt
.
subplots()
ax
.
set_xlim(
0
,
10
)
ax
.
set_ylim(
0
,
6
)
#
def
draw_node
(x, y, text):
ax
.
text(x, y, text, horizontalalignment
=
'center'
,
␣
↪
verticalalignment
=
'center'
, fontsize
=12
, bbox
=
dict
(facecolor
=
'white'
,
␣
↪
edgecolor
=
'black'
))
8
#
def
draw_link
(x1, y1, x2, y2, text):
ax
.
plot([x1, x2], [y1, y2],
'k--'
)
mid_x
=
(x1
+
x2)
/ 2
mid_y
=
(y1
+
y2)
/ 2
ax
.
text(mid_x, mid_y, text, fontsize
=10
, bbox
=
dict
(facecolor
=
'white'
,
␣
↪
edgecolor
=
'none'
))
#
draw_node(
5
,
5
,
'HP <= 93.5?'
)
#
draw_link(
5
,
5
,
3
,
4
,
'Yes'
)
draw_node(
3
,
4
,
f'
{
left_attribute
}
==
{
left_value
}
?'
)
#
draw_link(
5
,
5
,
7
,
4
,
'No'
)
draw_node(
7
,
4
,
'Predict: Bad MPG'
)
#
draw_link(
3
,
4
,
2
,
3
,
'Yes'
)
draw_node(
2
,
3
,
'Predict: Good MPG'
)
draw_link(
3
,
4
,
4
,
3
,
'No'
)
draw_node(
4
,
3
,
'Predict: Bad MPG'
)
plt
.
axis(
'off'
)
plt
.
show()
draw_decision_tree()
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[Is HP <= 93?]
/
\
o /
\x
/
\
[Is HP <= 81?]
[Predict: bad]
/
\
o
/
\ x
/
\
[Predict: good] [Is HP <= 90?]
/
\
o /
\ x
[Predict: bad]
[Predict: good]
Your answer here
g) Classify each of the following four vehicles as having “good” or “bad” fuel effciency (miles
per gallon). Do this by hand using the tree structure learned in part f. (4%)
MPG,cylinders,HP,weight
good,4,93,light
bad,6,113,medium
good,4,83,weighty
10
bad,6,70,weighty
0.0.4
Question 3, Predicting burden of disease 40%)
[254]:
data
=
pd
.
read_csv(
"Burden of diarrheal illness by country.csv"
)
data
.
head(
3
)
[254]:
Country
FrxnPeaceIn10
ODA4H2OPcptaDol
RenewResm3PcptaYr
\
0
Afghanistan
0.1
0.16
2986
1
Albania
1.0
5.58
13306
2
Algeria
0.0
0.33
473
SustAccImprWatRur
SustAccImprWatUrb
SustAccImprSanRur
SustAccImprSanUrb
\
0
0.10891
0.18812
0.049505
0.15842
1
0.94059
0.98020
0.801980
0.98020
2
0.79208
0.91089
0.811880
0.98020
TotHlthExpPctofGDP
GenGovtPctofTotHlthExp
ExtResHlthPctTotExpHlth
\
0
0.065
0.395
0.4560
1
0.065
0.417
0.0340
2
0.041
0.808
0.0005
PCptaGovtExpHlthAvgExcRt
GDPPCptaIntDol
AdultLtrcyRate
FemaleLtrcyRate
\
0
4
430
0.35644
0.20792
1
49
6158
0.85644
0.78713
2
71
4860
0.69307
0.60396
BurdenOfDisease
0
awful
1
low
2
high
0.0.5
Data dictionary
NAME: Burden of diarrheal illness by country
SIZE: 130 Countries, 16 Variables
VARIABLE DESCRIPTIONS:
Country: Country name
FrxnPeaceIn10: Fraction of the past ten years in which a country has been at peace
ODA4H2OPcptaDol: Per Capita Offcial Developmental Assistance for water projects
RenewResm3PcptaYr: Renewable Water Resources in cubic meters per capita per year
SustAccImprWatRur: Fraction of rural population with sustainable access to improved water
SustAccImprWatUrb: Fraction of urban population with sustainable access to improved water
11
SustAccImprSanRur: Fraction of rural population with sustainable access to improved sanitation
SustAccImprSanUrb: Fraction of urban population with sustainable access to improved sanitation
TotHlthExpPctofGDP: Fraction of a country’s GDP devoted to health spending
GenGovtPctofTotHlthExp: The fraction of total health expenditures for a country which is pro-
vided by the government
ExtResHlthPctTotExpHlth: The fraction of total health expenditures for a country which is comes
from sources external to the country
PCptaGovtExpHlthAvgExcRt: Per Capita Government Health Expenditures at the average ex-
change rate
GDPPCptaIntDol: Gross Domestic Product per capita in international dollars
AdultLtrcyRate: Adult Literacy rate
FemaleLtrcyRate: Female Literacy rate
BurdenOfDisease: Our target variable for classification.
The burden of disease due to diarrheal
illness, categorized into “low”, “medium”, “high”, and “awful” quartiles. For each country, we have
estimates of the number of Disability-Adjusted Life Years lost per 1000 persons per year (DALYs)
due to diarrheal illness. Countries with “low” burden of disease have up to 2.75345 DALYs; countries
with “medium” burden of disease have between 2.75345 and 8.2127 DALYs; countries with “high”
burden of disease have between 8.2127 and 26.699 DALYs; and countries with “awful” burden of
diease have more than 26.699 DALYs.
0.0.6
Your
goal
is
to
train
a
decision
tree
classifier
for
the
attribute
“Burde-
nOfDisease” using all other variables (except country name) as features with
sklearn.tree.DecisionTreeClassifier.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
a) Please choose a train/test split and choose a hyper-parameter governing model simplicity, for
example, the maximum tree depth or maximum number of leaf nodes. Then, fit your decision
tree classifier (using the training set) for different values of this parameter and for each such
value, record the corresponding classification accuracy on the test set. (10%)
[255]:
# sklearn .
from
sklearn.model_selection
import
train_test_split
from
sklearn.tree
import
DecisionTreeClassifier
from
sklearn.preprocessing
import
LabelEncoder
X
=
data
.
drop([
'Country'
,
'BurdenOfDisease'
], axis
=1
)
y
=
data[
'BurdenOfDisease'
]
#
le
=
LabelEncoder()
y_encoded
=
le
.
fit_transform(y)
#
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
X_train, X_test, y_train_encoded, y_test_encoded
=
train_test_split(X,
␣
↪
y_encoded, test_size
=0.2
, random_state
=42
)
# max_depth
max_depths
=
range
(
1
,
11
)
# 1 10
accuracies
=
[]
for
depth
in
max_depths:
dt_clf
=
DecisionTreeClassifier(max_depth
=
depth, random_state
=42
)
dt_clf
.
fit(X_train, y_train_encoded)
y_pred
=
dt_clf
.
predict(X_test)
accuracy
=
accuracy_score(y_test_encoded, y_pred)
accuracies
.
append(accuracy)
print
(
f"Max depth:
{
depth
}
, Test accuracy:
{
accuracy
}
"
)
# max_depth
best_accuracy_index
=
accuracies
.
index(
max
(accuracies))
best_depth
=
max_depths[best_accuracy_index]
print
(
f"
\n
Best max_depth:
{
best_depth
}
with accuracy:
␣
↪
{
accuracies[best_accuracy_index]
}
"
)
Max depth: 1, Test accuracy: 0.5384615384615384
Max depth: 2, Test accuracy: 0.5769230769230769
Max depth: 3, Test accuracy: 0.5769230769230769
Max depth: 4, Test accuracy: 0.5769230769230769
Max depth: 5, Test accuracy: 0.5384615384615384
Max depth: 6, Test accuracy: 0.5
Max depth: 7, Test accuracy: 0.6153846153846154
Max depth: 8, Test accuracy: 0.6538461538461539
Max depth: 9, Test accuracy: 0.6538461538461539
Max depth: 10, Test accuracy: 0.6538461538461539
Best max_depth: 8 with accuracy: 0.6538461538461539
b) Make a plot of accuracy vs. simplicity for different values of the hyper-parameter chosen in
part a). That is, the x-axis should be hyper-parameter value (e.g. tree depth) and the y-axis
should be accuracy. (10%)
[256]:
import
matplotlib.pyplot
as
plt
# vs. ( )
plt
.
figure(figsize
=
(
10
,
6
))
plt
.
plot(max_depths, accuracies, marker
=
'o'
, linestyle
=
'-'
, color
=
'blue'
)
plt
.
title(
'Accuracy vs. Tree Depth'
)
plt
.
xlabel(
'Max Depth'
)
plt
.
ylabel(
'Accuracy'
)
plt
.
grid(
True
)
plt
.
xticks(max_depths)
13
plt
.
show()
max depth : 8
c) Tune the hyper-parameter you choose in part a) by cross-validation using the training data.
You can choose to use the GridSearchCV package from sklearn or write your own code to do
cross-validation by spliting the training data into training and validation data. What is the
out of sample accuracy after tuning the hyper-parameter? (10%)
[257]:
#
X_train, X_test, y_train_encoded, y_test_encoded
=
train_test_split(X,
␣
↪
y_encoded, test_size
=0.2
, random_state
=42
)
#
thresholds
=
np
.
linspace(
1
,
2
,
50
)
# ; max_depth
param_grid
=
{
'max_depth'
: [
int
(x)
for
x
in
thresholds]}
# max_depth
# GridSearchCV
gs
=
GridSearchCV(DecisionTreeClassifier(random_state
=2019
), param_grid, cv
=6
)
#
model
=
gs
.
fit(X_train, y_train_encoded)
#
14
print
(
"best_params:
{}
\n
out of sample accuracy:
{}
"
.
format(model
.
best_params_,
␣
↪
model
.
score(X_test, y_test_encoded)))
best_params: {'max_depth': 2}
out of sample accuracy: 0.5769230769230769
d) Visualize a simple decision tree (e.g., with max_depth = 2 or 3) learned from the data. To do
so, given your decision tree dt, you can use the code below, then copy and paste the resulting
output into http://www.webgraphviz.com.
Alternatively, if you have graphviz installed on
your machine, you can use that. (10%)
[258]:
from
sklearn.tree
import
export_graphviz
import
graphviz
#
dt
=
DecisionTreeClassifier(max_depth
=2
, random_state
=2019
)
dt
.
fit(X_train, y_train_encoded)
# export_graphviz DOT ,
dot_data
=
export_graphviz(
dt,
out_file
=
None
,
feature_names
=
X
.
columns
.
values,
#
class_names
=
le
.
classes_,
#
filled
=
True
,
rounded
=
True
,
special_characters
=
True
,
impurity
=
False
)
.
replace(
"<br/>"
,
", "
)
.
replace(
"≤"
,
"<="
)
.
replace(
"=<"
,
"=
\"
"
)
.
↪
replace(
">,"
,
"
\"
, "
)
# DOT
print
(dot_data)
digraph Tree {
node [shape=box, style="filled, rounded", color="black", fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="GDPPCptaIntDol <= 2978.5, samples = 104, value = [24, 29, 26, 25],
class = high",
fillcolor="#f8fef7"] ;
1 [label="SustAccImprWatUrb <= 0.842, samples = 45, value = [23, 20, 0, 2],
class = awful",
fillcolor="#fcf0e7"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="samples = 22, value = [18, 4, 0, 0], class = awful",
fillcolor="#eb9d65"] ;
1 -> 2 ;
3 [label="samples = 23, value = [5, 16, 0, 2], class = high",
fillcolor="#8fef86"] ;
1 -> 3 ;
4 [label="SustAccImprSanRur <= 0.866, samples = 59, value = [1, 9, 26, 23],
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
class = low",
fillcolor="#eff7fd"] ;
0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
5 [label="samples = 38, value = [1, 9, 7, 21], class = medium",
fillcolor="#eeadf4"] ;
4 -> 5 ;
6 [label="samples = 21, value = [0, 0, 19, 2], class = low",
fillcolor="#4ea7e8"] ;
4 -> 6 ;
}
Question 4, Fit a random forest to the data from question 3 (20%)
a) Please
use
the
same
test/train
split
from
previous
question
and
feel
free
to
tune
the
hyper-parameters
for
Random
Forest
model
us-
ing
training
data.
The
package
from
sklearn
is
here:
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
Then
please report your out of sample prediction result and compare this model’s performance
with 3c). (10%)
[259]:
# np.linspace (1 10 10 )
thresholds
=
np
.
linspace(
1
,
2
,
50
, dtype
=
int
)
#
param_grid
=
{
'max_depth'
: thresholds}
# RandomForestClassifier GridSearchCV
rfc
=
GridSearchCV(
estimator
=
RandomForestClassifier(random_state
=42
),
param_grid
=
param_grid,
cv
=6
,
scoring
=
'accuracy'
,
n_jobs
=-1
)
# GridSearchCV
gs
=
rfc
.
fit(X_train, y_train_encoded)
# y_train_encoded
#
print
(
f"best_params:
{
gs
.
best_params_
}
\n
out of sample accuracy:
{
gs
.
↪
score(X_test, y_test_encoded)
}
"
)
# y_test_encoded
best_params: {'max_depth': 2}
out of sample accuracy: 0.6538461538461539
b) Write one paragraph comparing the results from those two models (Random Forest vs Decision
Tree) in terms of both accuracy and interpretability. (10%)
When we compare between the two methods in terms of accuracy and interpretability, it shows
that random forest have better result with an accuracy of 0.61 compared to 0.57 for trees, in-
dicating superior performance.
However, interpretability with the plot shows that decision trees
16
indicates clearer features by showing distinct characteristics that were previously only encountered
theoretically.
17
Related Documents
Recommended textbooks for you

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Recommended textbooks for you
- EBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENTNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageProgramming with Microsoft Visual Basic 2017Computer ScienceISBN:9781337102124Author:Diane ZakPublisher:Cengage Learning
- COMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LOperations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks ColeMicrosoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,