HW1-Decision+Trees+and+Random+Forests-jc12818

pdf

School

New York University *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by BaronFlagFerret27

HW1-Decision+Trees+and+Random+Forests-jc12818 February 13, 2024 Please submit an electronic version of your Python Jupyter notebook on NYU Brightspace. Re- member that this assignment is to be done individually. Solutions will be posted a few days after the due date (on Feb 20th), so assignments submitted until that day will receive a late penalty, but no late assignments will be accepted after the solutions are posted. Total points for this HW: 10 Please note: Copying and pasting other people’s work is absolutely prohibited. Any such cases will be reported to CUSP’s education team and severely punished. Discussion is encouraged, and feel free to exchange ideas with your classmates, but please write your own code and do your own work. 0.0.1 Question 1: Accuracy and interpretability (10%) a) Describe a real-world prediction problem using urban data for which interpretability of your models and results is essential, and for which it might be preferable to use decision trees rather than random forests. Argue why this is the case. (3%) In my opinion, decision trees are more suitable for handling data with specific conditions, such as identifying areas with high concentrations of hospitals, regions with the highest crime rates, or places with poor air quality. Because they provide intuitive insights. They are optimized for offering straightforward ideas, making them ideal when the decision making process needs to be explained to non-specialists. and also they can be used when a simple model is needed for rapid development and decision making. Thus, decision trees may be preferred for their ability to quickly generate models and facilitate decision making, especially where explaining the logic behind decisions to those without a background in the field is crucial. b) Describe a real-world prediction problem using urban data for which accuracy is paramount and interpretability may be less important, and for which it might be preferable to use random forests rather than decision trees. Argue why this is the case. (3%) Last semester, I worked on a GIS class project to identify “hospital deserts,” and in a similar context, I believe using random forests would be appropriate when predicting the optimal hospital location for ambulances to transport patients based on effcient travel distances, hourly traffc patterns, nearby hospital specialties, and bed counts. Analyzing these complex data can help determine the best destination hospital for each patient’s location. Additionally, when identifying areas with a high likelihood of emergency situations, random forests are beneficial because they can assess the importance of each feature, aiding in identifying the factors that most significantly impact emergency situations. This accurate prediction can enhance the effciency of emergency medical services and allow for the effective allocation of medical workforce. 1

c) Let’s imagine that you want to try to get the best of both worlds (accuracy and interpretabil- ity). So you decide to start by learning a random forest classifier. Describe at least one way of getting some interpretability out of the model by post-processing. You could either pick a method from the literature (e.g., Domingos’s work on combining multiple models or some method of computing variable importance), or come up with your own approach (doesn’t have to be ground-breaking, but feel free to be creative!) (4%) Random forests enable sophisticated analysis but they face challenges in interpretability. To address this, we can use LIME (Local Interpretable Model-agnostic Explanations). LIME is a method that provides local fidelity for predictions made by complex machine learning models. It accomplishes this by generating perturbed samples around a specific data point and then making predictions on these samples. A linear model is then trained on these samples to analyze the weights and assess the impact of each feature. This method allows us to deliver understandable explanations for the complex model. reference : https://deeplearningofpython.blogspot.com/2023/05/LIME- XAI-example-python.html?source=post_page—–d195c2640834——————————– 0.0.2 Question 2: Build a decision tree for classification, step by step, following the lecture notes. Note that the dataset has been modified, so you may get a different tree than the one shown in the lecture notes. (30%) [241]: import pandas as pd import numpy as np [242]: import io thefile = io . ↪ StringIO( 'MPG,cylinders,HP,weight \n good,4,75,light \n bad,6,90,medium \n bad,4,110,medium \n bad,8 df = pd . read_csv(thefile) df [242]: MPG cylinders HP weight 0 good 4 75 light 1 bad 6 90 medium 2 bad 4 110 medium 3 bad 8 175 weighty 4 bad 6 95 medium 5 bad 4 94 light 6 bad 4 95 light 7 bad 8 139 weighty 8 bad 8 190 weighty 9 bad 8 145 weighty 10 bad 6 100 medium 11 good 4 92 medium 12 bad 6 100 weighty 13 bad 8 170 weighty 14 good 4 89 medium 15 good 4 65 light 16 bad 6 85 medium 17 good 4 81 light 2

18 bad 6 95 medium 19 good 4 93 light 0.0.3 Please use numpy and pandas to do the computation for parts a) through f). Do not use an existing decision tree implementation like sklearn for this question. a) Start with the entire dataset and find the most common MPG value. (2%) [243]: most_common_mpg = df[ 'MPG' ] . mode()[ 0 ] most_common_mpg [243]: 'bad' [244]: mpg_counts = df[ 'MPG' ] . value_counts() mpg_counts [244]: MPG bad 14 good 6 Name: count, dtype: int64 b) Enumerate all the possible binary questions you could ask for each discrete-valued variable. For each such split, compute the numbers of “good” and “bad” MPG vehicles in each of the two child nodes, and compute the information gain using the provided function above. (5%) [245]: def InformationGain (goodY, badY, goodN, badN): def F (X, Y): val1 = X * np . log2( 1.0 * (X + Y) / X) if X > 0 else 0 val2 = Y * np . log2( 1.0 * (X + Y) / Y) if Y > 0 else 0 return val1 + val2 total = goodY + goodN + badY + badN return (F(goodY + goodN, badY + badN) - F(goodY, badY) - F(goodN, badN)) / ␣ ↪ total if total > 0 else 0 # Function to compute information gain for each binary split def compute_information_gain (dataframe, columns): gain_dict = {} for column in columns: unique_values = dataframe[column] . unique() for value in unique_values: df_yes = dataframe[dataframe[column] == value] df_no = dataframe[dataframe[column] != value] goodY = len (df_yes[df_yes[ 'MPG' ] == 'good' ]) badY = len (df_yes[df_yes[ 'MPG' ] == 'bad' ]) goodN = len (df_no[df_no[ 'MPG' ] == 'good' ]) badN = len (df_no[df_no[ 'MPG' ] == 'bad' ]) gain = InformationGain(goodY, badY, goodN, badN) gain_dict[ f" { column } == { value } " ] = gain 3

Your preview ends here