ML-COMP3020-Lecture 2-Decision Trees

pdf

School

University of Technology Sydney *

*We aren’t endorsed by this school

Course

3020

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

47

Uploaded by JusticeKudu453

Report
Decision Trees COMP3020 - Machine Learning KHOA D. DOAN khoa.dd@vinuni.edu Slides adapted from SOHEIL KOLOURI
Previously e \What does “learning by examples” mean? o Classification tasks o Learning requires examples + inductive bias o Memorization is not enough, need generalization o Formalizing the learning problem m Learning as function approximation m Learning as minimizing expected loss
Today: Decision Trees What is decision tree? How to learn a decision tree from data? What is inductive bias? What exactly is generalization?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Day @ Weather Temperature Humidity Wind Play? Atraining set: <y ot figh | Weak | No | Playmg Badminton Cloudy Hot High Weak Yes 3| sunny Mild Normal | Strong | Yes 4 Cloudy Mild High Strong Yes 5 Rainy Mild High Strong No 6 Rainy Cool Normal Strong No 7 Rainy Mild High Weak Yes 8 Sunny Hot High Strong No 9 Cloudy Hot Normal Weak Yes [Source] 10 Rainy Mild High Strong No
A decision tree for playing badminton [ } Each node tests a Weather _ feature (attribute) Each branch correspondS/‘\ to a feature value gynny Cloudy Rainy | (outcome) \ [ Humidity} Yes [ Wind 1 /High Norn{ }‘ong Weak\ Each leaf node assigns No Yes . . No Yes a final decision
Decision Trees e Representation o Each node tests a feature (attribute) o Each branch corresponds to a feature value (outcome) o Each leaf node assigns a final decision (classification) m Or adistribution over decisions e A decision tree represents a function that maps examples In XtoaclassinY:-: f. <Weather, Temperature, Humidity, Wind> PlayBadminton?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise: represent Boolean functions with decision trees AND OR XOR
Exercise: represent Boolean functions with decision trees AND OR XOR X1 X2 X18&X2 T T T T F F F T F F F F ° False True False
Today: Decision Trees What is decision tree? How to learn a decision tree from data? What is inductive bias? What exactly is generalization?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Function approximation with decision trees Problem Setting e Setof possible instances X feature vector = [zq,...,z4] X e Unknown target function f: X Y Y is discrete valued e Set of function hypotheses H = {h|h : X Y|} h is a decision tree Input e J[raining examples {(m(l),y(l)), (m(z),y(z)), Cee (:L'(N),y(N))} of unknown target function Output: Hypothesis h H that best approximates target function f 10
Decision Trees Learning e Find the hypothesis h ¢ H o Such that it minimizes the training error o Or maximizes the training accuracy e Challenges? o H is too large for exhaustive search 11
Decision Trees Learning e Find the hypothesis h ¢ H o Such that it minimizes the training error o Or maximizes the training accuracy e Challenges? o H is too large for exhaustive search o We opt for heuristic (greedy) search algorithms that m Pick questions to ask, in order m Such that classification accuracy is maximized 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Top-down Induction of Decision Trees DTtrain(examples for CurrentNode, features at CurrentNode): Find F, best decision feature for CurrentNode For each value of F, create new child node Assign training examples to child node If training examples perfectly classified Stop Else Recursively apply DTtrain over this new child node 13
How to select the best feature? e A good feature is one that lets us make “correct’ classification decisions I\ B o]V N olo V1 (o B-T el g1\ e] g (SN [SL=X] [ a WAV o F= 1 "\VZ0 101 [0 MY O o150 o One way: select feature that results in the best classification accuracy o Let'stryitout: o Build a decision tree with Humidity and Wind only 14
Build a decision tree: partition with Wind? [ Wind ] Weak Weather Day Temperature Humidity Wind Play? 1 Sunny Hot High Weak No 2 Cloudy Hot High Weak Yes 3 Sunny Mild Normal Strong Yes 4 Cloudy Mild High Strong Yes 5 Rainy Mild High Strong No 6 Rainy Cool Normal Strong | No 7 Rainy Mild High Weak Yes 8 Sunny Hot High Strong No 9 Cloudy Hot Normal Weak Yes 10 Rainy Mild High Strong | No - 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Weather Build a decision Day Temperature = Humidity Wind Play? tre_e' partltlon with 1 sunny Hot High Weak No Wind? 2 Cloudy Hot High Weak Yes [ Wind ] (3 Sunny Mild Normal Strong Yes 4 Cloudy Mild High Strong Yes 5 Rainy Mild High Strong No \_ 6 Rainy Cool Normal Strong No Acc: 0.75 Acc: 0.66 Avg: 0.71 7 Rainy Mild High Weak Yes 8 Sunny Hot High Strong No 9 Cloudy Hot Normal Weak Yes 10 Rainy Mild High Strong No 16
Build a decision tree: partition with Humidity? [ Wind ] Weak Y N Y D Avg: 0.71 Strong [ Humidity ] High Normal Avg: 0.67 Day @ Weather Temperature Humidity Wind Play? 1 Sunny Hot High Weak No 2 Cloudy Hot High Weak Yes 3 Sunny Mild Normal Strong Yes 4 Cloudy Mild High Strong Yes 5 Rainy Mild High Strong No 6 l Rainy Cool Normal Strong | No 7 | Rainy Mild High Weak Yes 8 Sunny Hot High Strong = No 9 Cloudy Hot Normal Weak Yes 10 Rainy Mild High i Strong | No ]
Build a decision tree: partition with Humidity & Wind? [ Wind ] Weak Strong [ Humidity ] High Normal Acc: 0.75 Acc: 1.00 Avg: 0.875 Weather Day Temperature Humidity Wind Play? 1 Sunny Hot f High Weak No \‘ 2 Cloudy Hot J\ahHighg\ ~ Weak =\Yesl%)\ 3 Sunny Mild | Normal Strong Yes | 4 Cloudy Mild High Strong Yes 5 Rainy Mild High Strong No 6 Rainy Cool Normal | Strong No 7 | Rainy ild High | Weak | Yes || 8 Sunny Hot i ngh S;rong | No 9 Cloudy Hot Normal Weak Yes 10 Rainy Mild High Strong | No - 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Another feature selection criterion: Entropy e Used in ID3 algorithm [Quinlan et al, 1963] o Pick feature with the smallest entropy to split the examples at each iteration e Entropy measures impurity of a sample of examples 19
Sample Entropy H(S) = —py logypy —p-logy p- e S is asample of training examples e p. proportion of positive samples in S e p_ proportion of negative samples in S e entropy measures the impurity of S 20
Entropy of a random variable e Entropy H(X) of a random variable X- ZP = 1) log, P(X = 1) e H(X)isthe number of expected bits needed to encode a randomly drawn value of X (under most efficient code) 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Entropy of a random variable Entropy H(X) of a random variable X: ZP = 1) log, P(X = 1) e H(X)isthe number of expected bits needed to encode a randomly drawn value of X (under most efficient code) Why entropy? Information theory o To encode a message X = 7, the most efficient code assigns log, P(X = 1) bits o So the expected number of bits to encode one random X is: —ZP i) log, P(X = i) 22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conditional Entropy Conditional Entropy H(Y | X) of a variable Y conditioned on a random variable X m HY|X)=) P(X=jHY |X=j) H(Y |X=j)=Y PY=i| X =j)log, P(Y =i | X = j 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conditional Entropy HY | X)= Y P(X= )Y P(Y =i | X = j)logy P(Y =i | X = j) =1 i=1 X, | %, | Y T|7 T | F |8 T | T B8 T | F BB F| T B F | F [BF
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conditional Entropy HY | X)= Y P(X= )Y P(Y =i | X = j)logy P(Y =i | X = j) =1 i=1 Example: X X1 % LY ple: t/1\f T | T Bk T | F e P(X,=t) = 4/6 Yedih wased 1 P(X,=f) = 2/6 Y=F:0 yef. - T | F el F | T PBE F | F
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conditional Entropy n H(Y | X)= - Y P(X =) P(Y =i| X = j)log, P(Y = | X = j) 7=1 1=1 Example: X, t/\f P(X4=t) = 4/6 Y=t:4 y=t-1 P(X,=f) = 2/6 Y=f:0 vy=f-1 H(Y|X,) =-4/6 (1log,1 + 0 log, 0) - 2/6 (1/2 log, 1/2 + 1/2 log, 1/2) = 2/6 X, | %, | Y T|7 T | F |8 T | T B8 T | F BB F| T B F | F [BF 26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Gain e Decrease in entropy (uncertainty) after splitting IGIX)=H(Y)—-H(Y | X) 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Gain e Decrease in entropy (uncertainty) after splitting IGIX)=H(Y)—-H(Y | X) In our example: IG(X1)=H(Y) - H(Y | X;) m(m|A|—|[H| M| |7m|d|7n |4 X T (=SS S e < 28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Gain e Decrease In entropy (uncertainty) after splitting IG(X)=H(Y) - H(Y | X) In our example: IG(X,)=H(Y)-H(Y | X1) = 0.65 0.33 IG(X) > 0 prefer the split X, | X, | Y T T T|F T T B T|F [ F| T |8 F | F BB 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Today: Decision Trees What is decision tree? How to learn a decision tree from data? What is inductive bias? What exactly is generalization? 30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Inductive bias in decision tree learning s e The learning algorithm performs /{\ heuristic search through the e e \ space of all decision trees e |t stops at the smallest, fi% /'?5?\ acceptable tree N e Occam’s Razor: prefer the q . simplest hypothesis that fits the /{ES A ;E data + 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Why preference for short hypotheses? e Fewer short hypotheses than long ones o Therefore, a short hypothesis that fits the data is less likely to be a statistical coincidence. e But what is so special about short hypothesis? 32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Today: Decision Trees What is decision tree? How to learn a decision tree from data? What is inductive bias? What exactly is generalization? 33
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Evaluating the learned hypothesis e Assume that o We've learned a tree h using the top-down induction algorithm o |t fits the training data perfectly e Are we done? o Can we guarantee we have found a good hypothesis? 34
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Examples of Class A Examples of Class B | - | Recall: Background in-focus Training Metrics do not tell us anything about generalization! ) £ - ~ - " ; T~ vy : - X Ty ' . A ST - S anityo) e . B Yo A = . - 3 . b i L 4 ) py y . ‘o n . p B - \ : ‘\' N ) l. R . 4y : : -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Recall: Formalizing Induction e Given o a loss function [ o samples from some unknown data distribution D e OQOurtask is to compute a function f that has low expected error over D with respect to [ E (2 4)~n[l(y, £ Zvay y, f(z)) 36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Training error is not sufficient e \We care about generalization to new (unseen) and related examples e A tree can classify training data perfectly, yet classify new examples incorrectly. How? 37
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Training error is not sufficient e \We care about generalization to new (unseen) and related examples e A tree can classify training data perfectly, yet classify new examples incorrectly. How? o Training data is only a sample from the data-generating distribution m A feature might correlate with class by coincidence o Training examples can be noisy m e.g., labeling mistakes 38
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Day { Weather | Temperature { Humidity Wind : : | » | *1 Add|ng d nO|Sy 1 | Sunny Hot | High - Weak training example , couqy Hot High | Weak 3 | Sunny Mild Normal | Strong Mild High Strong Sunny Cloudy Rainy Mild ' ngh Strong Cool Normal Strong Mild .~ High | Weak High Normal Strong Weak 1 ——— r / N / N Hot High Strong No Yes No Yes | o ! 9 | Cloudy Hot ~ Normal | Weak 10 Rainy Mild High Strong 11 Sunny Mild | High ;Strong;
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Day { Weather | Temperature { Humidity Wind : : | » | *1 Add|ng d nO|Sy 1 | Sunny Hot | High - Weak training example , couqy Hot High | Weak 3 | Sunny Mild Normal | Strong Mild High Strong y ot iy Mild | High Strong Cool Normal Strong Mild .~ High | Weak High Normal Strong Weak 1 ——— r / N / N Hot High Strong No Yes No Yes | o ! 9 | Cloudy Hot ~ Normal | Weak 10 Rainy Mild High Strong 11 Sunny Mild | High ;Strong;
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Overfitting e Given a hypothesis h and its: o Training Error over training data erroriq;,(h) o True Error over all data errory.,.(h) 41
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Overfitting e Given a hypothesis h and its: o Training Error over training data erroriq;,(h) o True Error over all data errory.,.(h) e \We say h overfits the training data if: errorirein(h) < erroriue(h) e Amount of overfitting: erroriue(h) erroripqin(h) 42
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The need for test data e We don’t know errory.,.(h) 43
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The need for test data e We don’t know errory.,.(h) e Solution: o We set aside a test set m Some examples that will be used for evaluation o We never look at them during training o After learning, we calculate erroriest () 44
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Measuring the effect of overfitting in decision trees 0-9 I I I I I I I I I 0.85 0.8 0.75 Accuracy =) < 0.65 - 06 On training data —— . On test data —--— 035 M 0.5 1 1 1 1 1 1 1 1 1 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Underfitting vs. Overtfitting e \When underfitting? o Learning algorithms had the opportunity to learn more from training data, but didn't e \When overfitting? o Learning algorithms paid too much attention to learn the noisy part of the training data, resulting in a tree that does not generalize well on test data. e \What we want is: o A decision tree that neither underfits nor overfits o Because it is expected to do best in the future 46
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Recap: Decision trees e \What is decision tree? e How to learn a decision tree from data? o Top-down induction to minimize (training) classification error e \What is inductive bias? o Occam’s Razor: short trees are preferred e \What exactly is generalization? o Overfitting can be an issue 47
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: What is a brief history of the social media platform Spotify?
Q: Which of the following aqueous solutions would have the lowest vapor pressure? (Hint: Which solute…
Q: Assuming no other files are involved and that Dolores is not using any shared objects, what will…
Q: 3. The price index for the current year is 180. This means that, on average, prices in the current…
Q: other - separated by an erosional surface. What is this erosional surface called, and what…
Q: Why do we require Latitudes and Longitudes? What are the major latitudinal belts? 2. What are Time…
Q: Cook Farm Supply Company manufactures and sells a pesticide called Snare. The following data are…
Q: QUESTION 7 How can a sample be loaded into the melting point capillary tube? O Push the open end of…
Q: A researcher finds the following sample values: 2, 8, 6, 2, 7, 3, 4, 4, 5, 6, 7  At the .03 level,…
Q: spring day, 60 degrees Fahrenheit, is equal to ____ degrees Celsius. b. 20 degrees Celsius is ____…
Q: Aspirin prevents blood from clotting and helps prevent strokes. The Second European Stroke…
Q: 5.1.3. Show that (a₁ + a2 +. + an)² ≤ n (a² + a² + · Hint: Use the CBS inequality with x = α1 XX2 On…
Q: What is one possible conflict or issue that can arise from the commodification of users by social…
Q: Identify the following compounds. aromatic ? anti aromatic? non aromatic? Explain the answers for…
Q: 1. What is the bo a. Adenosine b. Calcium
Q: Given: AC bisects BD and ZCBE LEDA. Prove: ABEC ADEA. Note: quadrilateral properties are not…
Q: A car traveling at 20 mph on a curved exit ramp has a constant velocity. True False
Q: 7. Using a Grignard reaction, show methods how you could prepare each of the following three…
Q: For this Short Run firms costs, explain the SHAPE of each curve: MC, ATC and AVC. Where would the…
Q: sidemaria.com 2 Point M is on AB such that AM-MB - 1:1. If the coordinates of A are (-1.5) and the…
Q: an increase in the CPI from 200 to 225 would indicate an annual rate of measured inflation o -3% b.…
Q: 10 Position (m) CO 8 2 10 20 30 40 50 Time (s) For the following questions, refer to the position…