HW3-DecisionTrees
docx
keyboard_arrow_up
School
Middlebury College *
*We aren’t endorsed by this school
Course
0311
Subject
Computer Science
Date
Dec 6, 2023
Type
docx
Pages
8
Uploaded by ettranoks
CS311 – Fall 2023
Introduction to Artificial Intelligence
HW Assignment #3
: Due
Tuesday, November 7
th
11:10am
(via the submission link on canvas)
There is an accompanying video linked from our canvas site.
Building a Decision Tree
In this assignment, you will implement a binary decision tree classifier by implementing the
DecisionTree class according to the specifications described below. Since the tree is binary it can
learn data when every feature has exactly two possible values (true or false).
Provided Code
Your class should interact with the following classes that are already provided for you:
TreeNode.java – represents a node in the decision tree.
Example.java – represents a boolean feature vector for one example.
TestClassifier.java – loads positive and negative training and testing examples, runs your
decision tree class, and outputs the results.
You won't need to edit these files, but you should read and familiarize yourself with them
so you understand how to use them
. The code is located here:
https://middlebury.instructure.com/courses/13367/files/folder/HWs/HW3/Part%201
Part 1
:
Build the Decision Tree Classifier
You will write the DecisionTree class according to the following specifications:
DecisionTree – Fields
TreeNode
root
Represents the root of a binary decision tree.
Rules for group work:
I suggest working in a GROUP on this assignment. You may work in groups of 2 students.
You will submit only ONE assignment for the group. You may not “divide and conquer” this
assignment. That is, all students must be present (virtually is ok) and contributing while any
work on the project is being done. In other words, you must “pair-program” to complete the
assignment. Along with the honor code, you must include, in a comment at the top of your
assignment the following statement: “All group members were present and contributing
during all work on this project.” This statement will be treated in the same manner as the
honor code, so please make sure it is present and true.
DecisionTree – Methods
void
train
(ArrayList<Example> examples)
Builds a decision tree on the training examples in
examples
. This is the
Construct-
Tree()
function we discussed in class.
void
createChildren
(TreeNode n, int numFeatures)
Creates the true and false child nodes for node n.
double
getRemainingEntropy
(int f, TreeNode n)
Computes and returns the amount of remaining entropy if feature f is chosen as the
next node.
double
getEnropy
(int numPos, int numNeg)
Computes and returns the entropy of a node given its number of positive and negative
examples.
boolean
classify
(Example e)
Uses the tree to classify the given example as positive (true) or negative (false).
void
print
()//Already provided for you
Prints a description of the decision tree, with indentations indicating tree levels.
For example this decision tree should be printed as:
color = red:
num = one: T
num = two: F
color = blue:
size = big: T
size = tiny: F
You are welcome to create any other fields or methods you find useful in your DecisionTree
class and the given TreeNode class (do not modify the Example class). You are also welcome to
change the return and parameter types of the methods. The only requirement is that
your classes
must run with the given driver program TestClassifier.java
. Remember that it is always good
programming practice to keep methods as concise and simple as possible by using the methods
to do meaningful chunks of work.
***If you have implemented the tree correctly, you are done with the programming components
of the assignment (i.e. the challenging part), so congratulations!
Read below to figure out how
to know if you have implemented the tree correctly.***
Part 2a)
:
Test Your Classifier
To make sure your tree is working correctly, you will train and test it on four randomly generated
datasets (set1-small, set1-big, set2-small, and set2-big).
color
blue
red
num
size
big
tiny
two
one
F
T
F
T
Train and test your decision tree on the datasets, by running:
TestClassifier
with
set1
small
as the program (command line) argument
TestClassifier
with
set1 big
as the program (command line) argument
TestClassifier
with
set2
small
as the program (command line) argument
TestClassifier
with
set2 big
as the program (command line) argument
Each of these commands will train the tree on a set of positive and negative training examples
(from the train datasets), test the learned tree on a set of positive and negative testing examples
(from the test data files), and report the results.
The datasets
:
Each dataset is split into two separate files: one for positive examples the other for negative
examples. Each training example has a value for each feature of the dataset. The features are
numbered starting from 0 (so for example, feature 3 is the 4
th
feature). Here are the first few lines
of the positive dataset for set1-small:
false false true false true true true true
false false false true true false true false
false true false true false false false true
…
The first line shows the values (either true or false) for each of the 8 features (feature 0 to feature
7) for example 1. The second line shows these values for example 2; the third line for example 3,
and so on.
In these datasets all positive examples meet the following criterion:
feature 3 is true
OR
feature 4 is true
OR
feature 7 is true
So, the decision tree should be fairly simple – something like this:
However, because of our datasets, we may not end up with this decision tree. One reason is that
some of the datasets contain
noise
, specifically some negative training examples also meet the
criterion above. And even when the training data is noise-free, we may not end up with this
decision tree (can you think of why not?). The goal is to use these datasets to investigate how
well a decision tree can learn the criterion.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The only differences between the datasets are:
1.
The small ones have 10 examples and 8 features while the big ones have 100 examples
and 10 features.
2.
Set1 is
noise-free
. This means that positive training examples always have the property
specified above and negative training examples never do.
3.
Set2 contains some
noise
. Specifically, positive training examples always have the
property but about 10% of negative training examples also do.
What your program should output
:
The small sets are mainly for testing purposes. Here is the output (i.e. the decision tree) you
should get:
For set1-small:
Feature 3 = True:Positive
Feature 3 = False:
Feature 4 = True:Positive
Feature 4 = False:Negative
Positive examples correct: 9 out of 10
Negative examples correct: 10 out of 10
Here is a more visually appealing version of the tree:
For set2-small:
Feature 6 = True:
Feature 0 = True:
Feature 5 = True:Negative
Feature 5 = False:Positive
Feature 0 = False:Positive
Feature 6 = False:
Feature 7 = True:
Feature 3 = True:Negative
Feature 3 = False:Positive
Feature 7 = False:
Feature 3 = True:Positive
Feature 3 = False:Negative
Positive examples correct: 4 out of 10
Negative examples correct: 5 out of 10
I have also provided the expected output for the big datasets on canvas. Your analysis will focus
more on the big sets.
In your ReadMe
, discuss how your decision tree performs on the big sets
and briefly explain these results.
If you end up with output that is different than my expected output
:
1.
If your accuracy is
less than
the posted results, then it is likely that there is an error in your
code.
2.
If your
accuracy is greater than
the posted results:
o
If you are getting a smaller tree, then this is okay, in fact, it’s great!
o
If you are getting a slightly larger tree, then this is also okay – it is likely due to a
difference in tie-breaking approaches.
3.
If your
accuracy is the same
as the posted results:
o
If you are getting a smaller tree, then great again – nothing to worry about.
o
If you are getting a larger tree, then it is likely that there is an error in your code.
Pseudocode
:
Below is the pseudocode for the
train()
method (i.e., the method for constructing the tree).
train(TreeNode
node
)
1.
If all remaining examples at this node have the same label L
(Base Case 1):
2.
-set this node’s label to L
3.
-set this node as a leaf
4.
If no more examples at this node (Base Case 2):
5.
-set this node’s label to the majority label of its
parent’s examples
6.
-set this node as a leaf
7.
If no more features (Base Case 3)
8.
-set this node’s label to the majority label of this
node’s examples
9.
-set this node as a leaf
10.Else:
11.
pos = node.getPos(); neg = node.getNeg(); // Get the
//positive and negative examples for this node
12.
Find the next feature to split on, i.e. the feature, f,
with the most information gain
13.
Set this node’s feature as f
14.
-createSubChildren(node)//each node will have two
subchildren: a true child and a false child
15.
train(this node’s true child)
16.
train(this node’s false child)
Tip
:
Debugging the training algorithm is somewhat tricky. The easiest way to avoid hard-to-find bugs
is to build your program up incrementally, testing as you go. Try to break your training procedure
into multiple procedures and test each of these procedures to make sure they work individually.
For example, you might write functions that split a given dataset around a particular feature and
test that it trains properly for this single feature.
Part 2b): Testing on
Real-World Data
Now you will test the decision tree on a real-world dataset located here:
https://middlebury.instructure.com/courses/13367/files/folder/HWs/HW3/Part%202
The dataset is derived from
this
one (http://archive.ics.uci.edu/ml/machine-learning-
databases/hepatitis/hepatitis.names) and contains data about patients who were seen for
hepatitis
.
The goal is to predict whether or not a given patient survived. The dataset is already divided into
a training set and a testing set. The ReadMe describes the features and provides the number of
examples.
The class TestClassifierHepatitis.java is provided for you in the above directory. It loads positive
and negative training and testing examples from the hepatitis data files, runs your decision tree
class, and outputs the results.
(i)
Run TestClassifierHepatitis.java on the hepatitis data to predict survival for hepatitis
patients.
In your ReadMe
, describe how your decision tree classifies patients. Use
Wikipedia or an on-line search engine to find out the meaning of the features whose
names are unfamiliar to you. Do image searches at your own risk :-) Do not turn in
feature definitions, but knowing a little about them may help you interpret the learned
decision trees.
(ii)
In your ReadMe
report the percentage of correctly classified examples, the number
of false positives and the number of false negatives.
Part 3: Testing your own dataset
There are many very interesting datasets publicly available to investigate. They range in size,
quality, and the type of features and have resulted in many new machine learning techniques to be
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
developed. For this part, you will find a public, free, supervised (i.e. it must have features and
labels), machine learning dataset.
To find a dataset, you can try:
googling “machine learning binary dataset”
browsing the UCI repository: http://archive.ics.uci.edu/ml/
browsing KDnuggets: http://www.kdnuggets.com/datasets/index.html
You may have to modify the dataset to make the features binary valued so that they work with the
decision tree you have implemented.
Once you’ve found the dataset, provide the following information
in your ReadMe
:
(a) The name of the dataset.
(b) From where you obtained the data.
(c) A brief (i.e. 1-2 sentences) description of the dataset including what the features are and what
is being predicted.
(d) The number of examples in the dataset.
(e) The number of features for each example.
Partition your data into training and testing (at least 50 examples each), train your decision tree
on the training data, and test it on the testing data.
In your ReadMe
, discuss how your decision
tree performs on the data (as in Part 2(a), report the percentage of correctly classified examples,
the number of false positives, and the number of false negatives).
Output
:
We would like to grade your programs as efficiently as possible. So please make sure your
program does not output anything other than what is asked for above. Points will be deducted for
extraneous output/debugging statements.
ReadMe
:
As always, please provide a ReadMe that includes the following:
1.
The names of the students in your group.
2.
Names of files required to run your program.
3.
Any known bugs in your program (you will be less penalized for bugs that you
identify than for bugs that we identify).
4.
The discussion of results and data asked for in Part2 and Part3.
Submission
:
Please include, in a comment at the top of every file you submit, the following:
Full names of students in the group
Group honor code: “All group members were present and contributing during all work on
this project”
Acknowledgement of any outside sources of help (discussions with other students, online
resources, etc.)
Middlebury honor code
Zip the following files and use the HW3 link on canvas to submit exactly
ONE
zipped file per
group:
TreeNode.java
DecisionTree.java
ReadMe
Name your zip file as follows: HW3_<last names of all group members>. So, for example, if
your group consists of the last names Alvarez and Baker, you should name your file
HW3_AlvarezBaker.zip.
Grading
:
[77] - Correctness of implementation of above methods
[22]
train()
[11]
classify()
[22]
createSubChildren()
[22]
getRemainingInfo()
[7] - Comments in code
[16] – Discussion of Parts 2 & 3 in Readme
Written Problem
Submit one typed copy per group with the names of both students in class by the due date
specified above.
[18] Use the algorithm we discussed in class (choosing the feature with the most information
gain) to construct a decision tree for the following dataset. A
1
, A
3
, and A
3
are the features and the
Output
y
is the label. Show the computations made to determine the attribute to split at each
node.
Related Documents
Recommended textbooks for you

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Recommended textbooks for you
- C++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningOperations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks ColeEBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
- New Perspectives on HTML5, CSS3, and JavaScriptComputer ScienceISBN:9781305503922Author:Patrick M. CareyPublisher:Cengage LearningCOMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE L

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L