Homework-5
pdf
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
3461
Subject
Biology
Date
Nov 24, 2024
Type
Pages
10
Uploaded by juanchidb
HW-5
2023-10-23
Abstract
Accurate cancer diagnosis is difficult. This project focuses on three classification methods of breast
cancer samples, in order to determine which method is the most accurate.
Random Forest clas-
sification is known as a black box, a fine method of classification that requires complete trust, as
its methods are often difficult to explain.
Boosting takes weak learners and creates an effective
system of strong learners. Support Vector Machines determine hyperplanes to classify binary data.
The goal is to determine which of these three options, when tuned properly, best predict malignant
and benign diagnoses in breast tissue samples. These supervised learning methods will be applied
to 9 quantifiable characteristics of breast tumors, along with the the correct diagnosis, in order to
predict future sample diagnosis. The characteristics are measured by microscopic examination, and
scaled 1-10 depending on the findings.
Intro
I chose the
UCI cancer dataset
. In my initial cleanup of the data, I set the Class as a factor since
there were only two options. Upon working with Boosting and the gbm() function, I realized it was
the factors breaking the cross validation, so I reverted them back to integers. Going forward, a 0
signifies a benign tumor, and a 1 signifies a malignant tumor. I also removed the ID column, since
that will not be necessary for my analysis.
Thick
UnifSize
UnifShape
MargAd
SglEpSize
BareNuc
BlandChr
NormNuc
Mitosis
Class
5
1
1
1
2
1
3
1
1
0
5
4
4
5
7
10
3
2
1
0
3
1
1
1
2
2
3
1
1
0
6
8
8
1
3
4
3
7
1
0
4
1
1
3
2
1
3
1
1
0
8
10
10
8
7
10
9
7
1
1
The Data
The columns have shortened names for ease in coding, but here is the explanation of all independent
variables:
1. Clump Thickness(1 - 10), cancer cells grouping together, which is why doctors check for lumps.
A higher thickness is more indicative of malignancy.
2. Uniformity of Cell Size(1 - 10), cancer cells generally grow in unexpected shapes and sizes, so
a lower value of uniformity is more indicative of malignancy.
3. Uniformity of Cell Shape(1 - 10), similar to cell size, lower uniformity in cell shape is more
indicative of malignancy
4. Marginal Adhesion(1 - 10), malignant cells are less adhesive than regular cells, which is how
you get metastasis, so a lower marginal adhesion indicates malignancy.
5. Single Epithelial Cell Size(1 - 10), epithelial cells are surface cells, whether external or internal.
Since they are generally uniform, a higher SEC size is more indicative of malignancy.
6. Bare Nuclei(1 - 10), nuclei without cytoplasm. Lower bare nuclei indicates malignancy.
7. Bland Chromatin(1 - 10), uniformity in texture of the nucleus.
Less uniform tends to be
malignant.
1
8. Normal Nucleoli(1 - 10), the nucleolus is what copies the DNA, so unusual cells, such as
cancer, typically have less normal nucleoli
9. Mitoses(1 - 10), an estimate on number of times mitosis has taken place.
Since malignant
cells reproduce faster, a higher number is indicative of malignancy.
10. Class(0-1): Classification, or, diagnosis. Either Malignant or Benign.
Class
Freq
0
458
1
241
There are 458 counts of benign tumors, 65.52% of the samples, and 241 counts of malignant tumors,
34.48% of the samples.
In my cleaning and setup, I noticed one column of numbers was being treated as characters. On
casting to integers, I got a warning that NAs were introduced by coercion, meaning there were some
non-integers in there, probably some “NA” strings.
Number of Na values in BareNuc: 16
Percentage of rows with Na values in data: 2.29%
Number of Malignant Na values: 2
Number of Benign Na values: 14
Per Dr. Sokol in ISYE 6501, since less than 5% of the data is NA, I will impute the missing values
based on the mean BareNuc count of the other points in their class. I don’t want to take the mean
of the whole dataset, since we are trying to distinguish between malignant and benign, and an
average of those two together may not be as accurate as the two averages separately.
Before imputation:
BareNuc
Class
24
NA
1
41
NA
0
After imputation:
BareNuc
Class
24
7.63
1
41
1.35
0
The mean bare nucleus count for malignant cells is 7.63 and the mean bare nucleus count for benign
cells is 1.35. Specimen 24, a malignant tumor, has been imputed to 7.63. Specimen 41, a benign
tumor, has been imputed to 1.35.
Seeing the averages now, I am convinced that my choice to
average each Class individually was the right choice.
Scaling
Ordinarily, I would scale the data first, but all columns except for the Malignant/Benign classifier
are between 1-10, so there is no need for additional scaling.
Analysis
Now that the data has been cleaned, we can begin the actual analysis.
A: Random Forest
Random Forests are fascinating
. First and foremost, they are an ensemble method of prediction
called “bagging”, where we take the error as trees are added to the forest. Some of the benefits are
that Random Forests reduce overfitting, or at least the risk of overfitting. They are easy to build
since they can handle regression and classification problems, although in my experience, I have
only used them for classification. One of the main critiques of Random Forests are that they are
considered to be “black box” models where you see your inputs and your outputs, but it is difficult
2
to explain the jump from one to the other. Because of this, bootstrapping is used to select data, a
method of data sampling with replacement.
One tune-able parameter in Random Forests are the number of trees we use. Here is a test on the
error rates for different numbers of trees in the forests, along with a small table of trees with the
lowest errors:
Clearly, the Random Forest error plot isn’t like some plots we use where error decreases the more
neighbors we use or the more unsupervised K groups we add.
The errors appear to gradually
collapse around 2.625% error as more trees are added.
1000
2000
3000
4000
5000
2.55
2.60
2.65
2.70
Error Rate per Number of Trees in Forest
Number of Trees
Error Rate
# Trees
Error Rate (%)
12
1600
2.55
13
1700
2.55
4
800
2.57
6
1000
2.57
10
1400
2.58
17
2100
2.58
I grabbed all the errors and the numbers of trees they correspond to and sorted by lowest error and
just displayed the first few. I’ll just use the top value for my tree value, 1600 trees, with an error
of 2.55%.
Unfortunately, there isn’t much to show graphically when working with Random Forests. However,
I can show an example tree out of the forest because I think they’re fun:
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
So what can we see here?
We start at the top box and see it is slightly orange colored and has
class=Benign at the bottom of the box. That means if we don’t do anything, we say the sample
is Benign. This is outlined at the top of the report, where the Benign count to Malignant count is
about 2:1, and we can see the Value = [360, 200], meaning there are 360 Benign and 200 Malignant
samples in the training set. Moving down a level, if we are only going to take one measurement, it
should be the Size Uniformity, outlined at the top of the top box. True is to the right, and false
is to the left. If UnifSize is less than or equal to 2.5, it is classified as Malignant in blue. If it is
greater than 2.5, it is Benign. We move down from there if we want to take more measurements.
The different shades are different “purity levels” that the classification is correct.
If you look at
the middle, lighter orange boxes on the third row, you can see in both values that the breakdown
is much closer to a 1:1 than the bottom outer boxes, where value = [315,1] and [14,180]. We could
cut the Benign options off at the first level, since it looks like everything with UnifSize > 2.5 is
Benign; but this is more a fun example of a tree.
Regarding overfitting, mentioned above, we can see the smallest sample sizes are 25 in the bottom
middle boxes. Obviously, we can categorize and classify all the way down to one sample per box,
but that model would be pretty overfit. Best practice is to set the minimum allowable samples for
a classification to exist is the square root of the total number of samples. In this tree, we have 560
total samples,
√
560 = 24
, so we are right about there.
For the test data set, the Random Forest confusion matrix shows a 94.96% accuracy.
Pretty
excellent in my opinion. Lets see what we can do with Boosting.
4
Boosting
Boosting is also interesting. It takes several one-decision trees, often referred to as “stumps”, that
are just barely better than random guessing, and places weights on their predictions based on how
many samples they can correctly classify. Boosting is the process of combining the weighted stumps
into one system. For example, if one stump is great at diagnosing based on Size Uniformity, and
one is great at diagnosing based on Bare Nuclei count, those stumps will be weighted higher and
combined. One of the main goals of Boosting is to correct the errors of previous stumps.
I have opted to utilize Gradient Boosting, which generally outperforms random forest, but not
always.
0
20
40
60
80
100
0.2
0.4
0.6
0.8
1.0
1.2
Iteration
Bernoulli deviance
What I’m showing here is the optimal number of trees to boost with the blue line. I will set my
optimal model to 70 trees, per the recommendation.
For the same test data, the Boosting confusion matrix shows a boost accuracy of 96.4%, just about
the same as the Random Forest.
5
B: Baseline Method
I am going to use a Support Vector Machine as my baseline comparing method. Since we have the
classification already, a supervised approach is more appropriate. For fans of my previous homework
assignments, you’ll know that SVMs are my favorite, so I’ll take any excuse to make one.
Support
Vector Machines (SVM)
are similar to Linear Discriminant Analysis, except they use the predictors,
not components. They are excellent tools for discriminating between binary responses such as who
will be approved for a bank loan, will a car be above or below a certain mpg, or in this case, will
your biopsy come back as Malignant or Benign.
I love SVM enough that I referenced it in homework 1 and 3, so I’ll include the same expertly
drawn example I made in Paint. When looking at the image, you’ll see a new point being added
in the blue circle. The Support Vector Machine would find that diagonal hyperplane to split the
data, as opposed to a different baseline method like K-Nearest Neighbors, which would ignore the
line in favor of the neighbors in the blue circle. If you count the 7 Nearest Neighbors, you’d see
that the new point would be classified as a green star, where SVM would classify the new point as
a red circle.
Our baseline Support Vector Machine finished with 96.4% accuracy, again, similar to the others.
Results & Conclusion
Method
True Positive
True Negative
False Positive
False Negative
Accuracy
Random Forest
91
41
2
5
94.96
Gradient Boost
91
43
2
3
96.40
Support Vector
91
41
2
5
96.40
Hopefully it is obvious that I would never claim that this predictor was a substitute for a doctor’s
diagnosis.
That being said, I will continue under the assumption that we are diagnosing breast
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
cancer with this tool.
For this particular data set, it appears that it is better to use Gradient
Boosting or Support Vector Machines to diagnose or predict a diagnosis that to use Random Forest.
If I were to make this tool even better, I would determine what was more important, over-diagnosing
or under-diagnosing malignant tumors. We can tune all three of these machines to never miss a
malignant case. The trade-off is likely that we will diagnose someone with malignant cancer that is
actually benign, causing undue stress, medical expenses, and damage to health through the typical
chemotherapy. On the other hand, if we would rather avoid those cases, we could tune the machines
to benign cases, so no one needlessly goes through such a difficult treatment. That trade-off is likely
missing malignant cases, leading to higher rate of metastases and death. This assignment is not
for that purpose, and I am content with the middle-of-the-road diagnoses presented by the three
options.
Appendix
#R setup
# set.seed(999)
# rm(list = ls())
# tinytex::install_tinytex(force = TRUE)
# library(reticulate)
# library(caret)
# library(ggplot2)
# library(randomForest)
# library(dplyr)
# library(kableExtra)
# library(e1071)
# library(magick)
# library(rpart)
# library(rpart.plot)
# library(gbm)
# python setup
# import warnings
# warnings.filterwarnings("ignore")
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import random as rnd
#
#
# from sklearn.cluster import KMeans
# from sklearn import metrics
# from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.metrics import classification_report, confusion_matrix
# from sklearn import tree
# from scipy.spatial.distance import cdist
# from collections import Counter
# Data Setup
# set.seed(999)
# cancer <- read.csv("C:/Users/mroberts/Documents/GT/ISYE 6501/HW1/breast_cancer.txt")
# df <- data.frame(cancer)
# df = df[,-1]
# colnames(df)[1] <-"Thick"
# colnames(df)[7] <-"BlandChr"
# # df$Class <- ifelse(test = df$Class==2, yes = "Benign", no = "Malignant")
7
# df$Class <- ifelse(test = df$Class==2, yes = 0, no = 1)
# # df$Class <- as.factor(df$Class)
# # Show head
# kable(head(df))
# kable(table(df[
'
Class
'
]))
# # Get Counts
# benign_count <- table(df[
'
Class
'
])[1]
# malignant_count<- table(df[
'
Class
'
])[2]
# benign_perc <- round(table(df[
'
Class
'
])[1]/nrow(df)*100,2)
# malignant_perc <- round(table(df[
'
Class
'
])[2]/nrow(df)*100,2)
# df$BareNuc <- as.integer(df$BareNuc)
#
# Cleaning Data
# na_sum <- sum(is.na(df$BareNuc))
# percent_na <- round(na_sum/nrow(df)*100,2)
# # na_sum_malignant <- nrow(df[is.na(df$BareNuc) & df$Class == "Malignant",])
# # na_sum_benign <- nrow(df[is.na(df$BareNuc) & df$Class == "Benign",])
# na_sum_malignant <- nrow(df[is.na(df$BareNuc) & df$Class == 1,])
# na_sum_benign <- nrow(df[is.na(df$BareNuc) & df$Class == 0,])
# #Before Imputation
# kable(df[c(24,41),c(6,10)])
# df$BareNuc <- ifelse(is.na(df$BareNuc), ifelse(df$Class ==
'
0
'
, ben_BN, malig_BN), df$BareNuc)
# #After Imputation
# kable(df[c(24,41),c(6,10)])
# # training and testing
# set.seed(999)
# flag <- sort(sample(x = nrow(df), size = nrow(df)/5, replace = TRUE))
# train <- df[-flag,]
# train_x <- train[,-10]
# train_y <- train[,10]
# test <- df[flag,]
# test_x <- test[,-10]
# test_y <- test[,10]
# #Random Forest
# set.seed(999)
# error_rates <- c()
# trees <- c()
#
# # optimize rf
# for (i in seq(500,1000,100)){
#
rf_optimal <- randomForest(Class~., data = train, proximity=TRUE, ntree = i)
#
error_rates <- c(error_rates, round(min(rf_optimal$mse)*100, 2))
#
trees <- c(trees, i)
# }
# optimal_trees <- data.frame(trees, error_rates)
# opt_tree <- head(optimal_trees[order(error_rates, decreasing = FALSE),])
# opt_tree_num <- opt_tree$trees[1]
# opt_tree_err <- opt_tree$error_rates[1]
# colnames(opt_tree) <- c("# Trees", "Error Rate (%)")
# plot(trees, error_rates, type =
'
b
'
, xlab = "Number of Trees", ylab = "Error Rate",
#
main = "Error Rate per Number of Trees in Forest")
8
# kable(opt_tree)
# rf_opt <- randomForest(Class~., data = train, proximity=TRUE, ntree = opt_tree_num)
# error <- round(rf_opt$mse[500] * 100, 2)
# correct <- round((1 - rf_opt$mse[500]) * 100, 2)
# # Python tree viz
# X_train_binary = r.train_x
# Y_train_binary = r.train_y
# X_test_binary = r.test_x
# Y_test_binary = r.test_y
# ex_tree = DecisionTreeClassifier(min_samples_leaf = 24, max_depth = 2)
# ex_tree = ex_tree.fit(X_train_binary, Y_train_binary)
#
# features_binary = r.df.columns
# importance_binary = ex_tree.feature_importances_
# feat_imp_binary = dict(zip(features_binary,importance_binary))
# feat_imp_binary = {k:v for k,v in sorted(feat_imp_binary.items(),
#
key= lambda item: item[1], reverse = True)}
# # feat_imp_binary
#
# fig_binary = plt.figure(figsize=(58,30))
# _ = tree.plot_tree(ex_tree, feature_names = features_binary, filled = True,
#
class_names = {0:
'
Benign
'
, 1:
'
Malignant
'
})
#
# preds_binary = ex_tree.predict(X_test_binary)
# R CM stats
# rf_pred <- predict(rf_opt, test)
# rf_pred <- ifelse(rf_pred >= 0.5, 1, 0)
# # Boosting
# #optimize boost
# set.seed(999)
# boost <- gbm(Class~., data = train, distribution =
'
bernoulli
'
, cv.folds = 10)
#
# optimal_gbm <- gbm.perf(boost, method =
'
cv
'
)
# # optimal_gbm
# optimal_boost <- gbm(Class~., data = train, distribution =
'
bernoulli
'
, n.trees = 64)
# # summary(optimal_boost)
#
# boost_pred <- predict(optimal_boost, test, n.trees = 70, type = "response")
# boost_pred <- ifelse(boost_pred >= 0.5, 1, 0)
# # Python set up boosting confusion matrix
# cf_boost_matrix = confusion_matrix(r.test_y, r.boost_pred)
# cf_boost_formatted = sns.heatmap(cf_boost_matrix, annot = True, fmt =
'
.0f
'
, cmap =
'
Blues
'
)
#
# #x labels
# cf_boost_formatted.set_xlabel("Predicted Diagnosis")
#
# # #y labels
# cf_boost_formatted.set_ylabel("Actual Diagnosis")
# plt.title("Gradient Boost Confusion Matrix")
# R CM stats
# r_boost_cf_matrix <- confusionMatrix(as.factor(test_y), as.factor(boost_pred))
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# boost_accuracy <- round(r_boost_cf_matrix$overall[1]* 100,2)
# # SVM Image
# paint <- image_read("C:/Users/mroberts/Documents/GT/ISYE 7406/SVM_ex.png")
# paint
# #SVM setup
# svm_model <- svm(Class~., data = train, type =
'
C-classification
'
, kernel =
'
linear
'
)
# # summary(svm_model)
# svm_predict <- predict(svm_model, test[,-10])
# svm_predict <- as.numeric(svm_predict)
# svm_predict <- svm_predict-1
# SVM CM and stats, Python and R
# #set up confusion matrix
# cf_svm_matrix = confusion_matrix(r.test_y, r.svm_predict)
# cf_svm_formatted = sns.heatmap(cf_svm_matrix, annot = True, fmt =
'
.0f
'
, cmap =
'
Blues
'
)
#
# #x labels
# cf_svm_formatted.set_xlabel("Predicted Diagnosis")
#
# # #y labels
# cf_svm_formatted.set_ylabel("Actual Diagnosis")
# plt.title("Support Vector Machine Confusion Matrix")
# r_cf_svm_matrix <- confusionMatrix(as.factor(test_y), as.factor(svm_predict))
# svm_accuracy <- round(r_cf_svm_matrix$overall[1]* 100,2)
#Results dataframe
# true_positives <- c(r_cf_rf_matrix$table[1], r_boost_cf_matrix$table[1],
#
r_cf_rf_matrix$table[1])
# true_negatives <- c(r_cf_rf_matrix$table[4], r_boost_cf_matrix$table[4],
#
r_cf_rf_matrix$table[4])
# false_positives <- c(r_cf_rf_matrix$table[3], r_boost_cf_matrix$table[3],
#
r_cf_rf_matrix$table[3])
# false_negatives <- c(r_cf_rf_matrix$table[2], r_boost_cf_matrix$table[2],
#
r_cf_rf_matrix$table[2])
# accuracy <- c(rf_accuracy, boost_accuracy, svm_accuracy)
# labels <- c("Random Forest", "Gradient Boost", "Support Vector")
# compare <- data.frame(labels, true_positives, true_negatives, false_positives,
#
false_negatives, accuracy)
# colnames(compare) <- c("Method", "True Positive", "True Negative", "False Positive",
#
"False Negative", "Accuracy")
# kable(compare)
10
Related Documents
Recommended textbooks for you

Human Heredity: Principles and Issues (MindTap Co...
Biology
ISBN:9781305251052
Author:Michael Cummings
Publisher:Cengage Learning

Anatomy & Physiology
Biology
ISBN:9781938168130
Author:Kelly A. Young, James A. Wise, Peter DeSaix, Dean H. Kruse, Brandon Poe, Eddie Johnson, Jody E. Johnson, Oksana Korol, J. Gordon Betts, Mark Womble
Publisher:OpenStax College
Essentials Health Info Management Principles/Prac...
Health & Nutrition
ISBN:9780357191651
Author:Bowie
Publisher:Cengage

Principles Of Radiographic Imaging: An Art And A ...
Health & Nutrition
ISBN:9781337711067
Author:Richard R. Carlton, Arlene M. Adler, Vesna Balac
Publisher:Cengage Learning
Recommended textbooks for you
- Human Heredity: Principles and Issues (MindTap Co...BiologyISBN:9781305251052Author:Michael CummingsPublisher:Cengage Learning
- Anatomy & PhysiologyBiologyISBN:9781938168130Author:Kelly A. Young, James A. Wise, Peter DeSaix, Dean H. Kruse, Brandon Poe, Eddie Johnson, Jody E. Johnson, Oksana Korol, J. Gordon Betts, Mark WomblePublisher:OpenStax CollegeEssentials Health Info Management Principles/Prac...Health & NutritionISBN:9780357191651Author:BowiePublisher:CengagePrinciples Of Radiographic Imaging: An Art And A ...Health & NutritionISBN:9781337711067Author:Richard R. Carlton, Arlene M. Adler, Vesna BalacPublisher:Cengage Learning

Human Heredity: Principles and Issues (MindTap Co...
Biology
ISBN:9781305251052
Author:Michael Cummings
Publisher:Cengage Learning

Anatomy & Physiology
Biology
ISBN:9781938168130
Author:Kelly A. Young, James A. Wise, Peter DeSaix, Dean H. Kruse, Brandon Poe, Eddie Johnson, Jody E. Johnson, Oksana Korol, J. Gordon Betts, Mark Womble
Publisher:OpenStax College
Essentials Health Info Management Principles/Prac...
Health & Nutrition
ISBN:9780357191651
Author:Bowie
Publisher:Cengage

Principles Of Radiographic Imaging: An Art And A ...
Health & Nutrition
ISBN:9781337711067
Author:Richard R. Carlton, Arlene M. Adler, Vesna Balac
Publisher:Cengage Learning