Homework-5

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

3461

Subject

Biology

Date

Nov 24, 2024

Type

pdf

Pages

10

Uploaded by juanchidb

Report
HW-5 2023-10-23 Abstract Accurate cancer diagnosis is difficult. This project focuses on three classification methods of breast cancer samples, in order to determine which method is the most accurate. Random Forest clas- sification is known as a black box, a fine method of classification that requires complete trust, as its methods are often difficult to explain. Boosting takes weak learners and creates an effective system of strong learners. Support Vector Machines determine hyperplanes to classify binary data. The goal is to determine which of these three options, when tuned properly, best predict malignant and benign diagnoses in breast tissue samples. These supervised learning methods will be applied to 9 quantifiable characteristics of breast tumors, along with the the correct diagnosis, in order to predict future sample diagnosis. The characteristics are measured by microscopic examination, and scaled 1-10 depending on the findings. Intro I chose the UCI cancer dataset . In my initial cleanup of the data, I set the Class as a factor since there were only two options. Upon working with Boosting and the gbm() function, I realized it was the factors breaking the cross validation, so I reverted them back to integers. Going forward, a 0 signifies a benign tumor, and a 1 signifies a malignant tumor. I also removed the ID column, since that will not be necessary for my analysis. Thick UnifSize UnifShape MargAd SglEpSize BareNuc BlandChr NormNuc Mitosis Class 5 1 1 1 2 1 3 1 1 0 5 4 4 5 7 10 3 2 1 0 3 1 1 1 2 2 3 1 1 0 6 8 8 1 3 4 3 7 1 0 4 1 1 3 2 1 3 1 1 0 8 10 10 8 7 10 9 7 1 1 The Data The columns have shortened names for ease in coding, but here is the explanation of all independent variables: 1. Clump Thickness(1 - 10), cancer cells grouping together, which is why doctors check for lumps. A higher thickness is more indicative of malignancy. 2. Uniformity of Cell Size(1 - 10), cancer cells generally grow in unexpected shapes and sizes, so a lower value of uniformity is more indicative of malignancy. 3. Uniformity of Cell Shape(1 - 10), similar to cell size, lower uniformity in cell shape is more indicative of malignancy 4. Marginal Adhesion(1 - 10), malignant cells are less adhesive than regular cells, which is how you get metastasis, so a lower marginal adhesion indicates malignancy. 5. Single Epithelial Cell Size(1 - 10), epithelial cells are surface cells, whether external or internal. Since they are generally uniform, a higher SEC size is more indicative of malignancy. 6. Bare Nuclei(1 - 10), nuclei without cytoplasm. Lower bare nuclei indicates malignancy. 7. Bland Chromatin(1 - 10), uniformity in texture of the nucleus. Less uniform tends to be malignant. 1
8. Normal Nucleoli(1 - 10), the nucleolus is what copies the DNA, so unusual cells, such as cancer, typically have less normal nucleoli 9. Mitoses(1 - 10), an estimate on number of times mitosis has taken place. Since malignant cells reproduce faster, a higher number is indicative of malignancy. 10. Class(0-1): Classification, or, diagnosis. Either Malignant or Benign. Class Freq 0 458 1 241 There are 458 counts of benign tumors, 65.52% of the samples, and 241 counts of malignant tumors, 34.48% of the samples. In my cleaning and setup, I noticed one column of numbers was being treated as characters. On casting to integers, I got a warning that NAs were introduced by coercion, meaning there were some non-integers in there, probably some “NA” strings. Number of Na values in BareNuc: 16 Percentage of rows with Na values in data: 2.29% Number of Malignant Na values: 2 Number of Benign Na values: 14 Per Dr. Sokol in ISYE 6501, since less than 5% of the data is NA, I will impute the missing values based on the mean BareNuc count of the other points in their class. I don’t want to take the mean of the whole dataset, since we are trying to distinguish between malignant and benign, and an average of those two together may not be as accurate as the two averages separately. Before imputation: BareNuc Class 24 NA 1 41 NA 0 After imputation: BareNuc Class 24 7.63 1 41 1.35 0 The mean bare nucleus count for malignant cells is 7.63 and the mean bare nucleus count for benign cells is 1.35. Specimen 24, a malignant tumor, has been imputed to 7.63. Specimen 41, a benign tumor, has been imputed to 1.35. Seeing the averages now, I am convinced that my choice to average each Class individually was the right choice. Scaling Ordinarily, I would scale the data first, but all columns except for the Malignant/Benign classifier are between 1-10, so there is no need for additional scaling. Analysis Now that the data has been cleaned, we can begin the actual analysis. A: Random Forest Random Forests are fascinating . First and foremost, they are an ensemble method of prediction called “bagging”, where we take the error as trees are added to the forest. Some of the benefits are that Random Forests reduce overfitting, or at least the risk of overfitting. They are easy to build since they can handle regression and classification problems, although in my experience, I have only used them for classification. One of the main critiques of Random Forests are that they are considered to be “black box” models where you see your inputs and your outputs, but it is difficult 2
to explain the jump from one to the other. Because of this, bootstrapping is used to select data, a method of data sampling with replacement. One tune-able parameter in Random Forests are the number of trees we use. Here is a test on the error rates for different numbers of trees in the forests, along with a small table of trees with the lowest errors: Clearly, the Random Forest error plot isn’t like some plots we use where error decreases the more neighbors we use or the more unsupervised K groups we add. The errors appear to gradually collapse around 2.625% error as more trees are added. 1000 2000 3000 4000 5000 2.55 2.60 2.65 2.70 Error Rate per Number of Trees in Forest Number of Trees Error Rate # Trees Error Rate (%) 12 1600 2.55 13 1700 2.55 4 800 2.57 6 1000 2.57 10 1400 2.58 17 2100 2.58 I grabbed all the errors and the numbers of trees they correspond to and sorted by lowest error and just displayed the first few. I’ll just use the top value for my tree value, 1600 trees, with an error of 2.55%. Unfortunately, there isn’t much to show graphically when working with Random Forests. However, I can show an example tree out of the forest because I think they’re fun: 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
So what can we see here? We start at the top box and see it is slightly orange colored and has class=Benign at the bottom of the box. That means if we don’t do anything, we say the sample is Benign. This is outlined at the top of the report, where the Benign count to Malignant count is about 2:1, and we can see the Value = [360, 200], meaning there are 360 Benign and 200 Malignant samples in the training set. Moving down a level, if we are only going to take one measurement, it should be the Size Uniformity, outlined at the top of the top box. True is to the right, and false is to the left. If UnifSize is less than or equal to 2.5, it is classified as Malignant in blue. If it is greater than 2.5, it is Benign. We move down from there if we want to take more measurements. The different shades are different “purity levels” that the classification is correct. If you look at the middle, lighter orange boxes on the third row, you can see in both values that the breakdown is much closer to a 1:1 than the bottom outer boxes, where value = [315,1] and [14,180]. We could cut the Benign options off at the first level, since it looks like everything with UnifSize > 2.5 is Benign; but this is more a fun example of a tree. Regarding overfitting, mentioned above, we can see the smallest sample sizes are 25 in the bottom middle boxes. Obviously, we can categorize and classify all the way down to one sample per box, but that model would be pretty overfit. Best practice is to set the minimum allowable samples for a classification to exist is the square root of the total number of samples. In this tree, we have 560 total samples, 560 = 24 , so we are right about there. For the test data set, the Random Forest confusion matrix shows a 94.96% accuracy. Pretty excellent in my opinion. Lets see what we can do with Boosting. 4
Boosting Boosting is also interesting. It takes several one-decision trees, often referred to as “stumps”, that are just barely better than random guessing, and places weights on their predictions based on how many samples they can correctly classify. Boosting is the process of combining the weighted stumps into one system. For example, if one stump is great at diagnosing based on Size Uniformity, and one is great at diagnosing based on Bare Nuclei count, those stumps will be weighted higher and combined. One of the main goals of Boosting is to correct the errors of previous stumps. I have opted to utilize Gradient Boosting, which generally outperforms random forest, but not always. 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1.0 1.2 Iteration Bernoulli deviance What I’m showing here is the optimal number of trees to boost with the blue line. I will set my optimal model to 70 trees, per the recommendation. For the same test data, the Boosting confusion matrix shows a boost accuracy of 96.4%, just about the same as the Random Forest. 5
B: Baseline Method I am going to use a Support Vector Machine as my baseline comparing method. Since we have the classification already, a supervised approach is more appropriate. For fans of my previous homework assignments, you’ll know that SVMs are my favorite, so I’ll take any excuse to make one. Support Vector Machines (SVM) are similar to Linear Discriminant Analysis, except they use the predictors, not components. They are excellent tools for discriminating between binary responses such as who will be approved for a bank loan, will a car be above or below a certain mpg, or in this case, will your biopsy come back as Malignant or Benign. I love SVM enough that I referenced it in homework 1 and 3, so I’ll include the same expertly drawn example I made in Paint. When looking at the image, you’ll see a new point being added in the blue circle. The Support Vector Machine would find that diagonal hyperplane to split the data, as opposed to a different baseline method like K-Nearest Neighbors, which would ignore the line in favor of the neighbors in the blue circle. If you count the 7 Nearest Neighbors, you’d see that the new point would be classified as a green star, where SVM would classify the new point as a red circle. Our baseline Support Vector Machine finished with 96.4% accuracy, again, similar to the others. Results & Conclusion Method True Positive True Negative False Positive False Negative Accuracy Random Forest 91 41 2 5 94.96 Gradient Boost 91 43 2 3 96.40 Support Vector 91 41 2 5 96.40 Hopefully it is obvious that I would never claim that this predictor was a substitute for a doctor’s diagnosis. That being said, I will continue under the assumption that we are diagnosing breast 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
cancer with this tool. For this particular data set, it appears that it is better to use Gradient Boosting or Support Vector Machines to diagnose or predict a diagnosis that to use Random Forest. If I were to make this tool even better, I would determine what was more important, over-diagnosing or under-diagnosing malignant tumors. We can tune all three of these machines to never miss a malignant case. The trade-off is likely that we will diagnose someone with malignant cancer that is actually benign, causing undue stress, medical expenses, and damage to health through the typical chemotherapy. On the other hand, if we would rather avoid those cases, we could tune the machines to benign cases, so no one needlessly goes through such a difficult treatment. That trade-off is likely missing malignant cases, leading to higher rate of metastases and death. This assignment is not for that purpose, and I am content with the middle-of-the-road diagnoses presented by the three options. Appendix #R setup # set.seed(999) # rm(list = ls()) # tinytex::install_tinytex(force = TRUE) # library(reticulate) # library(caret) # library(ggplot2) # library(randomForest) # library(dplyr) # library(kableExtra) # library(e1071) # library(magick) # library(rpart) # library(rpart.plot) # library(gbm) # python setup # import warnings # warnings.filterwarnings("ignore") # import pandas as pd # import numpy as np # import matplotlib.pyplot as plt # import seaborn as sns # import random as rnd # # # from sklearn.cluster import KMeans # from sklearn import metrics # from sklearn.model_selection import train_test_split # from sklearn.tree import DecisionTreeClassifier # from sklearn.metrics import classification_report, confusion_matrix # from sklearn import tree # from scipy.spatial.distance import cdist # from collections import Counter # Data Setup # set.seed(999) # cancer <- read.csv("C:/Users/mroberts/Documents/GT/ISYE 6501/HW1/breast_cancer.txt") # df <- data.frame(cancer) # df = df[,-1] # colnames(df)[1] <-"Thick" # colnames(df)[7] <-"BlandChr" # # df$Class <- ifelse(test = df$Class==2, yes = "Benign", no = "Malignant") 7
# df$Class <- ifelse(test = df$Class==2, yes = 0, no = 1) # # df$Class <- as.factor(df$Class) # # Show head # kable(head(df)) # kable(table(df[ ' Class ' ])) # # Get Counts # benign_count <- table(df[ ' Class ' ])[1] # malignant_count<- table(df[ ' Class ' ])[2] # benign_perc <- round(table(df[ ' Class ' ])[1]/nrow(df)*100,2) # malignant_perc <- round(table(df[ ' Class ' ])[2]/nrow(df)*100,2) # df$BareNuc <- as.integer(df$BareNuc) # # Cleaning Data # na_sum <- sum(is.na(df$BareNuc)) # percent_na <- round(na_sum/nrow(df)*100,2) # # na_sum_malignant <- nrow(df[is.na(df$BareNuc) & df$Class == "Malignant",]) # # na_sum_benign <- nrow(df[is.na(df$BareNuc) & df$Class == "Benign",]) # na_sum_malignant <- nrow(df[is.na(df$BareNuc) & df$Class == 1,]) # na_sum_benign <- nrow(df[is.na(df$BareNuc) & df$Class == 0,]) # #Before Imputation # kable(df[c(24,41),c(6,10)]) # df$BareNuc <- ifelse(is.na(df$BareNuc), ifelse(df$Class == ' 0 ' , ben_BN, malig_BN), df$BareNuc) # #After Imputation # kable(df[c(24,41),c(6,10)]) # # training and testing # set.seed(999) # flag <- sort(sample(x = nrow(df), size = nrow(df)/5, replace = TRUE)) # train <- df[-flag,] # train_x <- train[,-10] # train_y <- train[,10] # test <- df[flag,] # test_x <- test[,-10] # test_y <- test[,10] # #Random Forest # set.seed(999) # error_rates <- c() # trees <- c() # # # optimize rf # for (i in seq(500,1000,100)){ # rf_optimal <- randomForest(Class~., data = train, proximity=TRUE, ntree = i) # error_rates <- c(error_rates, round(min(rf_optimal$mse)*100, 2)) # trees <- c(trees, i) # } # optimal_trees <- data.frame(trees, error_rates) # opt_tree <- head(optimal_trees[order(error_rates, decreasing = FALSE),]) # opt_tree_num <- opt_tree$trees[1] # opt_tree_err <- opt_tree$error_rates[1] # colnames(opt_tree) <- c("# Trees", "Error Rate (%)") # plot(trees, error_rates, type = ' b ' , xlab = "Number of Trees", ylab = "Error Rate", # main = "Error Rate per Number of Trees in Forest") 8
# kable(opt_tree) # rf_opt <- randomForest(Class~., data = train, proximity=TRUE, ntree = opt_tree_num) # error <- round(rf_opt$mse[500] * 100, 2) # correct <- round((1 - rf_opt$mse[500]) * 100, 2) # # Python tree viz # X_train_binary = r.train_x # Y_train_binary = r.train_y # X_test_binary = r.test_x # Y_test_binary = r.test_y # ex_tree = DecisionTreeClassifier(min_samples_leaf = 24, max_depth = 2) # ex_tree = ex_tree.fit(X_train_binary, Y_train_binary) # # features_binary = r.df.columns # importance_binary = ex_tree.feature_importances_ # feat_imp_binary = dict(zip(features_binary,importance_binary)) # feat_imp_binary = {k:v for k,v in sorted(feat_imp_binary.items(), # key= lambda item: item[1], reverse = True)} # # feat_imp_binary # # fig_binary = plt.figure(figsize=(58,30)) # _ = tree.plot_tree(ex_tree, feature_names = features_binary, filled = True, # class_names = {0: ' Benign ' , 1: ' Malignant ' }) # # preds_binary = ex_tree.predict(X_test_binary) # R CM stats # rf_pred <- predict(rf_opt, test) # rf_pred <- ifelse(rf_pred >= 0.5, 1, 0) # # Boosting # #optimize boost # set.seed(999) # boost <- gbm(Class~., data = train, distribution = ' bernoulli ' , cv.folds = 10) # # optimal_gbm <- gbm.perf(boost, method = ' cv ' ) # # optimal_gbm # optimal_boost <- gbm(Class~., data = train, distribution = ' bernoulli ' , n.trees = 64) # # summary(optimal_boost) # # boost_pred <- predict(optimal_boost, test, n.trees = 70, type = "response") # boost_pred <- ifelse(boost_pred >= 0.5, 1, 0) # # Python set up boosting confusion matrix # cf_boost_matrix = confusion_matrix(r.test_y, r.boost_pred) # cf_boost_formatted = sns.heatmap(cf_boost_matrix, annot = True, fmt = ' .0f ' , cmap = ' Blues ' ) # # #x labels # cf_boost_formatted.set_xlabel("Predicted Diagnosis") # # # #y labels # cf_boost_formatted.set_ylabel("Actual Diagnosis") # plt.title("Gradient Boost Confusion Matrix") # R CM stats # r_boost_cf_matrix <- confusionMatrix(as.factor(test_y), as.factor(boost_pred)) 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# boost_accuracy <- round(r_boost_cf_matrix$overall[1]* 100,2) # # SVM Image # paint <- image_read("C:/Users/mroberts/Documents/GT/ISYE 7406/SVM_ex.png") # paint # #SVM setup # svm_model <- svm(Class~., data = train, type = ' C-classification ' , kernel = ' linear ' ) # # summary(svm_model) # svm_predict <- predict(svm_model, test[,-10]) # svm_predict <- as.numeric(svm_predict) # svm_predict <- svm_predict-1 # SVM CM and stats, Python and R # #set up confusion matrix # cf_svm_matrix = confusion_matrix(r.test_y, r.svm_predict) # cf_svm_formatted = sns.heatmap(cf_svm_matrix, annot = True, fmt = ' .0f ' , cmap = ' Blues ' ) # # #x labels # cf_svm_formatted.set_xlabel("Predicted Diagnosis") # # # #y labels # cf_svm_formatted.set_ylabel("Actual Diagnosis") # plt.title("Support Vector Machine Confusion Matrix") # r_cf_svm_matrix <- confusionMatrix(as.factor(test_y), as.factor(svm_predict)) # svm_accuracy <- round(r_cf_svm_matrix$overall[1]* 100,2) #Results dataframe # true_positives <- c(r_cf_rf_matrix$table[1], r_boost_cf_matrix$table[1], # r_cf_rf_matrix$table[1]) # true_negatives <- c(r_cf_rf_matrix$table[4], r_boost_cf_matrix$table[4], # r_cf_rf_matrix$table[4]) # false_positives <- c(r_cf_rf_matrix$table[3], r_boost_cf_matrix$table[3], # r_cf_rf_matrix$table[3]) # false_negatives <- c(r_cf_rf_matrix$table[2], r_boost_cf_matrix$table[2], # r_cf_rf_matrix$table[2]) # accuracy <- c(rf_accuracy, boost_accuracy, svm_accuracy) # labels <- c("Random Forest", "Gradient Boost", "Support Vector") # compare <- data.frame(labels, true_positives, true_negatives, false_positives, # false_negatives, accuracy) # colnames(compare) <- c("Method", "True Positive", "True Negative", "False Positive", # "False Negative", "Accuracy") # kable(compare) 10