EE-267 Term Project Proposal-Karthik_Bharadwaj

docx

School

San Jose State University *

*We aren’t endorsed by this school

Course

267

Subject

Electrical Engineering

Date

Jan 9, 2024

Type

docx

Pages

Uploaded by sarthakkothari

Karthik Bharadwaj 016042273 Lung and Colon Cancer Prediction using Machine Learning Techniques EE-267 COMPUTER VISION WITH AI APPLICATIONS Term Project Final Report By Karthik Bharadwaj 0160422783 Department of Electrical Engineering San Jose State University San Jose, CA

Karthik Bharadwaj 016042273 Abstract The death rate from cancer is one out of five worldwide. The incidence of lung and colon cancers has increased dramatically over the past few decades, and they have become one of the most serious medical problems that people face worldwide. Medical doctors have traditionally had to spend a great deal of time and energy examining and diagnosing histopathological photographs for cancer patients. A high-level understanding of the different types of lung and colon cancer, as well as the use of machine learning and deep learning techniques to diagnose these diseases, is the goal of this paper. There are several articles from a variety of sources that are analyzed and compared in the survey, as well as research papers discussing how to detect lung and colon cancer using a variety of machine learning techniques including SVMs (Support Vector Machines), PCAs, AdaBoosts, and deep learning techniques that include CNNs (Convolutional Neural Networks) and transfer learnings. A comparison is made between the accuracy of trained models from research papers that use both machines learning and deep learning as models. A study of histopathology images revealed that deep learning techniques had outperformed other methods for identifying or diagnosing cancer. 2

Karthik Bharadwaj 016042273 1. Detailed Description of the Project 1.1. Introduction The term cancer refers to a group of diseases characterized by the development of abnormal cells in the body that are caused by random mutations. Human organs can become infected when these cells spread out of control, and if not treated, this can result in the death of the individual. There is no region on the body where cancer cells cannot form. Generally, the liver, stomach, skin, lungs, breasts, brains, colons, rectums, livers, and breasts are affected by lymphoma. It is important to note that lung and colon cancer are two of the most severe cancers experienced worldwide and have become the primary focus of medical attention. Approximately a quarter of all cancers occur in the lungs and colon. Early diagnosis and treatment are the only means of preventing lung and colon cancer-related mortality. The disease is, however, difficult to combat in a number of developing and underdeveloped countries because they do not have access to good diagnostic services. Consequently, these countries are left defenseless against lung and colon cancer as a result of the lack of diagnostic services. As a result of this problem, these countries will have to invest heavily in medical diagnostic systems, laboratories, and qualified individuals to provide diagnostic services. Such goals cannot be achieved overnight, given the economic status of the countries concerned. The limitations of conventional diagnostic techniques for the detection of lung and colon cancer motivate the development of advanced diagnostic techniques that rely on artificial intelligence. These diagnostic tasks have shown great promise to be addressed through artificial intelligence, and it may be considered a viable alternative to traditional methods of diagnosis. So, I will be using some Deep Learning and Machine Learning techniques, such as Convolutional Neural Networks, Transfer Learning, SVM, KNN, NBC, Decision Trees and Random Forests, in order to determine how well these models perform on the available dataset. Furthermore, we compare these models in order to determine which one performs better than the other and to determine why one is more effective than the other. 2. Objectives We have chosen to implement some of the techniques discussed above, which include KNN, NBC, SVM, Decision Tree, Random Forest, and a few deep learning models CNN. Considering that our problem statement contains a set of images, we chose to preprocess the data by resizing the images to a 32x32 resolution. 2.1. Methodology 2.1.1. Classic machine learning techniques Multiple classical machine learning algorithms have been tried out on the available dataset to figure out the best performing model under similar conditions. The classical machine learning algorithms 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Karthik Bharadwaj 016042273 include SVM (Support Vector Machines), XGBoost, Random Forest, KNN(K Nearest Neighbors). Before feeding the available dataset to these ML models, all the images were resized to 256 x 256 shape and converted into gray scale. As part of the feature extraction, following 10 features were extracted from the images: variance, mean, standard deviation, skew, kurtosis, entropy, canny edges, local binary pattern(LBP), sobel X, sobel Y. These extracted features were used to train separate models, one for lung cancer detection and another for colon cancer. Train-Test split of 80-20 has been used and also a 5 fold cross validation is performed on each model to ensure that the model and the data is consistent and there is no overfitting. Cross_val_score from sklearn.model_selection is used to perform cross- validation. 2.1.2. Support vector machine SVM(Support Vector Machine) is a classifier that distributes the data into space and draws a hyperplane that seperates the classes of data. We chose SVM as one of our models because of its flexibility. Even though SVMs are most suited for binary classification, they are proven to perform well for multiclass classification under certain conditions. Therefore, we decided to experiment with SVMs. SVM package from sklearn was used for implementing this model. We used GridSearchCV for tuning hyper parameters. Linear, RBF and polynomial kernels were used along with degree from 2 to 10 as well. Out of the three kernels, linear kernel performed the best with an accuracy of 91% on lung cancer data and 83% on colon cancer data, as shown in figure 1and 2. Fig 1. Lung Cancer Classification Fig 2. Colon Cancer Classification 2.1.3. Transfer learning using EfficientNetB7 EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient . Unlike conventional practice that 4

Karthik Bharadwaj 016042273 arbitrary scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients as shown in figure 3. EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Fig 3. EfficientNetB7 architecture 2.1.4. Transfer Learning using VGG19 VGG19 was proposed by Visual Geometry Group at Oxford University. It is considered as a successor of AlexNet and it is one of the fewest models among ImageNet top 30 board that had very minimal depth in architecture as shown in figure 3. Fig 3. VGG19 model architecture. 2.1.5. Transfer Learning using InceptionV3 InceptionV3 model is published by Google AI. It is also called as GoogLeNet version 3. It is mainly known for its deeper architecture with comparatively minimal parameters as shown in figure 4. Fig 4. Inception V3 model architecture. 5

Karthik Bharadwaj 016042273 3. Results We will monitor the model's accuracy and loss during training and validation to determine whether to continue training or make any parameter changes. The validation accuracy is predicted to rise and eventually approach the training accuracy while the validation loss is predicted to be at a minimum. The validation accuracy tends to rise when the model is trained for more and more epochs. After a given number of iterations, the validation accuracy will decrease if the model overfits the training set of data. To lessen the overfitting in such circumstances, dropout and regularization settings will be tested. The image below shows an example model performance chart after training. 1. EfficientNetB7 EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient . Unlike conventional practice that arbitrary scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients as shown in figure 19. EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Apart from that it is one of the top performing model for ImageNet challenge with comparatively less number of parameters as shown in figure 5. Fig 5. ImageNet competition 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Karthik Bharadwaj 016042273 Fig 6. EfficientNetB7 architecture Fig 7 Lung cancerTraining and Validation accuracy per epoch for efficientnetb7 Fig 8 Colon cancer Training and Validation accuracy per epoch for efficientnetb7 EfficientNetB7 gave an accuracy of 94% and 97% accuracies for lung and colon cancer classifications respectively. Figure 7 and 8 shows the increment of training accuracies with every epoch while training the model on lung and colon data respectively. Figure 9, 10 show the confusion matrices and ROC curves for both trained models. And it can be seen that for all the three classes, false negatives are comparatively less in number when compared to previously shown classical machine learning models. 7

Karthik Bharadwaj 016042273 Fig 9. Lung Cancer prediction confusion matrix using EfficientNetB7 Fig 10. Colon Cancer prediction confusion matrix using EfficientNetB7 Figure 11 and 12 show the AUC of ROC curves for lung and colon cancer prediction models using efficientnetb7 model. It can be observed that similar to false negative count in the confusion matrices, the area under the curves are also good enough to consider the models as good at differentiating positive and negative cases. Fig 11. AUC of ROC Curve for lung cancer prediction using EfficientNetB7 8

Karthik Bharadwaj 016042273 Fig 12. AUC of ROC Curve for colon cancer prediction using EfficientNetB7 2. VGG19 VGG19 gave accuracies of 95% and 98% for lung and colon cancer models respectively as shown in figure 13 and 14. These are the highest accuracies we have ever achieved in this study. Fig 13. Lung cancer training and validation accuracy per epoch for VGG19 Fig 13. Colon cancer training and validation accuracy per epoch for VGG19 These models also have a good recall score, with very less false negative cases as shown figure 14 and 15. Infact, they are better than all other models in this study. This makes VGG19 model as one of the suitable algorithm for this problem statement. 9

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Karthik Bharadwaj 016042273 Fig 14. Lung Cancer prediction confusion matrix using VGG19 Fig 15. Colon Cancer prediction confusion matrix using VGG19 Figure 16 and 17 show the AUC of ROC curves for both the models. Similar to other scores, these models also have a very good area under curves and describing them as very good at differentiating positive and negative cases. Fig 16. AUC of ROC Curve for lung cancer prediction using VGG19 10

Karthik Bharadwaj 016042273 Fig 17. AUC of ROC Curve for colon cancer prediction using VGG19 3. Inception V3 InceptionV3 model gave accuracies of 75% and 50% for lung and colon cancer datasets respectively as shown in figure 18 and 19. InceptionV3 is the least performing model when compared to all other models in this study in terms of accuracy. Fig 18. Lung cancer training and validation accuracy per epoch for InceptionV3 Fig 19. Colon cancer training and validation accuracy per epoch for InceptionV3 Similar to accuracies, these models also does not have good recall or false negative scores when compared to other models as shown in figure 20 and 21. Fig 20. Lung Cancer prediction confusion matrix using InceptionV3 11

Karthik Bharadwaj 016042273 Fig 21. Colon Cancer prediction confusion matrix using InceptionV3 Figure 22 and 23 show the AUC of ROC curves for both the models. Similar to other scores, these models also have a very less area under curves and describing them as very bad at differentiating positive and negative cases . Fig 22. AUC of ROC Curve for lung cancer prediction using InceptionV3 Fig 23. AUC of ROC Curve for colon cancer prediction using InceptionV3 Facilities/Tools used Softwares: Python – JupyterLab/ Colab with 500 compute units 12

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Karthik Bharadwaj 016042273 Python Libraries: Tensorflow, Keras and OpenCV, PIL, Pandas, NumPy, sklearn, math, Matplotlib, Seaborn Storage: Google Drive (~200 MB) Conclusion VGG19 performance has outperformed all the other models for detecting:  Lung cancer  Colon cancer  Lung and Colon cancer images together Implementing models taught in the class on a real time project aided further to bolster the understanding about working with ML models. 13

Karthik Bharadwaj 016042273 References [1] M. Masud, N. Sikder, A.-A. Nahid, A. K. Bairagi, and M. A. AlZain, “ a machine learning approach to diagnosing lung and colon cancer using a deep learning-based classification framework ,” Sensors (Basel, Switzerland) , 22-Jan-2021. [2] Md. A. Talukder, Md. M. Islam, M. A. Uddin, A. Akhter, K. F. Hasan, and M. A. Moni, “Machine learning-based lung and colon cancer detection using deep feature extraction and ensemble learning,” Expert Systems with Applications , vol. 205, p. 117695 [3] Abbasi, A. A., Hussain, L., Awan, I. A., Abbasi, I., Majid, A., Nadeem, M. S. A., and Chaudhary, Q.-A. (2020). Detecting prostate cancer using deep learning convolution neural network with transfer learning approach. Cognitive Neurodynamics, 14(4):523–533. [4] Abdelsamea, M. M., Pitiot, A., Grineviciute, R. B., Besusparis, J., Laurinavicius, A., and Ilyas, M. (2019). A cascade-learning approach for automated segmentation of tumour epithelium in colorectal cancer. Expert Systems with Applications, 118:539–552. [5] Alkhatib, K. and Abualigah, S. (2020). Predictive model for cutting customers migration from banks: Based on machine learning classification algorithms. In 2020 11th International Conference on Information and Communication Systems (ICICS), pages 303–307. IEEE. 14