Machine Learning write up

docx

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6035

Subject

Computer Science

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by kdolph

Machine Learning write up Machine Learning Learning Goals of this Project:  Learning Basic Pandas Dataframe Manipulations  Learning more about Machine Learning (ML) Classification models and how they are used in a Cybersecurity Context.  Learning about basic Data pipelines and Transformations  Learning how to write and use Unit Tests when developing Python code Important Reference Materials:  NumPy Documentation  Pandas Documentation  Scikit-learn Documentation  Introduction Video BACKGROUND Many of the Projects in CS6035 are focused on offensive security tasks which are very applicable to Red Team activities which many of us may associate with cybersecurity. This project will be more focused on defensive security tasks which are usually considered Blue Team activites that are done by many corporate teams. Historically many defensive security professionals have investigated malicious activity/files/code to create patterns (often called signatures) that can be used to detect (and prevent) malicious activity/files/code when they see that pattern again. Historically this was a relatively effective way of preventing known malware from infecting systems but it does nothing to protect against novel attacks. As attackers became more sophisticated they learned to tweak (or simply encode) their malicious activity/files/code to avoid detection from these simple pattern matching detections.

With this background information it would be nice if a more general solution could give a score to activity/files/code that pass through corporate systems every day and tell the security team that while a certain pattern may not exactly fit a signature of known malicious activity/files/code it appears to be very similar to examples that were seen in the past that were malicious. Luckily Machine Learning models can do exactly that if provided with proper training data! Thus it is no surprise that one of the most powerful tools in the hands of defensive cybersecurity professionals is Machine Learning. Modern detection systems will usually use a combination of Machine Learning models and pattern matching (Regular Expressions) to detect and prevent malicious activity on networks and devices. This project will focus on teaching the basic fundamentals of data analysis and building/testing your own ML models in python using the open source libraries Pandas and Scikit-Learn. Cybersecurity Machine Learning Careers and Trends Machine learning in cybersecurity is a growing field. The area was considered among top trends by McKinsey in 2022. Additional Information  ML in Cybersecurity - Crowdstrike  AI for Cybersecurity - IBM  Future of Cybersecurity and AI - Deloitte Frequently Asked Question(s) (FAQ)

Getting Started  Q: Are there any recommended documentation resources for Python libraries used on the project?  A: The scikit-learn documentation is very useful for understanding how certain machine learning functions work and can serve as a valuable resource. The NumPy documentation can help with understanding common data structures and manipulation techniques used in data analysis. The Pandas documentation can help with understanding how to create and manipulate dataframes. Other sources may be useful as well.  Q: Are there any recommended video resources for Python libraries used on the project?  A: YouTube can serve as an excellent source of learning for those who enjoy videos. One video that may be helpful to get a feel for machine learning concepts is Machine Learning for Everybody - Full Course created by freeCodeCamp.  Q: What general skills are needed to succeed on this project?  A: o Familiarity with Python programming environments and packaging:  Functions and the self keyword, parameters  Basic operators and loops  Basic understanding of NumPy and Pandas packages o Familiarity with data science concepts:  Basic dataset preprocessing  Basic train/test splits  Basic implementation of Scikit learn modeling  Basic clustering and PCA o High level understanding of data science algorithms:  Supervised learning models  Unsupervised learning models  High level understanding of model comparison metrics  Q: I am overwhelmed and don’t know where to start.  A: Start simple with reviewing the useful links/videos we have provided and doing the coding tasks (tasks 1-5) in order. They will somewhat build on each other and will get progressively harder so early tasks are easier to complete.

Your preview ends here