Machine Learning write up
docx
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
6035
Subject
Computer Science
Date
Apr 3, 2024
Type
docx
Pages
46
Uploaded by kdolph
Machine Learning write up
Machine Learning
Learning Goals of this Project:
Learning Basic Pandas Dataframe Manipulations
Learning more about Machine Learning (ML) Classification models and how they are used in a Cybersecurity Context.
Learning about basic Data pipelines and Transformations
Learning how to write and use Unit Tests when developing Python code
Important Reference Materials:
NumPy Documentation
Pandas Documentation
Scikit-learn Documentation
Introduction Video
BACKGROUND
Many of the Projects in CS6035 are focused on offensive security tasks which
are very applicable to
Red Team
activities which many of us may associate with cybersecurity. This project will be more focused on defensive security tasks which are usually considered
Blue Team
activites that are done by many corporate teams.
Historically many defensive security professionals have investigated malicious activity/files/code to create patterns (often called signatures) that can be used to detect (and prevent) malicious activity/files/code when they see that pattern again. Historically this was a relatively effective way of preventing known malware from infecting systems but it does nothing to protect against novel attacks. As attackers became more sophisticated they learned to tweak (or simply encode) their malicious activity/files/code to avoid detection from these simple pattern matching detections.
With this background information it would be nice if a more general solution could give a score to activity/files/code that pass through corporate systems every day and tell the security team that while a certain pattern may not exactly fit a signature of known malicious activity/files/code it appears to be very similar to examples that were seen in the past that were malicious. Luckily Machine Learning models can do exactly that if provided with proper training data! Thus it is no surprise that one of the most powerful tools in the
hands of defensive cybersecurity professionals is Machine Learning. Modern detection systems will usually use a combination of Machine Learning models
and pattern matching (Regular Expressions) to detect and prevent malicious activity on networks and devices.
This project will focus on teaching the basic fundamentals of data analysis and building/testing your own ML models in python using the open source libraries Pandas and Scikit-Learn.
Cybersecurity Machine Learning Careers and Trends
Machine learning in cybersecurity is a growing field. The area was considered
among top trends by
McKinsey
in 2022.
Additional Information
ML in Cybersecurity - Crowdstrike
AI for Cybersecurity - IBM
Future of Cybersecurity and AI - Deloitte
Frequently Asked Question(s) (FAQ)
Getting Started
Q: Are there any recommended documentation resources for Python libraries used on the project?
A: The
scikit-learn documentation
is very useful for understanding how certain machine learning functions work and can serve as a valuable resource. The
NumPy documentation can help with understanding common data structures and manipulation techniques used in data analysis. The
Pandas documentation can help with understanding how to create and manipulate dataframes. Other sources may be useful as well.
Q: Are there any recommended video resources for Python libraries used on the project?
A: YouTube can serve as an excellent source of learning for those who enjoy videos. One video that may be helpful to get a feel for machine learning concepts is
Machine Learning for Everybody - Full Course
created by freeCodeCamp.
Q: What general skills are needed to succeed on this project?
A:
o
Familiarity with Python programming environments and packaging:
Functions and the self keyword, parameters
Basic operators and loops
Basic understanding of NumPy and Pandas packages
o
Familiarity with data science concepts:
Basic dataset preprocessing
Basic train/test splits
Basic implementation of Scikit learn modeling
Basic clustering and PCA
o
High level understanding of data science algorithms:
Supervised learning models
Unsupervised learning models
High level understanding of model comparison metrics
Q: I am overwhelmed and don’t know where to start.
A: Start simple with reviewing the useful links/videos we have provided and doing the coding tasks (tasks 1-5) in order. They will somewhat build on each other and will get progressively harder so early tasks are easier to complete.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
General Project Questions
Q: When are office hours for this project?
A: There will be a pinned Ed Discussion Post with office hour date/times
as well as recordings after they take place.
Q: Should I make my own post related to this project in Ed Discussion?
A: Please try to ask your question in one of the pinned project posts and remove answer data (ie don’t post your code, even snippets) or other information that should not be publicly shared and ask in the public forum so others can benefit from your questions.
Q: Can you review my code in a private Ed Discussion Post?
A: Since we have a Gradescope autograder we will not be reviewing code of students and expect you to debug your code using information in public posts in Ed Discussion or via google searches/stack overflow.
Q: I have constructive feedback that can improve this project for next semester
A: Open a private Ed Discussion Post and we will review your ideas and
may implement them in a future semester.
Submission and Gradescope
Q: How many submissions do we have in Gradescope?
A: Unlimited
Q: Do I have to submit all 5 tasks at once or can I submit 1 at a time and get a superscore of my task submissions?
A: You need to submit all 5 task files for your final submission. The score you see in Gradescope will be the score you get in canvas.
Q: I can’t see any scores/output in the autograder is it broken?
A: We have a protection in the autograder to prevent printing sensitive
information so if your code has print statements then you wont see your score or any outputs of the autograder. Please resubmit your code
with print statements removed and you should see the normal outputs.
Q: I think I found a bug in the Autograder
A: Open a private Ed Discussion Post and we can take a look. This is a relatively new project as CS6035 projects go so there is a chance that we missed something (edge cases) with how the autograder is checking vs how you coded your solution. If this happens we will make an update to the autograder to fix it and will make a pinned post letting students know the autograder was changed.
Task Hints and Questions
Q: I am using RFE to find the feature importance of a random forest or gradient boosting model and it is running for a long time and timing out in the autograder
A: Only use RFE for logistic regression models and use the built in values of feature importance for random forest and gradient boosting models.
Setup
For this assignment if you want to run it locally we suggest that you use the following local setup instructions. You can install and run the packages with a
variety of other software environments but you will need to figure out how to
install and run the code in those environments yourself.
Anaconda Installation
First you should download anaconda from their website:
Anaconda Download
. It has installers for Windows, Linux and Mac so make sure you are
downloading the right version then run the installation wizard.
For more information on how to install it see the following docs:
Installing on Windows
Installing on Mac
Installing on Linux
Environment Setup
Now that you have Anaconda installed we will set up the python environment.
Note:
If you go the Pycharm route you can have it install your anaconda environment from the env.yml file we give you by following this
guide
1.
Open up Anaconda Prompt
2.
Download the
Starter Code
and
Student Local Testing
Folders from Canvas
3.
Navigate to the Student Local Testing Folder
4.
Inside that folder there is a
env.yml
file which Anaconda can use to install all the required packages. To install that environment run
conda env create --
file env.yml
from the Anaconda Prompt once you are inside the
Student Local Testing
folder.
5.
Anaconda should install your environment and from now on to activate it you can run
conda activate cs6035_ML
from inside Anaconda Prompt
Unit Test Setup
Before Running any Unit tests you need to copy the your edited Task Files that you are trying to locally test into the
Student Local Testing
Folder
Then Follow the directions for VS Code or Pycharm depending on which you have or decide to install on your machine.
Visual Studio Code
If you dont already have Visual Studio Code installed on your machine you can follow the install guide:
Installing on Windows
Installing on Mac
Installing on Linux
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Local Testing Video
Next you need to select the python from the anaconda environment we created
cs6035_ML
in VS Code.
Here is a guide
for doing so on windows if you dont have anaconda in your system’s PATH. VS Code also has some
documentation
on environments if you are struggling.
To Setup Unit Testing:
1.
Open up a VS Code window with the
Student Local Testing
Folder
2.
Follow the
testing docs
or basically just click the
beaker
shape on the left side
bar and then
Configure Python Tests
3.
select
unittest
from the framework options
4.
select
tests
as the directory containing the tests
5.
select
test_*.py
as the apttern to identify test files
You should now see a drop down with the unit tests on the left hand sidebar and you can click the play button to run all the unit tests or each one individually. Tests that have a red x will show either an error or an incorrect answer while green checkmarks have passed the test case.
Pycharm Community Edition
If you dont have Pycharm installed on your machine you can follow the install
guide:
Install Guide
We suggest installing the Community Edition (free version)
Once you have it installed use this
guide
to setup your IDE with your anaconda environment. You can reference the instructions for Creating a conda environment based on environment.yaml .
Next you can follow the Pycharm Testing Docs to configure the test cases:
Test your first Python application
Testing Docs
The Test Sources Root for Pycharm should be the
tests
folder inside the
Student Local Testing
Folder.
Notebooks
You can use
Google’s Colab
,
VS Code’s Notebooks
,
Pycharm’s Notebooks
or
Jupyter Notebooks
to write/debug your code for this assignment but we will not provide any support for this method. Ultimately
you will still have to submit the Task.py to gradescope so make sure your code will run in those python files and not just in a notebook.
Task 1 (15 points)
Lets first get familiar with some Pandas basics. Pandas is a library that handles dataframes which you can think of as a python class that handles tabular data. Generally in the real world you would also use plotting tools like
PowerBi, Tableau, Data Studio, Matplotlib, etc., to create graphics and other visuals to better understand the dataset you are working with, this step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class we will skip the plotting for this project. For this task
we have released a test suite if you are struggling to understand the expected input and outputs for a function please set that up and use it to debug your function.
Useful Links:
Pandas documentation — Pandas 1.5.3 documentation (pydata.org)
What is Exploratory Data Analysis? - IBM
Top Data Visualization Tools - KDnuggets
Getting Started Video
Getting started with Notebooks and Functions Video
Deliverables:
Complete the functions in task1.py
For this task we have released a local test suite please set that up and use it to debug your function.
Submit task1.py to gradescope
Instructions:
The Task1.py File has function skeletons that you will complete with python code (mostly using the pandas library). The Goal of each of these functions is
to give you familiarity with the pandas library and some general python concepts like classes which you may not have seen before. See information about the Function’s Inputs,Outputs and Skeletons below
Table of contents
1.
find_data_type
2.
set_index_col
3.
reset_index_col
4.
set_col_type
5.
make_DF_from_2d_array
6.
sort_DF_by_column
7.
drop_NA_cols
8.
drop_NA_rows
9.
make_new_column
10.
left_merge_DFs_by_column
11.
simpleClass
12.
find_dataset_statistics
find_data_type
In this function you will take a dataset and the name of a column in it and return what datatype the column is
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.dtypes.html
INPUTS
dataset
- a pandas dataframe that contains some data
OUTPUTS
data type of the column (
np.dtype
)
Function Skeleton
def find_data_type(dataset:pd.DataFrame,column_name:
str
) -> np.dtype:
return np.dtype()
set_index_col
In this function you will take a dataset and a series and set the index of the dataset to be the series
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.Index.html
INPUTS
dataset
- a pandas dataframe that contains some data
index
- a pandas series that contains an index for the dataset
OUTPUTS
a pandas dataframe indexed by the given index series
Function Skeleton
def set_index_col(dataset:pd.DataFrame,index:pd.Series) -> pd.DataFrame:
return pd.DataFrame()
reset_index_col
In this function you will take a dataset with an index already set and reindex the dataset from 0 to n-1, dropping the old index
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.reset_index.html
INPUTS
dataset
- a pandas dataframe that contains some data
OUTPUTS
a pandas dataframe indexed from 0 to n-1
Function Skeleton
def reset_index_col(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
set_col_type
In this function you will be given a dataframe, column name and column type. You will edit the dataset to take the column name you are given and set it to be the type given in the input variable
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.astype.html
INPUTS
dataset
- a pandas dataframe that contains some data
column_name
- a string containing the name of a column
new_col_type
- a type to change the column to
OUTPUTS
a pandas dataframe with the column in
column_name
changed to the type in
new_col_type
Function Skeleton
# Set astype (string, int, datetime)
def set_col_type(dataset:pd.DataFrame,column_name:
str
,new_col_type:
type
) -> pd.DataFrame:
return pd.DataFrame()
make_DF_from_2d_array
In this function you will take data in an array as well as column and row labels and use that information to create a pandas dataframe
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.html
INPUTS
array_2d
- a 2 dimensional numpy array of values
column_name_list
- a list of strings holding column names
index
- a pandas series holding the row index’s
OUTPUTS
a pandas dataframe with columns set from
column_name_list
, row index set from
index
and data set from
array_2d
Function Skeleton
# Take Matrix of numbers and make it into a dataframe with column name and index numbering
def make_DF_from_2d_array(array_2d:np.array,column_name_list:
list
[
str
],index:pd.Series) ->
pd.DataFrame:
return pd.DataFrame()
sort_DF_by_column
In this function you are given a dataset and column name and will return a sorted dataset (sort rows by the value of the column specified and do not reindex) either in decending or ascending order depending on the value in the
decending
variable
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.sort_values.html
INPUTS
dataset
- a pandas dataframe that contains some data
column_name
- a string that contains the column name to sort the data on
decending
- a boolean value (
True
or
False
) for if the column should be sorted
in decending order
OUTPUTS
a pandas dataframe sorted by the given column name and in decending or ascending order depending on the value of the
decending
variable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Function Skeleton
# Sort Dataframe by values
def sort_DF_by_column(dataset:pd.DataFrame,column_name:
str
,descending:
bool
) -> pd.DataFrame:
return pd.DataFrame()
drop_NA_cols
In this function you are given a dataframe you will return a dataframe with any columns containing
NA
values dropped
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.dropna.html
INPUTS
dataset
- a pandas dataframe that contains some data
OUTPUTS
a pandas dataframe with any columns that contain an
NA
value dropped
Function Skeleton
# Drop NA values in dataframe Columns def drop_NA_cols(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
drop_NA_rows
In this function you are given a dataframe you will return a dataframe with any rows containing
NA
values dropped
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.dropna.html
INPUTS
dataset
- a pandas dataframe that contains some data
OUTPUTS
a pandas dataframe with any rows that contain an
NA
value dropped
Function Skeleton
def drop_NA_rows(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
make_new_column
In this function you are given a dataset, new column name and a static value
for the new column add the new column to the dataset and return the dataset
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/getting_started/
intro_tutorials/05_add_columns.html
INPUTS
dataset
- a pandas dataframe that contains some data
new_column_name
- a string containing the name of the new column to be created
new_column_value
- a string containing a static value that will be set for the new column for every row
OUTPUTS
a pandas dataframe with the new column created named
new_column_name
and filled with the value in
new_column_value
Function Skeleton
def make_new_column(dataset:pd.DataFrame,new_column_name:
str
,new_column_value:
str
) -> pd.DataFrame:
return pd.DataFrame()
left_merge_DFs_by_column
In this function you are given 2 datasets and the name of a column with which you will left join (left dataset is
dataset1
right dataset is
dataset2
) them on using the pandas merge method.
Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.merge.html https://stackoverflow.com/questions/53645882/pandas-merging-101
INPUTS
left_dataset
- a pandas dataframe that contains some data
right_dataset
- a pandas dataframe that contains some data
join_col_name
- a string containing the column name to join the two dataframes on
OUTPUTS
a pandas dataframe containing the left 2 datasets left joined together
Function Skeleton
def left_merge_DFs_by_column(left_dataset:pd.DataFrame,right_dataset:pd.DataFrame,join_col
_name:
str
) -> pd.DataFrame:
return pd.DataFrame()
simpleClass
This project will require you to work with Python Classes. If you are not familiar with them we suggest learning a bit more about them.
You will take the inputs into the Class initialization and set them as instance variables (of the same name) in the python class
Useful Resources
https://www.w3schools.com/python/python_classes.asp
INPUTS
length
- an integer
width
- an integer
height
- an integer
OUTPUTS
None
Function Skeleton
class simpleClass
():
def __init__(
self
, length:
int
, width:
int
, height:
int
):
pass
find_dataset_statistics
Now that you have learned a bit about pandas dataframes you can start using them to generate some simple summary statistics for a dataframe. You
will be given the dataset as an input variable as well as a column name for a column in the dataset that contains binary (0 for negative and 1 for positive) values that you will summarize
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Useful Resources
https://www.learndatasci.com/glossary/binary-classification/
https://developers.google.com/machine-learning/crash-course/framing/ml-
terminology
INPUTS
dataset
- a pandas dataframe that contains some data
label_col
- a string containing the name of the
label
column
OUTPUTS
n_records
(int) - the number of rows in the dataset
n_columns
(int) - the number of columns in the dataset
n_negative
(int) - the number of “negative” samples in the dataset (
label
column equals 0)
n_positive
(int) - the number of “positive” samples in the dataset (
label
column equals 1)
perc_positive
(float) - the percentage (out of 100%) of positive samples in the
dataset
Function Skeleton
import numpy as np
import pandas as pd
def find_dataset_statistics(dataset:pd.DataFrame,label_col:
str
) -> tuple
[
int
,
int
,
int
,
int
,
float
]:
n_records = #TODO
n_columns = #TODO
n_negative = #TODO
n_positive = #TODO
perc_positive = #TODO
return n_records,n_columns,n_negative,n_positive,perc_positive
Task 2 (25 points)
Now that you have a basic understanding of pandas and the dataset it is time to dive into some more complex data processing tasks. These are basic concepts in model building but at a high level it is important to hold out a subset of your data when you train a model so you can see what the expected performance is on unseen samples and so you can determine if the
resulting model is overfit (performs much better on training data vs test data). Preprocessing data is important since most models only take in numerical values so categorical features need to be “encoded” to numerical values so models can use them. Numerical scaling can be more or less useful
depending on the type of model used but is especially important in linear models. These preprocessing techniques will provide you options to augment
your dataset and improve model performance.
Useful Links:
Training and Test Sets - Machine Learning - Google Developers
Bias–variance tradeoff - Wikipedia
Overfitting - Wikipedia
Categorical and Numerical Types of Data - 365 Data Science
scikit-learn: machine learning in Python — scikit-learn 1.2.1 documentation
Deliverables:
Complete the functions and methods in task2.py
For this task we have released a local test suite please set that up and use it to debug your function.
Submit task2.py to Gradescape.
Instructions:
The Task2.py File has function skeletons that you will complete with python code (mostly using the pandas and scikit-learn libraries). The Goal of each of these functions is to give you familiarity with the applied concepts of Splitting and Preprocessing Data. See information about the Function’s Inputs, Outputs and Skeletons below
Table of contents
1.
tts
2.
PreprocessDataset
1.
__init__
2.
One Hot Encoding
3.
Min/Max Scaling
4.
PCA
5.
Feature Engineering
6.
Preprocess
tts
In this function you will take a dataset, the name of its label column, a percentage of the data to put into the test set, if you should stratify on the label column and a random state to set the sklearn function with and you will
return features and labels for the training and test sets. At a high level you can separate the task into 2 subtasks, first is splitting your dataset into both features and labels ( by columns) and second is splitting your dataset into training and test sets (by rows). You should use the sklearn train_test_split function but will have to write wrapper code around it based on the input values we give you.
Useful Resources
https://scikit-learn.org/stable/modules/generated/
sklearn.model_selection.train_test_split.html
https://developers.google.com/machine-learning/crash-course/framing/ml-
terminology
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
https://stackoverflow.com/questions/40898019/what-is-the-difference-
between-a-feature-and-a-label
INPUTS
dataset
- a pandas dataframe that contains some data
label_col
- a string containing the name of the column that contains the
label
values (what our model wants to predict)
test_size
- a float containing the decimal value of the percentage of the number of rows that the test set should be out of the dataset
stratify
- a boolean (
True
or
False
) value indicating if the resulting train/test
split should be stratified or not
random_state
- an integer value to set the randomness of the function (useful for repeatability especially when autograding)
OUTPUTS
train_features
- a pandas dataframe that contains the train rows and the feature columns
test_features
- a pandas dataframe that contains the test rows and the feature columns
train_labels
- a pandas dataframe that contains the train rows and the label column
test_labels
- a pandas dataframe that contains the test rows and the label column
Function Skeleton
def tts( dataset: pd.DataFrame,
label_col: str
, test_size: float
,
stratify: bool
,
random_state: int
) -> tuple
[pd.DataFrame,pd.DataFrame,pd.Series,pd.Series]:
# TODO
return train_features,test_features,train_labels,test_labels
PreprocessDataset
The PreprocessDataset Class contains a code skeleton with 9 methods for you to implement. Most methods will be split into 2 parts: one that will be run on the training dataset and one that will be run on the test dataset. In Data Science/Machine Learning this is done in order to avoid something called
Data Leakage
. For this assignment, we don’t expect you to understand
the nuances of the concept, but we will have you follow principles that will minimize the chances of it occurring. You will accomplish this by splitting data into training and test datasets and processing those datasets in slightly different ways. Generally for everything you do in this project and if you do
any ML or Data Science work in the future you should train/fit on the training data first then predict/transform on the training and test data. That holds up for basic preprocessing steps like task2 and for complex models like you will see in tasks 3 and 4. For the purposes of this project you should
never
train or fit on the test data (and more generally in any ML project) because your test data is expected to give you an understanding of how your model/predictions will perform on unseen data and if you fit even a preprocessing step to your test data then you are either giving the model information about the test set it wouldnt have about unseen data (if you combine train and test and fit to both) or you are providing a different preprocessing than the model is expecting (if you fit a different preprocessor to the test data) and your model would not be expected to perform well.
Note
: You should train/fit using the train dataset then once you have a fit encoder/scaler/pca/model instance you can transform/predict on the training and test data.
You will also notice that we are only preprocessing the Features and not the Labels. There are a few cases where preprocessing steps on labels may be helpful in modeling, but they are definitely more advanced and out of the scope of this introduction. Generally you will not need to do any preprocessing to your labels beyond potentially encoding a string value (ie “Malware” or “Benign”) into an integer value (0 or 1) which is called
Label Encoding
.
PreprocessDataset:
__init__
Similar to the Task1 simpleClass subtask you previously completed you will initialize the class by adding instance variables (add all the inputs)
Useful Resources
https://www.w3schools.com/python/python_classes.asp
INPUTS
train_features
- a dataset split by a function similar to tts which should be used in the training/fitting steps
test_features
- a dataset split by a function similar to tts which should be used in the test steps
one_hot_encode_cols
- a list of column names (strings) that should be one hot encoded by the one hot encode methods
min_max_scale_cols
- a list of column names (strings) that should be min/max scaled by the min/max scaling methods
n_components
- an int that contains the number of components that should be used in Principal Component Analysis
feature_engineering_functions
- a dictionary that contains feature name and function to create that feature as a key value pair (example shown below)
Example of
feature_engineering_functions
:
def double_height(dataframe:pd.DataFrame):
return dataframe[
"height"
] * 2
def half_height(dataframe:pd.DataFrame):
return dataframe[
"height"
] / 2
feature_engineering_functions = {
"double_height"
:double_height,
"half_height"
:half_height}
Dont worry about copying it we also have examples in the local test cases this is just provided as an illustration of what to expect in your function.
OUTPUTS
None
Function Skeleton
def __init__(
self
, train_features:pd.DataFrame, test_features:pd.DataFrame,
one_hot_encode_cols:
list
[
str
],
min_max_scale_cols:
list
[
str
],
n_components:
int
,
feature_engineering_functions:
dict
):
# TODO: Add any instance variables you may need to make your functions work
return
PreprocessDataset:
one_hot_encode_columns_train
and
one_ho
t_encode_columns_test
One Hot Encoding is the process of taking a column and returning a binary vector representing the various values within it. There is a separate function for the training and test datasets since they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).
Pseudocode
one_hot_encode_columns_train()
1.
In the
__init__()
method initialize an instance variable containing an sklearn
OneHotEncoder
with any Parameters you may need
2.
Split
train_features
into 2 dataframes: a dataframe with only the columns you want to one hot encode (using
one_hot_encode_cols
) and a dataframe with
all the other columns
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.
Fit the
OneHotEncoder
using the dataframe you split from
train_features
with the columns you want to encode
4.
Transform the dataframe you split from
train_features
with the columns you want to encode using the fit
OneHotEncoder
5.
Create a Dataframe from the 2d array of data that the output from step 3 gave you, column names in the form of columnName_categoryName (there should be an attribute in
OneHotEncoder
that can help you with this) and the same index that
train_features
had
6.
Join the dataframe you made in step 4 with the dataframe of other columns from step 1
one_hot_encode_columns_test()
1.
Split
test_features
into 2 dataframes: a dataframe with only the columns you
want to one hot encode (using
one_hot_encode_cols
) and a dataframe with all the other columns
2.
Transform the dataframe you split from
train_features
with the columns you want to encode using the
OneHotEncoder
you fit in
one_hot_encode_columns_train()
3.
Create a Dataframe from the 2d array of data that the output from step 2 gave you, column names in the form of columnName_categoryName (there should be an attribute in
OneHotEncoder
that can help you with this) and the same index that
test_features
had
4.
Join the dataframe you made in step 4 with the dataframe of other columns from step 1
Example Walkthrough (from Local Testing suite):
INPUTS:
one_hot_encode_cols
["color","version"]
Train Features
index
color
version
cost
height
0
red
1
5.99
12
6
yellow
6
10.99
18
3
red
1
5.99
15
9
red
8
12.99
21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2
blue
3
5.99
14
5
orange
5
10.99
17
1
green
2
5.99
13
7
green
2
12.99
19
Test Features
index color version
cost
height
4
purple
4
10.99
16
8
blue
3
12.99
20
TRAIN DATAFRAMES AT EACH STEP:
1.
Dataframe with columns to encode:
index
color
version
0
red
1
6
yellow
6
3
red
1
9
red
8
2
blue
3
5
orange
5
1
green
2
7
green
2
Dataframe with other columns:
index
cost
height
0
5.99
12
6
10.99
18
3
5.99
15
9
12.99
21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2
5.99
14
5
10.99
17
1
5.99
13
7
12.99
19
3.
One Hot Encoded 2d array:
0 0 0 1 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1 0
0 0 0 1 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 1
1 0 0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 1 0 0
0 1 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 1 0 0 0 0
4.
One Hot Encoded Dataframe with Index and Column Names
inde
x
color_bl
ue
color_gre
en
color_oran
ge
color_r
ed
color_yell
ow
version
_1
version
_2
version
_3
version
_5
version
_6
version
_8
0
0
0
0
1
0
1
0
0
0
0
0
6
0
0
0
0
1
0
0
0
0
1
0
3
0
0
0
1
0
1
0
0
0
0
0
9
0
0
0
1
0
0
0
0
0
0
1
2
1
0
0
0
0
0
0
1
0
0
0
5
0
0
1
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
1
0
0
0
0
7
0
1
0
0
0
0
1
0
0
0
0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
5.
Final Dataframe with passthrough columns joined back
inde
x
color_b
lue
color_gr
een
color_ora
nge
color_r
ed
color_yel
low
version
_1
version
_2
version
_3
version
_5
version
_6
version
_8
cost
heig
ht
0
0
0
0
1
0
1
0
0
0
0
0
5.99
12
6
0
0
0
0
1
0
0
0
0
1
0
10.9
9
18
3
0
0
0
1
0
1
0
0
0
0
0
5.99
15
9
0
0
0
1
0
0
0
0
0
0
1
12.9
9
21
2
1
0
0
0
0
0
0
1
0
0
0
5.99
14
5
0
0
1
0
0
0
0
0
1
0
0
10.9
9
17
1
0
1
0
0
0
0
1
0
0
0
0
5.99
13
7
0
1
0
0
0
0
1
0
0
0
0
12.9
9
19
TEST DATAFRAMES AT EACH STEP:
1.
Dataframe with columns to encode:
index color
version
4
purple
4
8
blue
3
Dataframe with other columns:
index
cost
height
4
10.99
16
8
12.99
20
2.
One Hot Encoded 2d array:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 0 0 0
3.
One Hot Encoded Dataframe with Index and Column Names
inde
x
color_bl
ue
color_gre
en
color_oran
ge
color_r
ed
color_yell
ow
version
_1
version
_2
version
_3
version
_5
version
_6
version
_8
4
0
0
0
0
0
0
0
0
0
0
0
8
1
0
0
0
0
0
0
1
0
0
0
4.
Final Dataframe with passthrough columns joined back
inde
x
color_b
lue
color_gr
een
color_ora
nge
color_r
ed
color_yel
low
version
_1
version
_2
version
_3
version
_5
version
_6
version
_8
cost
heig
ht
4
0
0
0
0
0
0
0
0
0
0
0
10.9
9
16
8
1
0
0
0
0
0
0
1
0
0
0
12.9
9
20
Note:
For the autograder use the column naming scheme of joining the previous column name and the column value with an underscore (similar to above where Type -> Type_Fruit and Type_Vegtable)
Note 2:
Since you should only be fitting your encoder on the training data if there are values in your test set that are different than those in the training set you will denote that with 0s. In the example above lets say we have a row in the test set with pizza which is neither a fruit or vegtable for the Type_Fruit and Type_Vegtable it should result in a 0 for both columns. If you dont handle these properly you may get errors like
Test Failed: Found unknown categories
.
Note 3:
You may be tempted to use the pandas function
get_dummies
to solve this task but its a trap. It seems easier but you will have to do a lot more work to make it handle a train/test split so we suggest you use sklearn’s
OneHotEncoder
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Useful Resources
https://www.educative.io/blog/one-hot-encoding
https://developers.google.com/machine-learning/data-prep/transform/
transform-categorical
https://scikit-learn.org/stable/modules/generated/
sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEn
coder
https://datascience.stackexchange.com/questions/103211/do-we-need-to-pre-
process-both-the-test-and-train-data-set
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe with the columns listed in
one_hot_encode_cols
one hot encoded and all other columns in the dataframe unchanged
Function Skeleton
def one_hot_encode_columns_train(
self
) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
def one_hot_encode_columns_test(
self
) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
PreprocessDataset:
min_max_scaled_columns_train
and
min_ma
x_scaled_columns_test
Min/Max Scaling is a process of scaling ints/floats from a min and max value in a series to between 0 and 1. The function for how scikit-learn does this is shown below but for this assignment you should just use the linked scikit-
learn function.
X_std = (X - X.
min
(axis=
0
)) / (X.
max
(axis=
0
) - X.
min
(axis=
0
))
X_scaled = X_std * (
max - min
) + min
There is a separate function for the training and test datasets since they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).
Example Dataframe:
Item
Price Count
Type
Apples
1.99
7
Fruit
Broccoli
1.29
435
Vegtable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Bananas
0.99
123
Fruit
Oranges
2.79
25
Fruit
Pineapples 4.89
5234
Fruit
Example One Hot Encoded Dataframe (rounded to 4 decimal places):
Item
Price
Count
Type
Apples
0.2564
7
Fruit
Broccoli
0.0769
435
Vegtable
Bananas
0
123
Fruit
Oranges
0.4615
25
Fruit
Pineapples
1
5234
Fruit
Note:
For the autograder use the same name as the original column (ex: Price -> Price)
Useful Resources
https://developers.google.com/machine-learning/data-prep/transform/
normalization
https://scikit-learn.org/stable/modules/generated/
sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScal
er
https://datascience.stackexchange.com/questions/103211/do-we-need-to-pre-
process-both-the-test-and-train-data-set
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe with the columns listed in
min_max_scale_cols
min/max scaled and all other columns in the dataframe unchanged
Function Skeleton
def min_max_scaled_columns_train(
self
) -> pd.DataFrame:
min_max_scaled_dataset = pd.DataFrame()
return min_max_scaled_dataset
def min_max_scaled_columns_test(
self
) -> pd.DataFrame:
min_max_scaled_dataset = pd.DataFrame()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
return min_max_scaled_dataset
PreprocessDataset:
pca_train
and
pca_test
Principal Component analysis is a dimensionality reduction technique (column reduction). It aims to take the variance in your input columns and map the columns into N columns that contain as much of the variance as it can. This technique can be useful if you are trying to train a model faster and
has some more advanced uses especially when training models on data which has many columns but few rows. There is a separate function for the training and test datasets since they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).
Note:
For the autograder use the column naming scheme of column names component_1, component_2 .. component_n for
n_components
passed into the
__init__
method
Note 2:
For your PCA outputs to match the autograder make sure you set the seed using a random state of 0 when you initialize the PCA function.
Note 3:
Since PCA does not work with NA values make sure you drop any columns that have NA values before running PCA
Useful Resources
https://builtin.com/data-science/step-step-explanation-principal-component-
analysis
https://scikit-learn.org/stable/modules/generated/
sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
https://datascience.stackexchange.com/questions/103211/do-we-need-to-pre-
process-both-the-test-and-train-data-set
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe with the generated pca values and using column names:
component_1, component_2 .. component_n
Function Skeleton
def one_hot_encode_columns_train(
self
) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
def one_hot_encode_columns_test(
self
) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
PreprocessDataset:
feature_engineering_train
,
feature_engi
neering_test
Feature Engineering is a process of using domain knowledge (physics, geometry, sports statistics, business metrics, etc) to create new features (columns) out of the existing data. This could mean creating an area feature when given the length and width of a triangle or extracting the major and minor version number from a software version or more complex logic depending on the scenario. For this method you will be taking in a dictionary with a column name and a function (that takes in a dataframe and returns a column) and using that to create a new column with the name in the dict key.
For example:
def double_height(dataframe:pd.DataFrame):
return dataframe[
"height"
] * 2
def half_height(dataframe:pd.DataFrame):
return dataframe[
"height"
] / 2
feature_engineering_functions = {
"double_height"
:double_height,
"half_height"
:half_height}
with the above functions you would create 2 new columns named “double_height” and “half_height”.
Useful Resources
https://en.wikipedia.org/wiki/Feature_engineering
https://www.geeksforgeeks.org/what-is-feature-engineering/
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe with the features described in
feature_engineering_functions
added as new columns and all other columns in the dataframe unchanged
Function Skeleton
def feature_engineering_train(
self
) -> pd.DataFrame:
feature_engineered_dataset = pd.DataFrame()
return feature_engineered_dataset
def feature_engineering_test(
self
) -> pd.DataFrame:
feature_engineered_dataset = pd.DataFrame()
return feature_engineered_dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
PreprocessDataset:
preprocess_train
,
preprocess_test
Now we will put 3 of the above methods together into a preprocess function which will take in a dataset and encode, scale and feature engineer using the
above methods and their respective columns and output a preprocessed dataframe.
Useful Resources
See resources for one hot encoding, min/max scaling and feature engineering above
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe for both test and train features with the columns in
one_hot_encode_cols
encoded, the columns in
min_max_scale_cols
scaled and the columns described in
feature_engineering_functions
engineered.
Function Skeleton
def preprocess(
self
) -> tuple
[pd.DataFrame,pd.DataFrame]:
train_features = pd.DataFrame()
test_features = pd.DataFrame()
return train_features,test_features
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Task 3 (15 points)
So far we have functions to split the data and preprocess it. Now we will run a basic model on the data to cluster files (rows) with similar attributes together. We will use an unsupervised (model with no label column) model, Kmeans, since it is simple to use and understand. Please use scikit-learn to create the model and Yellowbrick to determine the optimal value of k for our dataset.
Useful Links:
Clustering - Google Developers
Clustering Algorithms - Google Developers
Kmeans - Google Developers
Deliverables:
Complete the KmeansClustering class in task3.py
For this task we have released a local test suite please set that up and use it to debug your function.
Submit task3.py to Gradescope
Instructions:
The Task3.py File has function skeletons that you will complete with python code (mostly using the pandas and scikit-learn libraries). The Goal of each of these functions is to give you familiarity with the applied concepts of Unsupervised Learning. See information about the Function’s Inputs, Outputs
and Skeletons below
KmeansClustering
The KmeansClustering Class contains a code skeleton with 4 methods for you
to implement.
Note
: You should train/fit using the train dataset then once you have a fit encoder/scaler/pca/model instance you can transform/predict on the training and test data.
KmeansClustering:
__init__
Similar to the Task1 simpleClass subtask you previously completed you will initialize the class by adding instance variables
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Useful Resources
https://www.w3schools.com/python/python_classes.asp
INPUTS
train_features
- a dataset split by a function similar to tts which should be used in the training/fitting steps
test_features
- a dataset split by a function similar to tts which should be used in the test steps
random_state
- an integer that should be used to set the scikit-learn randomness so the model results will be repeatable which is required for the autograder
OUTPUTS
None
Function Skeleton
def __init__(
self
, train_features:pd.DataFrame,
test_features:pd.DataFrame,
random_state: int
):
# TODO: Add any state variables you may need to make your functions work
pass
KmeansClustering:
kmeans_train
Kmeans Clustering is a process of grouping together similar rows together and assigning them to a cluster. For this method you will use the training data to fit an optimal kmeans cluster on the data.
To help you get started we have provided a list of subtasks to complete for this task:
1.
Initialize a sklearn Kmeans model using random_state from the
__init__
method and setting n_init = 10.
2.
Initialize a yellowbrick KElbowVisualizer to search for the optimal value of k (between 1 and 10).
3.
Train the KElbowVisualizer on the training data and determine the optimal k value.
4.
Train a Kmeans model with the proper initialization for that optimal value of k
5.
Return the cluster ids for each row of the training set as a list.
Useful Resources
https://scikit-learn.org/stable/modules/generated/
sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
https://www.scikit-yb.org/en/latest/api/cluster/elbow.html
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a list of cluster ids that the kmeans model has assigned for each row in the train dataset
Function Skeleton
def kmeans_train(
self
) -> list
:
cluster_ids = list
()
return cluster_ids
KmeansClustering:
kmeans_test
Kmeans Clustering is a process of grouping together similar rows together and assigning them to a cluster. For this method you will use the training data to fit an optimal kmeans cluster on the data.
To help you get started we have provided a list of subtasks to complete for this task:
1.
Use the model you trained in the kmeans_train method to generate cluster ids for each row of the test dataset
2.
Return the cluster ids for each row of the test set as a list.
Useful Resources
https://scikit-learn.org/stable/modules/generated/
sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
https://www.scikit-yb.org/en/latest/api/cluster/elbow.html
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a list of cluster ids that the kmeans model has assigned for each row in the test dataset
Function Skeleton
def kmeans_test(
self
) -> list
:
cluster_ids = list
()
return cluster_ids
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
KmeansClustering:
train_add_kmeans_cluster_id_feature
,
te
st_add_kmeans_cluster_id_feature
Using the two methods you completed above (
kmeans_train
and
kmeans_test
) you will add a feature to the training and test dataframes. This is similar to the feature engineering method in Task2. To do this, use the outout of the methods (the list of cluster ids) from the corresponding method and add it as
a new column (named
kmeans_cluster_id
) in the input dataframe, then return the full dataframe.
Useful Resources
INPUTS
Use the needed instance variables you set in the
__init__
method
OUTPUTS
a pandas dataframe with the
kmeans_cluster_id
added as a feature and all other input columns unchanged
Function Skeleton
def train_add_kmeans_cluster_id_feature(
self
) -> pd.DataFrame:
output_df = pd.DataFrame()
return output_df
def test_add_kmeans_cluster_id_feature(
self
) -> pd.DataFrame:
output_df = pd.DataFrame()
return output_df
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Task 4 (25 points)
Finally we are ready to try a few different supervised classification models. We have chosen a few commonly used models for you to use here but there are many options and in the real world specific algorithms may fit a specific dataset better. You also won’t be doing any hyperparameter tuning yet to better focus on writing the code. You will train a model using the training set,
predict on the training/test sets and calculate performance metrics and return a ModelMetrics object and trained scikit-learn model from each model function. (Note: You should use RFE for determining feature importance of the logistic regression model but do NOT use RFE for random forest or gradient boosting models to determine feature importance please use their built in values for this)
Useful Links:
Training and Test Sets - Machine Learning - Google Developers
Bias–variance tradeoff - Wikipedia
Overfitting - Wikipedia
scikit-learn: machine learning in Python — scikit-learn 1.2.1 documentation
An Introduction to Classification in Machine Learning - builtin
Classification in Machine Learning: An Introduction - DataCamp
Deliverables:
Complete the functions and methods in task4.py
For this task we have released a local test suite please set that up and use it to debug your function.
Submit task4.py to Gradescape.
Instructions:
The Task4.py File has function skeletons that you will complete with python code (mostly using the pandas and scikit-learn libraries). The Goal of each of these functions is to give you familiarity with the applied concepts of Training a Model, Using it to score records and Calculating Performance Metrics for it. See information about the Function’s Inputs, Outputs and Skeletons below
Table of contents
1.
ModelMetrics
2.
calculate_naive_metrics
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.
calculate_logistic_regression_metrics
4.
calculate_random_forest_metrics
5.
calculate_gradient_boosting_metrics
ModelMetrics
In order to simplify the Autograding we have created a class that will hold the metrics and Feature importances for a given trained model. You do not have to add anything to this class but are expected to use it (put your training and test metric dictionaries and feature importance dataframes inside it for the autograder to handle).
calculate_naive_metrics
A Naive model is a very simple model/prediction that can help to frame how well any more sophisticated model is doing. Since the Naive Model is incredibly basic (often a constant result or a randomly selected result) we should expect that any more sophisticated model that we train should outperform it. If the Naive Model beats a trained model it can mean that addtional data (rows or columns) is needed in the dataset to improve the model or that the dataset doesn’t have a strong enough signal for the target we want to predict. In this function you will use the approach of a constant output Naive model. You will calculate 4 metrics (
accuracy
,
recall
,
precision
and
fscore
) for the training and test datasets for a given constant integer as your prediction (passed into the function as the variable
naive_assumption
).
Useful Resources
https://machinelearningmastery.com/how-to-develop-and-evaluate-naive-
classifier-strategies-using-probability/
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
INPUTS
train_features
- a dataset split by a function similar to the tts function you created in task2
test_features
- a dataset split by a function similar to the tts function you created in task2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
train_targets
- a dataset split by a function similar to the tts function you created in task2
test_targets
- a dataset split by a function similar to the tts function you created in task2
naive_assumption
- an integer that should be used as the result from the Naive model
OUTPUTS
A completed
ModelMetrics
object with a training and test metrics dictionary with each one of the metrics
rounded to 4 decimal places
Function Skeleton
def calculate_naive_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, naive_assumption:
int
) -> ModelMetrics:
train_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
}
test_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
}
naive_metrics = ModelMetrics(
"Naive"
,train_metrics,test_metrics,
None
)
return naive_metrics
calculate_logistic_regression_metrics
A Logistic Regression model is a simple and more explainable statistical model that can be used to estimate the probability of an event (log-odds). At a high level a Logistic Regression model uses data in the training set to estimate a column’s weight in a linear approximation function (conceptually this is similar to estimating
m
for each column in the line formula you probably know well:
y = m*x + b
). If you are interested in learning more you can read up on the Math behind how this works. For this project we are more focused on showing you how to apply these models, so you can just use the sklearn Logistic Regression model in your code. For this task use Sklearn’s Logistic Regression class to train a Logistic Regression model (initialized using the kwargs passed into the function), predict scores for training and test datasets and calculate 7 metrics (
accuracy
,
recall
,
precision
,
fscore
,
false
positive rate (fpr)
,
false negative rate (fnr)
and
Area Under the Curve of Reciever
Operating Characteristics Curve (roc_auc)
) for the training and test datasets using predictions from the fit model along with the a dataframe of the top 10
most important features.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
NOTE:
Make sure you use the predicted probabilities for roc auc
NOTE2:
For Feature Importance use the top 10 features selected by RFE and
sort by absolute value of the coefficient from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col [
Feature
] and imp_col [
Importance
] and the index is reset to 0-9 you can do this the same way you did in task1)
Useful Resources
https://stats.libretexts.org/Bookshelves/Introductory_Statistics/
OpenIntro_Statistics_(Diez_et_al)./08%3A_Multiple_and_Logistic_Regression/
8.04%3A_Introduction_to_Logistic_Regression
https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticR
egression
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
https://en.wikipedia.org/wiki/Confusion_matrix
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.roc_auc_score.html
INPUTS
train_features
- a dataset split by a function similar to the tts function you created in task2
test_features
- a dataset split by a function similar to the tts function you created in task2
train_targets
- a dataset split by a function similar to the tts function you created in task2
test_targets
- a dataset split by a function similar to the tts function you created in task2
logreg_kwargs
- a dictionary with keyword arguments that can be passed directly to the sklearn Logistic Regression class
OUTPUTS
A completed
ModelMetrics
object with a training and test metrics dictionary with each one of the metrics
rounded to 4 decimal places
An sklearn Logistic Regression model object fit on the training set
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Function Skeleton
def calculate_logistic_regression_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, logreg_kwargs) -> tuple
[ModelMetrics,LogisticRegression]:
model = LogisticRegression()
train_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
test_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
log_reg_importance = pd.DataFrame()
log_reg_metrics = ModelMetrics(
"Logistic Regression"
,train_metrics,test_metrics,log_reg_importance)
return log_reg_metrics,model
Example of Feature Importance Dataframe
Feature
Importance
0
density
-7.1416
1
volatile acidity
-6.6914
2
sulphates
1.4095
3
alcohol
1.0275
4
fixed acidity
-0.2234
5
pH
-0.1779
6
residual sugar
0.0683
7
free sulfur dioxide
0.0111
8
total sulfur dioxide
-0.0025
9
citric acid
0.0007
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
calculate_random_forest_metrics
A Random Forest model is a more complex model than the Naive and Logistic Regression Models you have trained so far. It can still be used to estimate the probability of an event but achieves this using a different underlying structure, A Tree Based Model. Conceptually this looks a lot like lots of if/else statements chained together into a “tree”. A Random Forest Expands on this and trains many different trees with different subsets of the data and starting conditions to get a better estimate than a single tree would
give. For this project we are more focused on showing you how to apply these models, so you can just use the sklearn Random Forest model in your code. For this task use Sklearn’s Random Forest Classifier class to train a Random Forest model (initialized using the kwargs passed into the function), predict scores for training and test datasets and calculate 7 metrics (
accuracy
,
recall
,
precision
,
fscore
,
false positive rate (fpr)
,
false negative rate
(fnr)
and
Area Under the Curve of Reciever Operating Characteristics Curve (roc_auc)
) for the training and test datasets using predictions from the fit model along with the a dataframe of the top 10 most important features.
NOTE:
Make sure you use the predicted probabilities for roc auc
NOTE2:
For Feature Importance use the top 10 features selected by the built
in method and sort by importance from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col [
Feature
] and imp_col [
Importance
] and the index is reset to 0-9 you can do this the same way you did in task1)
Useful Resources
https://blog.dataiku.com/tree-based-models-how-they-work-in-plain-english
https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
https://en.wikipedia.org/wiki/Confusion_matrix
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.roc_auc_score.html
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
INPUTS
train_features
- a dataset split by a function similar to the tts function you created in task2
test_features
- a dataset split by a function similar to the tts function you created in task2
train_targets
- a dataset split by a function similar to the tts function you created in task2
test_targets
- a dataset split by a function similar to the tts function you created in task2
rf_kwargs
- a dictionary with keyword arguments that can be passed directly to the sklearn RandomForestClassifier class
OUTPUTS
A completed
ModelMetrics
object with a training and test metrics dictionary with each one of the metrics
rounded to 4 decimal places
An sklearn Random Forest model object fit on the training set
Function Skeleton
def calculate_random_forest_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, rf_kwargs) -> tuple
[ModelMetrics,RandomForestClassifier]:
model = RandomForestClassifier()
train_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
test_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
rf_importance = pd.DataFrame()
rf_metrics = ModelMetrics(
"Random Forest"
,train_metrics,test_metrics,rf_importance)
return rf_metrics,model
Example of Feature Importance Dataframe
Feature
Importance
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0
alcohol
0.3567
1
density
0.183
2
volatile acidity
0.1478
3
chlorides
0.0776
4
free sulfur dioxide
0.0725
5
citric acid
0.0684
6
total sulfur dioxide
0.0421
7
residual sugar
0.0187
8
fixed acidity
0.0144
9
pH
0.0097
calculate_gradient_boosting_metrics
A Gradient Boosted model is a more complex model than the Naive and Logistic Regression Models and similar in structure to the Random Forest Model you just trained. A Gradient Boosted Model Expands on the Tree Based
Model by using its additional trees to predict the errors from the previous tree. For this project we are more focused on showing you how to apply these models, so you can just use the sklearn Gradient Boosted Model in your code. For this task use Sklearn’s Gradient Boosting Classifier class to train a Gradient Boosted model (initialized using the kwargs passed into the function), predict scores for training and test datasets and calculate 7 metrics (
accuracy
,
recall
,
precision
,
fscore
,
false positive rate (fpr)
,
false negative rate (fnr)
and
Area Under the Curve of Reciever Operating Characteristics Curve (roc_auc)
) for the training and test datasets using predictions from the fit model along with the a dataframe of the top 10 most important features.
NOTE:
Make sure you use the predicted probabilities for roc auc
NOTE2:
For Feature Importance use the top 10 features selected by the built
in method and sort by importance from biggest to smallest (make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col [
Feature
] and imp_col [
Importance
] and the index is reset to 0-9 you can do this the same way you did in task1)
Useful Resources
https://blog.dataiku.com/tree-based-models-how-they-work-in-plain-english
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.GradientBoostingClassifier.html
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
https://en.wikipedia.org/wiki/Confusion_matrix
https://scikit-learn.org/stable/modules/generated/
sklearn.metrics.roc_auc_score.html
INPUTS
train_features
- a dataset split by a function similar to the tts function you created in task2
test_features
- a dataset split by a function similar to the tts function you created in task2
train_targets
- a dataset split by a function similar to the tts function you created in task2
test_targets
- a dataset split by a function similar to the tts function you created in task2
gb_kwargs
- a dictionary with keyword arguments that can be passed directly to the sklearn GradientBoostingClassifier class
OUTPUTS
A completed
ModelMetrics
object with a training and test metrics dictionary with each one of the metrics
rounded to 4 decimal places
An sklearn Gradient Boosted model object fit on the training set
Function Skeleton
def calculate_random_forest_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, rf_kwargs) -> tuple
[ModelMetrics,RandomForestClassifier]:
model = RandomForestClassifier()
train_metrics = {
"accuracy" : 0
,
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
test_metrics = {
"accuracy" : 0
,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
"recall" : 0
,
"precision" : 0
,
"fscore" : 0
,
"fpr" : 0
,
"fnr" : 0
,
"roc_auc" : 0
}
rf_importance = pd.DataFrame()
rf_metrics = ModelMetrics(
"Random Forest"
,train_metrics,test_metrics,rf_importance)
return rf_metrics,model
Example of Feature Importance Dataframe
Feature
Importance
0
alcohol
0.3495
1
volatile acidity
0.2106
2
free sulfur dioxide
0.1077
3
residual sugar
0.0599
4
fixed acidity
0.0451
5
citric acid
0.045
6
total sulfur dioxide
0.0426
7
chlorides
0.0381
8
density
0.0367
9
sulphates
0.0326
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Task 5 (20 points)
Now that you have written functions for different steps of the model building process you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook ie don’t tune your model in gradescope since the autograder will likely timeout). It will take in the CLAMP training data (Note: the “class” column is the target for this dataset), train a model then predict on a test set
(“class” column will be removed to simulate new files that will be classified by your model) and output values from 0 to 1 (values close to 0 being less likely to be malicious and closer to 1 being more likely to be malicious) for each row and our autograder will compare your predictions with the correct answers and to get credit you will need a roc auc score of .9 or higher on the test set (should not require much hyperparameter tuning for this dataset). This is basically a simulation of how your model would perform in the “production” system using batch inference.
Instructions:
Make use of any of the techniques we covered in this project to train a model and return predicted probabilities for each row of the test set as a DataFrame with columns
index
(same as your index from the input test df) and
malware_score
(predicted probabilities).
Complete the train_model_return_scores function in task5.py
Sample Submission:
index malware_score
0
0.65
1
0.1
...
...
Deliverables:
Submit task5.py to Gradescope
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recommended textbooks for you

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningProgramming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:CengageC++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology Ptr
- Principles of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningInformation Technology Project ManagementComputer ScienceISBN:9781337101356Author:Kathy SchwalbePublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Information Technology Project Management
Computer Science
ISBN:9781337101356
Author:Kathy Schwalbe
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning