Week 8 Assignment Machine Learning Model Building
docx
keyboard_arrow_up
School
Howard University *
*We aren’t endorsed by this school
Course
MISC
Subject
Information Systems
Date
Jan 9, 2024
Type
docx
Pages
12
Uploaded by Dillah1
Week 8 Assignment Machine Learning Model Building K Nearest Neighbors
Background information: Customer Churn Prediction in the Telecom
Sector
Customer churn, also known as customer retention, customer turnover, or
customer defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies,
insurance firms, and alarm monitoring services, often use customer attrition
analysis and customer attrition rates as one of their key business metrics
because the cost of retaining an existing customer is far less than acquiring a
new one. Companies from these sectors often have customer service branches
which attempt to win back defecting clients, because recovered long-term
customers can be worth much more to a company than newly recruited
clients.
Companies usually make a distinction between voluntary churn and
involuntary churn. Voluntary churn occurs due to a decision by the customer
to switch to another company or service provider, involuntary churn occurs
due to circumstances such as a customer's relocation to a long-term care
facility, death, or the relocation to a distant location. In most applications,
involuntary reasons for churn are excluded from the analytical models.
Analysts tend to concentrate on voluntary churn, because it typically occurs
due to factors of the company-customer relationship which companies’
control, such as how billing interactions are handled or how after-sales help is
provided.
Predictive analytics uses churn prediction models that predict customer churn
by assessing their propensity of risk to churn. Since these models generate a
small, prioritized list of potential defectors, they are effective at focusing
customer retention marketing programs on the subset of the customer base
who are most vulnerable to churn.
In this assignment, we will be applying the K Nearest Neighbors algorithm to
see how well our model can classify new data as either positive (TRUE) for
churn or negative (FALSE) based on certain characteristics of our dataset.
Dataset: Churn.csv The dataset contains various data about individual customers for a monthly
billing cycle, including
average day, evening, night
and international minutes,
number of calls and cost data
amount of time the customer
has had a contract with the
company (in months)
average number of calls to
customer service (time frame
is not given, but we can
assume the same time for all
customers,
so
directly
comparable).
Preliminary remarks: Let’s recall the Data Science
Pipeline from Module 1, Week 1.
We will be working through each of
these stages in this project.
You should be sure that you have
studied all the Module 4 resources,
readings, videos, and tutorials
before
beginning
this
final
assignment.
If you have any questions, do not
hesitate to contact your instructor.
While this final assignment relies
on knowledge and skills acquired
throughout the course, the
assignment
is
particularly
anchored on the Week 8 K Nearest
Neighbors tutorial (Iris dataset).
While you are in possession of a perfectly working Jupyter Notebook with
Python code applicable to a different dataset, it is a fact of ‘data science life’
that rarely is code directly usable without modification from one dataset, one
project, to another. Therefore it is critical that you not only study the code but
also the dataset carefully.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Process:
Previous tutorials and assignment in the course have given a step by step
breakdown of what to do, what to type, what to execute, when and where.
This final assignment of the course will deviate from that approach.
Your task:
Your task in this assignment is to execute and document
an end-to-end
machine learning algorithm (K Nearest Neighbors-KNN) on the business
problem and Churn.csv dataset presented earlier. Your evaluation is not so
much based on identifying the best combination of criteria that lead to the
highest ‘accuracy’ of the model, but rather your ability to explain the process,
in layman’s terms (for non-data scientist colleagues and management), at each
stage of your investigation, your Data Science Pipeline.
You must include the code, output and comments/analysis in your
submission. You may create your submission in a Jupyter Notebook or Word.
In this assignment, you will follow the basic processes covered in the KNN Iris
dataset tutorial (with modifications that you will implement).
You should provide:
A comprehensive, multiple-step, Summary of the Dataset,
Data Preparation, including Label Encoding and Feature Scaling
Data Visualization of the dataset, a minimum of 3 visualizations (Note: a
single code cell that generates multiple visualizations, such as
sns.pairplot
or matplotlib
boxplot
counts as one visualization)
Complete Prediction set, including results, evaluations and at least one
form of cross-validation, with at least one graphic of optimal number of
neighbors for the KNN algorithm
Each of the above stages should be accompanied with
commentary/analysis. Nearly every cell will need comments.
Suggested first steps:
There are 15 potential features (X values) in the machine learning
model you will create. You will want to at least try one iteration of the
model with all 15 features.
However, before doing so, be sure to run the Step 3.6 Correlation Heat
Map
and see if that gives you some indication of which features to
concentrate on.
You will run several iterations of this process on the dataset (4
iterations minimum). Do not ‘cut to the chase’ and choose your final
features too quickly; take the time to experiment. You will need to take
notes during and after each iteration for the required deliverables
(below). o
Ultimately you should choose to concentrate on 3-4 features
(columns).
By the final iteration, you will want to remove/delete
certain features (columns) from the dataset.
This can be done manually in Excel (then reloading a new,
more focused dataset).
Or, per Step 3 Data Visualization
in the KNN Iris dataset
tutorial, by using the filter
method in Jupyter Notebook, to
create another DataFrame
dataset2 = dataset.filter([
'SepalLengthCm'
,
'SepalWidthCm'
,
'Species'
], axis=1)
o
Note that when you are testing which features to
include/exclude you can modify the following code (from the
KNN Iris dataset tutorial), then run the succession of steps to
get to the KNN results to compare.
feature_columns = [
'SepalLengthCm'
, 'SepalWidthCm'
, PetalLengthCm'
,
'PetalWidthCm'
]
X = dataset[feature_columns].values
y = dataset[
'Species'
].values
Deliverables:
In terms of complete code, output (including visualizations on your final
3-4 features only) and ‘in code’ comments/analysis, you will submit
only your final iteration. Ideally this will be the KNN where you
achieved the best results, but as mentioned previously, the
commentary/analysis is more important than the results.
Full
python code + output/visualizations
+ ‘in code’
comments/description of process will be preferably in the form of a
Jupyter Notebook
. A Word document with relevant code and output
and commentary entered/cut and pasted is possible as well.
In addition to your Jupyter Notebook (or code/output/comments
document), you should provide a one page overall summary of the
process (including a brief description of the problem, what KNN should
accomplish and descriptions of your various iterations—see below for a
sample table format). This will be a separate Word document
.
Include any significant interim observations/milestones and any
conclusions of the resulting model—in this case, which level of K proved
most successful in your K Nearest Neighbor algorithm and to what
accuracy rate.
Thus you will submit two files
o
Jupyter Notebook or Word file (Full python code +
output/visualizations + ‘in code’ comments/description)
o
Word file (one page overall summary of the process)
This is an open-ended, leverage-what-you’ve-observed-and-learned
assignment—very much like adjusting to the ever-evolving knowledge and
skill base required in any data science career path. Do not hesitate to consult
documentation or demos for Python or the various Python libraries (Pandas,
matplotlib, Seaborn, etc.). There is a lot of detail in these instructions, but the bottom line is that we’re
looking
for
a
deliverable
very
similar
to
the
knn_classification_tutorial_iris.ipynb
Jupyter Notebook, with additional
comments/analysis. As mentioned elsewhere, when adapting code from one
dataset to another, adjustments to code and/or parameters will be required.
Of course, if you have any questions, do not hesitate to contact your instructor.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Sample summary table for reporting iterations (adapt as needed):
Iteration#
Features
Included
Primary
Results
Observations/
Analysis
Next Steps
import
pandas as
pd
import
numpy as
np
import
matplotlib
.
pyplot as
plt
import
seaborn as
sns
from
sklearn
.
model_selection import
train_test_split
from
sklearn
.
preprocessing import
LabelEncoder
,
StandardScaler
from
sklearn
.
neighbors import
KNeighborsClassifier
from
sklearn
.
metrics import
accuracy_score
,
classification_report
from
sklearn
.
model_selection import
cross_val_score
# Load the dataset
df =
pd
.
read_csv
(
'Churn.csv'
)
# Explore the dataset
print
(
"Dataset dimensions:"
,
df
.
shape
)
print
(
"Data types:\n"
,
df
.
dtypes
)
print
(
"Descriptive statistics:\n"
,
df
.
describe
())
# Identify the target variable
target =
'Churn'
# Data Preparation
# Handle missing values if any
# ...
# Perform label encoding for categorical variables
# ...
# Split the dataset into features (X) and the target variable (y)
X =
df
.
drop
(
target
,
axis
=
1
)
y =
df
[
target
]
# Data Visualization
# Create at least three visualizations
# ...
# Model Building
# Select features for the model
# ...
# Split the dataset into training and testing sets
X_train
,
X_test
,
y_train
,
y_test =
train_test_split
(
X
,
y
,
test_size
=
0.2
, random_state
=
42
)
# Feature Scaling
scaler =
StandardScaler
()
X_train_scaled =
scaler
.
fit_transform
(
X_train
)
X_test_scaled =
scaler
.
transform
(
X_test
)
# Build the KNN classifier
knn =
KNeighborsClassifier
(
n_neighbors
=
5
)
# Train the model
knn
.
fit
(
X_train_scaled
,
y_train
)
# Predict on the test set
y_pred =
knn
.
predict
(
X_test_scaled
)
# Evaluate the model
accuracy =
accuracy_score
(
y_test
,
y_pred
)
print
(
"Accuracy:"
,
accuracy
)
# Perform cross-validation
cv_scores =
cross_val_score
(
knn
,
X
,
y
,
cv
=
5
)
print
(
"Cross-validation scores:"
,
cv_scores
)
print
(
"Mean cross-validation score:"
,
np
.
mean
(
cv_scores
))
# Visualize the optimal number of neighbors
neighbors =
range
(
1
,
10
)
accuracy_scores =
[]
for
k in
neighbors
:
knn =
KNeighborsClassifier
(
n_neighbors
=
k
)
scores =
cross_val_score
(
knn
,
X
,
y
,
cv
=
5
)
accuracy_scores
.
append
(
scores
.
mean
())
plt
.
plot
(
neighbors
,
accuracy_scores
)
plt
.
xlabel
(
'Number of Neighbors (k)'
)
plt
.
ylabel
(
'Accuracy'
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt
.
title
(
'KNN: Number of Neighbors vs. Accuracy'
)
plt
.
show
()
# Polishing and Presenting
# Write your commentary, analysis, and conclusions
# ...
EXPLAINTATION OF PROGRAM This code will import necessary Python libraries such as pandas, NumPy,
seaborn, and sklearn.
These libraries contain functions and tools that are useful for data analysis
and machine learning.
It loads a dataset called 'Churn.csv' using the pandas library. The dataset
contains information about customers of a telecommunications company,
such as their demographics, usage patterns, and whether they have churned
or not.
This code will prepare the data by converting categorical variables (such as
'International Plan' and 'Voice Mail Plan') into numerical ones and scales the
numerical variables so that they have a similar range of values. This step is
important to ensure that all variables are treated equally by the machine
learning algorithm.
This code will allow you to visualize the data using various plots, such as a
pair plot, correlation heatmap, and boxplot.
These plots help to understand the relationships between variables and identify any outliers or anomalies in the data.
Then this code will split the data into a training set and a testing set using the train test split function from sklearn. The training set is used to train the machine learning model, while the testing set is used to evaluate the model's performance.
After that, this code will classify data using the KNN technique. A technique based on machine learning known as K-Nearest Neighbors leverages the separation between data points to produce predictions. The training set is
used to train the algorithm, and the highest possible precision score is used to determine the hyperparameter.
Then this code will fit the final KNN model with the optimal number of neighbors and predicts churn on the testing set. The predictions are compared
to the actual churn values to evaluate the performance of the model.
This code will evaluate the performance of the model using various metrics such as accuracy score, confusion matrix, and classification report. These metrics provide information about how well the model is performing and can be used to improve the model if necessary.
OUTPUT
The output of this code will be:
1.
Summary of the Dataset: This will display the first few rows and descriptive statistics of the dataset.
2.
Accuracy Score: This will display the accuracy of the final KNN model
on the testing set.
3.
Confusion Matrix: This will display the number of true positives, true negatives, false positives, and false negatives for the final KNN model on the testing set. A confusion matrix helps to understand the types of errors the model is making.
4.
Classification Report: This will display various metrics such as precision, recall, and F1-score for each class (churn and no churn) of the final KNN model on the testing set. A classification report helps to
understand the performance of the model in each class separately.
SUMMARY OF CODE AND FUNCTION
Data Loading and Exploration:
o
The dataset is loaded using pandas, and its dimensions, data types, and descriptive statistics are displayed to understand the dataset's structure and content.
Data Preparation:
Missing values are handled if any are present in the dataset.
Categorical variables are encoded using label encoding to convert them into numerical representations.
The dataset is split into features (X) and the target variable (y).
Data Visualization:
The code provides a placeholder to create at least three visualizations to
gain insights into the dataset, such as relationships between variables or
data distributions.
Model Building:
The dataset is split into training and testing sets using the train_test_split function from scikit-learn.
Feature scaling is applied to standardize the features using Standard Scaler from scikit-learn.
The KNN classifier is instantiated with a chosen value of k (number of neighbors).
The model is trained on the training data using the fit method.
Model Evaluation:
The model's performance is evaluated by predicting the churn status for
the testing data using the predict method.
The accuracy of the model is calculated using the accuracy score function from scikit-learn.
Cross-validation is performed using cross_val_score to assess the model's generalization performance.
Visualizing the Optimal Number of Neighbors:
The code iterates over different values of k and calculates the mean accuracy score using cross-validation.
The results are plotted to visualize the relationship between the number
of neighbors and the accuracy of the KNN model.
Polishing and Presenting:
The code provides a placeholder for you to add your commentary, analysis, and conclusions about the results and the performance of the model.
You can write a summary of the process, including a description of the problem, the purpose of KNN, descriptions of your iterations, significant
observations, conclusions, and the optimal value of k with its corresponding accuracy rate.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help