Week 8 Assignment Machine Learning Model Building

docx

School

Howard University *

*We aren’t endorsed by this school

Course

MISC

Subject

Information Systems

Date

Jan 9, 2024

Type

docx

Pages

12

Uploaded by Dillah1

Report
Week 8 Assignment Machine Learning Model Building K Nearest Neighbors Background information: Customer Churn Prediction in the Telecom Sector Customer churn, also known as customer retention, customer turnover, or customer defection, is the loss of clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients. Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies’ control, such as how billing interactions are handled or how after-sales help is provided. Predictive analytics uses churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small, prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn. In this assignment, we will be applying the K Nearest Neighbors algorithm to see how well our model can classify new data as either positive (TRUE) for churn or negative (FALSE) based on certain characteristics of our dataset.
Dataset: Churn.csv The dataset contains various data about individual customers for a monthly billing cycle, including average day, evening, night and international minutes, number of calls and cost data amount of time the customer has had a contract with the company (in months) average number of calls to customer service (time frame is not given, but we can assume the same time for all customers, so directly comparable). Preliminary remarks: Let’s recall the Data Science Pipeline from Module 1, Week 1. We will be working through each of these stages in this project. You should be sure that you have studied all the Module 4 resources, readings, videos, and tutorials before beginning this final assignment. If you have any questions, do not hesitate to contact your instructor. While this final assignment relies on knowledge and skills acquired throughout the course, the assignment is particularly anchored on the Week 8 K Nearest Neighbors tutorial (Iris dataset).
While you are in possession of a perfectly working Jupyter Notebook with Python code applicable to a different dataset, it is a fact of ‘data science life’ that rarely is code directly usable without modification from one dataset, one project, to another. Therefore it is critical that you not only study the code but also the dataset carefully.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Process: Previous tutorials and assignment in the course have given a step by step breakdown of what to do, what to type, what to execute, when and where. This final assignment of the course will deviate from that approach. Your task: Your task in this assignment is to execute and document an end-to-end machine learning algorithm (K Nearest Neighbors-KNN) on the business problem and Churn.csv dataset presented earlier. Your evaluation is not so much based on identifying the best combination of criteria that lead to the highest ‘accuracy’ of the model, but rather your ability to explain the process, in layman’s terms (for non-data scientist colleagues and management), at each stage of your investigation, your Data Science Pipeline. You must include the code, output and comments/analysis in your submission. You may create your submission in a Jupyter Notebook or Word. In this assignment, you will follow the basic processes covered in the KNN Iris dataset tutorial (with modifications that you will implement). You should provide: A comprehensive, multiple-step, Summary of the Dataset, Data Preparation, including Label Encoding and Feature Scaling Data Visualization of the dataset, a minimum of 3 visualizations (Note: a single code cell that generates multiple visualizations, such as sns.pairplot or matplotlib boxplot counts as one visualization) Complete Prediction set, including results, evaluations and at least one form of cross-validation, with at least one graphic of optimal number of neighbors for the KNN algorithm Each of the above stages should be accompanied with commentary/analysis. Nearly every cell will need comments.
Suggested first steps: There are 15 potential features (X values) in the machine learning model you will create. You will want to at least try one iteration of the model with all 15 features. However, before doing so, be sure to run the Step 3.6 Correlation Heat Map and see if that gives you some indication of which features to concentrate on. You will run several iterations of this process on the dataset (4 iterations minimum). Do not ‘cut to the chase’ and choose your final features too quickly; take the time to experiment. You will need to take notes during and after each iteration for the required deliverables (below). o Ultimately you should choose to concentrate on 3-4 features (columns). By the final iteration, you will want to remove/delete certain features (columns) from the dataset. This can be done manually in Excel (then reloading a new, more focused dataset). Or, per Step 3 Data Visualization in the KNN Iris dataset tutorial, by using the filter method in Jupyter Notebook, to create another DataFrame dataset2 = dataset.filter([ 'SepalLengthCm' , 'SepalWidthCm' , 'Species' ], axis=1) o Note that when you are testing which features to include/exclude you can modify the following code (from the KNN Iris dataset tutorial), then run the succession of steps to get to the KNN results to compare. feature_columns = [ 'SepalLengthCm' , 'SepalWidthCm' , PetalLengthCm' , 'PetalWidthCm' ] X = dataset[feature_columns].values y = dataset[ 'Species' ].values Deliverables:
In terms of complete code, output (including visualizations on your final 3-4 features only) and ‘in code’ comments/analysis, you will submit only your final iteration. Ideally this will be the KNN where you achieved the best results, but as mentioned previously, the commentary/analysis is more important than the results. Full python code + output/visualizations + ‘in code’ comments/description of process will be preferably in the form of a Jupyter Notebook . A Word document with relevant code and output and commentary entered/cut and pasted is possible as well. In addition to your Jupyter Notebook (or code/output/comments document), you should provide a one page overall summary of the process (including a brief description of the problem, what KNN should accomplish and descriptions of your various iterations—see below for a sample table format). This will be a separate Word document . Include any significant interim observations/milestones and any conclusions of the resulting model—in this case, which level of K proved most successful in your K Nearest Neighbor algorithm and to what accuracy rate. Thus you will submit two files o Jupyter Notebook or Word file (Full python code + output/visualizations + ‘in code’ comments/description) o Word file (one page overall summary of the process) This is an open-ended, leverage-what-you’ve-observed-and-learned assignment—very much like adjusting to the ever-evolving knowledge and skill base required in any data science career path. Do not hesitate to consult documentation or demos for Python or the various Python libraries (Pandas, matplotlib, Seaborn, etc.). There is a lot of detail in these instructions, but the bottom line is that we’re looking for a deliverable very similar to the knn_classification_tutorial_iris.ipynb Jupyter Notebook, with additional comments/analysis. As mentioned elsewhere, when adapting code from one dataset to another, adjustments to code and/or parameters will be required. Of course, if you have any questions, do not hesitate to contact your instructor.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Sample summary table for reporting iterations (adapt as needed): Iteration# Features Included Primary Results Observations/ Analysis Next Steps
import pandas as pd import numpy as np import matplotlib . pyplot as plt import seaborn as sns from sklearn . model_selection import train_test_split from sklearn . preprocessing import LabelEncoder , StandardScaler from sklearn . neighbors import KNeighborsClassifier from sklearn . metrics import accuracy_score , classification_report from sklearn . model_selection import cross_val_score # Load the dataset df = pd . read_csv ( 'Churn.csv' ) # Explore the dataset print ( "Dataset dimensions:" , df . shape ) print ( "Data types:\n" , df . dtypes ) print ( "Descriptive statistics:\n" , df . describe ()) # Identify the target variable target = 'Churn' # Data Preparation # Handle missing values if any # ... # Perform label encoding for categorical variables # ... # Split the dataset into features (X) and the target variable (y) X = df . drop ( target , axis = 1 ) y = df [ target ] # Data Visualization # Create at least three visualizations # ... # Model Building # Select features for the model # ...
# Split the dataset into training and testing sets X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.2 , random_state = 42 ) # Feature Scaling scaler = StandardScaler () X_train_scaled = scaler . fit_transform ( X_train ) X_test_scaled = scaler . transform ( X_test ) # Build the KNN classifier knn = KNeighborsClassifier ( n_neighbors = 5 ) # Train the model knn . fit ( X_train_scaled , y_train ) # Predict on the test set y_pred = knn . predict ( X_test_scaled ) # Evaluate the model accuracy = accuracy_score ( y_test , y_pred ) print ( "Accuracy:" , accuracy ) # Perform cross-validation cv_scores = cross_val_score ( knn , X , y , cv = 5 ) print ( "Cross-validation scores:" , cv_scores ) print ( "Mean cross-validation score:" , np . mean ( cv_scores )) # Visualize the optimal number of neighbors neighbors = range ( 1 , 10 ) accuracy_scores = [] for k in neighbors : knn = KNeighborsClassifier ( n_neighbors = k ) scores = cross_val_score ( knn , X , y , cv = 5 ) accuracy_scores . append ( scores . mean ()) plt . plot ( neighbors , accuracy_scores ) plt . xlabel ( 'Number of Neighbors (k)' ) plt . ylabel ( 'Accuracy' )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt . title ( 'KNN: Number of Neighbors vs. Accuracy' ) plt . show () # Polishing and Presenting # Write your commentary, analysis, and conclusions # ... EXPLAINTATION OF PROGRAM This code will import necessary Python libraries such as pandas, NumPy, seaborn, and sklearn. These libraries contain functions and tools that are useful for data analysis and machine learning. It loads a dataset called 'Churn.csv' using the pandas library. The dataset contains information about customers of a telecommunications company, such as their demographics, usage patterns, and whether they have churned or not. This code will prepare the data by converting categorical variables (such as 'International Plan' and 'Voice Mail Plan') into numerical ones and scales the numerical variables so that they have a similar range of values. This step is important to ensure that all variables are treated equally by the machine learning algorithm. This code will allow you to visualize the data using various plots, such as a pair plot, correlation heatmap, and boxplot. These plots help to understand the relationships between variables and identify any outliers or anomalies in the data. Then this code will split the data into a training set and a testing set using the train test split function from sklearn. The training set is used to train the machine learning model, while the testing set is used to evaluate the model's performance. After that, this code will classify data using the KNN technique. A technique based on machine learning known as K-Nearest Neighbors leverages the separation between data points to produce predictions. The training set is
used to train the algorithm, and the highest possible precision score is used to determine the hyperparameter. Then this code will fit the final KNN model with the optimal number of neighbors and predicts churn on the testing set. The predictions are compared to the actual churn values to evaluate the performance of the model. This code will evaluate the performance of the model using various metrics such as accuracy score, confusion matrix, and classification report. These metrics provide information about how well the model is performing and can be used to improve the model if necessary. OUTPUT The output of this code will be: 1. Summary of the Dataset: This will display the first few rows and descriptive statistics of the dataset. 2. Accuracy Score: This will display the accuracy of the final KNN model on the testing set. 3. Confusion Matrix: This will display the number of true positives, true negatives, false positives, and false negatives for the final KNN model on the testing set. A confusion matrix helps to understand the types of errors the model is making. 4. Classification Report: This will display various metrics such as precision, recall, and F1-score for each class (churn and no churn) of the final KNN model on the testing set. A classification report helps to understand the performance of the model in each class separately. SUMMARY OF CODE AND FUNCTION Data Loading and Exploration: o The dataset is loaded using pandas, and its dimensions, data types, and descriptive statistics are displayed to understand the dataset's structure and content. Data Preparation:
Missing values are handled if any are present in the dataset. Categorical variables are encoded using label encoding to convert them into numerical representations. The dataset is split into features (X) and the target variable (y). Data Visualization: The code provides a placeholder to create at least three visualizations to gain insights into the dataset, such as relationships between variables or data distributions. Model Building: The dataset is split into training and testing sets using the train_test_split function from scikit-learn. Feature scaling is applied to standardize the features using Standard Scaler from scikit-learn. The KNN classifier is instantiated with a chosen value of k (number of neighbors). The model is trained on the training data using the fit method. Model Evaluation: The model's performance is evaluated by predicting the churn status for the testing data using the predict method. The accuracy of the model is calculated using the accuracy score function from scikit-learn. Cross-validation is performed using cross_val_score to assess the model's generalization performance. Visualizing the Optimal Number of Neighbors: The code iterates over different values of k and calculates the mean accuracy score using cross-validation. The results are plotted to visualize the relationship between the number of neighbors and the accuracy of the KNN model. Polishing and Presenting: The code provides a placeholder for you to add your commentary, analysis, and conclusions about the results and the performance of the model. You can write a summary of the process, including a description of the problem, the purpose of KNN, descriptions of your iterations, significant observations, conclusions, and the optimal value of k with its corresponding accuracy rate.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help