D208_Task2_PA

docx

School

Western Governors University *

*We aren’t endorsed by this school

Course

D208

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

24

Uploaded by MagistrateAntelope3113

Report
D208 - Predictive Modeling Logistic Regression Modeling
Table of Contents Part I: Research Question ......................................................................................................... 3 A. Describe Purpose of Analysis ......................................................................................................... 3 1. Summarize one research question .................................................................................................................... 3 2. Define Goals of Analysis .................................................................................................................................... 3 Part II: Method Justification ..................................................................................................... 3 B. Describe Multiple Logistic Regression Methods ............................................................................. 3 1. Summarize four assumptions of a logistic regression model ............................................................................ 3 2. Describe two benefits of using Python in support of analysis .......................................................................... 3 3. Explain why logistic regression an appropriate technique is to use based on question in Part I ..................... 3 Part III: Data Prep .................................................................................................................... 3 C. Summarize the data prep process .................................................................................................. 3 1. Describe Data cleaning goals – See attached .ipynb File ................................................................................... 3 2. Describe dependent and all independent variables ......................................................................................... 5 3. Generate univariate and bivariate visualizations of the distributions – independent and dependent variables, include dependent variable in bivariate visualization .......................................................................... 7 4. Describe data transformation goals that algin with your research question and the steps used to transform the data to achieve goals, include annotated code ............................................................................................ 13 5. Provide Prepared set as a CSV file ................................................................................................................... 14 Part IV: Model Comparison & Analysis ................................................................................... 15 D. Compare initial and reduced linear regression model .................................................................. 15 1. Initial multiple linear regression model with all variables from part C2 ......................................................... 15 2. Justify statistically based feature selection ..................................................................................................... 16 3. Provide reduced linear regression model ....................................................................................................... 19 E ....................................................................................................................................................... 20 1. Model Evaluation Mettric explanation ............................................................................................................ 20 2. Confusion Matrix & Accuracy Calculation ....................................................................................................... 21 3. Attached code .............................................................................................................................. 22 F. Summary ...................................................................................................................................... 23 1. Discuss results ................................................................................................................................................. 23 2. Recommend course of action .......................................................................................................................... 24 G ...................................................................................................................................................... 24 H ...................................................................................................................................................... 24 I ........................................................................................................................................................ 24
Part I: Research Question A. Describe Purpose of Analysis 1. Summarize one research question What factors contribute to Churn? 2. Define Goals of Analysis The objective of my analysis is to gain insight into what customer factors directly correlate to whether or not a customer Churns. Part II: Method Justification B. Describe Multiple Logistic Regression Methods 1. Summarize four assumptions of a logistic regression model Assumptions for this model include: There is independence of observations, the outcome of one observation should not influence what happens in another observation. There is nominal independence that the independent variables don’t correlate highly with each other. A goodness of fit test should be used to evaluate how well the model fits our data. The independent variables and log-fits should be linear. 2. Describe two benefits of using Python in support of analysis Jupyter notebook and Python are the tools I used to complete this analysis. Using Python as my method of analysis is beneficial for many reasons. But I will only list 2. The first benefit is that Python offers multiple libraries of data visualization that I can use to help me visualize my logistic regression models. The second benefit is that it has a rich ecosystem of libraries which means that it has many of the calculations already built out which can save time in the analysis phase. These both mean that I can calculate and visualize my data with ease using Python. 3. Explain why logistic regression an appropriate technique is to use based on question in Part I Our target variable, Churn is a binary, categorical field. Logistic regression will help identify the elements that influence it. Therefore, logistic regression is an excellent technique to assist me in answering my question in Part I. We will test independent variables to determine the affect they have on our target variable. The affect could be positive, negative, or none. Part III: Data Prep C. Summarize the data prep process 1. Describe Data cleaning goals – See attached .ipynb File While becoming familiar with the data, by using .describe(), box plots, and .isnull() sums, I was able to identify areas that needed to be cleaned in the data. The goal is to have a data environment that is optimal to perform a linear regression analysis.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
I first began by identifying that there were null fields in InternetService. Using similar techniques that I used in D206 I filled those nan fields by using the .fillna() method. It’s an easy way to fill null fields without compromising the data or removing excessive rows of data from the data source. Once the nulls were taken care of, I was able to identify outliers in the data by using box plots. It did appear that there are multiple fields with outliers: Income, Children, and Outage_sec_perweek. I determined to identify the count of outliers, and then replace them if they were several standard deviations above the mean. The threshold set was 3. The outliers outside of three standard deviations were replaced using z-scores. With the null values and outliers taken care of, the data cleaning step was complete.
2. Describe dependent and all independent variables The original data set contains 50 columns and 10,000 rows of customers. For my analysis I will be focusing on ‘Churn’ as my dependent variable. I will also retain a summary describing ‘Income’, ‘Outage_sec_perweek’, ‘Tenure’, ‘MonthlyCharge’, and ‘Bandwidth_GB_Year’. These are my nominal values. The categorical values that will be retained for independent variables are ‘Area’, ‘Contract’, and ‘PaymentMethod’. The dependent variable, Churn is binary of ‘Yes’ or ‘No’. ‘Yes’ indicates that the customer churned as a customer and ‘No indicates that they did not. ‘No’ is the top response with 7350 and 2650 were ‘Yes’. The independent variables that are nominal or continuous are Income which has 10000 rows, a minimum stated income of $348 and maximum of $258900. The average income per year for the customers is $39806. Outage_sec_perweek is also a nominal value with the average of 10 seconds per week, max of 21 and minimum of .09 seconds per week. Tenure is the lengt of time a customer has been a customer with the average of 34 months with the provider and max of 71 months. Bandwidth_GB_Year is a nominal independent variable and has an average of 3392 GB per year, maximum of 7158 GB per year and minimum of 155 GB per year. MonthlyCharge is the final nominal independent variable and has an average of $172.62, the maximum payment per month is $290.16 and minimum is $79.98. There are 4 categorical independent variables, and they are Area, which is the type of area a customer lives in. The top area for our customers is Suburban with 33.46% of customers, 33.27% Urban and 33.27% Rural. Contract is the type of contract our customers are on. The most frequent contract is month-to-month with 54.56% of customers, Two Year at 24.42% of customers and 21.02% customers at One year. Finally, the type of payment method is the last independent categorical value. The top payment method is Electronic Check with 33.98% of customers utilizing e-checks, 22.9% mailing in checks, 22.29% having a bank automatic transfer and 20.83% having automatic credit card transactions. Churn is the final categorical of Yes or No. And 73.5% of the 10,000 rows are still customers.
Using this data to describe our typical customer would mean that we have a customer who likely has a yearly income of $39806, an average of 10 seconds of internet interruption due to outage per week, has been with us as a provider for around 34 months and uses 3392 GB of data per year. They also likely live in a suburban area on a month-to-month contract and pays electronic check for $173.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Now the goal is to determine what aspects of these independent variables influence Churn. 3. Generate univariate and bivariate visualizations of the distributions – independent and dependent variables, include dependent variable in bivariate visualization Dependent variable univariate visualization for monthly charge:
Independent univariate continuous variables visualizations: Independent univariate categorical variables visualizations:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Bivariate visualizations:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Describe data transformation goals that algin with your research question and the steps used to transform the data to achieve goals, include annotated code The data transformation that I performed was one-hot coding. The fields that were affected were the categorical fields Churn, Contract, Payment Method, and Area. Churn is the dependent field and the others are the independent categorical fields I chose to test regression with my dependent variable. The reason these had to be transformed using one-hot coding was because they were not nominal fields that could be ranked using a ranking system. They also couldn’t be converted to True or False. One-hot coding takes a column and pulls out the responses to create a separate column for each possible value and assigns a 1 or 0 depending on what the row contained. This will assist in down the line linear regression. To help mitigate multicollinearity the first column is dropped. Unlike in the linear regression model, Churn is our dependent variable for logistic regression and therefore will not need to be transformed into a numerical, binary field. 5. Provide Prepared set as a CSV file. See attached C5_preppeddata.csv
Part IV: Model Comparison & Analysis D. Compare initial and reduced linear regression model 1. Initial multiple linear regression model with all variables from part C2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. Justify statistically based feature selection Using the VIF score, it indicated there was multicollinearity for Tenure and Bandwidth_GB_Year. The threshold is 10. Once removed, Bandwidth_GB_Year decreased below the threshold and I ended up leaving that in. Once multicollinearity was analyzed, I used p-values to determine the independent values to retain:
Once those values were removed I wanted to check VIF again just to verify it was not affected once the other fields were removed. Checking VIF (multicollinearity) again:
Utilizing VIF scores and Pvalues, I was able to determine the variables that should be retained that affect the Churn dependent variable. Finally ending the reducing of values based on this model of .corr(). Monthly Charge (0.37) which was a positive correlation and negative correlations: Bandwidth_GB_Year (-0.44)indicate stronger correlation to MonthlyCharge than encoded_Two (-0.18)Year and encoded_One year (- 0.14). These are the values I will be keeping for the reduced model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. Provide reduced linear regression model
E. 1. Model Evaluation Mettric explanation Initial Model: Reduced Model:
The Accuracy score for the initial model is 80% while the reduced model is 87%. It is indeicated on the reduced model that it is able to accurately predict “No” in 1340 instances and “Yes” in 394 instances. The initial model was correctly predicted for “No’ 1276 times and “yes” was 331 times. This indicates that the reduced model was more accurate than the initial model at predicting an outcome of churn. Of the actual instances, the reduced model captures 92% of the actual “no’ instances and 77% of the “yes” instances where the initial model is 88% and 61% respectively. 2. Confusion Matrix & Accuracy Calculation
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. Attached code See attached .ipynb file.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
F. Summary 1. Discuss results The regression equation: Interpreting the coefficients: 'MonthlyCharge': it is associated with the churn and for every unit increase of Monthly Charge, churn increases by .041. – meaning a customer paying more is more likely to churn than a customer paying less. 'encoded_One year': indicates that it affects churn positively. If the customer is on a one year contract churn decreases by 2.813. 'encoded_Two year': similar to encoded_One year, if the customer is on a two year contract they decrease churn by 2.380. ‘bandwidth_GB_Year’ indicates that for every one unit increase of this field, churn decreases by 0.001. (.00009763). The practical significance of this reduced model indicates the independent variables that are most relevant to the Churn variable of the data. The most impactful variable is encoded_One year. It indicates that it has a stronger impact slightly than two-year contracts. Statistically speaking it is important to identify which values are most impactful to churn. The reason is that if we can identify the indicators of churn, we can work to prevent it. The pvalues indicate that Monthly Charge and the type of contract the customer is on are statistically the most impactful to churn based on the analysis that I completed. The limitations I experienced is that logistic regression assumes independence of the variables. Especially in this model, it is very likely that the monthly charge and contract type are related in some way which was not considered for the logistic regression and is a limitation that we have.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Another limitation is that it is only 77% able to predict “yes” for churn which is a good amount, but likely would need more investigation to be able to be more accurate for that predictor. It is probably because the sample contains a lot more “No”’s than “yes”’s. 2. Recommend course of action Based on the correlation of Churn and One-Year contracts, my recommended course of action is to push for one-year contracts as a company. Customers are less likely to leave the company if they are on a one-year contract. This research will re-enforce to sales and upper management the importance of selling one-year contracts to decrease the likely hood of churn in customers. G. See attached .mp4 H. N/A I. N/A
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help