Final Project Report draft 9th Dec_WIP

docx

School

University Canada West *

*We aren’t endorsed by this school

Course

651

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

25

Uploaded by BrigadierCraneMaster831

Report
1 Final Team Project- Credit Prediction Anirban Mukherjee (2123152) Ashish Dhyani (2208510) Mithushan Kirubaithasan (2211115) Nickson Maxwell (2207948) Hetti Arachchige Lihini Vihanga Hettiarachchi (2231246) Zhen Wang (2233414) BUSI 651-Section 4 University Canada West Instructor Ghodrat, Mohsen Due Date: Dec. 09, 2023
2 Sections and tasks Completion Check Executive summary-------Lihini Introduction-----Lihini Project justification, problem definition, project goal, etc----Mitu Team roles------Mitu Data exploration and analysis-----Zhen Quantitative analysis and data visualizations ---Zhen & Nickson Data Preprocessing----Zhen Predictive model selection and development --- Anirban Model performance assessment-------Nickson Predictions, discussion, and recommendations ---Lihini & Mitu References/Appendix
3 Contents Executive summary ............................................................................................................. 5 Introduction ......................................................................................................................... 6 Problem Definition and approach ........................................................................................ 8 Approach ......................................................................................................................... 9 Data exploration and analysis ............................................................................................ 10 Quantitative analysis and visualizations ............................................................................ 10 Histogram ....................................................................... Error! Bookmark not defined. Boxplot .......................................................................................................................... 11 Heat Map ....................................................................................................................... 13 Data Preprocessing ............................................................................................................ 14 Missing (NULL) Values ................................................................................................ 14 #Columns to drop ...................................................................................................... 14 Use the KNN method to impute missing values ....................................................... 14 # Fill missing values with 'unknown' ........................................................................ 14 #Fill missing values with the mean ........................................................................... 14 Errors ............................................................................................................................. 15 Identify and remove outliers ...................................................................................... 15 Remove duplicated records ....................................................................................... 16 Descriptive findings ........................................................................................................... 16 Income Category and Education level ........................... Error! Bookmark not defined. Education Level and Gender ......................................................................................... 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Gender vs Credit Limit and Education Level vs Credit Limit ...................................... 18 Income Category Vs. Credit Limit ................................................................................ 19 Predictive Model selection and development .................................................................... 19 Predictive Model Selection ............................................................................................ 19 Predictive Model development ...................................................................................... 19 Model performance assessment ......................................................................................... 21 Prediction ........................................................................................................................... 22 Appendix 1(Data Exploration and Preprocessing) ............................................................ 24 Appendix 2 (Prediction model, evaluation, and result) ..................................................... 25
5 Executive summary This report focuses on predicting customer credit limits through the development of a neural network model, leveraging various demographic and financial features. The comprehensive data exploration, cleaning, and preprocessing phases ensure the integrity of the dataset, while insightful visualizations shed light on key patterns and relationships. Notably, the average card utilization ratio emerges as a critical metric, influencing credit risk and customer behavior. The predictive model, built using TensorFlow/Keras, showcases a multi-layered architecture with distinct activation functions. Trained and validated on relevant features, the model exhibits a promising performance, as indicated by metrics such as Mean Squared Error (MSE) and R-squared. Visualizing the model's learning patterns through gradient descent patterns offers valuable insights into its training dynamics. Descriptive findings underscore the interplay between education level, income category, and gender, providing nuanced context to customer characteristics. The model, after meticulous training and evaluation, demonstrates its capability to reasonably predict credit limits. However, ongoing refinement opportunities are identified through visualizing prediction results. In conclusion, this report advances the understanding of credit prediction using advanced machine learning techniques. The developed neural network model, while displaying satisfactory performance, suggests avenues for further optimization. The insights gained hold implications for enhancing credit risk assessment and strategic decision-making within the financial landscape.
6 Introduction In the dynamic landscape of financial services, the ability to assess and predict customer creditworthiness is paramount for mitigating risk and ensuring sound lending practices. As the financial industry continues to evolve, leveraging cutting-edge technology becomes imperative to enhance decision-making processes. This report endeavors to delve into the realm of credit prediction, focusing specifically on estimating the credit limit of customers through the utilization of a neural network. The dataset provided encompasses a myriad of demographic and financial features, ranging from customer age and gender to transaction amounts and education levels. Through an intricate process of data exploration, cleaning, and feature engineering, this study seeks to unveil latent patterns and relationships within the data. The ultimate goal is to construct a robust predictive model that can navigate the intricate web of variables to provide accurate estimations of credit limits. The predictive model, driven by a neural network architecture, will be trained and fine- tuned to optimize its performance. The success of the model will be evaluated using metrics tailored to the nature of the problem, ensuring a comprehensive assessment of its accuracy and reliability. This project does not merely stop at algorithmic predictions; it aspires to offer interpretability, shedding light on the features that wield the most influence in determining credit limits. As we embark on this journey, we aim to not only fulfill the technical requisites of the task but also cultivate a nuanced understanding of the multifaceted interplay between customer attributes and credit limits. The insights gleaned from this endeavor hold the potential to not only
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 augment the predictive capabilities of financial institutions but also foster strategic decision- making that aligns with the evolving landscape of credit assessment. Student Name Student ID Team Role Anirban Mukherjee 2123152 Predictive model selection and development and model Performance assessment Zhen Wang 2233414 Data Exploration and Analysis & Quantitative Analysis Mithushan Kirubaithasan 2211115 Project justification, Problem definition, project goal, Predictions, and Discussions Lihini Hettiarachchi 2231246 Executive Summary, Introduction& Recommendations Nickson Maxwell 2207948 Data Visualizations Ashish Dhyani 2208510 APA Formatting, Proof resding, recommendations Project Goal The goal of this research is to create a neural network model that can estimate customer credit limits with high reliability and accuracy using a variety of financial and demographic
8 variables. In order to understand the intricate correlations between attributes and credit limitations, this model will be trained on a sizable dataset of customer data. The system will thereafter produce dependable and precise forecasts for prospective clients, providing interpretability to comprehend the primary aspects impacting those forecasts. If this project succeeds, institutions will have a strong tool to boost customer satisfaction, improve credit risk assessment, and encourage responsible lending practices. It is made to be easily integrated into current financial systems. Project Justification For financial institutions to reduce risk and make wise lending decisions, they must be able to estimate customer credit limitations with accuracy. The intricate interactions between numerous elements may be missed by traditional algorithms for credit limit prediction, which frequently depend on static attributes. By using a neural network model that can learn from a vast data set of demographic and financial variables, this study aims to build a more accurate and dependable method of predicting credit limitations. Problem Definition and approach Problem Statement: Predicting "Credit Limit" of customers. Objective: Develop a predictive model using a neural network to estimate the credit limit of customers based on various demographic and financial features such as customer age, gender, education level, total revolving balance, transaction amounts, and marital status. Train the model to predict the credit limit, evaluate its accuracy using appropriate metrics, and visualize the
9 model's learning patterns during the training process. Aim to create an accurate model that can effectively estimate credit limits for customers based on their attributes. Approach: Preparation of Data: Preprocessing of the dataset includes managing missing values, encoding category variables, and scaling numerical characteristics. Feature Choice: Relevant features are chosen based on their potential effect on "credit limit" prediction. Model Construction: TensorFlow/Keras is used to build a neural network model with numerous layers and distinct activation functions. The model is trained using the characteristics chosen and the target variable (Credit_Limit). Model Assessment: The model's performance is evaluated using measures such as Mean Squared Error (MSE) and R-squared Score. Visualisation: Gradient descent patterns are visualised to help comprehend the model's learning process. Prediction: Based on the features of the clients, the trained model predicts credit limits. Model Accuracy Evaluation: The model's accuracy and efficiency in forecasting credit limits are tested to determine its appropriateness for properly calculating credit limits.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Data exploration and analysis Upload the dataset CreditPrediction.csv to Google Drive and get the path, run df. describe (), we get a basic description of the dataset, which has 15 columns with numerical data, each with 10167 entries, The 'Customer_Age' column seems abnormal as the maximum value is 352.33, and The 'Unnamed: 19' column also seems abnormal as it contains no entries (count is 0), indicating both might be an error or placeholder. Run df.info (), we get information that the data has 10167 entries with 20 columns, which contain three types of data: integer (int64), floating point number (float64), and object. The columns of 'Gender', 'Marital_Status', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count' and unnamed: 19' have missing values, The column "Unnamed: 19" seem to be completely empty, we will not use it, so we decide to this column df = df. drop (['Unnamed: 19'], axis=1) Run duplicate_rows_df = df [df. duplicated ()], we found there are 35 duplicate rows, remove the duplicate rows: df = df. drop_duplicates () Quantitative analysis and visualizations First, we want to find how the target variable 'Avg_Utilization_Ratio' is distributed, created a histogram show the plot as below:
11 Histogram We could see that the distribution is right skewed, indicating that most customers have a low credit card utilization ratio. The highest frequency is at an `Avg_Utilization_Ratio` of 0, suggesting many customers do not utilize their credit limit at all. A few ratios are 1, showing that customers cannot utilize their credit limits fully. Boxplot Check categorical features, we want to find out how the average credit utilization ratio is distributed with income category, so we create a boxplot as below:
12 We could see that client with the lowest annual income (less than $40K) have the highest average credit utilization ratio and the most variation in their credit utilization behavior. This could indicate different financial needs and preferences or different challenges and opportunities in accessing and managing credit among this group. The clients with the highest annual income ($120K+) have the lowest utilization ratios and the least variation in their credit utilization behavior. This could indicate more financial stability and security or more access and control over their credit sources among this group.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 Heat Map The 'Avg_Utilization_Ratio is calculated as the amount of revolving balance divided by the credit limit. Also, by observing the numerical variables, we chose the following features that might also indirectly affect the credit utilization ratio and created a heatmap showing the correlation coefficients: We could see that customers with a higher revolving balance tend to have a higher utilization ratio (0.62). While a higher credit limit tends to have a lower utilization ratio(-0.48). There is a negative correlation (-0.08) between 'Total_Trans_Amt' with `Avg_Utilization_Ratio`. This suggests that customers with higher total transaction amounts tend to have a lower utilization ratio, but the relationship is not very strong. Besides, customers with more products with the bank tend to have a higher utilization ratio.
14 Data Preprocessing Missing (NULL) Values Run pd. DataFrame(df.isna().sum()).T, we found the the null values: Gender":199," "Marital_Status":1939"Card_Category":1915,"Months_on_book":221,"Total_Relationship_Cou nt":20 #Columns to drop The 'Card_Category' column is dropped due to having too many missing values. It might not provide reliable information for analysis or modeling, so we drop this column: columns_to_drop = ['Card_Category'] Use the KNN method to impute missing values. The 'Marital_Status' as it may be used to create a predicted model later: # Define columns to impute (excluding the target column and other non-numeric columns) columns_to_impute = [col for col in df.columns if col != 'Marital_Status' and df[col].dtype == 'float64'] # Initialize KNN imputer with desired parameters imputer = KNNImputer(n_neighbors=5) # Fill missing values with 'unknown' We want to keep gender row for analysis:df['Gender']. fillna('unknown', inplace=True) #Fill missing values with the mean We want to preserve the overall distribution of the data, filled the 'Months_on_book' and 'Total_Relationship_Count' missing values with the mean: df['Months_on_book']. fillna(df['Months_on_book']. mean(), inplace=True)
15 df['Total_Relationship_Count']. fillna(df['Total_Relationship_Count']. mean(), inplace=True) Errors Identify and remove outliers. Customer_Age:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 Remove duplicated records. After we fill the non-values, there may be some duplicated rows. Run duplicate_rows = df [df. duplicated()] identify 2 duplicate records, and remove them finally, we got integrated with 21 columns with consistent data containing 10,111 non-null entries: Avg_Utilization_Ratio: The average card utilization ratio is a crucial measure of how much of the available credit a customer is using, which can impact credit risk and, consequently, churn ( Appendix 1 ).
17 Descriptive findings Income Category and Education level: In the above graphical representation of the processed dataset has the following findings: All the Graduate students fall under the income bracket of "Less than 40K" and "80K to 120K. So, this suggests that Graduate students are either in the lower income bracket or the higher income bracket. All the high school students who passed earn less than 80K. Strangely, the uneducated population earns in the income bracket of 60K-80K. The second graph in the above figure shows the total income distribution. The "60K-80K" category is the highest and is equal to the "less than 40K" category. However, the "80K-120K" category is half of the previous two categories.
18 Education Level and Gender: In the above graphical representation of the processed dataset has the following finding: None of the "Female" population falls under "Uneducated" category. The ratio of gender distribution is 3:2 for Male to Female. Gender vs Credit Limit and Education Level vs Credit Limit The first chart shows that the Male category has a much higher credit limit than the Female category. Moreover, the second chart shows the credit limit distribution across various
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 education levels. We see in the second graph the variance in the credit limit is not so much based on an individual's qualification level. Income Category Vs. Credit Limit The above figure shows that the credit limit is high for people in the $120k+ category and decreases consistently based on how much lower the income bracket is. Predictive Model selection and development Predictive Model Selection The problem for this project is "Credit Limit" identification based on the set of numerical features in the dataset. As "Credit Limit" is continuous, we solve this problem through regression. Here, we chose an Artificial Neural Network as a suitable model for predicting the "Credit Limit" for each customer. Predictive Model development Feature Selection: The following numeric features have a potential impact on "Credit Limit". This was also decided based on the correlation matrix.
20 Splitting the data into features (X) and the target variable (y): Model Building: We constructed a neural network model using Keras with multiple hidden layers and different activation functions (relu, tanh, sigmoid, linear) to predict the 'Credit Limit'. We also compiled the model using the Adam optimizer and Mean Squared Error (MSE) as the loss function. We are training the model on the training data (X_train, y_train) for 80 epochs with a batch size 32.
21 Model performance assessment The model looks to perform well as the Mean Squared Error (MSE) is roughly 17,689,536, indicating that the average squared difference between anticipated and actual values is around 17,689,536. Lower MSE values suggest higher model performance. However, the magnitude of the target variable must be considered. The Root Mean Squared Error (RMSE) is an interpretable measure of the average error magnitude in the same units as the target variable. In this scenario, the RMSE indicates that the model's predictions are wrong by around 4205.89 in the 'Credit Limit' forecast on average. The R-squared number, which is typically about 0.778, represents the amount of variation in the target variable that is predicted by the characteristics. The model can explain roughly 77.8% of the 'Credit Limit' variability with an R-squared of 0.778. Overall, the model appears to give an acceptable fit to the data, accounting for a significant percentage of the variance in the 'Credit Limit' and producing predictions with a reasonably low error rate. However, when interpreting model performance, domain knowledge and problem-specific criteria should be considered ( Appendix 2 ).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 Prediction The above figure suggests the target value "Credit Limit" prediction. The red data points are the actual value, and the blue data points are the predicted value. The visualization suggests that the model predicts reasonably well, but there is still scope for improvement. The prediction result is exported to an excel sheet named "predicted_values.xlsx."
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
23 The above figure depicts the weight change during the training of the prediction model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24 Discussion The created model has the potential to greatly enhance strategic decision-making and credit risk assessment in the financial domain. Financial institutions may make better judgements by gaining useful insights into the aspects that have the biggest impact on credit limits thanks to their interpretability, which gives them a distinct advantage. Insights gained from this project can also be used to improve and optimize the model, which will result in forecasts that are even more accurate and a deeper comprehension of the creditworthiness of the clientele. All things considered, the model has the ability to guarantee responsible lending practices and customize credit offerings, which will ultimately help create a more stable and long-lasting financial environment. Additional Considerations It is critical to recognize that the completeness and quality of the training data may have an impact on the model's performance. Thus, in order to guarantee the model's continued relevance and accuracy, it is imperative that its performance be continuously observed and assessed over time. Additionally, the model must be used in conjunction with other risk assessment instruments and procedures in order to mitigate potential risks and make well- informed credit decisions. Appendix 1(Data Exploration and Preprocessing) Data exploration_Preprocessing.ipynb - coding for data cleaning and preprocessing https://drive.google.com/file/d/1qKaJiqqZoejh- NTrbrFptLc9e4QQWbNT/view?usp=sharing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25 Appendix 2 (Prediction model, evaluation, and result) Predicting_Credit_Limit.ipynb - Python code for Descriptive Findings, Prediction model, evaluation, and result. https://drive.google.com/file/d/1d6anmTZSZseq2IFDTZGQbHcHX_Z30Bdo/view?usp= sharing predicted_values.xlsx- result of the prediction stored in excel file. https://docs.google.com/spreadsheets/d/1wsh90ZfsSetdWGnRi8TTwg61jLsnZ7w7/edit? usp=sharing&ouid=114131992989484199063&rtpof=true&sd=true
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help