MIS637_MIDTERM_Saumil

pdf

School

Stevens Institute Of Technology *

*We aren’t endorsed by this school

Course

637

Subject

Information Systems

Date

Dec 6, 2023

Type

pdf

Pages

5

Uploaded by MajorCoyoteMaster800

Report
MIS 637 B Midterm Data Analytics & Machine Learning October 31, 2023 School of Business Stevens Institute of Technology Professor M. Daneshmand Student Name: Saumil Trivedi CWID: 20011349 A bank likes to be able to decide on 4 interest rates for new loan applicants as follows: interest rates of 7% for “high risk” applicants, 5% for “average risk” applicants, 3% for “low risk” applicants, and 2% for “no risk” applicants. You are being asked to lead this project. Provide a comprehensive end-to-end plan for this project. Include all the necessary steps from the beginning to the end. Make any necessary assumptions and define notations. Give a comprehensive description of the algorithm(s) as well as the related formulas you will use for this project (this is the fundamental part of your role in this project). Provide a detail description of the algorithm and how does it work. Please put your answer in the format of Step 1, Step 2 … Answer: A bank decided to bifurcate interest into 4 different type interest rates. They are as follows: 1) Interest rate for high risk applicant = 7% 2) Interest rate for average risk applicant = 5% 3) Interest rate for low risk applicant = 3% 4) Interest rate for no risk applicants = 2% A process model known as the CRoss Industry Standard Process for Data Mining (CRISP-DM) serves as the basis for a data science methodology. It moves through the following six phases: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modelling Phase 5. Evaluation Phase 6. Deployment Phase
Step 1: Business Understanding In this phase, we focus on understanding the objectives of the project and how it aligns with the bank's strategic goals. The primary aim is to determine interest rates for new loan applicants and categorize them based on their risk profiles. The interest rates will be as follows: "High-risk" applicants: 7% "Average-risk" applicants: 5% "Low risk" applicants: 3% "No-risk" applicants: 2% Understanding the business impact is crucial. We need to assess how these decisions will affect the bank's profitability, risk management, and customer satisfaction. It's essential to define the project scope, allocate resources, and establish a clear problem statement to guide our data mining efforts. Step 2: Data Understanding Data is the foundation of our analysis. We will collect historical data on loan applicants, which includes various factors such as credit scores, loan terms, loan types, locations, home prices, loan amounts, and down payments. This data will be instrumental in building our risk assessment model. During this phase, we'll conduct exploratory data analysis (EDA) to uncover patterns and gain preliminary insights. We'll use techniques to identify significant patterns, outliers, and potential trends in the data. Our focus is on gaining a deep understanding of the quality and characteristics of the data. Step 3: Data Preparation Data Cleaning: This involves addressing missing values, duplicates, and inconsistencies in the data. It ensures data quality and can be summarized as follows: Feature Selection: Selecting the most relevant features to impact the assessment of risk levels. This step doesn't involve specific formulas but is based on domain knowledge and statistical analysis. Outlier Handling: Managing outliers (data points significantly different from the dataset's average values) using the Inter Quartile Range (IQR) method. The IQR is calculated as: IQR (Inter Quartile Range) = Q3 - Q1 Data values are considered outliers if they meet these conditions:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Less than Q1 - 1.5 * IQR Greater than Q3 + 1.5 * IQR Data Transformation: Involves two important techniques: Data Standardization: This makes data have a mean of 0 and a standard deviation of 1. The formula for data standardization is: Standardization(x) = [x - mean(x)] / [standard deviation(x)] Data Normalization: This translates data into a common range, often between 0 and 1. The formula for data normalization is: Normalization(x) = [x - min(x)] / [range(x)] These steps ensure that the data is clean, standardized, and ready for further analysis. Step 4: Modeling In this phase, we select the C4.5 algorithm, a decision tree-based modeling technique, to build a predictive model for categorizing loan applicants based on their risk levels and the corresponding interest rates. The C4.5 method is chosen because it employs information gain and entropy reduction to determine optimal splits in decision trees. Explanation of C4.5 Algorithm: The C4.5 method assesses a variable X with k possible values and associated probabilities (p1, p2, ..., pk). It calculates the entropy of X, which measures its uncertainty. The entropy formula is as follows: Here, H(X) represents the entropy of variable X, and pi is the probability of each possible value. To apply entropy in C4.5, we consider a candidate split S that divides the training data T into subsets (T1, T2, ..., Tk). The mean information need is calculated by weighting the entropies of each subgroup with the percentage of records in each subset (Pi). The formula for information gain is: Where:
Gain(S) is the information gain for a specific split S. H(T) is the entropy of the original dataset. Pi represents the percentage of records in subset i. H(Si) is the entropy of each subset Ti created by the split. C4.5 selects the split that maximizes information gain at each decision node, leading to the construction of a decision tree that effectively categorizes loan applicants into their respective risk levels, which in turn determines the interest rates applied to them. Step 5: Evaluation The evaluation phase is crucial for ensuring the model's accuracy and reliability. We will: Assess the model's precision using test data, employing metrics such as accuracy, precision, recall, and F1-score. Implement cross-validation techniques to confirm the model's robustness and generalization. Compare the C4.5 model's performance against alternative models to ensure it is the most suitable for the task. Step 6: Deployment The deployment phase depends on the project's requirements: Simple deployment: We will provide a report for the bank to determine the risk level of loan applicants, categorizing them as high-risk or low-risk applications based on the model's output. Complex deployment: Prior to deploying the model, we will: Test use cases by observing the model's performance on real data under human supervision. Deploy the trained model to classify loan applications into categories, such as 'High-risk applicant,' 'Average-risk applicant,' 'Low-risk applicant,' or 'No-risk applicant,' based on their assessed risk levels. This detailed project plan follows the CRISP-DM methodology, offering a structured approach to categorize new loan applicants into risk levels with associated interest rates, ensuring transparency, accuracy, and informed decision-making for the bank.