MIS637_MIDTERM_Saumil
pdf
keyboard_arrow_up
School
Stevens Institute Of Technology *
*We aren’t endorsed by this school
Course
637
Subject
Information Systems
Date
Dec 6, 2023
Type
Pages
5
Uploaded by MajorCoyoteMaster800
MIS 637 B Midterm
Data Analytics & Machine Learning
October 31, 2023
School of Business
Stevens Institute of Technology
Professor M. Daneshmand
Student Name: Saumil Trivedi
CWID: 20011349
A bank likes to be able to decide on 4 interest rates for new loan applicants as follows:
interest rates of 7% for “high risk” applicants, 5% for “average risk” applicants, 3% for
“low risk” applicants, and 2% for “no risk” applicants. You are being asked to lead this
project. Provide a comprehensive end-to-end plan for this project. Include all the
necessary steps from the beginning to the end. Make any necessary assumptions and
define notations. Give a comprehensive description of the algorithm(s) as well as the
related formulas you will use for this project (this is the fundamental part of your role in
this project). Provide a detail description of the algorithm and how does it work. Please
put your answer in the format of Step 1, Step 2 …
Answer:
A bank decided to bifurcate interest into 4 different type interest rates. They are as
follows:
1)
Interest rate for high risk applicant = 7%
2)
Interest rate for average risk applicant = 5%
3)
Interest rate for low risk applicant =
3%
4)
Interest rate for no risk applicants = 2%
A process model known as the CRoss Industry Standard Process for Data Mining
(CRISP-DM) serves as the basis for a data science methodology. It moves through the
following six phases:
1.
Business Understanding
2.
Data Understanding
3.
Data Preparation
4.
Modelling Phase
5.
Evaluation Phase
6.
Deployment Phase
Step 1: Business Understanding
In this phase, we focus on understanding the objectives of the project and how it aligns
with the bank's strategic goals. The primary aim is to determine interest rates for new
loan applicants and categorize them based on their risk profiles. The interest rates will be
as follows:
"High-risk" applicants: 7%
"Average-risk" applicants: 5%
"Low risk" applicants: 3%
"No-risk" applicants: 2%
Understanding the business impact is crucial. We need to assess how these decisions will
affect the bank's profitability, risk management, and customer satisfaction. It's essential to
define the project scope, allocate resources, and establish a clear problem statement to
guide our data mining efforts.
Step 2: Data Understanding
Data is the foundation of our analysis. We will collect historical data on loan applicants,
which includes various factors such as credit scores, loan terms, loan types, locations,
home prices, loan amounts, and down payments. This data will be instrumental in
building our risk assessment model.
During this phase, we'll conduct exploratory data analysis (EDA) to uncover patterns and
gain preliminary insights. We'll use techniques to identify significant patterns, outliers,
and potential trends in the data. Our focus is on gaining a deep understanding of the
quality and characteristics of the data.
Step 3: Data Preparation
Data Cleaning: This involves addressing missing values, duplicates, and inconsistencies
in the data. It ensures data quality and can be summarized as follows:
Feature Selection: Selecting the most relevant features to impact the assessment of risk
levels. This step doesn't involve specific formulas but is based on domain knowledge and
statistical analysis.
Outlier Handling: Managing outliers (data points significantly different from the dataset's
average values) using the Inter Quartile Range (IQR) method. The IQR is calculated as:
IQR (Inter Quartile Range) = Q3
- Q1
Data values are considered outliers if they meet these conditions:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Less than Q1 - 1.5 * IQR
Greater than Q3 + 1.5 * IQR
Data Transformation: Involves two important techniques:
Data Standardization: This makes data have a mean of 0 and a standard deviation of 1.
The formula for data standardization is:
Standardization(x) = [x - mean(x)] / [standard deviation(x)]
Data Normalization: This translates data into a common range, often between 0 and 1.
The formula for data normalization is:
Normalization(x) = [x - min(x)] / [range(x)]
These steps ensure that the data is clean, standardized, and ready for further analysis.
Step 4: Modeling
In this phase, we select the C4.5 algorithm, a decision tree-based modeling technique, to
build a predictive model for categorizing loan applicants based on their risk levels and the
corresponding interest rates. The C4.5 method is chosen because it employs information
gain and entropy reduction to determine optimal splits in decision trees.
Explanation of C4.5 Algorithm:
The C4.5 method assesses a variable X with k possible values and associated
probabilities (p1, p2, ..., pk). It calculates the entropy of X, which measures its
uncertainty. The entropy formula is as follows:
Here, H(X) represents the entropy of variable X, and pi is the probability of each possible
value.
To apply entropy in C4.5, we consider a candidate split S that divides the training data T
into subsets (T1, T2, ..., Tk). The mean information need is calculated by weighting the
entropies of each subgroup with the percentage of records in each subset (Pi). The
formula for information gain is:
Where:
Gain(S) is the information gain for a specific split S.
H(T) is the entropy of the original dataset.
Pi represents the percentage of records in subset i.
H(Si) is the entropy of each subset Ti created by the split.
C4.5 selects the split that maximizes information gain at each decision node, leading to
the construction of a decision tree that effectively categorizes loan applicants into their
respective risk levels, which in turn determines the interest rates applied to them.
Step 5: Evaluation
The evaluation phase is crucial for ensuring the model's accuracy and reliability. We will:
Assess the model's precision using test data, employing metrics such as accuracy,
precision, recall, and F1-score.
Implement cross-validation techniques to confirm the model's robustness and
generalization.
Compare the C4.5 model's performance against alternative models to ensure it is the most
suitable for the task.
Step 6: Deployment
The deployment phase depends on the project's requirements:
Simple deployment: We will provide a report for the bank to determine the risk level of
loan applicants, categorizing them as high-risk or low-risk applications based on the
model's output.
Complex deployment: Prior to deploying the model, we will:
Test use cases by observing the model's performance on real data under human
supervision.
Deploy the trained model to classify loan applications into categories, such as 'High-risk
applicant,' 'Average-risk applicant,' 'Low-risk applicant,' or 'No-risk applicant,' based on
their assessed risk levels.
This detailed project plan follows the CRISP-DM methodology, offering a structured
approach to categorize new loan applicants into risk levels with associated interest rates,
ensuring transparency, accuracy, and informed decision-making for the bank.