Assignment 1 (1)
docx
keyboard_arrow_up
School
Western Michigan University *
*We aren’t endorsed by this school
Course
4640
Subject
Statistics
Date
Feb 20, 2024
Type
docx
Pages
5
Uploaded by jbenavidez0528
Assignment 1
See E-learning for the due date
Introduction
This assignment is designed for you to practice several activities in the six phases of CRISP-DM using a semi-real-life data set. ABC Car Insurance Inc. sells car insurance. The dataset (
audit – spring 2024.csv
) you will be working on includes approximately 2000 car insurance claims. Usually when a car accident happens, ABC sends a claim adjuster to inspect the damage and interview the claimant & other related people to determine the extent of the company’s liability. As you can see in the dataset, a lot of information has been collected regarding each claim. Our goal in this project is to identify the variables that are relevant to predict whether a claim will be adjusted or not. We will do this by creating a classification tree model and use it to predict new claims.
The following variables are available in the dataset.
ID: record ID (Integer; positive numbers only)
Age: the client’s age (integer; positive numbers between 0 and 100)
Employment: types of the client’s employment. All values are already valid. Valid values are the following:
Professional
Clerical
Repair
Service
Sales
Machinist
Transport
NA
Cleaner
Farming
Support
Protective
Home
Military
Education: (Valid values: all data values in this column are already valid.)
Marital: (Valid values: Unmarried, Widowed, Divorced, Married, Absent, and Married-spouse-
absent)
Occupation (Valid values
: all data values in this column are already valid.
are the following
)
1.
Executive
2.
Professional
3.
Clerical
4.
Repair
5.
Service
6.
Sales
7.
Machinist
8.
Transport
9.
NA
10.
Cleaner
11.
Farming
12.
Support
13.
Protective
14.
Home
15.
Military
Income: (real; positive numbers only)
Gender: valid values are Male and Female.
Deduction: (real; positive numbers only)
Hours: (integer; positive numbers only)
Ignore _Accounts: (text; valid countries or regions only.)
RISK_Adjustment: (integer; valid numbers only)
TARGET_Adjusted: whether the reported damaged is adjusted. (polynominal; 0 = NO, 1 = YES)
(Note: the data type of this column should be changed to polynominal during data import.)
Your will document your analysis with screenshots in Assignment 1.docx
and submit it along with the exported process file. You will be asked to create screenshots of your Rapidminer Process View and the result in this Word document. If you are not familiar with how to capture screenshots, please Google to train yourself. There are plenty of tutorials that show you how it is done in your favorite operation system. Please pay attention to the box below. All these requirements are enforced for each assignment including this one.
Note:
All analytical steps below must be completed in Rapidminer. Do not use Excel or other programs unless it is specifically noted.
All screenshots much be legible. The font displayed in the screenshots should be similar in size compared to the font used in your Word document. Use Word’s cropping capabilities to remove unnecessary parts of screenshots. Points will be deducted if this is not followed or I have a difficult time viewing your screenshots.
Section 1: Decision Tree
Note
: I would like to keep the amount of your work reasonable. Therefore, we will not be engaged in all the activities in each of the CRISP-DM phases, but we will cover some essential ones.
Note
: Please clearly show the CRISP-DM phases as your section headings in the Word file.
Phase 1: Business understanding
We will assume the activities of this step have been completed for you, since I provided you with the data set and defined the business problem (i.e., to model claim adjustment). Q1
: In your Word document, please briefly describe what we are trying to do in this assignment.
Phase 2: Data understanding
Data is already provided to you, but you need to describe the data and verify data quality. We will focus on just the data quality here for this assignment. Import the dataset into Rapidminer. While importing the data, make sure the data type of TARGET_Adjusted is categorical. Why? Frequently people encode data values into the numeric form, such as 0 = false, 1 = true. TARGET_Adjusted is like that. Since this variable is a categorical variable, we will have to make sure it is understood as such in Rapidminer by setting it as a polynominal type.
After the data import, drag the dataset to the Process View
in Rapidminer and look at both the Data
and
Statistics
tabs in the result. Q2
: Write a paragraph or two with a screenshot of the Statistics
tab in the result to report what you see in the following issues. Highlight the following issues in your Statistics screenshot.
1.
Missing values
. Report the number of missing values for each variable.
2.
Strange values
. Go through each column and identify those values that are apparently wrong for
the column. A strange value could be something outrageous or doesn’t make sense (e.g., Salary < 0 or Race = “Alien from mars”), something that is not written in the same format as the other values in the same column, etc. Explain why they are wrong. Use bullet points in your answer (one bullet point per column). See the variable explanation on the first page to help judge strange values.
3.
Strange data types
. Review the slides about Rapidminer data types. Take a close look at the data type of each column. Identify strange data types and explain why they are strange. There may or may not be anything strange.
Phase 3: Data preparation
Do the following in the order shown
to clean the data:
1.
Records with invalid values should be filtered out. Refer to p.1. for valid values of each variable. Make sure you check all columns for invalid values.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2.
Perform mode substitution on Education and Employment.
3.
Records with NA’s should be removed.
4.
Make sure the above three steps are correctly done before you proceed to the remaining steps. Double-check the results to be sure.
5.
Perform listwise removal.
6.
Select all columns except ID, IGNORE_ACCOUNTS and RISK_Adjustment.
7.
Q3
: Show a screenshot of the resulting data. Discuss if the issues in phase 2 have been resolved.
Phase 4: Modeling
1.
Since you are building a model to predict whether a claim is adjusted or not, we will have the classification tree predict the TARGET_Adjusted
variable. Use the Set Role
operator for this.
2.
You will split the data into 70/30 in Rapidminer. Note that the data is not pre-split for you like we did in class for the golf-playing dataset. You will have to do it yourself using the Split Data
operator.
3.
After the split, you should have two datasets (70% of the original and 30% of the original). The 70% will be used to train the model and 30% will be used to test the model.
4.
Use gain_ratio
as the purity function. Perform both pre-pruning and post-pruning using the default setting.
5.
Also answer the following questions:
a.
Q4
: Show a screenshot of your design
, the final decision tree
and confusion matrix
. If the graphic tree is too big, you may choose to show the text tree. Just click the Description
button on the left side of the Tree
tab in the result.
b.
Q5
: Is the accuracy of your model acceptable? Show all steps to calculate the accuracy? (Let’s consider Accuracy >= 75% to be acceptable for this assignment. 75% is no magic number. The ideal cutoff in real life depends on the context and the application. More on this in class.) c.
Q6
: Explain what each number in class recall
and class precision
mean. Show steps to calculate each recall and precision.
Don’t just give me the definition of class recall and class precision. I would like you to explain each number in a way that someone with no training in analytics would understand. See my tutorial Decision Tree – Cross Validation.pdf
in the shared folder for how to interpret class recall and class precision.
d.
Q7
: Your manager suggests using your model to predict those claims that are adjusted. Is it a good idea? Why (Why not)? Answer this question based on the confusion matrix of your model.
Phase 5: Evaluation
Answer the following questions:
e.
Q8
: What are the top three predictors? How do you know?
f.
Q9
: What strategy do you recommend to the top-management based on the findings of your Classification Tree model? (Hint: craft your strategy based on the results of your model.)
g.
Q10
: Here we will use your model to perform a simple prediction on one single insurance claim (see the claim information below). What is the result of the prediction? Explain the logic of how you arrived at the conclusion. Pay attention to our discussion in class.
i.
Age
: 50
ii.
Gender
: Male
iii.
Income
: 50000
iv.
Education
: Bachelor
v.
Marital
: Married
vi.
Hours
: 30
vii.
Deduction
: 600
viii.
Employment
: Private
ix.
Occupation
: Professional
What and where to submit
1.
Assignment 1.docx. Double-check the following:
a.
All questions are clearly answered. b.
All screenshots are legible. Unnecessary parts are cropped out. c.
Make sure all required screenshots are included as well. d.
Section headings (CRISP-DM phases) and question numbers are clearly shown.
e.
Question number is clearly shown for each of your answers.
2.
The exported Rapidminer process file (click File | Export Process).
3.
Both files should be submitted to e-learning before the deadline.
4.
Double-check the following. Don’t lose points because of these.
a.
Did you include all the required files?
b.
Did you crop the screenshots to only show the part relevant to the answer? Don’t give me the screenshot of the whole Rapidminer screen.
c.
Did you label your answers with the question numbers?
d.
Is your font too small? Anything less than 11 point is too small.
Related Documents
Related Questions
Part c please
arrow_forward
Briefly describe the methods of collecting primary data
arrow_forward
USE EXCEL
The United States Department of Agriculture (USDA), in conjunction with the Forest Service,
publishes information to assist companies in estimating the cost of building a temporary road for
such activities as a timber sale. Such roads are generally built for one or two seasons of use for
limited traffic and are designed with the goal of reestablishing vegetative cover on the roadway
and adjacent disturbed area within ten years after the termination of the contract, permit, or lease.
The timber sale contract requires out sloping, removal of culverts and ditches, and building water
bars or cross ditches after the road is no longer needed. As part of this estimation process, the
company needs to estimate haul costs. The USDA publishes variable costs in dollars per cubic-
yard-mile of hauling dirt according to the speed with which the vehicle can drive. Speeds are
mainly determined by the road width, the sight distance, the grade, the curves, and the turnouts.
Thus, on a steep,…
arrow_forward
An business reviews data on the daily amount of calls it receives. Are the data discrete or continous?
arrow_forward
Classification is most closely associated with diagnostic analytics.
T/F
arrow_forward
s Activity-Chapter 3
Math 2600
2. The data below shows the percent vaccinated (for the flu) by age group (6 months to 17 years, 18 to 49
years of age, 50 - 64 years of age, and 65 years of age or older) in the United States for last year up to
Feb. 2018. Answer the questions below.
Reference: https://www.cdc.gov/nchs/fastats/flu.htm
Vaccinated
Identify the who.
People vaccinated for flu,
b. Which age group has the greater
a.
09
percentage of those who got a flu vaccine?
65 and older.
50+
C.
Which age group has the lowest
45.2
percentage of those who got a flu vaccine?
18-49yr age group
31.8
d. Approximately 48.1 million Americans are
65 years old or older. According to this
value, how many 65 and older Americans
actually got the flu vaccine during this time
301
201
period?
Just because the 65 and older category had
the greater percentage, does that also
mean the 65 and older group has the most
6 mos to 17 yrs
18 to 49 yrs
50 64 yrs
65 and older
Age Group
people getting vaccinated?…
arrow_forward
Please solve without AI, need complete solution by hand
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/9ae58/9ae58d45ce2e430fbdbd90576f52102eefa7841e" alt="Text book image"
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
data:image/s3,"s3://crabby-images/a9ac6/a9ac6783eb3e46977d9bd00821a18682a3295235" alt="Text book image"
Elementary Algebra
Algebra
ISBN:9780998625713
Author:Lynn Marecek, MaryAnne Anthony-Smith
Publisher:OpenStax - Rice University
data:image/s3,"s3://crabby-images/005d3/005d33e7f3eb02359be846bf8989d1c18295f0a9" alt="Text book image"
Trigonometry (MindTap Course List)
Trigonometry
ISBN:9781305652224
Author:Charles P. McKeague, Mark D. Turner
Publisher:Cengage Learning
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
data:image/s3,"s3://crabby-images/b0445/b044547db96333d789eefbebceb5f3241eb2c484" alt="Text book image"
Related Questions
- Part c pleasearrow_forwardBriefly describe the methods of collecting primary dataarrow_forwardUSE EXCEL The United States Department of Agriculture (USDA), in conjunction with the Forest Service, publishes information to assist companies in estimating the cost of building a temporary road for such activities as a timber sale. Such roads are generally built for one or two seasons of use for limited traffic and are designed with the goal of reestablishing vegetative cover on the roadway and adjacent disturbed area within ten years after the termination of the contract, permit, or lease. The timber sale contract requires out sloping, removal of culverts and ditches, and building water bars or cross ditches after the road is no longer needed. As part of this estimation process, the company needs to estimate haul costs. The USDA publishes variable costs in dollars per cubic- yard-mile of hauling dirt according to the speed with which the vehicle can drive. Speeds are mainly determined by the road width, the sight distance, the grade, the curves, and the turnouts. Thus, on a steep,…arrow_forward
- An business reviews data on the daily amount of calls it receives. Are the data discrete or continous?arrow_forwardClassification is most closely associated with diagnostic analytics. T/Farrow_forwards Activity-Chapter 3 Math 2600 2. The data below shows the percent vaccinated (for the flu) by age group (6 months to 17 years, 18 to 49 years of age, 50 - 64 years of age, and 65 years of age or older) in the United States for last year up to Feb. 2018. Answer the questions below. Reference: https://www.cdc.gov/nchs/fastats/flu.htm Vaccinated Identify the who. People vaccinated for flu, b. Which age group has the greater a. 09 percentage of those who got a flu vaccine? 65 and older. 50+ C. Which age group has the lowest 45.2 percentage of those who got a flu vaccine? 18-49yr age group 31.8 d. Approximately 48.1 million Americans are 65 years old or older. According to this value, how many 65 and older Americans actually got the flu vaccine during this time 301 201 period? Just because the 65 and older category had the greater percentage, does that also mean the 65 and older group has the most 6 mos to 17 yrs 18 to 49 yrs 50 64 yrs 65 and older Age Group people getting vaccinated?…arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALElementary AlgebraAlgebraISBN:9780998625713Author:Lynn Marecek, MaryAnne Anthony-SmithPublisher:OpenStax - Rice University
- Trigonometry (MindTap Course List)TrigonometryISBN:9781305652224Author:Charles P. McKeague, Mark D. TurnerPublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
data:image/s3,"s3://crabby-images/9ae58/9ae58d45ce2e430fbdbd90576f52102eefa7841e" alt="Text book image"
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
data:image/s3,"s3://crabby-images/a9ac6/a9ac6783eb3e46977d9bd00821a18682a3295235" alt="Text book image"
Elementary Algebra
Algebra
ISBN:9780998625713
Author:Lynn Marecek, MaryAnne Anthony-Smith
Publisher:OpenStax - Rice University
data:image/s3,"s3://crabby-images/005d3/005d33e7f3eb02359be846bf8989d1c18295f0a9" alt="Text book image"
Trigonometry (MindTap Course List)
Trigonometry
ISBN:9781305652224
Author:Charles P. McKeague, Mark D. Turner
Publisher:Cengage Learning
data:image/s3,"s3://crabby-images/af711/af7111c99977ff8ffecac4d71f474692077dfd4c" alt="Text book image"
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
data:image/s3,"s3://crabby-images/b0445/b044547db96333d789eefbebceb5f3241eb2c484" alt="Text book image"