BAN210 Mid-term v1_ June2023
docx
keyboard_arrow_up
School
Seneca College *
*We aren’t endorsed by this school
Course
210
Subject
Mathematics
Date
Feb 20, 2024
Type
docx
Pages
7
Uploaded by SuperEnergyTarsier27
SENECA COLLEGE OF APPLIED ARTS AND TECHNOLOGY SENECA BUSINESS
Click here to enter course code.
- Click here to enter course name.
Version A
DATE: 6/21/2023
TIME ALLOWED: 3 hours PROFESSOR(S): Boire -Version A of exam
Allowable Examination Aids: (check applicable boxes)
☒
Calculators (non-programmable only)
☒
Math Tables (normal distribution table)
☒
Periodic Tables
☒
Formula Sheets (attached)
☒
Textbooks
☒
Probability Tables
☒
Dictionary
☒
Notes
☐
Other
Answers to be completed on:
☒
Exam Booklet
☐
GradeMaster Card
☐
Exam Paper
TOTAL MARKS: 100
WEIGHTED VALUE: 25%
INSTRUCTIONS: Academic Integrity Policy
. Seneca upholds a learning community that values academic integrity, honesty, fairness, trust, respect,
responsibility and courage. These values enhance Seneca’s commitment to students by delivering high-quality education and teaching
excellence, while supporting a positive learning environment. The AI policy is always in effect. Note Sections
2.3 and 2.4:
“…2.3
Should there be a suspected violation of this policy (e.g.…cheating, falsification, impersonation or plagiarism), the academic
integrity sanctions will be applied according to the severity of the offence committed. Refer to Appendix B
for the academic integrity
sanctions. 2.4
Should a suspected violation of this policy be a result of, or in combination with, a suspected violation of Seneca’s Student
Code of Conduct and/or another non-academic-related Seneca policy, the matter will be investigated and adjudicated through the
process found in the Student Code of Conduct.”
TO BE COMPLETED BY STUDENT
SUBJECT SECTION NUMBER (e.g. QNM223 AA): BAN210
STUDENT NAME: Shubham Hiteshkumar Jethwa
STUDENT NUMBER: 157367210
STUDENT SIGNATURE: Shubham Jethwa
APPROVED BY: ________________________________________________________
Cristina Italia, Interim Chair
School of Management and Entrepreneurship DATE:
___June 21/2023_____________________________________________________
778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 1
of 7 Version A
Double click on the word ‘Version’ at the end of the page and change the Version letter to the right. Double click here when done.
Start to type and/or cut/paste your content here.
Mid-Term Exam-BAN210-Predictive Analytics-Winter2023
Instructions: -Version A
a)
You will have 3 hours to complete the exam from the time that the class starts
b)
Exam will be conducted during class time
c)
Answer each of the 10 questions below providing short written answers. Question 11 is bonus question worth 5 marks
10 marks
1)
A)Why would a logistic model not be used to predict customer spend
B)Why can a multiple regression model be used to predict a probability outcome such as response
:- Response will create an estimate or inferences based on prior learning while predictive modelling creates a future estimate based on mathematical probabilities.
C)Why is decision tree output more easy to explain than logistic model output
:- Decision tree outputs are generally easier to explain than logistic model outputs due to their inherent structure and interpretability.
D) What is one disadvantage of a deep learning model over logistic regression :- One disadvantage of a deep learning model compared to logistic regression is its increased complexity and resource requirements.
10 marks
2)
A) A person makes 50 purchases of bread and 100 purchases of milk. Provide a scenario of where both milk and bread purchases are dependant.
- For every purchase of milk, there will be more than a 33% chance that he/she will purchase bread 778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 2
of 7 Version A
B)In purchasing Amazon books, what would be the direction of the correlation coefficient if both romance and history novels were presented to the potential buyer.
- The correlation coefficient will be above 0 and positive
C)What might the direction of the correlation coefficient be if comedy films were never recommended to the viewer who viewed horror films? - The correlation coefficient will be around 0.
D)How might you determine the optimum # of clusters?
- By using the elbow theory, we’ll be able measure the increase in variation and then we can see that part where the increase in variation started to be linear.
10 marks
3)
A) You have 300 variables. Describe in 4-5 sentences how you would use factor analysis to reduce your number of variables to 40.
B) After using factor analysis to arrive at your 40 variables, how would you use clustering to develop unique customer segments and how would this be presented to the business? C)What Assessment metric might have more cost in a fraud model than a response model and why? D)What is a gains chart or decile chart attempting to depict in assessing overall models ?
10 marks
4) Two data scientists decide to build a targeted response tool. There are four options:
-Linear Regression model :- Logistic regression output is easier to explain than neural net and is more granular than CHAID
-Deep Learning Model
778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 3
of 7 Version A
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
:- Neural net can work the best with large volumes of data and in particular where the distribution is non linear as long as there is a substantial noise to signal ratio. -Discriminant Analysis model
-CHAID :- CHAID is the easiest to explain relative to the other options and is more machine intensive in terms of learning relative to logistic regression modelling.
For each above option, describe one advantage relative to the other options 10 marks
5)
A)You have three datasets(a customer file, a purchase file, and a promotion file). Describe 3 tasks which should occur in creating a proper input file for modelling.
- First We need to perform Data audit where we’ll read the file and performing data diagnostics & frequency distributions
- Then Data standardization where we’ll link the files together
- And We’ll create the Analytical file (Source vs derived)
B) Why is normalizing or statistically standardizing all data required in all cluster exercises.
- Data normalization is important to make sure that the data is in a consistent format. This is especially needed when the different pieces of information have different scales or units. - By standardizing the data, we remove the effects of these different scales or units so that we can compare and analyze them more accurately.
778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 4
of 7 Version A
10 marks
6) A)Create a confusion matrix where the accuracy is 80%. You have 100 records and your target variable is response. There are 20 responders in the file.
B) What are the 4 components of the confusion matrix? 10 marks
7)
A) What are 3 stages of the CRISP Method.
There are 3 stages of the CRISP Method:
1). Business Understanding, 2). Data Understanding & 3). Data preparation
B) What are the three main deliverables from a data audit used for data analysis purposes
There will be 3 main deliverables from a data audit as mentioned below:
- Loading of Data
- Evaluation of Initial Data diagnostics
- Frequency Distribution
10 marks
8)
You have the following meta data: 100000 records
Gender field containing values either ‘M’ or ‘F’ or missing(50000 records have missing gender)
Credit score with values ranging from 600 to 900
Address which contains postal code(missing values comprise 30%)
Income ranging from 25K to 100K with no missing values
What are 5 derived variables or 5 features that could be engineered?
- Male binary variable(0/1) will be based on gender outcome ‘M’
- Female binary variable(0/1) will be based on gender outcome ‘F’
- Binary variable(0/1) will be based on credit score being above a certain value
- Binary Regional(0/1) variables will be based on 1st digit of the postal code
778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 5
of 7 Version A
- Binary Apartment dweller(0/1) will be based on finding unit number in address field
- Income ordinal field (1,2,3) will be based on different income levels
15 marks
9)
Describe what is happening in the following below cases.
A)An RRSP model is developed from a random sample of bank’s customer and is applied to customers over the age of 55 and performs miserably. Why?
- Tremendous bias in that sample for development was wrong and should have been filtered to include people over 55
B) A deep learning model performs better than a logistic model in predicting response. Why?
- The two techniques are almost identical except for the exponential transformation of the final linear regression.
C) What are two user parameters that can be used in altering the shape of a decision tree?
- sample size on node & statistical threshold D) A decision tree model is used over a logistic regression despite the fact that both models perform equally well. Why?
- the decision tree is easy to explain and exhibits more flexibility when it comes to the distribution of the data
E) What can practitioners do to reduce (Bias) and reduce (Error) in the models?
- Cross Validation
5 marks
10) A)What is the difference between supervised and unsupervised learning
- The main difference between supervised and unsupervised learning lies in the presence or absence of labeled data during the learning process.
B) Provide one example of supervised learning and one example of unsupervised learning 778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 6
of 7 Version A
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
- Supervised Learning : Decision tree
- Unsupervised Learning : Clustering
5 marks(bonus question)
11)Answer the following:
The profile of a responder is young, live in Ontario, and are male. Using the index approach, create a table that would depict this profile. In this table, both education and household size are variables that did not make it into the model. Make up the numbers to support your results
responder
non-responder
age
Young
old
Location
Ontario
British Columbia
Gender
Male
Female
education
0.6
0.61
household size
3
3.2
778c39b463d9bdea26e5aa35964090f9cbad73d6.docx Page 7
of 7 Version A