Validation set
docx
keyboard_arrow_up
School
Slippery Rock University of Pennsylvania *
*We aren’t endorsed by this school
Course
MISC
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
4
Uploaded by DrClover9807
Validation set
data used to tune parameters in models
Testing set
data used to assess the likely future performance of a model
Supervised
training data includes both input (x) and result (y)
Unsupervised
the model is NOT provided with the results (y) during training
S
Supervised or Unsupervised?
Classification
S
Supervised or Unsupervised?
Regression
S
Supervised or Unsupervised?
Ranking
U
Supervised or Unsupervised?
Clustering
U
Supervised or Unsupervised?
Co-occurence grouping/frequent itemset
BOTH
Supervised or Unsupervised?
Data reduction
Cross Industry Standard Process for Data Mining
CRISP-DM
CRISP-DM
process that places a structure on the problem
life cycle of 6 phases
used to maintain reasonable consistency, repeatability, and objectiveness
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
6 phases of CRISP-DM
Data leakage
a variable collected in historical data gives information on the target variable (info that appears in
historical data but is not actually available when the decision has to be made)
imputation
replacing missing data with substituted values estimated from the data set
Categorical/nominal data
data that has two or more categories, but there is no intrinsic ordering to the categories
ordinal data
data that has two or more categories, has a clear ordering of the variables
Box Plot
Histogram
Scatter Plot
3 types of data visualization for numerical attributes
Bar Plot
Dot Plot
Mosaic Plot
3 types of data visualization for categorical attributes
Regression
Company X wants to know how much return on investment it is going to get based on the funds it has
invested in marketing a new product... what type of problem is this?
False
True/False
The difference between supervised and unsupervised learning is supervised learning has a categorical
target variable and unsupervised learning has a numeric target variable
False
True/False
The best way to deal with missing values in a feature is to always remove observations with missing
True
True/False
When implementing CRISP-DM, a data scientist often needs to go through the operation for several
iterations
Predictive Model
formula learned from old data for estimating the unknown value of interest for some new data
pure
certain in the outcome, homogeneous with respect to the target variable
entropy
measures uncertainty/impurity
higher, more
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The ____ the entropy value, the ___ uncertain/impure the data is