CAP 4611 Midterm review
docx
keyboard_arrow_up
School
University of South Florida *
*We aren’t endorsed by this school
Course
4611
Subject
Computer Science
Date
Dec 6, 2023
Type
docx
Pages
11
Uploaded by ColonelElephantMaster992
CAP 4611 Midterm review
Intro to ML
1.
How would you define Machine Learning?
The science of programming computers so they can learn from data.
2.
Can you name 4 types of problems where it shines?
Supervised, Unsupervised, reinforcement, online
3.
What is a labeled training set?
Used in supervised learning; an input where the desired output in known
4.
What are the two most common supervised tasks?
Regression and Classification
5.
Name 4 common unsupervised tasks
Clustering, visualization, dimension reduction, association rule learning
6.
What type of ML algorithm would you use to allow a robot to walk in various unknow
terrains?
Reinforcement learning
7.
What type of algorithm would you use to segment your customers into multiple groups?
Supervised learning (if known labels)
8.
Would you frame the problem of spam detection as supervised learning or unsupervised
learning problem?
Supervised learning because labels are known
9.
What is an online learning system?
Learning system in which the machine learns as data is given in small streams
continuously
10. What kind of data is suitable for regression?
When output is real-valued (not categorical)
11. What type of learning algorithm relies on a similarity measure to make predictions?
Instance-based learning algorithm
12. What is the different between a model parameter and a learning algorithm’s
hyperparameter?
Model parameter determines how a model will predict given a new instance;
hyperparameter is a parameter for the learning algorithm, not of a model
13. What do model-based learning algorithms search for? What is the most common
strategy that use to succeed? How do they make predictions?
Model-based learning algorithms search for the optimal value of parameters in a model
that will give the most accurate results in new instances. We use different calculations,
such as the cost function, to determine what parameters work best to minimize the cost.
They make predictions by using the values of the new instance and the parameters in
the function.
14. 4 main challenges in ML
Overfitting the data, underfitting the data, lacking data, and nonrepresentative data
15. If your model performs great on the training data but generalizes poorly on new
instances, what is happening? Can you name three possible solutions?
It is overfitting the training data. To solve, you can get more data, implement a simpler
model, or eliminate outliers or noise from existing data set
16. What is a test set and why would you want to use it?
Test set is a set that you use to test your model (fit using training data) to see how it
performs. It is necessary to see how good or bad your model performs
17. What is the purpose of a validation set?
Set used to compare between different training models
18. What can go wrong if you tune hyperparameters using the test set?
It may not perform well on out-of-sample data because the model is tuned just for that
specific set
19. What is k-fold cross-validation and why would you need that?
In cross-validation, k-fold is a procedure use to estimate the skill of the model on new
data.
EDA
1.
What are the different methods of exploratory data analysis?
-
Univariate visualization: provides summary statistics for each field in the raw data set
-
Bivariate visualization: performed to find the relationship between each variable in
the dataset and the target variable of interest
-
Multivariate visualization: performed to understand interactions between different
fields in the dataset
-
Dimensionality reduction: helps to understand the fields in the data that account for
the most variance between observations
2.
How would you assess the feasibility of your ML project?
Determine if the data required for the solution is available or could be made available.
Does the right data exist at the right level of granularity, in the necessary quantities,
withing the required time frame, legally obtainable?
3.
What is a design matrix?
A structure to store the data, which you feed to the ML algorithm for building models
4.
What are the different data types?
-
Numeric: real numbers
-
Interval: can be ordered, there is a known 0 point (6am – 8am)
-
Nominal: able to put objects into categories; can be in numerical form, but have no
mathematical interpretation
-
Ordinal: similar to nominal, but the values can be arranged in a meaningful order
(small, medium, large)
-
Binary: Kind of nominal with two values
-
Text: free form text data
-
Integer: values that are genuine integers (ex. Number of children)
-
Ratio Scaled
-
Categorical: nominal, binary, and ordinal
-
Continuous: integer, real, ratio scaled
5.
Deriving features from existing features
Mean, median, max, min, switching from minutes to hours, etc.
6.
What kind of questions can you answer from EDA?
-
What are the types of data?
-
What are the ranges of values?
-
What is the quality of the data?
-
Are there missing or extreme values?
7.
What contents are in a “data quality report”?
-
Summary statistics from quantitative: count, median, mean, std, mode, mine, max,
percentiles, etc.
-
Summary statistics for categorical: count, count and % missing, cardinality
(uniqueness), number of values in each category
-
Basic distribution plots: histograms or bar/box plots
-
Basic relationship plots: scatter plot matrix
8.
What kind of plots could be used to examine the categorical variables?
Bar plot of how common each value is
9.
What kind of plots could be used to understand the continuous variables?
Histogram and box plots to describe the central tendency and variation of the
distribution
10. What are the different types of distributions?
Uniform, normal, exponential, unimodal (skewed right/left)
11. What are the different types of data quality issues and how to fix them?
-
Missing values: if more than 60% missing, drop column
-
Irregular cardinality (no uniqueness): might need to drop column
-
Outliers
12. How to identify outliers
-
Determine upper and lower values by using whiskers in a box plot
-
Set up upper/lower values to the mean plus/minus a multiple of the std
13. What types of plots to find relationships between continuous attributes?
Scatter plots or scatter plot matrix
14. How to use correlation to decide what variables should be removed before ML?
If there is high correlation, this might mean the two variables have the same meaning,
so might be able to drop one
15. What is normalization?
Modifying the range of a feature to be within a certain range so that it is easier to
understand
16. What is binning?
Converts a continuous feature into a categorical feature
17. Different types of sampling
-
Sampling: selects a subset of data from the design matrix that should represent the
population
-
Top sampling: selects a plat % from the top of the dataset; almost always biased
-
Random sampling: selects a flat % from the dataset; does not present relationships
-
Stratified sampling: dataset is grouped by one or more particular variables, then
strata% is selected for sample; maintains the relative frequency of each group
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
-
Under sampling: creates a sample where all groups are equally represented; from
each group, randomly sample N instances (without replacement), where N is the
number of instances in the smallest group
-
Over sampling: creates a sample where all groups are equally represented; from
each group, randomly sample N instances (with replacement), where N is the
number of instances in the largest group
Classification and Evaluation
1.
What is classification?
Supervised learning approach to categorize some unknown items into a discrete set of
classes or categories
2.
Name 5 classification algorithms
Logistic regression, naïve bayes, KNN, Decision tree
3.
Explain the differences between binary, multiclass, and multi-label classification with
example
-
Binary classification: only 2 number of classes to choose from (late or not late/ 0 or
1)
-
Multiclass classification: more than two classes to choose from (male-blue, female-
blue, male-orange, female-orange)
-
Multi-label: class labels are not mutually exclusive (male/female and blue/orange)
4.
What are the purposes of evaluation of a classification model?
-
To determine which model is the most suitable for a task
-
To estimate how the model will perform
-
To convince uses that the model will meet their needs
… supports one goal: to select the best model for the problem
5.
Explain k-fold cross-validation and how would you use it to evaluate a classification
model.
Break your data up into k equally sized chunks and train the algorithm k times. Each time
you train the algorithm, you select 1 chunk as the validation set and the other k-1 chunks
as the training set
6.
How to calculate misclassification rate and what is the limitation? What tool cans you
use to better understand the number behind this?
Misclassification rate = number incorrect prediction / total predictions
This hides a lot of information, so use confusion matrix to find TP, TN, FP, FN. Confusion
matrix helps to better understand where a classification algorithm goes right or wrong
7.
How to measure classification accuracy? In what situation is classification accuracy not
the correct measure?
Classification accuracy = number correct prediction / total predictions
(ex. Imagine target vector has 99 “1s” and only a single 0. If you predict 1 for everything,
you have 99% accuracy; baggage example)
8.
What is bootstrapping?
Training k different models, with random samples (with replacement) from the dataset
to create the training and validations sets for each iteration
9.
How to calculate TPR, TNR, FPR, and FNR?
-
TPR = TP / (TP + FN)
-
TNR = TN / (TN + FP)
-
FPR = FP / (TN + FP)
-
FNR = FN / (TP + FN)
10. What is precision and recall? Interpret.
-
Precision is the accuracy in the positive predications. This captures how often, when
the model makes a positive prediction, that the prediction turns out to be actually
correct. Precision = TP / (TP + FP)
-
Recall (sensitivity, or TPR) is the ration of positive instances that are correctly
classified. This tells us how confident we can be that all the instances with a positive
target level have been found by the model.
11. What is F1 score, how to measure, and its limitation?
F1 score is the combined measure of precision and recall; a harmonic mean. It is less
sensitive to outliers but favors classifiers that have a similar precision and recall score.
F1-measure = 2 x [(precision x recall) / (precision + recall)]
12. How can precision-recall curve help find the best threshold?
If you plot the value of precision and recall for different values of threshold, you can find
the best point in the plot that will help you to know for which threshold you can
maximize both precision and recall
13. How can profit matrix be used for deciding the best threshold?
Associates a cost with TP, TN, FP, and FN outcomes
14. What is ROC curve and how can it help you evaluate your classification model?
A good classifier should be as far away from the dotted line as possible. The area under
the ROC curve (AUC) gives a single value, with a value of 1 indicating a perfect classifier.
15. If you have a dataset with several classes, how would you use classification algorithms
that only work for a binary classifier?
-
One vs. all: you only keep one class, and all the other classes are not class
-
One vs. one: if there are N classes, you need to train N x (N-1) / 2 classifiers; have a
column for each class
Linear Regression
1.
What are the independent and dependent variable in a dataset for linear regression?
Independent – can be continuous or categorical
Dependent – must be continuous
2.
What kind of ML problems can be solved using Linear Regression?
Linear regression can help us to predict continuous values
3.
What does it mean by fitting a line in linear regression?
Finding a line that best fits the data points available on the plot, so you can use that to
predict output values given inputs
4.
What are the meanings of the different variables in linear regression?
Theta 0 = intercept
Theta 1 = slope/gradient
X1 = a single predictor (independent variable)
5.
Parameter estimation using mathematical approach
Find theta 1 and theta 0 in
^
y
=
0
+
1
x
1
θ
1
=
∑
i
=
1
s
(
x
i
−
^
x
)(
y
i
−
^
y
)
∑
i
=
1
s
(
x
i
−
^
x
)
2
θ
0
=
^
y
−
θ
1
−
^
x
6.
What is the meaning of error by a linear regression model? Give three different
evaluation matrices to measure the performance of a linear regression model.
The error is how far the actual data point is from the predicted regression line.
Mean Absolute Error (MAE):
¿
y
j
−
^
y
j
∨
¿
MAE
=
1
n
∑
j
=
1
n
¿
Mean Square Error (MSE):
MSE
=
1
n
∑
i
=
1
n
(
y
i
−
^
y
i
)
2
Root Mean Square Error (RMSE):
RMSE
=
√
1
n
∑
i
=
1
n
(
y
i
−
^
y
i
)
2
7.
Explain the equation
^
y
=
θ
T
X
and explain why we transpose theta.
-
Y is the predicted value
-
θ
T
is the transpose of
θ
(a row vector instead of a column vector)
-
θ
T
X
is the matrix multiplication of
θ
T
and X
8.
Normal equation
θ
=
(
X
T
X
)
−
1
∙
(
X
T
y
)
9.
Why do we add a column will all values of 1 in linear regression?
We are adding another feature x
0
because
θ
0
will be multiplied with 1 and the value
of
θ
0
will not be affected by the multiplication
10. What is the computation complexity for the Normal equation in linear regression?
About O(n
2.4
) to O(n
3
), depending on the implementation
11. What is the weakness of using Normal equation for linear regression?
It is computationally expensive when you have a large number of features
Gradient Descent
1.
Gradient Descent is only used for linear regression
2.
What is the purpose of gradient descent in the case of linear regression? Is it used
during training or testing or both?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Gradient descent is used to tweak parameters iteratively in order to minimize a cost
function, in the training data.
3.
Why is partial derivate used in gradient descent?
The partial derivative lets use change one variable at a time, while holding all the others
constant
4.
What is a contour plot?
It is a graphical technique for representing a 3-dimensional surface by plotting constant z
slices, called contours, on a 2-dimensional format. The close to the middle circle, the
closer the error is to 0.
5.
Outline the steps of gradient descent algorithm
-
Start with some
θ
0
and
θ
1
-
Keep changing
θ
0
and
θ
1
to reduce the error
-
Hopefully you will end up at a minimum
6.
How does the learning rate (or step size) affect the gradient descent?
If the learning rate is too small, algorithm will take long; if learning rate is too big, might
jump across valley
7.
How to change slope/partial derivative based on its sign?
If slope/derivative is positive,
θ
1
should decrease
If slope/derivative is negative,
θ
1
should increase
8.
Should we calculate all the different thetas at each iteration simultaneously?
Yes
9.
Why is feature scaling important in gradient descent?
It is important to ensure all features have a similar scale, or it will take longer to
converge/get to the minimum
10. What is the gradient vector of the cost function in batch gradient descent?
The gradient vector, noted
∇
θ
MSE
(
θ
)
, contains all the partial derivatives of the cost
function
11. What is eta and epsilon in gradient descent?
Eta (
) is the learning rate
Epsilon is used if the different between x_old and x_new is smaller than this value, the
algorithm will halt
12. What is the weakness of batch gradient descent and how stochastic gradient descent
can help overcome that?
Batch gradient descent uses the whole training set at every step, making it slow.
Stochastic picks one random instance in the training set at every step and computes
gradients only based on the one instance
13. What is the learning schedule in stochastic gradient descent?
The learning schedule is the function that determines the learning rate at each iteration.
If learning rate is reduced too quickly, you may get stuck at local min or freeze halfway
If learning rate is reduced too slowly, you may jump around the minimum for a long time
and end with a suboptimal solution
14. What is the weakness of stochastic gradient descent?
Since instances are picked randomly, some instances may be picked several times, while
others may not be picked at all. Much less regular than batch; cost function will bounce
around; final values are good, not optimal.
15. What is minibatch gradient descent? Does it support out-of-core learning?
In minibatch, the gradients are computed on small random sets of instances, called
minibatches. Yes, stochastic and minibatch support out of core learning.
16. Hyperparameters for batch GD?
Polynomial Regression and Regularized Models
1.
What is polynomial regression and when is it used?
A form of linear regression that estimates the relationship as an nth degree polynomial.
Used to fit nonlinear data, by using linear model.
2.
What strategy is used to use a linear model to fit nonlinear data?
Add powers of each feature as new features, then train a linear model on this extended
set of features
3.
What is the purpose of learning curve and what do we plot in a learning curve?
These are plots of the model’s performance on the training set and the validation set, as
a function of the training set size. To generate, simply train the model several times on
different sized subsets of the training set.
4.
What is bias/variance tradeoff?
The increase of one will result in the decrease of the other. Increasing model’s
complexity (degree) will increase variance and reduce bias. Reducing complexity
increases bias and reduces variance
5.
What are the different techniques to overcome underfitting and overfitting?
To overcome underfitting/high bias, add new parameters to model so that model
complexity increases.
To overcome overfitting, reduce complexity or regularization
6.
What is the purpose of regularization and how is it performed?
It is constraining a model to make it simpler and reduce the risk of overfitting.
Keep the same number of features, but reduce the magnitude of the coefficients
7.
What are the three regularized linear models?
Ridge Regression, Lasso Regression, Elastic Net
8.
What strategy is used by Ridge regression as part of regularization?
A regularized term,
α
1
2
∑
i
=
1
n
θ
i
2
is added to the cost function during training
9.
What is l1 and l2 norm? Which Is used by ridge regression?
L1 norm is calculated as the sum of the absolute values of the vector (MAE)
L2 norm is calculated as the square root of the sum of squared vector values
Ridge uses L2 and Lasso uses L1
10. What strategy is used by Lasso regression as part of regularization?
Adds regularization term to cost function:
¿
θ
i
∨
¿
α
∑
i
=
1
n
¿
11. Out of ridge and lasso, which one automatically performed feature selection?
Lasso; tends to completely eliminate the weights of least important features
12. What is elastic net? What parameter is used in elastic net to balance between the use of
lasso and ridge?
Middle ground between ridge and lasso. Regularization term is a simple mix of both
ridge and lasso’s, and you control the mix ratio r. When r = 0 it is ridge and when r = 1 it
is lasso. Adds:
rα
∑
i
=
1
n
¿
θ
i
∨
+
1
−
r
2
α
∑
i
=
1
n
θ
i
2
13. How to decide between lasso/ridge/elastics net?
Ridge is a good default, but if few features as useful prefer lasso or elastic net. Elastic net
is preferred over lasso
14. What is early stopping regularization?
Regularize by stopping training as soon as validation error reaches a minimum
Logistic and SoftMax Regression
1.
What is the different between linear and logistic regression?
Linear regression is used to predict continuous values
Logistic regression is used to predict categorical values
2.
What is the difference between logistic and softmax regression?
SoftMax is used to support multiple classes directly, not just binary
3.
What is changed in linear regression to achieve the benefit of logistic regression?
Instead of using
θ
T
X
, we use
^
y
=
σ
(
θ
T
X
)
(sigma always return value between 0
and 1)
This gives us probability instead of continuous value
4.
Why don’t you use MSE cost function of linear regression in logistic regression?
MSE equation will give a non-convex function
5.
Cost function of logistic regression
Cost
(
^
y , y
)
=
1
2
(
σ
(
θ
T
X
)
−
y
)
2
In the case that desirable y = 1, the cost is
−
log
(
^
y
)
; if y = 0 cost is
−
log
(
1
−
^
y
)
;
6.
What is one-hot encoding?
Process for one vs one
KNN
1.
What is KNN classification?
Estimate the classification of an unseen instance using the classification of the instance
of instance that are closest to it
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2.
What is the formula for Euclidean distance?
√
(
a
1
−
b
1
)
2
+
(
a
2
−
b
2
)
2
+
…
+
(
a
n
−
b
n
)
2
3.
What is the difference between eager and lazy learning? What does KNN use?
Eager – when given training points, will construct generalized model before performing
new prediction
Lazy – no need for learning or training of model and all data points used at time of
prediction; memorized the training dataset
KNN is lazy learner
4.
What are the conditions for Minkowski distance?
-
Non-negativity
-
Identity; d (x, y) = 0 iff x==y
-
Symmetry
-
Triangle identity: d(x, y) + d(y, z) >= d(x, z)
5.
Minkowski formula
(
∑
i
=
1
n
|
x
i
−
y
i
|
p
)
1
p
P=1, manhattan
P=2, euclidean
6.
Explain cosine distance and Jaccard distance
Cosine distance: mainly used to calculate similarity between two vectors; determines if
pointing in same direction
Jaccard distance: similar to cosine; looks and two datasets and find incident where both
values equal to 1; results how many 1 to 1 matches occur
7.
Hamming distance
Used to compare two binary data structures
ABCDE
= 3 (3 non-matching)
AGDDF
8.
What is normalization and why is it important?
When using distance measures, the large values frequently swamp small ones. So,
normalize the values of continuous attributes. (0 to 1)
9.
How can you give more importance to a particular feature?
Add weight; put w in front of each square in Euclidean distance
10. What is the curse of dimensionality?
Data needs to exponentially grow as you increase number of dimensions
Naïve Bayes
1.
What is naïve bayes?
Uses probability theory to find most likely of the possible classifications
2.
Pros and cons of naïve bayes?
Pros: easy and fast to predict class of test data set; performs well in multiclass
predication
Cons: zero frequency; known as bad estimator
3.
What are assumption made by naïve bayes classifier about the data set?
Assumes independent prediction
Assumes normal distribution for numerical value
Recommended textbooks for you

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Security (MindTap Cours...
Computer Science
ISBN:9781337102063
Author:Michael E. Whitman, Herbert J. Mattord
Publisher:Cengage Learning
Recommended textbooks for you
- Fundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning
- Fundamentals of Information SystemsComputer ScienceISBN:9781337097536Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Security (MindTap Cours...Computer ScienceISBN:9781337102063Author:Michael E. Whitman, Herbert J. MattordPublisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Security (MindTap Cours...
Computer Science
ISBN:9781337102063
Author:Michael E. Whitman, Herbert J. Mattord
Publisher:Cengage Learning