CAP 4611 Midterm review

docx

School

University of South Florida *

*We aren’t endorsed by this school

Course

4611

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

11

Uploaded by ColonelElephantMaster992

Report
CAP 4611 Midterm review Intro to ML 1. How would you define Machine Learning? The science of programming computers so they can learn from data. 2. Can you name 4 types of problems where it shines? Supervised, Unsupervised, reinforcement, online 3. What is a labeled training set? Used in supervised learning; an input where the desired output in known 4. What are the two most common supervised tasks? Regression and Classification 5. Name 4 common unsupervised tasks Clustering, visualization, dimension reduction, association rule learning 6. What type of ML algorithm would you use to allow a robot to walk in various unknow terrains? Reinforcement learning 7. What type of algorithm would you use to segment your customers into multiple groups? Supervised learning (if known labels) 8. Would you frame the problem of spam detection as supervised learning or unsupervised learning problem? Supervised learning because labels are known 9. What is an online learning system? Learning system in which the machine learns as data is given in small streams continuously 10. What kind of data is suitable for regression? When output is real-valued (not categorical) 11. What type of learning algorithm relies on a similarity measure to make predictions? Instance-based learning algorithm 12. What is the different between a model parameter and a learning algorithm’s hyperparameter? Model parameter determines how a model will predict given a new instance; hyperparameter is a parameter for the learning algorithm, not of a model 13. What do model-based learning algorithms search for? What is the most common strategy that use to succeed? How do they make predictions? Model-based learning algorithms search for the optimal value of parameters in a model that will give the most accurate results in new instances. We use different calculations, such as the cost function, to determine what parameters work best to minimize the cost. They make predictions by using the values of the new instance and the parameters in the function. 14. 4 main challenges in ML Overfitting the data, underfitting the data, lacking data, and nonrepresentative data 15. If your model performs great on the training data but generalizes poorly on new instances, what is happening? Can you name three possible solutions?
It is overfitting the training data. To solve, you can get more data, implement a simpler model, or eliminate outliers or noise from existing data set 16. What is a test set and why would you want to use it? Test set is a set that you use to test your model (fit using training data) to see how it performs. It is necessary to see how good or bad your model performs 17. What is the purpose of a validation set? Set used to compare between different training models 18. What can go wrong if you tune hyperparameters using the test set? It may not perform well on out-of-sample data because the model is tuned just for that specific set 19. What is k-fold cross-validation and why would you need that? In cross-validation, k-fold is a procedure use to estimate the skill of the model on new data. EDA 1. What are the different methods of exploratory data analysis? - Univariate visualization: provides summary statistics for each field in the raw data set - Bivariate visualization: performed to find the relationship between each variable in the dataset and the target variable of interest - Multivariate visualization: performed to understand interactions between different fields in the dataset - Dimensionality reduction: helps to understand the fields in the data that account for the most variance between observations 2. How would you assess the feasibility of your ML project? Determine if the data required for the solution is available or could be made available. Does the right data exist at the right level of granularity, in the necessary quantities, withing the required time frame, legally obtainable? 3. What is a design matrix? A structure to store the data, which you feed to the ML algorithm for building models 4. What are the different data types? - Numeric: real numbers - Interval: can be ordered, there is a known 0 point (6am – 8am) - Nominal: able to put objects into categories; can be in numerical form, but have no mathematical interpretation - Ordinal: similar to nominal, but the values can be arranged in a meaningful order (small, medium, large) - Binary: Kind of nominal with two values - Text: free form text data - Integer: values that are genuine integers (ex. Number of children) - Ratio Scaled - Categorical: nominal, binary, and ordinal - Continuous: integer, real, ratio scaled 5. Deriving features from existing features Mean, median, max, min, switching from minutes to hours, etc.
6. What kind of questions can you answer from EDA? - What are the types of data? - What are the ranges of values? - What is the quality of the data? - Are there missing or extreme values? 7. What contents are in a “data quality report”? - Summary statistics from quantitative: count, median, mean, std, mode, mine, max, percentiles, etc. - Summary statistics for categorical: count, count and % missing, cardinality (uniqueness), number of values in each category - Basic distribution plots: histograms or bar/box plots - Basic relationship plots: scatter plot matrix 8. What kind of plots could be used to examine the categorical variables? Bar plot of how common each value is 9. What kind of plots could be used to understand the continuous variables? Histogram and box plots to describe the central tendency and variation of the distribution 10. What are the different types of distributions? Uniform, normal, exponential, unimodal (skewed right/left) 11. What are the different types of data quality issues and how to fix them? - Missing values: if more than 60% missing, drop column - Irregular cardinality (no uniqueness): might need to drop column - Outliers 12. How to identify outliers - Determine upper and lower values by using whiskers in a box plot - Set up upper/lower values to the mean plus/minus a multiple of the std 13. What types of plots to find relationships between continuous attributes? Scatter plots or scatter plot matrix 14. How to use correlation to decide what variables should be removed before ML? If there is high correlation, this might mean the two variables have the same meaning, so might be able to drop one 15. What is normalization? Modifying the range of a feature to be within a certain range so that it is easier to understand 16. What is binning? Converts a continuous feature into a categorical feature 17. Different types of sampling - Sampling: selects a subset of data from the design matrix that should represent the population - Top sampling: selects a plat % from the top of the dataset; almost always biased - Random sampling: selects a flat % from the dataset; does not present relationships - Stratified sampling: dataset is grouped by one or more particular variables, then strata% is selected for sample; maintains the relative frequency of each group
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
- Under sampling: creates a sample where all groups are equally represented; from each group, randomly sample N instances (without replacement), where N is the number of instances in the smallest group - Over sampling: creates a sample where all groups are equally represented; from each group, randomly sample N instances (with replacement), where N is the number of instances in the largest group Classification and Evaluation 1. What is classification? Supervised learning approach to categorize some unknown items into a discrete set of classes or categories 2. Name 5 classification algorithms Logistic regression, naïve bayes, KNN, Decision tree 3. Explain the differences between binary, multiclass, and multi-label classification with example - Binary classification: only 2 number of classes to choose from (late or not late/ 0 or 1) - Multiclass classification: more than two classes to choose from (male-blue, female- blue, male-orange, female-orange) - Multi-label: class labels are not mutually exclusive (male/female and blue/orange) 4. What are the purposes of evaluation of a classification model? - To determine which model is the most suitable for a task - To estimate how the model will perform - To convince uses that the model will meet their needs … supports one goal: to select the best model for the problem 5. Explain k-fold cross-validation and how would you use it to evaluate a classification model. Break your data up into k equally sized chunks and train the algorithm k times. Each time you train the algorithm, you select 1 chunk as the validation set and the other k-1 chunks as the training set 6. How to calculate misclassification rate and what is the limitation? What tool cans you use to better understand the number behind this? Misclassification rate = number incorrect prediction / total predictions This hides a lot of information, so use confusion matrix to find TP, TN, FP, FN. Confusion matrix helps to better understand where a classification algorithm goes right or wrong 7. How to measure classification accuracy? In what situation is classification accuracy not the correct measure? Classification accuracy = number correct prediction / total predictions (ex. Imagine target vector has 99 “1s” and only a single 0. If you predict 1 for everything, you have 99% accuracy; baggage example) 8. What is bootstrapping?
Training k different models, with random samples (with replacement) from the dataset to create the training and validations sets for each iteration 9. How to calculate TPR, TNR, FPR, and FNR? - TPR = TP / (TP + FN) - TNR = TN / (TN + FP) - FPR = FP / (TN + FP) - FNR = FN / (TP + FN) 10. What is precision and recall? Interpret. - Precision is the accuracy in the positive predications. This captures how often, when the model makes a positive prediction, that the prediction turns out to be actually correct. Precision = TP / (TP + FP) - Recall (sensitivity, or TPR) is the ration of positive instances that are correctly classified. This tells us how confident we can be that all the instances with a positive target level have been found by the model. 11. What is F1 score, how to measure, and its limitation? F1 score is the combined measure of precision and recall; a harmonic mean. It is less sensitive to outliers but favors classifiers that have a similar precision and recall score. F1-measure = 2 x [(precision x recall) / (precision + recall)] 12. How can precision-recall curve help find the best threshold? If you plot the value of precision and recall for different values of threshold, you can find the best point in the plot that will help you to know for which threshold you can maximize both precision and recall 13. How can profit matrix be used for deciding the best threshold? Associates a cost with TP, TN, FP, and FN outcomes 14. What is ROC curve and how can it help you evaluate your classification model? A good classifier should be as far away from the dotted line as possible. The area under the ROC curve (AUC) gives a single value, with a value of 1 indicating a perfect classifier. 15. If you have a dataset with several classes, how would you use classification algorithms that only work for a binary classifier? - One vs. all: you only keep one class, and all the other classes are not class - One vs. one: if there are N classes, you need to train N x (N-1) / 2 classifiers; have a column for each class Linear Regression 1. What are the independent and dependent variable in a dataset for linear regression? Independent – can be continuous or categorical Dependent – must be continuous 2. What kind of ML problems can be solved using Linear Regression? Linear regression can help us to predict continuous values 3. What does it mean by fitting a line in linear regression? Finding a line that best fits the data points available on the plot, so you can use that to predict output values given inputs
4. What are the meanings of the different variables in linear regression? Theta 0 = intercept Theta 1 = slope/gradient X1 = a single predictor (independent variable) 5. Parameter estimation using mathematical approach Find theta 1 and theta 0 in ^ y = 0 + 1 x 1 θ 1 = i = 1 s ( x i ^ x )( y i ^ y ) i = 1 s ( x i ^ x ) 2 θ 0 = ^ y θ 1 ^ x 6. What is the meaning of error by a linear regression model? Give three different evaluation matrices to measure the performance of a linear regression model. The error is how far the actual data point is from the predicted regression line. Mean Absolute Error (MAE): ¿ y j ^ y j ¿ MAE = 1 n j = 1 n ¿ Mean Square Error (MSE): MSE = 1 n i = 1 n ( y i ^ y i ) 2 Root Mean Square Error (RMSE): RMSE = 1 n i = 1 n ( y i ^ y i ) 2 7. Explain the equation ^ y = θ T X and explain why we transpose theta. - Y is the predicted value - θ T is the transpose of θ (a row vector instead of a column vector) - θ T X is the matrix multiplication of θ T and X 8. Normal equation θ = ( X T X ) 1 ( X T y ) 9. Why do we add a column will all values of 1 in linear regression? We are adding another feature x 0 because θ 0 will be multiplied with 1 and the value of θ 0 will not be affected by the multiplication 10. What is the computation complexity for the Normal equation in linear regression? About O(n 2.4 ) to O(n 3 ), depending on the implementation 11. What is the weakness of using Normal equation for linear regression? It is computationally expensive when you have a large number of features Gradient Descent 1. Gradient Descent is only used for linear regression 2. What is the purpose of gradient descent in the case of linear regression? Is it used during training or testing or both?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Gradient descent is used to tweak parameters iteratively in order to minimize a cost function, in the training data. 3. Why is partial derivate used in gradient descent? The partial derivative lets use change one variable at a time, while holding all the others constant 4. What is a contour plot? It is a graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. The close to the middle circle, the closer the error is to 0. 5. Outline the steps of gradient descent algorithm - Start with some θ 0 and θ 1 - Keep changing θ 0 and θ 1 to reduce the error - Hopefully you will end up at a minimum 6. How does the learning rate (or step size) affect the gradient descent? If the learning rate is too small, algorithm will take long; if learning rate is too big, might jump across valley 7. How to change slope/partial derivative based on its sign? If slope/derivative is positive, θ 1 should decrease If slope/derivative is negative, θ 1 should increase 8. Should we calculate all the different thetas at each iteration simultaneously? Yes 9. Why is feature scaling important in gradient descent? It is important to ensure all features have a similar scale, or it will take longer to converge/get to the minimum 10. What is the gradient vector of the cost function in batch gradient descent? The gradient vector, noted θ MSE ( θ ) , contains all the partial derivatives of the cost function 11. What is eta and epsilon in gradient descent? Eta ( ) is the learning rate Epsilon is used if the different between x_old and x_new is smaller than this value, the algorithm will halt 12. What is the weakness of batch gradient descent and how stochastic gradient descent can help overcome that? Batch gradient descent uses the whole training set at every step, making it slow. Stochastic picks one random instance in the training set at every step and computes gradients only based on the one instance 13. What is the learning schedule in stochastic gradient descent? The learning schedule is the function that determines the learning rate at each iteration. If learning rate is reduced too quickly, you may get stuck at local min or freeze halfway If learning rate is reduced too slowly, you may jump around the minimum for a long time and end with a suboptimal solution 14. What is the weakness of stochastic gradient descent?
Since instances are picked randomly, some instances may be picked several times, while others may not be picked at all. Much less regular than batch; cost function will bounce around; final values are good, not optimal. 15. What is minibatch gradient descent? Does it support out-of-core learning? In minibatch, the gradients are computed on small random sets of instances, called minibatches. Yes, stochastic and minibatch support out of core learning. 16. Hyperparameters for batch GD? Polynomial Regression and Regularized Models 1. What is polynomial regression and when is it used? A form of linear regression that estimates the relationship as an nth degree polynomial. Used to fit nonlinear data, by using linear model. 2. What strategy is used to use a linear model to fit nonlinear data? Add powers of each feature as new features, then train a linear model on this extended set of features 3. What is the purpose of learning curve and what do we plot in a learning curve? These are plots of the model’s performance on the training set and the validation set, as a function of the training set size. To generate, simply train the model several times on different sized subsets of the training set. 4. What is bias/variance tradeoff? The increase of one will result in the decrease of the other. Increasing model’s complexity (degree) will increase variance and reduce bias. Reducing complexity increases bias and reduces variance 5. What are the different techniques to overcome underfitting and overfitting? To overcome underfitting/high bias, add new parameters to model so that model complexity increases. To overcome overfitting, reduce complexity or regularization 6. What is the purpose of regularization and how is it performed? It is constraining a model to make it simpler and reduce the risk of overfitting. Keep the same number of features, but reduce the magnitude of the coefficients 7. What are the three regularized linear models? Ridge Regression, Lasso Regression, Elastic Net 8. What strategy is used by Ridge regression as part of regularization? A regularized term, α 1 2 i = 1 n θ i 2 is added to the cost function during training 9. What is l1 and l2 norm? Which Is used by ridge regression? L1 norm is calculated as the sum of the absolute values of the vector (MAE) L2 norm is calculated as the square root of the sum of squared vector values Ridge uses L2 and Lasso uses L1 10. What strategy is used by Lasso regression as part of regularization?
Adds regularization term to cost function: ¿ θ i ¿ α i = 1 n ¿ 11. Out of ridge and lasso, which one automatically performed feature selection? Lasso; tends to completely eliminate the weights of least important features 12. What is elastic net? What parameter is used in elastic net to balance between the use of lasso and ridge? Middle ground between ridge and lasso. Regularization term is a simple mix of both ridge and lasso’s, and you control the mix ratio r. When r = 0 it is ridge and when r = 1 it is lasso. Adds: i = 1 n ¿ θ i + 1 r 2 α i = 1 n θ i 2 13. How to decide between lasso/ridge/elastics net? Ridge is a good default, but if few features as useful prefer lasso or elastic net. Elastic net is preferred over lasso 14. What is early stopping regularization? Regularize by stopping training as soon as validation error reaches a minimum Logistic and SoftMax Regression 1. What is the different between linear and logistic regression? Linear regression is used to predict continuous values Logistic regression is used to predict categorical values 2. What is the difference between logistic and softmax regression? SoftMax is used to support multiple classes directly, not just binary 3. What is changed in linear regression to achieve the benefit of logistic regression? Instead of using θ T X , we use ^ y = σ ( θ T X ) (sigma always return value between 0 and 1) This gives us probability instead of continuous value 4. Why don’t you use MSE cost function of linear regression in logistic regression? MSE equation will give a non-convex function 5. Cost function of logistic regression Cost ( ^ y , y ) = 1 2 ( σ ( θ T X ) y ) 2 In the case that desirable y = 1, the cost is log ( ^ y ) ; if y = 0 cost is log ( 1 ^ y ) ; 6. What is one-hot encoding? Process for one vs one KNN 1. What is KNN classification? Estimate the classification of an unseen instance using the classification of the instance of instance that are closest to it
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. What is the formula for Euclidean distance? ( a 1 b 1 ) 2 + ( a 2 b 2 ) 2 + + ( a n b n ) 2 3. What is the difference between eager and lazy learning? What does KNN use? Eager – when given training points, will construct generalized model before performing new prediction Lazy – no need for learning or training of model and all data points used at time of prediction; memorized the training dataset KNN is lazy learner 4. What are the conditions for Minkowski distance? - Non-negativity - Identity; d (x, y) = 0 iff x==y - Symmetry - Triangle identity: d(x, y) + d(y, z) >= d(x, z) 5. Minkowski formula ( i = 1 n | x i y i | p ) 1 p P=1, manhattan P=2, euclidean 6. Explain cosine distance and Jaccard distance Cosine distance: mainly used to calculate similarity between two vectors; determines if pointing in same direction Jaccard distance: similar to cosine; looks and two datasets and find incident where both values equal to 1; results how many 1 to 1 matches occur 7. Hamming distance Used to compare two binary data structures ABCDE = 3 (3 non-matching) AGDDF 8. What is normalization and why is it important? When using distance measures, the large values frequently swamp small ones. So, normalize the values of continuous attributes. (0 to 1) 9. How can you give more importance to a particular feature? Add weight; put w in front of each square in Euclidean distance 10. What is the curse of dimensionality? Data needs to exponentially grow as you increase number of dimensions Naïve Bayes 1. What is naïve bayes? Uses probability theory to find most likely of the possible classifications
2. Pros and cons of naïve bayes? Pros: easy and fast to predict class of test data set; performs well in multiclass predication Cons: zero frequency; known as bad estimator 3. What are assumption made by naïve bayes classifier about the data set? Assumes independent prediction Assumes normal distribution for numerical value