TB for Quiz4 and Quiz5

pdf

School

SUNY at Albany *

*We aren’t endorsed by this school

Course

438

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

30

Uploaded by BarristerWorld7634

Report
1 Machine Learning: Classification, Regression and Clustering 15.1 Introduction to Machine Learning 15.1 Q1: Which of the following statements is false ? a. We can make machines learn. b. The “secret sauce” of machine learning is data— and lots of it. c. With machine learning, rather than programming expertise into our applica- tions, we program them to learn from data. d. All of the above statements are true . 15.1 Q2: Which of the following kinds of predictions are happening today with machine learning? a. Improving weather prediction, cancer diagnoses and treatment regimens to save lives. b. Predicting customer “churn,” what prices houses are likely to sell for, ticket sales of new movies, and anticipated revenue of new products and services. c. Predicting the best strategies for coaches and players to use to win more games and championships, while experiencing fewer injuries. d. All of the above. 15.1.1 Scikit-Learn 15.1 Q3: Which of the following statements is false ? a. Scikit-learn conveniently packages the most effective machine-learning algo- rithms as evaluators. b. Each scikit- learn algorithm is encapsulated, so you don’t see its intricate details, including any heavy mathematics. c. With scikit- learn and a small amount of Python code, you’ll create powerful models quickly for analyzing data, extracting insights from the data and most im- portantly making predictions. d. All of the above statements are true . 15.1 Q4: Which of the following statements is false ? a. With scikit-learn, you’ll train each model on a subset of your data, then test each model on the rest to see how well your model works. b. Once your models are trained, you’ll put them to work making predictions based on data they have not seen. You’ll often be amazed at how accurate your models will be. c. With machine learning, your computer that you’ve used mostly on rote chores will take on characteristics of intelligence.
2 d. Although you can specify parameters to customize scikit-learn models and pos- sibly improve their performance, if you use the models’ default parameters for simplicity, you’ll often obtain mediocre results. 15.1 Q5: Which of the following statements about scikit-learn and the machine- learning models you’ll build with it is false ? a. It’s difficult to know in advance which model(s) will perform best on your data, so you typically try many models and pick the one that performs best scikit- learn makes this convenient for you. b. You’ll rarely get to know the details of the complex mathematical algorithms in the scikit- learn estimators, but with experience, you’ll be able to intuit the best model for each new dataset. c. It generally takes at most a few lines of code for you to create and use each scikit-learn model. d. The models report their performance so you can compare the results and pick the model(s) with the best performance. 15.1.2 Types of Machine Learning 15.1 Q6: Which of the following statements is false ? a. The two main types of machine learning are supervised machine learning, which works with unlabeled data, and unsupervised machine learning, which works with labeled data. b. If you’re developing a computer vision application to recognize dogs and cats, you’ll train your model on lots of dog photos labeled “dog” and cat photos labeled “cat.” If your model is effective, when you put it to work processing unlabeled photos it will recognize dogs and cats it has never seen before. The more photos you train with, the greater the chance that your model will accurately predict which new photos are dogs and which are cats. d. In this era of big data and massive, economical computer power, you should be able to build some pretty accurate machine learning models. 15.1 Q7: Which of the following statements is false ? a. Supervised machine learning falls into two categories classification and re- gression. b. You train machine-learning models on datasets that consist of rows and col- umns. Each row represents a data feature. Each column represents a sample of that feature. c. In supervised machine learning, each sample has an associated label called a target (like “spam” or “not spam”). This is the value you’re trying to predict for new data that you present to your models. d. All of the above statements are true . 15.1 Q8: Which of the following statements is false ?
3 a. “Toy” datasets, generally have a small number of samples with a limited num- ber of features. In the world of big data, datasets commonly have, millions and billions of samples, or even more. b. There’s an enormous number of free and open datasets available for data sci- ence studies. Libraries like scikit-learn bundle popular datasets for you to exper- iment with and provide mechanisms for loading datasets from various reposito- ries (such as openml.org ). c. Governments, businesses and other organizations worldwide offer datasets on a vast range of subjects. d. All of the above statements are true . 15.1 Q9: Which of the following statements is false ? a. Even though k-nearest neighbors is one of the most complex classification al- gorithms, because of its superior prediction accuracy we use it to to analyze the Digits dataset bundled with scikit-learn. b. Classification algorithms predict the discrete classes (categories) to which sam- ples belong. c. Binary classification uses two classes, such as “spam” or “not spam” in an email classification application. Multi-classification uses more than two classes, such as the 10 classes, 0 through 9, in the Digits dataset. d. A classification scheme looking at movie descriptions might try to classify them as “action,” “adventure,” “fantasy,” “romance,” “history” and the like. 15.1 Q10: Which of the following statements is false ? a. Regression models predict a continuous output, such as the predicted temper- ature output in a weather time series analysis. b. We can implement simple linear regression using scikit- learn’s Line- arRegression estimator. c. The LinearRegression estimator also can perform multiple linear regression. d. The LinearRegression estimator, by default, uses all the nonnumerical fea- tures in a dataset to make more sophisticated predictions than you can with a single-feature simple linear regression. 15.1 Q11: Unsupervised machine learning uses ________ algorithms. a. classification b. clustering c. regression d. None of the above 15.1 Q12 : Which of the following are related to compressing a dataset’s large number of features down to two for visualization purposes. a. dimensionality reduction b. TSNE estimator c. both a) and b) d. neither a) nor b)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 15.1 Q13: Which of the following statements is false ? a. The simplest supervised machine-learning algorithm we use is k-means clus- tering. b. We use scikit- learn’s PCA estimator to perform dimensionality reduction to compress a dataset’s many features down to two for visualization purposes. c. In k- means clustering, each cluster’s centroid is the cluster’s center point. d. You’ll often run multiple clustering estimators to compare their ability to divide a dataset’s samples effectively into clusters. 15.1 Q14: Which of the following statements is false ? a. K-means clustering works through the data attempting to divide it into that many clusters. b. As with many machine learning algorithms, k-means is recursive and gradually zeros in on the clusters to match the number you specify. c. K-means clustering can find similarities in unlabeled data. This can ultimately help with assigning labels to that data so that supervised learning estimators can then process it. d. Given that it’s tedious and error -prone for humans to have to assign labels to unlabeled data, and given that the vast majority of the world’s data is unlabeled, unsupervised machine learning is an important tool. 15.1 Q15: Which of the following statements is false ? a. The amount of data that’s available today is already enormous and continues to grow exponentially the data produced in the world in the last few years alone equals the amount produced up to that point since the dawn of civilization. b. People used to say “I’m drowning in data and I don’t know what to do with it. With machine learning, we now say, “Flood me with big data so I can use machine - learning technology to extract insights and make predictions from it.” c. The big data phenomenon is occurring at a time when computing power is ex- ploding and computer memory and secondary storage are exploding in capacity while costs dramatically decline. This enables us to think differently about solu- tion approaches. d. All of the above statements are true . 15.1.4 Steps in a Typical Data Science Study 15.1 Q16: Which of the following are not steps in a typical machine-learning case study? a. loading the dataset and exploring the data with pandas and visualizations b. transforming your data (converting non-numeric data to numeric data because scikit-learn requires numeric data) and splitting the data for training and testing; c. creating, training and testing the model; tuning the model, evaluating its accu- racy and making predictions on live data that the model hasn’t seen before. d. All of the above are steps in a typical machine-learning case study.
5 15.2 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 1 15.2 Q1: Which of the following statements is false ? a. Classification in supervised machine learning attempts to predict the distinct class to which a sample belongs. b. If you have images of dogs and images of cats, you can classify each image as a “dog” or a “cat.” This is a binary classification problem. c. When classifying digit images from the Digits dataset bundled with scikit-learn, our goal is to predict which digit an image represents. Since there are 10 possible digits (the classes), this is a multi-classification problem. d. You train a classification model using unlabeled data. 15.2.1 k-Nearest Neighbors Algorithm 15.2 Q2: Which of the following statements is false ? a. Scikit-learn supports many classification algorithms, including the simplest k-nearest neighbors (k-NN). b. The k- nearest neighbors algorithm attempts to predict a test sample’s class by looking at the k training samples that are nearest (in distance) to the test sample. c. Always pick an even value of k. d. In the k- nearest neighbors algorithm, the class with the most “votes” wins. 15.2 Q3: Which of the following statements is false ? a. In machine learning, a model implements a machine-learning algorithm. In scikit-learn, models are called estimators. b. There are two parameter types in machine learning those the estimator cal- culates as it learns from the data you provide and those you specify in advance when you create the scikit-learn estimator object that represents the model. c. The machine-learning parameters the estimator calculates as it learns from the data are called hyperparameters in the k-nearest neighbors algorithm, k is a hy- perparameter. d. For simplicity, we use scikit- learn’s default hyperparameter values. In real - world machine- learning studies, you’ll want to experiment with different values of k to produce the best possible models for your studies this process is called hyperparameter tuning. 15.2.2 Loading the Dataset 15.2 Q4: Which of the following statements is false ? a. Scikit- learn’s machine-learning algorithms require samples to be stored in a one-dimensional array of floating-point values (or one-dimensional array-like collection, such as a list). b. To represent every sample as one row, multi-dimensional data must be flat- tened into a one-dimensional array.
6 c. If you work with a dataset containing categorical features (typically repre- sented as strings, such as 'spam' or 'not-spam' ), you have to preprocess those features into numerical values. d. Scikit- learn’s sklearn.preprocessing module provides capabilities for con- verting categorical data to numeric data. 15.2.3 Visualizing the Data 15.2 Q5: With regard to our code that displays 24 digit images, which of the fol- lowing statements is false ? a. The following call to function subplots creates a 6-by-4 inch Figure (speci- fied by the figsize=(6, 4) keyword argument) containing 24 subplots ar- ranged in 6 rows and 4 columns: import matplotlib.pyplot as plt figure, axes = plt.subplots(nrows= 4 , ncols= 6 , figsize=( 6 , 4 )) b. Each subplot has its own Axes object, which we’ll use to display one digit image. c. Function subplots returns the Axes objects in a two-dimensional NumPy ar- ray. d. All of the above are true . 15.2.4 Splitting the Data for Training and Testing 15.2 Q6: Which of the following statements is false ? a. You typically train a machine-learning model with a subset of a dataset. b. Typically, you should train your model with the smallest amount of data that makes the model perform well. c. It’s important to set aside a portion of your data for testing, so you can evaluate a model’s performance using data that it has not yet seen. Once you’re confident that the model is performing well, you can use it to make predictions using new data. d. All of the above statements are true . 15.2 Q7: Which of the following statements is false ? a. You should first break your data into a training set and a testing set to prepare to train and test the model. b. The function train_test_split from the sklearn.model_selection mod- ule simply splits in order the samples in the data array and the target values in the target array into training and testing sets. This helps ensure that the training and testing sets have similar characteristics. c. A ShuffleSplit object (module sklearn.model_selection ) shuffles and splits samples and their targets. d. All of the above statements are true . 15.2 Q8: Which of the following statements is false ?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 a. Scikit- learn’s bundled classification datasets are not balanced, so you should be sure to balance each dataset before you work with it. b. Unbalanced classes could lead to incorrect results. c. In machine-learning studies, this helps others confirm your results by working with the same randomly selected data. d. Function train_test_split provides the keyword argument random_state for reproducibility . When you run the code in the future with the same seed value, train_test_split will select the same data for the training set and the same data for the testing set. 15.2 Q9: Which of the following statements is false ? a. Looking at the arrays X_train ’s and X_test ’s shape s, you can see that, by de- fault, train_test_split reserves 75% of the data for training and 25% for test- ing. b. To specify different splits, you can set the sizes of the testing and training sets with the train_test_split function’s keyword arguments test_size and train_size . Use floating-point values from 0.0 through 100.0 to specify the per- centages of the data to use for each. c. You can use integer values to set the precise numbers of samples. d. If you specify one of the keyword arguments test_size and train_size , the other is inferred for example, the statement X_train, X_test, y_train, y_test = train_test_split( digits.data, digits.target, random_state= 11 , test_size= 0.20 ) specifies that 20% of the data is for testing, so train_size is inferred to be 0.80 . 15.2.5 Creating the Model 15.2 Q10: Which of the following statements is false ? a. The KNeighborsClassifier estimator (module sklearn.neighbors ) im- plements the k-nearest neighbors algorithm. b. The following code creates a KNeighborsClassifier estimator object: from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() c. The internal details of how a KNeighborsClassifier object implements the k-nearest neighbors algorithm are hidden in the object. You simply call its meth- ods. d. All of the above statements are true . 15.2.6 Training the Model 15.2 Q11: Which of the following statements is false ? a. The following call to the KNeighborsClassifier object’s fit method loads the sample training set ( X_train ) and target training set ( y_train ) into the es- timator:
8 knn.fit(X=X_train, y=y_train) b. After the KNeighborsClassifier ’s fit method loads the data into the esti- mator, it uses that data to perform complex calculations behind the scenes that learn from the data and train the model. c. KNeighborsClassifier estimator is said to be lazy because its work is per- formed only when you use it to make predictions. d. All of the above statements are true . 15.2 Q12: Which of the following statements is false ? a. In real-world machine-learning applications, it can often take minutes, hours, days or even months to train your models special-purpose, high-performance hardware called GPUs and TPUs can significantly reduce model training time. b. The fit method returns the estimator, so IPython displays its string represen- tation, which includes the estimator’s default settings. c. For simplicity, we generally use the default estimator settings by default, a KNeighborsClassifier looks at the four nearest neighbors to make its predic- tions. d. All of the above statements are true . 15.2.7 Predicting Digit Classes 15.2 Q13: Which of the following statements is false ? a. Once we’ve loaded our data into the KNeighborsClassifier , we can use it with the test samples to make predictions. Calling the estimator’s predict method with X_test as an argument returns an array containing the predicted class of each test image: predicted = knn.predict(X=X_test) b. If predicted and expected are arrays containing the predictions and ex- pected target values for digit images, evaluating the following code snippets dis- plays the predicted digits and expected digits for the first 20 test samples: predicted[: 20 ] expected[: 20 ] c. If predicted and expected are arrays containing the predictions and ex- pected target values for digit images, the following list comprehension locates all the incorrect predictions for the entire test set that is, the cases in which the predicted and expected values do not match: wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e] d. All of the above statements are true .
9 15.3 Case Study: Classification with k-Nearest Neighbors and the Digits Dataset, Part 2 15.3.1 Metrics for Model Accuracy 15.3 Q1: Which of the following statements is false ? a. Each estimator has a score method that returns an indication of how well the estimator performs for the test data you pass as arguments. b. For classification estimators, the score method returns the prediction accu- racy for the test data, as in: print(f ' {knn.score(X_test, y_test): .2 %} ' ) c. You can perform hyperparameter tuning to try to determine the optimal value for k , hoping to get even better accuracy. d. All of the above statements are true . 15.3 Q2: Which of the following statements is false ? a. Another way to check a classification estimator’s accuracy is via a confusion matrix, which shows only the incorrect predicted values (also known as the misses) for a given class. b. To create a confusion matrix imply call the function confusion_matrix from the sklearn.metrics module, passing the expected classes and the pre- dicted classes as arguments, as in: from sklearn.metrics import confusion_matrix confusion = confusion_matrix(y_true=expected, y_pred=pre- dicted) c. The y_true keyword argument in Part ( b) specifies the test samples’ actual classes. d. The y_pred keyword argument in Part (b) specifies the predicted classes for the test samples. 15.3 Q3: Consider the confusion matrix for the Digits dataset ’s predictions : array([[45, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 45, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 54, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 42, 0, 1 , 0, 1 , 0, 0], [ 0, 0, 0, 0, 49, 0, 0, 1 , 0, 0], [ 0, 0, 0, 0, 0, 38, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 42, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 45, 0, 0], [ 0, 1 , 1 , 2 , 0, 0, 0, 0, 39, 1 ], [ 0, 0, 0, 0, 1 , 0, 0, 0, 1 , 41]])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Which of the following statements is false ? a. The correct predictions are shown on the diagonal from top-left to bottom- right this is called the principal diagonal. b. The nonzero values that are not on the principal diagonal indicate incorrect predictions c. Each row represents one distinct class that is, one of the digits 0 9. d. The columns within a row specify how many of the test samples were classified incorrectly into each distinct class. 15.3 Q4: The sklearn.metrics module provides function classifica- tion_report , which produces a table of classification metrics based on the ex- pected and predicted values, as in: from sklearn.metrics import classification_report names = [str(digit) for digit in digits.target_names] print(classification_report(expected, predicted, target_names=names)) precision recall f1-score support 0 1.00 1.00 1.00 45 1 0.98 1.00 0.99 45 2 0.98 1.00 0.99 54 3 0.95 0.95 0.95 44 4 0.98 0.98 0.98 50 5 0.97 1.00 0.99 38 6 1.00 1.00 1.00 42 7 0.96 1.00 0.98 45 8 0.97 0.89 0.93 44 9 0.98 0.95 0.96 43 micro avg 0.98 0.98 0.98 450 macro avg 0.98 0.98 0.98 450 weighted avg 0.98 0.98 0.98 450 Which of the following statements about the report is false ? a. The precision column shows the total number of correct predictions for a given digit divided by the total number of predictions for that digit. You can con- firm the precision by looking at each column in the confusion matrix. b. The recall column is the total number of correct predictions for a given digit divided by the total number of samples that should have been predicted as that digit. You can confirm the recall by looking at each row in the confusion matrix.
11 c. The f1-score column is the average of the precision . The recall and the support column is the number of samples with a given expected value for ex- ample, 50 samples were labeled as 4s, and 38 samples were labeled as 5s. d. All of the above are true . 15.3 Q5: h. When you display a confusion matrix as a heat map, the principal di- agonal and the incorrect predictions stand out nicely: Which of the following statements about the above heat map version of a confu- sion matrix is false ? a. Seaborn’s graphing functions work with two -dimensional data such as pandas DataFrame s. b. The following code converts a confusion matrix into a DataFrame , then graphs it: import pandas as pd confusion_df = pd.DataFrame(confusion, index=range( 10 ), columns=range( 10 )) import seaborn as sns axes = sns.heatmap(confusion_df, annot= True , cmap= 'nipy_spectral_r' )
12 c. The Seaborn function heatmap creates a heat map from the specified Data- Frame . d. The keyword argument annot=True (short for “annotation”) labels the heatmap’s rows with row numbers and columns with column numbers. 15.3.2 K-Fold Cross-Validation 15.3 Q6: Which of the following statements is false ? a. K-fold cross-validation enables you to use all of your data at once for training your model. b. K-fold cross-validation splits the dataset into k equal-size folds. c. You then repeatedly train your model with k 1 folds and test the model with the remaining fold. d. For example, consider using k = 10 with folds numbered 1 through 10. With 10 folds, we’d do 10 successive training and testing cycles: First, we’d train with folds 1– 9, then test with fold 10. Next, we’d train with folds 1– 8 and 10, then test with fold 9. Next, we’d train with folds 1– 7 and 9 10, then test with fold 8. This training and testing cycle continues until each fold has been used to test the model. 15.3 Q7: Which of the following statements is false ? a. Scikit-learn provides the KFold class and the cross_val_score function (both in the module sklearn.model_selection ) to help you perform the train- ing and testing cycles. b. The following code creates a KFold object: from sklearn.model_selection import KFold kfold = KFold(n_folds= 10 , random_state= 11 , shuffle= True ) c. The keyword argument random_state=11 seeds the random number genera- tor for reproducibility . d. The keyword argument shuffle=True causes the KFold object to randomize the data by shuffling it before splitting it into folds. This is particularly important if the samples might be ordered or grouped. 15.3 Q8: Which of the following statements is false ? a. The following code uses function cross_val_score to train and test a model: from sklearn.model_selection import cross_val_score scores = cross_val_score(estimator=knn, X=digits.data, y=digits.target, cv=kfold) b. The keyword arguments in Part (a) are: estimator=knn , which specifies the estimator you’d like to validate. X=digits.data , which specifies the samples to use for training and test- ing.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 y=digits.target , which specifies the target predictions for the sam- ples. cv=kfold , which specifies the cross-validation generator that defines how to split the samples and targets for training and testing. c. Function cross_val_score returns one accuracy score. d. All of the above statements are true . 15.3.3 Running Multiple Models to Find the Best One 15.3 Q9: Which of the following statements is false ? a. It’s difficult to know in advance which machine learning model(s) will perform best for a given dataset, especially when they hide the details of how they operate from their users. b. Even though the KNeighborsClassifier predicts digit images with a high degree of accuracy, it’s possible that other scikit -learn estimators are even more accurate. c. Scikit-learn provides many models with which you can quickly train and test your data. This encourages you to run multiple models to determine which is the best for a particular machine learning study. d. All of the above statements are true . 15.3.4 Hyperparameter Tuning 15.3 Q10: Which of the following statements is false ? a. The k in the k-nearest neighbors algorithm is a hyperparameter of the algo- rithm. b. Hyperparameters are set after using the algorithm to train your model. c. In real- world machine learning studies, you’ll want to use hyperparameter tun- ing to choose hyperparameter values that produce the best possible predictions. d. To determine the best value for k in the kNN algorithm, try different values of k then compare the estimator’s performance with each. 15.3 Q21: Consider the following code and output: In [57]: for k in range( 1 , 20 , 2 ): ...: kfold = KFold(n_splits= 10 , random_state= 11 , shuf- fle= True ) ...: knn = KNeighborsClassifier(n_neighbors=k) ...: scores = cross_val_score(estimator=knn, ...: X=digits.data, y=digits.target, cv=kfold) ...: print(f 'k= {k:< 2 } ; mean accuracy= {scores.mean(): .2 %} ; ' + ...: f 'standard deviation= {scores.std(): .2 %} ' ) ...: k=1 ; mean accuracy=98.83%; standard deviation=0.58% k=3 ; mean accuracy=98.78%; standard deviation=0.78% k=5 ; mean accuracy=98.72%; standard deviation=0.75% k=7 ; mean accuracy=98.44%; standard deviation=0.96%
14 k=9 ; mean accuracy=98.39%; standard deviation=0.80% k=11; mean accuracy=98.39%; standard deviation=0.80% k=13; mean accuracy=97.89%; standard deviation=0.89% k=15; mean accuracy=97.89%; standard deviation=1.02% k=17; mean accuracy=97.50%; standard deviation=1.00% k=19; mean accuracy=97.66%; standard deviation=0.96% Which of the following statements is false ? a. The loop creates KNeighborsClassifier s with odd k values from 1 through 19 and performs k-fold cross-validation on each. b. The k value 7 in kNN produces the most accurate predictions for the Digits da- taset. c. You can also see that accuracy tends to decrease for higher k values. d Compute time grows rapidly with k , because k-NN needs to perform more cal- culations to find the nearest neighbors. 15.4 Case Study: Time Series and Simple Linear Regression 15.4 Q1: Which of the following statements is false ? a. The LinearRegression estimator is in the sklearn.linear_model module. b. By default, LinearRegression uses all the numeric features in a dataset, per- forming a multiple linear regression. c. Simple linear regression uses one feature as the independent variable. d. All of the above statements are true . 15.3 Q2: Which of the following statements is false ? a. Scikit-learn estimators require their training and testing data to be two-dimen- sional arrays (or two-dimensional array-like data, such as lists of lists or pandas DataFrame s). b. To use one-dimensional data with an estimator, you must transform it from one dimension containing n elements, into two dimensions containing m rows and n columns. c. In a DataFrame nyc with a Date column, the expression nyc.Date returns the Date column’s Series , and the Series values attribute returns the NumPy ar- ray containing that Series ’ values. d. All of the above statements are true . 15.3 Q3: Which of the following statements is false ? a. To transform a one-dimensional array into two dimensions, we call an array’s ________ method. a. transform b. switch c. convert d. reshape
15 15.4 Q4: Which of the following statements is false ? a. Scikit-learn has separate classes for simple linear regression and multiple lin- ear regression. b. To find the best fitting regression line for the data, a LinearRegression esti- mator iteratively adjusts the slope and intercept values to minimize the sum of the squares of the data points’ distances from the line. c. Once LinearRegression is complete, you can use the slope and intercept in the y = mx + b calculation to make predictions. The slope is stored in the estima- tor’s coeff_ attribute ( m in the equation) and the intercept is stored in the esti- mator’s intercept_ attribute ( b in the equation). d. All of the above are true . 15.4 Q5: Which of the following statements is false ? a. The following code tests a linear regression model using the data in X_test and checks some of the predictions throughout the dataset by displaying the pre- dicted and expected values for every ________ element: predicted = linear_regression.predict(X_test) expected = y_test for p, e in zip(predicted[:: 5 ], expected[:: 5 ]): print(f 'predicted: {p: .2 f} , expected: {e: .2 f} ' ) a. second b. fifth c. p th d. e th 15.3 Q24: Consider the following visualization that you studied in Chapter 15, Machine Learning:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 Which of the following statements best describes the above visualization? a. It shows a scatter plot. b. It shows a linear regression line. c. Both (a) and (b) d. None of the above 15.4 Q8: Which of the following statements is false ? a. When creating a model, a key goal is to ensure that it is capable of making ac- curate predictions for data it has not yet seen. Two common problems that pre- vent accurate predictions are overfitting and underfitting. b. Underfitting occurs when a model is too simple to make accurate predictions, based on its training data. An example of underfitting is using a linear model, such as simple linear regression, when in fact, the problem really requires a more so- phisticated non-linear model. c. Overfitting occurs when your model is too complex. In the most extreme case of overfitting, a model memorizes its training data. d. When you make predictions with an overfit model, the model won’t know what to do with new data that matches the training data, but the model will make ex- cellent predictions with data it has never seen.
17 15.5 Case Study: Multiple Linear Regression with the California Housing Dataset 15.5 Q1: Which of the following statements is false ? a. The California Housing dataset (bundled with scikit-learn) has 20,640 samples, each with eight numerical features. b. The LinearRegression estimator performs multiple linear regression by de- fault using all of a dataset’s numeric features by default . c. You should expect more meaningful results from simple linear regression than from multiple linear regression. d. All of the above statements are true . 15.5.1 Loading the Dataset 15.5 Q2: Which of the following statements is false ? a. According to the California Housing Prices dataset’s description in scikit -learn, “This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).” b. The dataset has 20,640 samples one per block group with eight features each: median income in tens of thousands, so 8.37 would represent $83,700 median house age in the dataset, the maximum value for this feature is 52 average number of rooms average number of bedrooms block population average house occupancy house block latitude house block longitude By combining these features to make predictions, we’re more likely to get more accurate predictions. c. Each sample has as its target a corresponding median house value in hundreds of thousands of dollars, so 3.55 would represent $355,000. In the dataset, the maximum value for this feature is 5, which represents $500,000. It’s reasonable to expect that more bedrooms or more rooms or higher income would generally mean higher house value. d. All of the above statements are true . 15.5 Q3: Which of the following statements is false ? a. You load the California Housing dataset using function fetch_califor- nia_housing from the sklearn.datasets module. b. The Bunch object’s data and target attributes are NumPy array s containing the 20,640 samples and their target values respectively.
18 c. We can confirm the number of samples (rows) and features (columns) by look- ing at the data array’s shape attribute, which shows that there are 20,640 rows and 8 columns, as in: In [4]: california.data.shape Out[4]: (20640, 8) Similarly, you can see that the number of target values that is, the median house values matches the number of samples by looking at the target array’s shape , as in: In [5]: california.target.shape Out[5]: (20640,) d. The Bunch ’s features attribute contains the names that correspond to each column in the data array. 15.5.2 Exploring the Data with Pandas 15.5 Q4: Consider the following code that imports pandas and sets some options: import pandas as pd pd.set_option( 'precision' , 4 ) pd.set_option( 'max_columns' , 9 ) pd.set_option( 'display.width' , None ) Which of the following statements about the set_option calls is false ? a. 'precision' is the maximum number of digits to display to the right of each decimal point. b. 'max_columns' is the maximum number of columns to display when you out- put the DataFrame ’s string representation. By default, pandas displays all of the columns left-to-right. The 'max_columns' setting enables pandas to show all the columns using multiple rows of output. c. 'display.width' specifies the width in characters of your Command Prompt (Windows), Terminal (macOS/Linux) or shell (Linux). The value None tells pan- das to auto-detect the display width when formatting string representations of Series and DataFrame s. d. All of the above statements are true . 15.3 Q5: Which of the following statements is false ? a. The following code creates a DataFrame from a Bunch ’s data , target and feature_names arrays. The first snippet below creates the initial DataFrame us- ing the data in california.data and with the column names specified by cal- ifornia.feature_names . The second snippet adds a column for the median house values stored in california.target : In [11]: california_df = pd.DataFrame(california.data, ...: columns=california.feature_names) ...:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 In [12]: california_df[ 'MedHouseValue' ] = pd.Series(california.target) b. We can peek at some of the data in the DataFrame using the head function: In [13]: california_df.head() Out[13]: MedInc HouseAge AveRooms AveBedrms Population AveOccup \ 0 8.3252 41.0 6.9841 1.0238 322.0 2.5556 1 8.3014 21.0 6.2381 0.9719 2401.0 2.1098 2 7.2574 52.0 8.2881 1.0734 496.0 2.8023 3 5.6431 52.0 5.8174 1.0731 558.0 2.5479 4 3.8462 52.0 6.2819 1.0811 565.0 2.1815 Latitude Longitude MedHouseValue 0 37.88 -122.23 4.526 1 37.86 -122.22 3.585 2 37.85 -122.24 3.521 3 37.85 -122.25 3.413 4 37.85 -122.25 3.422 c. The \ to the right of the column head "AveOccup" in Part (b)’s output indicates that there are more columns displayed below. You’ll see the \ only if the window in which IPython is too narrow to display all the columns left-to-right. d. All of the above statements are true . 15.5.3 Visualizing the Features 15.5 Q6: Which of the following statements is false ? a. It’s helpful to visualize your data by plotting the target value against each fea- ture in the case of the California Housing Prices dataset, to see how the median home value relates to each feature. b. DataFrame method sample can randomly select a percentage of a Data- Frame ’s data (specified keyword argument frac ), as in: sample_df = california_df.sample(frac= 0.1 , random_state= 17 ) c. The keyword argument random_state in Part (b)’s snippet enables you to seed the random number generator. Each time you use the same seed value, method sample selects a similar random subset of the DataFrame ’s rows. d. All of the above statements are true . 15.5.4 Training the Model 15.5 Q7: Which of the following statements is false ? a. By default, a LinearRegression estimator uses all the features in the dataset’s data array to perform a multiple linear regression.
20 b. An error occurs if any of the features passed to a LinearRegression estimator for training are categorical rather than numeric. If a dataset contains categorical data, you must exclude the categorical features from the training process. c. A benefit of working with scikit- learn’s bundled datasets is that they’re already in the correct format for machine learning using scikit- learn’s models. d. All of the above statements are true . 15.5 Q8: In the context of the California Housing dataset, which of the following statements is false ? a. The following code creates a LinearRegression estimator and invokes its fit method to train the estimator using X_train and y_train : from sklearn.linear_model import LinearRegression linear_regression = LinearRegression() linear_regression.fit(X=X_train, y=y_train) b. Multiple linear regression produces separate coefficients for each feature (stored in coeff_ ) in the dataset and one intercept (stored in intercept ). c. For positive coefficients, the median house value increases as the feature value increases. For negative coefficients, the median house value decreases as the fea- ture value decreases. d. You can use the coefficient and intercept values with the following equation to make predictions: y = m 1 x 1 + m 2 x 2 + … m n x n + b where m 1 , m 2 , …, m n are the feature coefficients, b is the intercept, x 1 , x 2 , …, x n are the feature values (that is, the values of the independent variables), and y is the predicted value (that is, the dependent variable). 15.5.5 Testing the Model 15.5 Q9: Which of the following statements is false ? a. The following code tests a model by calling the estimator’s predict method with the test samples as an argument: predicted = linear_regression.predict(X_test) b. Assuming the array expected contains the expected values for the samples used to make predictions in Part (a)’s snippet, evaluating t he following snippets displays the first five predictions and their corresponding expected values: In [32]: predicted[: 5 ] Out[32]: array([1.25396876, 2.34693107, 2.03794745, 1.8701254 , 2.53608339])
21 In [33]: expected[: 5 ] Out[33]: array([0.762, 1.732, 1.125, 1.37 , 1.856]) c. With classification, we saw that the predictions were distinct classes that matched existing classes in the dataset. With regression, it’s tough to get exact predictions, because you have continuous outputs. Every possible value of x 1 , x 2 x n in the calculation y = m 1 x 1 + m 2 x 2 + … m n x n + b predicts a different value. d. All of the above statements are true . 15.5.6 Visualizing the Expected vs. Predicted Prices No questions. 15.5.7 Regression Model Metrics 15.5 Q10: Which of the following statements is false ? a. Scikit-learn provides many metrics functions for evaluating how well estima- tors predict results and for comparing estimators to choose the best one(s) for your particular study. b. Scikit- learn’s metrics vary by estimator type. c. The sklearn.metrics functions confusion_matrix and classifica- tion_report we used in the Digits dataset classification case study are two of many metrics functions specifically for evaluating regression estimators. d. All of the above statements are true . 15.5 Q11: Which of the following statements is false ? a. Among the many metrics for regression estimators is the model’s coefficient of determination, which is also called the R 2 score. b. To calculate an estimator’s R 2 score, the following code calls the sklearn.met- rics module’s r2_score function with the arrays representing the expected and predicted results: In [44]: from sklearn import metrics In [45]: metrics.r2_score(expected, predicted) Out[45]: 0.6008983115964333 c. R 2 scores range from 0.0 to 1.0 with 1.0 being the best. An R 2 score of 1.0 indi- cates that the estimator perfectly predicts the independent variable’s value, given the dependent variable(s) value(s). An R 2 score of 0.0 indicates the model cannot make predictions with any accuracy, based on the independent variables’ values. d. All of the above statements are true . 15.5 Q12: Which of the following statements is false ?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 a. Another common metric for regression models is the mean squared error, which calculates the difference between each expected and predicted value this is called the error , squares each difference and calculates the average of the squared values. b. To calculate an estimator’s mean squared error, call function mean_squared_error (from module sklearn.metrics ) with the arrays rep- resenting the expected and predicted results, as in: In [46]: metrics.mean_squared_error(expected, predicted) Out[46]: 0.5350149774449119 c. When comparing estimators with the mean squared error metric, the one with the value closest to 1 best fits your data. d. All of the above statements are true . 15.5.8 Choosing the Best Model 15.5 Q13: Which of the following statements is false ? a. We can try several regression estimators to determine whether any produces better results than the LinearRegression estimator. b. The following code from our example uses a linear_regression estimator we already created and creates ElasticNet , Lasso and Ridge regression esti- mators (all from the sklearn.linear_model module): In [47]: from sklearn.linear_model import ElasticNet, Lasso, Ridge In [48]: estimators = { ...: 'LinearRegression' : linear_regression, ...: 'ElasticNet' : ElasticNet(), ...: 'Lasso' : Lasso(), ...: 'Ridge' : Ridge() ...: } c. The following code from our example runs the estimators using k-fold cross- validation with a KFold object and the cross_val_score function. The code passes to cross_val_score the additional keyword argument scoring='r2' , which indicates that the function should report the R 2 scores for each fold again, 1.0 is the best, so it appears that LinearRegression and Ridge are the best models for this dataset: In [49]: from sklearn.model_selection import KFold, cross_val_score In [50]: for estimator_name, estimator_object in estimators.items(): ...: kfold = KFold(n_splits= 10 , random_state= 11 , shuf- fle= True ) ...: scores = cross_val_score(estimator=estimator_object,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
23 ...: X=california.data, y=california.target, cv=kfold, ...: scoring= 'r2' ) ...: print(f ' {estimator_name:> 16 } : ' + ...: f 'mean of r2 scores= {scores.mean(): .3 f} ' ) ...: LinearRegression: mean of r2 scores=0.599 ElasticNet: mean of r2 scores=0.423 Lasso: mean of r2 scores=0.285 Ridge: mean of r2 scores=0.599 d. All of the above statements are true . 15.6 Case Study: Unsupervised Machine Learning, Part 1 Dimensionality Reduction 15.6 Q1: Which of the following statements is false ? a. Unsupervised machine learning and visualization can help you get to know your data by finding patterns and relationships among unlabeled samples. b. For datasets like the univariate time series we used earlier in this chapter, vis- ualizing the data is easy. In that case, we had two variables date and tempera- ture so we plotted the data in two dimensions with one variable along each axis. c. Using Matplotlib, Seaborn and other visualization libraries, you also can plot datasets with three variables using 3D visualizations. d. In the Digits dataset, every sample has 64 features (and a target value), so we cannot visualize the dataset. 15.6 Q2: Which of the following statements is false ? a. In big data, samples can have hundreds, thousands or even millions of features. b. To visualize a dataset with many features (that is, many dimensions), you must first reduce the data to two or three dimensions. This requires a supervised ma- chine learning technique called dimensionality reduction. c. When you graph the resulting data after dimensionality reduction, you might see patterns in the data that will help you choose the most appropriate machine learning algorithms to use. For example, if the visualization contains clusters of points, it might indicate that there are distinct classes of information within the dataset. d. All of the above statements are true . 15.6 Q3: Which of the following statements is false ? a. It’s difficult for humans to think about data with large numbers of dimensions. This is called the curse of dimensionality. b. If data has closely correlated features, some could be eliminated via dimension- ality reduction to improve the training performance. c. Eliminating features with dimensionality reduction, improves the accuracy of the model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24 d. All of the above statements are true . 15.6 Q4: Which of the following statements is false ? a. We can use the TSNE estimator (from the sklearn.manifold module) to per- form dimensionality reduction. This estimator analyzes a dataset’s features and reduces them to the specified number of dimensions. b. The following code creates a TSNE object for reducing a dataset’s features to two dimensions, as specified by the keyword argument n_components : In [3]: from sklearn.manifold import TSNE In [4]: tsne = TSNE(n_components= 2 , random_state= 11 ) c. The random_state keyword argument in Part (b) ensures the reproducibility of the “render sequence” when we display the digit clusters. d. All of the above statements are true . 15.6 Q5: Which of the following statements is false ? a. Dimensionality reduction in scikit-learn typically involves two steps training the estimator with the dataset, then using the estimator to transform the data into the specified number of dimensions. b. The steps mentioned in Part (a) can be performed separately with the TSNE methods fit and transform , or they can be performed in one statement using the fit_transform method, as in: In [5]: reduced_data = tsne.fit_transform(digits.data) c. TSNE ’s fit_transform method takes some time to train the estimator then perform the reduction. When the method completes its task, it returns an array with the same number of rows as digits.data , but only two columns. You can confirm this by checking reduced_data ’s shape . d. All of the above statements are true . 15.7 Case Study: Unsupervised Machine Learning, Part 2 k-Means Clustering 15.7 Q1: Which of the following statements is false ? a. k-means clustering is perhaps the simplest unsupervised machine learning al- gorithm. b. The k-means clustering algorithm analyzes unlabeled samples and attempts to place them in clusters that appear to be related. c. The k in “k - means” represents the number of clusters you’d like to see imposed on your data. d. The k-means clustering algorithm organizes samples into the number of clus- ters you specify in advance, using distance calculations similar to the k-nearest neighbors clustering algorithm.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25 15.7 Q2: Which of the following statements about the k-means clustering algo- rithm is false ? a. Each cluster of samples is grouped around a centroid —the cluster’s center point. b. Initially, the algorithm chooses k centroids at random from the dataset’s sam- ples. Then the remaining samples are placed in the cluster whose centroid is the closest. c. The centroids are iteratively recalculated and the samples re-assigned to clus- ters until, for all clusters, the distances from a given centroid to the samples in its cluster are maximized. d. The algorithm’s results are a one -dimensional array of labels indicating the cluster to which each sample belongs, and a two-dimensional array of centroids representing the center of each cluster. 15.7 Q3: Which of the following statements is false ? a. The Iris dataset bundled with scikit-learn is commonly analyzed with both clas- sification and clustering. b. Although the Iris dataset is labeled, we can ignore those labels to demonstrate clustering. Then, we can use the labels to determine how well the k-means algo- rithm clustered the samples. c. The Iris dataset is referred to as a “toy dataset” because it has only 150 samples and four features. The dataset describes 50 samples for each of three Iris flower species Iris setosa, Iris versicolor and Iris virginica. d. All of the above statements are true . 15.7.3 Visualizing the Dataset with a Seaborn pairplot 15.7 Q4: Which of the following statements is false ? a. One way to learn more about your data is to see how the features relate to one another. b. The samples in the Iris dataset each have four features. c. We cannot graph one feater the other three in a single graph. But we can plot pairs of features against one another in a pairplot . d. All of the above statements are true . 15.7 Q5: Which of the following statements is false ? a. The following code uses Seaborn function pairplot to create a grid of graphs plotting each feature against itself and the other specified features: import seaborn as sns grid = sns.pairplot(data=iris_df, vars=iris_df.columns[ 0 : 4 ], hue= 'species' ) b. The pairplot keyword argument data is the DataFrame containing the data to plot.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
26 c. The pairplot keyword argument vars is a sequence containing the names of the variables to plot. For a DataFrame , these are the names of the columns to plot. Here, we use the first five DataFrame columns, representing the sepal length, se- pal width, petal length and petal width, respectively. d. The pairplot keyword argument hue is the DataFrame column that’s used to determine colors of the plotted data. 15.7.4 Using a KMeans Estimator 15.7 Q6: Which of the following statements is false ? a. We can use k-means clustering via scikit- learn’s KMeans estimator (from the sklearn.cluster module) to place each sample in a dataset into a cluster. The KMeans estimator hides from you the algorithm’s complex mathematical details, making it straightforward to use. b. The following code creates a KMeans object: from sklearn.cluster import KMeans kmeans = KMeans(n_clusters= 3 , random_state= 11 ) c. The keyword argument n_clusters specifies the k-means clustering algo- rithm’s hyperparameter k (in this case, 3), which KMeans requires to calculate the clusters and label each sample. The default value for n_clusters is 8 . d. All of the above statements are true . 15.7 Q7: Which of the following statements is false ? a. We train the KMeans estimator by calling the object’s fit method this per- forms the k-means algorithm. b. As with the other estimators, the fit method returns the estimator object. c. When the training completes, the KMeans object contains a labels_ array with values from 0 to n_clusters 1 (in the Iris dataset example, 0 2), indicating the clusters to which the samples belong, and a cluster_centers_ array in which each row represents a cluster. d. All of the above statements are true . 15.7 Q8: Which of the following statements is false ? a. Because the Iris dataset is labeled, we can look at its target array values to get a sense of how well the k-means algorithm clustered the samples for the three Iris species. b. In the Iris dataset, the first 50 samples are Iris setosa, the next 50 are Iris ver- sicolor, and the last 50 are Iris virginica. c. If the KMeans estimator chose the Iris dataset clusters perfectly, then each group of 50 elements in the estimator’s labels_ array should have mostly the same label. d. All of the above statements are true .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
27 15.7.5 Dimensionality Reduction with Principal Component Analysis 15.7 Q9: Which of the following statements is false ? a. The PCA estimator (from the sklearn.decomposition module ), like TSNE , performs dimensionality reduction. The PCA estimator uses an algorithm called principal component analysis to analyze a dataset’s features and reduce them to the specified number of dimensions. b. Like TSNE , a PCA estimator uses the keyword argument n_components to spec- ify the number of dimensions, as in: from sklearn.decomposition import PCA pca = PCA(n_components= 2 , random_state= 11 ) c. The following snippets train the PCA estimator and produces the reduced data by calling the PCA estimator’s methods fit and transform methods: pca.fit(iris.data) iris_pca = pca.transform(iris.data) d. All of the above statements are true . 15.7 Q10: Which of the following statements is false ? a. You can use Seaborn’s scatterplot function to plot the reduced Iris in two dimensions. b. The following code transforms the reduced data into a DataFrame and adds a species column: iris_pca_df = pd.DataFrame(iris_pca, columns=[ 'Component1' , 'Component2' ]) iris_pca_df[ 'species' ] = iris_df.species c. The following code scatterplots the DataFrame in Part (b) using Seaborn: In [39]: axes = sns.scatterplot(data=iris_pca_df, x= 'Component1' , y= 'Component2' , hue= 'species' , legend= 'brief' , palette= 'cool' ) d. All of the above statements are true . 15.7 Q11: Which of the following statements is false ? a. Each centroid in the KMeans object’s cluster_centers_ array has the same number of features as the original dataset (four in the case of the Iris dataset). b To plot the centroids in two-dimensions, you must reduce their dimensions. c. You can think of a centroid as the “mode” sample in its cluster. d. Each centroid should be transformed using the same PCA estimator used to re- duce the other samples in that cluster
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
28 15.7.6 Choosing the Best Clustering Estimator 15.7 Q12: Which of the following statements is false ? a. You can run multiple clustering algorithms and compare how well they cluster the three species of Iris flowers. b. The following code creates a dictionary of clustering estimators: from sklearn.cluster import DBSCAN, MeanShift,\ SpectralClustering, AgglomerativeClustering estimators = { 'KMeans' : kmeans, 'DBSCAN' : DBSCAN(), 'MeanShift' : MeanShift(), 'SpectralClustering' : SpectralClustering(n_clusters= 3 ), 'AgglomerativeClustering' : AgglomerativeClustering(n_clusters= 3 ) } c. Like KMeans , you specify the number of clusters in advance for the Spec- tralClustering and AgglomerativeClustering , DBSCAN and MeanShift es- timators. d. All of the above statements are true . 15.7 Q17: Which of the following statements is false ? a. Each iteration of the following loop calls one clustering estimator’s fit method with iris.data as an argument, then uses NumPy’s unique function to get the cluster labels and counts for the three groups of 50 samples and displays the re- sults. In [45]: import numpy as np In [46]: for name, estimator in estimators.items(): ...: estimator.fit(iris.data) ...: print(f '\n {name} :' ) ...: for i in range( 0 , 101 , 50 ): ...: labels, counts = np.unique( ...: estimator.labels_[i:i+ 50 ], re- turn_counts= True ) ...: print(f ' {i}-{i+ 50 }: ' ) ...: for label, count in zip(labels, counts): ...: print(f ' label= {label} , count= {count} ' ) ...: KMeans: 0-50:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
29 label=1, count=50 50-100: label=0, count=48 label=2, count=2 100-150: label=0, count=14 label=2, count=36 DBSCAN: 0-50: label=-1, count=1 label=0, count=49 50-100: label=-1, count=6 label=1, count=44 100-150: label=-1, count=10 label=1, count=40 MeanShift: 0-50: label=1, count=50 50-100: label=0, count=49 label=1, count=1 100-150: label=0, count=50 SpectralClustering: 0-50: label=2, count=50 50-100: label=1, count=50 100-150: label=0, count=35 label=1, count=15 AgglomerativeClustering: 0-50: label=1, count=50 50-100: label=0, count=49 label=2, count=1 100-150:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
30 label=0, count=15 label=2, count=35 b. Interestingly, the output in Part (a) shows that DBSCAN correctly predicted three clusters (labeled -1 , 0 and 1 ), though it placed 84 of the 100 Iris virginica and Iris versicolor samples in the same cluster. c. The output in Part (a) shows MeanShift estimator, on the other hand, pre- dicted only two clusters (labeled as 0 and 1) , and placed 99 of the 100 Iris virgin- ica and Iris versicolor samples in the same cluster: d. All of the above statements are true .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help