In this problem, we use the "breast cancer wisconsin dataset" from scikit-learn for training and evaluating classification models. This dataset is computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The dataset comprises 30 features (mean radius, mean texture, mean perimeter, etc.) and a target variable or class.   (a) Import this dataset from scikit-learn to form the input data matrix ? and the target vector ?. What is the sample size? How many different classes do we have in this problem?   (b) Split the dataset by using the function train_test_split() with 70% training data.   (c) Build a logistic regression model to predict classes for the testing data. In this problem, you can use default values for all input arguments, but you will receive a warning message that the optimizer does not converge. So, you can increase the maximum number of iterations, e.g., max_iter=10,000.   (d) Plot precision-recall curve given the trained classifier in the previous part and the hold-out testing set. Feel free to review the documentation page: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html#sklearn.metrics.PrecisionRecallDisplay.from_estimator   (e) While PrecisionRecallDisplay provides a friendly and easy way to plot the precision-recall curve, it is not trivial how to find the "best" trade-off between the two precision and recall scores. What is the probability threshold for the trained logistic regression classifier that achieves the maximum f1 score? Polt the confusion matrix for this classifier.   (f) Based on your observations, prove the following statement. If a classifier predicts an equal amount of samples as false positives as it classifies false negatives, we have: precision = recall = f1 score.

icon
Related questions
Question

In this problem, we use the "breast cancer wisconsin dataset" from scikit-learn for training and evaluating classification models. This dataset is computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The dataset comprises 30 features (mean radius, mean texture, mean perimeter, etc.) and a target variable or class.

 

(a) Import this dataset from scikit-learn to form the input data matrix ? and the target vector ?. What is the sample size? How many different classes do we have in this problem?

 

(b) Split the dataset by using the function train_test_split() with 70% training data.

 

(c) Build a logistic regression model to predict classes for the testing data. In this problem, you can use default values for all input arguments, but you will receive a warning message that the optimizer does not converge. So, you can increase the maximum number of iterations, e.g., max_iter=10,000.

 

(d) Plot precision-recall curve given the trained classifier in the previous part and the hold-out testing set. Feel free to review the documentation page: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html#sklearn.metrics.PrecisionRecallDisplay.from_estimator

 

(e) While PrecisionRecallDisplay provides a friendly and easy way to plot the precision-recall curve, it is not trivial how to find the "best" trade-off between the two precision and recall scores. What is the probability threshold for the trained logistic regression classifier that achieves the maximum f1 score? Polt the confusion matrix for this classifier.

 

(f) Based on your observations, prove the following statement. If a classifier predicts an equal amount of samples as false positives as it classifies false negatives, we have: precision = recall = f1 score.

Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 4 steps with 5 images

Blurred answer