Assign-ClassificationError
pdf
keyboard_arrow_up
School
Simon Fraser University *
*We aren’t endorsed by this school
Course
340
Subject
Mathematics
Date
Apr 3, 2024
Type
Pages
11
Uploaded by DeaconLion805
Assignment: Understanding Classification Error Goal In this activity you will practice calculating the ROC curve and computing the confusion matrix. I. Evaluate an AI-based COVID-19 Diagnosis System Note: The information provided here is based on real research, however, given the seriousness of COVID19, please assume the information here is hypothetical and may contain errors and should not be used for any purpose beyond this assignment. The current gold-standard for diagnosis of COVID-19 is real-time polymerase chain reaction (RT-PCR) lab test [Bai 2020]. However, lab resources are expensive, limited and time consuming. A quick, cheaper and non-invasive alternative may be to perform CT imaging and use the features such as peripheral distribution, ground-glass opacity and vascular thickening of the CT images for diagnosis [Bai 2020]. Assume
the scientists designed an alternative AI system, which takes in a CT image, recognizes the ground-glass opacity (GGO) feature, and performs the diagnosis in a few seconds. However, there is a trade-off between efficiency and accuracy, so we have to evaluate how much we can trust the system. (Simulated) Dataset
: 100 patients were both tested by RT-PCR and the CT-based AI system: 51 patients were diagnosed by RT-PCR (the gold-standard) as positive (True) while 49 tested negative (False). The raw GGO values were collected from the AI system before making any thresholding. The data is saved in data/GGO_value.mat
and data/diagnosis.mat
respectively. Question 1 Assume the probability of positive and negative patients follow Gaussian distributions (see the two schematic plots below). Notice there is overlap between the two distributions (which means if we take different thresholds, we’ll obtain different prediction results). 1
(a) Using MATLAB, load the data and find the mean and standard deviation (std) of the Gaussian that models the positive distribution for 51 subjects. (b) Find the mean and std of the Gaussian that models the negative distribution for 49 subjects by MATLAB. (c) Plot of the two distributions in MATLAB (using the mean and std values found in parts (a) and (b). Label your axes to obtain a figure similar to the schematic plot shown above. Hint: use MATLAB’s normpdf
function. Question 1. Your Answers: a)
Mean of positive subjects: 72.57 Std of positive subjects: 12.57 b)
Mean of negative subjects: 29.80 Std of negative subjects: 10.43 2
c)
Paste plot here: Paste Code Here: Loading data: % Load the data from the provided MATLAB files load('data/GGO_value.mat'); % Assuming GGO_value.mat contains a variable named 'GGO_values' load('data/diagnosis.mat'); % Assuming diagnosis.mat contains a variable named 'diagnosis' % Separate GGO values for positive and negative subjects ggo_positive = GGO_values(diagnosis == 1); ggo_negative = GGO_values(diagnosis == 0); 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Calculating mean and std: % Calculate mean and standard deviation for positive and negative distributions mean_positive = mean(ggo_positive); std_positive = std(ggo_positive); mean_negative = mean(ggo_negative); std_negative = std(ggo_negative); Choose the threshold range and step % Choose the threshold range and step (customize as needed) threshold_range = linspace(min(GGO_values), max(GGO_values), 1000); Build the distribution function % Build the distribution function (PDF) for positive and negative subjects pdf_positive = normpdf(threshold_range, mean_positive, std_positive); pdf_negative = normpdf(threshold_range, mean_negative, std_negative); Plot your figures % Plot the distributions figure; plot(threshold_range, pdf_positive, 'b', 'LineWidth', 2, 'DisplayName', 'Positive Distribution'); hold on; plot(threshold_range, pdf_negative, 'r', 'LineWidth', 2, 'DisplayName', 'Negative Distribution'); xlabel('GGO Values'); ylabel('Probability Density'); title('Gaussian Distributions for Positive and Negative Subjects'); legend('Location', 'best'); grid on; hold off; 4
Question 2 Given the 2 Gaussian distributions in Question 1, the goal is to construct the corresponding ROC curve. a)
Choose your threshold values to construct the ROC curve. Make sure your choice contains at least 10 different values. b)
Plot the ROC curve with true positive rate (TPR) in percentage along the vertical axis, vs. false positive rate (FPR) along the horizontal, using both the erf table and the normcdf
function. Therefore, you are expected to produce 2 plots. Note: •
The ROC should show the operating points for equally-separated thresholds. •
Do not use the raw data to calculate the operating points, instead use the Gaussian distributions.
•
To calculate needed integrals (i.e., CDF), refer to the lecture slides and make use of the values given in the provided file: erf_tables.pdf
. Note: for , erf is negative and is equal to -erf(-x) as read from the table. •
Use MATLAB to plot and make sure that the operating points are clearly visible. Double check your answers using MATLAB’s built-in function normcdf
to calculate the integrals over a Gaussian distribution. You can use a finer threshold grid so the ROC curve will look smoother. Question 2. Your Answers:
a)
Choose threshold Values of the ROC Operating points: % Choose equally spaced thresholds
num_thresholds = 10;
thresholds = linspace(min(GGO_values), max(GGO_values), num_thresholds);
b)
Explain how you used erf table for three example thresholds values: To calculate the CDF values for different thresholds, we refer to the erf table. For example, if we have a threshold value of x, we can find the corresponding CDF value using the erf table as follows: x
< 0
5
If x is positive, use erf(x). If x is negative, use -erf(-x) (since erf is negative for negative values of x). Paste MATLAB Code for plotting the ROC curve (use scattered points instead of line segments) Here:
% Choose equally spaced thresholds num_thresholds = 10; thresholds = linspace(min(GGO_values), max(GGO_values), num_thresholds); % Calculate TPR and FPR using erf table tpr_erf = zeros(size(thresholds)); fpr_erf = zeros(size(thresholds)); for i = 1:num_thresholds tpr_erf(i) = sum(ggo_positive > thresholds(i)) / numel(ggo_positive); fpr_erf(i) = sum(ggo_negative > thresholds(i)) / numel(ggo_negative); end % Calculate TPR and FPR using normcdf tpr_normcdf = normcdf(thresholds, mean_positive, std_positive); fpr_normcdf = normcdf(thresholds, mean_negative, std_negative); % Plot the ROC curves figure; subplot(1, 2, 1); scatter(fpr_erf, tpr_erf, 'b', 'filled'); xlabel('False Positive Rate (FPR)'); ylabel('True Positive Rate (TPR)'); title('ROC Curve (Using erf Table)'); grid on; 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
subplot(1, 2, 2); scatter(fpr_normcdf, tpr_normcdf, 'r', 'filled'); xlabel('False Positive Rate (FPR)'); ylabel('True Positive Rate (TPR)'); title('ROC Curve (Using normcdf)'); grid on; sgtitle('Receiver Operating Characteristic (ROC) Curves'); Paste ROC Figure here: 7
Question 3 Now we look at a dataset collected by a deployed beta-version of the AI system, which resulted in the following ROC curve: Your task is to control the mis-diagnosis ratio, by reaching a trade-off between TPR and FPR. 1.
In particular, you need to tune certain hyper-parameters (e.g., the threshold T) for the decision system so that such that: with the lowest possible FPR. Among the 9 possible operating points (green dots ‘a’ through ‘i’) in the plot above, which one would you choose to satisfy the requirement? Please explain. 2.
Assume 20 AI-diagnosed patients had GGO values: . Then these 20 patients underwent the more reliable RT-PCR test, which returned, 72 hours later, the following diagnoses, which we regard as “truth”: C = [P, P, N, P, N, N, N, P, P, N, P, P, P, N, N, N, N, P, P, N]. where (P: positive, i.e., COVID19; N: negative, i.e., non-COVID19) Choose the threshold so that the AI-classification results would have and ? Justify your choice. Note: Calculate FNR using the data points and not a fitted, Gaussian, or any other distribution. 3.
Using the threshold, you chose in question 2. Answer the following questions: FNR
≤
20
%
V
=
[70,
60,
30,
80,
40,
20,
50,
90,
85,
45,
75,
65,
55,
15,
35,
45,
45,
65,
65,
75]
FNR
≤
10
%
FPR
≤
40
%
8
a.
How many were misdiagnosed? b.
How many sick patients were diagnosed as healthy? c.
How many healthy patients were diagnosed as sick? d.
What is the false negative ratio? 4.
Calculate entries of 2x2 confusion matrix for the 20 patients using the threshold given in question 2, use the number of patients in the entries, e.g., number of patients that are N but were misdiagnosed as P, etc. 5.
Now draw another confusion matrix and enter the percentages instead, i.e., out of 100% negative cases, what percent were correctly classified as N, etc. Question 3. Your Answers:
1.
Choose operating point. To satisfy the requirement of minimizing the FPR while maintaining a reasonable TPR, we should choose an operating point that balances these two metrics. It has to strike a balance between sensitivity and specificity. It also has to provide a good compromise between correctly identifying positive cases (TPR) while minimizing false positives (FPR). 2.
a.
List and sort the positive and negative GGO values (you can use MATLAB command sort here) Sorted Positive GGO Values: Columns 1 through 25 41 47 53 57 57 58 59 59 61 61 62 62 62 66 67 68 68 68 69 69 69 70 73 73 73 Columns 26 through 50 73 73 73 75 75 76 77 77 77 77 77 79 79 80 81 81 84 84 84 84 85 86 88 98 100 Column 51 106 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Sorted Negative GGO Values: Columns 1 through 25 11 12 15 16 16 18 18 19 19 19 19 22 22 22 23 23 24 26 28 28 28 28 28 29 30 Columns 26 through 49 30 31 31 31 32 32 32 33 34 35 36 37 37 37 38 39 41 41 44 45 45 46 54 56 b.
How to choose a threshold so that FNR<=10%? To achieve an FNR (False Negative Ratio) of 10%, we want to minimize the number of false negatives. We can iteratively adjust the threshold and calculate the FNR until it reaches the desired value. Specifically, we’ll choose the threshold that results in a low number of false negatives while maintaining a reasonable true positive rate (TPR). c.
How to choose a threshold so that FPR<=40%? To achieve an FPR (False Positive Rate) of 40%, we want to minimize the number of false positives. Similar to the previous step, we’ll iteratively adjust the threshold and calculate the FPR until it reaches the desired value. We’ll choose the threshold that results in a low number of false positives while maintaining a reasonable true negative rate (TNR). d.
What’s your choice of the final threshold? Point e: This point has a relatively high TPR (around 50%) and a moderate FPR (around 95%). 3.
a.
Misdiagnosed number patients b.
Sick diagnosed as healthy Misdiagnosed
=
FalsePositives
(
FP
) +
FalseNegatives
(
FN
) = 4 + 2 = 6
FalseNegatives
(
FN
) = 2
10
c.
Healthy diagnosed as sick d.
FNR 4.
a.
Confusion matrix in number b.
Confusion matrix in percentage References [Bai 2020] H. X. Bai et al., “Performance of radiologists in differentiating COVID-19 from viral pneumonia on chest CT”, Radiology, 2020. DOI: https://doi.org/10.1148/radiol.2020200823 FalsePositives
(
FP
) = 4
FNR
=
FN
FN
+
TP
=
2
2 + 8
= 0.2
Predicted Positive (P)
Predicted Negative (N)
Actual Positive (P)
8 (True Positives)
2 (False Negatives)
Actual Negative (N)
4 (False Positives)
6 (True Negatives)
Predicted Positive (P)
Predicted Negative (N)
Actual Positive (P)
80%
20%
Actual Negative (N)
40%
60%
11