Problem 3
The data science team in a genetic testing company has developed a predictive model to
identify Type 1 Gaucher disease. From domain knowledge, the prevalence of Type 1 Gaucher
disease in the US population is 4%. The model was built on a dataset of 4000 samples, of
which 1800 samples were diagnosed as positive. The team partitioned the dataset into 70%
training and 30% validation with a stratified sampling technique. The sensitivity and
specificity achieved on the validation set are 70% and 90%, respectively.
a.
Calculate the adjusted misclassification rate, precision, and recall on the validation
set. Comment on the model performance.
Records in data set = 4000 Trained data = (70/100)*4000 = 2800 Validation data = (30/100)*4000 = 1200
As we use stratified sampling technique, out of 1200 records, 540 records are diseases and
660 records are not diseases. Since sensitivity is 70% and specificity 90% based on this we
calculate,
Actual Class Predicted Class
Positive Negative
Positive 378
162
Negative
66
594
Since only 4% of the entire have that disease and since the data we selected is oversampled
and based on the previous confusion matrix the corrected confusion matrix is Based on this
we calculate,
Actual Class Predicted Class
Positive Negative
Positive 34
14
Negative
115
1037
Misclassification rate
= (FP+FN)/(TP+TN+FP+FN) = (15+115)/(33+1037+15+115) = 130/1200 = 0.108 Recall
= TP/(TP+FP) = 33/(33+15) = 33/48 = 0.68 Precision
= TP/(TP+FN) = 33/33+115 = 33/148 = 0.22
b.
Recommend another scheme to deal with the unbalanced data for this data science
team. By using oversampling on the training data instead of the validation, such as SMOTE we can
deal with the unbalanced data as this would eliminate adjustments while calculating the
parameters.