Lab 3
docx
keyboard_arrow_up
School
Trine University *
*We aren’t endorsed by this school
Course
CO
Subject
English
Date
Nov 24, 2024
Type
docx
Pages
7
Uploaded by MasterScience9621
Lab 3
Ashim Neupane
MSDS-632-M50 Big Data
Dr. Billy Chestnut
10/29/2023
Q 1
Entropy is a measure of disorder within a system. In binary classification, entropy is used
to measure how well the data is split into two classes (Ruby & Yendapalli, 2020). The entropy of
a classification is calculated by using the following formula:
Entropy = -p*log2(p) - q*log2(q)
where p is the probability of belonging to one class, and q is the probability of belonging
to the other class.
The entropy of a binary classification can take any value between 0 and 1. The minimum
entropy is 0, which only occurs when the data is perfectly split into two classes, with the same
probability of belonging to each. The maximum entropy is 1, which only occurs when the data is
not split between the two classes, and so both classes have a probability of 0.5 of containing the
data.
Q 2
In a decision tree algorithm, attributes are picked to ‘split’ the data based on which
variable provides the most information gain. This splitting strategy is called Greedy Algorithm
which means it is goal oriented and moves towards a specific goal without considering the global
optimum.
The first step in picking an attribute is to calculate the Entropy of the dataset which is a
measure of how chaotic or random the data points are spread in the dataset. Then the attribute
which provides the best Information Gain is selected. Information Gain is calculated by
calculating how much the entropy has decreased after the split. The higher the Information Gain,
the better the split is because it makes the data more organized.
Once the attribute with the best Information Gain is chosen it is then split into two new
branches, each for one subset of the attribute. This process is repeated again and again until a
terminal leaf is created which means the branch can’t be further split into two. At each step, the
algorithm calculates the information gain and chooses the attribute with the highest Information
Gain.
Q 3
The probability that John has swine flu is 95%. As the probability of swine flu occurring
in 1 in 5,000 people is 5%, and the accuracy of the test is 99%, this means there is an overall
chance of 95% that John has swine flu, given that his test results were positive. To calculate this,
we use the formula: Probability (of having swine flu given positive test result) = (probability of
having swine flu * accuracy of the test) / (probability of having swine flu * accuracy of the test +
probability of not having swine flu * (1-accuracy of the test)). Plugging in the numbers: 5% *
99% / (5% * 99% + 95% * (1-99%)) = 95%.
Therefore, while the test result is not conclusive, the probability that John has swine flu is
quite high at 95%. The doctor should be informed so that John can be treated as soon as possible.
Q 4
Naive Bayes classifiers are considered computationally efficient for high-dimensional
problems due to their ability to leverage extremely fast multinomial probability calculations
(Lakin, 2021). Naive Bayes classifiers only use the probabilities of each class to compute their
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
predictions instead of optimizing weights, which makes them an effective classification choice
for large datasets.
Unlike many other classifiers, Naive Bayes classifiers don't require intensive computing
resources or excessive amounts of training. Instead, they estimate the probabilities of each class
from a given data set, which allows them to quickly determine the best prediction without
performing any optimization. This is beneficial for larger data sets that contain a large number of
features.
Also, since Naive Bayes classifiers assume that the features are conditionally
independent, they can easily accommodate new features added to the data set. This is especially
true in high-dimensional settings, since an increasing number of features can simply be added
without any additional training.
Q 5
The team should consider using a Decision Tree classifier in this situation. Decision tree
classifiers work well on both categorical and numeric data, so they are a good choice for a
dataset with many correlated variables. Additionally, Decision Trees can easily accommodate
interactions between variables, which can be useful when dealing with many correlated variables
(Ding, Cao & Liu, 2019). Finally, Decision Trees are advantageous because they are relatively
easy to interpret; a data scientist can easily follow the logic of a Decision Tree to identify the
most important variables and how they interact with each other to make predictions. All of these
properties make Decision Trees an ideal choice for a classification problem with many correlated
variables, many of which are categorical.
Q 6
In this case, the team should consider using logistic regression. Logistic regression is
effective when there are many correlated variables because it works by creating a linear
relationship between several variables, allowing each one to contribute to the classification
outcome (Shah et al., 2020). Additionally, this model calculates the probability of a certain label
being assigned to a case and works well for binary classification problems. This feature in
particular makes logistic regression a preferred choice for tasks that require probability output,
which is necessary for this kind of problem. Therefore, logistic regression is an ideal classifier
for the team to use.
Q 7
The accuracy of the predictions made by a particular model can be evaluated using
several metrics, such as True Positive, True Negative, False Positive, and False Negative rates. In
this particular case, True Positive indicates the number of predicted "good" instances that were
actually good (in this case, 671), while True Negative indicates the number of predicted "bad"
instances that were actually bad (262). False Positive, on the other hand, indicates the number of
instances that were predicted to be "good" but turned out to be bad (38), while False Negative
indicates the number of instances that were predicted to be "bad" but actually turned out to be
good (29).
To measure the accuracy of a model's predictions, we calculate the True Positive rate
(also known as the recall rate) by dividing the number of True Positive instances (671) by the
sum of the True Positive and False Negative instances (671 + 29). This yields a value of 0.959,
or approximately 96%. On the other hand, to measure the accuracy of a model's negative
predictions, we can calculate the False Positive rate by dividing the number of False Positive
instances (38) by the sum of the False Positive and True Negative instances (38 + 262). This
yields a value of 0.1267, or approximately 13%. Finally, to measure the accuracy of a model's
classifications of "bad" instances, we can calculate the False Negative rate by dividing the
number of False Negative instances (29) by the sum of the False Negative and True Positive
instances (29 + 671). This yields a value of 0.0414, or approximately 4%.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
References
Ding, C., Cao, X., & Liu, C. (2019). How does the station-area built environment influence
Metrorail ridership? Using gradient boosting decision trees to identify non-linear
thresholds.
Journal of Transport Geography
,
77
, 70-78.
Lakin, S. M. (2021).
Modern Considerations for the Use of Naive Bayes in the Supervised
Classification of Genetic Sequence Data
(Doctoral dissertation, Colorado State
University).
Ruby, U., & Yendapalli, V. (2020). Binary cross entropy with deep learning technique for image
classification.
Int. J. Adv. Trends Comput. Sci. Eng
,
9
(10).
Shah, K., Patel, H., Sanghvi, D., & Shah, M. (2020). A comparative analysis of logistic
regression, random forest and KNN models for the text classification.
Augmented Human
Research
,
5
, 1-16.