IntroPR_and_Bayesian_Decision_QB
pdf
keyboard_arrow_up
School
SUNY Buffalo State College *
*We aren’t endorsed by this school
Course
555
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
16
Uploaded by SuperHumanWombatMaster946
CSE 4/555 – Intro to Pattern Recognition
Chapter: Introduction and Bayesian Decision Theory
Edited By:
Cheng-En Chuang
University at Buffalo, The State University of New York
Buffalo, NY 14620
September 27, 2021
1
Short Answer Questions
1. Describe what is a pattern?
Answer:
Pattern is everything around in this digital world. A pattern can either be
seen physically or it can be observed mathematically by applying algorithms. Pattern
is comprises of the following two fundamental things:
(a) Collection of observations
(b) The concept behind the observation, which is also known as a feature vector.
2. What is Pattern Recognition?
Answer:
Pattern recognition is the process of recognizing patterns by using machine
learning algorithm.
Pattern recognition can be defined as the classification of data
based on knowledge already gained or on statistical information extracted from patterns
and/or their representation. One of the important aspects of the pattern recognition
is its application potential.
Examples: Speech recognition, speaker identification, multimedia document recognition
(MDR), automatic medical diagnosis. In a typical pattern recognition application, the
raw data is processed and converted into a form that is amenable for a machine to use.
3. Pattern recognition involves classification and cluster of patterns. Explain these two
approaches.
Answer:
(a) In classification, an appropriate class label is assigned to a pattern based on an
abstraction that is generated using a set of training patterns or domain knowledge.
Classification is used in supervised learning.
(b) Clustering generated a partition of the data which helps decision making, the
specific decision making activity of interest to us. Clustering is used in an unsu-
pervised learning.
4. Explain what are features in pattern recognition with examples.
Answer:
Features may be represented as continuous, discrete or discrete binary vari-
ables.
A feature is a function of one or more measurements, computed so that it
quantifies some significant characteristics of the object.
Example: consider our face then eyes, ears, nose etc are features of the face. A set of
features that are taken together, forms the features vector.
Example: In the above example of face, if all the features (eyes, ears, nose etc) taken
together then the sequence is feature vector([eyes, ears, nose]). Feature vector is the
sequence of a features represented as a d-dimensional column vector.
5. What are the properties of a robust pattern recognition system?
Answer:
(a) Recognise familiar pattern quickly and accurate
(b) Recognize and classify unfamiliar objects
(c) Accurately recognize shapes and objects from different angles
1
(d) Identify patterns and objects even when partly hidden
(e) Recognize patterns quickly with ease, and with automaticity
6. Explain how does a system use a dataset to learn patterns?
Answer:
Entire dataset is divided into two categories, one which is used in training the
model i.e. Training set and the other that is used in testing the model after training,
i.e. Testing set.
(a) Training set: Training set is used to build a model. It consists of the set of images
that are used to train the system. Training rules and algorithms used give relevant
information on how to associate input data with output decision. The system is
trained by applying these algorithms on the dataset, all the relevant information
is extracted from the data and results are obtained. Generally, 80% of the data of
the dataset is taken for training data.
(b) Testing set: Testing data is used to test the system. It is the set of data which
is used to verify whether the system is producing the correct output after being
trained or not.
Generally, 20% of the data of the dataset is used for testing.
Testing data is used to measure the accuracy of the system. Example: a system
which identifies which category a particular flower belongs to, is able to identify
seven category of flowers correctly out of ten and rest others wrong, then the
accuracy is 70%
7. What Bayes’ Theorem (Bayes Rule) is all about?
Answer:
Often, we know how frequently some particular evidence is observed, given
a known outcome. We have to use this known fact to compute the reverse, to compute
the chance of that outcome happening, given the evidence. Conceptually, this is a way
to go from
P
(
Evidence
|
Known Outcome
) to
P
(
Outcome
|
Known Evidence
).
Bayes’ theorem is a relatively simple, but fundamental result of probability theory that
allows for the calculation of certain conditional probabilities. Conditional probabilities
are just those probabilities that reflect the influence of one event on the probability of
another.
The formula is:
P
(
A
|
B
) =
P
(
B
|
A
)
*
P
(
A
)
P
(
B
)
Which tells us:
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
•
how often A happens given that B happens, written
P
(
A
|
B
).
When we know:
•
how often B happens given that A happens, written
P
(
B
|
A
).
•
and how likely A is on its own, written
P
(
A
).
•
and how likely B is on its own, written
P
(
B
).
As an example let us say
P
(
Fire
) means how often there is fire, and
P
(
Smoke
) means
how often we see smoke, then
•
P
(
Fire
|
Smoke
) means how often there is fire when we can see smoke
•
P
(
Smoke
|
Fire
) means how often we can see smoke when there is fire
So the formula kind of tells us ”forwards”
P
(
Fire
|
Smoke
) when we know ”backwards”
P
(
Smoke
|
Fire
).
8. What is a Na¨
ıve Bayes Classifier?
Answer:
Naive Bayes Classifiers are a family of simple probabilistic classifiers based on
applying Bayes’ theorem with strong independence assumptions between the features.
Bayes’ theorem is given by the following equation:
P
(
A
|
B
) =
P
(
B
|
A
)
*
P
(
A
)
P
(
B
)
Using Bayes’ theorem, the probability of
A
happening given that
B
has occurred can
be found. An example of the way a Naive Bayes Classifier can be used is, given that it
has rained, the probability of temperature being low is
P
(
Temperature
|
Rain
).
9. Why Naive Bayes is called Naive?
Answer:
We call it naive because its assumptions (it assumes that all of the features
in the dataset are equally important and independent) are really optimistic and rarely
true in most real-world applications.
10. Explain Likelihood Ratio Test (LRT).
Answer:
Assume we are to classify an object based on the evidence provided by feature
vector
x
, Evaluate the posterior probability
P
(
ω
i
|
x
) of each class and choose the class
with largest
P
(
ω
i
|
x
).
In other words, choose the class that is most probable given
observation
x
.
Let’s examine this rule for a 2-class problem. In this case the decision rule becomes if
P
(
ω
1
|
x
)
> P
(
ω
2
|
x
) choose
ω
1
else choose
ω
2
. Or, in a more compact form:
P
(
ω
1
|
x
)
ω
1
≷
ω
2
P
(
ω
2
|
x
)
Applying Bayes rule:
P
(
x
|
ω
1
)
P
(
ω
1
)
P
(
x
)
ω
1
≷
ω
2
P
(
x
|
ω
2
)
P
(
ω
2
)
P
(
x
)
3
Since
P
(
x
) does not affect the decision rule, it can be eliminated.
1
Then we can
rearrange the previous expression:
Λ(
x
) =
P
(
x
|
ω
1
)
P
(
x
|
ω
2
)
ω
1
≷
ω
2
P
(
ω
2
)
P
(
ω
1
)
The term Λ(
x
) is called the likelihood ratio, and the decision rule is known as the
likelihood ratio test (LRT).
11. What is the loss function?
Answer:
There are typically two steps involved in learning a hypothesis function h().
First, we select the type of machine learning algorithm that we think is appropriate
for this particular learning problem. This defines the hypothesis class
H
, i.e. the set
of functions we can possibly learn. The second step is to find the best function within
this class,
h
∈ H
. This second step is the actual learning process and often, but not
always, involves an optimization problem.
Essentially, we try to find a function h within the hypothesis class that makes the fewest
mistakes within our training data.
(If there is not a single function we typically try
to choose the ”simplest” by some notion of simplicity - but we will cover this in more
detail in a later class.) How can we find the best function? For this we need some way
to evaluate what it means for one function to be better than another.
This is where the loss function (aka risk function) comes in. A loss function evaluates
a hypothesis
h
∈ H
on our training data and tells us how bad it is. The higher the
loss, the worse it is - a loss of zero means it makes perfect predictions. It is common
practice to normalize the loss by the total number of training samples, n, so that the
output can be interpreted as the average loss per sample (and is independent of n).
12. What are the examples of loss function?
Answer:
•
Zero-one loss:
The simplest loss function is the zero-one loss. It literally counts how many mis-
takes an hypothesis function h makes on the training set. For every single example
it suffers a loss of 1 if it is mispredicted, and 0 otherwise. The normalized zero-
one loss returns the fraction of misclassified training samples, also often referred
to as the training error. The zero-one loss is often used to evaluate classifiers in
multi-class/binary classification settings but rarely useful to guide optimization
procedures because the function is non-differentiable and non-continuous.
Formally, the zero-one loss can be stated has:
L
0
/
1
(
h
) =
1
n
n
X
i
=1
δ
h
(
x
i
)
6
=
y
i
,
where
δ
h
(
x
i
)
6
=
y
i
=
(
1
,
if
h
(
x
i
)
6
=
y
i
0
,
o.w.
This loss function returns the error rate on this data set
D
. For every example
that the classifier misclassifies (i.e.
gets wrong) a loss of 1 is suffered, whereas
correctly classified samples lead to 0 loss.
1
P
(
x
) can be disregarded in the decision rule since it is constant regardless of class
ω
i
. However,
P
(
x
) will
be needed if we want to estimate the posterior
P
(
ω
i
|
x
) which, unlike
P
(
x
|
ω
1
)
P
(
ω
1
) , is a true probability
value and, therefore, gives us an estimate of the “goodness” of our decision
4
•
Squared loss:
The squared loss function is typically used in regression settings. It iterates over all
training samples and suffers the loss (
h
(
x
i
)
-
y
i
)
2
. The squaring has two effects:
1., the loss suffered is always nonnegative; 2., the loss suffered grows quadrati-
cally with the absolute mispredicted amount. The latter property encourages no
predictions to be really far off (or the penalty would be so large that a different
hypothesis function is likely better suited).
On the flipside, if a prediction is very close to be correct, the square will be tiny
and little attention will be given to that example to obtain zero error. For example,
if
|
h
(
x
i
)
-
y
i
|
= 0
.
001 the squared loss will be even smaller, 0
.
000001, and will likely
never be fully corrected. If, given an input
x
, the label
y
is probabilistic according
to some distribution
P
(
y
|
x
) then the optimal prediction to minimize the squared
loss is to predict the expected value, i.e.
h
(
x
) =
E
P
(
y
|
x
)
[
y
]. Formally the squared
loss is:
L
sq
(
h
) =
1
n
n
X
i
=1
(
h
(
x
i
)
-
y
i
)
2
.
•
Absolute loss:
Similar to the squared loss, the absolute loss function is also typically used in
regression settings. It suffers the penalties
|
h
(
x
i
)
-
y
i
|
. Because the suffered loss
grows linearly with the mispredictions it is more suitable for noisy data (when
some mispredictions are unavoidable and shouldn’t dominate the loss).
If, given an input
x
, the label
y
is probabilistic according to some distribution
P
(
y
|
x
) then the optimal prediction to minimize the absolute loss is to predict the
median value, i.e.
h
(
x
) = MEDIAN
P
(
y
|
x
)
[
y
]. Formally, the absolute loss can be
stated as:
L
abs
(
h
) =
1
n
n
X
i
=1
|
h
(
x
i
)
-
y
i
|
.
13. What is the purpose of discriminant functions?
Answer:
If we have a set of
K
classes then we may define a set of K discriminant
functions
y
k
(
x
), one for each class. Data point
x
is assigned to class
c
if
y
c
(
x
)
> y
k
(
x
)
for all k
6
=
c
In other words: assign
x
to the class
c
whose discriminant function
y
c
(
x
) is biggest.
When classifying based on the values of the log posterior probability, the log posterior
probability of class
c
given a data point
x
is a possible discriminant function:
y
c
(
x
) =
ln P
(
c
|
x
) =
ln p
(
x
|
c
) +
P
(
c
) +
const
The posterior probability could also be used as a discriminant function, with the same
results: choosing the class with the largest posterior probability is an identical decision
rule to choosing the class with the largest log posterior probability.
5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2
Derivations
1.
Question:
Proof Bayes Theorem.
Solution:
The probability of two events
A
and
B
happening,
P
(
A
∩
B
), is the probability of
A
,
P
(
A
), times the probability of
B
given that
A
has occurred,
P
(
B
|
A
).
P
(
A
∩
B
) =
P
(
A
)
P
(
B
|
A
)
On the other hand, the probability of
A
and
B
is also equal to the probability of
B
times the probability of
A
given
B
.
P
(
A
∩
B
) =
P
(
B
)
P
(
A
|
B
)
Equating the two yields:
P
(
A
)
P
(
B
|
A
) =
P
(
B
)
P
(
A
|
B
)
and thus
P
(
A
|
B
) =
P
(
A
)
P
(
B
|
A
)
P
(
B
)
This equation, known as Bayes Theorem is the basis of statistical inference.
6
2.
Question:
Assume equal priors, given the likelihoods below, derive a decision rule based on the
likelihood ratio test (LRT).
P
(
x
|
ω
1
) =
N
(4
,
1)
P
(
x
|
ω
2
) =
N
(10
,
1)
Solution:
Substituting into the LRT expression:
Λ(
x
) =
1
√
2
π
e
-
1
2
(
x
-
4)
2
1
√
2
π
e
-
1
2
(
x
-
10)
2
ω
1
≷
ω
2
1
1
Simplifying the LRT expression:
Λ(
x
) =
e
-
1
2
(
x
-
4)
2
+
1
2
(
x
-
10)
2
ω
1
≷
ω
2
1
which yields
x
ω
1
≷
ω
2
7.
7
3.
Question:
Assumed equal priors, show that the performance of any decision rule can be measured
by
P
[
error
] for a 2-class problem.
Solution:
Making use of the Theorem of total probability:
P
[
error
] = Σ
C
i
=1
P
[
error
|
ω
i
]
P
[
ω
i
]
The class conditional probability
P
[
error
|
ω
i
] can be expressed as:
P
[
error
|
ω
i
] =
P
[
choose ω
j
|
ω
i
] =
Z
Rj
P
(
x
|
ω
i
)
dx
=
i
For a 2-class problem,
P
[
error
] becomes
P
[
error
] =
P
[
ω
1
]
Z
R
2
P
(
x
|
ω
1
)
dx
+
P
[
ω
2
]
Z
R
1
P
(
x
|
ω
2
)
dx
where
(
1
=
R
R
2
P
(
x
|
ω
1
)
dx
2
=
R
R
1
P
(
x
|
ω
2
)
dx
Since we assume equal priors, then
P
[
error
] =
1
+
2
2
8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4.
Question:
Consider Neyman-Pearson criteria for two Cauchy distributions in one dimension
p
(
x
|
ω
i
) =
1
πb
·
1
1 +
(
x
-
a
i
b
)
2
where
i
= 1
,
2
Assume a zero-one error loss, and for simplicity
a
2
> a
1
, the same “width”
b
, and equal
priors.
(a) Suppose the maximum acceptable error rate for classifying a pattern that is actually
in
ω
1
as if it were in
ω
2
is
E
1
.
Determine the decision boundary in terms of the
variables given.
(b) For this boundary, what is the error rate for classifying
ω
2
as
ω
1
?
(c) What is the overall error rate under zero-one loss?
(d) Apply your results to the specific case
b
= 1 and
a
1
= 1,
a
2
= 1 and
E
1
= 0
.
1.
(e) Compare your result to the Bayes error rate (i.e., without the Neyman-Pearson
conditions).
Solution:
(a) We want the probability of choosing action
α
2
to be smaller than, or equal to,
E
1
,
given that the true state of nature is
ω
1
. Let’s assume that
μ
1
< μ
2
and that the
decision threshold is
x
*
, so we decide
α
2
if
x > x
*
. We then have
P
(
α
2
|
ω
1
)
≤
E
1
p
(
x > x
*
|
ω
1
)
≤
E
1
1
-
Z
x
*
0
p
(
x
|
ω
1
)
dx
≤
E
1
We let Φ :
R
7→
[0
,
1] denote the cumulative Gaussian distribution, and Φ
-
1
:
[0
,
1]
7→
R
it’s inverse function. Making use of Φ we write
1
-
Φ
x
*
-
μ
1
σ
1
≤
E
1
x
*
≥
μ
1
+
σ
1
Φ
-
1
(1
-
E
1
)
If the desired error is close to zero, then
x
*
goes to positive infinity. If the desired
error is close to one, then
x
*
goes to negative infinity.
(b) The error rate for classifying
ω
2
as
ω
1
is
P
(
α
1
|
ω
2
) =
p
(
x
≤
x
*
|
ω
2
) =
Z
x
*
0
p
(
x
|
ω
2
)
dx
= Φ
x
*
-
μ
2
σ
2
Making use of
x
*
from the previous problem, we obtain
Φ
μ
1
+
σ
1
Φ
-
1
(1
-
E
1
)
-
μ
2
σ
2
= Φ
μ
1
-
μ
2
σ
2
+
σ
1
σ
2
Φ
-
1
(1
-
E
1
)
9
(c) The overall error rate becomes
P
(
error
) =
P
(
α
1
, ω
2
) +
P
(
α
2
, ω
1
)
=
P
(
α
1
|
ω
2
)
P
(
ω
2
) +
P
(
α
2
|
ω
1
)
P
(
ω
1
)
=
1
2
[
P
(
α
1
|
ω
2
) +
P
(
α
2
|
ω
1
)]
=
1
2
E
1
+ Φ
μ
1
-
μ
2
σ
2
+
σ
1
σ
2
Φ
-
1
(1
-
E
1
)
In the last equality we used the results from the previous subproblems.
(d) We substitute the given values into the equations, and obtain
x
*
≈
0
.
6449.
The
total error rate is
P
(
error
)
≈
0
.
2056.
(e) The Bayes error rate, as a function of
x
*
, is given by
P
(
error
) =
P
(
α
2
|
ω
1
)
P
(
ω
1
) +
P
(
α
1
|
ω
2
)
P
(
ω
2
)
=
1
2
[
p
(
x > x
*
|
ω
1
) +
p
(
x < x
*
|
ω
2
)]
=
1
2
1
-
Φ
x
*
-
μ
1
σ
1
+ Φ
x
*
-
μ
2
σ
2
The Bayes error rate for this problem is depicted in following figure
10
5.
Question:
Let
ω
max
(
x
) be the state of nature for which
P
(
ω
max
|
x
)
≥
P
(
ω
i
|
x
) for all
i, i
= 1
, . . . , c
.
(a) Show that
P
(
ω
max
|
x
)
≥
1
c
(b) Show that for the minimum-error-rate decision rule the average probability of error
is given by
P
(
error
) = 1
-
Z
P
(
ω
max
|
x
)
P
(
x
)
dx
(c) Use these two results to show that
P
(
error
)
≤
c
-
1
c
(d) Describe a situation for which
P
(
error
) =
c
-
1
c
Solution:
(a) A useful observation is that the maximal value
P
(
ω
max
|
x
) is greater than, or equal
to, the average. Therefore we obtain
P
(
ω
max
|
x
)
≥
1
c
c
Σ
i
=1
P
(
ω
i
|
x
) =
1
c
where the last equality is due to probabilities summing to unity.
(b) The minimum error rate is achieved by choosing
ω
max
, the most likely state of na-
ture. The average probability of error over the data space is therefore the probability
that
ω
max
is not the true state of nature for a given
x
, i.e.
P
(
error
) =
E
x
[1
-
P
(
ω
max
|
x
)] = 1
-
Z
P
(
ω
max
|
x
)
p
(
x
)
dx
(c) We see that
P
(
error
) = 1
-
Z
P
(
ω
max
|
x
)
p
(
x
)
dx
≤
1
-
Z
1
c
p
(
x
)
dx
= 1
-
1
c
=
c
-
1
c
where we have used the fact that
R
p
(
x
)
dx
= 1.
(d) A situation where
P
(
error
) =
c
-
1
c
arises when
P
(
ω
i
) =
1
c
for every
i
.
Then the
maximum value is equal to the average value, and the inequality in problem a)
becomes an equality.
11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3
Numerical
1. Given:
•
the probability of dangerous fires are rare (1%)
•
smoke is fairly common (10%) due to barbecues
•
90% of dangerous fires make smoke
Can you find the probability of dangerous Fire when there is Smoke?
Answer:
We can then discover the probability of dangerous Fire when there is Smoke using
the Bayes’ Theorem:
P
(
Fire
|
Smoke
) =
P
(
Fire
)
*
P
(
Smoke
|
Fire
)
P
(
Smoke
)
=
0
.
01
*
0
.
9
0
.
1
= 0
.
09
So probability of dangerous fire when there is a smoke is 9%.
2. A doctor is called to see a sick child. The doctor has prior information that 90% of sick
children in that neighborhood have the flu, while the other 10% are sick with measles.
Let
F
stand for an event of a child being sick with flu and
M
stand for an event of a
child being sick with measles. Assume for simplicity that
F
∪
M
= Ω, i.e., that there
no other maladies in that neighborhood.
A well-known symptom of measles is a rash (the event of having which we denote
R
).
Assume that the probability of having a rash if one has measles is
P
(
R
|
M
) = 0
.
95.
However, occasionally children with flu also develop rash, and the probability of having
a rash if one has flu is
P
(
R
|
F
) = 0
.
08.
Upon examining the child, the doctor finds a rash.
What is the probability that the
child has measles?
Answer:
We use Bayes’s formula.
P
(
P
|
R
) =
P
(
R
|
M
)
P
(
M
)
(
P
(
R
|
M
)
P
(
M
) +
P
(
R
|
F
)
P
(
F
))
=
0
.
95
*
0
.
10
(0
.
95
*
0
.
10 + 0
.
08
*
0
.
90)
’
0
.
57
Which is nowhere close to 95% of
P
(
R
|
M
).
3. In a study, physicians were asked what the odds of breast cancer would be in a woman
who was initially thought to have a 1% risk of cancer but who ended up with a positive
mammogram result (a mammogram accurately classifies about 80% of cancerous tumors
and 90% of benign tumors.)
95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do
you agree?
12
Answer:
Introduce the events:
+ = mammogram result is positive
B
= tumor is benign
M
= tumor is malignant
Note that
B
c
=
M
. We are given
P
(
M
) =
.
01, so
P
(
B
) = 1
P
(
M
) =
.
99.
We are also given the conditional probabilities
P
(+
|
M
) =
.
80 and
P
(
-|
B
) =
.
90, where
the event
-
is the complement of +, thus
P
(+
|
B
) =
.
10
Bayes’ formula in this case is
P
(
M
|
+) =
P
(+
|
M
)
P
(
M
)
(
P
(+
|
M
)
P
(
M
) +
P
(+
|
B
)
P
(
B
))
=
0
.
80
*
0
.
01
(0
.
80
*
0
.
01 + 0
.
10
*
0
.
99)
’
0
.
075
So the chance would be 7.5%. A far cry from a common estimate of 75.
4. Suppose we have 3 cards identical in form except that both sides of the first card are
colored red, both sides of the second card are colored black, and one side of the third
card is colored red and the other side is colored black.
The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on
the ground. If the upper side of the chosen card is colored red, what is the probability
that the other side is colored black?
Answer:
Let RR, BB, and RB denote, respectively, the events that the chosen cars is the red-red,
the black-black, or the red-black card. Letting R be the event that the upturned side of
the chosen card is red, we have that the desired probability is obtained by
P
(
RB
|
R
) =
P
(
RB
∩
R
)
P
(
R
)
=
P
(
R
|
RB
)
P
(
RB
)
P
(
R
|
RR
)
R
(
RR
) +
P
(
R
|
RB
)
P
(
RB
) +
P
(
R
|
BB
)
P
(
BB
)
=
(
1
2
) (
1
3
)
(1)
(
1
3
)
+
(
1
2
) (
1
3
)
+ 0
(
1
3
)
=
1
3
This question was actually just like the Monty Hall problem!
5. It is estimated that 50% of emails are spam emails.
Some software has been applied
to filter these spam emails before they reach your inbox. A certain brand of software
claims that it can detect 99% of spam emails, and the probability for a false positive (a
non-spam email detected as spam) is 5%.
Now if an email is detected as spam, then what is the probability that it is in fact a
non-spam email?
13
Answer:
Define events:
A
= event that an email is detected as spam
B
= event that an email is spam
B
c
= event that an email is not spam
We know
P
(
B
) =
P
(
B
c
) = 0
.
5,
P
(
A
|
B
) = 0
.
99,
P
(
A
|
B
c
) = 0
.
05. Hence by the Bayes’s
formula we have:
P
(
B
c
|
A
) =
P
(
A
|
B
c
)
P
(
B
c
)
P
(
A
|
B
)
P
(
B
) +
P
(
A
|
B
c
)
P
(
B
c
)
=
0
.
05
*
0
.
5
0
.
05
*
0
.
5 + 0
.
99
*
0
.
5
=
5
104
14
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
References
(a) https://www.geeksforgeeks.org/pattern-recognition-basics-and-design-principles/
(b) https://www.geeksforgeeks.org/pattern-recognition-introduction/
(c) https://www.guru99.com/machine-learning-interview-questions.html
(d) https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
(e) https://www.mlstack.cafe/blog/naive-bayes-interview-questions
(f) https://www.mathsisfun.com/data/bayes-theorem.html
(g) http://www.hep.upenn.edu/ johnda/Papers/Bayes.pdf
(h) https://www2.math.upenn.edu/ mmerling/math107%20docs/practice
%20on%20Bayes%20solutions.pdf
(i) https://people.engr.tamu.edu/rgutier/web
courses/csce666
s20/l4.pdf
(j) https://www.cs.cornell.edu/courses/cs4780/2018sp/lectures
/lecturenote01
MLsetup.html
(k) http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note10-2up.pdf
(l) Pattern Classification, 2nd Edition, by Richard O. Duda, Peter E . Hart, and David
G. Stork
15
Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Operations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks ColeFundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning
- Fundamentals of Information SystemsComputer ScienceISBN:9781337097536Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning