IntroPR_and_Bayesian_Decision_QB

pdf

School

SUNY Buffalo State College *

*We aren’t endorsed by this school

Course

555

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

16

Uploaded by SuperHumanWombatMaster946

Report
CSE 4/555 – Intro to Pattern Recognition Chapter: Introduction and Bayesian Decision Theory Edited By: Cheng-En Chuang University at Buffalo, The State University of New York Buffalo, NY 14620 September 27, 2021
1 Short Answer Questions 1. Describe what is a pattern? Answer: Pattern is everything around in this digital world. A pattern can either be seen physically or it can be observed mathematically by applying algorithms. Pattern is comprises of the following two fundamental things: (a) Collection of observations (b) The concept behind the observation, which is also known as a feature vector. 2. What is Pattern Recognition? Answer: Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation. One of the important aspects of the pattern recognition is its application potential. Examples: Speech recognition, speaker identification, multimedia document recognition (MDR), automatic medical diagnosis. In a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. 3. Pattern recognition involves classification and cluster of patterns. Explain these two approaches. Answer: (a) In classification, an appropriate class label is assigned to a pattern based on an abstraction that is generated using a set of training patterns or domain knowledge. Classification is used in supervised learning. (b) Clustering generated a partition of the data which helps decision making, the specific decision making activity of interest to us. Clustering is used in an unsu- pervised learning. 4. Explain what are features in pattern recognition with examples. Answer: Features may be represented as continuous, discrete or discrete binary vari- ables. A feature is a function of one or more measurements, computed so that it quantifies some significant characteristics of the object. Example: consider our face then eyes, ears, nose etc are features of the face. A set of features that are taken together, forms the features vector. Example: In the above example of face, if all the features (eyes, ears, nose etc) taken together then the sequence is feature vector([eyes, ears, nose]). Feature vector is the sequence of a features represented as a d-dimensional column vector. 5. What are the properties of a robust pattern recognition system? Answer: (a) Recognise familiar pattern quickly and accurate (b) Recognize and classify unfamiliar objects (c) Accurately recognize shapes and objects from different angles 1
(d) Identify patterns and objects even when partly hidden (e) Recognize patterns quickly with ease, and with automaticity 6. Explain how does a system use a dataset to learn patterns? Answer: Entire dataset is divided into two categories, one which is used in training the model i.e. Training set and the other that is used in testing the model after training, i.e. Testing set. (a) Training set: Training set is used to build a model. It consists of the set of images that are used to train the system. Training rules and algorithms used give relevant information on how to associate input data with output decision. The system is trained by applying these algorithms on the dataset, all the relevant information is extracted from the data and results are obtained. Generally, 80% of the data of the dataset is taken for training data. (b) Testing set: Testing data is used to test the system. It is the set of data which is used to verify whether the system is producing the correct output after being trained or not. Generally, 20% of the data of the dataset is used for testing. Testing data is used to measure the accuracy of the system. Example: a system which identifies which category a particular flower belongs to, is able to identify seven category of flowers correctly out of ten and rest others wrong, then the accuracy is 70% 7. What Bayes’ Theorem (Bayes Rule) is all about? Answer: Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence. Conceptually, this is a way to go from P ( Evidence | Known Outcome ) to P ( Outcome | Known Evidence ). Bayes’ theorem is a relatively simple, but fundamental result of probability theory that allows for the calculation of certain conditional probabilities. Conditional probabilities are just those probabilities that reflect the influence of one event on the probability of another. The formula is: P ( A | B ) = P ( B | A ) * P ( A ) P ( B ) Which tells us: 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
how often A happens given that B happens, written P ( A | B ). When we know: how often B happens given that A happens, written P ( B | A ). and how likely A is on its own, written P ( A ). and how likely B is on its own, written P ( B ). As an example let us say P ( Fire ) means how often there is fire, and P ( Smoke ) means how often we see smoke, then P ( Fire | Smoke ) means how often there is fire when we can see smoke P ( Smoke | Fire ) means how often we can see smoke when there is fire So the formula kind of tells us ”forwards” P ( Fire | Smoke ) when we know ”backwards” P ( Smoke | Fire ). 8. What is a Na¨ ıve Bayes Classifier? Answer: Naive Bayes Classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. Bayes’ theorem is given by the following equation: P ( A | B ) = P ( B | A ) * P ( A ) P ( B ) Using Bayes’ theorem, the probability of A happening given that B has occurred can be found. An example of the way a Naive Bayes Classifier can be used is, given that it has rained, the probability of temperature being low is P ( Temperature | Rain ). 9. Why Naive Bayes is called Naive? Answer: We call it naive because its assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications. 10. Explain Likelihood Ratio Test (LRT). Answer: Assume we are to classify an object based on the evidence provided by feature vector x , Evaluate the posterior probability P ( ω i | x ) of each class and choose the class with largest P ( ω i | x ). In other words, choose the class that is most probable given observation x . Let’s examine this rule for a 2-class problem. In this case the decision rule becomes if P ( ω 1 | x ) > P ( ω 2 | x ) choose ω 1 else choose ω 2 . Or, in a more compact form: P ( ω 1 | x ) ω 1 ω 2 P ( ω 2 | x ) Applying Bayes rule: P ( x | ω 1 ) P ( ω 1 ) P ( x ) ω 1 ω 2 P ( x | ω 2 ) P ( ω 2 ) P ( x ) 3
Since P ( x ) does not affect the decision rule, it can be eliminated. 1 Then we can rearrange the previous expression: Λ( x ) = P ( x | ω 1 ) P ( x | ω 2 ) ω 1 ω 2 P ( ω 2 ) P ( ω 1 ) The term Λ( x ) is called the likelihood ratio, and the decision rule is known as the likelihood ratio test (LRT). 11. What is the loss function? Answer: There are typically two steps involved in learning a hypothesis function h(). First, we select the type of machine learning algorithm that we think is appropriate for this particular learning problem. This defines the hypothesis class H , i.e. the set of functions we can possibly learn. The second step is to find the best function within this class, h ∈ H . This second step is the actual learning process and often, but not always, involves an optimization problem. Essentially, we try to find a function h within the hypothesis class that makes the fewest mistakes within our training data. (If there is not a single function we typically try to choose the ”simplest” by some notion of simplicity - but we will cover this in more detail in a later class.) How can we find the best function? For this we need some way to evaluate what it means for one function to be better than another. This is where the loss function (aka risk function) comes in. A loss function evaluates a hypothesis h ∈ H on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions. It is common practice to normalize the loss by the total number of training samples, n, so that the output can be interpreted as the average loss per sample (and is independent of n). 12. What are the examples of loss function? Answer: Zero-one loss: The simplest loss function is the zero-one loss. It literally counts how many mis- takes an hypothesis function h makes on the training set. For every single example it suffers a loss of 1 if it is mispredicted, and 0 otherwise. The normalized zero- one loss returns the fraction of misclassified training samples, also often referred to as the training error. The zero-one loss is often used to evaluate classifiers in multi-class/binary classification settings but rarely useful to guide optimization procedures because the function is non-differentiable and non-continuous. Formally, the zero-one loss can be stated has: L 0 / 1 ( h ) = 1 n n X i =1 δ h ( x i ) 6 = y i , where δ h ( x i ) 6 = y i = ( 1 , if h ( x i ) 6 = y i 0 , o.w. This loss function returns the error rate on this data set D . For every example that the classifier misclassifies (i.e. gets wrong) a loss of 1 is suffered, whereas correctly classified samples lead to 0 loss. 1 P ( x ) can be disregarded in the decision rule since it is constant regardless of class ω i . However, P ( x ) will be needed if we want to estimate the posterior P ( ω i | x ) which, unlike P ( x | ω 1 ) P ( ω 1 ) , is a true probability value and, therefore, gives us an estimate of the “goodness” of our decision 4
Squared loss: The squared loss function is typically used in regression settings. It iterates over all training samples and suffers the loss ( h ( x i ) - y i ) 2 . The squaring has two effects: 1., the loss suffered is always nonnegative; 2., the loss suffered grows quadrati- cally with the absolute mispredicted amount. The latter property encourages no predictions to be really far off (or the penalty would be so large that a different hypothesis function is likely better suited). On the flipside, if a prediction is very close to be correct, the square will be tiny and little attention will be given to that example to obtain zero error. For example, if | h ( x i ) - y i | = 0 . 001 the squared loss will be even smaller, 0 . 000001, and will likely never be fully corrected. If, given an input x , the label y is probabilistic according to some distribution P ( y | x ) then the optimal prediction to minimize the squared loss is to predict the expected value, i.e. h ( x ) = E P ( y | x ) [ y ]. Formally the squared loss is: L sq ( h ) = 1 n n X i =1 ( h ( x i ) - y i ) 2 . Absolute loss: Similar to the squared loss, the absolute loss function is also typically used in regression settings. It suffers the penalties | h ( x i ) - y i | . Because the suffered loss grows linearly with the mispredictions it is more suitable for noisy data (when some mispredictions are unavoidable and shouldn’t dominate the loss). If, given an input x , the label y is probabilistic according to some distribution P ( y | x ) then the optimal prediction to minimize the absolute loss is to predict the median value, i.e. h ( x ) = MEDIAN P ( y | x ) [ y ]. Formally, the absolute loss can be stated as: L abs ( h ) = 1 n n X i =1 | h ( x i ) - y i | . 13. What is the purpose of discriminant functions? Answer: If we have a set of K classes then we may define a set of K discriminant functions y k ( x ), one for each class. Data point x is assigned to class c if y c ( x ) > y k ( x ) for all k 6 = c In other words: assign x to the class c whose discriminant function y c ( x ) is biggest. When classifying based on the values of the log posterior probability, the log posterior probability of class c given a data point x is a possible discriminant function: y c ( x ) = ln P ( c | x ) = ln p ( x | c ) + P ( c ) + const The posterior probability could also be used as a discriminant function, with the same results: choosing the class with the largest posterior probability is an identical decision rule to choosing the class with the largest log posterior probability. 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2 Derivations 1. Question: Proof Bayes Theorem. Solution: The probability of two events A and B happening, P ( A B ), is the probability of A , P ( A ), times the probability of B given that A has occurred, P ( B | A ). P ( A B ) = P ( A ) P ( B | A ) On the other hand, the probability of A and B is also equal to the probability of B times the probability of A given B . P ( A B ) = P ( B ) P ( A | B ) Equating the two yields: P ( A ) P ( B | A ) = P ( B ) P ( A | B ) and thus P ( A | B ) = P ( A ) P ( B | A ) P ( B ) This equation, known as Bayes Theorem is the basis of statistical inference. 6
2. Question: Assume equal priors, given the likelihoods below, derive a decision rule based on the likelihood ratio test (LRT). P ( x | ω 1 ) = N (4 , 1) P ( x | ω 2 ) = N (10 , 1) Solution: Substituting into the LRT expression: Λ( x ) = 1 2 π e - 1 2 ( x - 4) 2 1 2 π e - 1 2 ( x - 10) 2 ω 1 ω 2 1 1 Simplifying the LRT expression: Λ( x ) = e - 1 2 ( x - 4) 2 + 1 2 ( x - 10) 2 ω 1 ω 2 1 which yields x ω 1 ω 2 7. 7
3. Question: Assumed equal priors, show that the performance of any decision rule can be measured by P [ error ] for a 2-class problem. Solution: Making use of the Theorem of total probability: P [ error ] = Σ C i =1 P [ error | ω i ] P [ ω i ] The class conditional probability P [ error | ω i ] can be expressed as: P [ error | ω i ] = P [ choose ω j | ω i ] = Z Rj P ( x | ω i ) dx = i For a 2-class problem, P [ error ] becomes P [ error ] = P [ ω 1 ] Z R 2 P ( x | ω 1 ) dx + P [ ω 2 ] Z R 1 P ( x | ω 2 ) dx where ( 1 = R R 2 P ( x | ω 1 ) dx 2 = R R 1 P ( x | ω 2 ) dx Since we assume equal priors, then P [ error ] = 1 + 2 2 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Question: Consider Neyman-Pearson criteria for two Cauchy distributions in one dimension p ( x | ω i ) = 1 πb · 1 1 + ( x - a i b ) 2 where i = 1 , 2 Assume a zero-one error loss, and for simplicity a 2 > a 1 , the same “width” b , and equal priors. (a) Suppose the maximum acceptable error rate for classifying a pattern that is actually in ω 1 as if it were in ω 2 is E 1 . Determine the decision boundary in terms of the variables given. (b) For this boundary, what is the error rate for classifying ω 2 as ω 1 ? (c) What is the overall error rate under zero-one loss? (d) Apply your results to the specific case b = 1 and a 1 = 1, a 2 = 1 and E 1 = 0 . 1. (e) Compare your result to the Bayes error rate (i.e., without the Neyman-Pearson conditions). Solution: (a) We want the probability of choosing action α 2 to be smaller than, or equal to, E 1 , given that the true state of nature is ω 1 . Let’s assume that μ 1 < μ 2 and that the decision threshold is x * , so we decide α 2 if x > x * . We then have P ( α 2 | ω 1 ) E 1 p ( x > x * | ω 1 ) E 1 1 - Z x * 0 p ( x | ω 1 ) dx E 1 We let Φ : R 7→ [0 , 1] denote the cumulative Gaussian distribution, and Φ - 1 : [0 , 1] 7→ R it’s inverse function. Making use of Φ we write 1 - Φ x * - μ 1 σ 1 E 1 x * μ 1 + σ 1 Φ - 1 (1 - E 1 ) If the desired error is close to zero, then x * goes to positive infinity. If the desired error is close to one, then x * goes to negative infinity. (b) The error rate for classifying ω 2 as ω 1 is P ( α 1 | ω 2 ) = p ( x x * | ω 2 ) = Z x * 0 p ( x | ω 2 ) dx = Φ x * - μ 2 σ 2 Making use of x * from the previous problem, we obtain Φ μ 1 + σ 1 Φ - 1 (1 - E 1 ) - μ 2 σ 2 = Φ μ 1 - μ 2 σ 2 + σ 1 σ 2 Φ - 1 (1 - E 1 ) 9
(c) The overall error rate becomes P ( error ) = P ( α 1 , ω 2 ) + P ( α 2 , ω 1 ) = P ( α 1 | ω 2 ) P ( ω 2 ) + P ( α 2 | ω 1 ) P ( ω 1 ) = 1 2 [ P ( α 1 | ω 2 ) + P ( α 2 | ω 1 )] = 1 2 E 1 + Φ μ 1 - μ 2 σ 2 + σ 1 σ 2 Φ - 1 (1 - E 1 ) In the last equality we used the results from the previous subproblems. (d) We substitute the given values into the equations, and obtain x * 0 . 6449. The total error rate is P ( error ) 0 . 2056. (e) The Bayes error rate, as a function of x * , is given by P ( error ) = P ( α 2 | ω 1 ) P ( ω 1 ) + P ( α 1 | ω 2 ) P ( ω 2 ) = 1 2 [ p ( x > x * | ω 1 ) + p ( x < x * | ω 2 )] = 1 2 1 - Φ x * - μ 1 σ 1 + Φ x * - μ 2 σ 2 The Bayes error rate for this problem is depicted in following figure 10
5. Question: Let ω max ( x ) be the state of nature for which P ( ω max | x ) P ( ω i | x ) for all i, i = 1 , . . . , c . (a) Show that P ( ω max | x ) 1 c (b) Show that for the minimum-error-rate decision rule the average probability of error is given by P ( error ) = 1 - Z P ( ω max | x ) P ( x ) dx (c) Use these two results to show that P ( error ) c - 1 c (d) Describe a situation for which P ( error ) = c - 1 c Solution: (a) A useful observation is that the maximal value P ( ω max | x ) is greater than, or equal to, the average. Therefore we obtain P ( ω max | x ) 1 c c Σ i =1 P ( ω i | x ) = 1 c where the last equality is due to probabilities summing to unity. (b) The minimum error rate is achieved by choosing ω max , the most likely state of na- ture. The average probability of error over the data space is therefore the probability that ω max is not the true state of nature for a given x , i.e. P ( error ) = E x [1 - P ( ω max | x )] = 1 - Z P ( ω max | x ) p ( x ) dx (c) We see that P ( error ) = 1 - Z P ( ω max | x ) p ( x ) dx 1 - Z 1 c p ( x ) dx = 1 - 1 c = c - 1 c where we have used the fact that R p ( x ) dx = 1. (d) A situation where P ( error ) = c - 1 c arises when P ( ω i ) = 1 c for every i . Then the maximum value is equal to the average value, and the inequality in problem a) becomes an equality. 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3 Numerical 1. Given: the probability of dangerous fires are rare (1%) smoke is fairly common (10%) due to barbecues 90% of dangerous fires make smoke Can you find the probability of dangerous Fire when there is Smoke? Answer: We can then discover the probability of dangerous Fire when there is Smoke using the Bayes’ Theorem: P ( Fire | Smoke ) = P ( Fire ) * P ( Smoke | Fire ) P ( Smoke ) = 0 . 01 * 0 . 9 0 . 1 = 0 . 09 So probability of dangerous fire when there is a smoke is 9%. 2. A doctor is called to see a sick child. The doctor has prior information that 90% of sick children in that neighborhood have the flu, while the other 10% are sick with measles. Let F stand for an event of a child being sick with flu and M stand for an event of a child being sick with measles. Assume for simplicity that F M = Ω, i.e., that there no other maladies in that neighborhood. A well-known symptom of measles is a rash (the event of having which we denote R ). Assume that the probability of having a rash if one has measles is P ( R | M ) = 0 . 95. However, occasionally children with flu also develop rash, and the probability of having a rash if one has flu is P ( R | F ) = 0 . 08. Upon examining the child, the doctor finds a rash. What is the probability that the child has measles? Answer: We use Bayes’s formula. P ( P | R ) = P ( R | M ) P ( M ) ( P ( R | M ) P ( M ) + P ( R | F ) P ( F )) = 0 . 95 * 0 . 10 (0 . 95 * 0 . 10 + 0 . 08 * 0 . 90) 0 . 57 Which is nowhere close to 95% of P ( R | M ). 3. In a study, physicians were asked what the odds of breast cancer would be in a woman who was initially thought to have a 1% risk of cancer but who ended up with a positive mammogram result (a mammogram accurately classifies about 80% of cancerous tumors and 90% of benign tumors.) 95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do you agree? 12
Answer: Introduce the events: + = mammogram result is positive B = tumor is benign M = tumor is malignant Note that B c = M . We are given P ( M ) = . 01, so P ( B ) = 1 P ( M ) = . 99. We are also given the conditional probabilities P (+ | M ) = . 80 and P ( -| B ) = . 90, where the event - is the complement of +, thus P (+ | B ) = . 10 Bayes’ formula in this case is P ( M | +) = P (+ | M ) P ( M ) ( P (+ | M ) P ( M ) + P (+ | B ) P ( B )) = 0 . 80 * 0 . 01 (0 . 80 * 0 . 01 + 0 . 10 * 0 . 99) 0 . 075 So the chance would be 7.5%. A far cry from a common estimate of 75. 4. Suppose we have 3 cards identical in form except that both sides of the first card are colored red, both sides of the second card are colored black, and one side of the third card is colored red and the other side is colored black. The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on the ground. If the upper side of the chosen card is colored red, what is the probability that the other side is colored black? Answer: Let RR, BB, and RB denote, respectively, the events that the chosen cars is the red-red, the black-black, or the red-black card. Letting R be the event that the upturned side of the chosen card is red, we have that the desired probability is obtained by P ( RB | R ) = P ( RB R ) P ( R ) = P ( R | RB ) P ( RB ) P ( R | RR ) R ( RR ) + P ( R | RB ) P ( RB ) + P ( R | BB ) P ( BB ) = ( 1 2 ) ( 1 3 ) (1) ( 1 3 ) + ( 1 2 ) ( 1 3 ) + 0 ( 1 3 ) = 1 3 This question was actually just like the Monty Hall problem! 5. It is estimated that 50% of emails are spam emails. Some software has been applied to filter these spam emails before they reach your inbox. A certain brand of software claims that it can detect 99% of spam emails, and the probability for a false positive (a non-spam email detected as spam) is 5%. Now if an email is detected as spam, then what is the probability that it is in fact a non-spam email? 13
Answer: Define events: A = event that an email is detected as spam B = event that an email is spam B c = event that an email is not spam We know P ( B ) = P ( B c ) = 0 . 5, P ( A | B ) = 0 . 99, P ( A | B c ) = 0 . 05. Hence by the Bayes’s formula we have: P ( B c | A ) = P ( A | B c ) P ( B c ) P ( A | B ) P ( B ) + P ( A | B c ) P ( B c ) = 0 . 05 * 0 . 5 0 . 05 * 0 . 5 + 0 . 99 * 0 . 5 = 5 104 14
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 References (a) https://www.geeksforgeeks.org/pattern-recognition-basics-and-design-principles/ (b) https://www.geeksforgeeks.org/pattern-recognition-introduction/ (c) https://www.guru99.com/machine-learning-interview-questions.html (d) https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c (e) https://www.mlstack.cafe/blog/naive-bayes-interview-questions (f) https://www.mathsisfun.com/data/bayes-theorem.html (g) http://www.hep.upenn.edu/ johnda/Papers/Bayes.pdf (h) https://www2.math.upenn.edu/ mmerling/math107%20docs/practice %20on%20Bayes%20solutions.pdf (i) https://people.engr.tamu.edu/rgutier/web courses/csce666 s20/l4.pdf (j) https://www.cs.cornell.edu/courses/cs4780/2018sp/lectures /lecturenote01 MLsetup.html (k) http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note10-2up.pdf (l) Pattern Classification, 2nd Edition, by Richard O. Duda, Peter E . Hart, and David G. Stork 15