HW2 Theory

docx

School

University of Texas *

*We aren’t endorsed by this school

Course

6375

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

5

Uploaded by ChefFalcon3581

Report
Question 1: Does Perceptron make the same number of mistakes? No ,the perceptron does not makes the same number of mistakes, infact the model may stop learning because a model stops learning when it stops making mistakes as the in the new order the model may not make any mistakes. Does it end up with the same final weights? No, the final weights may also change because of initialization and order of training data. S and S` where order matters. Yes, the order of training set matters while running Perceptron Algorithm. The difference comes from the initialization and the order of training data. Example: Consider the following example where points are lying in R 2 . S = {((1, 1), +1),((-4, 0), +1),((-1, 2), +1)} S` = {((-1, 2), +1) ),((-4, 0), +1),((1, 1), +1)}. Consider that all the training examples are +1 and our perceptron will converge once it predicts +1 for all training examples. Let initial weight w 1 =(0,0) Perceptron f(x) = Sign(w i T x i – θ), f(x) Є{-1, +1} Assume θ = 0, so that f(x) = Sign(w i T x i ) Where Sign(z) = 1 , if z >0 = -1, if z<=0 Lets iterate through the training example until we converge, Sign(w 1 T x 1 ) = Sign (0) = -1 ->Mistake W 2 = (0, 0) + (1, 1) = (1, 1) Sign(w 2 T x 2 ) = Sign (-4*1 +0*1) = -4 ->Mistake W 3 = (1, 1) + (-4, 0) = (-2, 1) If we go on this way, the final weight will converge to (-1, 3) and total mistakes we get in process are 4.
Now if we consider the same process for S`, for this when we start with initial weight (0,0), we get final weight as (-1, 2) and number of final mistakes = 1. So from the above example, we can see that while iterating through database different order results in different final weights and different final mistakes. Question 2: If the loss function is (z) = max(0, z), the SGD algorithm is equivalent to the Perceptron algorithm. In this case, the SGD update rule is the same as the Perceptron update rule. The Perceptron algorithm is a linear classifier that predicts the difference between two classes by locating the hyperplane that separates them. The SGD algorithm is a non-linear classification algorithm that is a generalization of the Perceptron algorithm. Consider for Perceptron, where we update only when mistake is made. Then the update rule is as following: w new = w old − η φ (y i w T x i ) = w old + y i x i The Update rule for SGD is as follows: W new = W old − η φ (y i w T x i ) If there is no mistake, then W new = W old as y i w T x i >= 0 If there is a mistake, then W new = W old + y i x i as y i w T x i < 0 Here when n = 1, both Perceptron and SGD are same. Question 3a: For example, a 3 piece classifier c(x) with b=+1 has 3 regions: Region 1: For [-B, 1], c(x) outputs -1 Region 2: For [ 1, 2], c(x) outputs +1 Region 3: For [ 2, B], c(x) outputs -1
As we see above, every classifier c belongs to C consists of three regions (two unbounded intervals on the left and right and a center interval) with alternate labels. We will show that we can always define a decision stump that will agree with the labeling of any combination of two out of the three regions. Note that for every distribution D over R and every partitioning of the line into three such regions, the smallest of these 3 regions must have weight of at most 1/3. This is because the weight must add up to 1.0 and the maximum value for the smallest region would be exactly 1/3 considering that all the 3 regions have equal weight Decision Stump, h(x) = b if x <= θ h h(x) = -b if x > θ h This decision stump could always classify 2 out of the 3 regions correctly. If the decision stump set θ h = B and b=-1, then we would correctly classify Region 1 and Region 3. If the decision stump set θ h = θ 2 and b=1, then we would correctly classify Region 2 and Region 3. If the decision stump set θ h = θ 1 and b=-1, then we would correctly classify Region 1 and Region 2. For a 3 piece classifier c(x) with b=-1, we could employ similar logic to the above in order for our decision stump to always classify 2 regions correctly. If we are minimizing the error, we would always classify the two largest regions correctly, and misclassify the smallest region. The misclassified region is equivalent to our error. Since the smallest region is guaranteed to be at most 1/3, then our error is guaranteed to be at most 1/3. Question 3b: This is the decision stump, which we consider Decision Stump, h(x) = b if x <= θ h h(x) = -b if x > θ h Lets assume that our training set contains examples x i , y i with m examples. Then consider the following algorithm that minimizes the error: 1. Set error min = 1.0 2. Set ϵ = 0.000000001 3. Sort the examples so that X is a sorted list such that x 1 < x 2 < ... < x m 4. Iterate over b Є {1, 1} a) Iterate over all x i in X: i. Set error num = 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. Choose θ h = x i + ϵ iii. Create decision stump h(x) with θ h as a our threshold iv. Iterate over all x i in X: A. Calculate prediction y pred using decision stump h(x) B. Get the true y i C. If y pred != y i : error num = error num + 1 v. Calculate error = error num / m vi. If error <= error min , then set: error min = error b min = b θ min = θ h The output of the procedure will be a decision stump that minimizes the error, with final parameters θ h = θ min and b = b min . Executing this procedure will yield an an empirical risk minimizer (ERM) that minimizes the error. We can see that the algorithm is efficient and runs in polynomial time O(m 2 ) , where m is the number of training. Question 3c: Overly complex models tend to not generalize well, and su ff er from overfitting or high variance. On the other hand, simple models such as our decision stump tend to generalize better. Since our decision stump has 2 regions and the 3-piece classifier has 3 regions, our decision stump is a simpler model and we would expect it needs less samples to generalize. The decision stump ERM is a -weak learner for H. This lets us conclude that we can weakly learn C using H, and H has less model complexity than C. Question 4: Let us define variables: E = error rate A = accuracy = 1 – E β = E / A We assume that each weak learner has error rate slightly less than 1/2: E = (1/2) - ϒ A = (1/2) + ϒ B = ((1/2) – ϒ) / ((1/2) + ϒ)
Assume we are at iteration t and assume W is equal to the sum of the weights of all points before iteration t. Let’s calculate the weights of correct points and the weights of incorrect points for iteration t + 1: Correct points have an update rule that they should be scaled by β. The weight of the correct points for D t+1 is thus determined by the previous weight W, beta and the accuracy A: = W * β * A = W * β * ((1/2) + ϒ) The weight of incorrect points for D t+1 is determined by the previous weight, and the error E: = W*E = W * ((1/2) - ϒ) The sum of all weights for Dt+1 is equal to: = weights of correct points + weight of incorrect points = W * β * ((1/2) + ϒ) + W * ((1/2) - ϒ) = W(2((1/2) - ϒ)) Thus, with respect to the distribution D t+1 generated for the next iteration, let’s calculate the error of h t : Error = weight of incorrect points / sum of all weights = W * ((1/2) - ϒ) / W(2((1/2) - ϒ)) = ½ Accuracy = 1 – Error = 1 – ½ = ½