Problem 2: Bias-Variance Decomposition for Classification In this problem, you will prove a decomposition for the expected cross-entropy (i.e.., logistic). loss for binary class probability estimation (CPE), analogous to the bias-variance decomposition for the expected squared loss for regression that we did in class. Specifically, consider a binary CPE problem with some instance space X and label space = {+1}, and let P be an unknown probability distribution on Xxy from which labeled examples are generated. Say you have an algorithm with, given a training dataset D, produces a CPE model hp X[0,1]. Recall that the logistic loss of such a model on a new example is given by: lCE (y, hD (x)) = y= +1 - In hp(x) In(1hp(x)) y = -1° Define n(x) Pr(y = +1 | x) to be the conditional distribution of the positive label under the true data distribution P. The corresponding generalization error is then: er[hd] = E(x,y)~P[CE (y, hD (x))] = Ex [Eyx [CE(y,hD(x))]] = Ex[-n(x) In(hD (x)) − (1 − n(x)) In(1 — hp(x))] Like in class, we are interested in understanding the expected generalization error over multiple random training datasets D: ED[er [hd]]E[ED~pn[-n(x) ln(hD (x)) − (1 − n(x)) In(1 – hp(x))]] We start with some information-theoretic preliminaries that you will need for this problem. 1. Entropy. The entropy of a probability distribution measures the amount of "randomness" in the distribution. The entropy of a Bernoulli distribution (i.e., coin toss) with parameter p [0,1] is given by: H(p) plnp (1 - p) ln(1 - p) 2. Cross-entropy. The cross-entropy from one probability distribution to another measures roughly how bad it is to use the first distribution in place of the second. The cross-entropy from a Bernoulli distribution with parameter q (often an "estimated" distribution) to another Bernoulli distribution with parameter p (often a "true" distribution) is given by H(p,q) plnq- (1 - p) ln(1 − q) 3. Kullback-Leibler (KL) divergence. The KL divergence (also known as relative entropy) is a form of (asymmetric) distance between two probability distributions. As with the cross- entropy, the KL divergence from one probability distribution to another also measures roughly how bad it is to use the first distribution in place of the second. Indeed, it is related to cross entropy simply by subtraction of an entropy term. The KL divergence from a Bernoulli distribution with parameter q (the "estimated" distribution) to one with parameter p (the "true" distribution) is given by: KL(p||q) = pln + (1 - p) In = H(p,q) - H (p) This has the property that KL (p||p) = 0. 4. "Average" model under KL divergence. In the case of regression under squared loss, an "average" regression model is given by h(x) = ED~pn [hp(x)]. This has the property that it minimizes the expected squared loss. In the case of binary CPE under the cross-entropy (logistic) loss, we will consider an "average" CPE model given by: 1 h(x) = = Z(x) exp [ED [In hD (x)]], - where Z(x) = exp [EƊ [ln hɲ(x)]] + exp [EƊ [ln(1 – hƊ(x))]]. Using the above preliminaries: 1. Show that the expected generalization error can be decomposed as: erCF[hd] = Ex [ED~pm [KL(h(x)||hd(x))]]+Ex [KL(n(x)||h(x))] +Ex [H(n(x))] term 1 term 2 term 3 2. Give an interpretation for each of the three terms in the above decomposition and explain how they play an alaogous role to the terms in the standard bias-variance decompo- sition of the expected squared loss for regression. Hints for part 1: Show the decomposition for a fixed instance x first, and then take expectations over x. Start by showing that KL(n(x)||h(x)) = ED~p [n(x)||hp(x)] + In Z(x). Then show that In Z(x) = −ED~p™ [KL(h(x)||hp(x))]. Putting everything together should give the expected result.

Problem 2: Bias-Variance Decomposition for Classification In this problem, you will prove a decomposition for the expected cross-entropy (i.e.., logistic). loss for binary class probability estimation (CPE), analogous to the bias-variance decomposition for the expected squared loss for regression that we did in class. Specifically, consider a binary CPE problem with some instance space X and label space = {+1}, and let P be an unknown probability distribution on Xxy from which labeled examples are generated. Say you have an algorithm with, given a training dataset D, produces a CPE model hp X[0,1]. Recall that the logistic loss of such a model on a new example is given by: lCE (y, hD (x)) = y= +1 - In hp(x) In(1hp(x)) y = -1° Define n(x) Pr(y = +1 | x) to be the conditional distribution of the positive label under the true data distribution P. The corresponding generalization error is then: er[hd] = E(x,y)~P[CE (y, hD (x))] = Ex [Eyx [CE(y,hD(x))]] = Ex[-n(x) In(hD (x)) − (1 − n(x)) In(1 — hp(x))] Like in class, we are interested in understanding the expected generalization error over multiple random training datasets D: ED[er [hd]]E[ED~pn[-n(x) ln(hD (x)) − (1 − n(x)) In(1 – hp(x))]] We start with some information-theoretic preliminaries that you will need for this problem. 1. Entropy. The entropy of a probability distribution measures the amount of "randomness" in the distribution. The entropy of a Bernoulli distribution (i.e., coin toss) with parameter p [0,1] is given by: H(p) plnp (1 - p) ln(1 - p) 2. Cross-entropy. The cross-entropy from one probability distribution to another measures roughly how bad it is to use the first distribution in place of the second. The cross-entropy from a Bernoulli distribution with parameter q (often an "estimated" distribution) to another Bernoulli distribution with parameter p (often a "true" distribution) is given by H(p,q) plnq- (1 - p) ln(1 − q) 3. Kullback-Leibler (KL) divergence. The KL divergence (also known as relative entropy) is a form of (asymmetric) distance between two probability distributions. As with the cross- entropy, the KL divergence from one probability distribution to another also measures roughly how bad it is to use the first distribution in place of the second. Indeed, it is related to cross entropy simply by subtraction of an entropy term. The KL divergence from a Bernoulli distribution with parameter q (the "estimated" distribution) to one with parameter p (the "true" distribution) is given by: KL(p||q) = pln + (1 - p) In = H(p,q) - H (p) This has the property that KL (p||p) = 0. 4. "Average" model under KL divergence. In the case of regression under squared loss, an "average" regression model is given by h(x) = ED~pn [hp(x)]. This has the property that it minimizes the expected squared loss. In the case of binary CPE under the cross-entropy (logistic) loss, we will consider an "average" CPE model given by: 1 h(x) = = Z(x) exp [ED [In hD (x)]], - where Z(x) = exp [EƊ [ln hɲ(x)]] + exp [EƊ [ln(1 – hƊ(x))]]. Using the above preliminaries: 1. Show that the expected generalization error can be decomposed as: erCF[hd] = Ex [ED~pm [KL(h(x)||hd(x))]]+Ex [KL(n(x)||h(x))] +Ex [H(n(x))] term 1 term 2 term 3 2. Give an interpretation for each of the three terms in the above decomposition and explain how they play an alaogous role to the terms in the standard bias-variance decompo- sition of the expected squared loss for regression. Hints for part 1: Show the decomposition for a fixed instance x first, and then take expectations over x. Start by showing that KL(n(x)||h(x)) = ED~p [n(x)||hp(x)] + In Z(x). Then show that In Z(x) = −ED~p™ [KL(h(x)||hp(x))]. Putting everything together should give the expected result.