CSE543_Au23_HW1

pdf

School

University of Washington *

*We aren’t endorsed by this school

Course

543

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

4

Uploaded by CountWombatPerson1014

Report
CSE543 Deep Learning Homework 1 October 2, 2023 Deadline: Oct 24th. Include your name and UW NetID in your submission. You only need to submit one PDF. Homework must be typed. You can use any typesetting software you wish (L A T E X, Markdown, MS Word etc). You may discuss assignments with others, but you must write down the solutions by yourself. 1 Univariate approximation with shallow ReLU networks (6 Points) Let g : [0 , 1] R be an arbitrary twice-differentiable function with g (0) = g (0) = 0. Q1.1 (3 Points) Prove that g ( x ) can be represented as an infinite-width ReLU network by deriving the following: g ( x ) = R 1 0 σ ( x b ) g ′′ ( b ) db (where σ is ReLU function). Hint: use the fundamental theorem of calculus. Q1.2 (3 Points) Suppose | g ′′ | ≤ β over [0 , 1] for some β > 0, and let ϵ > 0 be given. Prove that there exists a ReLU network f ( x ) m i =1 a i σ ( x b i ) with m l β ϵ m and f g ϵ . 2 A nice property of positive homogeneity (10 points) Suppose f : R d R is locally Lipschitz and positively homogeneous of degree L (a function g is positive homogeneous of degree L if g ( αx ) = α L g ( x ) for any α 0). We will prove that for any given x R d , for s ∂f ( x ), we have s, x = Lf ( x ). Here ∂f ( x ) is Clarke Differential. Q2.1 (2 Points) Show that when x = 0, and s ∂f ( x ), we have s, x = Lf ( x ). Q2.2 (5 Points) Show for all x ̸ = 0 such that f ( x ) exists, ⟨∇ f ( x ) , x = Lf ( x ). Hint: You can use the following basic property about gradient: lim δ 0 f ( x + δx ) f ( x ) − ⟨∇ f ( x ) , δx δ x = 0 . Q2.3 (3 Points) Using the definition of Clarke Differential to show that for any given x R d , for s ∂f ( x ), we have s, x = Lf ( x ). 1
3 Auto-differentiation (10 points) Consider the following function: f ( w 1 , w 2 ) = (sin(2 πw 1 /w 2 ) + 3 w 1 /w 2 exp(2 w 2 )) · (3 w 1 /w 2 exp(2 w 2 )) Suppose our program for this function uses the following evaluation trace: Input: z 0 = w ( w 1 , w 2 ) 1. z 1 = w 1 /w 2 . 2. z 2 = sin(2 πz 1 ). 3. z 3 = exp(2 w 2 ). 4. z 4 = 3 z 1 z 3 . 5. z 5 = z 2 + z 4 . 6. z 6 = z 4 z 5 . output: z 6 . Q3.1 Reverse mode of auto-differentiation (3 Points) Consider the reverse mode to compute the gradient df/dw , which is a two dimensional vector. Explicitly write out the reverse mode in this example. Write the pseudocode computing all the intermediate derivative as you would actually compute them in an implementation. You may assume you have evaluated the “trace” and have already stored ( z 0 , . . . , z 6 ). Your pseudocode for computing dz 6 dz t should only be written as a function of z 1 , . . . , z 6 and the derivatives dz 6 dz t +1 , . . . , dz 6 dz 6 (as is specified by the reverse mode algorithm). For initialization, let dz 6 dz 6 = 1. In particular, it must be written in the manner that the reverse mode is implemented taught in class (as to be an efficient algorithm). You should basically have the same number of lines of pseudocode as computing the original function. Q3.2 (3 points) Now we consider another approach for auto-differentiation. This is a conceptually simpler way than the reverse mode. This question asks to compute dz 6 dw 1 . The forward mode computes the derivative in a sequential manner, directly using the previous variables and the chain rule. The idea is as follows: at time t , we (a) Compute z 1 , . . . , z 6 as usual. (b) Compute dz t dw 1 for t = 1 , . . . , 6 using our previously computed derivative dz 1 dw 1 , . . . , dz t 1 dw 1 using the chain rule: dz t dw 1 = X τ<t ∂z t ∂z τ ∂z τ ∂w 1 . Explicitly write out the forward mode in our example, where you will write out the pseu- docode computing all the intermediate derivatives as you would actually compute them in an implementation. You will be writing out a series of steps (the pseudocode) where you will be computing both z t and its derivative dz t d w . Your pseudocode for computing dz t dw 1 should only be written as a function of z 1 , . . . , z t 1 and the previous derivatives dz 1 dw 1 , . . . , dz t 1 dw 1 . 2
Q3.3 (4 points) Now we consider forward mode of auto-differentiation for general functions. Q3.3.1 (2 points) Now suppose we have a general function f : R d R , and we want to compute df dw . Describe how you would do this with the forward mode. Q3.3.2 (2 points) Let T denote the computation time to compute f ( w ). Obtain an upper bound of the computation time df dw in big-O notation in terms of T and d . Remark: the time for computing df dw using the reverse mode is O ( T ) . 4 Programming problem: optimizers and hyper-parameters (12 points) In this problem, you will experiment with different optimizers on the MNIST dataset with differ- ent hyper-parameters. We recommend using PyTorch. If you plan to use PyTorch, you can use Dataloader to use the MNIST dataset directly. 1 Q4.1 (10 points) Use LeNet 2 with ReLU activation function, Kaiming initialization with default hyperparameters (either the uniform distribution version or the normal distribution version) 3 , and logistic loss. You are asked to perform experiments using gradient descent, stochastic gradient descent, AdaGrad and ADAM. Since this problem asks you to choose your own hyper-parameters, as long as your plots are reasonable, you will receive full credits. A example of plot is shown below. For each figure, you need to list all the hyperparameters used (including the those kept fixed), either in the text or in the figure caption. Please attach your codes in your submission. (a) Gradient Descent 4 (2 points): choose three learning rates and plot the learning curves on one figure; each curve corresponds to one learning rate. (b) Stochastic Gradient Descent 5 (2 points): choose two different batch sizes and three learning rates and plot two figures; each figure corresponds to one batch size and each curve corresponds to one learning rate. (c) AdaGrad 6 (3 points): Fix a reasonable batch size (which makes sure the loss function is overall decreasing) and an initial accumulator value ( τ in PyTorch implementation). Choose three learning rate decays ( η in PyTorch implementation) and three learning rates. Plot three figures; each figure corresponds to one learning rate decay and each curve corresponds to one learning rate. (d) Adam 7 : (3 points) Fix a reasonable batch size and a learning rate. Try out three different values on adaptivity momentum ( β 1 in PyTorch implementation) and three different values on gradient momentum ( β 2 in PyTorch implementation). Plot three figures; each figure corresponds to one β 1 and each curve corresponds to one β 2 . 1 https://pytorch.org/vision/stable/datasets.html 2 https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial 3 https://pytorch.org/docs/stable/nn.init 4 A.k.a. full-batch stochastic gradient descent. 5 https://pytorch.org/docs/stable/generated/torch.optim.SGD 6 https://pytorch.org/docs/stable/generated/torch.optim.Adagrad 7 https://pytorch.org/docs/stable/generated/torch.optim.Adam 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure 1: batch size= 32, number of epochs = 10, β 1 = 0 . 99, β 2 = 0 . 999. Q4.2 (2 points) Discuss what are the advantages and disadvantages of different algorithms from the optimization point of view (one paragraph). You can discuss hyper-parameter sensitivity, convergence rates, etc. This is an open-ended question and any reasonable answer will receive full credit. 4