kakadenotes1

pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

6

Uploaded by BrigadierGorillaMaster2190

Report
Notes on Attention and the Transformer Model 1 Introduction Two “core” tasks: machine translation and language modeling. Many other tasks: part-of-speech tagging, named entity recognition, coref- erence resolution, semantic role labeling, question/answering, textual entail- ment, sentiment analysis, semantic parsing, etc. Goal today: build a language model. Why? “Representations” of the lan- guage may be helpful for many tasks. Interesting questions: memory, questions/answering, reasoning/logic. 2 A Short Summary of Some Improvements linguistics (grammars, parse trees) statistical machine learning “deep” models Brown clustering, n -gram models, IBM translation models Lots of work on Neural Embeddings. MT: rule-based machine translation. Statistical MT. IBM translation models. Then a series of “deep learning” based approaches: * One first end-to-end models, with an “encoder-decoder” architec- ture. “ Recurrent Continuous Translation Models.” (Kalchbrenner & Blunsom, 2013) * Seq2Seq: using sequential neural models was a good first step. “Sequence to Sequence Learning with Neural Networks.” (Sutskever et. al. ’14) 1
* A series of papers started incorporating “attention”, where one di- rectly tries to utilize long range dependencies in the representation. The idea is that these long range dependencies help when translat- ing given words (the broader context is important). Now, all state of the art methods use some form of ”attention”. The Transformer is one of the most popular ones: “Attention is All you Need” (Vaswani. et. al. ’17) Transfer learning: how can make learning easier by transferring knowledge of one task to another? Recent exciting results showing that representations extracted from a good language model can help with this. NAACL best paper: “Deep Contextualized Word Representations” (Peters et. al. ’18) Another improvement with pretraining: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et. al. ’18) 3 Datasets, Tasks, and some (important) details 3.1 Datasets and Objectives for Language Modelling machine translation: translate one sentence to another sentence. BLEU score used. language modeling: the goal is to learn a model over documents/ sequences, where given a document d = w 1: T (or a sequence of words/characters) our model provides a probability c Pr( d ) = b Pr ( w 1: T ) . Note that we often specify this joint distribution by the conditional distribution c Pr( w t +1 | w t ) where the w ’s are the words. The performance measure: If D is the true distribution, we measure the quality of our model by the cross entropy rate : CrossEnt ( c Pr || D ) := 1 T E w 1: T D h - log b Pr ( w 1: T ) i = 1 T E w 1: T D " - X t log b Pr ( w t +1 | w 1: t ) # The perplexity is defined as exp( CrossEnt ( c Pr || D )) . Intuitively, think of this as the number of plausible candidate alternative words that our model is suggesting. 2
Examples: Using a uniform distribution over m words gives a ppl of m . Using the (estimated) unigram distribution has ppl about 1000 . Shannon, in his paper ””Prediction and Entropy of Printed English” (’51), estimated 0.6 to 1.3 bits/character (using human prediction of letters). This translates to 4 . 5 bits/word, using 1 bit/character and 4 . 5 characters/word. This gives a ppl of 2 4 . 5 = 23 . On the PTB dataset, the best ppl is about 55 - 60 (on the validation set). The best character level entropy rate is 1.2 bits/character. This translates to about 77 ppl in perplexity units per word (to see this use 2 ( 1 . 175 * 390000 / 74000) since there are 390000 characters in valida- tion set and 74000 words in the validation set). There are other ’codings’ like BPE (byte pair encodings) and sub- words. One can translate perplexities between different codings, pro- vided they can faithfully represent the document/sequence. Concerns: memory and long term dependencies may not be reflected in this metric? Other ideas: RL, logic, meaning? Datasets used for language modeling: Penn Tree Bank (PTB): first collection. 1M words. 10K vocab sized (based on standardization) WikiText-2 (2M words) and WikiText-103 (103M words). Scraped from articles on Wikipedia passing a certain quality/length threshold, on all topics. 300k vocab size, > 3 times each. Google Billion Words: web crawl, assorted topics. 1B words. 800K vocab size. Books corpus: 11k public-domain novels. 1B words. Training: GPUs/TPUs are needed. Books/Billion words takes GPU weeks to a month to train (all standard models). TPU a few days. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.2 The details matter: training and overfitting The details do matter a lot for training language models. In contrast, in visual object recognition, once we move to the “Res-Net” style of architectures, training is relatively easy, where overfitting, hyperparameter tuning, and regularization are not major concerns. In fact, simply “early stopping” on vanilla SGD training is often non-trivially competitive on any reasonable model. overfitting is very real in language models, contrast to vision. PTB with a trigram model (i.e. predict next word with previous two words) has 20 train ppl (in about 4 epochs) with 150 ppl on val. PTB with LSTM+dropout has about 30 train ppl (in about 500 epochs) with 60 ppl on val. dropout: this is needed. dropout is used everywhere in these networks. L 2 regularization alone is not comparable (very brittle and even with highly tuned it is not as good). (average) SGD or ADAM? Sometimes one algorithm is much better than the other. exploding gradients: these occur in practice. vanishing gradients: lots of discussion on this. unclear what is going on. dynamic evaluation: keep training at test time to handle topic drifts. squeezes about 5-10ppl on val (for PTB). 4 The Transformer Let x be the input sequences of size T × m , where T is the sequence length, often of length 512 and m is the vocabulary size, often in the range of 10 4 to 10 5 . Now we will describe a one hidden layer transformer, where it will predict the next word. Parameters: E R m × d embedding W V , W Q , W K R d embedding × d hidden W 1 R d hidden × d 1 W 2 R d 1 × d embedding 4
1. Embed the sequence: x xE + P so now x is of size d embedding × T . Here, P is the positional encoding. One common choice is: P t, 2 i = sin( t/ 10000 2 i/d embedding ) P t, 2 i +1 = cos( t/ 10000 2 i/d embedding ) where i indexes the embedding dimension and t the sequence time. Note that often this is a fixed choice (and not a learned parameter). 2. Compute the ’values’, ’query’, and ’key’: V = xW V , Q = xW Q , K = xW K , which are of size T × d hidden . 3. Compute the attention ”weights” scheme: h = softmax QK T d hidden V so h is of size T × d hidden . Importantly, note that QK T is a T × T matrix. Here, abusing notation, the vector valued softmax ( · ) function is applied to every row of the matrix QK > . Recall that the vector valued softmax ( · ) function is defined so that the i -th component is: [ softmax ( v )] i := exp( v i ) / X j exp( v j ) . Note: each row of softmax ( QK T ) sums to 1. The idea is that we want a convex combination of the columns of V. 4. The output after two transformations is then: O = ReLu ( ReLu ( hW 1 ) W 2 ) which is of size T × d embedding . 5. The prediction that the next word in the sequence, X T +1 , is the j -th word is then: e O = OE > c Pr( X T +1 = j ) = [ softmax ( e O T )] j . i.e. only the last node e O T is used for prediction. Here, we have coupled the embedding weights and the prediction weights, where both use the matrix E . 5
4.1 Invariances and other observations Define: M = QK T which is of size T × T . hidden state interpretation and ’sequential’ training/scoring: the above de- scription is model for c Pr( w T +1 | w 1: T ) . We may be interested in the model predicting c Pr( w t +1 | w 1: t ) (for t < T where often T = 512 ), i.e. we may want to make multiple predictions simultaneously (say for training). For this, there is a way to use ’masking’ with an upper triangular matrix so that (for all t T ): Pr( X t +1 = j | w 1: t ) = [ softmax ( e O t )] j Suppose P = 0 (no positional encoding). The matrix M is shift invariant in that if translate the sequence by τ then ( i, j ) entry gets shifted to ( i + τ, j + τ ) (provided these are in bounds). Similarly, h i h i + τ . Lemma. Let Q, K, V R T × d embed , and let Π be a T -by- T permutation matrix. Then, σ Q K ) T V = Π σ ( QK T ) V , where σ ( · ) is the row- wise softmax of a matrix. In particular, suppose P = 0 , then if we permute the sequence, i.e. x 7→ Π x , then h 7→ Π h . Proof. The LHS is σ QK T Π T V . It suffices to show that σ QK T Π T ) = Π σ ( QK T T . Indeed, letting π denote the permutation specified by Π , we have σ QK T Π T ) i,j = e ( QK T ) π ( i ) ( j ) T j 0 =1 e ( QK T ) π ( i ) ( j 0 ) = e ( QK T ) π ( i ) ( j ) T j 0 =1 e ( QK T ) π ( i ) ,j 0 = (Π σ ( QK T T ) i,j . computation: the transformer computations are very efficient due to the man- ner in which the matrix multiplications can be parallelized. In contrast, the LSTM fundamentally needs a for-loop over the history. (The LSTM is a circuit with greater depth.) 5 Acknowledgements These notes were based on discussions with Xinyi Chen, Karthik Narashiman, Cyril Zhang, and Yi Zhang. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: The figure below shows a boy swinging on a rope, starting at a point higher than A.  Consider the…
Q: Your grandmother enjoys creating pottery as a hobby. She uses a potter's wheel, which is a stone…
Q: Q13. Suppose a connected graph, G, has 12 vertices. How many edges must there be in a spanning tree…
Q: Consider an infinite line of charge with a constant charge density of A lying on the z-axis. Using…
Q: Solve the following linear programming problems. Restrict ? ≥ 0 and ? ≥ 0. Minimize g = 7x + 6y…
Q: When k2 >> k-1, KM approximates the affinity of the enzyme•substrate complex. The (circle one)…
Q: A competitive inhibitor interacts with the free enzyme to form an enzyme•inhibitor complex (E•I).…
Q: Kacey is learning how to figure skate. She is spinning with her arms out, each arm is .5 m, with an…
Q: account is compounded semi-annually at 6%, how much interest will a principal of $14,400 earn in 60…
Q: Let θ be an angle in quadrant II such that  cos θ = −7/10 .Find the exact values of csc θ and tan θ…
Q: Explain why 1,4-pentadiene is less acidic when compared to cyclopentadiene
Q: What was the purpose of adding KOH to the cotton balls in the respirometers? Consider what gases…
Q: Describe each of these historical events and their significance to the development of dual…
Q: 8. A) Please derive the inductance per unit length of a coaxial shown in the figure in terms of a…
Q: A student dissolves 12.1 g of lithium chloride (LiCl)in 300. g of water in a well-insulated open…
Q: Set up a definite integral to find the length of the curve  y= (3/4)x(4/3) - (3/8)x(2/3) over the…
Q: The Federal Reserve System achieves its goals in the following way O A. controlling interest rates…
Q: Every year, the students at a school are given a musical aptitude test that rates them from 0 (no…
Q: Benzoic acid is a weak acid because Select one: Oa. a. Ob. It partially ionizes in water Oc. it does…
Q: The points ​(-4​,-4​) and ​(3​,6​) are the endpoints of the diameter of a circle. Find the length of…
Q: Question # 1: Draw the Bode log-magnitude and phase plots for the system. R(s) +, G(s) = = E(s) G(s)…
Q: Determine whether the following scenarios would increase, decrease or have no impact on potential…