kakadenotes1

pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by BrigadierGorillaMaster2190

Notes on Attention and the Transformer Model 1 Introduction • Two “core” tasks: machine translation and language modeling. • Many other tasks: part-of-speech tagging, named entity recognition, coref- erence resolution, semantic role labeling, question/answering, textual entail- ment, sentiment analysis, semantic parsing, etc. • Goal today: build a language model. Why? “Representations” of the lan- guage may be helpful for many tasks. • Interesting questions: memory, questions/answering, reasoning/logic. 2 A Short Summary of Some Improvements • linguistics (grammars, parse trees) → statistical machine learning → “deep” models • Brown clustering, n -gram models, IBM translation models • Lots of work on Neural Embeddings. • MT: – rule-based machine translation. – Statistical MT. IBM translation models. – Then a series of “deep learning” based approaches: * One first end-to-end models, with an “encoder-decoder” architec- ture. “ Recurrent Continuous Translation Models.” (Kalchbrenner & Blunsom, 2013) * Seq2Seq: using sequential neural models was a good first step. “Sequence to Sequence Learning with Neural Networks.” (Sutskever et. al. ’14) 1

* A series of papers started incorporating “attention”, where one di- rectly tries to utilize long range dependencies in the representation. The idea is that these long range dependencies help when translat- ing given words (the broader context is important). Now, all state of the art methods use some form of ”attention”. The Transformer is one of the most popular ones: “Attention is All you Need” (Vaswani. et. al. ’17) • Transfer learning: how can make learning easier by transferring knowledge of one task to another? Recent exciting results showing that representations extracted from a good language model can help with this. – NAACL best paper: “Deep Contextualized Word Representations” (Peters et. al. ’18) – Another improvement with pretraining: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et. al. ’18) 3 Datasets, Tasks, and some (important) details 3.1 Datasets and Objectives for Language Modelling • machine translation: translate one sentence to another sentence. BLEU score used. • language modeling: the goal is to learn a model over documents/ sequences, where given a document d = w 1: T (or a sequence of words/characters) our model provides a probability c Pr( d ) = b Pr ( w 1: T ) . Note that we often specify this joint distribution by the conditional distribution c Pr( w t +1 | w ≤ t ) where the w ’s are the words. • The performance measure: If D is the true distribution, we measure the quality of our model by the cross entropy rate : CrossEnt ( c Pr || D ) := 1 T E w 1: T ∼ D h - log b Pr ( w 1: T ) i = 1 T E w 1: T ∼ D " - X t log b Pr ( w t +1 | w 1: t ) # • The perplexity is defined as exp( CrossEnt ( c Pr || D )) . Intuitively, think of this as the number of plausible candidate alternative words that our model is suggesting. 2

• Examples: – Using a uniform distribution over m words gives a ppl of m . – Using the (estimated) unigram distribution has ppl about 1000 . – Shannon, in his paper ””Prediction and Entropy of Printed English” (’51), estimated 0.6 to 1.3 bits/character (using human prediction of letters). This translates to 4 . 5 bits/word, using 1 bit/character and 4 . 5 characters/word. This gives a ppl of 2 4 . 5 = 23 . – On the PTB dataset, the best ppl is about 55 - 60 (on the validation set). The best character level entropy rate is 1.2 bits/character. This translates to about 77 ppl in perplexity units per word (to see this use 2 ( 1 . 175 * 390000 / 74000) since there are 390000 characters in valida- tion set and 74000 words in the validation set). – There are other ’codings’ like BPE (byte pair encodings) and sub- words. One can translate perplexities between different codings, pro- vided they can faithfully represent the document/sequence. • Concerns: memory and long term dependencies may not be reflected in this metric? • Other ideas: RL, logic, meaning? • Datasets used for language modeling: – Penn Tree Bank (PTB): first collection. 1M words. 10K vocab sized (based on standardization) – WikiText-2 (2M words) and WikiText-103 (103M words). Scraped from articles on Wikipedia passing a certain quality/length threshold, on all topics. 300k vocab size, > 3 times each. – Google Billion Words: web crawl, assorted topics. 1B words. 800K vocab size. – Books corpus: 11k public-domain novels. 1B words. • Training: – GPUs/TPUs are needed. – Books/Billion words takes GPU weeks to a month to train (all standard models). TPU a few days. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3.2 The details matter: training and overfitting The details do matter a lot for training language models. In contrast, in visual object recognition, once we move to the “Res-Net” style of architectures, training is relatively easy, where overfitting, hyperparameter tuning, and regularization are not major concerns. In fact, simply “early stopping” on vanilla SGD training is often non-trivially competitive on any reasonable model. • overfitting is very real in language models, contrast to vision. – PTB with a trigram model (i.e. predict next word with previous two words) has 20 train ppl (in about 4 epochs) with 150 ppl on val. – PTB with LSTM+dropout has about 30 train ppl (in about 500 epochs) with 60 ppl on val. • dropout: this is needed. dropout is used everywhere in these networks. L 2 regularization alone is not comparable (very brittle and even with highly tuned it is not as good). • (average) SGD or ADAM? Sometimes one algorithm is much better than the other. • exploding gradients: these occur in practice. • vanishing gradients: lots of discussion on this. unclear what is going on. • dynamic evaluation: keep training at test time to handle topic drifts. squeezes about 5-10ppl on val (for PTB). 4 The Transformer Let x be the input sequences of size T × m , where T is the sequence length, often of length 512 and m is the vocabulary size, often in the range of 10 4 to 10 5 . Now we will describe a one hidden layer transformer, where it will predict the next word. Parameters: E ∈ R m × d embedding W V , W Q , W K ∈ R d embedding × d hidden W 1 ∈ R d hidden × d 1 W 2 ∈ R d 1 × d embedding 4

1. Embed the sequence: x ← xE + P so now x is of size d embedding × T . Here, P is the positional encoding. One common choice is: P t, 2 i = sin( t/ 10000 2 i/d embedding ) P t, 2 i +1 = cos( t/ 10000 2 i/d embedding ) where i indexes the embedding dimension and t the sequence time. Note that often this is a fixed choice (and not a learned parameter). 2. Compute the ’values’, ’query’, and ’key’: V = xW V , Q = xW Q , K = xW K , which are of size T × d hidden . 3. Compute the attention ”weights” scheme: h = softmax QK T √ d hidden V so h is of size T × d hidden . Importantly, note that QK T is a T × T matrix. Here, abusing notation, the vector valued softmax ( · ) function is applied to every row of the matrix QK > . Recall that the vector valued softmax ( · ) function is defined so that the i -th component is: [ softmax ( v )] i := exp( v i ) / X j exp( v j ) . Note: each row of softmax ( QK T ) sums to 1. The idea is that we want a convex combination of the columns of V. 4. The output after two transformations is then: O = ReLu ( ReLu ( hW 1 ) W 2 ) which is of size T × d embedding . 5. The prediction that the next word in the sequence, X T +1 , is the j -th word is then: e O = OE > c Pr( X T +1 = j ) = [ softmax ( e O T )] j . i.e. only the last node e O T is used for prediction. Here, we have coupled the embedding weights and the prediction weights, where both use the matrix E . 5

4.1 Invariances and other observations Define: M = QK T which is of size T × T . • hidden state interpretation and ’sequential’ training/scoring: the above de- scription is model for c Pr( w T +1 | w 1: T ) . We may be interested in the model predicting c Pr( w t +1 | w 1: t ) (for t < T where often T = 512 ), i.e. we may want to make multiple predictions simultaneously (say for training). For this, there is a way to use ’masking’ with an upper triangular matrix so that (for all t ≤ T ): Pr( X t +1 = j | w 1: t ) = [ softmax ( e O t )] j • Suppose P = 0 (no positional encoding). The matrix M is shift invariant in that if translate the sequence by τ then ( i, j ) entry gets shifted to ( i + τ, j + τ ) (provided these are in bounds). Similarly, h i → h i + τ . • Lemma. Let Q, K, V ∈ R T × d embed , and let Π be a T -by- T permutation matrix. Then, σ (Π Q (Π K ) T )Π V = Π σ ( QK T ) V , where σ ( · ) is the row- wise softmax of a matrix. In particular, suppose P = 0 , then if we permute the sequence, i.e. x 7→ Π x , then h 7→ Π h . Proof. The LHS is σ (Π QK T Π T )Π V . It suffices to show that σ (Π QK T Π T ) = Π σ ( QK T )Π T . Indeed, letting π denote the permutation specified by Π , we have σ (Π QK T Π T ) i,j = e ( QK T ) π ( i ) ,π ( j ) ∑ T j 0 =1 e ( QK T ) π ( i ) ,π ( j 0 ) = e ( QK T ) π ( i ) ,π ( j ) ∑ T j 0 =1 e ( QK T ) π ( i ) ,j 0 = (Π σ ( QK T )Π T ) i,j . • computation: the transformer computations are very efficient due to the man- ner in which the matrix multiplications can be parallelized. In contrast, the LSTM fundamentally needs a for-loop over the history. (The LSTM is a circuit with greater depth.) 5 Acknowledgements These notes were based on discussions with Xinyi Chen, Karthik Narashiman, Cyril Zhang, and Yi Zhang. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version