Lecture 11 Learning from Text -- Alec Radford

pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

124

Uploaded by BrigadierGorillaMaster2190

Report
Learning From Text Language Models and More April 15th 2020 - Berkeley - alec@openai.com
Standard supervised learning requires “ YGMIONTOUZKQ XKQGM^ZOUZMS MS^GMJPKQ ” data There is not a lot of “ YGMIONTOUZKQ XKQGM^ZOUZMS MS^GMJPKQ ” data (compared to what current models need) This lecture focuses on a variety of methods for learning from natural language in order to improve the performance of models on standard NLP datasets/tasks. Learning From Text
Autoregressive maximum likelihood language modeling will be the core. But, there are many proxy tasks involving predicting / modeling text somehow, someway that work well (sometimes even better than standard LMs!) Word2Vec / Paragraph2Vec Contrast Predictive Coding (CPC) BERT ELECTRA A Variety of Methods
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Motivation and Intro
A Wild Internet Appears
A Wild Internet Appears
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
How to use it? Let’s try word-word co-occurrences water steam ice water 32879 ... ... steam 250 324 ... ice 765 23 859 hot 19540 1832 17 hot ... ... ... 48323
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions (Clark et al 2016) How good is counting a bunch of stuff?
It’s still huge! 1 million words x 1 million words x 4 byte int32 = 4 terabytes Want to come up with a much more compact, but faithful representation of the relations between words and the information they represent. Problems working with word-word co-occurrence matrix
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Take the matrix X counting word-word co-occurrences (cheap so do it for 840B tokens!) So entry X ij would be the count of word i occuring in a context with word j Learn low dim vector representations of each word such that their dot product = log prob of co-occuring Goes from MxM to MxN where N is the dimensionality of the word vectors (300 << 1,000,000!) GLoVE (Pennington et al 2014)
Word2Vec
Usefulness of Word Vectors [McCann et al 2017]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Language is a lot more than just counts of words! It has a ton of structure on top of / in addition to words. Context is very important and a fixed static representation of a word is insufficient. 1. I went to the river bank. 2. I made a withdrawal from the bank. 3. “I wouldn’t bank on it” Problems with word vectors
Great, so I’ve got a 1,000,000 x 300 matrix ... now what? How to use it is up to the practitioner. Often involves a lot of task specific models slapped on top. Learning just word vectors is like learning just edge detectors in computer vision. Problems with word vectors
Intro to Language Models
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
70 years of samples [From Oriol Vinyals’ twitter]
Interpret language as a high-dimensional discrete data distribution to be modeled. Observe a bunch of strings of language and Learn a function that can compute the probability of new ones: “Is it going to rain today?” ° Statistical / Probabilistic Language Modeling
The cat sat on the mat. ° = ??? What does it mean to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The cat sat on the mat. ° = ??? Noam Chomsky in 1969: &a` OU` Ya_` HNKQ ^KQIO[MSZOU^KQJP `NTGM` `NTKQ Z[`OU[Z [LR ±\^[HNGMHNOUXOU`] [LR GM _KQZ`KQZIOKQ± OU_ GMZ KQZ`OU^KQX] a_KQXKQ__ [ZKQ² aZJPKQ^ GMZ] WZ[[Z OUZ`KQ^\^KQ`GM`OU[Z [LR `NTOU_ `KQ^Y³ What does it mean to compute the probability of a string?
The cat sat on the mat. ° > The cat sats on the mat. ° [grammar] Should The cat sats on the mat. ° be 0? The hyena sat on the mat. ° < The cat sat on the mat. ° [world knowledge] Should “4” | “2 + 2 =” ° be 1? p( “1 star out of 5” | “That movie was terrible! I’d rate it” ) [sentiment analysis] How can you use the probability of a string?
Speech Recognition and Machine Translation are supervised tasks Speech Recognition = (audio 1 , transcript 1 ) (audio 2 , transcript 2 ) (audio 3 , transcript 3 ) Machine Translation = (french 1 , english 1 ) (french 2 , english 2 ) (french 3 , english 3 ) A major promise of language modeling is to leverage a bunch of “uncurrated” text to help with these problems. How can you use the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Speech Recognition prune the space of possible transcriptions from an acoustic model famous example: “wreck a nice beach” vs “recognize speech” Machine Translation re-rank possible translations Integrate directly with decoder How can you use the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
First, maybe do some preprocessing (like lower-casing) “THe CaT SAt oN ThE MAT.” -> “the cat sat on the mat.” How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Often we’ll set a maximum # of words (or minimum frequency) for computational reasons so: “the cat sat on the countertop.” -> “the cat sat on the <UNK>.” How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A tokenizer takes a string as input and returns a sequence of tokens: “the cat sat on the mat.” -> [the, cat, sat, on, the, mat, .] [the, cat, sat, on, the, mat, .] -> [23, 1924, 742, 101, 23, 3946, 7] How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A tokenizer takes a string as input and returns a sequence of tokens: “the cat sat on the mat.” -> [t, h, e, “ “, c, a, t, “ “, s, a, t, “ “, ...] How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Character level (throw out non-ascii) Byte level (work on UTF-8 byte stream) Unicode symbols / codepoints Tokenized / pre-processed word level Byte Pair Encoding (Sennrich 2016) All the different ways to dice a string! t h -> th i n -> in e d -> ed a n -> an th e -> the o u -> ou e r -> er in g -> ing t o -> to e r -> er h e -> he an d -> and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. Assume a uniform prior over tokens 2. Assume all tokens are independent p(t 0 ) = 1/vocab size p(t 0 , t 1 , t 2 , t 3 ) = product of p(t i ) for all i How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. Assume a uniform prior over tokens 2. Assume all tokens are independent Estimate the probability of a token by counting its occurrences and normalize this count by the total number of tokens seen. p(t 0 , t 1 , t 2 , t 3 …) = p(t 0 )p(t 1 )p(t 2 )p(t 3 ). .. This is a unigram language model How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. Assume a uniform prior over tokens 2. Assume all tokens are independent Estimate the probability of a token conditioned on the previous token by counting how many times it co-occurs with that previous token and normalize this count by the total number of occurrences of that context. p(t 0 , t 1 , t 2 , t 3 …) = p(t 0 )p(t 1 | t 0 )p(t 2 | t 1 )p(t 3 | t 2 ) This is a bigram language model How to compute the probability of a string?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
self-attention ° = 0 = infinite loss… self-attention | the cool thing about ° = 0 = infinite loss... Generalization?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
self-attention ° = 0 = infinite loss… self-attention | the cool thing about ° = 0 = infinite loss... Smooth things out by using a mixture model p mixture (t 1 ) = 0.01 * p uniform (t 1 ) + 0.99 * p unigram (t 1 ) Smoothing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Language model research in the 80s and 90s focused a lot on how to better estimate, smooth, and interpolate n-gram language models Smoothing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Probabilities are often within rounding error of zero. (Language is a huge space!) They also are a function of the length of the string. The most common quantity is the average negative log probability per “token”² Character level LMs use base 2 and report bits per character (can also be per byte) Word level LMs exponentiate and report perplexity Evaluation Type 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Working with abstract #s like these can be difficult What’s 1.23 BPC vs 1.21 BPC? (especially important when you just spent 3 months of your life on it!) These quantities are dataset dependent (it’s really easy to guess all 0s - really hard to guess the arXiv) Random guessing gets you log 2 (1/256) = 8 bits per character Current human estimate ranges ~0.6-1.3 BPC. Best models are now a little lower than 1 BPC so probably closer to 0.6. Grounding bits per character and perplexity
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Random guessing PPL is just vocab size so with a vocab of 50K = 50K PPL One way of thinking about perplexity is as a “branching factor of language”. PPL n = space of possible generations of length n A model can get 10 PPL by uniformly assigning probability across 10 equally likely next words (and always having the correct word within these top 10) Human level is probably between 5 and 10 from BPC estimate Translation is a well constrained space and best models are between ³ and PPL Grounding bits per character and perplexity
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
There are a lot of ways to use a language models. You can evaluate them based on their usefulness for a downstream task. Improve: W+ER for speech recognition +(BL+EU for translation ,F´ for POS tagging *'A,)C,)C for document classification This is an increasingly common evaluation setting. Evaluation Type 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
History of Neural Language Models
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
So many things! A neural net Skip connections Learn distributed representation of words Large scale asynchronous SGD A Neural Probabilistic Language Model (Bengio et al. 2003)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Replace MLP with RNN (allows for unbounded context) Showed improvements on speech recognition Recurrent neural network based language model (Mikolov et al 2010)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Character level RNN Approximates a tensor RNN which has a different set of weights for every input character Very complicated optimization scheme Ms . Claire Parters will also have a history temple for him to raise jobs until naked Prodiena to paint baseball partners , provided people to ride both of Manhattan in 1978 , but what was largely directed to China in 1946 , focusing on the trademark period is the sailboat yesterday and comments on whom they obtain overheard within the 120th anniversary , where many civil rights defined , officials said early that forms , ” said Bernard J. Marco Jr. of Pennsylvania , was monitoring New York (not actually a lot better than word level n-gram models) Generating Text with Recurrent Neural Networks (Sutskever et al 2011)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Generating Sequences with Recurrent Neural Networks (Graves 2013)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Generating Sequences with Recurrent Neural Networks (Graves 2013)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Proposed using an RNN sequence encoder trained to provide context to an LM as a sentence level text feature extractor² Skip-Thought Vectors (Kiros et al 2015)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Proposes finetuning an LM directly for downstream tasks 1. Use LM objective as a pre-training task 2. Then initialize the parameters of downstream model with LM weights 3. Then train like a normal supervised model Semi-supervised Sequence Learning (Dai et al 2015)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A larger dataset 1BW (Chelba et al 2013) A 8K projection LSTM (Sak et al 2014) Character aware (Kim et al 2015) A large vocab - 800K words Approximate with sampled softmax 32 K40s for 3 weeks 41.0 -> 23.7 perplexity Exploring The Limits of Language Modeling (Jozefowicz et al. 2016)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Was one of the first neural language models (I’m aware of) to generally have ~coherent non-trivial sentences. =OU`NT KQZKQZ Y[^KQ ZKQ[ `KQIONTZ[X[MSOUKQ_ IO[YOUZMS [Z`[ `NTKQ YGM^WKQ` ]aOUIOWX] JPa^OUZMS `NTKQ \GM_` `NT^KQKQ ]KQGM^_ ² GMZ OUZIO^KQGM_OUZMS ZaYHNKQ^ [LR IO[Y\GMZOUKQ_ Z[[ Ya_` `GMIOWXKQ `NTKQ KQZKQ^´IONTGMZMSOUZMS GMZJP KQZKQ^´IONTGMZMSOUZMS KQZZOU^[ZYKQZ`GMX IONTGMXXKQZMSKQ_ [ZXOUZKQ ³ Exploring The Limits of Language Modeling (Jozefowicz et al. 2016)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
There’s a whole internet out there Soooooooooo much information A perfect language model would need to fit the internet into its parameters. This suggests we’re going to need a lot of parameters, compute, and data to get as close to this as possible. Why scale?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
This is what a very small charRNN learns: ± %X_ MSGMYHN^GMZ`^ µ_ [ `NTWKQ^MS`^KQ GMWXJP `KQZ[ ¶ ·¸¹¶º `OUKQ ,KQ 'aXKQ GM ² __[` +[_NTaXGMZ Z HNXZKQ ` ² `[ NTKQ^KQJP GM^KQ^[^OUZZKQ^ ^^W LR ³ ² GM`KQ &GMZGM`± The best architecture in the world is useless without capacity. Even classic resources like WordNet are larger than many models trained today. (5.5M relational features and the package is 55MB on disk!) Ungrounded language learning is grotesquely inefficient. How to make peace with this? For now, address it with scale? Why scale?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ɣ (KQKQ\ 0KQGM^ZOUZMS 9IOGMXOUZMS OU_ 6^KQJPOUIO`GMHNXKQ² )Y\OU^OUIOGMXX] »,KQ_`ZKQ__ KQ` GMX³ ¼¸·¹½ Ɣ +6OU\KQ )ƽIOOUKQZ` :^GMOUZOUZMS [LR +OUGMZ` 4KQa^GMX 4KQ`[[^W_ »,aGMZMS KQ` GMX³ ¼¸·¾½ Ɣ %- GMZJP '[Y\a`KQ »(GMY[JPKQOU GMZJP ,KQ^ZGMZJPKQ^ ¼¸·¾½ These trends have been consistent across many orders of magnitude Why scale?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ƭ Maybe data is the bottleneck! Make dataset bigger -> 80 million product reviews (40 GB of text) 4096 unit byte level mLSTM - 1 month - 4 Pascal Titan X GPUs Model ended up just underfitting by a lot But learned what sentiment is Learning To Generate Reviews and Discovering Sentiment (Radford et al. 2017)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ƭ Hidden State Size is the bottleneck! Make an LSTMs with a much larger state size -> 18,432 units Slightly more efficient than a dense model with the same # of parameters Also better performance on sentiment analysis (when evaluated by a linear model) GPU Kernels for Block-Sparse Weights (Gray et al 2017)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
LM pre-training for sentiment analysis Small World LSTM is here
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Story Cloze Task: UW NLP System (Schwartz et al 2017)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ƭ Maybe parameter count is the bottleneck! Make a model with as many parameters as possible -> 137 Billion More efficient than equivalent compute dense models And a lot of very impressive systems work The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al 2017)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Replace word vectors with a learned weighted sum of features of deep bi-directional LM Improves baseline models to SOTA Uses the LM from (Jozefowicz et al. 2016) Extends benefits of LMs to a much wider variety of tasks Deep contextualized word representations (Peters et al 2018)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Deep contextualized word representations (Peters et al 2018) Word representation Forward LSTM Layer 1 State Backward LSTM Layer 1 State Forward LSTM Layer 2 State Backward LSTM Layer 2 State Word representation Forward LSTM Layer 1 State Backward LSTM Layer 1 State Forward LSTM Layer 2 State Backward LSTM Layer 2 State Contextualized representation Contextualized representation
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Transformer based LM 12 self-attention blocks - 12 heads - 768 dim state ~100M params Trained on 7,000 books ~ 5 GB of text (BookCorpus Zhu et al 2015) Fine-tune on supervised tasks (like Dai et al. 2015) Removes the need for task specific architectures /]SSm`VVpXXr_UUo\\vQYOO^TTnOWMM 2GIQ^TTnOWMM[[uGIQOWMMMUKK ;=E^TTnLTJMUKKXXrYYsZZtGIQ^TTnLTJQYOO^TTnOWMM HJRHy -MUKK^TTnMUKKXXrGIQZZtQYOO\\vMUKK 68XXrMUKK°:<XXrGIQQYOO^TTnQYOO^TTnOWMM ±-68:<°²³
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
/]SSm`VVpXXr_UUo\\vQYOO^TTnOWMM 2GIQ^TTnOWMM[[uGIQOWMMMUKK ;=E^TTnLTJMUKKXXrYYsZZtGIQ^TTnLTJQYOO^TTnOWMM HJRHy -MUKK^TTnMUKKXXrGIQZZtQYOO\\vMUKK 68XXrMUKK°:<XXrGIQQYOO^TTnQYOO^TTnOWMM ±-68:<°²³
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A digression into Transformers
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the FDt sDt on : uery 4 ey ? Dlue informDtion you FDn retrieve whDt you FDn FompDre to whDt you wDnt to look for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the FDt sDt on : uery 4 ey ? Dlue informDtion you FDn retrieve whDt you FDn FompDre to whDt you wDnt to look for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the FDt sDt on : uery 4 ey ? Dlue informDtion you FDn retrieve whDt you FDn FompDre to whDt you wDnt to look for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the FDt sDt on : uery 4 ey ? Dlue informDtion you FDn retrieve whDt you FDn FompDre to whDt you wDnt to look for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the FDt sDt on “ZNK FDZ” : uery 4 ey ? Dlue informDtion you FDn retrieve whDt you FDn FompDre to whDt you wDnt to look for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[Vaswani et al 2017]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Beyond standard LMs
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A lot of Improvements! 6[uRZtO752 Premise: Hills and mountains are especially sanctified in Jainism. Hypotheis: Janism hates nature. Label: Contradiction ,U5D Sentence: The wagon rumbled down the road. Label: Acceptable Sentence: The car honked down the road. Label: Unacceptable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A lot of Improvements! More on this later! 6[uRZtO752 Premise: Hills and mountains are especially sanctified in Jainism. Hypotheis: Janism hates nature. Label: Contradiction ,U5D Sentence: The wagon rumbled down the road. Label: Acceptable Sentence: The car honked down the road. Label: Unacceptable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al 2018) Left-Right LM: The cat sat on the [mask] -> The cat sat on the mat Right-Left LM: [mask] cat sat on the mat -> The cat sat on the mat Masked LM: The [mask] sat on the [mask] -> The cat sat on the mat
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Really well executed refinement µ engineering on +(B+ERT Better tuned (many HPs) Remove a few hacks (remove annealing context size) Better data generation (online instead of cached) A more flexible vocab scheme (more on this later) Use more compute / train longer (but same model capacity - BERT was undertrained) RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al 2019)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ELECTRA (Clark et al 2019)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al 2019) Very thorough (50 pages!) exploration of the design space of pretraining with a pleasing task formulation (from McCann et al 2018)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Why we need Unsupervised Learning
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Natural Language Inference - SNLI (Bowman et al. 2015) Predict logical relation between two sentences - P and H. Contradiction -> A man inspects a uniform. A man is sleeping. Neutral -> An older and younger man smiling. Two men are smiling at cats playing on the floor. Entailment -> A soccer game with multiple males playing. Some men are playing a sport. Models are near human level according to the standard test set Humans ~ 88.0% ESIM (Chen et al. 2017) ~ 88.0% How well does supervised learning work?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Turkers were paid to create the training data of SNLI They often use a few tricks or heuristics to quickly make data For instance: Words like (not, never, nothing) hint at negation Generic words like (person, animal, sport) hint at entailment Modifiers like (tall, sad, popular) hint at neutral If you train a classifier on only the second sentence... You get ~67.0% compared to ~33.0% ESIM performance drops from ~88% to ~72% on the hard examples Annotation Artifacts In Natural Language Inference Data (Gururangan et al. 2018)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Use known relations between words to construct a new test set The man is holding a {object} . The man is holding a {different object} . Contradiction A little girl is very aGMJPPVKQIO`OUZKQc . A little girl is very a_]Z[Z]Yc . Entailment Built a new test set of 8,000 examples from 14 categories to probe this. ESIM drops from ~88% to ~66% on this new test set Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (Glockner et al. 2018)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Near SOTA QA model (BERT on SQUAD) drops from 86.5 F1 to: 35.6 F1 on TriviaQA 56.2 F1 on QuAC Learning and Evaluating General Linguistic Intelligence (Yogatama et. al 2019)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Standard training datasets might not encourage generalization Models learn spurious associations in the training set Models exploit distributional bias of the creation of the training set Models “stop learning” once they get to 0 training error Current techniques are brittle Current techniques are closer to memorization than generalization What might be going wrong?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Better models / architectures? More data? Different paths all together? How to make progress?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
:NTKQ HNKQGMa`OULRaX _`[^] [LR Y[JPKQ^Z JPKQKQ\ XKQGM^ZOUZMS [GM_ _a\\[_KQJP `[ HNKQ `NTGM` [KQ IOXKQZKQ^X] KQZIO[JPKQJP NTOUMSNT´XKQZKQX JP[YGMOUZ WZ[[XKQJPMSKQ OUZ`[ [a^ GM^IONTOU`KQIO`a^KQ_ GMZJP HNaOUX` `NTKQ_KQ XGM^MSKQ^ XGMHNKQXKQJP JPGM`GM_KQ`_ GMZJP `NTKQZ XKQ` 9+( ﰿMSa^KQ [a` GMXX `NTKQ GMZZ[]OUZMS JPKQ`GMOUX_ LR[^ a_³ How to make progress?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
This set us up for a mindset of architecture engineering. There’s a very large design space: Multiply by a sigmoid here Add a temporal max-pool there Convolve with not 1 (or 2) but `NT^KQKQ different width filters Throw some attention on top of it all for good measure How to make progress?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We really like playing with blocks!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We can encode useful information through the choice of model: Convolution Recurrence Weight Sharing Attention Hierarchy Depth These are all important and impactful How to make progress?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
What’s going on?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
What we’ve been mostly doing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Huh?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The value of architecture engineering?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Supervised Learning is the dominant approach The largest supervised dataset is JFT-300M (Sun et al. 2017) Ɣ 300 million images Ɣ 18,000 classes How to learn?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Supervised Learning is the dominant approach The largest supervised dataset is JFT-300M (Sun et al. 2017) Ɣ 300 million images Ɣ 18,000 classes Ƭ JFT is only 530 MB of constraints How to learn?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Spent most of 2015 trying to build what I hoped would be -YGMMSKQZKQ` LR[^ `KQ\` (to enable impactful transfer learning) 20 Newsgroups but for reddit giant weakly supervised dataset 150M labeled examples across 1,000 communities Trained RNNs to predict the community from the discussion Pursuing this route for language
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
What we’ve been trying
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
How to do this instead?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
KIM (Chen et al. 2017) Gets 83.5% on the new NLI test set Information (and representation) Engineering alongside Architecture Engineering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Word vectors are the classic approach! GLoVE (Pennington et al. 2014) Common Crawl (a good chunk of the internet) Represent co-occurrences of words in 840 billion tokens Information (and representation) Engineering alongside Architecture Engineering
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Word vectors are the classic approach! GLoVE (Pennington et al. 2014) Common Crawl (a good chunk of the internet) Represent co-occurrences of words in 840 billion tokens Information (and representation) Engineering alongside Architecture Engineering The NLI models were already using word vectors So this hasn’t been figured out yet! But CoVe -> ELMo -> GPT-1 -> BERT helps a ton!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GPT-1 performs similarly to KIM (83.75%) on the new NLI test set BERT is basically SOTA on everything (that I’m aware of) It’s just a “stock” transformer! But it makes up for this with all that its learned through pre-training. Information Engineering Taking Off (CoVe, ELMo, ULMFiT, GPT-1, BERT)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Instead of manually specifying what to predict through the creation of large supervised datasets… Figure out how to learn from and predict everything “out there”. ?E[a IOGMZ `NTOUZW [LR KQZKQ^]`OUYKQ [KQ HNaOUXJP GM JPGM`GM_KQ` GM_ _KQ``OUZMS `NTKQ OUY\[^`GMZIOKQ [LR KQZKQ^]`NTOUZMS KQX_KQ OUZ `NTKQ [[^XJP `[ ¸ GMZJP `NTKQ OUY\[^`GMZIOKQ [LR KQZKQ^]`NTOUZMS OUZ `NTKQ JPGM`GM_KQ` `[ · ³ Our poor models! They know so little and yet still have so much hidden from them.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. High capacity and flexible model classes + 2. Algos for extracting information and learning the structure of domains + 3. An almost infeasible amount of data tiling everything (billions of unlabeled examples?) + 4. An [LRLRKQZ_OUZKQ amount of compute with which to learn (peta to exaflops?) = ? A Potential Recipe
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. High capacity and flexible model classes + 2. Algos for extracting information and learning the structure of domains + 3. An almost infeasible amount of data tiling everything (billions of unlabeled examples?) + 4. An [LRLRKQZ_OUZKQ amount of compute with which to learn (peta to exaflops?) = Is it time to stop? To call it quits? A Potential Recipe
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. High capacity and flexible model classes + 2. Algos for extracting information and learning the structure of domains + 3. An almost infeasible amount of data tiling everything (billions of unlabeled examples?) + 4. An [LRLRKQZ_OUZKQ amount of compute with which to learn (peta to exaflops?) = Or will it drive a good chunk of progress over the next few years? A Potential Recipe
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
What about Multitask Learning?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GPT-2?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
More data 40GB of text 10B tokens 8 million webpages Bigger model Up to 1.5 billion parameters 1024 token context 48 layers, 1600 dim state Just a language model - predicts everything (with some unfortunate restrictions as BERT shows) GPT-2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Performance across tasks
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Performance across tasks
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Why it’s working?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question Answering and Reading Comprehension: 6 Million 5 Ws questions in the dataset Summarization: ~100K TL;DR, In summary… Translation: ~10MB French data Why it’s working?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A concrete example of why unsupervised learning is necessary
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Performance not (usually) limited by something a single paper fixes Diminishing returns mean there is always some other bottleneck Fancy model -> compute utilization, trainability Parameters -> compute Data -> capacity Capacity -> data, compute Ƭ Be pragmatic about scaling If you do everything sensibly - compute will probably be the bottleneck If it’s not… there’s an interesting research problem! Takeaways from scaling language modeling
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
How to do research on large scale models?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Don’t do research on large scale models How to do research on large scale models?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Don’t do research on large scale models Prototype on models which are 10x smaller and 10x faster Run 10x as many experiments in parallel instead )ZKQ^] HNKQNTGMZOU[^ in the GPT-2 paper shows up on these models After the proof of concept - then you scale GPT-1 was a proof point on zero-shot task transfer GPT-1 on WebText is already SOTA on several LM tasks Used the same strategy for Sentiment Unit First trained a 512 dim LSTM in a few days Final 4096 dim LSTM took a month How to do research on large scale models?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Develop and test everything quickly at small scale first Tune the hyperparams, decide on a model, checkout datasets, etc... Whatever does best at a reasonable scale will also probably do well at large scale Optimize the language model as a language model Log-prob of held-out text Then see what it else it can do How to do research on large scale models?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Sometimes issues don’t show up until at scale Plan for _[YKQ`NTOUZMS to break about every order of magnitude of scale Will have to re-tune hyperparameters For GPT-2 models this happened at >= 24 self-attention blocks Performance of models appears to saturate Fix was better weight init and pre-activation style residual network Rewon Child figured this out The Gotcha’s
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ɣ Self-attention architectures + long sequences = lots of memory Recompute Half Precision (FP-16) Data Parallelism More Model More Problem
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Naive tensorflow code can now be over 5 times slower than what is achievable on modern hardware Case study: GPT-1 Took 25 days on 8 P6000s (how do you do research on models that take a month to train without going insane?) Trains in 3 days on 8 v100s 1.75x from TF data parallel -> MPI + NCCL AllReduce 1.50x from native TF ops -> Blocksparse ops 3.50x from FP32 Pascal -> FP16 Volta Write Efficient / Smart Code!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Accelerated primitives for common Tensorflow ops Dropout, normalization, optimizers, activations Custom self-attention operations Avoid transposes, fuse operations, sparse compute Targets Volta / Turing hardware Tensorcores allow for 3x+ speedup over previous gen hardware from blocksparse±transformer import BlocksparseTransformer² softmax_cross_entropy from blocksparse±optimize import AdamOptimizer² ClipGlobalNorm from blocksparse±norms import layer_norm from blocksparse±embed import embedding_lookup from blocksparse±ewops import bias_relu² dropout from blocksparse±nccl import allreduce Blocksparse Library - Scott Gray
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
If you’re paying for it: A 4 2080 Ti desktop The results in GPT-2 do show up on models trainable on this hardware (but will take a week) Can ~ match BERT-Base in that time too Cost about $6,000 :( If someone else is paying for it: 8 v100s from a cloud provider (AWS, GCE, etc...) What is the Sweet Spot in terms of compute?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Scale matters go beyond classic datasts like PTB Better results come from combining several sources of improvement Don’t get bottlenecked by something that can be fixed easily Don’t let scale slow you down during development A medium+ language model on a new dataset / domain will probably learn something interesting - but might take some digging to find Most of my research for the past few years has been exploring the capabilities, behaviors, and uses of language models in this regime Takeaways from language modeling
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In the next few years language models will be trained on pretty much the whole internet (might as well throw in millions of books too!) Will scaling trends breakdown? How far will this get? Where is this Heading?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In the next few years language models will be trained on pretty much the whole internet (might as well throw in millions of books too!) Will scaling trends breakdown? How far will this get? If trendlines continue… Where is this Heading? [Ian Goodfellow’s twitter]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In the next few years language models will be trained on pretty much the whole internet (might as well throw in millions of books too!) Will scaling trends breakdown? How far will this get? If trendlines continue… It will probably feel unsatisfying, though Where is this Heading? [Ian Goodfellow’s twitter]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: Could this please be written in C++ 11, thanks !   Design a "Tips" class that calculates the…
Q: A spherical party balloon is being inflated with helium pumped in at a rate of 4 cubic feet per…
Q: Given vectors u=i+2j and v = 3i + yj, find y so that the angle between the vectors is 60°. roo…
Q: What does programmable array logic really mean?
Q: You are holding a portfolio with the following investments and betas: Stock Dollar investment Beta…
Q: Q10. In the image below, the directions of the velocities of ball A and ball B before and after…
Q: The
Q: A 30.9kg kid walks to the end of a diving board, and he is now 1.6m away from the base of the diving…
Q: Consider the following matrix: 30-8 -4 03 8 4 0 0 11 4 0 0-24 -9 A= a) Find the distinct eigenvalues…
Q: The B&K Real Estate Company sells homes and is currently serving the Southeast region. It has…
Q: Prepare journal entries to record the following transactions for a retail store. The company uses a…
Q: Ibuprofen has the following mass percent composition: C75.69%,H8.80%,O15.51%   What is the…
Q: Design an interface Polynomial that defines the following operations. This is your polynomial…
Q: Each of the three rolled-steel beams shown (numbered 1, 2, and 3) is to carry a 64-kip load…
Q: Niobium metal becomes a superconductor when cooled below 9 K. Its superconductivity is destroyed…
Q: 7 points lie on a circle. How many chords can be drawn through them?  (A chord is a line that passes…
Q: Pierce Manufacturing determines that the daily revenue, in dollars, from the sale of x lawn chairs…
Q: will write a c++ program that prompts the user to input elements for two 2x2 matrices. Your program…
Q: What is the mass percentage of each element in the compound Al2O3? Enter the percent mass of…
Q: hydrogen by mass. Assume a 100.-g sample of this compound. How many grams of each element are in…
Q: 1) Given the following equations of a certain line, determine the equation of the perpendicular line…
Q: Complete the following table: Note: Do not round intermediate calculations. Round your final…