(T5) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Abstract
Transfer Learning
- model is first pre-trained on data-rich task before being fine-tuned on a downstream task
- power technique in NLP
- allowed diversity of approaches, methodologies and practice
Objectives
- exploring the landscape of transfer learning techniques for NLP
- introduce unified framework that converts all text-based language problems into text-to-text format
- compare pretraining objectives, architectures, unlabeled data sets, transfer approaches, and other factors on language understanding
1. Introduction
- model that can process text amenable to downstream learning → developing general-purpose knowledge that allows the model to “understand” text
- low-level : spelling/meaning of words
- high-level : that tuba is too large to fit in most backpacks (common sense, I guess?)
- providing the above knowledge is not done explicitly → learned as a part of auxiliary task
- work vectors
- pretraining the entire model on a data-rich task → develops general-purpose abilities and knowledge for downstream tasks
- NLP: unsupervised learning on unlabeled data (BERT, UniLM, RoBERTa, AlBERT)
- Benchmarks : GLUE, SuperGLUE, SentEval
Basic Idea of T5
- treat every text processing problem as “text-to-text” problem : taking text as input and text as output.
- QA: https://arxiv.org/abs/1806.08730
- Span Extraction: https://arxiv.org/abs/1904.09286
- LM: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider.
- unified approach → enables us to compare the effectiveness of different transfer learning objectives, unlabeled datasets, etc
2. Setup
2.1 Model
- 초반에는 neural network-based transfer learning이 대부분이었지만 최근에는 transformer 기반으로 넘어옴. T5도 지금까지의 아키텍처와 유사하다.
Transformer (2017) Architecture
- self-attention 기반
- encoder-decoder structure
- seq2seq를 위해 만들어짐
- 하나의 layer stack에 다양한 self-attention 조합으로 언어모델을 만들고 있음
- Sequence of tokens → Sequence of Embedding → Encoder
- Encoder
- stack of blocks of
self-attention
and smallfeed-forward networks
+dropout
layer normalization
andresidual skip connection
+dropout
applied to each input of the sub-component- T5 uses a simplified version of layer normalization where the activations are only rescaled and no additive bias is applied.
dropout
applied to input and output of entire stack and the attention weight
- stack of blocks of
- Decoder
- similar to encoder
- standard attention mechanism is added after each self-attention
- Dense Layer with softmax output
- weight are shared with the input embedding matrix
All attention mechanisms in the Transformer are split into independent “heads” whose outputs are concatenated before being further processed.
Self-Attention
- order-independent (operation on sets) → requires positional signals/embedding
- instead of fixed embedding, relative position embedding are normally used
- T5 uses a simplified form of position embedding where each embeeding is simply a scalar that is added to the corresponding logit used for computing the attention weights
- T5 also share the position embedding parameters across all layers in the model, though within a given layer each attention head uses a different learned position embedding
- Fixed number of embeddings are learned
- T5 used 32 embeddings for all models (offset of 128)
T5 Architecture
similar to Transformer (2017) except:
- removed Layer Norm bias
- Layer Norm outside the residual path
- different position embedding scheme
2.3 Downstream Tasks
Benchmarks
- GLUE, SuperGLUE : text classification
- CNN/Daily Mail : abstractive summarization
- SQuAD : question answering
- WMT : English to German, French, Romanian translation
- German : same training data with Transformer (2017), test data from
newstest2013
- French : 2015 as training, and
newstest2014
as test - Romanian : both from WMT 2016
- German : same training data with Transformer (2017), test data from
2.4 Input and Output Format
- text to text framework provides consistent training objective for both pre-training and fine-tuning
- trained with maximum likelihood objective regardless of the task
- for task specification, used task-specific text prefix to the input sequence before feeding it to the model
- As an example, to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.”
- choice of text prefix used for a given task is essentially a hyperparameter
- changing the exact wording of the prefix had limited impact
3. Experiments
survey on transfer learning for NLP : https://researchrepository.universityofgalway.ie/server/api/core/bitstreams/db70a7a4-836b-4161-9269-e979efdd01ef/content
3.1 Baseline
3.1.1 Model
Encoder-Decoder Structure
- standard encoder-decoder transformer by Transformer (2017)
- similar in size and configuration to a BERT_base stack
- both encoder and decoder consist of 12 blocks
- each with self-attention, encoder-decoder attention (optional), feed-forward network
- many approaches use Transformer architecture consisting of only a single “stack”
Feed-forward Network
- dense layers
- output dimensionality of $d_{ff} = 3072$
- ReLU non-linearity between dense layers
“Key” and “Value” Matrices of all attention mechanism
- inner dimensionality of $d_{kv} = 64$
- all attention mechanisms have 12 heads
All sub-layers and embeddings
- dimensionality of $d_{model}=768$
In total, results in model with 220 million parameters
- twice the number of parameters of BERT_base
Regularization
- dropout probability of 0.1 for all dropouts
3.2.1 Training
as all tasks are text-to-text, the model is trained using standard maximum likelihood
- teacher forcing
- cross-entropy loss
for optimization : AdaFactor
greedy decoding : choosing highest-probability logit at every timestep
🌟 Pre-training : $2^{35} \approx 34B$ tokens
- C4 dataset
- steps: $2^{19}=524,288$
- maximum sequence length: 512
- batch size: 128
- $2^{16}=65,536$ tokens per batch
- learning rate scheduler: inverse square root
🌟 Fine-tuning
- steps : $2^{18} = 262,144$
- maximum sequence length: 512
- batch size: 128
- $2^{16}=65,536$ tokens per batch
- constant learning rate : 0.001 / no learning rate scheduler
- checkpoint every 5,000 steps
3.1.3 Vocabulary
- SetencePiece, WordPiece tokens to encode test
- 32,000 workpieces as vocabulary
3.1.4 Unsupervised Objective
Preliminaries
- Casual Language Modeling
- GPT-1, ELMo, ULMFiT
- Masked Language Modeling
- denoising objectives
- BERT
- masks spans
T5
- consecutive spans of dropped-out tokens are replaced by a single sentinel toke
- Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence.
⇒ Mask consecutive spans of tokens and only predict dropped-out tokens
- reduce computational cost of pretraining