(T5) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Posted Aug 13, 2024 Updated Sep 27, 2024

By Eunhye Park

5 min read

Abstract

Transfer Learning

model is first pre-trained on data-rich task before being fine-tuned on a downstream task
power technique in NLP
allowed diversity of approaches, methodologies and practice

Objectives

exploring the landscape of transfer learning techniques for NLP
introduce unified framework that converts all text-based language problems into text-to-text format
compare pretraining objectives, architectures, unlabeled data sets, transfer approaches, and other factors on language understanding

1. Introduction

model that can process text amenable to downstream learning → developing general-purpose knowledge that allows the model to “understand” text
- low-level : spelling/meaning of words
- high-level : that tuba is too large to fit in most backpacks (common sense, I guess?)
providing the above knowledge is not done explicitly → learned as a part of auxiliary task
- work vectors
- pretraining the entire model on a data-rich task → develops general-purpose abilities and knowledge for downstream tasks
  - NLP: unsupervised learning on unlabeled data (BERT, UniLM, RoBERTa, AlBERT)
Benchmarks : GLUE, SuperGLUE, SentEval

Basic Idea of T5

treat every text processing problem as “text-to-text” problem : taking text as input and text as output.
- QA: https://arxiv.org/abs/1806.08730
- Span Extraction: https://arxiv.org/abs/1904.09286
- LM: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider.
unified approach → enables us to compare the effectiveness of different transfer learning objectives, unlabeled datasets, etc

2. Setup

2.1 Model

초반에는 neural network-based transfer learning이 대부분이었지만 최근에는 transformer 기반으로 넘어옴. T5도 지금까지의 아키텍처와 유사하다.

Transformer (2017) Architecture

self-attention 기반
encoder-decoder structure
seq2seq를 위해 만들어짐
하나의 layer stack에 다양한 self-attention 조합으로 언어모델을 만들고 있음

Sequence of tokens → Sequence of Embedding → Encoder
Encoder
- stack of blocks of self-attention and small feed-forward networks+dropout
- layer normalization and residual skip connection+dropout applied to each input of the sub-component
  - T5 uses a simplified version of layer normalization where the activations are only rescaled and no additive bias is applied.
- dropout applied to input and output of entire stack and the attention weight
Decoder
- similar to encoder
- standard attention mechanism is added after each self-attention
Dense Layer with softmax output
- weight are shared with the input embedding matrix

All attention mechanisms in the Transformer are split into independent “heads” whose outputs are concatenated before being further processed.

Self-Attention

order-independent (operation on sets) → requires positional signals/embedding
instead of fixed embedding, relative position embedding are normally used
- T5 uses a simplified form of position embedding where each embeeding is simply a scalar that is added to the corresponding logit used for computing the attention weights
- T5 also share the position embedding parameters across all layers in the model, though within a given layer each attention head uses a different learned position embedding
Fixed number of embeddings are learned
- T5 used 32 embeddings for all models (offset of 128)

T5 Architecture

similar to Transformer (2017) except:

removed Layer Norm bias
Layer Norm outside the residual path
different position embedding scheme

2.3 Downstream Tasks

Benchmarks

GLUE, SuperGLUE : text classification
CNN/Daily Mail : abstractive summarization
SQuAD : question answering
WMT : English to German, French, Romanian translation
- German : same training data with Transformer (2017), test data from newstest2013
- French : 2015 as training, and newstest2014 as test
- Romanian : both from WMT 2016

2.4 Input and Output Format

text to text framework provides consistent training objective for both pre-training and fine-tuning
trained with maximum likelihood objective regardless of the task
for task specification, used task-specific text prefix to the input sequence before feeding it to the model
- As an example, to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.”
choice of text prefix used for a given task is essentially a hyperparameter
- changing the exact wording of the prefix had limited impact

3. Experiments

survey on transfer learning for NLP : https://researchrepository.universityofgalway.ie/server/api/core/bitstreams/db70a7a4-836b-4161-9269-e979efdd01ef/content

3.1 Baseline

3.1.1 Model

Encoder-Decoder Structure

standard encoder-decoder transformer by Transformer (2017)
- similar in size and configuration to a BERT_base stack
- both encoder and decoder consist of 12 blocks
  - each with self-attention, encoder-decoder attention (optional), feed-forward network
many approaches use Transformer architecture consisting of only a single “stack”

Feed-forward Network

dense layers
- output dimensionality of $d_{ff} = 3072$
ReLU non-linearity between dense layers

“Key” and “Value” Matrices of all attention mechanism

inner dimensionality of $d_{kv} = 64$
all attention mechanisms have 12 heads

All sub-layers and embeddings

dimensionality of $d_{model}=768$

In total, results in model with 220 million parameters
twice the number of parameters of BERT_base

Regularization

dropout probability of 0.1 for all dropouts

3.2.1 Training

as all tasks are text-to-text, the model is trained using standard maximum likelihood

teacher forcing
cross-entropy loss

for optimization : AdaFactor

greedy decoding : choosing highest-probability logit at every timestep

🌟 Pre-training : $2^{35} \approx 34B$ tokens

C4 dataset
steps: $2^{19}=524,288$
maximum sequence length: 512
batch size: 128
- $2^{16}=65,536$ tokens per batch
learning rate scheduler: inverse square root

🌟 Fine-tuning

steps : $2^{18} = 262,144$
maximum sequence length: 512
batch size: 128
- $2^{16}=65,536$ tokens per batch
constant learning rate : 0.001 / no learning rate scheduler
checkpoint every 5,000 steps

3.1.3 Vocabulary

SetencePiece, WordPiece tokens to encode test
32,000 workpieces as vocabulary

3.1.4 Unsupervised Objective

Preliminaries

Casual Language Modeling
- GPT-1, ELMo, ULMFiT
Masked Language Modeling
- denoising objectives
- BERT
- masks spans

consecutive spans of dropped-out tokens are replaced by a single sentinel toke
- Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence.

⇒ Mask consecutive spans of tokens and only predict dropped-out tokens

reduce computational cost of pretraining

Papers, Text Generation

transformers encoder-decoder

This post is licensed under CC BY 4.0 by the author.