How To
Huggingface Transformers
Implementing transformer models for natural language processing
Transformers are a family of deep learning models based on attention mechanisms. First proposed by Vaswani et al. in 2017, these models have achieved state-of-the-art results on many natural language processing tasks. Transformers have outperformed recurrent networks by harnessing the potential of transfer learning whereby models are pretrained on data-rich tasks, like language modelling, and then fine-tuned on downstream tasks of interest, like summarization. Transfer learning has also allowed multilingual transformer models to generalize across languages and achieve better performance on low-resource languages.
The Huggingface Transformers library provides hundreds of pretrained transformer models for natural language processing. This is a brief tutorial on fine-tuning a huggingface transformer model.
The library can be installed using pip
as follows
pip install transformers
pip install sentencepiece
We begin by selecting a model architecture appropriate for our task from this list of available architectures. Let’s say we want to use the T5 model.
The library can be combined with both PyTorch and TensorFlow as it implements two versions of each object, one for PyTorch and one for TensorFlow. All Tensorflow objects start with the prefix TF.
The models can be imported as
PyTorch:
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration
TensorFlow:
from transformers import TFT5Config, TFT5Tokenizer, TFT5ForConditionalGeneration
T5ForConditionalGeneration
is the complete seq2seq model with a language modelling head. This library also includes other versions of the architecture for each model. For example,T5Model
is the bare T5 model that outputs raw hidden states without a specific head on top whileT5EncoderModel
outputs the raw hidden states of the encoder.
For each model architecture, the library contains several pretrained models of varying sizes. Additionally, users are also allowed to submit fine-tuned or entirely retrained models. We can browse these variants through the “All model pages” link for each model here. Alternatively, we can also search through the models here. We’ll choose the T5-small for reasons that will soon become apparent.
If you’re new to transformers, you may be surprised by how big these models are. The largest T5 model requires 42 GB of storage space alone. But even a standard transformer requires a GB or two.
The first thing we need then is around 12 GB of RAM or GPU. Well, you don’t get put under lock and key by running on a Pentium 4. Google Colab does come in handy here. You can run a standard transformer model with batch sizes of around 8–16 on the Colab GPU. Remember to connect the notebook to your Drive to preserve data and models across sessions.
from google.colab import drive
drive.mount('/content/drive')
Secondly, we need to cache these models after download.
model_dir = '/content/drive/T5model'config = T5Config()
model = T5ForConditionalGeneration.from_pretrained(
'google/t5-small-ssm',
config=config
)
model.save_pretrained(model_dir)
The next time, these models can be loaded as
config = T5Config()
model = T5ForConditionalGeneration.from_pretrained(
model_dir,
config=config
)
The config
can be modified to change the vocab size or even the number of layers. However, modifying some parameters, e.g. model dimension, may cause the loading of pretrained weights to fail.
config = T5Config(vocab_size=250112, num_layers=8, num_heads=6)
This library can also be used for training a custom model from scratch by providing a custom config and removing the
from_pretrained()
call.
Huggingface also provides tokenizers designed for the data used to pretrain the models. These can also be cached.
try:
tokenizer = T5Tokenizer.from_pretrained(model_dir)
except:
tokenizer = T5Tokenizer.from_pretrained('google/t5-small-ssm')
tokenizer.save_pretrained(model_dir)
The tokenizer can also be slightly modified to add special tokens like language codes.
tokens = ['en', 'fr', 'zh']
tokenizer.add_special_tokens({'additional_special_tokens' : tokens})
The tokenizers also include useful methods to process the input for training and prediction. The encode method splits a string into tokens and returns a list containing the integer IDs of the tokens while the decode method converts the token IDs back to the original string. These methods are also available for batches.
[int, int, ...] = tokenizer.encode(str)
str = tokenizer.decode([int, int, ...])[[int...], [int...], ...] = tokenizer.batch_encode([str, str, ...])
[str, str, ...] = tokenizer.batch_decode([[int...], [int...], ...])
In seq2seq models, like the T5, the tokenizers also provide a handy method for preparing both the encoder and decoder inputs together.
source = [str, str, ...]
target = [str, str, ...]
encodings = tokenizer.prepare_seq2seq_batch(source, target)
where encodings = {input_ids: [[]], attention_mask: [[]], labels: [[]]}
In Seq2Seq models, the decoder input must be right shifted by one step. When using this library, the user must perform this right-shift manually. This is generally accomplished by appending an extra token at the start of each string in
target
. This token is defined during pretraining and can be found in the model documentation. In T5, the padding token is used for this purpose.
target = [tokenizer.pad_token + ' ' + ' '.join(x) for x in target]...encodings = tokenizer.prepare_seq2seq_batch(source, target)
encodings.labels
represent the desired output and have two uses: as decoder_input_ids
and as labels
for the loss function. These two are identical except labels
do not include the right-shift token at the start. Therefore, we create two copies of encodings.labels
: one for decoder input and one for loss labels. We remove the starting right-shift token from labels
as this token is not part of the expected output. We then remove the last token from decoder_input_ids
to equalize tensor sizes.
encoder_input_ids = encodings.input_ids
encoder_attention_mask = encodings.attention_maskdecoder_input_ids = encodings.labels[:, :-1].clone() # skip last
labels = encodings.labels[:, 1:].clone() # skip first
Frequently, model inputs are padded to some maximum length to ensure consistent tensor sizes. This is accomplished by appending padding tokens to the inputs. These tokens need to be excluded from loss calculations. Huggingface’s loss functions are defined to exclude the ID -100 during loss calculations. Therefore, we need to convert all padding token IDs in labels
to -100.
labels[labels[:, :] == tokenizer.pad_token_id] = -100
The forward pass is initiated by
outputs = model(
input_ids = encoder_input_ids,
attention_mask = encoder_attention_mask,
decoder_input_ids = decoder_input_ids,
labels = labels,
)
loss = outputs[0]
In PyTorch, the model can be trained as
optimizer = AdamW(model.parameters(), lr=5e-5)device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)for epoch in epochs:
for batch in dataloader:
batch.to(device)
model.train()
optimizer.zero_grad()
outputs = model(
input_ids = batch.encoder_input_ids,
attention_mask = batch.encoder_attention_mask,
decoder_input_ids = batch.decoder_input_ids,
labels = batch.labels,
)
loss = outputs[0]
loss.backward()
optimizer.step()
In PyTorch, run
model.eval()
before performing a forward pass on validation input. Also, usewith torch.no_grad()
to skip gradient computation.
model.eval()
with torch.no_grad():
output = model(
...
)
Once trained, the model can be saved by
model.save_pretrained(new_model_dir)
These checkpoints are useful when training the model in multiple sessions.
A seq2seq model can be used to generate an output by
predicted_tokens = model.generate(
encoder_input_ids,
decoder_start_token_id=tokenizer.pad_token_id,
num_beams=5,
early_stopping=True,
max_length=MAX_LEN
)predicted_strings = tokenizer.batch_decode(
predicted_tokens,
skip_special_tokens=True
)
Here, decoder_start_token_id
is the ID of the right-shift token.
And this brings the tutorial to a close. For more information, you can read the docs here. Adieu!
[Exit, pursued by a bear.]