How To

Huggingface Transformers

Implementing transformer models for natural language processing

5 min readMar 19, 2021

Transformers are a family of deep learning models based on attention mechanisms. First proposed by Vaswani et al. in 2017, these models have achieved state-of-the-art results on many natural language processing tasks. Transformers have outperformed recurrent networks by harnessing the potential of transfer learning whereby models are pretrained on data-rich tasks, like language modelling, and then fine-tuned on downstream tasks of interest, like summarization. Transfer learning has also allowed multilingual transformer models to generalize across languages and achieve better performance on low-resource languages.

The Huggingface Transformers library provides hundreds of pretrained transformer models for natural language processing. This is a brief tutorial on fine-tuning a huggingface transformer model.

The library can be installed using pip as follows

pip install transformers
pip install sentencepiece

We begin by selecting a model architecture appropriate for our task from this list of available architectures. Let’s say we want to use the T5 model.

The library can be combined with both PyTorch and TensorFlow as it implements two versions of each object, one for PyTorch and one for TensorFlow. All Tensorflow objects start with the prefix TF.

The models can be imported as

PyTorch:

from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration

TensorFlow:

from transformers import TFT5Config, TFT5Tokenizer, TFT5ForConditionalGeneration

T5ForConditionalGeneration is the complete seq2seq model with a language modelling head. This library also includes other versions of the architecture for each model. For example, T5Model is the bare T5 model that outputs raw hidden states without a specific head on top while T5EncoderModel outputs the raw hidden states of the encoder.

For each model architecture, the library contains several pretrained models of varying sizes. Additionally, users are also allowed to submit fine-tuned or entirely retrained models. We can browse these variants through the “All model pages” link for each model here. Alternatively, we can also search through the models here. We’ll choose the T5-small for reasons that will soon become apparent.

If you’re new to transformers, you may be surprised by how big these models are. The largest T5 model requires 42 GB of storage space alone. But even a standard transformer requires a GB or two.

The first thing we need then is around 12 GB of RAM or GPU. Well, you don’t get put under lock and key by running on a Pentium 4. Google Colab does come in handy here. You can run a standard transformer model with batch sizes of around 8–16 on the Colab GPU. Remember to connect the notebook to your Drive to preserve data and models across sessions.

from google.colab import drive
drive.mount('/content/drive')

Secondly, we need to cache these models after download.

model_dir = '/content/drive/T5model'config = T5Config()
model = T5ForConditionalGeneration.from_pretrained(
    'google/t5-small-ssm',
    config=config
)
model.save_pretrained(model_dir)

The next time, these models can be loaded as

config = T5Config()
model = T5ForConditionalGeneration.from_pretrained(
    model_dir,
    config=config
)

The config can be modified to change the vocab size or even the number of layers. However, modifying some parameters, e.g. model dimension, may cause the loading of pretrained weights to fail.

config = T5Config(vocab_size=250112, num_layers=8, num_heads=6)

This library can also be used for training a custom model from scratch by providing a custom config and removing the from_pretrained() call.

Huggingface also provides tokenizers designed for the data used to pretrain the models. These can also be cached.

try:
    tokenizer = T5Tokenizer.from_pretrained(model_dir)
except:
    tokenizer = T5Tokenizer.from_pretrained('google/t5-small-ssm')
    tokenizer.save_pretrained(model_dir)

The tokenizer can also be slightly modified to add special tokens like language codes.

tokens = ['en', 'fr', 'zh']
tokenizer.add_special_tokens({'additional_special_tokens' : tokens})

The tokenizers also include useful methods to process the input for training and prediction. The encode method splits a string into tokens and returns a list containing the integer IDs of the tokens while the decode method converts the token IDs back to the original string. These methods are also available for batches.

[int, int, ...] = tokenizer.encode(str)
str = tokenizer.decode([int, int, ...])[[int...], [int...], ...] = tokenizer.batch_encode([str, str, ...])
[str, str, ...] = tokenizer.batch_decode([[int...], [int...], ...])

In seq2seq models, like the T5, the tokenizers also provide a handy method for preparing both the encoder and decoder inputs together.

source = [str, str, ...]
target = [str, str, ...]
encodings = tokenizer.prepare_seq2seq_batch(source, target)

where encodings = {input_ids: [[]], attention_mask: [[]], labels: [[]]}

In Seq2Seq models, the decoder input must be right shifted by one step. When using this library, the user must perform this right-shift manually. This is generally accomplished by appending an extra token at the start of each string in target. This token is defined during pretraining and can be found in the model documentation. In T5, the padding token is used for this purpose.

target = [tokenizer.pad_token + ' ' + ' '.join(x) for x in target]...encodings = tokenizer.prepare_seq2seq_batch(source, target)

encodings.labels represent the desired output and have two uses: as decoder_input_ids and as labels for the loss function. These two are identical except labels do not include the right-shift token at the start. Therefore, we create two copies of encodings.labels: one for decoder input and one for loss labels. We remove the starting right-shift token from labels as this token is not part of the expected output. We then remove the last token from decoder_input_ids to equalize tensor sizes.

encoder_input_ids = encodings.input_ids
encoder_attention_mask = encodings.attention_maskdecoder_input_ids = encodings.labels[:, :-1].clone()  # skip last
labels = encodings.labels[:, 1:].clone()              # skip first

Frequently, model inputs are padded to some maximum length to ensure consistent tensor sizes. This is accomplished by appending padding tokens to the inputs. These tokens need to be excluded from loss calculations. Huggingface’s loss functions are defined to exclude the ID -100 during loss calculations. Therefore, we need to convert all padding token IDs in labels to -100.

labels[labels[:, :] == tokenizer.pad_token_id] = -100

The forward pass is initiated by

outputs = model(
       input_ids = encoder_input_ids,
       attention_mask = encoder_attention_mask,
       decoder_input_ids = decoder_input_ids,
       labels = labels,
)
loss = outputs[0]

In PyTorch, the model can be trained as

optimizer = AdamW(model.parameters(), lr=5e-5)device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)for epoch in epochs:
    for batch in dataloader:
        batch.to(device)
        model.train()
        optimizer.zero_grad()
        outputs = model(
               input_ids = batch.encoder_input_ids,
               attention_mask = batch.encoder_attention_mask,
               decoder_input_ids = batch.decoder_input_ids,
               labels = batch.labels,
       )
        loss = outputs[0]
        loss.backward()
        optimizer.step()

In PyTorch, run model.eval() before performing a forward pass on validation input. Also, use with torch.no_grad() to skip gradient computation.

model.eval()
with torch.no_grad():
    output = model(
    ...
    )

Once trained, the model can be saved by

model.save_pretrained(new_model_dir)

These checkpoints are useful when training the model in multiple sessions.

A seq2seq model can be used to generate an output by

predicted_tokens = model.generate(
    encoder_input_ids,
    decoder_start_token_id=tokenizer.pad_token_id,
    num_beams=5,
    early_stopping=True,
    max_length=MAX_LEN
)predicted_strings = tokenizer.batch_decode(
    predicted_tokens,
    skip_special_tokens=True
)

Here, decoder_start_token_id is the ID of the right-shift token.

And this brings the tutorial to a close. For more information, you can read the docs here. Adieu!

[Exit, pursued by a bear.]

How To

Huggingface Transformers

Implementing transformer models for natural language processing

Written by Atif Khurshid