Hyperparameter tuning a Transformer with Optuna

This blog assumes you know a little about transformers and their architecture. To get to grips with the transformer we have used for this example – check out how the BERT infrastructure works:

Once you have watched that video we will load a special version of this model called ELECTRA that is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

Let’s not get too far down the rabbit hole yet!

What the devil is Optuna?

Optuna is:

An open source hyperparameter optimization framework to automate hyperparameter search
eager search spaces using automated search for optimal hyperparameters using Python conditionals, loops, and syntax
SOTA algorithms to efficiently search large spaces and prune unpromising trials for faster results
easy parallelise hyperparameter searches over multiple threads or processes without modifying code

The website is a great resource and contains Tutorials, a Community to help get you up and running and a supporting GitHub.

Setting up our imports

Specify our parameter and project variables

The next step is to set what we want our hyperparameter searching function to iterate between and some project settings like name of the model, etc:

Here we have set:

Learning rate minimum and maximum (ceiling) named LR_MIN and LR_CEIL
Weight decay minimum and ceilling named WD_MIN and WD_CEIL
Minimum and maximum epochs named MIN_EPOCHS and MAX_EPOCHS
per device evaluation batch sizes for the training and evaluation sets
number of Optuna trials to implement – incrementing this will perform multiple hyperparameter trials for each individual permutation and setting
SAVE_DIR is the name of the folder to save it to
NAME_OF_MODEL is what I want to call my serialised and fine tuned transformer network
MAX_LENGTH is actually the maximum sentence sequence length

Work with HuggingFace datasets for this example

Here we will use the HuggingFace datasets package to work with the adverse drug reaction dataset ade_corpus_v2_classification dataset. This contains text and a label to indicate whether the sentence is related to an adverse drug reaction. Once we have the dataset loaded we will use a train_test_split as a proportion:

You will see a similar message to the below:

Load in Electra Small for the model

I am going to use a small model and tokeniser for this tutorial, as they are lightweight and not as slow to train as the large transformer models:

When working with transformers you will become very used to loading in pretrained models using from_pretrained(). In other tutorials I will delve into pretraining and fine-tuning your own model, on your own dataset, but for now we will use a pretrained model that has been contributed to HuggingFace.

Preprocess our text

Next we will create a function to preprocess our text:

This function:

tokenizers our examples text field
truncates the text
sets the padding to our maximum length – without this the tokenizer would not know that it needs to pad text to the same length and would cause issues when text lengths are not the same
we then set the maximum length of our sequences

Finally, we use the map function to map our preprocessing function to our dataset, in batches.

Using Optuna to set our objective function

Optuna is a brilliant tool for hyperparameter tuning as it parellelises and iterates through the suggested ranges of your hyperparameters. This function is defined as below:

This will need some explanation:

the parameter of the model is the trial variable and it is mapped to an optuna.Trial type
we then use our model initialisation using an AutoModelForSequenceClassification and use electra-small as the model
in our transformers TrainingArguments we set the initial values, as well as those we want tuning:
- output_dir is where we want to save our model and artefacts to
- learning_rate is how deep or shallow we want to learn from the weight updates and here we use the special Optuna commands to suggest_loguniform('learning_rate', low=LR_MIN, high=LR_CEIL). This says use a logarithmic uniform distribution to set the learning rate to our constants of the minimum learning rate specified in the section at the start when we defined the parameters
- weight_decay – this is a regularisation technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function i.e. loss = loss + weight decay parameter * L2 norm of the weights. This allows the model to warm up and prevents overfitting by slowly returning back to the actual loss
- num_train_epoch – this is the number of complete epochs to train the model for
- batch sizes are allocated to control the size of the training and evaluation batches. These variables are the PER_DEVICE_TRAIN_BATCH and the PER_DEVICE_EVAL_BATCH constant variables we included at the top of our script
- disable_tqdm – disables or enables the tqdm package’s progress bar options
We will now use the transformers package in-built transformer Trainer to use the model TrainingArguments we have just captured with our inputs and Optuna study inputs, for hyperparameter searching and iteration. The Trainer function has the following inputs:
- the model we initialised – in this case it is Electra-Small, but it could be a larger model such as BERT or a knowledge distilled ROBERTA architecture
- the training arguments we captured in the TrainingArguments wrapper
- the training and evaluation sets to test the model on
Finally, we train the model and return the training_loss from the model

Create the Optuna study

The steps below are how you create the Optuna study object and we will pass in our function (objective) we created in the previous steps to say I want to trigger that study for a number of trials / runs:

Here we create our study with a name and what we want to do is minimise loss. We then set our study.optimize to our function we defined and the number of trials (times we want to repeat our study).

At this point your GPU (I wouldn’t recommend this on a CPU) starts to fire up and away the training goes. This the part where you need to run the algorithm overnight and be calm:

Ideally, you would want multiple trials, but I have kept this as 1, to allow you to run the code in under a day.

Get the best study hyperparameters

After we have done a permutation from the various hyperparameter combinations we need to find a way to extract the best hyperparameters to pass to our final model. The next steps show you how to do this in a couple of lines of Python code:

Here we will store a variable for each of our optimal study parameters (according to Optuna when asked to minimize loss) and the lines of code then just print out the hyperparameters and store them in a dictionary:

Create model based on best hyperparameters

The final step will be to train our optimised model via the same process we have outlined in the previous model training. The only difference in the script now is that we are not using Optuna to suggest the best learning rates, weight decay values and epochs:

Saving our best model

This would be the last step of our tuning and training process. We know need to save our best model so we can use it later on to perform inference on the relevant dataset:

This will: a) create a model directory if it doesn’t exist; b) store the model in the model folder with the name of the model (this one is called huggingoptunaface) and c) we will save the tokenizer and model with the model path. The special method here is save_pretrained.

Once your script has run the model will be saved with the relevant name:

The model is stored as a bin file and the special tokens and tokenizers, as well as the training config are stored as JSON documents. The vocab.txt file contains all the vocab the model has detected when fine-tuning the model.

The full script is contained below:

Loading and use the model

The next step we will create a script just to load our fine-tuned Optuna optimised model and make inferences on it.

The steps below will declare the relevant imports and load our model and tokenizer that we have fine tuned with one of the HuggingFace datasets:

We have loaded our trained model from the relevant directory. Now we will pass an example sentence through the model and prepare a function to take the text and collapse down the result:

To explain the steps:

text is the example text we want to predict the label for
get_result function takes in parameters text and message (boolean):
- passes our example text through our loaded tokenizer and returns pt tensors (PyTorch tensors) for Tensorflow use tf and for Numpy use np
- passes our tokenized and encoded input to the model we have fine-tuned
- we then unpack the PyTorch tensor as a numpy array
- finally, we take the argmax of the result to get our binary label prediction
- we also use the logit function to get the raw probability
- if message is set to true then the class label is printed to the console
- the return of the function outputs the result (tensor) and the class label
At the end of the script we multiple assignment unpacking to return the result and class_label
because we set the message parameter to True we get the following returned:

And that is it – we have fine-tuned and optimised the parameters of our Electra-Small model and when then loaded these back in to make inferences against.

The full script for the inferencing script is here:

What’s next?

I aim on creating a tutorial to guide you through training your own tokenizer, pre-training your transformer model on a collection of relevant articles and then fine-tuning this on a classification task, as these are the most common NLP challenge.

I hope you had fun working through this with me and keep up the good work. Now it is time to: