This blog assumes you know a little about transformers and their architecture. To get to grips with the transformer we have used for this example – check out how the BERT infrastructure works:
Once you have watched that video we will load a special version of this model called ELECTRA that is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.
Let’s not get too far down the rabbit hole yet!
What the devil is Optuna?
- An open source hyperparameter optimization framework to automate hyperparameter search
- eager search spaces using automated search for optimal hyperparameters using Python conditionals, loops, and syntax
- SOTA algorithms to efficiently search large spaces and prune unpromising trials for faster results
- easy parallelise hyperparameter searches over multiple threads or processes without modifying code
The website is a great resource and contains Tutorials, a Community to help get you up and running and a supporting GitHub.
Setting up our imports
Specify our parameter and project variables
The next step is to set what we want our hyperparameter searching function to iterate between and some project settings like name of the model, etc:
Here we have set:
- Learning rate minimum and maximum (ceiling) named
- Weight decay minimum and ceilling named
- Minimum and maximum epochs named
- per device evaluation batch sizes for the training and evaluation sets
- number of Optuna trials to implement – incrementing this will perform multiple hyperparameter trials for each individual permutation and setting
SAVE_DIRis the name of the folder to save it to
NAME_OF_MODELis what I want to call my serialised and fine tuned transformer network
MAX_LENGTHis actually the maximum sentence sequence length
Work with HuggingFace datasets for this example
Here we will use the HuggingFace datasets package to work with the adverse drug reaction dataset ade_corpus_v2_classification dataset. This contains text and a label to indicate whether the sentence is related to an adverse drug reaction. Once we have the dataset loaded we will use a train_test_split as a proportion:
You will see a similar message to the below:
Load in Electra Small for the model
I am going to use a small model and tokeniser for this tutorial, as they are lightweight and not as slow to train as the large transformer models:
When working with transformers you will become very used to loading in pretrained models using
from_pretrained(). In other tutorials I will delve into pretraining and fine-tuning your own model, on your own dataset, but for now we will use a pretrained model that has been contributed to HuggingFace.
Preprocess our text
Next we will create a function to preprocess our text:
- tokenizers our examples text field
- truncates the text
- sets the padding to our maximum length – without this the tokenizer would not know that it needs to pad text to the same length and would cause issues when text lengths are not the same
- we then set the maximum length of our sequences
Finally, we use the map function to map our preprocessing function to our dataset, in batches.
Using Optuna to set our objective function
Optuna is a brilliant tool for hyperparameter tuning as it parellelises and iterates through the suggested ranges of your hyperparameters. This function is defined as below:
This will need some explanation:
- the parameter of the model is the trial variable and it is mapped to an
- we then use our model initialisation using an
AutoModelForSequenceClassificationand use electra-small as the model
- in our transformers
TrainingArgumentswe set the initial values, as well as those we want tuning:
output_diris where we want to save our model and artefacts to
learning_rateis how deep or shallow we want to learn from the weight updates and here we use the special Optuna commands to
suggest_loguniform('learning_rate', low=LR_MIN, high=LR_CEIL). This says use a logarithmic uniform distribution to set the learning rate to our constants of the minimum learning rate specified in the section at the start when we defined the parameters
weight_decay– this is a regularisation technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function i.e. loss = loss + weight decay parameter * L2 norm of the weights. This allows the model to warm up and prevents overfitting by slowly returning back to the actual loss
num_train_epoch– this is the number of complete epochs to train the model for
- batch sizes are allocated to control the size of the training and evaluation batches. These variables are the PER_DEVICE_TRAIN_BATCH and the PER_DEVICE_EVAL_BATCH constant variables we included at the top of our script
disable_tqdm– disables or enables the tqdm package’s progress bar options
- We will now use the transformers package in-built transformer Trainer to use the model
TrainingArgumentswe have just captured with our inputs and Optuna study inputs, for hyperparameter searching and iteration. The
Trainerfunction has the following inputs:
- the model we initialised – in this case it is Electra-Small, but it could be a larger model such as BERT or a knowledge distilled ROBERTA architecture
- the training arguments we captured in the
- the training and evaluation sets to test the model on
- Finally, we train the model and return the training_loss from the model
Create the Optuna study
The steps below are how you create the Optuna study object and we will pass in our function (
objective) we created in the previous steps to say I want to trigger that study for a number of trials / runs:
Here we create our study with a name and what we want to do is minimise loss. We then set our
study.optimize to our function we defined and the number of trials (times we want to repeat our study).
At this point your GPU (I wouldn’t recommend this on a CPU) starts to fire up and away the training goes. This the part where you need to run the algorithm overnight and be calm:
Ideally, you would want multiple trials, but I have kept this as 1, to allow you to run the code in under a day.
Get the best study hyperparameters
After we have done a permutation from the various hyperparameter combinations we need to find a way to extract the best hyperparameters to pass to our final model. The next steps show you how to do this in a couple of lines of Python code:
Here we will store a variable for each of our optimal study parameters (according to Optuna when asked to minimize loss) and the lines of code then just print out the hyperparameters and store them in a dictionary:
Create model based on best hyperparameters
The final step will be to train our optimised model via the same process we have outlined in the previous model training. The only difference in the script now is that we are not using Optuna to suggest the best learning rates, weight decay values and epochs:
Saving our best model
This would be the last step of our tuning and training process. We know need to save our best model so we can use it later on to perform inference on the relevant dataset:
This will: a) create a model directory if it doesn’t exist; b) store the model in the model folder with the name of the model (this one is called huggingoptunaface) and c) we will save the tokenizer and model with the model path. The special method here is
Once your script has run the model will be saved with the relevant name:
The model is stored as a bin file and the special tokens and tokenizers, as well as the training config are stored as JSON documents. The vocab.txt file contains all the vocab the model has detected when fine-tuning the model.
The full script is contained below:
Loading and use the model
The next step we will create a script just to load our fine-tuned Optuna optimised model and make inferences on it.
The steps below will declare the relevant imports and load our model and tokenizer that we have fine tuned with one of the HuggingFace datasets:
We have loaded our trained model from the relevant directory. Now we will pass an example sentence through the model and prepare a function to take the text and collapse down the result:
To explain the steps:
textis the example text we want to predict the label for
- get_result function takes in parameters text and message (boolean):
- passes our example text through our loaded tokenizer and returns
pttensors (PyTorch tensors) for Tensorflow use
tfand for Numpy use
- passes our tokenized and encoded input to the model we have fine-tuned
- we then unpack the PyTorch tensor as a numpy array
- finally, we take the argmax of the result to get our binary label prediction
- we also use the logit function to get the raw probability
messageis set to true then the class label is printed to the console
- the return of the function outputs the result (tensor) and the class label
- passes our example text through our loaded tokenizer and returns
- At the end of the script we multiple assignment unpacking to return the result and class_label
- because we set the message parameter to True we get the following returned:
And that is it – we have fine-tuned and optimised the parameters of our Electra-Small model and when then loaded these back in to make inferences against.
The full script for the inferencing script is here:
I aim on creating a tutorial to guide you through training your own tokenizer, pre-training your transformer model on a collection of relevant articles and then fine-tuning this on a classification task, as these are the most common NLP challenge.
I hope you had fun working through this with me and keep up the good work. Now it is time to:
I’m struggling to find out how to set objective function for maximizing macro F1 instead of minimizing loss.
Awesome tutorial. Many thanks!