Build a PyTorch regression MLP from scratch

In this post we will go through how to build a feed forward neural network from scratch, with the awesome PyTorch library. This library was developed by researchers in Meta (Facebook) to enable them to process natural language with ease. Here I will show you how to extend this to some of your more common tabular data tasks.

Building the training script

The below steps will take you through how to build the training script end-to-end.

Let’s get going!!!

What data are we using?

The data we are going to use is medical insurance data, and was posted as a challenge on machine learning competition Kaggle. This platform pits ML engineers and practitioners in a contest to find the most accurate model. As this was a regression challenge the metric to optimise was chosen as the root mean squared error (RMSE).

To source the data for this tutorial directly go to my GitHub data resource (https://github.com/StatsGary/Data/blob/main/insurance.csv), however the reference to this is linked in the underlying code we are about to create.

Within the data we see 7 columns and 1338 rows, with the fields containing:

Age of patient seeking medical insurance
Sex of patient (categorical)
Body Mass Index (BMI) – continuous variable
Number of children patient has
Smoker – categorical variable
Region of patient – categorical variable
Charges (continuous variable) and our outcome of interest

With the charges column we are going to try and estimate the charges per patient requiring insurance.

One thing to note

Here we will focus more on building the model architecture and components. One thing we won’t be doing is extensive feature engineering, outlier analysis and feature selection, as these would lead to an entirely new article.

Importing the packages we will need

The following are the list of imports we are going to need:

Under custom imports we will build our own MLPRegressor model and store this in a models folder to be used across multiple projects. I will detail the folder structure needed once we are at that juncture.

Data loading and initial setup

We will load the data in and perform a few steps, as well as setting our batch size, which we will use later as a parameter in our model training step.

Here we are:

setting a variable with data_name to capture the current project – we will use this later on when we are saving our artifacts and models
creating a variable called df to store our raw GitHub user content for the insurance file
dropping any null values – there are many more ways you could treat this data, one of which is MICE, but as I said we will be focussing on building the model out and not the hundred other steps to getting the data into a better shape
getting the number of rows in the dataset
creating a batch size that is half of the number of observations – for better results in the model training you could tune this as a hyperparameter

The next step would be some initial feature engineering to treat the none continuous columns i.e. those with multiple levels and categorical descriptions.

Feature engineering and creating tensors

Following the code snippet I will take you through each one of the lines of code in a stepwise fashion:

Strap yourselves in and here we go:

cat_cols is a list of the categorical column names sex, smoker, region and children
cont_cols is a list of continuous column names age and bmi
At this point it goes without saying that you would need to adapt these list values for your own use case, as this structure can be used as a template for future projects
y is equal to our outcome column – for this example we are trying to predict the charges for each part of the medical insurance
The next step is to loop through our cat_cols to set all the types to category using the astype('category') method
After we have done these steps we need to convert our dictionary objects (data frame) to a numpy array representation and we will use the stack method to stack our arrays on top of each other – here we use a list comprehension to achieve this result.
Once we have the cats variable in an array, we can simply use torch.tensor to convert this into a torch readable tensor for processing with PyTorch
We repeat the same steps for the continuous variables to make sure we have them tensorfied (is that even a word?)
Finally, we will make sure the outcome variable is also a tensor and cast to torch.float, as the outcome will have multiple decimal places after it

Set our embeddings

An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.

Essentially, in this context they are the values from each of the feature columns mapped as embeddings in the tensor, and function slightly differently to embeddings used in NLP models and neural nets, albeit they are still a learned continuous vector. To get the embeddings of any tabular model, you can use the following snippet:

To be logical, the following occurs:

We use a list comprehension to loop through each col in the categorical columns and take the length of the original data frames categories – essentially how many levels does each category have – this we call cat_szs
Then, we use a second comprehension to get the embedding size between a minimum of 50 and the size _1 with an integer division divided by 2. This magic will then work out the embeddings needed for any tabular regression problem. Be sure to note this down, as you will need the output of this for the inference step later on. Printed out this shows the embedding sizes, for our current problem, as below:

Great – we have our embeddings for our categorical columns i.e. the number of levels each of the 4 columns goes down to. To translate this:

Sex has two values 1) male and 2) female
Smoker has two values 1) yes and 2) no
Region has four levels
Children has six levels

The next step is on to our model definition and once we have built this model backbone, we will save it in a seperate python file and store it in our folder structure so we can import it into all projects where thee MLPRegressor is utilised.

Building our regression skeleton

I will include the whole code and then go through each step as we go along:

Define the modelling block

Let’s dive into this:

The MLPRegressor takes an input as a nn.Module
In the initialisation function __init__() we will pass in the embedding sizes, number of continuous variables, the output size, the layer structure (as a list) and the drop out probability (this, in essence, drops out nodes in the network at random and prevents overfitting)
We then use super().__init__() to inherit (via class inheritance) all the parameters and structure of the master class
This is followed by defining self.embed and using the nn.ModuleList to loop through each embedding in the embedding sizes, remember we have a list here with four tuples, each containing two values. Hence, we are setting the embedding sizes for the network
The dropout layer is then defined using self.emb_drop and setting it to nn.Dropout with the required probability to drop out at random
We then batch normalise the continuous variables to bring them down to a standardised scale using nn.BatchNorm1d()
After we have defined the neural network steps we will use an empty list to specify the layer depth in our network
We then set n_emb to the sum of our number of features in our embedding sizes
This is followed by the variable n_in which adds the embeddings to our continuous features to get the full dataset

Let’s break at this point and grab a coffee! You have earned it!

Once we know the input structure we can now add to the layers by looping through each of the layers and building a sub structure underneath – this is what the code below does:

This will add:

nn.Linear layer as the first input – which will take the size of the layer and match the number of inputs, as this is a regression model, so will be related to the good old regression algorithms Galton first espoused, postulated and discovered
nn.Relu as the activation function to use
nn.BatchNorm1d as the way of normalising each batch
nn.Dropout as the random node dropout probability
Finally, append each one as a linear layer by the output size

The last step of the model definition is to set the self.layers variable to a sequential model (nn.Sequential), as we want to process each one of the steps in sequence. Next we define the forward propagation method for our model.

Define the forward passing setup

We will project the following forward through the network and perform the following steps:

Initialise an empty embeddings list
Loop through the self.embeds variable and append the categorical values to the empty embeddings list
We will then concatenate the the embeddings and then use a dropout layer
For the continuous variables we will first use a 1 dimensional batch normalisation pass for the continuous variables
We again concatenate both the encoded categorical values with the continuous values = x = torch.cat([x, x_cont], 1)
Finally, we set x equal to the self.layers and return x

In a nutshell – what we have done is built a way to handle continuous and categorical variables, in our network, with embeddings. We have defined the model structure and made it sequential and then indicated how the forward

Saving our MLPRegressor structure

Now we have this structure built we will save this in a separate Python file called Regression.py and this will be nested inside a models directory. Your folder structure should look like this:

Once we have this we can import it into our two main Python projects using

from models.Regression import MLPRegressor

Use the model

The following steps will load the model in and we will pass inputs to the model to get it setup:

The steps here are:

initialise a random seed value to make the results repeatable
create a model variable and pass through to our MLPRegressor model the embedding sizes, the shape of the continuous variables tensor, specify our output size (because it is a regression problem we will be outputting 1 value only), add the number of layers (this is passed as a list of values) and choose a drop out layer probability (if you don’t specify this a default value will be used, as this is an optional parameter)
Printing the model structure will show the layers in the network

Splitting our data

Before we move on to setting up the training loop we are going to split the data into train and test sets (you could add a validation split here as well):

Creating our training loop

This is the workhorse and something you will see in every PyTorch implementation. This indicates how to train the network and perform weight updates and optimisation steps. I will take you through this step-by-step after the Git Gist below:

Let’s step into this:

First we create our function train and define the parameters needed. The optional parameters are learning_rate, epochs and print_out_interval
We initialise criterion as a global variable, as we will need to use this later on
Then the criterion is going to be set to Mean Squared Error loss, as this is a regression problem and we are going to track the average error across the line of regression
For the gradient descent steps we are going to use our optimiser as ADAM and pass in our model parameters and the desired rate we want the model to learn at
We then set a timer to start and use model.train (to specify this is the steps to apply in the training steps of the model)
Then, I initialise empty lists to store our losses and preds (predictions) at each epoch
From here we start the iteration through each epoch and perform the following steps:
- create a y_pred variable and pass the model categorical and continuous training videos
- append the specific prediction to the empty list
- we create the a torch tensor to take the square root of the prediction of y vs the train y category
- we then append the losses to the list
- we use the print out interval variable to take the modulo and print out the epoch of the loss for the current epoch
- to clear the gradient tape we use the special torch function optimizer.zero_grad()
- we then propagate the loss backwards and use an optimizer step to pick the best weights for the specific example
- finally, we create our print out steps to print the loss per epoch and the duration of the epoch
Once the training step is done, we will evaluate on the validation data as we go. The steps taken here are:
- first we disable the graident calculation while we pass inference examples to our model using torch.no_grad()
- then we initialise a y_val variable and pass the categorical validation tensor and the continuous validation tensor to the model
- We take the square root of the model again
- Then, print the RMSE (Root Mean Squared Error)
- create empty lists for the predictions, differences and actuals
- then we loop through the length of the tensor
- use a numpy array to take the absolute value (abs) of the validation item vs the prediction
- get the pred from the validation prediction
- and the actuals from the y_test tensor
- we then append the diffs, preds and actuals to the respective empty lists
- out of the loop we create a dictionary to store the predictions, differences and actuals
finally, we save the model using the model.state_dict() to our model_artifacts folder and return the losses, preds, diffs, actuals, model, valid_results_dict and epochs to be used later on in the training script.

Using our training loop

Using multiple assignment we will store each one of the outputs of our train function:

Here we pass in the model, y_train (training of the outcome variable), categorical training variables, continuous training variables, this is repeated for our validation data, select a learning rate, the number of epochs and the print_out_interval. When this is triggered – the script will output the below:

View our predictions vs actuals

Next, we will visualise where our model is performant and where it is way off the mark:

This will store our results in a data.frame and then visualises the data in a scatter chart:

This is a difficult dataset, as there are many outliers and it is apparent that there appear to be bands of people with different types of medical insurance values, this would be indicative of the type of procedure they needed the medical insurance for and who require different levels of medical insurance. You could treat the outliers and repeat the training – I will leave this to you to perfect, as the aim of this post is to show how to use PyTorch to create a regression model.

Produce our model training graph

We will now see how well the model training performed:

This step uses list compression to iterate through the losses list we created in our training loop and extracts every .item() i.e. loss from the list and calls the new list losses_collapsed. We do a similar comprehension to get the number of epochs and then create a pandas data.frame.

We then save the data to csv and create the SNS chart. The chart looks as below:

This shows our model is still learning after 400 epochs, as the loss is still on the decline, we could extend the epochs further to tweak the model loss further.

We now have our training script in place, the full code is captured here:

Building our model inference script

Now we are going to use our trained model to infer from our production data. Here we will pass through multiple examples from our production medical insurance dataset. This dataset, in real life, would be the passing through new values that we want to estimate the medical insurance cost for. Moreover, we would not know the actual cost, as we are trying to make new predictions.

Feature engineering

We will repeat the same steps as the previous example, with one slight change of loading in the production medical insurance dataset:

This next step is important. We need to specify a list with tuples in for the same embeddings sizes as in the previous training script, as if we tried to do this with the inference script dynamically, we would have less of our categorical columns and there would be a shape mismatch. I have hard coded this, but you could import it as a text file, json file or similar. Again, this embedding should match our training embedding sizes for the network to work correctly:

Right, we have everything in place, such as our categorical and continuous values encoded and converted to PyTorch tensors and our embeddings have been translated from the training script to match the shapes. Please note – for your own dataset – this would need to be updated to match the shape of your categorical and continuous values.

Load and use our saved model

In the next steps we will load our saved model, with the same parameters we used for training and then load the state_dict() from the model_artefacts folder (this could be anything you like, I just called it model artefacts):

The print of model_infer.eval() will print out the original saved model structure:

Here you can see the importance of our embeddings matching, otherwise the model will throw a wobbly!

Define function to process our prod data

I will explain this function in more detail underneath the code:

Let’s break this down:

model takes in the loaded model from the state_dict() we loaded
setting torch.no_grad() disables gradient calculation to allow for inference
setting the y_val to the model and passing in our categorical and continuous PyTorch tensors
we then create an empty preds list to store our results
this is followed by a loop through all the production files in the dataset, where I / we:
- get the length of the production items passed through the model
- get each .item() from the torch tensor
- append each predictions to our empty preds list and do this incrementally until the loop has finished to the end
- a boolean variable called verbose indicates whether you want to print the result
after all this, the only return from the function is the preds list

Running the function we get the below print outs:

Some of these predictions are dubious, as the outliers and differences in medical insurance costs is obviously confusing the model. I would suggest to make this model more useful to create a piece wise regression model to deal with the different levels and treatment of the outliers.

The full code for the inference script is here:

We have reached the end!

Wow – congratulations on getting this far. We have covered so much content in this tutorial.

Feel free to adapt the code and create a pull request to the GitHub repository if you want to add or adapt the code in any way. Remember – the aim of this tutorial was to show how you can create a regression network in PyTorch, and not to go way in depth in the billion ways you can encode features before modelling – that would be its own tutorial.

Learning PyTorch is harder than Tensorflow, as it is very pythonic and requires you to build classes, however once you get used to it the tool is very powerful and is mostly used in my work with natural language processing at my company.

You have done well and keep on coding!