Fine tune a Dreambooth model for image generation using Stable Diffusion with PyTorch

I recently did a great talk at Leeds Data Science, where I presented how to fine tune a Stable Diffusion model, using Google’s Dreambooth method, to fine tune the model to create interesting image concepts for generation. This was well received (see:, therefore, I thought, “shall I do a blog to aid everyone?”. The answer to this lies in the following.

What is Stable Diffusion?

There are various papers that show the important concepts of stable diffusion, but in essence, we use an input image; generate some random noise to the image; up and down sample the image using a Unet architecture to pick out the relevant segments in the image and then we use a variational autoencoder to take the random noise that has been encoded and decode it using the variational autoencoder to paint the images, when we say paint, we mean the autoencoder then uses the decoder block to guess (predict) what it thinks the image is. This process is not one model, it is several, all working together in harmony.

The diagram hereunder shows the general process we go through with the diffusion process:

Before we step into a high level description, I will break down some important parts of the diagram above:

  • Our imput image (ꭓ) is the first step of the process
  • We then encode the image (ε)
  • The encoded images gets converted into a numerical representation of the image, with the encoding, and is stored in a vector (Ζ)
  • The diffusion process is responsible for adding some random noise to the image i.e. noising the image (this is an important step, otherwise the decoding step would just predict the same image, the noises job here is to try and fool the decoder at the other end, this is where the diffusion model gets creative. For each noising phase (Ζτ-1) we then sample the distribution to look at the random noise, this is represented as q(Zt|Zt-1) until we have added all the noise to the vectors
  • The resultant vector is our noisy/latent vector given by (Zt)
  • Once we have the noisy latents from the encoder process top left to right on the diargam, we enter the conditioning phase, I will explain the CLIP architecture in a section after this
  • We then use our Τθ (Transformer) for text/image mapping
  • We use a pretrained de-noising autoencoder with skip connections for the attention heads
  • In the denoising U-Net step we use reverse diffusion to remove the noise from the image over timesteps – the reverse diffusion takes the equation Pθτ-1τ) – here we are looking at the noise from the previous transformer and comparing to the noise at the current transformer attention step
  • Then, we step outside of the denoising U-Net box on the diagram, we have a denoising network to recover the original latent vector – this is where a variational autoencoder is used to try and decode the information and is essentially creating a new image representation of the image into a vector form χ(Τ-1)
  • Perceptual compression then happens on the image, we pass through our D (decoder) step
  • Finally, we get an X hat i.e. our predicted image from the decoding step
  • We do this many times and start to get some groovy looking images

In the next step I will delve into how the CLIP, VAEs and UNets work.

Let’s explore some of the models we use in this process

The models use what is known as a Latent Diffusion Model (LDM) to process these images through.

OpenAI’s CLIP is responsible for the mapping of text prompts to the representative images (this is the conditioning phase of the diagram above). The general process is described below:

As you see, the text encoder and image are represented by a diagonal product of the text to image. This gives us the ability to link our text to images. The UNET works as below:

It takes the embedding, random noise (latents) and the relevant timestep, as the stable diffusion models everything based on time step (zT). This way, the model can learn new features everytime we pass it through a generation.

The last piece of the puzzle is the image painter, that is the variational autoencoder which is responsible for taking the encoded information and then trying to make a prediction, but the cool part about this is that the diffusion process changes the images slightly, with a combination of the random noise we add to the image. After many generations, we get something that starts to change the image over each time step. Furthermore, the great part about this, is that because the model uses a multiple attention head over each feature map in the image, the model learns from each of the attention heads by the query, keys and values. Read about how VIT and Transformers work to understand this concept.

What does the Dreambooth do?

Dreambooth is an extension to the methods we have described hereunder, with the overall aim to generate a personalised-text-to-image model that can be used for later inference. The HuggingFace visual (credits to HuggingFace) show this process:

Here we can have a small sample of images with the class name of the object we are trying to use as our candidate for generation. This can be anything from landscapes, to people, to cartoon characters, to styling new shoes, you name it. We use a pretrained text-to-image model, do some fine tuning (i.e. I want to use my dog in this image) and give the output a unique identifier V. At this point, you get a personalised text-to-image model. All there is to do then is to engineer the prompt to start the generation process.

Put simply…

All you need to do is decide on your concept:

Once you have your concept in mind, we can go through how you want to store your images, how to load them in with the scripts provided with this tutorial and how you will then fine tune your model.

Pushing our images to HuggingFace Hub

The first step would be to create an image folder locally and store the concept images you want to use in your fine tuning. For me, I used pictures of Fjords to generate weird and wacky landscapes, but you could use your favourite pet, or your child. Essentially, put images in a folder and load them up using the below Python script:

This will create our images in HuggingFace in our account under datasets.

Please bear in mind you must have a HuggingFace account and have registered your access token. This is simple and can be done here:

To view your images, navigate to Datasets in HuggingFace and you will see them loaded:

Get your images from HuggingFace Hub

Once we have published them, we will need to pull them back into our project for later analysis. The way to do this is to do the below:

This will pull the images from HuggingFace and then I will use the load_dataset function to do a split for training on my images.

Visualise your images

I have created a supporting Dreambooth module that should be used with this project, as it simplifies a lot of the training code and extra utilities you will need to fine tune your model.

To create an image grid all you need to do is loaded in the function from dreambooth and it will allow you to visualise your images:

This will create an output, for your use case, as below:

Fine tuning the model

The does all the hard work for you, but I thought it worth taking you through it step by step to concrete the knowkedge.

You will need to clone the supporting GitHub repository to allow you to work with the next set of examples. The repository with the dreambooth module you need can be found here: and you can pull into your environment by using git clone


  • Make sure you have a GPU to perform this training
  • Create a virtual environment
  • pip install the supporting requirements.txt file in the GitHub repo into your Virtual Environment using pip install -r requirements.txt
  • This will make sure you have all the dependencies you need. This also uses bitsandbytes for speeding up the training process. Refer to for any issues you may encounter with this package, as it worked on mine, but I know if you have two GPU drivers, the code will error, so make sure this is resolved before running this script

Once you have the repository open the file, as this is the first thing you need to run.

Get your imports needed

As with all Python projects, the first stage is to gather your imports into your Python project.

Bring in your dreambooth_param.yml file

All of this code relies on you to change your settings in the supporting config file. The config file has two parts – train and evaluation config, so simply, before fine tuning, tweak the config with your settings and hit run on the training script.

What does the config encompass?

This allows you to alter the model backbone for stable diffusion and clip, but these don’t need to be changed. What does is the image store options and your output directories, the rest of the parameters can remain the same.

The Python train file loads this config and sets project variables to the config that has been passed to the script:

The parameters such as learning rate, max_train_steps and other parameters are our hyperparameters that we can tweak to get better model results.

Project setup steps

In the next set of steps we load in the dataset, set our concept name, create an instance prompt and then load in our tokenizer:

Loading our data with a PyTorch dataloader

The dataloader is the next step, this will take the images, and create batches and do other augmentation steps, such as resizing, tensor creation, scaling and other augmentations:

In the training script, your simply create your training dataset, here it is train_dataset which takes your image, instance_prompt and the text-to-image tokenizer:

I have actually hidden some of what is going on her for you, so if you want to skip to the next heading, then go for it, otherwise stick around. The DreamBoothDataset resides in the Dreambooth module to simplify the process for you, but I will include the method for those that want to know:

The __init__ block takes in the parameters dataset, instance_prompt, tokenizer and the image size (size) as a default of 512. The following steps:

  • Instance variables are for each of the inputs
  • self.transforms applies augmentations to the image, such as resizing, centre cropping, normalising and converting the image to a tensor i.e. information stored about height, width and colour channels (RGB)

That is all the class initialisation block does. The dunder methods do two things, the first just gives the length of the dataset, the second:

  • creates and empty example Python dictionary
  • Gets the index of the image passed
  • Does a transform on the image
  • A tokenizer is created for the instance prompt passed i.e. a picture of a fjord
  • Truncation is set to True to truncate the text in the tensor
  • The max_length is set to the model_max_length
  • The example gets returned from the method

Load in pretrained models

As we went through in depth earlier, stable diffusion is a multi-model approach to this and we need to load in the relevant components, luckily HuggingFace has all the pretrained models that you need for the task, this is what is happening in the next step:

Now let’s examine the training step, in the next section.

Training the model

The fun bit, we hit train, and watch the epochs count down.

This is a memory intensive modelling process, so you will need bitsandbytes to be working property. If errors arise when training, go to this repository and have a look at similar Issues people have had:

It is simple, with the script I have provided, as I have created the PyTorch training loop for you, which involves creation of the random scheduler and all the other components. If you are inclined, check out the dreambooth.train module.

This training script takes in all the parameters from the config file, therefore you should not need to change anything in the training loop. A couple of tips:

  • To lower memory lower the gradient_accumulation_steps
  • gradient_checkpointing should be set to True
  • smaller_sample_batch_sizes as the processing of these models is intensive – it maxed out on my Tesla T4 instance in GCP when I was working with this project
  • output_dir = name of the model to be serialised locally

The rest are all regular components of a neural network you have seen, such as the rate at which the model learns i.e. the model weights are updated.

The full script for the model training is included here:

To set the model running, once the config is updated, use python in your VS Code terminal, or any terminal, and you will see the training process commence.

Push the model to the hub

When the training has finalised, we can then push the model to the HuggingFace Hub, for multiple people to be able to see and play with your model. The steps to do this are contained in the code below:

Here, all you need to do is add a description for your model card in HuggingFace, albeit you can edit the file later on, the config takes care of the rest for you, as it picks up where the training model gets stored and loads that.

After running the script in the repository your model will appear on HuggingFace, in your account:

Inferencing the model on the hub

You will now have a model on the hub that can be used to perform Text-to-image inferencing, see screenshot:

This allows you to create all sorts of new images. There is also a script version of this and you can find out how to use that on the README of the supporting repository:

Here is an example of some of the fun things I have generated with my model, having no knowledge of what a Fjord in Norway looked like before:

Where to get the code?

If you haven’t guessed already, as the repository has been linked several times, the code for this tutorial can be found here:

I hope you have found this tutorial useful, it certainly helped me rank second in the Dreambooth competition that HuggingFace ran towards the end of 2022 and the start of this year in the landscape category:

Although the competition is over, you can still have fun generating your own images.


Leave a Reply