Back to Index

How diffusion models work - explanation and code!


Chapters

0:0 Introduction
0:46 Generative models
3:51 Latent space
7:35 Forward and reverse process
9:0 Mathematical definitions
13:0 Training loop
15:5 Sampling loop
16:36 U-Net
18:31 Training code
19:28 Sampling code
20:34 Full code

Transcript

Hello guys, welcome to my new video about the diffusion models. In this video I will be introducing the diffusion model as described in the original paper, the DDPM paper, and I will show you the structure of these models, how they work, why they were invented in the first place, and also we will go inside some code to see how they are actually implemented.

And we will see that the code is actually very simple, even if the math and everything there looks very hard, and I will also not go very much into the math because I feel like this video is more about teaching concepts and how to actually work with these models instead of teaching the math derivations, which you can find also online or in the paper.

So let's start by reviewing why we needed the DDPM model in the first place. Before we had all these fancy models like the GAN and the DDPM etc, we basically had simple models like the autoencoder. And the autoencoder had a very simple task, that is to compress data. So if we have some data and we run it through the encoder, the encoder will transform the data into some smaller representation of the data.

And if this data, this code is passed through the decoder, hopefully it will rebuild the original data. So for example, if we have a picture of a tomato and we run it through the encoder, it will be converted into a vector of numbers that represents that particular picture. And if we run this vector into the decoder, hopefully it will build up the original image.

And if we run multiple images into an autoencoder, each of them will have an associated code, different for each image. However, there was a problem with these autoencoders, that is, the autoencoder did not catch any semantic relationship between the data. So for example, the code associated with the picture of the tomato, or the code associated with the picture of the zebra, maybe they were very similar, even if the two images, they have no semantic relationship.

Because the autoencoder was never told to learn this semantic relationship between the data. Its only job was to compress data, and it was pretty good at it. But of course, we wanted to learn some representation. So this code, we wanted to transform into a latent space. And that's why we introduced the variational autoencoder.

In the variational autoencoder, we don't just learn how to compress data, we actually learn a latent space, which is basically parameters of a multivariate distribution, in such a way that this latent space actually catches also some relationship between the data. For example, the code associated with the picture of the tomato, and the code associated with the picture of the egg, maybe they are similar to each other, at least more similar compared to the picture of the egg and zebra, for example.

And the most important property of a latent space is that we can sample from it, just like we sample from a Gaussian distribution. And if we sample, for example, something from this part of the space, hopefully we will get a picture of food. And if we sample, for example, here, we hopefully we will get a picture of animals.

And if we sample from here, hopefully we get a picture of a car, etc. So the most important property of these latent spaces is that we can sample from it to generate new data. So why is it called a latent space, this representation? Because basically, we model our data as x, as a variable x, which is conditioned on another variable z that we cannot observe, but we want to infer some properties about it.

If we model the variable z as a Gaussian, we want to learn its mean and the variance. Let me give you some more concrete examples on this latent space. I will use the Plato's allegory of the cave for that purpose. And in this allegory, we have to imagine that there are some people, and these people here, who are born and lived all their life in this small section of a cave.

And these people cannot leave the cave, they are chained in it. And these people observe some objects on these walls, and they believe this is reality. So for them, the horse is something black that moves like this, and the bird is something black that moves like this, etc. However, we know, as external observers, that this is not the real reality.

This is actually the projections through this fire of these real objects. So for example, these people, they can see the real objects, right? So basically, we have to think that our data is the only variable that we can observe. And this variable is conditioned on another variable that we cannot observe.

And this variable is hidden. So that's why it's called the latent. Latent means hidden. Now, this was true for the variational autoencoder. For diffusion model, we have to go deeper. And I will tell you why. Imagine these people here, they believe that they hold the true objects, right? But imagine these people themselves, they didn't hold the true objects, but they were themselves prisoners of a cave, and they were watching some projection of some real objects.

So they were just like these people. That is, we start from some people who can observe the real object, okay? So this is the real object, and we will say it's time step zero. And these people projected some other people inside an inner cave, this real object. So these people here, they think they are seeing the real object, but actually, they are watching what is the projection of the real object.

So it's more noisy, just like through the fire, we projected the shadows, which is a more noisy version of the real object, right? And these people themselves, they actually projected to some other people inside an inner cave. So it becomes a noisy version of something that was already noisy, so an even noisier version.

And we do it again and again and again for 1000 steps. And the last step is called the T, capital T, until it becomes pure noise. This process of noisification is called forward process. And then we also want the reverse process, that is, if we have some noise, can we infer something about the object that was noisified?

So for example, if we are here at the last time step, can we get some information about the previous time step, which is T capital minus one, but in this case is 500. Okay. And these people, of course, they also want to infer something about the object that projected the one they are watching, and these people, etc, etc.

And we do this for 1000 time steps, each step can only watch the previous one. And each noisy version comes from a previous noisified version. So this is the forward process. And this is the reverse process in blue. The forward process is quite easy, because we can always add noise to something.

For example, you can give the picture of the Mona Lisa to a three years old, and he or she will add all the noise that you want. However, the reverse process is hard, because we want to remove noise from something and observe the real object. And because it's hard, we will train a neural network to do it.

So mathematically, we have some real data that we call it x zero, and this x zero is conditioned on a latent variable z one, that actually is also conditioned on a z two variable, which is itself conditioned on another variable z three, until in this chain of conditioning, we have the last variable, which is pure noise.

And the process of noisification is called the forward process. And the process of denoisification is called reverse process. And the forward process, as I said before, it's fixed. So we know how to go from less noise to more noise. But we don't know how to go from more noise to less noise.

That's why we will train a more neural network to do it. Another thing to notice is that with zero time step zero, we indicate the original image with time step T capital, we indicate pure noise. So higher number means higher noise, less, smaller number means less noise. Now we need to, of course, look at some maths, I will try to avoid any derivation, I will try to teach the concept behind the math, because this way, you can also read the paper and follow through the paper easily.

Even if you don't understand each step, you will actually grasp the meaning of all the parts described in the paper. That's why I'm actually quoting the paper itself. We start with the forward process. Now the forward process, as I said before, this is the original paper, DDPM paper from Ho and the other authors, which was released in 2020.

And the forward process is called Q. And as you can see, the forward process, which is different from the reverse process that we will see later, has no parameter. So it's not Q of theta, which is like the reverse process is P of theta. But Q has no parameter to learn because it's fixed, we decide the parameters for it.

It's not like the neural network has to learn anything about it. And basically, they describe how to go from a less noisy version, so a less noise, so smaller number, less noise, to a more noisy version, bigger noise, okay. And they model it as steps of a chain, which is a Markov chain of Gaussian variables, in which we know the mean and the variance of each one of them.

And the mean is this one, so the square root of 1 minus beta t multiplied by the previous version, and also we know the variance. Now this beta, this one for each time step is fixed, we decide it. And the sequence of beta is called a schedule. And all then we have also the reverse process P.

As I said before, this P has a theta here, because we want to learn the parameters of this reverse process. So this basically means that from a more noisy version, if we want to go to a less noisy version, we want to learn this mean and this covariance matrix, because it's also modeled as a Gaussian variable, and actually as a Markov chain of Gaussian variables.

Another interesting property of the forward process is, because it's fixed, we can always go from the original image to the image noisified at the time step t, whatever t is, without doing all the intermediate step, just with one step, that is using this formula here. And alpha t is basically 1 minus beta t, so beta t is defined, so also alpha t is defined.

And alpha t with the hat is the product of all the alphas from 1 to the t, time step t. And how do we actually learn a neural network to model our reverse process? Basically, we do just like what we did for the variational autoencoder. That is, we model our data.

So P of theta of x0 is the latent space that we want to learn. And we know that our x0 is conditioned on a chain of latent variables. Here they are called x1, x2, xT, xT. But basically, they are z1, z2, zT. And basically, we want to learn this latent space.

So what we did is, let me go here, we do what we did for the variational autoencoder. So we want to maximize the log likelihood of our data. What we do is, we basically find something that is a lower bound for this log likelihood, which is called ELBO. And ELBO is also written here in the paper, which is this expression here, which will be further expanded to arrive to the loss function.

And what we do with our neural network is that we maximize this ELBO. Because if you maximize a lower bound, so this one, you also maximize the variable that is bounded. So basically, we maximize the ELBO, or we minimize the negative term of the ELBO. And this is exactly what we did for the variational autoencoder.

Now, I will not show you the derivation on how to arrive to the loss. I just told you the concept. So if you want to learn more about it, you can read the paper. There are many tutorials online on how the math of diffusion works. Now let's go to the training loop.

This is from the paper also. And in the paper, they describe the training loop. That is, we start basically from a picture sampled from our dataset or a batch of pictures sampled from our dataset. And for each picture, we choose a time step of noisification. So we generate, because the time step of noisification can be between one and capital T, we can choose for each one of this picture a random time step.

And then we sample some noise. Basically, what we do is, we take this noise, okay, and we add this noise at the time step T to each picture. And our model, which is this epsilon of theta, because as you remember, the reverse process has the theta parameters, has to predict the noise in this noisified version of the image.

So basically, why do we have this formula here? Let's go back to the paper. As you can show, as we saw here, we can always go from the X zero. So from the original image to the noisified image at time step T. And what we are doing here is exactly the same, we are going from the picture X zero, that is here to the noisified version.

And why do we do like this? Because of the property of the Gaussian variables. And our model will, this is the output of this. So this is the output of our model that given a noisified image and the time step T, at which it was noisified, has to predict the noise that was added.

So basically, we compare the predicted noise from our model with the noise that we added to the image. And that's it, this is the training loop. So our model has to just predict the noise that we add to an image at the time step T. And if we do it, we will learn that latent space.

The sampling, that is how do we generate new samples using our latent space, is also described in the paper. We start with some noise, and we denoise progressively this initial noise for these time steps until we arrive what is the X zero. But of course, this X zero does not belong to our dataset, we actually sample something new, just like we did for the variational autoencoder.

So if you remember previously, let's go back here. Our goal with the variational autoencoder, but also with the diffusion model, is actually to sample new stuff. So we want to be able to sample from this space to generate new data. And also we want our latent space to actually represent features, to capture features from our data.

So the sampling basically means that we are actually creating, generating new samples from our latent space. That's why there is a sampling. And why we do it that way? Because basically, we start from noise, and we progressively denoise it. So we keep doing T time step in total. And we do it with this, okay, with this algorithm here.

And it's also coming from the paper. Now, this still maybe looks a little abstract to you. So let's go inside the code. But before that, let's review the model that we use to model our latent space. So the model that has to predict the reverse process of the diffusion model is the unit.

So why did the authors choose the unit? Because the unit was introduced in 2015, as image segmentation model for medical applications. And this model looks like, if you look at the structure looks like an autoencoder. So you start with the original image, it gets compressed until it becomes very small in this bottleneck here.

And then we up sample to reconstruct the original image. And the authors of the DDPM paper also use the unit for the purpose of training the reverse process. And however, with some modifications, that we will see also in the code. So the modification, the first modification is that, as you saw in the sampling, and also in the training, there are two parameters from our model, epsilon theta.

The first is the image noisified at the time step t, and the second is the time step t itself. So we need to tell our model what is the time step t. And how do we do it? Well, basically, at each down sampling and up sampling operation, we also concat, for example, this one with the positional encoding that from the transformer model.

So if you remember from the transformer model, we have a way of encoding the position of a token inside of the sentence, which is actually a vector that tells the position. And we use the same vector to tell the model what is the position, what is the time step at which the image was noisified.

The second modification is that we do also attention, self attention. So at each down sampling, we can do some, we can do the self attention. Let's look at the training code. Now the training code, as we have already saw the logic of the training code before now, I will compare it with a Python code.

So basically, we start from some samples taken from our data set. So a batch of sample. For each sample, we generate a time step t, which is from one to t. And we also sample some noise, some random noise, we create the noisified version of the image using this noise and the time steps t.

And then we pass it through the unit, this is the unit, okay, in which we pass the noisified image and the time steps t. And then we our loss is basically predicting the difference between the predicted noise, so e hat, this e, this one, theta actually not hat. And the epsilon that we used as the initial noise.

That's it. This is the training code. And the sampling code, sampling code is also simple. So basically, we start with some random noise. And which is here x. And then we progressively denoise at the this is done for only the inner loop. So this code is actually only the inner loop of this for loop here.

So basically, we denoise it continuously for these time steps. And this is the same, you can see also the names are same here. And as you can see, the code is not so hard. And it's quite simple. Plus another thing I want you to notice is that we use the unit, not because we have to use the unit, but because the unit works well with this kind of model.

So actually, the authors of the DDPM paper, they chose the unit because it actually works well with this kind of model, but we don't have to use it. So we can use any model, any structure that is good at predicting noise, given the noisified version and the time steps t.

It can be as simple as you want, or as complex as you want, but it doesn't have to be the unit. The full code is available on my GitHub. And I also want to special thank to two other repositories from which I took the unit model. Now the unit model here was very complete with a lot of features, but I removed a lot of them to simplify it as much as possible.

So that it becomes simple to understand. And the diffusion model I took from here. The problem with this implementation, however, was that the unit was too simple and actually not reflecting the unit actually used by the DDPM paper. Thank you guys for watching and stay tuned for more amazing content on deep learning and machine learning.