back to indexHow diffusion models work - explanation and code!
Chapters
0:0 Introduction
0:46 Generative models
3:51 Latent space
7:35 Forward and reverse process
9:0 Mathematical definitions
13:0 Training loop
15:5 Sampling loop
16:36 U-Net
18:31 Training code
19:28 Sampling code
20:34 Full code
00:00:00.000 |
Hello guys, welcome to my new video about the diffusion models. In this video I will be 00:00:04.880 |
introducing the diffusion model as described in the original paper, the DDPM paper, and I will 00:00:11.520 |
show you the structure of these models, how they work, why they were invented in the first place, 00:00:16.960 |
and also we will go inside some code to see how they are actually implemented. And we will see 00:00:22.640 |
that the code is actually very simple, even if the math and everything there looks very hard, 00:00:27.280 |
and I will also not go very much into the math because I feel like this video is more about 00:00:32.640 |
teaching concepts and how to actually work with these models instead of teaching the math 00:00:38.160 |
derivations, which you can find also online or in the paper. So let's start by reviewing why we 00:00:43.840 |
needed the DDPM model in the first place. Before we had all these fancy models like the GAN and 00:00:50.480 |
the DDPM etc, we basically had simple models like the autoencoder. And the autoencoder had a very 00:00:58.080 |
simple task, that is to compress data. So if we have some data and we run it through the encoder, 00:01:03.520 |
the encoder will transform the data into some smaller representation of the data. And if this 00:01:09.520 |
data, this code is passed through the decoder, hopefully it will rebuild the original data. 00:01:15.600 |
So for example, if we have a picture of a tomato and we run it through the encoder, 00:01:19.600 |
it will be converted into a vector of numbers that represents that particular picture. 00:01:24.800 |
And if we run this vector into the decoder, hopefully it will build up the original image. 00:01:29.600 |
And if we run multiple images into an autoencoder, each of them will have an associated code, 00:01:35.360 |
different for each image. However, there was a problem with these autoencoders, that is, 00:01:41.840 |
the autoencoder did not catch any semantic relationship between the data. So for example, 00:01:47.520 |
the code associated with the picture of the tomato, or the code associated with the picture 00:01:53.040 |
of the zebra, maybe they were very similar, even if the two images, they have no semantic 00:01:58.480 |
relationship. Because the autoencoder was never told to learn this semantic relationship between 00:02:04.400 |
the data. Its only job was to compress data, and it was pretty good at it. But of course, 00:02:12.160 |
we wanted to learn some representation. So this code, we wanted to transform into a latent space. 00:02:18.480 |
And that's why we introduced the variational autoencoder. In the variational autoencoder, 00:02:22.480 |
we don't just learn how to compress data, we actually learn a latent space, 00:02:26.800 |
which is basically parameters of a multivariate distribution, in such a way that this latent 00:02:35.120 |
space actually catches also some relationship between the data. For example, the code associated 00:02:41.680 |
with the picture of the tomato, and the code associated with the picture of the egg, maybe 00:02:45.520 |
they are similar to each other, at least more similar compared to the picture of the egg and 00:02:52.480 |
zebra, for example. And the most important property of a latent space is that we can sample from it, 00:02:58.320 |
just like we sample from a Gaussian distribution. And if we sample, for example, something from 00:03:03.360 |
this part of the space, hopefully we will get a picture of food. And if we sample, 00:03:08.720 |
for example, here, we hopefully we will get a picture of animals. And if we sample from here, 00:03:13.360 |
hopefully we get a picture of a car, etc. So the most important property of these latent 00:03:18.560 |
spaces is that we can sample from it to generate new data. So why is it called a latent space, 00:03:25.840 |
this representation? Because basically, we model our data as x, as a variable x, which is conditioned 00:03:32.720 |
on another variable z that we cannot observe, but we want to infer some properties about it. 00:03:38.720 |
If we model the variable z as a Gaussian, we want to learn its mean and the variance. 00:03:45.440 |
Let me give you some more concrete examples on this latent space. I will use the Plato's 00:03:52.480 |
allegory of the cave for that purpose. And in this allegory, we have to imagine that there 00:03:58.160 |
are some people, and these people here, who are born and lived all their life in this small 00:04:04.720 |
section of a cave. And these people cannot leave the cave, they are chained in it. And these people 00:04:11.520 |
observe some objects on these walls, and they believe this is reality. So for them, the horse 00:04:16.400 |
is something black that moves like this, and the bird is something black that moves like this, 00:04:21.680 |
etc. However, we know, as external observers, that this is not the real reality. This is actually the 00:04:28.320 |
projections through this fire of these real objects. So for example, these people, they can 00:04:33.440 |
see the real objects, right? So basically, we have to think that our data is the only variable that 00:04:41.440 |
we can observe. And this variable is conditioned on another variable that we cannot observe. And 00:04:48.480 |
this variable is hidden. So that's why it's called the latent. Latent means hidden. Now, this was 00:04:56.160 |
true for the variational autoencoder. For diffusion model, we have to go deeper. And I will tell you 00:05:02.800 |
why. Imagine these people here, they believe that they hold the true objects, right? But imagine 00:05:10.240 |
these people themselves, they didn't hold the true objects, but they were themselves prisoners of a 00:05:15.680 |
cave, and they were watching some projection of some real objects. So they were just like these 00:05:23.680 |
people. That is, we start from some people who can observe the real object, okay? So this is the 00:05:31.760 |
real object, and we will say it's time step zero. And these people projected some other people 00:05:38.880 |
inside an inner cave, this real object. So these people here, they think they are seeing the real 00:05:46.800 |
object, but actually, they are watching what is the projection of the real object. So it's more 00:05:52.080 |
noisy, just like through the fire, we projected the shadows, which is a more noisy version of 00:05:58.320 |
the real object, right? And these people themselves, they actually projected to some other people 00:06:03.360 |
inside an inner cave. So it becomes a noisy version of something that was already noisy, 00:06:09.040 |
so an even noisier version. And we do it again and again and again for 1000 steps. And the last step 00:06:16.800 |
is called the T, capital T, until it becomes pure noise. This process of noisification is called 00:06:24.160 |
forward process. And then we also want the reverse process, that is, if we have some noise, can we 00:06:30.960 |
infer something about the object that was noisified? So for example, if we are here at the 00:06:37.760 |
last time step, can we get some information about the previous time step, which is T capital minus 00:06:44.880 |
one, but in this case is 500. Okay. And these people, of course, they also want to infer 00:06:50.720 |
something about the object that projected the one they are watching, and these people, etc, etc. 00:06:57.280 |
And we do this for 1000 time steps, each step can only watch the previous one. And each noisy 00:07:03.920 |
version comes from a previous noisified version. So this is the forward process. And this is the 00:07:10.080 |
reverse process in blue. The forward process is quite easy, because we can always add noise to 00:07:15.440 |
something. For example, you can give the picture of the Mona Lisa to a three years old, and he or 00:07:20.800 |
she will add all the noise that you want. However, the reverse process is hard, because we want to 00:07:26.960 |
remove noise from something and observe the real object. And because it's hard, we will train a 00:07:33.120 |
neural network to do it. So mathematically, we have some real data that we call it x zero, and 00:07:41.280 |
this x zero is conditioned on a latent variable z one, that actually is also conditioned on a z two 00:07:47.440 |
variable, which is itself conditioned on another variable z three, until in this chain of 00:07:54.320 |
conditioning, we have the last variable, which is pure noise. And the process of 00:08:00.640 |
noisification is called the forward process. And the process of denoisification is called 00:08:05.520 |
reverse process. And the forward process, as I said before, it's fixed. So we know how to 00:08:11.760 |
go from less noise to more noise. But we don't know how to go from more noise to less noise. 00:08:18.640 |
That's why we will train a more neural network to do it. Another thing to notice is that with 00:08:25.280 |
zero time step zero, we indicate the original image with time step T capital, we indicate 00:08:30.720 |
pure noise. So higher number means higher noise, less, smaller number means less noise. 00:08:37.200 |
Now we need to, of course, look at some maths, I will try to avoid any derivation, I will try to 00:08:45.280 |
teach the concept behind the math, because this way, you can also read the paper and follow 00:08:51.200 |
through the paper easily. Even if you don't understand each step, you will actually grasp 00:08:56.640 |
the meaning of all the parts described in the paper. That's why I'm actually quoting the paper 00:09:03.280 |
itself. We start with the forward process. Now the forward process, as I said before, 00:09:08.880 |
this is the original paper, DDPM paper from Ho and the other authors, which was released in 2020. 00:09:16.640 |
And the forward process is called Q. And as you can see, the forward process, which is different 00:09:23.200 |
from the reverse process that we will see later, has no parameter. So it's not Q of theta, which 00:09:28.160 |
is like the reverse process is P of theta. But Q has no parameter to learn because it's fixed, 00:09:34.160 |
we decide the parameters for it. It's not like the neural network has to learn anything about it. 00:09:39.200 |
And basically, they describe how to go from a less noisy version, so a less noise, so smaller 00:09:44.480 |
number, less noise, to a more noisy version, bigger noise, okay. And they model it as steps 00:09:51.200 |
of a chain, which is a Markov chain of Gaussian variables, in which we know the mean and the 00:09:59.200 |
variance of each one of them. And the mean is this one, so the square root of 1 minus beta t 00:10:07.200 |
multiplied by the previous version, and also we know the variance. Now this beta, this one for 00:10:14.400 |
each time step is fixed, we decide it. And the sequence of beta is called a schedule. 00:10:19.520 |
And all then we have also the reverse process P. As I said before, this P has a theta here, 00:10:28.560 |
because we want to learn the parameters of this reverse process. So this basically means that 00:10:34.000 |
from a more noisy version, if we want to go to a less noisy version, we want to learn this mean 00:10:39.920 |
and this covariance matrix, because it's also modeled as a Gaussian variable, and actually as 00:10:45.520 |
a Markov chain of Gaussian variables. Another interesting property of the forward process is, 00:10:51.280 |
because it's fixed, we can always go from the original image to the image noisified at the 00:10:58.960 |
time step t, whatever t is, without doing all the intermediate step, just with one step, 00:11:05.600 |
that is using this formula here. And alpha t is basically 1 minus beta t, so beta t is defined, 00:11:14.320 |
so also alpha t is defined. And alpha t with the hat is the product of all the alphas from 1 to the 00:11:22.640 |
t, time step t. And how do we actually learn a neural network to model our reverse process? 00:11:33.120 |
Basically, we do just like what we did for the variational autoencoder. That is, 00:11:37.600 |
we model our data. So P of theta of x0 is the latent space that we want to learn. And we know 00:11:45.520 |
that our x0 is conditioned on a chain of latent variables. Here they are called x1, x2, xT, 00:11:54.320 |
xT. But basically, they are z1, z2, zT. And basically, we want to learn this latent space. 00:12:04.480 |
So what we did is, let me go here, we do what we did for the variational autoencoder. So we 00:12:11.360 |
want to maximize the log likelihood of our data. What we do is, we basically find something that 00:12:17.120 |
is a lower bound for this log likelihood, which is called ELBO. And ELBO is also written here in 00:12:23.840 |
the paper, which is this expression here, which will be further expanded to arrive to the loss 00:12:30.240 |
function. And what we do with our neural network is that we maximize this ELBO. Because if you 00:12:36.000 |
maximize a lower bound, so this one, you also maximize the variable that is bounded. So basically, 00:12:42.640 |
we maximize the ELBO, or we minimize the negative term of the ELBO. And this is exactly what we did 00:12:48.560 |
for the variational autoencoder. Now, I will not show you the derivation on how to arrive to the 00:12:53.040 |
loss. I just told you the concept. So if you want to learn more about it, you can read the paper. 00:12:57.280 |
There are many tutorials online on how the math of diffusion works. Now let's go to the training 00:13:04.720 |
loop. This is from the paper also. And in the paper, they describe the training loop. That is, 00:13:09.440 |
we start basically from a picture sampled from our dataset or a batch of pictures sampled from 00:13:17.600 |
our dataset. And for each picture, we choose a time step of noisification. So we generate, 00:13:24.320 |
because the time step of noisification can be between one and capital T, we can choose for each 00:13:29.360 |
one of this picture a random time step. And then we sample some noise. Basically, what we do is, 00:13:35.360 |
we take this noise, okay, and we add this noise at the time step T to each picture. And our model, 00:13:44.320 |
which is this epsilon of theta, because as you remember, the reverse process has the theta 00:13:50.480 |
parameters, has to predict the noise in this noisified version of the image. So basically, 00:14:00.080 |
why do we have this formula here? Let's go back to the paper. 00:14:04.160 |
As you can show, as we saw here, we can always go from the X zero. So from the original image to the 00:14:13.440 |
noisified image at time step T. And what we are doing here is exactly the same, we are going from 00:14:20.800 |
the picture X zero, that is here to the noisified version. And why do we do like this? Because of 00:14:30.240 |
the property of the Gaussian variables. And our model will, this is the output of this. So this 00:14:36.640 |
is the output of our model that given a noisified image and the time step T, at which it was 00:14:43.360 |
noisified, has to predict the noise that was added. So basically, we compare the predicted noise from 00:14:49.520 |
our model with the noise that we added to the image. And that's it, this is the training loop. 00:14:55.760 |
So our model has to just predict the noise that we add to an image at the time step T. 00:15:02.160 |
And if we do it, we will learn that latent space. The sampling, that is how do we generate 00:15:10.000 |
new samples using our latent space, is also described in the paper. We start with some noise, 00:15:17.040 |
and we denoise progressively this initial noise for these time steps until we arrive what is the 00:15:25.440 |
X zero. But of course, this X zero does not belong to our dataset, we actually sample something new, 00:15:31.680 |
just like we did for the variational autoencoder. So if you remember previously, let's go back 00:15:42.080 |
here. Our goal with the variational autoencoder, but also with the diffusion model, is actually 00:15:47.040 |
to sample new stuff. So we want to be able to sample from this space to generate new data. 00:15:51.840 |
And also we want our latent space to actually represent features, to capture features from 00:15:58.880 |
our data. So the sampling basically means that we are actually creating, generating new samples 00:16:05.680 |
from our latent space. That's why there is a sampling. And why we do it that way? Because 00:16:11.920 |
basically, we start from noise, and we progressively denoise it. So we keep doing 00:16:18.240 |
T time step in total. And we do it with this, okay, with this algorithm here. And it's also 00:16:28.480 |
coming from the paper. Now, this still maybe looks a little abstract to you. So let's go inside the 00:16:34.800 |
code. But before that, let's review the model that we use to model our latent space. So the model 00:16:43.360 |
that has to predict the reverse process of the diffusion model is the unit. So why did the 00:16:50.800 |
authors choose the unit? Because the unit was introduced in 2015, as image segmentation model 00:16:58.240 |
for medical applications. And this model looks like, if you look at the structure looks like an 00:17:04.480 |
autoencoder. So you start with the original image, it gets compressed until it becomes very small in 00:17:10.400 |
this bottleneck here. And then we up sample to reconstruct the original image. And the authors 00:17:18.240 |
of the DDPM paper also use the unit for the purpose of training the reverse process. And 00:17:24.320 |
however, with some modifications, that we will see also in the code. So the modification, the 00:17:29.440 |
first modification is that, as you saw in the sampling, and also in the training, there are 00:17:34.720 |
two parameters from our model, epsilon theta. The first is the image noisified at the time step t, 00:17:42.560 |
and the second is the time step t itself. So we need to tell our model what is the time step t. 00:17:49.280 |
And how do we do it? Well, basically, at each down sampling and up sampling operation, 00:17:56.400 |
we also concat, for example, this one with the positional encoding that from the transformer 00:18:02.640 |
model. So if you remember from the transformer model, we have a way of encoding the position 00:18:08.080 |
of a token inside of the sentence, which is actually a vector that tells the position. 00:18:12.800 |
And we use the same vector to tell the model what is the position, what is the time step at which 00:18:18.480 |
the image was noisified. The second modification is that we do also attention, self attention. So 00:18:25.920 |
at each down sampling, we can do some, we can do the self attention. Let's look at the training 00:18:32.880 |
code. Now the training code, as we have already saw the logic of the training code before now, 00:18:37.840 |
I will compare it with a Python code. So basically, we start from some samples taken from 00:18:43.120 |
our data set. So a batch of sample. For each sample, we generate a time step t, which is from 00:18:50.320 |
one to t. And we also sample some noise, some random noise, we create the noisified version of 00:18:57.040 |
the image using this noise and the time steps t. And then we pass it through the unit, this is the 00:19:05.120 |
unit, okay, in which we pass the noisified image and the time steps t. And then we our loss is 00:19:12.000 |
basically predicting the difference between the predicted noise, so e hat, this e, this one, 00:19:18.640 |
theta actually not hat. And the epsilon that we used as the initial noise. That's it. This is the 00:19:25.760 |
training code. And the sampling code, sampling code is also simple. So basically, we start with 00:19:33.120 |
some random noise. And which is here x. And then we progressively denoise at the this is done for 00:19:43.360 |
only the inner loop. So this code is actually only the inner loop of this for loop here. So basically, 00:19:48.960 |
we denoise it continuously for these time steps. And this is the same, you can see also the names 00:19:55.360 |
are same here. And as you can see, the code is not so hard. And it's quite simple. Plus another 00:20:02.720 |
thing I want you to notice is that we use the unit, not because we have to use the unit, but 00:20:08.960 |
because the unit works well with this kind of model. So actually, the authors of the DDPM paper, 00:20:14.800 |
they chose the unit because it actually works well with this kind of model, but we don't have to use 00:20:19.360 |
it. So we can use any model, any structure that is good at predicting noise, given the noisified 00:20:26.800 |
version and the time steps t. It can be as simple as you want, or as complex as you want, but it 00:20:32.080 |
doesn't have to be the unit. The full code is available on my GitHub. And I also want to special 00:20:39.120 |
thank to two other repositories from which I took the unit model. Now the unit model here was very 00:20:45.120 |
complete with a lot of features, but I removed a lot of them to simplify it as much as possible. 00:20:49.600 |
So that it becomes simple to understand. And the diffusion model I took from here. 00:20:54.960 |
The problem with this implementation, however, was that the unit was too simple and actually 00:21:00.160 |
not reflecting the unit actually used by the DDPM paper. Thank you guys for watching 00:21:06.080 |
and stay tuned for more amazing content on deep learning and machine learning.