back to indexCoding Stable Diffusion from scratch in PyTorch
Chapters
0:0 Introduction
4:30 What is Stable Diffusion?
5:40 Generative Models
12:7 Forward and Reverse Process
17:44 ELBO and Loss
20:30 Generating New Data
22:20 Classifier-Free Guidance
31:0 CLIP
33:20 Variational Auto Encoder
37:26 Text to Image
39:54 Image to Image
41:40 Inpainting
44:30 Coding the VAE
114:50 Coding CLIP
129:10 Coding the Unet
184:40 Coding the Pipeline
233:0 Coding the Scheduler (DDPM)
278:0 Coding the Inference code
00:00:00.000 |
Hello guys! Welcome to my new video on how to code stable diffusion from scratch. 00:00:05.360 |
And stable diffusion is a model that was introduced last year. 00:00:09.480 |
I think most of you are already familiar with it. 00:00:12.360 |
And we will be coding it from scratch using PyTorch only. 00:00:16.640 |
And as usual my video is going to be quite long, 00:00:19.880 |
because we will be coding from scratch and at the same time I will be explaining each part that makes up stable diffusion. 00:00:27.480 |
So as usual let me introduce you what are the topics that we will discuss and what are the prerequisites for watching this video. 00:00:35.520 |
So of course we will discuss stable diffusion because we are going to build it from scratch using only PyTorch. 00:00:41.120 |
So no other libraries will be used except for the tokenizer. 00:00:45.720 |
I will describe the maths of the diffusion models as defined in the DDPM paper, but I will simplify it as much as possible. 00:00:54.240 |
I will show you how classifier-free guidance works and of course we will also implement it, 00:00:59.200 |
how the text-to-image works, image-to-image and in-painting. 00:01:03.360 |
Of course to have a very complete view of diffusion models actually we should also introduce the score-based models 00:01:08.920 |
and all the ODE and SDF theoretical framework. 00:01:12.240 |
But most people are not familiar with ordinary differential equations or even stochastic differential equations. 00:01:18.720 |
So I will not discuss these topics in this video and I'll leave it for future videos. 00:01:23.960 |
So anyway we will have a complete copy of a stable diffusion, we will be able to generate images using the prompt, 00:01:35.680 |
But for example the samplers based on the Euler method or Runge-Kutta method will not be built in this video. 00:01:42.560 |
I will make a future video in which I describe these ones. 00:01:46.560 |
What do I expect you to have as a prerequisite for watching this video? 00:01:50.560 |
Well first of all it's good that if you have some notion of probability and statistics, 00:01:55.240 |
so at least you know what is a Gaussian distribution, 00:01:58.160 |
what is the conditional probability, the marginal probability, the likelihood etc. 00:02:02.520 |
Now I don't expect you to have the mathematical formulation in your mind about these concepts, 00:02:08.520 |
but at least the concepts behind them, so at least what do we mean by conditional probability or what do we mean by marginal probability. 00:02:16.040 |
Anyway even if you're not very strong with mathematics I will always give a non-mathematics intuition for most concepts. 00:02:22.280 |
So even if you don't have this background you will at least understand the concept behind this, some intuition behind this. 00:02:31.000 |
And of course I expect you to know Python and PyTorch, at least basic level, because we will be coding using Python and PyTorch. 00:02:39.840 |
And then we will be using a lot the attention mechanism, 00:02:43.040 |
so if you're not familiar with the transformer model please watch my previous video on the attention and transformer. 00:02:49.800 |
And we will also be using a lot of convolutions. 00:02:53.320 |
So I don't expect you to know how mathematically the convolution layers work, 00:02:57.400 |
but at least what they do on a practical level in a neural network. 00:03:06.600 |
And because this is going to be a long video I will first, because the stable diffusion and the diffusion models in general are quite complex from a mathematical point of view, 00:03:16.840 |
so we cannot jump directly to the code without explaining what we are going to code and how it works. 00:03:23.120 |
The first thing I will do is to give you some background knowledge from a mathematical point of view, 00:03:28.600 |
but also from a conceptual point of view of how the diffusion models work and how stable diffusion works. 00:03:41.600 |
Of course at the beginning you will have a lot of ideas that are kind of confused because I will give you a lot of new concepts to grasp. 00:03:49.800 |
And it's normal that you don't understand everything at the beginning. 00:03:52.920 |
But don't worry because while coding I will repeat each concept more than once. 00:03:57.840 |
So while coding you will also get a practical knowledge of what each part is doing and how they interact with each other. 00:04:05.360 |
So please don't be scared if you don't understand everything in the beginning part of this video. 00:04:10.560 |
Later when we start coding it everything will make sense to you. 00:04:14.440 |
But we need this initial part because otherwise we cannot just jump in the dark and start coding without knowing what we are going to code. 00:04:28.720 |
Stable diffusion is a model that was introduced in 2022, so last year, at the end of last year I remember, 00:04:36.080 |
by Confit's group at the Ludwig Maximilian University in Munich, Germany. 00:04:40.680 |
And it's open source, the weights, the pre-trained weights can be found on the Internet. 00:04:45.560 |
And it became very famous because people started doing a lot of stuff and building projects with them and products with them with the stable diffusion. 00:04:53.240 |
And one of the most simple use of stable diffusion is to do text to image. 00:04:58.320 |
So given a prompt we want to generate an image. 00:05:01.040 |
We will also see how image to image works and also how in-painting works. 00:05:05.840 |
Image to image means that you already have a picture, for example, of a dog and you want to change it a little bit by using a prompt. 00:05:12.280 |
For example, you want to ask the model to add the wings to the dog so that it looks like a flying dog. 00:05:17.880 |
Or in-painting means that you remove some part of the image. 00:05:21.920 |
For example, you can remove, I don't know, this part here and you ask the model to replace it with some other part that makes sense, 00:05:30.000 |
that is coherent with the image. And we will see also how this works. 00:05:35.800 |
Let's jump into generative models because diffusion models are generative models. 00:05:44.120 |
Well, a generative model learns a probability distribution of the data such that we can then sample from the distribution to create new instances of the data. 00:05:54.640 |
For example, if we have many pictures of cats or dogs or whatever we have, 00:05:59.640 |
we can train a generative model on it and then we can sample from this distribution to create new images of cats or dogs or whatever. 00:06:08.560 |
And this is exactly what we do with stable diffusion. 00:06:11.080 |
We actually have a lot of images, we train it on a massive amount of images, 00:06:15.720 |
and then we sample from this distribution to generate new images that don't exist in our training set. 00:06:22.320 |
But the question may arise in your mind is why do we model data as distributions, as probability distributions? 00:06:30.240 |
Well, let me give you an example. Imagine you are a criminal and you want to generate thousands of fake identities. 00:06:38.200 |
Imagine you also live in a very simple world and each fake identity is made up of variables representing the characteristic of a person. 00:06:45.520 |
So age and height. Suppose we only have two variables that make up a person. 00:06:49.600 |
So it's the age of the person and the height of the person. 00:06:53.520 |
In my case, I will be using the centimetre for the height. I think the Americans can convert it to feet. 00:06:59.800 |
And so how do we proceed if we are a criminal with this goal? 00:07:04.120 |
Well, we can ask the statistics department of the government to give us some statistics about the age and the height of the population. 00:07:11.320 |
This information you can easily find online, for example. And then we can sample from this distribution. 00:07:17.800 |
For example, if we model the age of the population like a Gaussian with the mean of 40 and the variance of 30. 00:07:25.840 |
OK, these numbers are made up. I don't know if they reflect the reality. 00:07:29.640 |
And the height in centimetres is 120 as mean and the variance is 100. 00:07:38.120 |
We get these two distributions. Then we can sample from these two distributions to generate a fake identity. 00:07:45.240 |
What does it mean to sample from a distribution? To sample from this kind of distribution means to throw a coin, 00:07:52.280 |
a very special coin that has a very high chance of falling in this area, a lower chance of falling in this area, 00:08:00.200 |
an even lower chance of falling in this area and a very nearly zero chance of falling in this area. 00:08:07.240 |
So imagine we flip this coin once for the age, for example, and it falls here. 00:08:12.880 |
So it's quite probable, not very probable, but quite probable. 00:08:16.760 |
So suppose the age is three and let me write. So the age, let's say, is. Three. 00:08:27.960 |
And then we toss again this coin and we and the coin falls, let's say here. 00:08:35.320 |
So one hundred, let's say, thirty height. One hundred thirty centimetres. 00:08:43.560 |
So as you can see, the combination of age and height is quite improbable in reality. 00:08:49.960 |
I mean, no three years old is one metre and thirty centimetres high. 00:08:55.160 |
I mean, at least not the ones I know. So this combination of age and height is very not plausible. 00:09:02.600 |
So to produce plausible pairs, we actually need to model these two variables. 00:09:07.320 |
So the age and height, not as independent variables and sample from each of them independently, 00:09:12.840 |
but as a joint distribution. And usually we represent the joint distribution like this, 00:09:19.080 |
where each combination of age and height has a probability score associated with it. 00:09:25.000 |
And from this distribution, we only sample using one coin. 00:09:29.240 |
And for example, this coin will have a very high probability with very high chance will fall in this area, 00:09:35.160 |
with less chance will fall in this area and very close to zero chance of falling in this area. 00:09:40.440 |
Suppose we throw the coin and it ends up in this area to get to the corresponding. 00:09:44.680 |
Suppose this is the age and this is the height to get to the corresponding age and height. 00:09:50.280 |
We just need to do like this. And suppose these are actually the real height and the real height. 00:09:55.080 |
Now the numbers here are actually do not match, but you got the idea that to model something, 00:10:01.080 |
we need a joint distribution over all the variables. 00:10:03.960 |
And this is actually what we do also with our images. 00:10:06.680 |
With our images, we create a very complex distribution in which, for example, 00:10:12.680 |
each pixel is a distribution and the entirety of all the pixels are one big joint distribution. 00:10:20.120 |
And once we have a joint distribution, we can do a lot of interesting things. 00:10:26.440 |
So, for example, imagine we have a joint distribution over the age and the height. 00:10:30.840 |
So let's call the age X and let's call the height, let's say Y. 00:10:36.200 |
So if we have a joint distribution, which means having P of X and Y, 00:10:42.440 |
which is defined for each combination of X and Y, we can always calculate P of X. 00:10:48.680 |
So the probability of over the single variable by marginalizing over the other. 00:10:53.560 |
So as the integral of P of X and Y and the Y. 00:11:02.280 |
which means marginalizing over all the possible Y that we can have. 00:11:06.040 |
And then we can also calculate the probability, the conditional probability. 00:11:09.960 |
For example, we can say that the probability, 00:11:12.360 |
what is the probability of the age being, let's say, from 0 to 3, 00:11:24.920 |
We can do this kind of queries by using the conditional probability. 00:11:28.760 |
So this is actually what we do with the generative model. 00:11:31.400 |
We model our data as a very big joint distribution. 00:11:34.920 |
And then we learn the parameters of this distribution, 00:11:41.480 |
So we let the neural network learn the parameters of this distribution. 00:11:45.160 |
And our goal, of course, is to learn this very complex distribution 00:11:51.000 |
and then sample from it to generate new data, 00:11:53.720 |
just like the criminal before wanted to generate new fake identities 00:12:03.000 |
In our case, we will model our system as a joint distribution 00:12:11.880 |
As you probably are familiar with the diffusion models, 00:12:21.960 |
The forward process means that we have our initial image 00:12:37.240 |
Then we take this image, which has a little noise, 00:12:40.920 |
and we generate a new image that is same as the previous one, 00:12:45.960 |
So as you can see, this one has even more noise, 00:12:49.880 |
until we arrive to the last latent variable called Zt, 00:12:57.480 |
when it becomes completely noise, pure noise, 00:13:16.520 |
So we define how to build the noisified version 00:13:24.520 |
and we have a specific formula, an analytical formula, 00:13:30.600 |
The problem is, we don't have the analytical formula 00:13:49.800 |
to remove noise from something that has noise. 00:13:59.720 |
That's why we are using a neural network for this purpose. 00:14:02.680 |
Now we need to go inside, of course, of the math, 00:14:06.840 |
because we will be using it not only to write the code, 00:14:11.080 |
And in the sampler, it's all about mathematics. 00:14:13.800 |
And I will try to simplify it as much as possible, 00:14:22.280 |
so the Noising Diffusion Probabilistic Models, 00:14:36.440 |
how can I generate the noisified version of this image 00:14:42.360 |
In this case, actually, this is the joint distribution. 00:14:49.000 |
This means if I have the image at time step t minus one, 00:14:58.920 |
Well, we define it as a Gaussian distribution centered, 00:15:06.600 |
and the variance defined by this beta parameter here. 00:15:19.560 |
This is also known as the Markov chain of noisification, 00:15:23.320 |
because each variable is conditioned on the previous one. 00:15:48.760 |
So this is called the Markov chain of noisification, 00:15:59.800 |
which is a series of Gaussians that add noise. 00:16:08.440 |
to go from the original image to any image at time step t, 00:16:13.320 |
without calculating all the intermediate images, 00:16:33.800 |
This mean here depends on a parameter, alpha, alpha bar, 00:16:43.720 |
And also the variance actually depends on alpha, 00:16:55.880 |
The reverse process means that we have something noisy, 00:17:05.960 |
with a mean, mu theta, and a variance, sigma theta. 00:17:10.600 |
Now, this mean and this variance are not known to us. 00:17:18.520 |
And we will use a neural network to learn these two parameters. 00:17:23.000 |
Actually, the variance, we will also set this at fixed. 00:17:27.800 |
We will parameterize it in such a way that this variance actually is fixed. 00:17:32.360 |
So we hypothesize, we already know the variance. 00:17:34.920 |
And we let the network learn only the mean of this distribution. 00:17:38.680 |
So to rehearse, we have a forward process that adds noise. 00:17:45.800 |
We have a reverse process that we don't know how to denoise. 00:17:49.640 |
So we let a network learn the parameters on how to denoise it. 00:17:54.680 |
And OK, now that we have defined these two processes, 00:18:01.560 |
Because as you remember, our initial goal is actually to learn 00:18:05.000 |
a probability distribution over our data set. 00:18:12.040 |
But unlike before, when we could marginalize, for example, 00:18:15.880 |
in the case of the criminal who want to generate identities, 00:18:24.600 |
Because we need to marginalize over x1, x2, xt, x4, up to xt. 00:18:30.520 |
And to calculate this integral means to calculate it over all the possible x1. 00:18:38.760 |
So it's a very complex calculation that is computationally intractable, we say. 00:18:55.080 |
So we want to learn the parameter theta of this to maximize the likelihood we can see here. 00:19:00.520 |
What we did is we found a lower bound for this quantity here. 00:19:11.160 |
And if we maximize the lower bound, it will also maximize the likelihood. 00:19:16.040 |
So let me give you a parallel example on what it means to maximize the lower bound. 00:19:26.840 |
And usually, the revenue is more than or equal to the sales of your company. 00:19:35.640 |
Maybe you also have some revenue coming from interest that you get from your bank, et cetera. 00:19:41.000 |
But we can for sure say that the revenue of your company is more than or equal to the sales of your company. 00:19:47.240 |
So if you want to maximize your revenue, you can maximize your sales, for example, 00:19:55.480 |
So if we maximize the sales, we will also maximize the revenue. 00:20:03.880 |
Well, this is the training code for the DDPM diffusion models as defined by the DDPM paper. 00:20:13.240 |
And basically, the idea is after we get the elbow, 00:20:17.000 |
we can parameterize the loss function as this. 00:20:20.360 |
Which says that we need to learn-- we need to train a network called epsilon theta. 00:20:27.800 |
That given a noisy image-- so this formula here 00:20:31.480 |
means the noisy image at time step t and the time step at which the noise was added, 00:20:37.800 |
the network has to predict how much noise is in the image, the noisified image. 00:20:43.240 |
And if we do gradient descent over this loss function here, 00:20:53.400 |
And at the same time, we will also maximize the log likelihood of our data. 00:20:59.880 |
And this is how we train these kind of networks. 00:21:03.000 |
Now, I know that this is a lot of concept that you have to grasp. 00:21:07.960 |
For now, just remember that there is a forward process and there is a reverse process. 00:21:11.400 |
And to train this network to do the reverse process, 00:21:15.000 |
we need to train a network to detect how much noise 00:21:18.280 |
is in a noisified version of the image at time step t. 00:21:21.560 |
Let me show you how do we-- once we have this network that has already been trained, 00:21:29.800 |
how do we actually sample to generate new data? 00:21:38.600 |
Suppose we already have a network that was trained for detecting how much noise is in there. 00:21:44.680 |
And what we do is we start from complete noise. 00:21:47.720 |
And then we ask the network to detect how much noise is in there. 00:21:53.400 |
And then we ask the network again how much noise is in there. 00:21:57.640 |
And then we ask the network how much noise is there. 00:22:05.160 |
Until we reach this step, then here we will have something new. 00:22:09.560 |
So if we start from pure noise and we do this reverse process many times, 00:22:15.480 |
And this is the idea behind this generative model. 00:22:19.160 |
Now that we know how to generate new data starting from pure noise, 00:22:24.520 |
we also want to be able to control this denoisification process 00:22:29.160 |
so we can generate images of something that we want. 00:22:32.280 |
I mean, how can we tell the model to generate a picture of a cat 00:22:36.120 |
or a picture of a dog or a picture of a house by starting from pure noise? 00:22:40.840 |
Because as of now, by starting from pure noise and keep denoising, 00:22:48.680 |
But it's not like we can control which new image will be generated. 00:22:52.440 |
So we need to find a way to tell the model what we want in this generational process. 00:22:57.160 |
And the idea is that we start from pure noise. 00:23:01.800 |
And during this chain of removing noise, so denoisification, 00:23:12.620 |
Or it can also be called the conditioning signal. 00:23:21.720 |
In which we influence the model into how to remove the noise 00:23:26.280 |
so that the output will move towards what we want. 00:23:29.640 |
To understand how this works, let's review again 00:23:33.560 |
how the training of this kind of networks works. 00:23:37.880 |
To learn how the training of this kind of network goes, 00:23:46.760 |
Okay, as I told you before, our final goal is to model a distribution, 00:23:53.560 |
theta, p of theta, such that we maximize the likelihood of our data. 00:23:58.520 |
And to learn this distribution, we maximize the ELBO, so the lower bound. 00:24:09.240 |
We minimize this loss, minimize this loss here. 00:24:14.840 |
So by minimizing this loss, we maximize the ELBO, 00:24:25.800 |
for the likelihood of our data distribution here. 00:24:32.440 |
Loss function here indicates that we need to create a model, epsilon theta, 00:24:37.160 |
such that if we give this model a noisified image at a particular noise level, 00:24:44.200 |
and we also tell him what noise level we included in this image, 00:24:47.880 |
the network has to predict how much noise is there. 00:24:51.880 |
So this epsilon is how much noise we have added. 00:24:54.520 |
And we can do a gradient descent on this training loop. 00:25:00.280 |
This way we will learn a distribution of our data. 00:25:06.040 |
But as you can see, this distribution doesn't include anything 00:25:09.560 |
that tells the model what is a cat, or what is a dog, or what is a house. 00:25:14.120 |
The model is just learning how to generate pictures that make sense, 00:25:18.280 |
that are similar to our initial training data. 00:25:21.000 |
But they don't know what is the relationship between that picture and the prompt. 00:25:24.680 |
So one idea could be, OK, can we learn a joint distribution of our initial data, 00:25:33.400 |
so all the images, and the conditioning signal, so the prompt? 00:25:38.120 |
Well, this is also something that we don't want, 00:25:40.120 |
because we want to actually learn this distribution, 00:25:45.240 |
We don't want to learn the joint distribution 00:25:47.000 |
that will be too much influenced by the context, 00:25:50.040 |
and the model may not learn the generative process of the data. 00:25:55.560 |
But we also want to find some how to condition this model 00:26:11.400 |
will be built using, let me show you, this unit model here. 00:26:16.280 |
This unit will receive as input an image that is noisified, 00:26:21.240 |
so for example, a cat, with a particular noise level, 00:26:27.400 |
and we also tell him what is the noise level that we added to this cat, 00:26:31.240 |
and we give them both to the input of the unit, 00:26:33.560 |
and the unit has to predict how much noise is there. 00:26:38.600 |
What if we introduce also the prompt signal here, 00:26:43.400 |
so the conditioning signal here, so the prompt? 00:26:59.080 |
so the model has more information on how to remove the noise. 00:27:05.080 |
how to remove noise into building something that is more closer to the prompt. 00:27:14.280 |
it means that it will act like a conditioned model, 00:27:16.760 |
so we need to tell the model what is the condition that we want, 00:27:19.560 |
so that the model can remove the noise in that particular way, 00:27:23.720 |
moving the output towards that particular prompt. 00:27:27.480 |
But at the same time, when we train the model, 00:27:31.400 |
instead of only giving images along with the prompt, 00:27:35.960 |
we can also sometimes, with a probability, let's say 50%, 00:27:38.840 |
not give any prompt and let the model remove the noise 00:27:43.720 |
without telling him anything about the prompt. 00:27:46.040 |
So we just give him a bunch of zero when we give him the input. 00:27:50.680 |
This way, the model will learn to act both as a conditioned model 00:27:58.520 |
so the model will learn to pay attention to the prompt 00:28:05.880 |
Is that we can, once when we want to generate a new picture, 00:28:12.920 |
In the first one, suppose you want to generate a picture of a cat, 00:28:31.720 |
to generate a new image, we start from pure noise. 00:28:34.360 |
We indicate the model what is the noise level. 00:28:37.240 |
So at the beginning, it will be t equal to 1000, 00:28:49.320 |
The unit will predict some noise that we need to remove 00:28:53.800 |
in order to move the image towards what we want as output. 00:29:11.240 |
And again, we give the same input noise as before, 00:29:21.720 |
So it's the same noise with the same noise level, 00:29:30.120 |
which is how to remove the noise to generate something. 00:29:34.040 |
We don't know what, but to generate something 00:29:38.040 |
And then we combine these two output in such a way 00:29:42.360 |
that we can decide how much we want the output 00:29:52.760 |
So this approach here is called classifier-free guidance. 00:29:57.320 |
I will not tell you why it's called classifier-free guidance, 00:29:59.640 |
because otherwise I need to introduce the classifier guidance. 00:30:07.960 |
But the idea is this, that we train a model that, 00:30:11.320 |
when we train it, sometimes we give it the prompt 00:30:16.440 |
so that the model learns to ignore the prompt, 00:30:21.240 |
And when we sample from this model, we do two steps. 00:30:25.480 |
First time, we give him the prompt of what we want. 00:30:51.640 |
The lower this value, the less it will resemble our prompt. 00:30:55.640 |
And this is the idea behind classifier-free guidance. 00:31:01.400 |
we need to give some kind of embedding to the, 00:31:04.360 |
so the model needs to understand this prompt. 00:31:33.480 |
And the text, basically, they took a bunch of images. 00:31:38.360 |
So for example, this picture and its description. 00:31:41.720 |
Then they took another image along with its description. 00:31:44.280 |
So the image one is associated with the text number one, 00:31:50.760 |
Then the image two has the description number two. 00:32:06.760 |
multiplied with all the possible captions here. 00:32:23.320 |
between image and the text is on the diagonal 00:32:26.280 |
because the image one is associated with the text one. 00:32:31.880 |
Image three is associated with the text three. 00:32:35.480 |
Basically, they said they built a loss function 00:32:37.960 |
that they want this diagonal to have the maximum value 00:32:47.160 |
They are not the corresponding description of these images. 00:32:50.680 |
In this way, the model learned how to combine 00:32:54.360 |
the description of an image with the image itself. 00:33:09.160 |
And these embeddings are then used as conditioning signal 00:33:13.480 |
for our unit to denoise the image into what we want. 00:33:17.880 |
Okay, there is another thing that we need to understand. 00:33:25.560 |
we have a forward process that adds noise to the image. 00:33:43.480 |
that we need to do many steps of denoisification 00:33:53.800 |
involves going through the unit with a noisified image 00:33:57.720 |
and getting as output the amount of noise present in this image. 00:34:04.280 |
so suppose this image here is 512 multiplied by 512, 00:34:29.160 |
so that each step through the unit takes less time? 00:34:37.000 |
with something that is called the variational autoencoder. 00:34:41.000 |
Let's see how the variational autoencoder works. 00:35:00.440 |
but we learn the latent representation of the data 00:35:20.200 |
And then we can decompress it to build the original data. 00:35:24.680 |
Let me show you actually how it works on a practical level. 00:35:31.560 |
and you want to send it to your friend over the internet. 00:36:15.000 |
and each of them will have a representation in this. 00:36:17.560 |
This is called a code corresponding to each image. 00:36:27.480 |
doesn't make any sense from a semantic point of view. 00:36:30.680 |
So the code associated with the cat, for example, 00:36:44.600 |
And to overcome this limitation of the autoencoder, 00:36:49.640 |
in which we learn to kind of compress the data, 00:37:02.440 |
And we learn the mean and the sigma of this distribution, 00:37:17.240 |
And this is the idea that we use also in stable diffusion. 00:37:25.960 |
to see what is the architecture of the stable diffusion. 00:37:29.960 |
So let's start with how the text-to-image works. 00:37:34.760 |
Now, imagine text-to-image basically works like this. 00:37:55.080 |
We encode it with our variational autoencoder. 00:37:58.760 |
This will give us a latent representation of this noise. 00:38:13.960 |
The goal of the unit is to detect how much noise is there. 00:38:26.920 |
to make it into a picture that follows the prompt, 00:38:43.480 |
Our scheduler, we will see later what is the scheduler, 00:39:11.560 |
We keep doing this denoisification for many steps 00:39:15.000 |
until there is no more noise present in the image. 00:39:18.920 |
And after we have finished this loop of steps, 00:39:38.440 |
And this is why this is called a latent diffusion model 00:39:44.840 |
always works with the latent representation of the data. 00:40:02.440 |
For example, I want the model to add glasses to this dog 00:40:09.800 |
and hopefully the model will add glasses to this dog. 00:40:16.520 |
with the encoder of the variational autoencoder 00:40:18.920 |
and we get the latent representation of our image. 00:40:29.000 |
But of course, we need to have some noise to denoise. 00:40:34.920 |
and the amount of noise that we add to this image, 00:40:47.400 |
the more the unit has freedom to alter the image. 00:40:51.720 |
the less freedom the model has to alter the image 00:41:05.000 |
the unit is forced to modify just a little bit the output image. 00:41:12.920 |
indicates how much we want the model to pay attention 00:41:20.520 |
For many steps, we keep denoising, denoising, denoising, denoising. 00:41:40.280 |
In-painting works similar way to the image-to-image, 00:41:53.880 |
and we want the model to generate new legs for this dog 00:42:14.520 |
We add some noise to this latent representation. 00:42:23.800 |
because I want to generate new legs for this dog. 00:42:26.920 |
And then we pass the noisified input to the unit. 00:43:23.080 |
that came up with these details of the image, 00:44:28.600 |
Here we are finally coding our stable diffusion. 00:45:10.760 |
and the decoder of the variational autoencoder 00:46:02.440 |
we keep increasing the features of the image. 00:46:56.440 |
And later we download the pre-trained weights 00:47:18.520 |
and the decoder of our variational autoencoder. 00:48:06.840 |
For those who are familiar with computer vision models, 00:48:11.320 |
to the residual block that is used in the ResNet. 00:48:21.000 |
And this will inherit from the sequential module 00:48:56.520 |
but at the same time increases its number of features. 00:49:13.480 |
So the first thing we do, just like in the unit, 00:49:20.840 |
Initially, our image will have three channels. 00:49:32.120 |
For those who are not familiar with convolutions, 00:49:35.640 |
let's go have a look at how convolutions work. 00:50:02.300 |
So it's made of a matrix of a size that we can decide, 00:50:06.940 |
which is defined by the parameter kernel size, 00:50:17.980 |
And at each block, each of the pixel below the kernel 00:50:23.500 |
is multiplied by the value of the kernel in that position. 00:50:34.860 |
is multiplied by this red value of the kernel. 00:50:42.380 |
is multiplied by the green value of the kernel. 00:50:49.260 |
So this output here comes from four multiplications 00:50:53.500 |
each one with the corresponding number of the kernel. 00:50:57.420 |
This way, basically, by running this kernel through the image, 00:51:00.540 |
we capture local information about the image. 00:51:07.500 |
the information of four pixels, not only one. 00:51:12.540 |
Then we can also increase the kernel size, for example. 00:51:19.180 |
means that we capture more global information. 00:51:28.700 |
And then we can introduce, for example, the stride, 00:51:32.620 |
which means that we don't do it every successive pixel, 00:51:36.460 |
but we skip some pixels, as you can see here. 00:51:41.340 |
And if the number is, the kernel size is even 00:51:54.780 |
which means that it becomes, with the same kernel size, 00:52:04.700 |
but we skip some pixels, et cetera, et cetera. 00:52:12.060 |
from a local area of the picture, of the image, 00:52:23.660 |
will start with our, okay, let's define some shapes. 00:52:29.580 |
so the encoder of the variational autoencoder 00:52:32.060 |
will start with batch size and three channels. 00:52:39.660 |
Then this image will have a height and the width, 00:52:43.580 |
which will be 512 by 512, as we will see later. 00:52:47.900 |
And this convolution will convert it into batch size 128 features 00:52:56.780 |
Why, in this case, the height and the width doesn't change? 00:53:02.540 |
Because even if we have a kernel size of size three, 00:53:11.180 |
something to the bottom and the left of the image. 00:53:12.940 |
So the image with the padding becomes bigger, 00:53:15.580 |
but then the output of the convolution makes it smaller 00:53:24.780 |
But we will see later that with the next blocks, 00:53:45.580 |
this residual block is a combination of convolutions and normalization. 00:53:52.540 |
So it's just a bunch of convolutions that we will define later. 00:53:55.980 |
And this one indicates how many input channels we have 00:54:02.060 |
And the residual block will not change the size of the image. 00:54:17.900 |
And it becomes, it remains the same basically. 00:54:27.580 |
Another residual block with the same transformation. 00:54:35.820 |
And this time the convolution will change the size of the image. 00:54:47.100 |
Because the output channels of the last block is 128. 00:55:02.860 |
This will basically introduce kernel size 3, stride 2. 00:55:33.340 |
Okay, with the stride of 2 and the kernel size of 3. 00:55:38.620 |
So we skip every 2 pixels before calculating the output. 00:55:43.660 |
And this makes the output smaller than the input. 00:55:51.260 |
So this transformation here will have the following shapes. 00:56:05.420 |
So the original height and the width of the input image. 00:56:34.540 |
But this time by increasing the number of features. 00:56:43.420 |
Here by increasing the feature means that we don't increase the size of the image. 00:57:07.260 |
Now you may be confused of why we are doing all of this. 00:57:10.140 |
Okay the idea is we start with the initial image. 00:57:12.460 |
And we keep decreasing the size of the image. 00:57:15.180 |
So later you will see that the image will become divided by 4, divided by 8. 00:57:19.180 |
But at the same time we keep increasing the features. 00:57:36.780 |
And this time the size will become divided by 4. 00:57:59.740 |
Also in this case the size of the image will become half of what is it now. 00:58:26.860 |
So we start from 256 and the image is divided by 4. 00:58:49.200 |
We will see later what is the residual block. 00:58:52.780 |
But the residual block you have to think of it as just a convolution with a normalization. 00:59:02.860 |
And then we have another convolution that will make it even smaller. 00:59:19.020 |
The same kernel size and the same stride and the same padding as before. 00:59:36.300 |
And with the 4 times smaller width it will become 8 times smaller. 01:00:06.860 |
It doesn't change the shape of the image or the number of features. 01:00:11.020 |
So here we are going from divide by 8 and 512 here. 01:00:31.340 |
And later we will see what is the attention block. 01:00:34.220 |
Basically it will run a self-attention over each pixel. 01:00:40.780 |
as you remember, the attention is a way to relate tokens to each other in a sentence. 01:00:48.620 |
the attention can be thought of as a sequence of pixels 01:00:52.300 |
and the attention as a way to relate the pixel to each other. 01:00:57.820 |
And because this way each pixel is related to each other, 01:01:06.620 |
Even if the convolution already actually relates close pixels to each other, 01:01:14.220 |
So even the last pixel can be related to the first pixel. 01:01:19.020 |
And also in this case we don't reduce the size 01:01:23.020 |
because the attention is, the transformer's attention, 01:01:44.300 |
Also no change in shape or size of the image. 01:02:07.820 |
Finally, we have an activation function called the CELU. 01:02:13.820 |
Okay, it's derived from the sigmoid linear unit. 01:02:21.420 |
They just saw that this one works better for this kind of application. 01:02:25.580 |
But there is no particular reason to choose one over another, 01:02:31.820 |
except that they thought that practically this one works fine for this kind of models. 01:02:36.460 |
And if you watch my previous video about LAMA, for example, 01:02:40.860 |
in which we analyzed why they chose the ZWIGLU function. 01:02:43.740 |
If you read the paper, at the end of the paper, 01:02:45.580 |
they say that there is no particular reason they chose the ZWIGLU. 01:02:48.700 |
They just saw that practically it works better. 01:02:52.460 |
activation function works better than the others. 01:03:05.500 |
Convolution, 512, 8, kernel size, and then padding. 01:03:22.860 |
Because just like before, we have the kernel size as 3. 01:03:25.580 |
But we have the padding that compensates for the reduction given by the kernel size. 01:03:30.220 |
But we are decreasing the number of features. 01:03:36.060 |
And I will show you later on the architecture what is the bottleneck. 01:03:57.980 |
Which also doesn't change the size of the image. 01:04:02.540 |
Because if you watch here, if you have a kernel size of 1, 01:04:08.700 |
each kernel basically is running over each pixel. 01:04:13.020 |
So each output actually captures the information of only one pixel. 01:04:16.620 |
So the output has the same dimension as the input. 01:04:18.860 |
And this is why here also we don't change the... 01:04:37.820 |
And this is the list of modules that will make up our encoder. 01:04:44.540 |
Before building the residual block and the attention block, 01:04:50.860 |
let's write the forward method and then we build the residual block. 01:05:25.740 |
And later I will show you why we need some noise. 01:05:27.740 |
That has the same size as the output of the encoder. 01:05:35.500 |
Okay, our input x will be of size patch size with some channels. 01:05:45.820 |
Initially it will be 3 because it's an image. 01:05:55.100 |
This noise has the same size as the output of the encoder. 01:06:16.940 |
Then we just run sequentially all of these modules. 01:06:21.340 |
And then there is one little thing here that in the convolutions that have the stride, 01:06:46.460 |
So if the module has a stride attribute and it's equal to 2 2, 01:06:57.180 |
this convolution here and this convolution here, 01:06:59.980 |
we don't apply the padding here because the padding here is applied to the top of the image, 01:07:07.180 |
But we want to do an asymmetrical padding so we do it manually. 01:07:14.380 |
Basically this says can you add a layer of pixels on the right side of the image 01:07:30.380 |
Because when you apply the padding, it's padding left, padding right, padding top, 01:07:42.220 |
This means add a layer of pixels in the right side of the image 01:07:54.380 |
And then if we apply it only for these convolutions that have the stride equal to 2. 01:08:05.660 |
OK, now you may be wondering why are we building this kind of structure? 01:08:11.500 |
OK, usually in deep learning communities, especially during research, 01:08:18.460 |
So the people who made the stable diffusion, but also the people before them, 01:08:25.100 |
we check what models similar to the one we want to build 01:08:29.260 |
are already out there and they are working fine. 01:08:32.140 |
So very probably the people who built stable diffusion, 01:08:35.900 |
they saw that a model like this is working very well 01:08:39.100 |
for some previous project as a variational autoencoder. 01:08:42.700 |
They just modified it a little bit and kept it like it. 01:08:46.780 |
So for most choices, actually, there is no reason. 01:08:49.740 |
There is a historical reason, because it worked well in practice. 01:08:53.340 |
And we know that convolutions work well in practice 01:08:57.420 |
for image segmentation, for example, or anything related to computer vision. 01:09:01.340 |
And this is why they made the model like this. 01:09:04.940 |
So most encoders actually work like this, that we reduce the size of the image, 01:09:09.500 |
but each we keep increasing the features of the image, 01:09:12.860 |
the channels, the number of channels of the image. 01:09:18.620 |
but each pixel is represented by more than three channels. 01:09:23.980 |
Now, what we do is here we are running our image into sequentially, 01:09:30.940 |
in one by one, through all of these modules here. 01:09:37.900 |
So first through this convolution, then through this residual block, 01:09:41.020 |
which is also some convolutions, then this residual block, 01:09:44.860 |
then again convolution, convolution, convolution, 01:09:46.940 |
until we run it through this attention block and et cetera. 01:09:50.460 |
This will transform the image into something smaller, 01:09:57.500 |
But as I showed you before, this is not an autoencoder. 01:10:02.940 |
So the variational autoencoder, let me show you again the picture here. 01:10:13.260 |
And this latent space are the parameters of a multivariate Gaussian distribution. 01:10:19.340 |
So actually, the variational autoencoder is trained to learn the mu and the sigma, 01:10:25.260 |
so the mean and the variance of this distribution. 01:10:30.460 |
And this is actually what we will get from the output of this variational autoencoder, 01:10:42.140 |
I made a previous video about the variational autoencoder, 01:10:45.100 |
in which I show you also why the history of why we do it like this, 01:10:51.820 |
But for now, just remember that this is not just a compressed version of the image, 01:10:58.300 |
And then we can sample from this distribution. 01:11:02.380 |
So the output of the variational autoencoder is actually the mean and the variance. 01:11:08.220 |
And actually, it's actually not the variance, but the log variance. 01:11:11.420 |
So the mean and the log variance is equal to torch.chunk(x2, dimension equal 1). 01:11:27.500 |
So this basically converts batch size, 8 channels, height, 01:11:36.940 |
which is the output of the last layer of this encoder. 01:11:44.380 |
So this chunk basically means divide it into two tensors along this dimension. 01:11:49.500 |
So along this dimension, it will become two tensors of size, 01:11:55.980 |
So two tensors of shape, batch size 4, then height divided by 8, and width divided by 8. 01:12:12.540 |
And this basically, the output of this actually represents the mean and the variance. 01:12:21.660 |
And what we do, we don't want the log variance, we want the variance actually. 01:12:27.340 |
So to transform the log variance into variance, we do the exponentiation. 01:12:32.540 |
So the first thing actually we also need to do is to clamp this variance, 01:12:39.100 |
So clamping means that if the variance is too small or too big, 01:12:42.460 |
we want it to become within some ranges that are acceptable for us. 01:12:50.700 |
tells the PyTorch that if the value is too small or too big, make it within this range. 01:12:55.900 |
And this doesn't change the shape of the tensors. 01:13:04.300 |
And then we transform the log variance into variance. 01:13:09.660 |
So the variance is equal to the log variance dot exp, 01:13:16.140 |
So you delete the log and it becomes the variance. 01:13:18.620 |
And this also doesn't change the size of the shape of the tensor. 01:13:25.100 |
And then to calculate the standard deviation from the variance, 01:13:28.140 |
as you know, the standard deviation is the square root of the variance. 01:13:31.180 |
So standard deviation is the variance dot sqrt. 01:13:40.300 |
And also this doesn't change the size of the tensor. 01:13:44.540 |
OK, now what we want, as I told you before, this is a latent space. 01:13:51.500 |
It's a multivariate Gaussian, which has its own mean and its own variance. 01:13:56.060 |
And we know the mean and the variance, this mean and this variance. 01:14:03.740 |
Well, what we can sample from is, basically, we can sample from n_01. 01:14:12.620 |
how do we convert it into a sample of a given mean and the given variance? 01:14:19.100 |
This, as if you remember from probability and statistics, 01:14:25.340 |
you can convert it into any other sample of a Gaussian 01:14:28.860 |
with a given mean and a variance through this transformation. 01:14:32.140 |
So if z, let's call it this one, z is equal to n_01, 01:14:37.180 |
we can transform into another n, let's call it x, 01:14:49.980 |
the standard deviation of the new distribution multiplied by z. 01:14:54.780 |
This is the transformation, this is the formula from probability and statistics. 01:14:58.860 |
Basically means transform this distribution into this one, 01:15:03.020 |
which basically means sample from this distribution. 01:15:05.980 |
This is why we are given also the noise as input, 01:15:09.100 |
because the noise we want it to come from with a particular seed of the noise generator. 01:15:14.700 |
So we ask is as input and we sample from this distribution like this, 01:15:19.180 |
x is equal to mean plus standard deviation multiplied by noise. 01:15:25.180 |
Finally, there is also another step that we need to scale the output by a constant. 01:15:34.460 |
This constant, I found it in the original repository. 01:15:37.740 |
So I'm just writing it here without any explanation on why, 01:15:43.260 |
It's just a scaling constant that they use at the end. 01:15:46.460 |
I don't know if it's there for historical reason, 01:15:49.020 |
because they use some previous model that had this constant, 01:15:51.260 |
or they introduced it for some particular reason. 01:15:54.060 |
But it's a constant that I saw it in the original repository. 01:15:57.100 |
And actually, if you check the original parameters of the stable diffusion model, 01:16:02.460 |
So I am also scaling the output by this constant. 01:16:10.620 |
except that we didn't build the residual block and the attention block here, 01:16:14.940 |
we built the encoder part of the variational autoencoder and also the sampling part. 01:16:20.140 |
So we take the image, we run it through the encoder, it becomes very small. 01:16:26.940 |
And then we sample from that distribution given the mean and the variance. 01:16:31.340 |
Now we need to build the decoder along with the residual block and the attention block. 01:16:38.780 |
we do the opposite of what we did in the encoder. 01:16:42.220 |
So we will reduce the number of channels and at the same time, 01:17:31.820 |
Let's define first the residual block, the one we defined before, 01:17:39.900 |
so that you understand what is this residual block. 01:17:42.860 |
And then we define the attention block that we defined before. 01:18:00.860 |
Okay, this is made up of normalization and convolutions, like I said before. 01:18:09.420 |
There is a two normalization, which is the group norm one. 01:18:24.860 |
And then there is another group normalization. 01:19:05.900 |
Skip connection basically means that you take the input, 01:19:09.660 |
you skip some layers, and then you connect it there with the output of the last layer. 01:19:17.820 |
If the two channels are different, we need to create another intermediate layer. 01:20:05.260 |
Okay, the input of this residual layer, as you saw before, 01:20:10.780 |
is something that has a batch with some channels, 01:20:15.020 |
and then height and width, which can be different. 01:20:19.500 |
Sometimes it's 512 by 512, sometimes it's half of that, 01:20:25.980 |
So suppose it's x is batch size in channels height width. 01:20:41.420 |
We call it the residual or residue is equal to x. 01:20:53.660 |
And this doesn't change the shape of the tensor. 01:21:00.780 |
And this also doesn't change the size of the tensor. 01:21:10.300 |
This also doesn't change the size of the tensor, 01:21:18.220 |
because as you can see here, we have kernel size 3, yes, 01:21:22.860 |
With the padding of 1, actually, it will not change the size of the tensor. 01:21:28.140 |
Then we apply again the group normalization 2. 01:21:32.540 |
This again doesn't change the size of the tensor. 01:21:46.700 |
And finally, we apply the residual connection, 01:21:54.060 |
which basically means that we take x plus the residual. 01:21:58.940 |
But if the number of output channels is not equal to the input channels, 01:22:07.420 |
because this dimension will not match between the two. 01:22:13.660 |
to convert the input channels to the output channels of x, 01:22:19.180 |
So what we do is, we apply this residual layer. 01:22:29.660 |
So as I told you, it's just a bunch of convolutions and group normalization. 01:22:33.580 |
And for those who are familiar with the computer vision models, 01:22:43.180 |
Let's go build the attention block that we used also before in the encoder. 01:22:49.100 |
And to define the attention, we also need to define the self-attention. 01:22:55.660 |
which is used in the variational autoencoder. 01:22:57.340 |
And then we define what is this self-attention. 01:23:28.140 |
Again, the channel is always 32 here in stable diffusion. 01:23:34.060 |
But you also may be wondering, what is group normalization, right? 01:23:37.420 |
So let's go to review it, actually, since we are here. 01:23:40.780 |
And, okay, if you remember from my previous slides on Lama, 01:23:47.500 |
let's go here, where we use a layer normalization. 01:23:52.620 |
And also in the vanilla transformer, actually, we use layer normalization. 01:24:00.220 |
Normalization is basically when we have a deep neural network, 01:24:03.660 |
each layer of the network produces some output that is fed to the next layer. 01:24:08.700 |
Now, what happens is that if the output of a layer is varying in distribution, 01:24:14.700 |
so sometimes, for example, the output of a layer is between 0 and 1, 01:24:18.380 |
but the next step, maybe it's between 3 and 5, 01:24:22.140 |
and the next step, maybe it's between 10 and 15, etc. 01:24:25.740 |
So the distribution of the output of a layer changes, 01:24:32.060 |
that is very different from what the layer is used to see. 01:24:36.860 |
This will basically push the output of the next layer into a new distribution itself, 01:24:42.620 |
which, in turn, will push the loss function into, 01:24:45.900 |
basically, the output of the model to change very frequently in distribution. 01:24:56.940 |
sometimes it will be negative, sometimes it will be positive, etc. 01:24:59.740 |
And this basically makes the loss function oscillate too much, 01:25:05.980 |
So what we do is we normalize the values before feeding them into layers, 01:25:09.740 |
such that each layer always sees the same distribution of the data. 01:25:13.820 |
So it will always see numbers that are distributed around 0 with a variance of 1. 01:25:19.260 |
And this is the job of the layer normalization. 01:25:21.580 |
So imagine you are a layer, and you have some input, 01:25:27.260 |
Each item has some features, so feature 1, feature 2, feature 3. 01:25:31.180 |
Layer normalization calculates a mean and the variance over these features here, 01:25:38.380 |
and then normalizes this value according to this formula. 01:25:42.140 |
So each value basically becomes distributed between 0 and 1. 01:25:46.460 |
With batch normalization, we normalize by columns, 01:25:50.700 |
so the statistics mean and the sigma is calculated by columns. 01:25:54.380 |
With layer normalization, it is calculated by rows, 01:26:03.100 |
it is like layer normalization, but not all of the features of the item, but grouped. 01:26:10.860 |
So for example, imagine you have four features here. 01:26:13.500 |
So here you have F1, F2, F3, F4, and you have two groups. 01:26:18.620 |
Then the first group will be F1 and F2, and the second group will be F3 and F4. 01:26:26.860 |
one for the first group, one for the second group. 01:26:32.540 |
Why do we want to group this kind of features? 01:26:35.740 |
Because these features actually, they come from convolutions. 01:26:39.180 |
And as we saw before, let's go back to the website. 01:26:45.740 |
Each output here actually comes from local area of the image. 01:26:55.260 |
two things that are close to each other, may be related to each other. 01:26:59.340 |
So two things that are far from each other are not related to each other. 01:27:02.540 |
This is why we can group, we can use group normalization in this case. 01:27:07.260 |
Because closer features to each other will have kind of the same distribution, 01:27:15.020 |
and things that are far from each other may not. 01:27:17.580 |
This is the basic idea behind group normalization. 01:27:20.300 |
But the whole idea behind the normalization is that 01:27:22.700 |
we don't want these things to oscillate too much. 01:27:25.260 |
Otherwise, the loss of function will oscillate 01:27:28.860 |
With normalization, we make the training faster. 01:27:36.300 |
So now the attention block has this group normalization and also an attention, 01:27:54.300 |
Torch.tensor, returns, of course, torch.tensor. 01:28:02.060 |
The input of this block is something, where is it? 01:28:07.500 |
It's something in the form of batch size, number of channels, height and width. 01:28:11.980 |
But because it will be used in many positions, this attention block, 01:28:17.500 |
So we just say that x is something that is a batch size, 01:28:21.900 |
features or channels, if you want, height and width. 01:28:28.940 |
And the first thing we do is we extract the shape. 01:28:34.860 |
So n is the batch size, the number of channels, 01:28:38.220 |
the height and the width is equal to x.shape. 01:28:45.340 |
we do the self-attention between all the pixels of this image. 01:28:50.700 |
This will transform this tensor here into this tensor here. 01:29:06.620 |
So now we have a sequence where each item represents a pixel 01:29:20.620 |
This will transform this shape into this shape. 01:29:31.180 |
So this one comes before and features becomes the last one. 01:29:43.660 |
this is like when we do the attention in the transformer model. 01:29:47.500 |
So in the transformer model, we have a sequence of tokens. 01:29:50.220 |
Each token is representing, for example, a word. 01:29:52.700 |
And the attention basically calculates the attention between each token. 01:29:57.180 |
So how do two tokens are related to each other? 01:30:00.220 |
In this case, we can think of it as a sequence of pixels. 01:30:03.420 |
Each pixel with its own embedding, which is the features of that pixel. 01:30:17.260 |
In which self-attention means that the query key and values are the same input. 01:30:36.060 |
So because we put it in this form only to do attention. 01:30:51.900 |
And then again, we remove this multiplication by viewing again the tensor. 01:31:17.660 |
The residual connection will not change the size of the input. 01:31:25.900 |
Let me check also the residual connection here. 01:31:29.260 |
Now that we have also built the attention block, let's build also the self-attention. 01:31:35.180 |
And the attentions, because we have two kinds of attention in the stable diffusion. 01:31:44.940 |
So let's go build it in a separate class called "Attention". 01:32:18.940 |
I think you guys maybe want to review the attention before building it. 01:32:28.860 |
I have here opened my slides from my video about the attention model for the transformer model. 01:32:35.740 |
So the self-attention, basically, it's a way for, especially in a language model, 01:32:42.140 |
is a way for us to relate tokens to each other. 01:32:47.420 |
Each one of them having an embedding of size d model. 01:32:50.540 |
And we transform it into queries, key, and values. 01:32:53.260 |
In which query, key, and values in the self-attention are the same matrix, same sequence. 01:33:01.180 |
So wq, wk, and wv, which are parameter matrices. 01:33:05.340 |
Then we split them along the d model dimension into number of heads. 01:33:12.700 |
In our case, the one attention that we will do here is actually only one head. 01:33:19.740 |
And then we calculate the attention for each of this head. 01:33:23.580 |
Then we combine back by concatenating this head together. 01:33:28.780 |
We multiply this output matrix of the concatenation with another matrix called wo, 01:33:36.620 |
And then this is the output of the multi-head attention. 01:33:40.700 |
If we have only one head, instead of being a multi-head, 01:33:44.540 |
then we will not do this splitting operation. 01:33:46.940 |
We will just do this multiplication with the w and with the wo. 01:33:51.260 |
And OK, this is how the self-attention works. 01:33:54.940 |
So in a self-attention, we have this query key and values coming from the same matrix input. 01:34:16.700 |
But in our case, we are not talking about tokens. 01:34:21.340 |
And we can think that the number of channels of each pixel is the embedding of the pixel. 01:34:26.700 |
So the embedding, just like in the original transformer, 01:34:30.380 |
the embeddings are the kind of vectors that capture the meaning of the word. 01:34:36.540 |
Each channel, each pixel represented by many channels 01:34:39.900 |
that capture the information about that pixel. 01:34:41.980 |
Here we have also the bias for the w matrices, 01:34:49.420 |
which we don't have in the original transformer. 01:35:09.580 |
We will represent it as one big linear layer. 01:35:11.980 |
Instead of representing it as three different matrices, it's possible. 01:35:17.500 |
We just say that it's a big matrix, three by the embedding. 01:35:30.620 |
because it's a projection of the input before we apply the attention. 01:35:34.300 |
And then there is an auto projection, which is after we apply the attention. 01:35:47.100 |
So as you remember here, the wo matrix is actually the model by the model. 01:36:08.060 |
And then we saved the dimension of each head. 01:36:15.500 |
The dimension of each head basically means that if we have multi head, 01:36:18.780 |
each head will watch a part of the embedding of each token. 01:36:43.660 |
As you remember, the mask is a way to avoid relating tokens, 01:36:47.580 |
one particular token with the tokens that come after it, 01:36:58.380 |
If you really are not understanding what is happening here in the attention, 01:37:05.500 |
I highly recommend you watch my previous video, 01:37:09.980 |
And if you watch it, it will take not so much time. 01:37:16.300 |
So the first thing we do is extract the shape. 01:37:31.420 |
length and the embedding is equal to input shape. 01:37:38.780 |
And then we say that we will convert it into another shape 01:37:48.460 |
This is called the interim shape, intermediate shape. 01:38:06.700 |
We apply the in projection, so the wq, wq and wv matrix to the input, 01:38:16.620 |
We multiply it, but then we divide it with chunk. 01:38:23.900 |
Basically, we will multiply the input with the big matrix 01:38:31.420 |
but then we split it back into three smaller matrices. 01:38:34.700 |
This is the same as applying three different projections. 01:38:38.960 |
It's the same as applying three separate in projections, 01:38:43.500 |
but it's also possible to combine it in one big matrix. 01:38:46.860 |
This, what we will do, basically it will convert batch size, 01:38:56.060 |
sequence length, dimension into batch size, sequence length, dimension multiplied by three. 01:39:05.740 |
And then by using chunk, we split it along the last dimension 01:39:09.820 |
into three different tensors of shape, batch size, sequence length and dimension. 01:39:23.900 |
Okay, now we can split the query key and values in the number of heads. 01:39:31.180 |
According to the number of heads, this is why we built this shape, 01:39:34.220 |
which means split the dimension, the last dimension into n heads. 01:40:08.620 |
batch size, sequence length, dimension into batch size, sequence length, 01:40:17.180 |
then h, so the number of heads and each dimension divided by the number of heads. 01:40:26.060 |
but only a part of the embedding of each token, in this case, pixel. 01:40:35.180 |
So the full dimension, the embedding divided by the number of heads. 01:40:39.500 |
And then this will convert it, because we are also transposing, 01:40:55.900 |
So each head will watch all the sequence, but only a part of the embedding. 01:41:03.180 |
We then calculate the attention, just like the formula. 01:41:07.100 |
So query multiplied by the transpose of the keys. 01:41:09.900 |
So is the query, matrix multiplication with the transpose of the keys. 01:41:19.740 |
batch size, h, sequence length by sequence length. 01:41:30.860 |
As you remember, the mask is something that we apply when we calculate the attention, 01:41:35.980 |
if we don't want two tokens to relate to each other. 01:41:41.820 |
In this matrix, we substitute the interaction with minus infinity before applying the softmax, 01:41:54.540 |
This will create a causal mask, basically a mask where the upper triangle, 01:42:00.700 |
so above the principal diagonal, is made up of one ones, a lot of ones. 01:42:32.940 |
Masked fill, but with mask, and we put minus infinity, like this. 01:42:46.220 |
As you remember, the formula of the transformer is a 01:42:48.860 |
query multiplied by the transpose of the keys, and then divided by the square root of the model. 01:42:55.580 |
So divided by the square root of the model, set of the head. 01:43:24.540 |
So we want to remove, now we want to remove the head dimension. 01:43:29.420 |
So output is equal to, let me write some shapes. 01:43:37.020 |
This is equal to patch size, sequence by sequence, 01:43:45.580 |
multiplied, so matrix multiplication with patch size. 01:43:56.620 |
This will result into patch size, H, sequence length, and dimension divided by H. 01:44:09.820 |
This we then multiplied by the, we then transpose. 01:44:15.660 |
And this will result into, so we start with this one. 01:44:23.580 |
And it becomes, wait I put too many parentheses here, patch size, sequence length, 01:44:44.620 |
Then we can reshape as the input, like the initial shape, so this one. 01:45:26.220 |
Now let's go back to continue building the decoder. 01:45:29.100 |
For now we have built the attention block and the residual block. 01:45:45.740 |
And also this one is a sequence of modules that we will apply one after another. 01:46:00.860 |
We start with the convolution just like before. 01:46:03.180 |
Now I will not write again the shapes change, but you got the idea. 01:46:07.100 |
In the encoder we, in the encoder, let me show you here. 01:46:17.900 |
In the encoder we keep reducing the size of the image until it becomes small. 01:46:22.540 |
In the decoder we need to return to the original size of the image. 01:46:27.180 |
So we start with the latent dimension and we return to the original dimension of the image. 01:46:39.100 |
So we start with four channels and we output four channels. 01:47:03.100 |
Then we have a residual block just like before. 01:47:23.180 |
Then we have a bunch of residual blocks and we have four of them. 01:47:45.360 |
Now the residual blocks, let me write some shapes here. 01:47:49.820 |
Here we arrived to a situation in which we have batch size. 01:47:53.660 |
We have 512 features and the size of the image still didn't grow 01:47:58.700 |
because we didn't have any convolution that will make it grow. 01:48:02.140 |
This one of course will remain the same because it's a residual block and etc. 01:48:15.580 |
So now the image is actually height divided by 8 which height as you remember is 512, 01:48:21.740 |
the size of the image that we are working with. 01:48:31.260 |
The upsample, we have to think of it like when we resize an image. 01:48:42.220 |
So imagine you have an image that is 64 by 64 01:48:48.860 |
The upsample will do it just like when we resize an image. 01:48:57.580 |
So along the dimensions right and down for example twice. 01:49:02.220 |
So that the total amount of pixels, the height and the width actually doubles. 01:49:10.860 |
It will just replicate each pixel so that by this scale factor along each dimension. 01:49:24.060 |
divided by 8, width divided by 8 becomes as we see here 01:50:09.340 |
But in this case we have three of them, 2, 3. 01:50:14.060 |
This will again double the size of the image. 01:50:17.660 |
So we have another one that will double the size of the image. 01:50:23.340 |
So now our image which was divided by 4 with 512 channels. 01:50:45.980 |
And then we have three residual blocks again. 01:50:54.940 |
But this time we reduce the number of features. 01:51:06.780 |
Okay, then we have another upsampling which will again double the size of the image. 01:51:14.300 |
And this time we will go from divide by 2 to divide by 2 up to the original size. 01:51:26.220 |
And because the number of channels has changed, we are not 512 anymore. 01:51:35.820 |
This case with 256 because it's the new number of features. 01:51:40.540 |
Then we have another bunch of residual blocks that will decrease the number of features. 01:52:12.700 |
So we group features in groups of 32 before calculating the mu and the sigma before normalizing. 01:52:20.540 |
And we define the number of channels as 128 which is the number of features that we have. 01:52:25.180 |
So this group normalization will divide these 128 features into groups of 32. 01:52:41.340 |
The final convolution that will transform into an image with the three channels. 01:52:47.660 |
So RGB by applying these convolutions here which doesn't change the size of the output. 01:52:54.380 |
So we'll go from an image that is batch size 128 height width. 01:53:02.380 |
Because after the last upsampling we become of the original size into an image with only three channels. 01:53:36.780 |
I'm sorry if I'm putting a lot of spaces between here. 01:53:39.580 |
But otherwise it's easy to get lost and not understand where we are. 01:53:43.580 |
So here the input of the decoder is our latent. 01:53:50.620 |
So it's batch size 4 height divided by 8 width divided by 8. 01:53:57.500 |
As you remember here in the encoder the last thing we do is be scaled by this constant. 01:54:13.980 |
And then return x which is batch size 3 height and width. 01:54:31.260 |
Let me also write the input of this decoder which is this one. 01:54:43.740 |
We are building our architecture of the stable diffusion. 01:54:50.620 |
So far we have built the encoder and the decoder. 01:54:53.980 |
But now we have to build the unit and then we have to build the clip text encoder. 01:55:01.020 |
And finally we have to build the pipeline that will connect all of these things. 01:55:05.740 |
So it's going to be a long journey but it's fun actually to build things. 01:55:10.220 |
Because you learn every detail of how they work. 01:55:13.180 |
So the next thing that we are going to build is the text encoder. 01:55:16.780 |
So this clip encoder here that will allow us to encode the prompt into embeddings 01:55:22.700 |
that we can then feed to this unit model here. 01:55:28.620 |
And we will of course use a pre-trained version. 01:55:31.180 |
So by downloading the vocabulary and I will show you how it works. 01:55:39.340 |
We create a new file in st folder called clip.py. 01:56:08.060 |
And we also import self-attention because we will be using it. 01:56:15.100 |
It's very similar to the encoder layer of the transformer. 01:56:24.140 |
This is the encoder layer of the transformer. 01:56:26.460 |
It's made of attention and then feed forwards. 01:56:30.060 |
And there are many blocks like this one after another that are applied one after another. 01:56:34.140 |
We also have something that tells the position of each token inside of the sentence. 01:56:38.700 |
And we will also have something similar in clip. 01:56:41.020 |
So we need to build something very similar to this one. 01:56:44.460 |
And actually this is why I mean the transformer model was very successful. 01:56:48.140 |
So that's why they use the same structure of course also for this purpose. 01:56:57.500 |
I will build first the skeleton of the model and then we will build each block. 01:57:15.820 |
The embeddings allow us to convert the tokens. 01:57:19.660 |
So as you remember in when you have a sentence made up of text. 01:57:25.260 |
Where each number indicates the position of the token inside of the vocabulary. 01:57:31.180 |
Where each embedding represents a vector of size 512 in the original transformer. 01:57:40.300 |
And each vector represents kind of the meaning of the word or the token captures. 01:58:00.140 |
The maximum sequence length that we can have. 01:58:05.260 |
Because we should actually use some configuration file to save. 01:58:09.820 |
But because we will be using with the pre-trained stable diffusion model. 01:58:16.060 |
But in the future I will refactor the code to add some configuration actually. 01:58:37.340 |
Which indicates the number of head of the multihead attention. 01:59:25.340 |
That indicate the position of each token inside of the vocabulary. 02:00:18.060 |
And the last one we apply the layer normalization. 02:00:44.140 |
The shape of the input should match the shape of the output. 02:00:47.740 |
So we always obtain sequence length by the model. 02:01:40.540 |
We need to tell him what is the number of embeddings. 02:01:51.260 |
And what is the dimension of each vector of the embedding token. 02:01:59.580 |
The positional encoding in the original transformer. 02:02:13.260 |
That are learned by the model during training. 02:02:15.980 |
That tell the position of the token to the model. 02:02:51.980 |
And then just like in the original transformer. 02:03:02.460 |
We add the positional encodings to each token. 02:03:17.260 |
And then later we will load these parameters. 02:03:26.220 |
Which is just like the layer of the transformer model. 02:05:51.980 |
Because it's the same input that becomes query key and values. 02:06:07.580 |
And the dimension of the embedding which is 768. 02:06:11.980 |
The first thing we do is we apply the self attention. 02:06:39.180 |
Which basically means that every token cannot watch the next tokens. 02:06:46.860 |
And this is what we want from a text model actually. 02:06:49.900 |
We don't want the one word to watch the words that come after it. 02:07:28.780 |
Because here we are already familiar with the structure of the transformer. 02:07:35.900 |
We apply the first linear of the feed forward. 02:07:49.740 |
And actually we call the quick GLUE function. 02:08:10.140 |
So this is called the quick GLUE activation function. 02:08:17.500 |
There is no justification on why we should use this one and not another one. 02:08:23.420 |
They just saw that in practice this one works better for this kind of application. 02:08:27.100 |
So that's why we are using this function here. 02:08:40.380 |
This is exactly like the feed forward layer of the transformer. 02:08:44.940 |
Except that in the transformer we don't have this activation function. 02:08:48.780 |
And if you remember in LLAMA we don't have the RELU function. 02:08:54.300 |
But here we are using the quick GLUE function. 02:08:58.780 |
But I think that it works good for this model. 02:09:15.900 |
So we have built the variational autoencoder. 02:09:21.580 |
Now the next thing we have to build is this unit. 02:09:25.340 |
As you remember the unit is the network that will give some noisified image. 02:09:33.420 |
And we also indicated to the network what is the amount of noise that we added to this image. 02:09:38.940 |
The model has to predict how much noise is there. 02:09:59.580 |
But we increase exactly what we did in the encoder of the variational autoencoder. 02:10:07.740 |
Just like we did with the decoder of the variational autoencoder. 02:10:10.540 |
So now again we will work with some convolutions. 02:10:17.260 |
The one big difference is that we need to tell our unit. 02:10:29.500 |
So the time step at which this noise was added. 02:10:34.540 |
Because as you remember we need to also tell this unit what is our prompt. 02:10:40.060 |
Because we need to tell him how we want our output image to be. 02:10:44.940 |
Because there are many ways to deny the initial noise. 02:10:47.660 |
So if we want the initial noise to become a dog. 02:10:52.140 |
If we want the initial noise to become a cat. 02:10:58.620 |
And also he has to relate this prompt with the rest of the information. 02:11:03.180 |
And what is the best way to combine two different stuff. 02:11:10.220 |
We will use what is called the cross attention. 02:11:13.660 |
Cross attention basically allows us to calculate the attention between two sequences. 02:11:21.980 |
And the keys and the values are coming from another sequence. 02:11:25.340 |
So let's go build it and let's see how this works. 02:11:28.060 |
Now the first thing we will do is create a new class. 02:11:39.900 |
And I think also here I will build from top down. 02:11:49.740 |
Let's start by importing the usual libraries. 02:12:56.140 |
So because we need to give the unit not only the noisified image. 02:13:00.700 |
But also the time step at which it was noisified. 02:13:03.500 |
So the image, the unit needs some way to understand this time step. 02:13:10.300 |
So this is why this time step which is a number. 02:13:14.620 |
By using this particular module called time embedding. 02:13:39.020 |
As you remember the unit will receive the latent. 02:13:48.220 |
Is the output of the variational autoencoder. 02:14:42.300 |
Which we already converted with the clip encoder here. 02:15:23.180 |
It's actually a number that is multiplied by. 02:15:30.940 |
Because they saw that it works for the transformer. 02:15:33.500 |
So we can also use the same positional encoding. 02:15:42.300 |
At which step we arrived in the denoisification. 02:15:53.260 |
Into a tensor of one one two eight zero one thousand. 02:18:04.060 |
So the output dimension must match the input dimension. 02:19:32.300 |
What we do is first we apply this first layer. 02:20:13.660 |
And then we build each of the blocks that it will require. 02:20:36.380 |
As you can see, the unit is made up of one encoder branch. 02:20:40.060 |
So this is like the encoder of the variational autoencoder. 02:20:45.340 |
So the image becomes smaller, smaller, smaller. 02:20:58.860 |
The image from the very small size becomes the original size. 02:21:02.700 |
And then we have these skip connections between the encoder and the decoder. 02:21:07.020 |
So the output of each layer of each step of the encoder 02:21:12.460 |
is connected to the same step of the decoder on the other side. 02:21:19.500 |
So we start building the left side, which is the encoders. 02:21:25.100 |
And to build these encoders, we need to define a special layer, basically, that will apply... 02:21:37.980 |
Okay, let's build it and then I will describe it. 02:21:48.060 |
And basically, this switchSequential, given a list of layers, will apply them one by one. 02:22:07.580 |
But it can recognize what are the parameters of each of them and will apply accordingly. 02:22:17.180 |
So first we have, just like before, a convolution. 02:22:20.300 |
Because we want to increase the number of channels. 02:22:23.100 |
So as you can see, at the beginning, we increase the number of channels of the image. 02:22:29.900 |
And then we have another one of this switchSequential. 02:22:44.700 |
But it's very similar to the residual block that we built already for the variational autoencoder. 02:22:50.460 |
And then we have an attention block, which is also very similar to the attention block 02:22:54.300 |
that we built for the variational autoencoder. 02:23:03.900 |
Then we have-- OK, I think it's better to build this switchSequential. 02:23:21.260 |
But given x, which is our latent, which is a torch.tensor, our context, so our prompt. 02:23:49.340 |
So if the layer is a unit attention block, for example. 02:24:00.620 |
Because this attention block basically will compute the cross-attention between 02:24:06.620 |
This residual block will compute-- will match our latent with its time step. 02:24:22.860 |
And then if it's any other layer, we just apply it. 02:24:27.100 |
And then we return, but after the for a while. 02:24:37.020 |
We just need to define this residual block and this attention block. 02:24:40.220 |
Then we have another sequence-- sequential switch. 02:24:49.020 |
So the code I'm writing actually is based on a repository. 02:24:55.580 |
Upon which actually most of the code I wrote is based on. 02:24:59.100 |
Which is in turn based on another repository, 02:25:01.180 |
which was originally written for TensorFlow, if I remember correctly. 02:25:04.460 |
So actually, the code for stable diffusion-- because it's a model that is built by 02:25:11.500 |
Comfit's group at the LMU University, of course, it cannot be different from that code. 02:25:15.740 |
So most of the code are actually similar to each other. 02:25:19.020 |
I mean, you cannot create the same model and change the code. 02:25:25.500 |
So we again use this one-- switch sequential. 02:26:00.940 |
And then we have an attention block of 8 to 80. 02:26:05.340 |
And this attention block takes the number of head. 02:26:14.780 |
We will see later how we transform this, the output of this, 02:26:19.260 |
into a sequence so that we can run attention on it. 02:26:42.140 |
Convolution of size from 640 to 640 channels. 02:26:51.900 |
Then we have another residual block that will again increase the features. 02:27:05.020 |
And then we have an attention block of 8 heads and 160 is the embedding size. 02:27:15.820 |
Then we have another residual block of 1280 and 8 and 160. 02:27:28.860 |
So as you can see, just like in the encoder of the variational autoencoder, 02:27:31.980 |
we, with these convolutions, we keep decreasing the size of the image. 02:27:38.940 |
So actually here we started with the latent representation, 02:27:42.620 |
which was height divided by 8 and height divided by 8. 02:27:49.100 |
At least you need to understand the size changes. 02:27:52.700 |
So batch size for height divided by 8 and width divided by 8. 02:27:58.940 |
When we apply this convolution, it will become divided by 16. 02:28:17.020 |
And after we apply the second one, it will become divided by 32. 02:28:34.060 |
That if the initial image was of size 512, the latent is of size 64 by 64. 02:28:45.580 |
And then we apply these residual connections. 02:28:51.020 |
And then we apply another convolutional layer, 02:28:57.820 |
which will reduce the size of the image further. 02:29:01.660 |
So from 32 here, divide by 32 and divide by 32 to divide by 64. 02:29:12.460 |
Every time we divided the size of the image by 2. 02:29:46.060 |
And then we have a last one, which is another one of the same size. 02:29:51.740 |
So now we have an image that is 64 divided by 64 and divided by 64, 02:30:02.620 |
So here is 1280 channels and divided by 64 divided by 64. 02:30:09.420 |
Because the residual connections don't change the size. 02:30:31.020 |
So as I said before, we keep reducing the size of the image, 02:30:35.900 |
but we keep increasing this number of features of each pixel basically. 02:31:24.140 |
So in the decoder, we will do the opposite of what we did in the encoder. 02:31:27.660 |
So we will reduce the number of features, but increase the image size. 02:31:36.700 |
Again, let's start with our beautiful switch sequential. 02:31:54.620 |
Why here is 2560 even if after the bottleneck we have 1280? 02:32:11.420 |
so this side here of the unit is the output of the bottleneck. 02:32:16.060 |
But the bottleneck is outputting 1280 features, 02:32:20.140 |
while the encoder is expecting 2560, so double the amount. 02:32:25.180 |
Why? Because we need to consider that we have this skip connection here. 02:32:29.020 |
So this skip connection will double the amount at each layer here. 02:32:33.020 |
And this is why the input we expect here is double the size 02:32:44.140 |
The image is very small, so height by end width divided by 64. 02:32:59.260 |
Then we apply another switch sequential of the same size. 02:33:12.060 |
just like we did in the variational autoencoder. 02:33:14.140 |
So if you remember in the variational autoencoder, 02:33:16.300 |
to increase the size of the image we do upsampling. 02:33:23.420 |
But this is not the upsample that we did exactly the same, 02:33:57.020 |
Then we have another one with an attention block. 02:34:28.700 |
So I know that I'm not writing all the shapes, 02:34:32.380 |
but otherwise it's a really tiring job and very long. 02:34:37.180 |
So just remember that we are keep increasing the size of the image, 02:34:44.380 |
Later we will see that this number here will become very small, 02:34:47.580 |
and the size of the image will become nearly to the normal. 02:34:57.580 |
So as you can see, we are decreasing the features here. 02:35:05.820 |
Then we have 8 by 80, and we are increasing also here the size. 02:35:34.540 |
8 heads with the dimensions embedding size of 80, 02:35:43.100 |
And then we have another residual block with attention. 02:35:58.620 |
Then we have another one, which is a 640, 320, 840. 02:36:09.580 |
And finally, the last one, we have 640 by 320. 02:36:19.980 |
This dimension here is the same that will be applied by the output of the unit, 02:36:31.180 |
And then we will give it to the final layer to build the original latent size. 02:36:34.620 |
Okay, let's build all these blocks that we didn't build before. 02:36:45.900 |
Let's build it here, which is exactly the same as the two. 02:37:18.300 |
And this is also doesn't change the size of the image, actually. 02:37:28.700 |
So we will go from batch channels or features. 02:37:36.700 |
Let's call it features height width to batch size features. 02:37:46.380 |
Height multiplied by 2 and width multiplied by 2. 02:37:58.380 |
Interpolate x scale factor equal to mode is equal to nearest is the same operation that we did here. 02:38:26.700 |
And we also have to define for the output layer. 02:38:30.380 |
And we also have to define the attention block and the residual block. 02:38:47.580 |
So let's... this one also has a group normalization. 02:39:25.840 |
The final layer needs to convert this shape into this shape. 02:39:33.740 |
We have... so we have an input which is batch size of 320 features. 02:40:02.540 |
This will basically... the convolution... let me write also why we are reducing the size. 02:40:09.660 |
This convolution will change the number of channels from in to out. 02:40:13.180 |
And when we will declare it, we say that we want to convert from 320 to 4 here. 02:40:30.620 |
Then we need to go build this residual block and this attention block here. 02:40:44.060 |
which is very similar to the residual block that we built for the variational autoencoder. 02:41:16.220 |
As you remember, with the time embedding, we transform into an embedding of size 1280. 02:42:37.500 |
Again, just like before, we have if the in channels is equal to the out channels, 02:42:45.260 |
we can connect them directly with the residual connection. 02:42:54.220 |
Otherwise, we create a convolution to connect them, 02:42:57.180 |
to convert the size of the input into the output. 02:43:27.500 |
So it takes in as input this feature tensor, which is actually the latent 02:43:45.100 |
And then also the time embedding, which is 1 by 1280, just like here. 02:43:50.460 |
And we build, first of all, a residual connection. 02:44:01.580 |
So usually the residual connection, the residual blocks are more or less always the same. 02:44:05.500 |
So there is a normalization and activation function. 02:44:08.780 |
Then we can have some skip connection, etc, etc. 02:44:41.420 |
here we are merging the latency with the time embedding, 02:45:03.180 |
but the time embedding doesn't have the batch and the channels dimension. 02:45:37.500 |
And finally, we apply the residual connection. 02:45:49.900 |
Well, the idea is that here we have three inputs. 02:45:52.780 |
We have the time embedding, we have the latent, we have the prompt. 02:45:56.300 |
We need to find a way to combine the three information together. 02:46:00.140 |
So the unit needs to learn to detect the noise present in a noisified image 02:46:05.100 |
at a particular time step using a particular prompt as a condition. 02:46:09.100 |
Which means that the model needs to recognize this time embedding 02:46:14.860 |
and needs to relate this time embedding with the latency. 02:46:17.980 |
And this is exactly what we are doing in this residual block here. 02:46:21.100 |
We are relating the latent with the time embedding, 02:46:24.460 |
so that the output will depend on the combination of both, 02:46:29.020 |
not on the single noise or in the single time step. 02:46:31.820 |
And this will also be done with the context using cross-attention 02:46:35.980 |
in the attention block that we will build now. 02:47:09.340 |
Okay, I will define some layers that for now will not make much sense, 02:47:18.140 |
but later they will make sense when we make the forward method. 02:48:00.700 |
I think he already has food, but maybe he wants to eat something special today. 02:48:06.540 |
So, let me finish this attention block and the unit and then I'm all his. 02:48:24.940 |
As you remember, the self-attention we can have the bias for the W matrices. 02:48:30.060 |
Here we don't have any bias, just like in the vanilla transformer. 02:48:35.660 |
Then we have a layer normalization, self.layernorm 2, 02:48:47.020 |
We will see later why we need all this attention, 02:48:51.260 |
It's a cross-attention and we will see later how it works. 02:49:23.820 |
this is because we are using a function that is called the 02:50:08.700 |
Then we have our context, which is our prompt, 02:50:11.660 |
which is a batch size, sequence length, dimension. 02:50:19.260 |
So, the first thing we will do is we will do the normalization. 02:50:25.100 |
So, just like in the transformer, we will take the input, 02:50:28.540 |
so our latency, and we apply the normalization and the convolution. 02:50:32.700 |
Actually, in the transformer, there is no convolution, 02:51:01.900 |
which also doesn't change the size of the tensor. 02:51:08.460 |
which is the batch size, the number of features, the height, and the width. 02:51:17.340 |
We transpose because we want to apply cross-attention. 02:51:22.060 |
First, we apply self-attention, then we apply cross-attention. 02:51:28.460 |
So, we do normalization plus self-attention with skip connection. 02:51:40.220 |
So, X is X dot transpose of minus one, minus two. 02:51:54.140 |
Here, first of all, we need to do X is equal to X dot view. 02:52:02.780 |
So, we are going from this to batch size features, 02:52:26.940 |
Now, we apply this normalization plus self-attention. 02:52:29.500 |
So, we have a first short residual connection 02:52:37.180 |
So, we say that X is equal to layer norm one. 02:52:49.740 |
So, X is plus equal to residual short, the first residual connection. 02:52:56.060 |
Then we say that the residual short is again equal to six, 02:52:58.780 |
because we are going to apply now the cross attention. 02:53:01.740 |
So, now we apply the normalization plus the cross attention with skip connection. 02:53:11.100 |
So, what we did here is what we do in any transformer. 02:53:20.060 |
So, let me show you here what we do in any transformer. 02:53:26.140 |
And then we combine it with a skip connection here. 02:53:28.460 |
And now we will, instead of calculating a self-attention, 02:53:31.660 |
we will do a cross attention, which we still didn't define. 02:53:39.340 |
And then first we calculate, we apply the normalization. 02:53:43.020 |
Then the cross attention between the latency and the prompt. 02:54:12.300 |
Finally, just like with the attention transformer, 02:54:16.140 |
we have a feedforward layer with the JGLU activation function. 02:54:26.780 |
And this is actually, if you watch the original implementation of the transformer, 02:54:45.180 |
of the stable diffusion, it's implemented exactly like this. 02:54:48.620 |
So, basically later we do element-wise multiplication. 02:54:55.020 |
So, these are special activation functions that involve a lot of parameters. 02:55:11.500 |
they just saw that this one works better for this kind of application. 02:55:26.940 |
So, this one is basically normalization plus feedforward layer with JGLU and skip connection. 02:55:38.620 |
In which the skip connection is defined here. 02:55:41.660 |
So, at the end, we always apply the skip connection. 02:55:44.780 |
Finally, we change back to our tensor to not be a sequence of pixels anymore. 02:55:57.820 |
So, basically, we go from batch size with width multiplied by height multiplied by width. 02:56:13.980 |
And features into batch size features height multiplied by width. 02:56:36.940 |
Finally, we apply the long skip connection that we defined here at the beginning. 02:57:04.300 |
We have defined everything, I think, except for the cross attention, which is very fast. 02:57:09.340 |
So, we go to the attention that we defined before. 02:57:20.700 |
Yeah, we only need to define this cross attention here. 02:57:38.060 |
So, class, it will be very similar to the, not very similar, actually, same as the self 02:57:44.780 |
attention, except that the keys come from one side and the query and, sorry, the query 02:57:51.420 |
come from one side and the key and the values from another side. 02:58:06.300 |
So, this is the dimension of the embedding of the keys and the values. 02:58:37.420 |
In this case, we will define, instead of one big matrix made of three, WQ, WK and WV, we 02:58:46.860 |
You can define it as one big matrix or three separately. 02:59:02.860 |
So, the cross is from the keys and the values. 02:59:32.220 |
Then, we save the number of heads of this cross attention and also the dimension of 03:00:00.460 |
each, how much information each head will see. 03:00:04.140 |
And the head is equal to the embed divided by the number of heads. 03:00:27.740 |
So, we are relating X, which is our latency, which is of size batch size. 03:00:36.060 |
It will have a sequence length, its own sequence length, Q, let's call it Q, and its own dimension. 03:00:43.580 |
And the Y, which is the context or the prompt, which will be batch size. 03:00:53.580 |
Sequence length of the key, because the prompt will become the key and the values. 03:00:58.860 |
And each of them will have its own embedding size, the dimension of KV. 03:01:03.100 |
We can already say that this will be a batch size of 77, because our sequence length of 03:01:09.580 |
the prompt is 77 and its embedding is of size 768. 03:01:23.740 |
Okay, then we have the interim shape, like the same as before. 03:01:40.860 |
So, this is the sequence length, then the n number of heads. 03:01:55.260 |
The first thing we do is multiply queries by WQ matrix. 03:02:07.580 |
Then we do the same for the keys and the values, but by using the other matrices. 03:02:17.900 |
And as I told you before, the key and the values are the Y and not the X. 03:02:22.140 |
Again, we split them into H heads, so H number of heads. 03:02:38.860 |
I will not write the shapes because they match the same transformation that we do here. 03:02:46.700 |
Okay, again, we calculate the weight, which is the attention, as a query multiplied by 03:03:12.300 |
And then we divide it by the dimension of each head by the square root. 03:03:29.420 |
In this case, we don't have any causal mask, so we don't need to apply the mask like before, 03:03:36.300 |
because here we are trying to relate the tokens, so the prompt with the pixels. 03:03:41.980 |
So, each pixel can watch any word of the token, and any token can watch any pixel, basically. 03:04:00.940 |
We are to obtain the output, we multiply it by the bit matrix. 03:04:04.140 |
And then the output, again, is transposed, just like before. 03:04:10.140 |
So, now we are doing exactly the same things that we did here. 03:04:38.780 |
And this ends our building of the... let me show you. 03:04:43.020 |
Now we have built all the building blocks for the stable diffusion. 03:04:49.420 |
So, now we can finally combine them together. 03:04:53.420 |
So, the next thing that we are going to do is to create the system that, 03:04:57.500 |
taking the noise, taking the text, taking the time embedding, will run, 03:05:03.020 |
for example, if we want to do text to image, will run this noise many times through the unit, 03:05:10.540 |
So, we will build the scheduler, which means that, 03:05:13.340 |
because the unit is trained to predict how much noise is there, 03:05:20.460 |
So, to go from a noisy version to obtain a less noisy version, 03:05:25.180 |
we need to remove the noise that is predicted by the unit. 03:05:32.220 |
We will build the code to load the weights of the pre-trained model. 03:05:35.980 |
And then we combine all these things together. 03:05:39.580 |
And we actually build what is called the pipeline. 03:05:42.060 |
So, the pipeline of text to image, image to image, etc. 03:05:48.300 |
Now that we have built all the structure of the unit, 03:05:52.060 |
or we have built the variational autoencoder, we have built a clip, 03:06:01.980 |
So, the first thing I kindly ask you to do is to actually download 03:06:05.500 |
the pre-trained weights of the stable diffusion, because we need to inference it later. 03:06:09.340 |
So, if you go to the repository I shared, this one, PyTorch Stable Diffusion, 03:06:14.220 |
you can download the pre-trained weights of the stable diffusion 1.5 03:06:20.940 |
So, you download this file here, which is the EMA, 03:06:27.020 |
which means that it's a model that has been trained, 03:06:30.460 |
but they didn't change the weights at each iteration, 03:06:32.860 |
but with an Exponentially Moving Average schedule. 03:06:39.580 |
But if you want to fine-tune later the model, you need to download this one. 03:06:43.900 |
And we also need to download the files of the tokenizer, 03:06:48.380 |
because, of course, we will give some prompt to the model to generate an image. 03:06:53.660 |
And the prompt needs to be tokenized by a tokenizer, 03:06:56.700 |
which will convert the words into tokens and the tokens into numbers. 03:07:00.620 |
The numbers will then be mapped into embeddings by our clip embedding here. 03:07:05.420 |
So, we need to download two files for the tokenizer. 03:07:08.780 |
So, first of all, the weights of this one file here, 03:07:12.380 |
then on the tokenizer folder, we find the merges.txt and the vocab.json. 03:07:17.340 |
If we look at the vocab.json file, which I already downloaded, 03:07:25.260 |
That's it, just like what the tokenizer does. 03:07:27.580 |
And then I also prepared the picture of a dog that I will be using for image-to-image, 03:07:33.420 |
You don't have to use the one I am using, of course. 03:07:41.180 |
So, how we will inference this stable diffusion model. 03:07:47.900 |
I will also explain you how the scheduler will work. 03:07:54.540 |
I will explain all the formulas, all the mathematics behind it. 03:08:16.940 |
We will also use a tqdm to show the progress bar. 03:08:22.540 |
And later, we will build this sampler, the DPM sampler. 03:08:30.220 |
And I will also explain what is this sampler doing and how it works, etc, etc. 03:08:35.980 |
So, first of all, let's define some constants. 03:08:38.060 |
The stable diffusion can only produce images of size 512 by 512. 03:08:46.620 |
The latent dimension is the size of the latent tensor of the variational autoencoder. 03:08:55.420 |
And as we saw before, if we go check the size, 03:09:00.300 |
the encoder of the variational autoencoder will convert something that is 512 by 512 03:09:09.260 |
So, the latent dimension is 512 divided by 8. 03:09:19.420 |
We can also call it width divided by 8 and height divided by 8. 03:09:23.420 |
Then, we create a function called the generator. 03:09:27.660 |
This will be the main function that will allow us to do text to image and also image to image, 03:09:44.620 |
If you ever used stable diffusion, for example, with the HuggingFace library, 03:09:48.380 |
you will know that you can also specify a negative prompt, 03:09:51.820 |
which tells that you want, for example, you want a picture of a cat, 03:09:56.620 |
but you don't want the cat to be on the sofa. 03:09:59.900 |
So, for example, you can put the word sofa in the negative prompt. 03:10:04.060 |
So, it will try to go away from the concept of sofa when generating the image. 03:10:09.260 |
And this is connected with the classifier free guidance that we saw before. 03:10:13.980 |
So, but don't worry, I will repeat all the concepts while we are building it. 03:10:18.540 |
We can have an input image in case we are building an image to image. 03:10:27.660 |
Strength, I will show you later what is it, but it's related to if we have an input image 03:10:33.340 |
and how much, if we start from an image to generate another image, 03:10:37.180 |
how much attention we want to pay to the initial starting image. 03:10:40.780 |
And we can also have a parameter called doCFG, 03:10:51.020 |
CFG scale, which is the weight of how much we want the model to pay attention to our prompt. 03:11:00.380 |
The sampler name, we will only implement one. 03:11:12.060 |
I think it's quite common to do 50 steps, which produces actually not bad results. 03:11:20.140 |
The seed is how we want to initialize our random number generator. 03:11:23.980 |
Let me put a new line, otherwise we become crazy reading this. 03:11:36.540 |
Then we have the device where we want to create our tensor. 03:11:39.820 |
We have an idle device, which means basically if we load some model on CUDA 03:11:45.020 |
and then we don't need the model, we move it to the CPU. 03:11:48.060 |
And then the tokenizer that we will load later. 03:11:54.860 |
This is our main pipeline that, given all this information, will generate one picture. 03:12:01.340 |
It will pay attention to the input image, if there is, 03:12:03.980 |
according to the weights that we have specified. 03:12:10.540 |
Don't worry, later I will explain them actually how they work also on the code level. 03:12:22.480 |
Torch.log(red) because we are inferencing the model. 03:12:28.780 |
The first thing we make sure is the strength should be between 0 and 1. 03:12:59.340 |
If we want to move things to the CPU, we create this lambda function. 03:13:33.600 |
Then we create the random number generator that we will use. 03:13:51.040 |
And the generator is a random number generator that we will use to generate the noise. 03:14:14.140 |
Let me fix this formatting because I don't know format document. 03:14:33.900 |
The clip is a model that we take from the pre-trained models. 03:14:58.160 |
As you remember with the classifier-free guidance. 03:15:03.280 |
When we do classifier-free guidance, we inference the model twice. 03:15:16.560 |
And another time by not specifying the condition. 03:15:20.720 |
And then we combine the output of the model linearly with a weight. 03:15:31.440 |
It indicates how much we want to pay attention to the conditioned output 03:15:38.320 |
Which also means that how much we want the model to pay attention to the condition 03:15:54.880 |
So, the negative prompt that you use in stable diffusion. 03:16:19.600 |
And this will tell the model by using this weight. 03:16:21.760 |
We will combine the output in such a way that we can decide 03:16:24.880 |
how much we want the model to pay attention to the prompt. 03:17:02.720 |
We want to append the padding up to the maximum length. 03:17:06.880 |
Which means that the prompt, if it's too short, 03:17:22.960 |
Then we convert these tokens, which are input IDs, into a tensor. 03:17:30.960 |
Which will be of size batch size and sequence length. 03:17:56.880 |
So, it will convert batch size sequence length. 03:17:59.840 |
So, these input IDs will be converted into embeddings. 03:18:12.400 |
And what we do is conditional context is equal to clip of conditional tokens. 03:18:22.800 |
So, we are taking these tokens and we are running them through clips. 03:18:27.360 |
Which will return batch size sequence length dimension. 03:18:31.040 |
And this is exactly what I have written here. 03:18:38.080 |
Which, if you don't want to specify, we will use the empty string. 03:18:41.440 |
Which means the unconditional output of the model. 03:18:44.240 |
So, the model, what would the model produce without any condition? 03:18:51.440 |
So, if we start with random noise and we ask the model to produce an image. 03:18:56.880 |
So, the model will output anything that it wants based on the initial noise. 03:19:38.820 |
So, it will also become a tensor of batch size sequence length dimension. 03:19:47.600 |
Where the sequence length is actually always 77. 03:19:55.520 |
But I forgot to write the code to convert it into. 03:19:59.040 |
So, unconditional tokens is equal to tokenizer batch plus. 03:20:31.680 |
They will become the batch of our input to the unit. 03:20:51.520 |
We are taking the conditional and unconditional input. 03:20:54.240 |
And we are combining them into one single tensor. 03:20:56.560 |
So, they will become a tensor of batch size 2. 03:21:17.840 |
If we don't want to do conditional classifier free guidance. 03:21:24.880 |
We only need to use the prompt and that's it. 03:21:33.920 |
Without combining the unconditional input with the conditional input. 03:21:38.880 |
We cannot decide how much the model pays attention to the prompt. 03:21:44.880 |
Because we don't have anything to combine it with. 03:22:53.840 |
Because the model takes care of the batch size. 03:23:08.640 |
And you want to offload the models after using them. 03:23:26.480 |
Because it's better to build it after you know how it is used. 03:23:32.560 |
I think it's easy to get lost in what is happening. 03:23:51.440 |
And we tell the sampler how many steps we want to do for the inferencing. 03:24:04.960 |
Because we didn't implement any other sampler. 03:24:33.120 |
In this case the denoisification steps will be 50. 03:24:36.960 |
Even if during the training we have maximum 1000 steps. 03:24:41.120 |
During inferencing we don't need to do 1000 steps. 03:24:44.720 |
Of course usually the more steps you do the better the quality. 03:24:50.880 |
But with different samplers they work in different way. 03:24:55.920 |
And with ddpm usually 50 is good enough to get a nice result. 03:25:03.120 |
For some other samplers that work on with differential equations. 03:25:10.640 |
And how lucky you are with the particular prompt actually also. 03:25:15.920 |
This is the latency that will run through the unit. 03:25:26.400 |
And as you know it's of size "lat_height" and "lat_width". 03:25:33.600 |
So it's 512 divided by 8 by 512 divided by 8. 03:25:43.280 |
What happens if the user specifies an input image? 03:25:49.600 |
We can take care of the prompt by either running a classifier free guidance. 03:25:56.000 |
Which means combining the output of the model with the prompt and without the prompt. 03:26:05.920 |
Or we can directly just ask the model to output only one image. 03:26:13.600 |
But then we cannot combine the two output with this scale. 03:26:17.360 |
What happens however if we don't want to do text to image. 03:26:30.560 |
And then we ask the scheduler to remove noise. 03:26:33.440 |
But since the unit will also be conditioned by the text prompt. 03:26:38.400 |
We hope that while the unit will denoise this image. 03:28:03.120 |
The next thing we do is we rescale this image. 03:28:06.960 |
That the input of this unit should be normalized between. 03:28:12.480 |
Should be, sorry, rescaled between -1 and +1. 03:28:28.480 |
But this is not what the unit wants as input. 03:28:44.640 |
To transform anything from that is from between 0 and 255. 03:28:57.680 |
And this will not change the size of the tensor. 03:29:17.040 |
Okay, and then we change the order of the dimensions. 03:29:36.880 |
Because as you know the encoder of the variation autoencoder. 03:29:45.520 |
While we have batch size, height, width, channel. 03:29:49.520 |
So to obtain the correct input for the encoder. 03:30:18.640 |
And then he will sample from this particular Gaussian. 03:30:36.480 |
And we can also make the output deterministic. 03:30:47.040 |
Okay, and now let's run it through the decoder. 03:31:23.920 |
It will produce a latent representation of this image. 03:32:02.480 |
Because the model will have more noise to remove. 03:32:08.000 |
But if we add less noise to this initial image. 03:32:13.120 |
Because most of the image is already defined. 03:32:17.680 |
So we expect that the output will resemble more or less the input. 03:32:54.320 |
And later we will see what is this method doing. 03:33:03.200 |
According to the strength that we have defined. 03:33:35.200 |
What is the initial noise level we will start with. 03:34:58.880 |
And we then finally load the diffusion model. 03:35:12.160 |
Later we see what is this model and how to load it. 03:35:14.640 |
We take it to our device where we are working. 03:35:19.840 |
And then our sampler will define some time steps. 03:35:27.840 |
As you remember to train the model we have maximum of 1000 time steps. 03:35:31.680 |
But when we inference we don't need to do 1001 steps. 03:35:34.640 |
In our case we will be doing for example 50 steps of inferencing. 03:35:54.960 |
If we do only 50 it means that we need to do. 03:36:19.280 |
Basically each of these time steps indicates a noise level. 03:36:29.520 |
Or the initial noise in case we are doing the text to image. 03:36:38.480 |
Which are defined by how many inference steps we want. 03:36:41.440 |
And this is exactly what we are going to do now. 03:37:04.080 |
And for each of these time steps we denoise the image. 03:37:19.680 |
We need to tell the unit as you remember diffusion. 03:37:32.000 |
Or in case we are doing a classifier free guidance. 03:37:44.800 |
Keep denoising it according to the time embedding. 03:37:51.280 |
Which is an embedding of the current time step. 03:38:02.240 |
This function basically will convert a number. 03:38:15.200 |
It's basically just equal to the positional encoding. 03:39:25.760 |
If we are doing the classifier free guidance. 03:40:04.960 |
So basically we are repeating this dimension twice. 03:40:50.720 |
Because we are passing the input of the model. 03:41:01.040 |
So we can then split it into two different tensor. 03:41:24.400 |
And then we combine them according to this formula here. 03:42:16.000 |
That is able to predict the noise in the current latency. 03:43:37.520 |
Based on how many inference steps we want to do. 03:43:46.240 |
The unit will tell us how much is the predicted noise. 03:43:53.520 |
So the latency are equal to sampler dot step. 03:44:03.040 |
This basically means take the image from a more noisy version. 03:44:42.960 |
And then our image is run through the decoder. 03:45:13.360 |
Because the image was initially, as you remember here. 03:45:47.600 |
We want the channel dimension to be the last one. 03:45:56.320 |
So this one basically will take the batch size. 03:46:21.840 |
And then we need to convert it into a NumPy array. 03:47:33.600 |
So convert something that is within this range into this range. 03:47:59.760 |
This means basically take the time step which is a number. 03:48:08.320 |
And this will be done exactly using the same system that we use for the transformer. 03:48:13.520 |
So we first define the frequencies of our cosines and the sines. 03:48:23.440 |
Exactly using the same formula of the transformer. 03:48:26.240 |
So if you remember the formula is equal to the 10,000. 03:48:37.840 |
So it's power of 10,000 and minus torch.range. 03:48:45.040 |
So I am referring to this formula just in case you forgot. 03:48:56.480 |
So the formula that defines the positional encodings here. 03:48:59.920 |
Here we just use a different dimension of the embedding. 03:49:02.960 |
This one will produce something that is 160 numbers. 03:49:18.480 |
And this one will produce something that is 200 numbers. 03:50:06.720 |
And then we multiply this by the sines and the cosine. 03:50:13.920 |
Just like we did in the original transformer. 03:50:16.880 |
This one will return a tensor of size 100 by 62. 03:51:04.640 |
Because if we don't want to use any negative prompt. 03:51:13.280 |
The strength is how much attention we want to pay to this input image. 03:51:18.800 |
Or how much noise we want to add to it basically. 03:51:24.800 |
The less the output will resemble the input image. 03:51:31.120 |
Which means that if we want the model to output to output. 03:51:38.320 |
And then we can adjust how much we want to pay attention to the prompt. 03:51:52.880 |
The first thing we do is we create a generator. 03:52:02.960 |
Basically we need to go through the units twice. 03:52:16.800 |
In case we don't do the classifier free guidance. 03:52:57.760 |
Then this new latent is fed again to the unit. 03:53:09.280 |
Is how we remove the noise from the image now. 03:53:20.320 |
So now we need to go build this scheduler here. 03:53:56.880 |
Because I don't want you to be confused with the beta schedule. 03:54:12.240 |
Because there is the beta schedule that we will define now. 03:54:16.400 |
Which indicates the amount of noise at each time step. 03:54:19.840 |
And then there is what is known as the scheduler or the sampler. 03:54:26.400 |
So this scheduler here actually means a sampler. 03:54:30.480 |
I will update the slides when the video is out. 03:54:46.560 |
Where what are they and where they come from? 03:55:16.160 |
We can see that the forward process is the process that makes the image more noisy. 03:55:24.800 |
So given an image that don't have less noise. 03:55:33.840 |
Which is actually a chain of Gaussian distribution. 03:55:36.720 |
Which is called a Markov chain of Gaussian distribution. 03:55:39.520 |
And the noise that we add varies according to a variance schedule. 03:55:52.720 |
That indicates the variance of the noise that we add with each of these steps. 03:55:59.040 |
And as in the latent in the stable diffusion. 03:56:10.560 |
So this the beta that will turn the image into complete noise. 03:56:26.880 |
Which are for example the cosine schedule etc. 03:56:35.360 |
Which is actually 1000 numbers between beta start and beta end. 03:56:57.680 |
Because this is how they define it in the stable diffusion. 03:57:11.920 |
So in how many pieces we want to divide this linear space. 03:57:21.280 |
And then the type is torch dot float 32 I think. 03:57:33.360 |
This is in the diffusers libraries from Hugging Face. 03:57:38.480 |
I think this is called the scaled linear schedule. 03:57:44.480 |
That are needed for our forward and our backward process. 03:57:47.600 |
So our forward process depends on this beta schedule. 03:57:50.720 |
But actually this is only for the single step. 03:57:53.120 |
So if we want to go from for example the original image. 03:58:05.920 |
That allows you to go from the original image. 03:58:21.040 |
And the variance also depends on this alpha bar. 03:59:42.480 |
The second element is alpha 0 multiplied by alpha 1. 04:00:28.720 |
We will start from the more noisy to less noise. 04:00:36.240 |
So let's say time steps is equal to torch from. 04:01:12.720 |
So if the user later specifies less than 1000. 04:01:22.960 |
Based on how many actual steps we want to make. 04:01:39.920 |
Which is also actually the one they use normally. 04:03:00.720 |
According to how many we actually want to make. 04:03:54.240 |
Now the code looks very different from each other. 04:04:01.040 |
Because actually I have been copying the code from multiple sources. 04:04:04.560 |
Maybe one of them I think I copied from the HuggingFace library. 04:04:16.160 |
So we copy the code from the HuggingFace library. 04:04:20.480 |
So now we set the exact number of time steps we want. 04:04:23.760 |
And we redefine this time steps array like this. 04:04:33.520 |
Let's define the method on how to add noise to something. 04:04:45.840 |
Well we need to apply the formula as defined in the paper. 04:04:57.680 |
I want to go to the noisified version of this image at time step t. 04:05:14.320 |
And we will apply the same trick that we did for the variational autoencoder. 04:05:17.840 |
As you remember in the variational autoencoder. 04:05:19.920 |
I actually already showed how we sample from a distribution. 04:05:23.040 |
Of which we know the mean and the variance here. 04:05:26.640 |
But we of course we need to build the mean and the variance. 04:05:35.920 |
So we need to build the mean and the variance. 04:05:50.080 |
So this is actually time step, not time steps. 04:05:57.040 |
It indicates at what time step we want to add the noise. 04:06:00.000 |
Because you can add the time step at the noise at time step 1, 2, 3, 4. 04:06:05.520 |
And we need to add the noise at the noise at time step 1, 2, 3, 4. 04:06:15.200 |
So the noisified version at the time step 1 will be not so noisy. 04:06:19.280 |
But at the time step 1000 will be complete noise. 04:06:35.120 |
Let me check what we need to calculate first. 04:06:41.360 |
So to calculate the mean we need this alpha cum prod. 04:06:50.160 |
So the alpha bar as you can see is the cumulative product of all the alphas. 04:07:30.400 |
That we also move to the same device of the other tensor. 04:07:34.320 |
Now we need to calculate the square root of alpha bar. 04:07:47.040 |
Or alpha prod is alpha cum prod at the time step t. 04:07:57.760 |
Because having a number to the power of 0.5 means doing it's the square root of the number. 04:08:04.240 |
Because the square root of 1/2 which becomes the square. 04:08:07.840 |
Sorry to the power of 1/2 which becomes the square root. 04:08:14.800 |
And then basically because we need to combine this alpha cum prod. 04:08:29.040 |
So one trick is to just keep adding dimensions with unsqueeze. 04:08:32.640 |
Until you have the same number of dimensions. 04:08:34.480 |
So until the n of the square of the shape is less than. 04:08:41.440 |
Most of this code I have taken from the Hugging Face libraries samplers. 04:08:52.080 |
So we keep the dimension until this one and this tensor and this tensor have the same dimensions. 04:09:07.440 |
This is because otherwise we cannot do broadcasting when we multiply them together. 04:09:11.120 |
The other thing that we need to calculate this formula is this part here. 04:09:24.240 |
As the name implies is 1 minus alpha cum prod at the time step t. 04:09:39.440 |
Just like we did with the encoder of the variational autoencoder. 04:09:45.680 |
Because as you remember if you have an n01 and you want to transform into an n with the given mean and the variance. 04:09:52.720 |
The formula is x is equal to mean plus the standard deviation multiplied by the n01. 04:10:13.920 |
Flatten and then again we keep adding the dimensions until they have the same dimension. 04:10:19.680 |
Otherwise we cannot multiply them together or sum them together. 04:10:37.360 |
Now as you remember our method should add noise to an image. 04:10:44.080 |
So we need to add noise means we need to sample some noise. 04:10:47.280 |
So we need to sample some noise from the n01. 04:11:07.280 |
I think my cat is very angry today with me because I didn't play with him enough. 04:11:14.000 |
So later if you guys excuse me I need to later play with him. 04:11:24.240 |
So let's get the noisy samples using the noise and the mean and the variance that we have calculated. 04:11:35.520 |
Actually no the mean is this one multiplied by x0. 04:11:41.760 |
So the mean is this one multiplied by x0 is the mean. 04:11:45.520 |
So we need to take this square root of alpha comprod multiplied by x0 and this will be the mean. 04:11:50.880 |
So the mean is square root of alpha prod multiplied by the original latency. 04:11:56.320 |
So x0 so the input image or whatever we want to noisify. 04:12:00.240 |
Plus the standard deviation which is a square root of this one multiplied by a sample of the 04:12:22.160 |
So all of this is according to the equation 4 of the DDM paper and also according to this. 04:12:36.480 |
Okay now that we know how to add noise we need to understand how to remove noise. 04:12:49.920 |
Imagine we are doing text to text or text to image or image to image it doesn't matter. 04:12:57.200 |
The point is our unit as you remember is trained to only predict the amount of noise given the 04:13:03.600 |
latent with noise given the prompt and the time step at which this noise was added. 04:13:12.400 |
So what we do is we have this predicted noise from the unit. 04:13:17.280 |
We need to remove this noise so the unit will predict the noise but we need some way of 04:13:25.760 |
What I mean by this is you can see this reverse process here. 04:13:36.720 |
We want to go from Xt so something more noisy to something less noisy based on the noise 04:13:51.440 |
But here in this formula you don't see any relationship to the noise predicted by the 04:13:57.680 |
Actually here it just says if you have a network that can evaluate this mean and this variance 04:14:06.640 |
you know how to remove the noise to how to go from Xt to Xt-1 but we don't have a method 04:14:13.120 |
that actually predicts the mean and the variance. 04:14:15.120 |
We have a method that tells us how much noise is there. 04:14:18.000 |
So the formula we should be looking at is actually here. 04:14:22.800 |
So here here because we have we trained our network our unit as a epsilon theta as you 04:14:33.600 |
remember our training method was this we do gradient descent on this loss in which we 04:14:40.000 |
train a network to predict the noise in a noisy image. 04:14:45.040 |
So we need to use this epsilon theta now to remove the noise so this predicted noise to 04:14:51.120 |
remove the noise and if we read the paper it's written here that to sample Xt-1 given 04:14:58.000 |
Xt is to compute Xt-1 is equal to this formula here. 04:15:04.640 |
This tells us how to go from Xt to Xt-1 and this is the so basically we sample some noise 04:15:12.720 |
we multiply it by d sigma and this basically reminds us on how to move go from the N01 04:15:21.440 |
to any distribution with a particular mean and a particular variance. 04:15:26.400 |
So we will be working according to this formula here actually because we have a model that 04:15:31.360 |
predicts noise here this epsilon theta and this is our unit. 04:15:38.000 |
So let's build this part now and I will while building it I will also tell you which formula 04:15:44.160 |
I'm referring to at each step so you can also follow the paper. 04:15:47.680 |
So now let's build the method let's call step method that given the time step at which the 04:15:54.160 |
noise was added or we think it was added because when we do the reverse process we can also 04:16:00.400 |
skip it's not we think it was other but we can skip some time steps so we need to tell 04:16:06.000 |
him what is the time step at which it should remove the noise the latency so as you know 04:16:12.640 |
the unit works with the latency so with this z's here so this is z and it keeps denoising 04:16:19.040 |
so the latency and then what is the model output so the predicted noise of the unit 04:16:27.920 |
so the model output is the predicted noise torch dot tensor 04:16:33.520 |
this model output corresponds to this epsilon theta of xtt so this is the predicted noise 04:16:43.680 |
at time step t this latency is our xt and what else we need the alpha we have the beta 04:16:52.320 |
we have we have everything okay let's go so t is equal to time step the previous t is 04:17:02.160 |
equal to self dot get previous time step t this is a function that given this time step 04:17:11.040 |
calculates the previous one later we will build it actually we can build it now it's very simple 04:17:31.680 |
get previous time step self time step which is an integer we return another integer 04:17:43.680 |
minus self minus basically this quantity here step ratio so self dot num training steps 04:17:56.720 |
divided by self dot num inference steps return previous t this one will return basically 04:18:06.560 |
given for example the number 999 it will return number 999 minus 20 because the time steps 04:18:16.640 |
for example the initial time step will be suppose it's 1000 the training steps we are doing is 1000 04:18:24.080 |
divided by the number of inference step which is we will be doing is 50 so this is means 1000 minus 04:18:30.000 |
20 because 1000 divided by 50 is 20 so it will return 980 when we give him 980 as input he will 04:18:37.760 |
return 960 so what is the next step that we will be doing in our for loop or what is the previous 04:18:45.200 |
step of the denoising so we are going from the image noise at the time step 1000 to an image 04:18:52.320 |
noise that time step 980 for example this is the meaning of previous stem then we retrieve some 04:19:01.280 |
data later we will use it so alpha pod t is equal to self dot alpha for now if you don't understand 04:19:08.240 |
don't worry because later i will write i will just collect some data that we need to calculate 04:19:12.400 |
a formula and then i will tell you exactly which formula we are going to calculate 04:19:22.640 |
if we don't have any previous step then we don't know which alpha to return so we just return one 04:19:45.440 |
and actually there is a paper that came out i think from by dance that was complaining that 04:19:50.960 |
this method of doing is not correct because the the last time step doesn't have this is not doesn't 04:19:58.720 |
have the signal to noise ratio about equal to zero but okay this is something we don't need 04:20:03.760 |
to care about now actually if you're interested i will link the paper in the comments 04:20:26.320 |
prod t divided by alpha prod current also this code i took it from 04:20:33.600 |
hugging face diffusers library because i mean we are applying formulas so even if i wrote it by 04:20:43.520 |
myself it wouldn't be any different because we are just applying formulas from the paper so 04:20:48.240 |
so the first thing we need to do is to compute the original sample according to the formula 15 04:20:55.040 |
of the paper what do i mean by this as you can see where is it this one where is it 04:21:05.120 |
here so actually let me show you another formula here 04:21:12.160 |
as you can see we can calculate the previous step so the less noise is the the forward process 04:21:22.080 |
sorry the reverse process we can calculate the less noisy image given a more noisy image 04:21:26.960 |
and the predicted image at time step zero according to this formula here where the mean 04:21:34.560 |
is defined in this way and the variance is defined in this way but what is the predicted 04:21:44.400 |
x0 so given an image given a noisy image at time step t how can we predict what is the x0 04:21:53.920 |
of course this is the predicted x0 not what will be the x0 so this predicted x0 we can also retrieve 04:22:00.640 |
it using the formula number 15 if i remember correctly it's here so this x0 is given as xt 04:22:10.800 |
minus 1 minus alpha multiplied by the predicted noise at time step t divided by the square root 04:22:17.200 |
of alpha all these quantities we have so actually there are two ways which are equivalent to each 04:22:22.800 |
other actually numerically of going from more noisy to less noisy one way is this one this 04:22:28.960 |
one here which is the algorithm 2 of the sampling and one is this one here so the equation number 04:22:36.560 |
7 that allows you to go from more noisy to less noisy but the two are numerically equivalent they 04:22:42.400 |
just in the in the effect they are equivalent it's just they have different parameterization 04:22:48.000 |
so they have different formulas so as a matter of fact for example here in the code they say 04:22:54.720 |
to go from xt to xt minus 1 you need to do this calculation here but as you can see for example 04:23:02.480 |
this is this numerator of this multiplied by this epsilon theta is different from the one 04:23:11.120 |
in the algorithm here but actually they are the same thing because bt is equal to 1 minus alpha t 04:23:16.720 |
as beta alpha is defined as 1 minus beta as you remember so there are multiple ways of obtaining 04:23:24.080 |
the same thing so what we will do is we actually we will apply this formula here in which we need 04:23:29.760 |
to calculate the mean and we need to calculate the variance according to these formulas here 04:23:34.720 |
in which we know alpha we know beta we know alpha bar we know all the other alphas we know 04:23:40.320 |
because there are parameters that depend on beta what we don't know is x0 but x0 can be calculated 04:23:46.240 |
as in the formula 15 here so first we will calculate this x0 predicted x0 04:23:54.400 |
so first compute the predicted original sample using formula 15 of the DDTM paper 04:24:16.960 |
latency minus while so we do latency minus the square root of 1 minus alpha t what is the square 04:24:28.000 |
root of 1 minus alpha t is equal to beta so i have here beta t which is already 1 minus alpha t as 04:24:36.160 |
you can see alpha bar 1 minus alpha bar at the time step t because i already retrieve it from here 04:24:43.920 |
so 1 minus sorry beta to the power to to the power of one half or the square root of beta 04:24:52.000 |
so we do latency minus beta rod at time step t to the power of 0.5 which it means basically square 04:25:01.120 |
root of beta and then we multiply this by the predicted noise of the image of the latent at 04:25:08.640 |
time step t so what is the predicted noise it's the model output because our unit predicts the 04:25:15.120 |
noise model output and then we need to divide this by let me check 04:25:23.360 |
square root of alpha t which we have i think here alpha t here so the square root of 04:25:36.240 |
here i have something on this one i don't need this one i don't need okay 04:25:44.560 |
because otherwise it's wrong right yeah before first there is a product between 04:25:50.080 |
these two terms and then there is the difference yeah okay this is how we compute the prediction 04:25:56.160 |
the x0 now let's go back to the formula number seven 04:25:59.440 |
seven seven okay now we have this x0 so we can compute this term and we can compute this term 04:26:06.960 |
and this we can compute this term and all the other terms we also can compute so we calculate 04:26:11.120 |
this mean and this variance and then we sample from this distribution so compute the coefficients 04:26:20.640 |
for bred original sample and the current sample xt this is the same comment that you can find on 04:26:31.280 |
the diffusers library which basically means we need to compute this one this is the coefficient 04:26:36.640 |
for the predicted sample and this is the coefficient for xt this one here so predicted 04:26:43.520 |
original sample coefficient which is equal to what alpha prod t minus one so the previous alpha 04:26:54.320 |
prod t which is alpha prod t previous which means the alpha prod t but at the previous time step 04:27:04.720 |
under the square root so to the power of 0.5 multiplied by the current beta t so 04:27:10.720 |
the beta at the time step t so current beta t which is we define it here 04:27:18.640 |
current beta t we retrieve it from alpha we could have a okay and then we divide it by 04:27:28.720 |
beta product t because one minus alpha bar is actually equal to beta bar 04:27:33.120 |
beta product t then we have the this coefficient here so this one here 04:27:42.320 |
so this is current sample coefficient is equal to current alpha t to the power of 0.5 04:27:50.880 |
which means the square root of this time this this thing here so the square root of alpha t 04:27:58.080 |
and then we multiply it by beta at the previous time step because it's one minus alpha at the 04:28:03.200 |
previous time step corresponds to beta as the previous time steps time step multiplied by beta 04:28:09.280 |
prod t prev divide by beta at the time step t so beta prod t 04:28:17.840 |
now we can compute the mean so the mean is the sum of these two terms 04:28:26.160 |
pred prev sample so let me write some here compute the predicted 04:28:39.840 |
is equal to predicted original sample coefficient multiplied by what by x0 what is x0 is this one 04:28:50.000 |
that we obtained by the formula number 15 so the prediction predicted original sample so x0 04:28:56.000 |
plus this term here what is this term is this one here so the current sample coefficient 04:29:02.560 |
multiplied by xt what is xt is the latency at the time step t 04:29:08.480 |
now this we have computed the mean for now we need to compute also the variance 04:29:17.120 |
let's create another method to compute the variance 04:29:28.480 |
okay we obtained the previous time test t because we need to do for later calculations 04:29:37.440 |
again we calculate the alpha prod t so all the terms that we need to calculate 04:30:15.040 |
and the current beta t is equal to one minus alpha prod t divided by alpha prod this one 04:30:24.240 |
what is current beta t is equal to one minus alpha prod t yeah one minus alpha prod t 04:30:36.080 |
divided by alpha prod t0 okay so the variance according to the formula number six and seven 04:30:48.800 |
so this formula here is given as one minus alpha prod tprev so one minus alpha prod tprev 04:31:04.000 |
divided by one minus alpha prod which is one minus alpha prod why prod because 04:31:11.760 |
this is the alpha bar and multiplied by the current beta 04:31:15.920 |
beta t and beta t is defined i don't remember where it's one minus alpha 04:31:29.840 |
oops torch dot clamp the variance and the minimum that we want is one 04:31:41.600 |
equal to minus 20 to make sure that with it doesn't reach zero and then we return the variance 04:31:52.240 |
and now that we have the mean and the variance so this variance has also been computed using 04:31:58.960 |
let me write here computed using formula seven of the ddpm paper 04:32:07.840 |
and now we go back to our step function so what we do is 04:32:17.760 |
it's equal to zero so because we only need to add the variance if we are not at the last 04:32:24.480 |
last time step if you are at the last time step we have no noise so we don't add any 04:32:28.320 |
we don't add we don't need to add any noise actually because the point is we are going 04:32:36.880 |
to sample from this distribution and just like we did before we actually sample from the n01 and 04:32:42.240 |
then we shift it according to the formula so the n gaussian as with the particular mean and the 04:32:52.160 |
particular variance is equal to the the gaussian at zero one multiplied by the standard deviation 04:32:58.960 |
plus the um the plus the mean so we sample the noise 04:33:26.080 |
okay we sample some noise compute the variance 04:33:42.640 |
variance already multiplied by the noise so it's actually the standard deviation 04:33:48.800 |
because we will see self dot get variance after the time step t to the power of 0.5 so this 04:33:58.240 |
0.5 so this one becomes the standard deviation we multiply it by the n01 04:34:04.720 |
so what we are doing is basically we are going from n01 04:34:09.520 |
to nn with a particular mu and a particular sigma using the usual trick of going from 04:34:18.080 |
x is equal to the mu plus the sigma actually not yeah this is the sigma squared then 04:34:25.520 |
because this is the variance sigma multiplied by the z where z where z is distributed according 04:34:33.360 |
to the n01 this is the same thing that we always done also for the variation of the 04:34:39.520 |
encoder also for adding the noise the same thing that we did before this is how you sample from a 04:34:45.840 |
distribution how you actually shift the parameter of the gaussian distribution 04:34:49.280 |
so predicted prev sample is equal to the predicted prev sample plus the variance 04:34:59.600 |
this variance term here already includes the sigma multiplied by z 04:35:03.600 |
and then we return predicted prev sample oh okay now we have also built the 04:35:14.000 |
the sampler let me check if we have everything no we missed still still something which is the 04:35:19.360 |
set strength method as you remember once we want when we want to do image to image so let's go 04:35:26.320 |
back to check our slides if we want to do image to image we convert the image using the vae to a 04:35:32.560 |
latent then we need to add noise to this latent but how much noise we can decide the more noise 04:35:38.640 |
we add the more freedom the unit will have to change this image the less noise we add the 04:35:43.360 |
less freedom it will have to change the image so what we do is basically by setting the strength 04:35:48.640 |
we make our sampler start from a particular noise level and this is exactly what the method we want 04:35:55.600 |
to implement so i made some mess okay so for example as soon as we load the image we set 04:36:02.640 |
the strength which will shift the noise level from which we start from and then we add noise 04:36:08.240 |
to our latent to create the image to image here so let's go here and we create this method called 04:36:20.720 |
okay in the start step because we will skip some steps 04:36:29.840 |
is equal to self.num inference steps minus int of self.num inference 04:36:37.920 |
this basically means that if we have 50 inference steps and then we set the strength to let's say 04:36:48.000 |
0.8 it means that we will skip 20% of the steps so when we will add we will start from image to 04:36:55.200 |
image for example we will not start from a pure noise image but we will start from 80% of noise 04:37:01.760 |
in this image so the unit will still have freedom to change this image but not as much as with 100% 04:37:08.160 |
noise we redefine the time steps because we are altering the schedule so basically we skip some 04:37:17.920 |
time steps and self.start step is equal to start step so actually what we do here is suppose we 04:37:29.840 |
have the strength of 80% we are actually fooling the method the the unit into believing that he 04:37:35.440 |
came up with this image which is now with this level of strength and now he needs to keep denoising 04:37:41.280 |
it this is how we do image to image so we start with an image we noise it and then we make the 04:37:47.280 |
unit believe that he came up with this image with this particular noise level and now he has to keep 04:37:53.760 |
denoising it until according of course also to the prompt until we reach the clean image without any 04:38:01.120 |
noise now we have the pipeline that we can call we have the ddpm sampler we have the model built 04:38:12.080 |
of course we need to create the function to load the weights of this model so let's create another 04:38:18.000 |
file we will call it the model loader here model loader because now we are nearly close to sampling 04:38:27.760 |
from this finally from this table diffusion so now we need to create the method to load the 04:38:32.160 |
pre-trained the pre-trained weights that we have downloaded before so let's create it 04:38:42.240 |
decoder va encoder then from decoder import va decoder 04:38:55.600 |
fusion import diffusion our diffusion model which is our unit 04:39:05.040 |
now let me first define it then i tell you what we need to do so preload preload models from 04:39:17.280 |
okay as usual we load the weights using torch but we use we will create another function 04:39:32.720 |
model converter dot load from standard weights 04:39:37.280 |
this is a method that we will create later to to load the weights 04:39:49.360 |
the pre-trained weights and i will show you why we need this method then we create our encoder 04:39:58.880 |
and we load the state addict load state addict from our state addict 04:40:04.160 |
and we also set strict to two oops don't strict 04:40:37.040 |
and strict also so this strict parameter here basically tells that when you load a model from 04:40:52.000 |
pytorch this for example this ckp ckpt file here it is a dictionary that contains many keys 04:40:59.680 |
and each key corresponds to one matrix of our model so for example this uh self this group 04:41:06.320 |
normalization has some parameters and the the how can torch load this parameter is exactly in this 04:41:12.720 |
group norm by using the name of the variables that we have defined here and he will when we load a 04:41:20.480 |
model from pytorch he will actually load the dictionary and then we load this dictionary 04:41:25.680 |
into our models and he will match by names now the problem is the pre-trained model 04:41:30.960 |
actually they don't use the same name that i have used and actually this code is based on another 04:41:36.720 |
code that i have seen so actually the the names that we use are not the same as the pre-trained 04:41:42.560 |
model also because the names in the pre-trained model not always uh very friendly for learning 04:41:49.120 |
this is why i changed the names and also other people changed the names of the methods but this 04:41:55.200 |
also means that the automatic mapping between the names of the pre-trained model and the names 04:42:01.440 |
defined in our classes here cannot happen because it cannot happen automatically because the names 04:42:06.400 |
do not match for this reason there is a script that i have created in my github library here 04:42:14.320 |
that you need to download to convert these names it's just a script that maps one name into another 04:42:20.160 |
so if the name is this one map it into this if the name is this one mapping into this 04:42:24.560 |
there is nothing special about this script it's just a very big mapping of the names and this is 04:42:30.560 |
actually done by most models because if you want to change the name of the classes and or the 04:42:36.960 |
variables then you need to do this kind of mapping so i will also i will basically copy it i don't 04:42:43.920 |
need to download the file so this will call the model converter.py model converter.py 04:42:52.320 |
and that's it it's just a very big mapping of names and i take it from this comment here on 04:43:00.480 |
github so this is model converter so we need to import this model converter import model converter 04:43:12.560 |
import this model converter basically will convert the names and then we can use the 04:43:17.840 |
load state dict and this will actually map all the names it's now now the names will map with 04:43:22.720 |
each other and this trick makes sure that if there is even one name that doesn't map 04:43:26.880 |
then throw an exception which is what i want because i want to make sure that all the names map 04:44:07.440 |
then we do clip is equal to clip dot to device so we move it to device where we 04:44:13.360 |
want to work and then we load also his state dict so the parameters of the weights 04:44:35.760 |
clip and then we have the encoder is the encoder we have the decoder 04:44:42.960 |
is the decoder and then we have the diffusion we have the diffusion etc 04:44:51.600 |
now we have all the ingredients to run finally the inference guys so thank you for being patient so 04:44:58.800 |
much and it's really finally we have we can see the light coming so let's build our notebook so we 04:45:07.680 |
can visualize the image that we will build okay let's select the kernel stable diffusion i already 04:45:16.640 |
created it in my repository you will also find the requirements that you need to install in order to 04:45:23.360 |
run this so let's import everything we need so the model loader the pipeline 04:45:30.000 |
peel import image this is how to load the image from python so patlib import actually this one 04:45:41.040 |
we don't need transformers this is the only library that we will be using because there 04:45:48.000 |
is the tokenizer of the clip so how to tokenize the the text into tokens before sending it to 04:45:54.080 |
the clip embeddings otherwise we also need to build the tokenizer and it's really a lot of 04:45:59.120 |
job i don't allow cuda and i also don't allow mps but you can activate these two 04:46:19.920 |
valuable and low cuda then the device becomes cuda of course 04:47:11.680 |
okay let's load the tokenizer tokenizer is the clip tokenizer we need to tell him what is the 04:47:17.760 |
vocabulary file so which is already saved here in the data data vocabulary.json and then also the 04:47:25.440 |
merges file maybe one day i will make a video on how the tokenizer works so we can build also the 04:47:32.320 |
tokenizer but this is something that requires a lot of time i mean and it's not really related 04:47:38.880 |
to the diffusion model so that's why i didn't want to build it the model file is i will use the data 04:47:45.920 |
and then this file here then we load the model so the models are model loader dot preload model from 04:47:54.480 |
the model file into this device that we have selected okay let's build from text to image 04:48:02.640 |
what we need to define the prompt for example i want a cat 04:48:08.160 |
sitting or stretching let's say stretching on the floor highly detailed we need to create a 04:48:18.560 |
prompt that will create a good image so we need to add some a lot of details ultra sharp cinematic 04:48:25.360 |
etc etc 8k resolution the unconditioned prompt 04:48:32.720 |
i keep it blank this you can also use it as a negative number you can use it as a negative 04:48:42.720 |
prompt so if you don't want the sum you don't want the output to have some how to say some 04:48:50.800 |
characteristics you can define it in the negative prompt of course i like to do cfg so the 04:48:57.120 |
classifier free guidance which we set to true cfg scale is a number between 1 and 14 which 04:49:05.760 |
indicates how much attention we want the model to pay to this prompt 14 means pay 04:49:10.560 |
very much attention or 1 means we pay very little attention i use 7 04:49:17.920 |
then we can define also the parameters for image to image 04:49:20.480 |
so input image is equal to none image path is equal to i will define it with my 04:49:31.440 |
image of the dog which i already have here and um but for now i don't want to load it 04:49:38.960 |
so if we want to load it we need to do input image is equal to image.open 04:49:48.480 |
i will not use it so now let's comment it and if we use it we need to define the strength 04:49:56.960 |
so how much noise we want to add to this image but for now let's not use it 04:50:00.080 |
the sampler we will be using of course is the only one we have is the ddpm 04:50:10.880 |
50 and the seed is equal to 42 because it's a lucky number at least according to some books 04:50:19.280 |
output image is equal to pipeline generate okay the prompt is the prompt that we have defined 04:50:31.280 |
the unconditioned prompt is the unconditioned prompt that we have defined 04:50:36.240 |
input image is the input image that we have defined if it's not commented of course 04:50:53.200 |
the sampler name is the sampler name we have defined 04:51:01.760 |
the number of inference steps is the number of inference steps the seed 04:51:14.720 |
idle device is our cpu so when we don't want to use something we move it to the cpu 04:51:27.200 |
and then image dot from array output image if everything is done well if all the code has 04:51:37.680 |
been written correctly you can always go back to my repository and download the code if you 04:51:43.280 |
don't want to write it by yourself let's run the code and let's see what is the result my 04:51:48.880 |
computer will take a while so it will take some time so let's run it so if we run the code it will 04:51:56.800 |
generate an image according to our prompt in my computer it took really a long time so i cut the 04:52:01.520 |
video and i actually already replaced the code with the one from my github because now i want 04:52:08.240 |
to actually explain you the code without while showing you all the code together how does it 04:52:14.320 |
work so now we we generated an image using only the prompt i use the cpu that's why it's very slow 04:52:20.400 |
because my gpu is not powerful enough and we set a unconditioned prompt to zero we are using the 04:52:25.840 |
classifier free guidance and with a scale of seven so let's go in the pipeline and let's see 04:52:30.880 |
what happens so basically because we are doing the classifier free guidance we will generate 04:52:36.400 |
two conditioning signals one with the prompt and one with empty text which is the unconditioned 04:52:43.280 |
prompt which is also called the negative prompt this will result in a batch size of two that will 04:52:50.560 |
run through the unit so let's go back to here suppose we are doing text to image so now our 04:52:56.240 |
unit has two latents that he's doing at the same time because we have the batch size equal to two 04:53:01.680 |
and for each of them it is predicting the noise level but how can we move remove this noise from 04:53:10.160 |
the predicted noise from the initial noise so because to generate an image we start from random 04:53:17.680 |
noise and the prompt initially we encode it with our vae so it becomes a latent which is still 04:53:24.960 |
noise and with the unit we predict we predict how much noise is it according to a schedule so 04:53:31.040 |
according to 50 steps that of inferencing that we will be doing at the beginning the first step will 04:53:37.360 |
be 1000 the next step will be 980 the next step will be 960 etc so this time will change according 04:53:44.720 |
to this schedule so that at the 50th step we are at the time step 0 and how can we then with the 04:53:55.440 |
predicted noise go to the next latent so we remove this noise that was predicted by the unit well we 04:54:01.840 |
do it with the sampler and in particular we do it with the sample method of the sampler step method 04:54:09.040 |
sorry of the sampler which basically will calculate the previous sample given the current sample 04:54:14.640 |
according to the formula number 7 here so which basically calculates the previous sample 04:54:21.120 |
given the current one so the less noisy one given the current one and the predicted x0 so this is 04:54:27.840 |
not x0 because we don't have x0 so we don't have the noise the sample without any noise so but we 04:54:34.880 |
can predict it given the values of the current noise and the beta schedule another way of denoising 04:54:42.480 |
is to do the sampling like this if you watch my other repository about the ddbm paper i actually 04:54:48.000 |
implemented it like this if you want to see this version here and this is how we remove the noise 04:54:54.160 |
to get a less noisy version so once we get the less noisy version we keep doing this process 04:54:59.920 |
until there is no more noise so we are at the time step zero in which we have no more noise 04:55:04.800 |
we give this latent to the decoder which will turn it into an image this is how the text to image 04:55:09.920 |
works the image to image on the other side so let's try to do the image to image so to do the 04:55:15.200 |
image to image we need to go here and we uncomment this code here this allows us to start with the 04:55:25.680 |
dog and then give for example some prompt for example we want this dog here we want to say 04:55:31.840 |
okay we want a dog stretching on the floor highly detailed etc we can run it i will not run it 04:55:38.560 |
because it will take another five minutes and if we do this we can set a strength of let's say 0.6 04:55:45.760 |
which means that let's go here so we set a strength of 0.6 so we have this input image 04:55:54.000 |
strength of 0.6 means that we will add we will encode it with the variation auto encoder will 04:55:59.680 |
become a latent will add some noise but how much noise not all the noise so that it becomes 04:56:06.560 |
completely noise but less noise than that so at let's let's say 60 percent noise is not really 04:56:14.080 |
true because because it depends on the schedule in our case it's linear so it can be considered 04:56:21.040 |
60 percent of noise we then give this image to the scheduler which will start not from the 1000 04:56:28.240 |
step it will start before so if we set the strength to 0.6 it will start from the 600 04:56:34.800 |
step and then move by 20 we'll keep going 600 then 580 then 560 then 540 etc until it reaches 20 04:56:46.640 |
so in total it will do less steps because we start from a less noisy example but at the same 04:56:52.560 |
time because we start with less noise the the unit also has less freedom to change the to alter the 04:57:00.720 |
image because he already have the image so he cannot change it too much so how do you adjust 04:57:07.040 |
the noise the the noise level depends if you want the unit to pay very much attention to the input 04:57:13.440 |
image and not change it too much then you add less noise if you want to change completely the 04:57:20.800 |
original image then you can add all the possible noise so you set the strength to one and this is 04:57:25.280 |
how the image to image works i didn't implement the inpainting because the reason is that the 04:57:32.880 |
pre-trained model here so the model that we are using is not fine-tuned for inpainting so if you 04:57:38.080 |
go on the website and you look at the model card they have another model for inpainting which has 04:57:45.280 |
different weights here the this one here but this the structure of this model is also a little 04:57:53.120 |
different because they have in the unit they have five additional input channels for the mask 04:57:58.800 |
i will of course implement it in my repository directly so i will modify the code and 04:58:07.440 |
also implement the code for inpainting so that we can support this model but unfortunately i don't 04:58:12.960 |
have the time now because in china here is guoqing and i'm going to laojia with my my wife so we are 04:58:19.760 |
a little short of time but i hope that with my video guys you you got really into stable diffusion 04:58:26.240 |
and you understood what is happening under the hood instead of just using the hugging face library 04:58:31.440 |
and also notice that the model itself is not so particularly sophisticated if you check the 04:58:39.520 |
decoder and the encoder they are just a bunch of convolutions and upsampling and the normalizations 04:58:47.200 |
just like any other computer vision model and the same goes on for the unit of course there are very 04:58:53.280 |
smart choices in how they do it okay but that's not the important thing of the diffusion and 04:58:59.440 |
actually if we study the diffusion models like score models you will see that it doesn't even 04:59:03.680 |
matter the structure of the model as long as the model is expressive it will actually learn the 04:59:09.040 |
score function in the same way but this is not our case in this video i will talk about score model 04:59:14.000 |
in future videos what i want you to understand is that how this all mechanism works together 04:59:20.400 |
so how can we just learn a model that predicts the noise and then we come up with images and 04:59:28.160 |
let me rehearse again the idea so we started by training a model that needs to learn a probability 04:59:35.760 |
distribution as you remember p of theta here we we cannot learn this one directly because we don't 04:59:43.120 |
know how to marginalize here so what we did is we find some lower bound for this quantity here and 04:59:49.120 |
we maximize this lower bound how do we maximize this lower bound by training a model by running 04:59:54.800 |
the gradient descent on this loss this loss produces a model that allow us to predict the 05:00:03.760 |
noise then how do we actually use this model with the predicted noise to go back in time with the 05:00:10.880 |
noise because the forward process we know how to go it's defined by us how to add noise but in back 05:00:15.680 |
in time so how to remove noise we don't know and we do it according to the formulas that i have 05:00:20.880 |
described in the sampler so the formula number seven and the formula number also this one actually 05:00:27.360 |
we can use actually i will show you in my other um here i have another repository i think it's 05:00:33.760 |
called python ddpm in which i implemented the ddpm paper but by using this algorithm here so 05:00:39.840 |
if you are interested in this version of the denoising you can check my other uh repository 05:00:44.720 |
here this one ddpm and i also wanted to show you how the inpainting works how the how the 05:00:52.720 |
image to image and how the text to image works of course the possibilities are limitless it all 05:00:59.600 |
depends on the powerfulness of the model and how you use it and i hope you use it in a clever way 05:01:07.200 |
to build amazing products i also want to thank very much many repositories that i have used 05:01:13.600 |
as a self-studying material so because of course i didn't make up all this by myself i studied a 05:01:18.720 |
lot of papers i read i think to study this diffusion models i read more than 30 papers 05:01:23.600 |
in the last few weeks so it took me a lot of time but i was really passionate about this kind of 05:01:30.240 |
models because they're complicated and i really like to study things that can generate new stuff 05:01:35.600 |
so i want to really thank in particularly some resources that i have used let me see 05:01:42.880 |
this one's here so the official code the this guy divam gupta this other repository from 05:01:49.200 |
this person here which i used very much actually as a base and the diffusers library from this 05:01:56.640 |
hugging face upon which i based most of the code of my sampler because i think it's better to use 05:02:03.040 |
because we are actually just applying some formulas there is no point in writing it from 05:02:06.960 |
zero the point is actually understanding what is happening with these formulas and why we are doing 05:02:11.200 |
it the things we are doing and as usual the full code is available i will also make all the slides 05:02:16.480 |
available for you guys and i hope if you are in china you also have a great holiday with me and 05:02:21.920 |
if you're not in china i hope you have a great time with your family and friends and everyone 05:02:25.840 |
else so welcome back to my channel anytime and please feel free to comment on send me a comment 05:02:31.840 |
or if you didn't understand something or if you want me to explain something better because i'm 05:02:37.520 |
always available for explanation and guys i do this not as my full-time job of course i do it 05:02:43.920 |
as a part-time and lately i'm doing consulting so i'm very busy but sometime i take time to record 05:02:51.040 |
videos and so please share my channel share my video with people if you like it and so that my 05:02:57.680 |
channel can grow and i have more motivation to keep doing this kind of videos which take really 05:03:02.800 |
a lot of time because to prepare a video like this i spend around many weeks of research but 05:03:08.560 |
this is okay i do it for as a passion i don't do it as a job and i spend really a lot of time 05:03:14.640 |
preparing all the slides and preparing all the speeches and preparing the code and cleaning it 05:03:20.080 |
and commenting it etc etc i always do it for free so if you would like to support me the best way is 05:03:26.400 |
to subscribe like my video and share it with other people thank you guys and have a nice day