back to index

Coding Stable Diffusion from scratch in PyTorch


Chapters

0:0 Introduction
4:30 What is Stable Diffusion?
5:40 Generative Models
12:7 Forward and Reverse Process
17:44 ELBO and Loss
20:30 Generating New Data
22:20 Classifier-Free Guidance
31:0 CLIP
33:20 Variational Auto Encoder
37:26 Text to Image
39:54 Image to Image
41:40 Inpainting
44:30 Coding the VAE
114:50 Coding CLIP
129:10 Coding the Unet
184:40 Coding the Pipeline
233:0 Coding the Scheduler (DDPM)
278:0 Coding the Inference code

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys! Welcome to my new video on how to code stable diffusion from scratch.
00:00:05.360 | And stable diffusion is a model that was introduced last year.
00:00:09.480 | I think most of you are already familiar with it.
00:00:12.360 | And we will be coding it from scratch using PyTorch only.
00:00:16.640 | And as usual my video is going to be quite long,
00:00:19.880 | because we will be coding from scratch and at the same time I will be explaining each part that makes up stable diffusion.
00:00:27.480 | So as usual let me introduce you what are the topics that we will discuss and what are the prerequisites for watching this video.
00:00:35.520 | So of course we will discuss stable diffusion because we are going to build it from scratch using only PyTorch.
00:00:41.120 | So no other libraries will be used except for the tokenizer.
00:00:45.720 | I will describe the maths of the diffusion models as defined in the DDPM paper, but I will simplify it as much as possible.
00:00:54.240 | I will show you how classifier-free guidance works and of course we will also implement it,
00:00:59.200 | how the text-to-image works, image-to-image and in-painting.
00:01:03.360 | Of course to have a very complete view of diffusion models actually we should also introduce the score-based models
00:01:08.920 | and all the ODE and SDF theoretical framework.
00:01:12.240 | But most people are not familiar with ordinary differential equations or even stochastic differential equations.
00:01:18.720 | So I will not discuss these topics in this video and I'll leave it for future videos.
00:01:23.960 | So anyway we will have a complete copy of a stable diffusion, we will be able to generate images using the prompt,
00:01:32.280 | also condition on existing images etc.
00:01:35.680 | But for example the samplers based on the Euler method or Runge-Kutta method will not be built in this video.
00:01:42.560 | I will make a future video in which I describe these ones.
00:01:46.560 | What do I expect you to have as a prerequisite for watching this video?
00:01:50.560 | Well first of all it's good that if you have some notion of probability and statistics,
00:01:55.240 | so at least you know what is a Gaussian distribution,
00:01:58.160 | what is the conditional probability, the marginal probability, the likelihood etc.
00:02:02.520 | Now I don't expect you to have the mathematical formulation in your mind about these concepts,
00:02:08.520 | but at least the concepts behind them, so at least what do we mean by conditional probability or what do we mean by marginal probability.
00:02:16.040 | Anyway even if you're not very strong with mathematics I will always give a non-mathematics intuition for most concepts.
00:02:22.280 | So even if you don't have this background you will at least understand the concept behind this, some intuition behind this.
00:02:31.000 | And of course I expect you to know Python and PyTorch, at least basic level, because we will be coding using Python and PyTorch.
00:02:39.840 | And then we will be using a lot the attention mechanism,
00:02:43.040 | so if you're not familiar with the transformer model please watch my previous video on the attention and transformer.
00:02:49.800 | And we will also be using a lot of convolutions.
00:02:53.320 | So I don't expect you to know how mathematically the convolution layers work,
00:02:57.400 | but at least what they do on a practical level in a neural network.
00:03:02.720 | Anyway I will also review this while coding.
00:03:06.600 | And because this is going to be a long video I will first, because the stable diffusion and the diffusion models in general are quite complex from a mathematical point of view,
00:03:16.840 | so we cannot jump directly to the code without explaining what we are going to code and how it works.
00:03:23.120 | The first thing I will do is to give you some background knowledge from a mathematical point of view,
00:03:28.600 | but also from a conceptual point of view of how the diffusion models work and how stable diffusion works.
00:03:36.880 | And then we will build each part one by one.
00:03:41.600 | Of course at the beginning you will have a lot of ideas that are kind of confused because I will give you a lot of new concepts to grasp.
00:03:49.800 | And it's normal that you don't understand everything at the beginning.
00:03:52.920 | But don't worry because while coding I will repeat each concept more than once.
00:03:57.840 | So while coding you will also get a practical knowledge of what each part is doing and how they interact with each other.
00:04:05.360 | So please don't be scared if you don't understand everything in the beginning part of this video.
00:04:10.560 | Later when we start coding it everything will make sense to you.
00:04:14.440 | But we need this initial part because otherwise we cannot just jump in the dark and start coding without knowing what we are going to code.
00:04:22.480 | So let's start our journey.
00:04:26.800 | So what is stable diffusion?
00:04:28.720 | Stable diffusion is a model that was introduced in 2022, so last year, at the end of last year I remember,
00:04:36.080 | by Confit's group at the Ludwig Maximilian University in Munich, Germany.
00:04:40.680 | And it's open source, the weights, the pre-trained weights can be found on the Internet.
00:04:45.560 | And it became very famous because people started doing a lot of stuff and building projects with them and products with them with the stable diffusion.
00:04:53.240 | And one of the most simple use of stable diffusion is to do text to image.
00:04:58.320 | So given a prompt we want to generate an image.
00:05:01.040 | We will also see how image to image works and also how in-painting works.
00:05:05.840 | Image to image means that you already have a picture, for example, of a dog and you want to change it a little bit by using a prompt.
00:05:12.280 | For example, you want to ask the model to add the wings to the dog so that it looks like a flying dog.
00:05:17.880 | Or in-painting means that you remove some part of the image.
00:05:21.920 | For example, you can remove, I don't know, this part here and you ask the model to replace it with some other part that makes sense,
00:05:30.000 | that is coherent with the image. And we will see also how this works.
00:05:35.800 | Let's jump into generative models because diffusion models are generative models.
00:05:41.840 | But what is a generative model?
00:05:44.120 | Well, a generative model learns a probability distribution of the data such that we can then sample from the distribution to create new instances of the data.
00:05:54.640 | For example, if we have many pictures of cats or dogs or whatever we have,
00:05:59.640 | we can train a generative model on it and then we can sample from this distribution to create new images of cats or dogs or whatever.
00:06:08.560 | And this is exactly what we do with stable diffusion.
00:06:11.080 | We actually have a lot of images, we train it on a massive amount of images,
00:06:15.720 | and then we sample from this distribution to generate new images that don't exist in our training set.
00:06:22.320 | But the question may arise in your mind is why do we model data as distributions, as probability distributions?
00:06:30.240 | Well, let me give you an example. Imagine you are a criminal and you want to generate thousands of fake identities.
00:06:38.200 | Imagine you also live in a very simple world and each fake identity is made up of variables representing the characteristic of a person.
00:06:45.520 | So age and height. Suppose we only have two variables that make up a person.
00:06:49.600 | So it's the age of the person and the height of the person.
00:06:53.520 | In my case, I will be using the centimetre for the height. I think the Americans can convert it to feet.
00:06:59.800 | And so how do we proceed if we are a criminal with this goal?
00:07:04.120 | Well, we can ask the statistics department of the government to give us some statistics about the age and the height of the population.
00:07:11.320 | This information you can easily find online, for example. And then we can sample from this distribution.
00:07:17.800 | For example, if we model the age of the population like a Gaussian with the mean of 40 and the variance of 30.
00:07:25.840 | OK, these numbers are made up. I don't know if they reflect the reality.
00:07:29.640 | And the height in centimetres is 120 as mean and the variance is 100.
00:07:38.120 | We get these two distributions. Then we can sample from these two distributions to generate a fake identity.
00:07:45.240 | What does it mean to sample from a distribution? To sample from this kind of distribution means to throw a coin,
00:07:52.280 | a very special coin that has a very high chance of falling in this area, a lower chance of falling in this area,
00:08:00.200 | an even lower chance of falling in this area and a very nearly zero chance of falling in this area.
00:08:07.240 | So imagine we flip this coin once for the age, for example, and it falls here.
00:08:12.880 | So it's quite probable, not very probable, but quite probable.
00:08:16.760 | So suppose the age is three and let me write. So the age, let's say, is. Three.
00:08:27.960 | And then we toss again this coin and we and the coin falls, let's say here.
00:08:35.320 | So one hundred, let's say, thirty height. One hundred thirty centimetres.
00:08:43.560 | So as you can see, the combination of age and height is quite improbable in reality.
00:08:49.960 | I mean, no three years old is one metre and thirty centimetres high.
00:08:55.160 | I mean, at least not the ones I know. So this combination of age and height is very not plausible.
00:09:02.600 | So to produce plausible pairs, we actually need to model these two variables.
00:09:07.320 | So the age and height, not as independent variables and sample from each of them independently,
00:09:12.840 | but as a joint distribution. And usually we represent the joint distribution like this,
00:09:19.080 | where each combination of age and height has a probability score associated with it.
00:09:25.000 | And from this distribution, we only sample using one coin.
00:09:29.240 | And for example, this coin will have a very high probability with very high chance will fall in this area,
00:09:35.160 | with less chance will fall in this area and very close to zero chance of falling in this area.
00:09:40.440 | Suppose we throw the coin and it ends up in this area to get to the corresponding.
00:09:44.680 | Suppose this is the age and this is the height to get to the corresponding age and height.
00:09:50.280 | We just need to do like this. And suppose these are actually the real height and the real height.
00:09:55.080 | Now the numbers here are actually do not match, but you got the idea that to model something,
00:10:01.080 | we need a joint distribution over all the variables.
00:10:03.960 | And this is actually what we do also with our images.
00:10:06.680 | With our images, we create a very complex distribution in which, for example,
00:10:12.680 | each pixel is a distribution and the entirety of all the pixels are one big joint distribution.
00:10:20.120 | And once we have a joint distribution, we can do a lot of interesting things.
00:10:24.520 | For example, we can marginalize.
00:10:26.440 | So, for example, imagine we have a joint distribution over the age and the height.
00:10:30.840 | So let's call the age X and let's call the height, let's say Y.
00:10:36.200 | So if we have a joint distribution, which means having P of X and Y,
00:10:42.440 | which is defined for each combination of X and Y, we can always calculate P of X.
00:10:48.680 | So the probability of over the single variable by marginalizing over the other.
00:10:53.560 | So as the integral of P of X and Y and the Y.
00:10:59.320 | And this is how we marginalize,
00:11:02.280 | which means marginalizing over all the possible Y that we can have.
00:11:06.040 | And then we can also calculate the probability, the conditional probability.
00:11:09.960 | For example, we can say that the probability,
00:11:12.360 | what is the probability of the age being, let's say, from 0 to 3,
00:11:19.080 | given that the height is more than 1 meter.
00:11:23.320 | So something like this.
00:11:24.920 | We can do this kind of queries by using the conditional probability.
00:11:28.760 | So this is actually what we do with the generative model.
00:11:31.400 | We model our data as a very big joint distribution.
00:11:34.920 | And then we learn the parameters of this distribution,
00:11:39.800 | because it's a very complex distribution.
00:11:41.480 | So we let the neural network learn the parameters of this distribution.
00:11:45.160 | And our goal, of course, is to learn this very complex distribution
00:11:51.000 | and then sample from it to generate new data,
00:11:53.720 | just like the criminal before wanted to generate new fake identities
00:11:58.360 | by modeling the very complex distribution
00:12:00.680 | that represents the identity of a person.
00:12:03.000 | In our case, we will model our system as a joint distribution
00:12:08.680 | by including also some latent variables.
00:12:10.920 | So let me describe.
00:12:11.880 | As you probably are familiar with the diffusion models,
00:12:16.200 | we have two processes.
00:12:17.560 | One is called the forward process
00:12:19.480 | and one is called the reverse process.
00:12:21.960 | The forward process means that we have our initial image
00:12:25.000 | that we will call X0, so this here,
00:12:28.760 | and we add noise to it to get another image
00:12:32.120 | that is the same as the previous one,
00:12:35.240 | but with some noise on top of it.
00:12:37.240 | Then we take this image, which has a little noise,
00:12:40.920 | and we generate a new image that is same as the previous one,
00:12:44.440 | but with even more noise.
00:12:45.960 | So as you can see, this one has even more noise,
00:12:48.520 | and so on, so on, so on,
00:12:49.880 | until we arrive to the last latent variable called Zt,
00:12:53.960 | where t is equal to 1000,
00:12:57.480 | when it becomes completely noise, pure noise,
00:13:01.560 | like N0,1, actually N0i,
00:13:05.960 | because we are in the multivariate world.
00:13:09.080 | And our goal, actually, is to...
00:13:13.160 | this process, this forward process is fixed.
00:13:16.520 | So we define how to build the noisified version
00:13:20.280 | of each image given the previous one,
00:13:22.600 | so we know how to add noise,
00:13:24.520 | and we have a specific formula, an analytical formula,
00:13:27.720 | on how to add noise to an image.
00:13:30.600 | The problem is, we don't have the analytical formula
00:13:34.200 | to reverse this process,
00:13:35.720 | so we don't know how to take this one
00:13:38.280 | and just remove noise.
00:13:40.200 | There is no closed formula on how to do it,
00:13:43.640 | so we learn, we train a neural network
00:13:47.720 | to do this inverse process,
00:13:49.800 | to remove noise from something that has noise.
00:13:52.840 | And if you think about it,
00:13:54.200 | it is quite easy to add noise to something
00:13:56.840 | than it is to remove noise from something.
00:13:59.720 | That's why we are using a neural network for this purpose.
00:14:02.680 | Now we need to go inside, of course, of the math,
00:14:06.840 | because we will be using it not only to write the code,
00:14:09.720 | but also to write the sampler.
00:14:11.080 | And in the sampler, it's all about mathematics.
00:14:13.800 | And I will try to simplify it as much as possible,
00:14:16.840 | so don't be scared.
00:14:18.360 | So let's start.
00:14:19.160 | Okay, this is from the DDPM paper,
00:14:22.280 | so the Noising Diffusion Probabilistic Models,
00:14:24.600 | from Ho in 2020.
00:14:27.480 | And here we have two processes.
00:14:29.240 | The first is the forward process,
00:14:31.720 | which means that given the original image,
00:14:36.440 | how can I generate the noisified version of this image
00:14:40.680 | at time step t?
00:14:42.360 | In this case, actually, this is the joint distribution.
00:14:45.000 | Let's look at this one here.
00:14:49.000 | This means if I have the image at time step t minus one,
00:14:53.560 | how can I get the next time step,
00:14:55.400 | so the more noisified version of this image?
00:14:58.920 | Well, we define it as a Gaussian distribution centered,
00:15:04.200 | so the mean centered on the previous one,
00:15:06.600 | and the variance defined by this beta parameter here.
00:15:10.520 | This beta parameter here is decided by us,
00:15:13.960 | and it means how much noise we want to add
00:15:16.920 | at every step of this noisification process.
00:15:19.560 | This is also known as the Markov chain of noisification,
00:15:23.320 | because each variable is conditioned on the previous one.
00:15:26.840 | So to get xt, we need to have xt minus one.
00:15:30.600 | And as you can see from here,
00:15:32.520 | we start from x0, we go to x1.
00:15:36.120 | Here I call it z1 to differentiate it,
00:15:38.360 | but x1 actually is equal to z1.
00:15:40.920 | So x0 is the original image,
00:15:43.000 | and all the next x are noisy versions,
00:15:46.360 | with xt being the most noisy.
00:15:48.760 | So this is called the Markov chain of noisification,
00:15:53.240 | and we can do it like this.
00:15:56.840 | So it's defined by us as a process,
00:15:59.800 | which is a series of Gaussians that add noise.
00:16:02.680 | There is an interesting formula here.
00:16:05.880 | This is a closed loop, closed formula,
00:16:08.440 | to go from the original image to any image at time step t,
00:16:13.320 | without calculating all the intermediate images,
00:16:16.120 | using this particular parametrization.
00:16:19.480 | So we can go from the image, original image,
00:16:23.400 | to the image at time step t,
00:16:25.000 | by sampling from this distribution,
00:16:29.480 | by defining the distribution like this.
00:16:31.160 | So with this mean and with this variance.
00:16:33.800 | This mean here depends on a parameter, alpha, alpha bar,
00:16:38.440 | which is actually depending on beta.
00:16:40.440 | So it's something that we know,
00:16:42.040 | there is nothing we have to learn.
00:16:43.720 | And also the variance actually depends on alpha,
00:16:46.360 | which is defined as in function of beta.
00:16:49.400 | So beta is also something we know,
00:16:51.240 | so there is no parameters to learn here.
00:16:53.080 | Now let's look at the reverse process.
00:16:55.880 | The reverse process means that we have something noisy,
00:16:59.240 | and we want to get something less noisy.
00:17:01.400 | So we want to remove noise.
00:17:03.480 | And we also define it as a Gaussian,
00:17:05.960 | with a mean, mu theta, and a variance, sigma theta.
00:17:10.600 | Now, this mean and this variance are not known to us.
00:17:16.920 | We have to learn them.
00:17:18.520 | And we will use a neural network to learn these two parameters.
00:17:23.000 | Actually, the variance, we will also set this at fixed.
00:17:27.800 | We will parameterize it in such a way that this variance actually is fixed.
00:17:32.360 | So we hypothesize, we already know the variance.
00:17:34.920 | And we let the network learn only the mean of this distribution.
00:17:38.680 | So to rehearse, we have a forward process that adds noise.
00:17:42.680 | And we know everything about this process.
00:17:44.280 | We know how to add noise.
00:17:45.800 | We have a reverse process that we don't know how to denoise.
00:17:49.640 | So we let a network learn the parameters on how to denoise it.
00:17:54.680 | And OK, now that we have defined these two processes,
00:17:58.360 | how do we actually train a model to do it?
00:18:01.560 | Because as you remember, our initial goal is actually to learn
00:18:05.000 | a probability distribution over our data set.
00:18:09.000 | And so this quantity here.
00:18:12.040 | But unlike before, when we could marginalize, for example,
00:18:15.880 | in the case of the criminal who want to generate identities,
00:18:20.120 | we could marginalize over all the variables.
00:18:23.320 | Here we cannot marginalize.
00:18:24.600 | Because we need to marginalize over x1, x2, xt, x4, up to xt.
00:18:28.920 | So over a lot of variables.
00:18:30.520 | And to calculate this integral means to calculate it over all the possible x1.
00:18:36.600 | And over all the possible x2, et cetera.
00:18:38.760 | So it's a very complex calculation that is computationally intractable, we say.
00:18:44.040 | It means that it's theoretically possible.
00:18:46.440 | But practically, it will take forever.
00:18:48.600 | So we cannot use this route here.
00:18:52.040 | So what can we do?
00:18:53.320 | We want to learn this quantity here.
00:18:55.080 | So we want to learn the parameter theta of this to maximize the likelihood we can see here.
00:19:00.520 | What we did is we found a lower bound for this quantity here.
00:19:06.520 | So the quantity, the likelihood.
00:19:08.520 | And this lower bound is called the elbow.
00:19:11.160 | And if we maximize the lower bound, it will also maximize the likelihood.
00:19:16.040 | So let me give you a parallel example on what it means to maximize the lower bound.
00:19:21.800 | For example, imagine you have a company.
00:19:24.920 | And your company has some revenue.
00:19:26.840 | And usually, the revenue is more than or equal to the sales of your company.
00:19:33.560 | So you have some revenue coming from sales.
00:19:35.640 | Maybe you also have some revenue coming from interest that you get from your bank, et cetera.
00:19:41.000 | But we can for sure say that the revenue of your company is more than or equal to the sales of your company.
00:19:47.240 | So if you want to maximize your revenue, you can maximize your sales, for example,
00:19:52.280 | which is a lower bound over your revenue.
00:19:55.480 | So if we maximize the sales, we will also maximize the revenue.
00:19:58.200 | And this is the idea here.
00:19:59.400 | But how do we do it on a practical level?
00:20:03.880 | Well, this is the training code for the DDPM diffusion models as defined by the DDPM paper.
00:20:13.240 | And basically, the idea is after we get the elbow,
00:20:17.000 | we can parameterize the loss function as this.
00:20:20.360 | Which says that we need to learn-- we need to train a network called epsilon theta.
00:20:27.800 | That given a noisy image-- so this formula here
00:20:31.480 | means the noisy image at time step t and the time step at which the noise was added,
00:20:37.800 | the network has to predict how much noise is in the image, the noisified image.
00:20:43.240 | And if we do gradient descent over this loss function here,
00:20:50.600 | we will maximize the elbow.
00:20:53.400 | And at the same time, we will also maximize the log likelihood of our data.
00:20:59.880 | And this is how we train these kind of networks.
00:21:03.000 | Now, I know that this is a lot of concept that you have to grasp.
00:21:06.920 | So don't worry.
00:21:07.960 | For now, just remember that there is a forward process and there is a reverse process.
00:21:11.400 | And to train this network to do the reverse process,
00:21:15.000 | we need to train a network to detect how much noise
00:21:18.280 | is in a noisified version of the image at time step t.
00:21:21.560 | Let me show you how do we-- once we have this network that has already been trained,
00:21:29.800 | how do we actually sample to generate new data?
00:21:32.280 | So let's go here.
00:21:35.560 | Let's go here.
00:21:36.920 | So how do we generate new data?
00:21:38.600 | Suppose we already have a network that was trained for detecting how much noise is in there.
00:21:44.680 | And what we do is we start from complete noise.
00:21:47.720 | And then we ask the network to detect how much noise is in there.
00:21:52.040 | We remove this noise.
00:21:53.400 | And then we ask the network again how much noise is in there.
00:21:56.440 | And we remove it.
00:21:57.640 | And then we ask the network how much noise is there.
00:22:00.280 | OK, remove it.
00:22:02.040 | Then how much noise is here?
00:22:03.480 | OK, remove it, et cetera, et cetera.
00:22:05.160 | Until we reach this step, then here we will have something new.
00:22:09.560 | So if we start from pure noise and we do this reverse process many times,
00:22:13.720 | we will end up with something new.
00:22:15.480 | And this is the idea behind this generative model.
00:22:19.160 | Now that we know how to generate new data starting from pure noise,
00:22:24.520 | we also want to be able to control this denoisification process
00:22:29.160 | so we can generate images of something that we want.
00:22:32.280 | I mean, how can we tell the model to generate a picture of a cat
00:22:36.120 | or a picture of a dog or a picture of a house by starting from pure noise?
00:22:40.840 | Because as of now, by starting from pure noise and keep denoising,
00:22:46.360 | we will generate a new image, of course.
00:22:48.680 | But it's not like we can control which new image will be generated.
00:22:52.440 | So we need to find a way to tell the model what we want in this generational process.
00:22:57.160 | And the idea is that we start from pure noise.
00:23:01.800 | And during this chain of removing noise, so denoisification,
00:23:07.640 | we introduce a signal.
00:23:09.400 | Let's call it prompt.
00:23:11.400 | Prompt.
00:23:12.620 | Or it can also be called the conditioning signal.
00:23:16.440 | Or it can also be called the context.
00:23:18.920 | Anyway, they are the same concept.
00:23:21.720 | In which we influence the model into how to remove the noise
00:23:26.280 | so that the output will move towards what we want.
00:23:29.640 | To understand how this works, let's review again
00:23:33.560 | how the training of this kind of networks works.
00:23:36.200 | Because this is very important for us.
00:23:37.880 | To learn how the training of this kind of network goes,
00:23:40.920 | so that we can introduce the prompt.
00:23:42.840 | Let's go back.
00:23:46.760 | Okay, as I told you before, our final goal is to model a distribution,
00:23:53.560 | theta, p of theta, such that we maximize the likelihood of our data.
00:23:58.520 | And to learn this distribution, we maximize the ELBO, so the lower bound.
00:24:07.000 | But how do we maximize the ELBO?
00:24:09.240 | We minimize this loss, minimize this loss here.
00:24:14.840 | So by minimizing this loss, we maximize the ELBO,
00:24:17.960 | which in turn learns this distribution here.
00:24:21.400 | Because this ELBO here is the lower bound
00:24:25.800 | for the likelihood of our data distribution here.
00:24:29.000 | And what is this loss function?
00:24:32.440 | Loss function here indicates that we need to create a model, epsilon theta,
00:24:37.160 | such that if we give this model a noisified image at a particular noise level,
00:24:44.200 | and we also tell him what noise level we included in this image,
00:24:47.880 | the network has to predict how much noise is there.
00:24:51.880 | So this epsilon is how much noise we have added.
00:24:54.520 | And we can do a gradient descent on this training loop.
00:25:00.280 | This way we will learn a distribution of our data.
00:25:06.040 | But as you can see, this distribution doesn't include anything
00:25:09.560 | that tells the model what is a cat, or what is a dog, or what is a house.
00:25:14.120 | The model is just learning how to generate pictures that make sense,
00:25:18.280 | that are similar to our initial training data.
00:25:21.000 | But they don't know what is the relationship between that picture and the prompt.
00:25:24.680 | So one idea could be, OK, can we learn a joint distribution of our initial data,
00:25:33.400 | so all the images, and the conditioning signal, so the prompt?
00:25:38.120 | Well, this is also something that we don't want,
00:25:40.120 | because we want to actually learn this distribution,
00:25:42.680 | so that we can sample and generate new data.
00:25:45.240 | We don't want to learn the joint distribution
00:25:47.000 | that will be too much influenced by the context,
00:25:50.040 | and the model may not learn the generative process of the data.
00:25:53.320 | So our final goal is always this one.
00:25:55.560 | But we also want to find some how to condition this model
00:25:59.240 | into building something that we want.
00:26:01.160 | And the idea is that we modify this unit,
00:26:07.000 | so this model here, epsilon theta,
00:26:11.400 | will be built using, let me show you, this unit model here.
00:26:16.280 | This unit will receive as input an image that is noisified,
00:26:21.240 | so for example, a cat, with a particular noise level,
00:26:27.400 | and we also tell him what is the noise level that we added to this cat,
00:26:31.240 | and we give them both to the input of the unit,
00:26:33.560 | and the unit has to predict how much noise is there.
00:26:37.240 | This is the job of the unit.
00:26:38.600 | What if we introduce also the prompt signal here,
00:26:43.400 | so the conditioning signal here, so the prompt?
00:26:46.120 | This way, if we tell the model,
00:26:49.400 | can you remove noise from this image,
00:26:53.560 | which has this quantity of noise,
00:26:56.760 | and I am also telling you that it's a cat,
00:26:59.080 | so the model has more information on how to remove the noise.
00:27:03.080 | Yes, the model can learn this way,
00:27:05.080 | how to remove noise into building something that is more closer to the prompt.
00:27:10.360 | This will make the model conditioned,
00:27:14.280 | it means that it will act like a conditioned model,
00:27:16.760 | so we need to tell the model what is the condition that we want,
00:27:19.560 | so that the model can remove the noise in that particular way,
00:27:23.720 | moving the output towards that particular prompt.
00:27:27.480 | But at the same time, when we train the model,
00:27:31.400 | instead of only giving images along with the prompt,
00:27:35.960 | we can also sometimes, with a probability, let's say 50%,
00:27:38.840 | not give any prompt and let the model remove the noise
00:27:43.720 | without telling him anything about the prompt.
00:27:46.040 | So we just give him a bunch of zero when we give him the input.
00:27:50.680 | This way, the model will learn to act both as a conditioned model
00:27:56.600 | and also as a conditioned model,
00:27:58.520 | so the model will learn to pay attention to the prompt
00:28:01.080 | and also to not pay attention to the prompt.
00:28:03.240 | And what is the advantage of this?
00:28:05.880 | Is that we can, once when we want to generate a new picture,
00:28:11.320 | we can do two steps.
00:28:12.920 | In the first one, suppose you want to generate a picture of a cat,
00:28:16.680 | we can do like this.
00:28:17.560 | Let me delete first of all.
00:28:19.560 | Okay, we can do the first step.
00:28:23.000 | So let's call it step one.
00:28:27.160 | And we can start with pure noise,
00:28:30.440 | because as I told you before,
00:28:31.720 | to generate a new image, we start from pure noise.
00:28:34.360 | We indicate the model what is the noise level.
00:28:37.240 | So at the beginning, it will be t equal to 1000,
00:28:40.600 | so maximum noise level.
00:28:42.120 | And we tell the model that we want a cat.
00:28:44.920 | We give this as input to the unit.
00:28:49.320 | The unit will predict some noise that we need to remove
00:28:53.800 | in order to move the image towards what we want as output.
00:28:57.880 | So a cat.
00:28:58.520 | And this is our output one.
00:29:01.160 | Let's call it output one.
00:29:03.720 | Then we do another step.
00:29:05.320 | So let me delete this one.
00:29:06.920 | Then we do another step.
00:29:09.560 | Let's call it step two.
00:29:11.240 | And again, we give the same input noise as before,
00:29:17.400 | the same time step as the noise level.
00:29:21.720 | So it's the same noise with the same noise level,
00:29:24.040 | but we don't give any prompt.
00:29:25.480 | This way, the model will build some output.
00:29:28.760 | Let's call it out two,
00:29:30.120 | which is how to remove the noise to generate something.
00:29:34.040 | We don't know what, but to generate something
00:29:36.280 | that belongs to our data distribution.
00:29:38.040 | And then we combine these two output in such a way
00:29:42.360 | that we can decide how much we want the output
00:29:45.960 | to be closer to the prompt or not.
00:29:49.720 | This is called classifier-free guidance.
00:29:52.760 | So this approach here is called classifier-free guidance.
00:29:57.320 | I will not tell you why it's called classifier-free guidance,
00:29:59.640 | because otherwise I need to introduce the classifier guidance.
00:30:02.200 | And to talk about the classifier guidance,
00:30:03.960 | I need to introduce the score-based models
00:30:06.120 | to understand why it's called like this.
00:30:07.960 | But the idea is this, that we train a model that,
00:30:11.320 | when we train it, sometimes we give it the prompt
00:30:14.760 | and sometimes we don't give it the prompt,
00:30:16.440 | so that the model learns to ignore the prompt,
00:30:19.080 | but also to pay attention to the prompt.
00:30:21.240 | And when we sample from this model, we do two steps.
00:30:25.480 | First time, we give him the prompt of what we want.
00:30:27.960 | And the second time, we give the same noise,
00:30:29.800 | but without the prompt of what we want.
00:30:31.800 | And then we combine the two output,
00:30:35.080 | conditioned and unconditioned,
00:30:37.960 | linearly with a weight that indicates
00:30:41.560 | how much we want the output to be closer
00:30:44.600 | to our condition, to our prompt.
00:30:47.080 | The higher this value, the more the output
00:30:49.960 | will resemble our prompt.
00:30:51.640 | The lower this value, the less it will resemble our prompt.
00:30:55.640 | And this is the idea behind classifier-free guidance.
00:30:58.760 | To give the prompt, actually we will give,
00:31:01.400 | we need to give some kind of embedding to the,
00:31:04.360 | so the model needs to understand this prompt.
00:31:07.560 | To understand the prompt,
00:31:08.840 | the model needs some kind of embedding.
00:31:11.560 | Embedding means that we need some vectors
00:31:14.920 | that represent the meaning of the prompt.
00:31:17.640 | And this embedding are extracted
00:31:19.880 | using the CLIP text encoder.
00:31:21.640 | So before talking about the text encoder,
00:31:24.520 | let's talk about CLIP.
00:31:25.640 | So CLIP was a model built by OpenAI
00:31:28.760 | that allowed to connect text with images.
00:31:33.480 | And the text, basically, they took a bunch of images.
00:31:38.360 | So for example, this picture and its description.
00:31:41.720 | Then they took another image along with its description.
00:31:44.280 | So the image one is associated with the text number one,
00:31:48.120 | which is the description of the image one.
00:31:50.760 | Then the image two has the description number two.
00:31:53.720 | The image three has the text number three,
00:31:56.520 | which is the description of the image three,
00:31:58.520 | et cetera, et cetera.
00:31:59.800 | They built this matrix, you can see here,
00:32:02.280 | which is made up of the dot products
00:32:04.360 | of the embedding of the first image
00:32:06.760 | multiplied with all the possible captions here.
00:32:09.880 | So the image one with the text one.
00:32:12.280 | Image one with the text two.
00:32:13.560 | Image one with the text three, et cetera.
00:32:15.080 | Then image two with the text one.
00:32:17.080 | Image two with the text two, et cetera.
00:32:18.520 | How they train it?
00:32:20.760 | Basically, we know that the correspondence
00:32:23.320 | between image and the text is on the diagonal
00:32:26.280 | because the image one is associated with the text one.
00:32:30.040 | Image two is associated with the text two.
00:32:31.880 | Image three is associated with the text three.
00:32:33.880 | So how they train it?
00:32:35.480 | Basically, they said they built a loss function
00:32:37.960 | that they want this diagonal to have the maximum value
00:32:43.400 | and all the other numbers here to be zero
00:32:46.040 | because they are not matching.
00:32:47.160 | They are not the corresponding description of these images.
00:32:50.680 | In this way, the model learned how to combine
00:32:54.360 | the description of an image with the image itself.
00:32:57.880 | And what we do in stable diffusion
00:32:59.960 | is that we take this text encoder here,
00:33:02.600 | so only this part of this clip,
00:33:05.480 | to encode our prompt to get some embeddings.
00:33:09.160 | And these embeddings are then used as conditioning signal
00:33:13.480 | for our unit to denoise the image into what we want.
00:33:17.880 | Okay, there is another thing that we need to understand.
00:33:24.280 | So as I said before,
00:33:25.560 | we have a forward process that adds noise to the image.
00:33:29.560 | Then we have a reverse process
00:33:31.160 | that removes noise from the image.
00:33:33.480 | And this reverse process can be conditioned
00:33:36.120 | by using the classifier-free guidance.
00:33:39.800 | And this reverse process means
00:33:43.480 | that we need to do many steps of denoisification
00:33:46.840 | to arrive to the image, to the new image.
00:33:49.480 | And this also means that each of these steps
00:33:53.800 | involves going through the unit with a noisified image
00:33:57.720 | and getting as output the amount of noise present in this image.
00:34:02.200 | But if the image is very big,
00:34:04.280 | so suppose this image here is 512 multiplied by 512,
00:34:10.120 | it means every time on the unit,
00:34:12.680 | we will have a very big matrix
00:34:14.200 | that needs to go through this unit.
00:34:16.600 | And this may be very slow,
00:34:19.000 | because it's a very big matrix of data
00:34:22.120 | that the unit has to work on.
00:34:24.120 | What if we could somehow compress this image
00:34:27.560 | into something smaller,
00:34:29.160 | so that each step through the unit takes less time?
00:34:33.240 | Well, the idea is that yes,
00:34:35.080 | we can compress this image
00:34:37.000 | with something that is called the variational autoencoder.
00:34:41.000 | Let's see how the variational autoencoder works.
00:34:43.240 | Okay, the stable diffusion is actually known
00:34:48.920 | as a latent diffusion model,
00:34:53.320 | because what we learn is not the data
00:34:56.680 | probability distribution Px of our data,
00:35:00.440 | but we learn the latent representation of the data
00:35:04.680 | using a variational autoencoder.
00:35:06.920 | So basically we compress our data,
00:35:09.000 | so let's go back,
00:35:10.040 | we compress our data into something smaller,
00:35:12.760 | and then we learn the noisification process
00:35:16.040 | using this compressed version of the data,
00:35:18.840 | not the original data.
00:35:20.200 | And then we can decompress it to build the original data.
00:35:24.680 | Let me show you actually how it works on a practical level.
00:35:28.280 | So imagine you have some data
00:35:31.560 | and you want to send it to your friend over the internet.
00:35:34.200 | What do you do?
00:35:35.240 | You can send the original file
00:35:36.840 | or you can send the zipped file.
00:35:38.600 | So you can zip the file,
00:35:40.200 | maybe with WinZip, for example,
00:35:41.960 | and then you send the file to your friend
00:35:43.800 | and the friend can unzip it after receiving
00:35:47.960 | and rebuild the original data.
00:35:49.400 | This is exactly the job of the autoencoder.
00:35:51.880 | The autoencoder is a network
00:35:53.720 | that given an image, for example,
00:35:55.480 | will, after passing through the encoder,
00:35:58.120 | will transform into a vector
00:35:59.800 | which has a dimension that is much smaller
00:36:03.160 | than the original image.
00:36:04.760 | And if we use this vector
00:36:06.600 | and run it through the decoder,
00:36:08.360 | it will build the original image back.
00:36:10.920 | And we can do it for many images
00:36:15.000 | and each of them will have a representation in this.
00:36:17.560 | This is called a code corresponding to each image.
00:36:22.280 | Now, the problem with autoencoder
00:36:23.960 | is that the code learned by this model
00:36:27.480 | doesn't make any sense from a semantic point of view.
00:36:30.680 | So the code associated with the cat, for example,
00:36:33.480 | may be very similar to the code
00:36:36.040 | associated with pizza, for example,
00:36:38.120 | or the code associated with a building.
00:36:40.360 | So there is no semantic relationship
00:36:43.320 | between these codes.
00:36:44.600 | And to overcome this limitation of the autoencoder,
00:36:47.320 | we introduce the variational autoencoder,
00:36:49.640 | in which we learn to kind of compress the data,
00:36:54.600 | but at the same time,
00:36:56.280 | this data is distributed
00:36:58.040 | according to a multivariate distribution,
00:37:00.120 | which most of the times is a Gaussian.
00:37:02.440 | And we learn the mean and the sigma of this distribution,
00:37:08.440 | this very complex distribution here.
00:37:11.240 | And given the latent representation,
00:37:13.480 | we can always pass it through the decoder
00:37:15.640 | to rebuild the original data.
00:37:17.240 | And this is the idea that we use also in stable diffusion.
00:37:20.360 | Now we can finally combine all these things
00:37:24.440 | that we have seen together
00:37:25.960 | to see what is the architecture of the stable diffusion.
00:37:29.960 | So let's start with how the text-to-image works.
00:37:34.760 | Now, imagine text-to-image basically works like this.
00:37:39.160 | Imagine you want to generate a picture
00:37:42.280 | of a dog with glasses.
00:37:44.200 | So you start, of course, with a prompt,
00:37:46.120 | a dog with glasses.
00:37:47.640 | And then what do we do?
00:37:49.640 | We sample some noise here,
00:37:52.440 | some noise from the N01.
00:37:55.080 | We encode it with our variational autoencoder.
00:37:58.760 | This will give us a latent representation of this noise.
00:38:03.560 | Let's call it Z.
00:38:04.600 | This is, of course, a pure noise,
00:38:07.160 | but has been compressed by the encoder.
00:38:09.880 | And then we send it to the unit.
00:38:13.960 | The goal of the unit is to detect how much noise is there.
00:38:17.480 | And also, because to the unit,
00:38:19.320 | we also give the conditioning signal,
00:38:21.480 | the unit has to detect the noise,
00:38:24.440 | what noise we need to remove
00:38:26.920 | to make it into a picture that follows the prompt,
00:38:30.360 | so into a picture of a dog.
00:38:32.200 | So the unit, we pass it through the unit
00:38:34.280 | along with the time step, initial time step,
00:38:36.920 | so 1,000.
00:38:37.720 | And the unit will detect at the output here
00:38:41.800 | how much noise is there.
00:38:43.480 | Our scheduler, we will see later what is the scheduler,
00:38:46.280 | will remove this noise
00:38:48.200 | and then send it again to the unit
00:38:50.120 | for the second step of denoisification.
00:38:52.280 | And again, we send the time step,
00:38:54.840 | which is in this case not 1,000,
00:38:57.400 | but 980, for example,
00:38:59.560 | because we skipped some steps.
00:39:01.080 | And then we again, with the noise,
00:39:03.560 | we detect how much noise is there.
00:39:05.240 | The scheduler will remove this noise
00:39:07.480 | and again send it back.
00:39:09.240 | And we do many times this.
00:39:11.560 | We keep doing this denoisification for many steps
00:39:15.000 | until there is no more noise present in the image.
00:39:18.920 | And after we have finished this loop of steps,
00:39:22.520 | we get the output Z prime,
00:39:26.040 | which is still a latent
00:39:27.480 | because this unit only works
00:39:29.240 | with the latent representation of the data,
00:39:31.000 | not with the original data.
00:39:32.200 | We pass it through the decoder
00:39:35.000 | to obtain the output image.
00:39:38.440 | And this is why this is called a latent diffusion model
00:39:41.880 | because the unit,
00:39:43.080 | so the denoisification process,
00:39:44.840 | always works with the latent representation of the data.
00:39:47.720 | And this is how we generate text to image.
00:39:50.600 | We can do the same thing for image to image.
00:39:53.400 | Image to image means that I have,
00:39:55.480 | for example, the picture of a dog
00:39:57.320 | and I want to modify this image
00:40:00.040 | into something else by using a prompt.
00:40:02.440 | For example, I want the model to add glasses to this dog
00:40:05.480 | so I can give the input image here.
00:40:07.720 | And then I say a dog with glasses
00:40:09.800 | and hopefully the model will add glasses to this dog.
00:40:13.000 | How does it work?
00:40:13.800 | We encode the image
00:40:16.520 | with the encoder of the variational autoencoder
00:40:18.920 | and we get the latent representation of our image.
00:40:22.360 | Then we add noise to this latent
00:40:24.760 | because the unit, as we saw before,
00:40:26.920 | his job is to denoise an image.
00:40:29.000 | But of course, we need to have some noise to denoise.
00:40:31.640 | So we add noise to this image
00:40:34.920 | and the amount of noise that we add to this image,
00:40:38.360 | so this starting image here,
00:40:40.120 | indicates how much freedom the unit has
00:40:43.800 | into building the output image.
00:40:45.400 | Because the more noise we add,
00:40:47.400 | the more the unit has freedom to alter the image.
00:40:50.360 | But the less noise we add,
00:40:51.720 | the less freedom the model has to alter the image
00:40:54.520 | because it cannot change radically.
00:40:57.320 | If we start from pure noise,
00:41:00.680 | the unit can do anything it wants.
00:41:03.240 | But if we start with less noise,
00:41:05.000 | the unit is forced to modify just a little bit the output image.
00:41:09.720 | So the amount of noise that we start from
00:41:12.920 | indicates how much we want the model to pay attention
00:41:15.880 | to the initial image here.
00:41:17.560 | And then we give the prompt.
00:41:20.520 | For many steps, we keep denoising, denoising, denoising, denoising.
00:41:24.040 | And after there is no more noise,
00:41:26.600 | we take this latent representation,
00:41:28.760 | we pass it through the decoder
00:41:30.520 | and we get the output image here.
00:41:33.560 | And this is how image-to-image works.
00:41:35.560 | Now let's go to the last part,
00:41:38.120 | which is how in-painting works.
00:41:40.280 | In-painting works similar way to the image-to-image,
00:41:44.600 | but with a mask.
00:41:45.640 | So in-painting means, first of all,
00:41:47.960 | that we have an image
00:41:49.720 | and we want to cut some part of this image,
00:41:51.960 | for example, the legs of this dog
00:41:53.880 | and we want the model to generate new legs for this dog
00:41:57.240 | that are maybe a little different.
00:41:58.760 | So as you can see,
00:42:00.360 | these feet here are a little different
00:42:03.000 | from the legs of the dog here.
00:42:05.480 | So what we do is,
00:42:08.120 | we start from our initial image of the dog.
00:42:10.440 | We pass it through the encoder.
00:42:12.680 | It becomes a latent representation.
00:42:14.520 | We add some noise to this latent representation.
00:42:17.480 | We give some prompt to tell the model
00:42:20.280 | what we want the model to generate.
00:42:22.440 | So I just say a dog running
00:42:23.800 | because I want to generate new legs for this dog.
00:42:26.920 | And then we pass the noisified input to the unit.
00:42:32.600 | The unit will produce an output here
00:42:34.680 | for the first time step.
00:42:35.880 | But then, of course, nobody told the model
00:42:39.240 | to only predict this area.
00:42:41.640 | The model, of course,
00:42:42.920 | here at the output predicted
00:42:44.680 | and modified the noise all the image.
00:42:47.800 | But we take this output here
00:42:50.920 | and we don't care what the noise predicted
00:42:56.680 | for this area of the image.
00:42:58.200 | The area that we already know.
00:43:00.120 | We replace it with the image
00:43:02.680 | that we already know.
00:43:03.640 | And we pass it again through the unit.
00:43:07.560 | Basically, what we do is,
00:43:08.920 | at every step,
00:43:10.040 | at every output of the unit,
00:43:11.720 | we replace the areas that are already known
00:43:14.600 | with the areas of the original image.
00:43:18.120 | So, basically, to fool the model
00:43:20.760 | into believing that it was the model itself
00:43:23.080 | that came up with these details of the image,
00:43:26.120 | not us.
00:43:27.160 | So every time here in this area,
00:43:30.760 | before we send it back to the unit here,
00:43:33.160 | here we combine the output of the unit
00:43:36.600 | with the existing image
00:43:38.440 | by replacing whatever output
00:43:40.600 | the unit gave us for this area here
00:43:43.880 | with what is the original image.
00:43:45.800 | And then we give it back to the unit
00:43:47.880 | and we keep doing it.
00:43:49.400 | This way the model will only be able
00:43:51.720 | to work on this area here
00:43:53.560 | because this is the one we never replace
00:43:55.560 | in the output of the unit.
00:43:57.080 | And then after there is no more noise,
00:43:58.760 | we take the output,
00:44:00.120 | we send it to the decoder
00:44:02.440 | and then it will build the image
00:44:05.800 | we can see here.
00:44:06.920 | Okay, this is how the stable diffusion works
00:44:11.320 | from an architecture point of view.
00:44:13.080 | I know it has been a long journey.
00:44:14.600 | I had to introduce many concepts
00:44:16.920 | but it's very important
00:44:18.120 | that we know these concepts
00:44:19.640 | before we start building the unit
00:44:21.560 | because otherwise we don't even know
00:44:26.200 | how to start building the stable diffusion.
00:44:28.600 | Here we are finally coding our stable diffusion.
00:44:32.520 | And the first thing that we will code
00:44:35.000 | is the variational autoencoder
00:44:36.680 | because it's external to the unit,
00:44:38.520 | so it's external to the diffusion model,
00:44:40.680 | so the one that will detect,
00:44:42.360 | will predict how much noise
00:44:43.880 | is present in the image.
00:44:45.640 | And let's review it actually.
00:44:47.320 | Let's review the architecture
00:44:48.600 | and let me go to this slide here.
00:44:52.520 | Okay, oops.
00:44:55.080 | This one here.
00:44:59.000 | Okay, the first thing that we will build
00:45:01.240 | is this part here.
00:45:02.360 | The encoder and the decoder
00:45:07.240 | of our variational autoencoder.
00:45:09.240 | The job of the encoder
00:45:10.760 | and the decoder of the variational autoencoder
00:45:12.440 | is to encode an image or noise
00:45:15.960 | into a compressed version of the image
00:45:18.520 | or the noise itself
00:45:19.640 | such that then we can take this latent
00:45:23.560 | and run it through the unit.
00:45:25.480 | And then after the last step
00:45:28.120 | of the noisification
00:45:29.320 | we take this compressed version or latent
00:45:32.360 | and we pass it through the decoder
00:45:34.440 | and to get the original,
00:45:35.720 | the output image, not the original.
00:45:37.480 | And so the encoder,
00:45:41.080 | actually his job is to reduce
00:45:43.640 | the dimension of the data
00:45:45.320 | into a smaller data,
00:45:46.760 | into the data with the smaller dimension.
00:45:49.480 | And the idea is very similar
00:45:51.240 | to the one of the unit.
00:45:52.280 | So we start with a picture that is very big
00:45:54.760 | and at each step there are multiple levels.
00:45:57.720 | We keep reducing the size of the image
00:46:01.080 | but at the same time
00:46:02.440 | we keep increasing the features of the image.
00:46:05.400 | What does it mean?
00:46:06.440 | That initially each pixel of the image
00:46:08.920 | will be represented by three channels.
00:46:11.560 | So red, green and blue RGB.
00:46:13.800 | At each step by using convolutions
00:46:17.720 | we will reduce the size of the image
00:46:19.880 | but at the same time
00:46:20.840 | we will increase the number of features
00:46:23.720 | that each pixel represents.
00:46:26.280 | So each pixel will be represented
00:46:27.800 | not by three channels
00:46:28.760 | but maybe by more channels.
00:46:30.840 | This means that each pixel
00:46:32.840 | will actually capture more data.
00:46:34.920 | More data of the area
00:46:37.640 | to which that pixel belongs.
00:46:39.560 | And this is thanks to the convolutions.
00:46:42.040 | But I will show you later
00:46:43.720 | with an animation.
00:46:45.400 | So let's start building.
00:46:47.720 | The first thing we do is open Visual Studio.
00:46:51.320 | And we create three folders.
00:46:54.600 | The first is called data.
00:46:56.440 | And later we download the pre-trained weights
00:46:59.160 | that you can also find on my GitHub.
00:47:01.000 | Another folder called images
00:47:03.560 | in which we put images as input and output.
00:47:06.920 | And then another folder called SD
00:47:08.520 | which is our module.
00:47:09.560 | Let's create two files.
00:47:11.480 | One called encoder.py
00:47:13.320 | and one called decoder.py.
00:47:15.560 | These are the encoder
00:47:18.520 | and the decoder of our variational autoencoder.
00:47:22.600 | Let's start with the encoder.
00:47:24.600 | And the encoder is quite simple.
00:47:28.200 | So let's start by importing Torch
00:47:30.040 | and all the other stuff.
00:47:35.560 | Let me also select the interpreter.
00:47:38.360 | Okay.
00:47:42.140 | Then we need to import two other blocks
00:47:45.880 | that we will define later in the decoder.
00:47:47.800 | Let's call them for now
00:47:49.080 | port VAE attention block.
00:47:54.600 | And this is the port VAE attention block.
00:47:58.280 | And this is the port VAE attention block.
00:48:00.680 | And this is the port VAE attention block.
00:48:02.680 | And this is the port VAE attention block.
00:48:04.280 | And VAE residual block.
00:48:06.840 | For those who are familiar with computer vision models,
00:48:09.880 | the residual block is very similar
00:48:11.320 | to the residual block that is used in the ResNet.
00:48:13.800 | So later you will see the structure.
00:48:15.160 | It's very similar.
00:48:15.880 | But if those who are not familiar,
00:48:17.400 | don't worry, I will explain it later.
00:48:19.320 | So let's start building this encoder.
00:48:21.000 | And this will inherit from the sequential module
00:48:27.720 | which means basically our encoder
00:48:29.480 | is a sequence of modules, submodules.
00:48:32.040 | Okay.
00:48:32.540 | It's a sequence of submodules
00:48:50.760 | in which each module is something
00:48:53.960 | that reduces the dimension of the data,
00:48:56.520 | but at the same time increases its number of features.
00:49:00.760 | I will write the blocks one by one.
00:49:07.160 | And as soon as we encounter a block
00:49:09.720 | that we didn't define, we go to define it.
00:49:11.640 | And then we define also the shapes.
00:49:13.480 | So the first thing we do, just like in the unit,
00:49:16.200 | is we define a convolution.
00:49:18.520 | Convolution 2D.
00:49:20.840 | Initially, our image will have three channels.
00:49:25.160 | And we convert it to 128 channels
00:49:27.480 | with a kernel size of 3 and a padding of 1.
00:49:32.120 | For those who are not familiar with convolutions,
00:49:35.640 | let's go have a look at how convolutions work.
00:49:38.360 | Here.
00:49:55.500 | Here we can see that a convolution,
00:50:00.940 | basically, it's a kernel.
00:50:02.300 | So it's made of a matrix of a size that we can decide,
00:50:06.940 | which is defined by the parameter kernel size,
00:50:09.420 | which is run through the image
00:50:12.300 | as in the following animation.
00:50:15.180 | So block by block, as you can see.
00:50:17.980 | And at each block, each of the pixel below the kernel
00:50:23.500 | is multiplied by the value of the kernel in that position.
00:50:26.700 | So in this, for example, this pixel here,
00:50:28.780 | which is in position, let's call the,
00:50:31.260 | let's say this one here.
00:50:32.780 | So the first row and the first column
00:50:34.860 | is multiplied by this red value of the kernel.
00:50:39.740 | The second column, first row,
00:50:42.380 | is multiplied by the green value of the kernel.
00:50:45.420 | And then all of these multiplications
00:50:46.940 | are summed up to produce one output.
00:50:49.260 | So this output here comes from four multiplications
00:50:51.980 | that we do in this area,
00:50:53.500 | each one with the corresponding number of the kernel.
00:50:57.420 | This way, basically, by running this kernel through the image,
00:51:00.540 | we capture local information about the image.
00:51:03.980 | And this pixel here combines somehow
00:51:07.500 | the information of four pixels, not only one.
00:51:10.380 | And that's it.
00:51:12.540 | Then we can also increase the kernel size, for example.
00:51:16.380 | And the kernel size, increasing the kernel
00:51:19.180 | means that we capture more global information.
00:51:21.900 | So each pixel represents the information
00:51:24.140 | of more pixel from the original picture.
00:51:27.100 | So the output is smaller.
00:51:28.700 | And then we can introduce, for example, the stride,
00:51:32.620 | which means that we don't do it every successive pixel,
00:51:36.460 | but we skip some pixels, as you can see here.
00:51:38.700 | So we skip every second pixel here.
00:51:41.340 | And if the number is, the kernel size is even
00:51:46.780 | and the input size is odd,
00:51:49.020 | we will also never touch, for example,
00:51:51.020 | here, the border, as you can see.
00:51:52.380 | We can also implement a dilation,
00:51:54.780 | which means that it becomes, with the same kernel size,
00:51:59.100 | the information becomes even more global
00:52:01.820 | because we don't watch consecutive pixel,
00:52:04.700 | but we skip some pixels, et cetera, et cetera.
00:52:07.580 | So the kernels, basically, the convolutions
00:52:09.580 | allow us to capture information
00:52:12.060 | from a local area of the picture, of the image,
00:52:15.180 | and combine it using a kernel.
00:52:18.300 | And this is the idea behind convolutions.
00:52:20.140 | So this convolution here, for example,
00:52:23.660 | will start with our, okay, let's define some shapes.
00:52:27.020 | Our variational autoencoder,
00:52:29.580 | so the encoder of the variational autoencoder
00:52:32.060 | will start with batch size and three channels.
00:52:37.580 | Let's define it as channel.
00:52:39.660 | Then this image will have a height and the width,
00:52:43.580 | which will be 512 by 512, as we will see later.
00:52:47.900 | And this convolution will convert it into batch size 128 features
00:52:54.140 | with the same height and the same width.
00:52:56.780 | Why, in this case, the height and the width doesn't change?
00:53:02.540 | Because even if we have a kernel size of size three,
00:53:05.740 | because we add padding, basically,
00:53:07.500 | we add something to the right side,
00:53:09.820 | something to the top side,
00:53:11.180 | something to the bottom and the left of the image.
00:53:12.940 | So the image with the padding becomes bigger,
00:53:15.580 | but then the output of the convolution makes it smaller
00:53:18.940 | and matches the original size of the image.
00:53:21.420 | This is the reason we have the padding here.
00:53:24.780 | But we will see later that with the next blocks,
00:53:28.220 | the image size will start becoming smaller.
00:53:30.940 | The next block is called the residual block.
00:53:35.420 | And VAE residual block,
00:53:38.220 | which is from 128 channels to 128 channels.
00:53:44.380 | This is a combination,
00:53:45.580 | this residual block is a combination of convolutions and normalization.
00:53:52.540 | So it's just a bunch of convolutions that we will define later.
00:53:55.980 | And this one indicates how many input channels we have
00:54:00.300 | and how many output channels we have.
00:54:02.060 | And the residual block will not change the size of the image.
00:54:05.100 | So we define it.
00:54:07.580 | So our input image is 128.
00:54:11.820 | So batch size 128 height and width.
00:54:17.900 | And it becomes, it remains the same basically.
00:54:22.140 | Oops!
00:54:22.640 | Okay, we have another one.
00:54:27.580 | Another residual block with the same transformation.
00:54:33.100 | Then we have another convolution.
00:54:35.820 | And this time the convolution will change the size of the image.
00:54:38.780 | And we will see why.
00:54:39.660 | So we have a convolution.
00:54:41.740 | To the 128 to 128.
00:54:47.100 | Because the output channels of the last block is 128.
00:54:51.660 | So the input channel is 128.
00:54:53.340 | The output is 128.
00:54:54.700 | The kernel size is 3.
00:54:57.820 | The stride is 2.
00:55:00.140 | And the padding is 0.
00:55:02.860 | This will basically introduce kernel size 3, stride 2.
00:55:06.700 | Let's watch.
00:55:07.580 | So imagine the batch size is 6 by 6.
00:55:10.460 | Kernel size is 3.
00:55:13.020 | Stride is 2 without the deletion.
00:55:15.020 | And this is the output.
00:55:16.940 | Let me make it bigger.
00:55:20.140 | Okay, something.
00:55:25.020 | Yeah.
00:55:27.040 | So as you can see, with the stride of 2...
00:55:30.220 | Need to make it...
00:55:33.340 | Okay, with the stride of 2 and the kernel size of 3.
00:55:37.660 | This is the behavior.
00:55:38.620 | So we skip every 2 pixels before calculating the output.
00:55:43.660 | And this makes the output smaller than the input.
00:55:46.700 | Because of this stride.
00:55:48.300 | And also because of the kernel size.
00:55:49.900 | And we don't have any padding.
00:55:51.260 | So this transformation here will have the following shapes.
00:55:57.820 | So we are starting from batch size.
00:56:03.260 | Height, width.
00:56:05.420 | So the original height and the width of the input image.
00:56:08.540 | But this time it will become batch size.
00:56:16.220 | The height will become half.
00:56:18.540 | And the width will become half.
00:56:25.580 | Then we have two more residual blocks.
00:56:28.460 | With the same...
00:56:30.860 | Same as before.
00:56:34.540 | But this time by increasing the number of features.
00:56:36.940 | And also here we don't increase any.
00:56:43.420 | Here by increasing the feature means that we don't increase the size of the image.
00:56:48.940 | Or we reduce the size of the image.
00:56:50.380 | We just increase the number of features.
00:56:52.620 | So this one becomes 256.
00:56:55.500 | And here we start from...
00:57:03.980 | Oops 256 and we remain 256.
00:57:07.260 | Now you may be confused of why we are doing all of this.
00:57:10.140 | Okay the idea is we start with the initial image.
00:57:12.460 | And we keep decreasing the size of the image.
00:57:15.180 | So later you will see that the image will become divided by 4, divided by 8.
00:57:19.180 | But at the same time we keep increasing the features.
00:57:22.220 | So each pixel represents more information.
00:57:25.980 | But the number of pixels is diminishing.
00:57:28.780 | Is reducing at every step.
00:57:33.180 | So let's go forward.
00:57:34.700 | Then we have another convolution.
00:57:36.780 | And this time the size will become divided by 4.
00:57:41.500 | And the convolution is...
00:57:43.900 | Let me copy this one.
00:57:44.860 | 256 by 256.
00:57:52.300 | Because the previous output is 256.
00:57:54.940 | The kernel size is 3.
00:57:56.460 | The stride is 2 and the padding is 0.
00:57:58.700 | So just like before.
00:57:59.740 | Also in this case the size of the image will become half of what is it now.
00:58:04.060 | So the image is already divided by 2.
00:58:05.820 | So it will become divided by 4 now.
00:58:07.580 | Then we have another residual block.
00:58:18.140 | In which we increase the number of features.
00:58:21.420 | This time from 256 to 512.
00:58:26.860 | So we start from 256 and the image is divided by 4.
00:58:31.340 | And we go to 512.
00:58:34.780 | And the image size doesn't change.
00:58:36.620 | Then we have another one.
00:58:41.180 | From 512 to 512.
00:58:45.420 | In this case...
00:58:48.060 | Oops.
00:58:49.200 | We will see later what is the residual block.
00:58:52.780 | But the residual block you have to think of it as just a convolution with a normalization.
00:58:56.940 | We will see later.
00:58:57.820 | And this one is 512.
00:59:01.100 | And that goes into 512.
00:59:02.860 | And then we have another convolution that will make it even smaller.
00:59:10.140 | So let's copy this convolution here.
00:59:12.060 | This one will go from 512 to 512.
00:59:19.020 | The same kernel size and the same stride and the same padding as before.
00:59:23.180 | So the image will become even smaller.
00:59:25.340 | So our last dimension was this.
00:59:28.540 | Let me copy it.
00:59:29.500 | So we start with an image that is 512.
00:59:32.460 | 4 times smaller than the original image.
00:59:36.300 | And with the 4 times smaller width it will become 8 times smaller.
00:59:41.420 | And that's it.
00:59:46.220 | And then we have residual blocks also here.
00:59:49.740 | We have three of them in this case.
00:59:51.180 | Let me copy.
00:59:54.300 | One, two, three.
00:59:59.660 | I just write the one for the last one.
01:00:01.900 | So anyway the size, the shape changes here.
01:00:06.860 | It doesn't change the shape of the image or the number of features.
01:00:11.020 | So here we are going from divide by 8 and 512 here.
01:00:18.460 | And we go to same dimension.
01:00:24.860 | Divide by 8 and divide by 8.
01:00:29.020 | Then we have an attention block.
01:00:31.340 | And later we will see what is the attention block.
01:00:34.220 | Basically it will run a self-attention over each pixel.
01:00:38.700 | So each pixel will become kind of,
01:00:40.780 | as you remember, the attention is a way to relate tokens to each other in a sentence.
01:00:45.020 | So if we have an image made of pixels,
01:00:48.620 | the attention can be thought of as a sequence of pixels
01:00:52.300 | and the attention as a way to relate the pixel to each other.
01:00:55.180 | So this is the goal of the attention block.
01:00:57.820 | And because this way each pixel is related to each other,
01:01:04.620 | is not independent from each other.
01:01:06.620 | Even if the convolution already actually relates close pixels to each other,
01:01:12.380 | but the attention will be global.
01:01:14.220 | So even the last pixel can be related to the first pixel.
01:01:17.180 | This is the goal of the attention block.
01:01:19.020 | And also in this case we don't reduce the size
01:01:23.020 | because the attention is, the transformer's attention,
01:01:26.380 | is a sequence-to-sequence model.
01:01:27.660 | So we don't reduce the size of the sequence.
01:01:30.780 | And the image remains the same.
01:01:35.100 | Finally, we have another residual block.
01:01:36.860 | Let's...
01:01:40.800 | Let me copy here.
01:01:44.300 | Also no change in shape or size of the image.
01:01:47.740 | Then we have a normalization.
01:01:49.420 | And we will see what is this normalization.
01:01:51.660 | It's the group normalization,
01:01:53.420 | which also doesn't change the size.
01:01:55.180 | Just like any normalization, by the way.
01:01:58.060 | With the number of groups being 32
01:02:03.980 | and the number of channels being 512,
01:02:06.460 | because it's the number of features.
01:02:07.820 | Finally, we have an activation function called the CELU.
01:02:11.500 | The CELU is a function...
01:02:13.820 | Okay, it's derived from the sigmoid linear unit.
01:02:17.340 | And it's a function just like the RELU.
01:02:20.220 | There is nothing special.
01:02:21.420 | They just saw that this one works better for this kind of application.
01:02:25.580 | But there is no particular reason to choose one over another,
01:02:31.820 | except that they thought that practically this one works fine for this kind of models.
01:02:36.460 | And if you watch my previous video about LAMA, for example,
01:02:40.860 | in which we analyzed why they chose the ZWIGLU function.
01:02:43.740 | If you read the paper, at the end of the paper,
01:02:45.580 | they say that there is no particular reason they chose the ZWIGLU.
01:02:48.700 | They just saw that practically it works better.
01:02:50.620 | I mean, it's very difficult to describe why
01:02:52.460 | activation function works better than the others.
01:02:56.060 | So this is why they use the CELU here,
01:02:57.660 | because practically it works well.
01:03:01.100 | Now, we have another two convolutions.
01:03:03.340 | And then we are done with the encoder.
01:03:05.500 | Convolution, 512, 8, kernel size, and then padding.
01:03:16.460 | This will not change the size of the model.
01:03:22.860 | Because just like before, we have the kernel size as 3.
01:03:25.580 | But we have the padding that compensates for the reduction given by the kernel size.
01:03:30.220 | But we are decreasing the number of features.
01:03:32.300 | And this is the bottleneck of the encoder.
01:03:36.060 | And I will show you later on the architecture what is the bottleneck.
01:03:38.780 | And finally, we have another convolution.
01:03:47.260 | Which is 8 by 8 with kernel size equal to 1.
01:03:55.980 | And the padding is equal to 0.
01:03:57.980 | Which also doesn't change the size of the image.
01:04:02.540 | Because if you watch here, if you have a kernel size of 1,
01:04:05.740 | it means that each, without stride,
01:04:08.700 | each kernel basically is running over each pixel.
01:04:13.020 | So each output actually captures the information of only one pixel.
01:04:16.620 | So the output has the same dimension as the input.
01:04:18.860 | And this is why here also we don't change the...
01:04:24.460 | But here we need to change the number of...
01:04:27.500 | It becomes 8.
01:04:32.540 | And here from 8 to 8.
01:04:37.820 | And this is the list of modules that will make up our encoder.
01:04:44.540 | Before building the residual block and the attention block,
01:04:49.660 | so this attention block,
01:04:50.860 | let's write the forward method and then we build the residual block.
01:04:55.020 | So this is the init.
01:04:57.500 | Define it like this.
01:05:00.060 | Let me review it if it's correct.
01:05:04.380 | Okay, yeah.
01:05:05.180 | Now let's define the forward method.
01:05:12.700 | x is the image for which we want to encode.
01:05:19.660 | So it's a tensor.
01:05:21.420 | Torch.tensor.
01:05:23.020 | And the noise, we need some noise.
01:05:25.740 | And later I will show you why we need some noise.
01:05:27.740 | That has the same size as the output of the encoder.
01:05:31.260 | This returns a tensor.
01:05:35.500 | Okay, our input x will be of size patch size with some channels.
01:05:45.820 | Initially it will be 3 because it's an image.
01:05:48.860 | Height and width which will be 512 by 512.
01:05:52.540 | And then some noise.
01:05:55.100 | This noise has the same size as the output of the encoder.
01:05:59.660 | And we will see that it's jelly patch size.
01:06:05.820 | Then output channels.
01:06:08.540 | Height divided by 8 and width divided by 8.
01:06:16.940 | Then we just run sequentially all of these modules.
01:06:21.340 | And then there is one little thing here that in the convolutions that have the stride,
01:06:35.100 | we need to apply a special embedding.
01:06:38.460 | And I will show you why and how it works.
01:06:46.460 | So if the module has a stride attribute and it's equal to 2 2,
01:06:52.460 | which basically means this convolution here,
01:06:57.180 | this convolution here and this convolution here,
01:06:59.980 | we don't apply the padding here because the padding here is applied to the top of the image,
01:07:05.180 | bottom, left and right.
01:07:07.180 | But we want to do an asymmetrical padding so we do it manually.
01:07:10.460 | And this is applied like this.
01:07:13.420 | F.padding.
01:07:14.380 | Basically this says can you add a layer of pixels on the right side of the image
01:07:27.740 | and on the bottom side of the image only?
01:07:30.380 | Because when you apply the padding, it's padding left, padding right, padding top,
01:07:40.940 | padding bottom.
01:07:42.220 | This means add a layer of pixels in the right side of the image
01:07:47.340 | and on the top side of the image.
01:07:48.860 | And this is asymmetrical padding.
01:07:54.380 | And then if we apply it only for these convolutions that have the stride equal to 2.
01:08:00.140 | And then x is equal to module of x.
01:08:05.660 | OK, now you may be wondering why are we building this kind of structure?
01:08:10.060 | Why it's made like this?
01:08:11.500 | OK, usually in deep learning communities, especially during research,
01:08:16.620 | we don't reinvent the wheel every time.
01:08:18.460 | So the people who made the stable diffusion, but also the people before them,
01:08:23.180 | every time we want to use a model,
01:08:25.100 | we check what models similar to the one we want to build
01:08:29.260 | are already out there and they are working fine.
01:08:32.140 | So very probably the people who built stable diffusion,
01:08:35.900 | they saw that a model like this is working very well
01:08:39.100 | for some previous project as a variational autoencoder.
01:08:42.700 | They just modified it a little bit and kept it like it.
01:08:46.780 | So for most choices, actually, there is no reason.
01:08:49.740 | There is a historical reason, because it worked well in practice.
01:08:53.340 | And we know that convolutions work well in practice
01:08:57.420 | for image segmentation, for example, or anything related to computer vision.
01:09:01.340 | And this is why they made the model like this.
01:09:04.940 | So most encoders actually work like this, that we reduce the size of the image,
01:09:09.500 | but each we keep increasing the features of the image,
01:09:12.860 | the channels, the number of channels of the image.
01:09:14.940 | So the number of pixels becomes smaller,
01:09:18.620 | but each pixel is represented by more than three channels.
01:09:22.460 | So more channels at every step.
01:09:23.980 | Now, what we do is here we are running our image into sequentially,
01:09:30.940 | in one by one, through all of these modules here.
01:09:37.900 | So first through this convolution, then through this residual block,
01:09:41.020 | which is also some convolutions, then this residual block,
01:09:44.860 | then again convolution, convolution, convolution,
01:09:46.940 | until we run it through this attention block and et cetera.
01:09:50.460 | This will transform the image into something smaller,
01:09:55.260 | so a compressed version of the image.
01:09:57.500 | But as I showed you before, this is not an autoencoder.
01:10:00.540 | This is a variational autoencoder.
01:10:02.940 | So the variational autoencoder, let me show you again the picture here.
01:10:09.180 | We are not learning how to compress data.
01:10:11.500 | We are learning a latent space.
01:10:13.260 | And this latent space are the parameters of a multivariate Gaussian distribution.
01:10:19.340 | So actually, the variational autoencoder is trained to learn the mu and the sigma,
01:10:25.260 | so the mean and the variance of this distribution.
01:10:30.460 | And this is actually what we will get from the output of this variational autoencoder,
01:10:35.900 | not directly the compressed image.
01:10:39.020 | And if this is not clear, guys,
01:10:42.140 | I made a previous video about the variational autoencoder,
01:10:45.100 | in which I show you also why the history of why we do it like this,
01:10:48.460 | all the reparameterization trick, et cetera.
01:10:51.820 | But for now, just remember that this is not just a compressed version of the image,
01:10:56.700 | it's actually a distribution.
01:10:58.300 | And then we can sample from this distribution.
01:11:01.100 | And I will show you how.
01:11:02.380 | So the output of the variational autoencoder is actually the mean and the variance.
01:11:08.220 | And actually, it's actually not the variance, but the log variance.
01:11:11.420 | So the mean and the log variance is equal to torch.chunk(x2, dimension equal 1).
01:11:22.380 | We will see also what is the chunk function.
01:11:25.580 | So I will show you.
01:11:27.500 | So this basically converts batch size, 8 channels, height,
01:11:32.700 | height divided by 8, width divided by 8,
01:11:36.940 | which is the output of the last layer of this encoder.
01:11:40.140 | So this one.
01:11:41.100 | And we divide it into two tensors.
01:11:44.380 | So this chunk basically means divide it into two tensors along this dimension.
01:11:49.500 | So along this dimension, it will become two tensors of size,
01:11:52.860 | along this dimension of size 4.
01:11:55.980 | So two tensors of shape, batch size 4, then height divided by 8, and width divided by 8.
01:12:12.540 | And this basically, the output of this actually represents the mean and the variance.
01:12:21.660 | And what we do, we don't want the log variance, we want the variance actually.
01:12:27.340 | So to transform the log variance into variance, we do the exponentiation.
01:12:32.540 | So the first thing actually we also need to do is to clamp this variance,
01:12:37.100 | because otherwise it will become very small.
01:12:39.100 | So clamping means that if the variance is too small or too big,
01:12:42.460 | we want it to become within some ranges that are acceptable for us.
01:12:47.820 | So this clamping function, log variance,
01:12:50.700 | tells the PyTorch that if the value is too small or too big, make it within this range.
01:12:55.900 | And this doesn't change the shape of the tensors.
01:13:01.020 | So this still remains this tensor here.
01:13:04.300 | And then we transform the log variance into variance.
01:13:09.660 | So the variance is equal to the log variance dot exp,
01:13:13.820 | which means make the exponential of this.
01:13:16.140 | So you delete the log and it becomes the variance.
01:13:18.620 | And this also doesn't change the size of the shape of the tensor.
01:13:25.100 | And then to calculate the standard deviation from the variance,
01:13:28.140 | as you know, the standard deviation is the square root of the variance.
01:13:31.180 | So standard deviation is the variance dot sqrt.
01:13:40.300 | And also this doesn't change the size of the tensor.
01:13:44.540 | OK, now what we want, as I told you before, this is a latent space.
01:13:51.500 | It's a multivariate Gaussian, which has its own mean and its own variance.
01:13:56.060 | And we know the mean and the variance, this mean and this variance.
01:13:59.500 | How do we convert? How do we sample from it?
01:14:03.740 | Well, what we can sample from is, basically, we can sample from n_01.
01:14:09.420 | This is, if we have a sample from n_01,
01:14:12.620 | how do we convert it into a sample of a given mean and the given variance?
01:14:19.100 | This, as if you remember from probability and statistics,
01:14:23.260 | if you have a sample from n_01,
01:14:25.340 | you can convert it into any other sample of a Gaussian
01:14:28.860 | with a given mean and a variance through this transformation.
01:14:32.140 | So if z, let's call it this one, z is equal to n_01,
01:14:37.180 | we can transform into another n, let's call it x,
01:14:40.940 | through this transformation x is equal to z.
01:14:43.820 | Well, the mean of the new distribution plus
01:14:49.980 | the standard deviation of the new distribution multiplied by z.
01:14:54.780 | This is the transformation, this is the formula from probability and statistics.
01:14:58.860 | Basically means transform this distribution into this one,
01:15:01.420 | that has this mean and this variance,
01:15:03.020 | which basically means sample from this distribution.
01:15:05.980 | This is why we are given also the noise as input,
01:15:09.100 | because the noise we want it to come from with a particular seed of the noise generator.
01:15:14.700 | So we ask is as input and we sample from this distribution like this,
01:15:19.180 | x is equal to mean plus standard deviation multiplied by noise.
01:15:25.180 | Finally, there is also another step that we need to scale the output by a constant.
01:15:34.460 | This constant, I found it in the original repository.
01:15:37.740 | So I'm just writing it here without any explanation on why,
01:15:41.180 | because I actually, I also don't know.
01:15:43.260 | It's just a scaling constant that they use at the end.
01:15:46.460 | I don't know if it's there for historical reason,
01:15:49.020 | because they use some previous model that had this constant,
01:15:51.260 | or they introduced it for some particular reason.
01:15:54.060 | But it's a constant that I saw it in the original repository.
01:15:57.100 | And actually, if you check the original parameters of the stable diffusion model,
01:16:01.420 | there is also this constant.
01:16:02.460 | So I am also scaling the output by this constant.
01:16:05.020 | And then we return x.
01:16:08.140 | So now what we built so far,
01:16:10.620 | except that we didn't build the residual block and the attention block here,
01:16:14.940 | we built the encoder part of the variational autoencoder and also the sampling part.
01:16:20.140 | So we take the image, we run it through the encoder, it becomes very small.
01:16:24.140 | It will tell us the mean and the variance.
01:16:26.940 | And then we sample from that distribution given the mean and the variance.
01:16:31.340 | Now we need to build the decoder along with the residual block and the attention block.
01:16:36.700 | And what we will see is that in the decoder,
01:16:38.780 | we do the opposite of what we did in the encoder.
01:16:42.220 | So we will reduce the number of channels and at the same time,
01:16:46.700 | we will increase the size of the image.
01:16:48.620 | So let's go to the decoder.
01:16:52.620 | Let me review if everything is fine.
01:16:55.820 | Looks like it is.
01:17:00.380 | So let's go to the decoder.
01:17:01.820 | Again, import torch.
01:17:08.220 | We also need to define the attention.
01:17:29.260 | We need to define the self-attention.
01:17:30.860 | Later we define it.
01:17:31.820 | Let's define first the residual block, the one we defined before,
01:17:39.900 | so that you understand what is this residual block.
01:17:42.860 | And then we define the attention block that we defined before.
01:17:48.220 | And finally, we build the attention.
01:17:50.860 | So...
01:18:00.860 | Okay, this is made up of normalization and convolutions, like I said before.
01:18:09.420 | There is a two normalization, which is the group norm one.
01:18:14.860 | So...
01:18:24.860 | And then there is another group normalization.
01:18:40.860 | With remote channels to out channels.
01:18:58.860 | And then we have a skip connection.
01:19:05.900 | Skip connection basically means that you take the input,
01:19:09.660 | you skip some layers, and then you connect it there with the output of the last layer.
01:19:13.580 | And we also need this residual connection.
01:19:17.820 | If the two channels are different, we need to create another intermediate layer.
01:19:21.740 | Now I create it, later I explain it.
01:19:34.860 | Okay, let's create the forward method.
01:19:51.100 | Which is a torch.tensor.
01:20:02.620 | And returns a torch.tensor.
01:20:05.260 | Okay, the input of this residual layer, as you saw before,
01:20:10.780 | is something that has a batch with some channels,
01:20:15.020 | and then height and width, which can be different.
01:20:18.380 | It's not always the same.
01:20:19.500 | Sometimes it's 512 by 512, sometimes it's half of that,
01:20:23.420 | sometimes it's one fourth of that, etc.
01:20:25.980 | So suppose it's x is batch size in channels height width.
01:20:36.220 | What we do is we create the skip connection.
01:20:39.740 | So we save the initial input.
01:20:41.420 | We call it the residual or residue is equal to x.
01:20:45.500 | We apply the normalization.
01:20:48.060 | The first one.
01:20:53.660 | And this doesn't change the shape of the tensor.
01:20:56.940 | The normalization doesn't change.
01:20:59.340 | Then we apply the silo function.
01:21:00.780 | And this also doesn't change the size of the tensor.
01:21:07.820 | Then we apply the first convolution.
01:21:10.300 | This also doesn't change the size of the tensor,
01:21:18.220 | because as you can see here, we have kernel size 3, yes,
01:21:21.260 | but with the padding of 1.
01:21:22.860 | With the padding of 1, actually, it will not change the size of the tensor.
01:21:26.540 | So it will still remain this one.
01:21:28.140 | Then we apply again the group normalization 2.
01:21:32.540 | This again doesn't change the size of the tensor.
01:21:38.060 | Then we apply the silo again.
01:21:40.540 | Then we apply the convolution number 2.
01:21:46.700 | And finally, we apply the residual connection,
01:21:54.060 | which basically means that we take x plus the residual.
01:21:58.940 | But if the number of output channels is not equal to the input channels,
01:22:05.740 | you cannot add this one with this one,
01:22:07.420 | because this dimension will not match between the two.
01:22:09.740 | So what we do, we create this layer here
01:22:13.660 | to convert the input channels to the output channels of x,
01:22:17.420 | such that this sum can be done.
01:22:19.180 | So what we do is, we apply this residual layer.
01:22:22.380 | Residual layer of residual, like this.
01:22:27.100 | And this is our residual block.
01:22:29.660 | So as I told you, it's just a bunch of convolutions and group normalization.
01:22:33.580 | And for those who are familiar with the computer vision models,
01:22:36.140 | especially in ResNet, we use a lot of it.
01:22:38.220 | It's a very common block.
01:22:43.180 | Let's go build the attention block that we used also before in the encoder.
01:22:47.340 | This one here.
01:22:49.100 | And to define the attention, we also need to define the self-attention.
01:22:53.500 | So let's first build the attention block,
01:22:55.660 | which is used in the variational autoencoder.
01:22:57.340 | And then we define what is this self-attention.
01:23:11.500 | So it has a group normalization.
01:23:28.140 | Again, the channel is always 32 here in stable diffusion.
01:23:34.060 | But you also may be wondering, what is group normalization, right?
01:23:37.420 | So let's go to review it, actually, since we are here.
01:23:40.780 | And, okay, if you remember from my previous slides on Lama,
01:23:47.500 | let's go here, where we use a layer normalization.
01:23:52.620 | And also in the vanilla transformer, actually, we use layer normalization.
01:23:58.140 | So first of all, what is normalization?
01:24:00.220 | Normalization is basically when we have a deep neural network,
01:24:03.660 | each layer of the network produces some output that is fed to the next layer.
01:24:08.700 | Now, what happens is that if the output of a layer is varying in distribution,
01:24:14.700 | so sometimes, for example, the output of a layer is between 0 and 1,
01:24:18.380 | but the next step, maybe it's between 3 and 5,
01:24:22.140 | and the next step, maybe it's between 10 and 15, etc.
01:24:25.740 | So the distribution of the output of a layer changes,
01:24:29.180 | then the next layer also will see some input
01:24:32.060 | that is very different from what the layer is used to see.
01:24:36.860 | This will basically push the output of the next layer into a new distribution itself,
01:24:42.620 | which, in turn, will push the loss function into,
01:24:45.900 | basically, the output of the model to change very frequently in distribution.
01:24:53.740 | So sometimes it will be a very big number,
01:24:55.500 | sometimes it will be a very small number,
01:24:56.940 | sometimes it will be negative, sometimes it will be positive, etc.
01:24:59.740 | And this basically makes the loss function oscillate too much,
01:25:04.060 | and it makes the training slower.
01:25:05.980 | So what we do is we normalize the values before feeding them into layers,
01:25:09.740 | such that each layer always sees the same distribution of the data.
01:25:13.820 | So it will always see numbers that are distributed around 0 with a variance of 1.
01:25:19.260 | And this is the job of the layer normalization.
01:25:21.580 | So imagine you are a layer, and you have some input,
01:25:25.180 | which is a batch of 10 items.
01:25:27.260 | Each item has some features, so feature 1, feature 2, feature 3.
01:25:31.180 | Layer normalization calculates a mean and the variance over these features here,
01:25:36.380 | so over this distribution here,
01:25:38.380 | and then normalizes this value according to this formula.
01:25:42.140 | So each value basically becomes distributed between 0 and 1.
01:25:46.460 | With batch normalization, we normalize by columns,
01:25:50.700 | so the statistics mean and the sigma is calculated by columns.
01:25:54.380 | With layer normalization, it is calculated by rows,
01:25:57.420 | so each item independently from the others.
01:26:00.620 | With group normalization, on the other hand,
01:26:03.100 | it is like layer normalization, but not all of the features of the item, but grouped.
01:26:10.860 | So for example, imagine you have four features here.
01:26:13.500 | So here you have F1, F2, F3, F4, and you have two groups.
01:26:18.620 | Then the first group will be F1 and F2, and the second group will be F3 and F4.
01:26:23.740 | So you will have two means and two variance,
01:26:26.860 | one for the first group, one for the second group.
01:26:29.900 | But why do we use it like this?
01:26:32.540 | Why do we want to group this kind of features?
01:26:35.740 | Because these features actually, they come from convolutions.
01:26:39.180 | And as we saw before, let's go back to the website.
01:26:42.460 | Imagine you have a kernel of five here.
01:26:45.740 | Each output here actually comes from local area of the image.
01:26:51.900 | So the two close features, for example,
01:26:55.260 | two things that are close to each other, may be related to each other.
01:26:59.340 | So two things that are far from each other are not related to each other.
01:27:02.540 | This is why we can group, we can use group normalization in this case.
01:27:07.260 | Because closer features to each other will have kind of the same distribution,
01:27:12.860 | or we make them have the same distribution,
01:27:15.020 | and things that are far from each other may not.
01:27:17.580 | This is the basic idea behind group normalization.
01:27:20.300 | But the whole idea behind the normalization is that
01:27:22.700 | we don't want these things to oscillate too much.
01:27:25.260 | Otherwise, the loss of function will oscillate
01:27:27.420 | and will make the training slower.
01:27:28.860 | With normalization, we make the training faster.
01:27:30.940 | So let's go back to coding.
01:27:33.100 | So we were coding the attention block.
01:27:36.300 | So now the attention block has this group normalization and also an attention,
01:27:40.540 | which is a self-attention.
01:27:43.020 | And later we define it.
01:27:44.140 | And channels, okay.
01:27:48.140 | This one have a forward method.
01:27:54.300 | Torch.tensor, returns, of course, torch.tensor.
01:27:59.020 | Okay, what is the input of this block?
01:28:02.060 | The input of this block is something, where is it?
01:28:06.700 | Here.
01:28:07.500 | It's something in the form of batch size, number of channels, height and width.
01:28:11.980 | But because it will be used in many positions, this attention block,
01:28:15.180 | we don't define a specific size.
01:28:17.500 | So we just say that x is something that is a batch size,
01:28:21.900 | features or channels, if you want, height and width.
01:28:25.900 | Again, we create a residual connection.
01:28:28.940 | And the first thing we do is we extract the shape.
01:28:34.860 | So n is the batch size, the number of channels,
01:28:38.220 | the height and the width is equal to x.shape.
01:28:41.420 | Then, as I told you before,
01:28:45.340 | we do the self-attention between all the pixels of this image.
01:28:49.580 | And I will show you how.
01:28:50.700 | This will transform this tensor here into this tensor here.
01:29:03.660 | Height multiplied by width.
01:29:06.620 | So now we have a sequence where each item represents a pixel
01:29:11.340 | because we multiplied height by width.
01:29:13.260 | And then we transpose it.
01:29:15.180 | So put it back a little before.
01:29:18.220 | Transpose the -1 with -2.
01:29:20.620 | This will transform this shape into this shape.
01:29:26.940 | So we put back this one.
01:29:31.180 | So this one comes before and features becomes the last one.
01:29:34.700 | Something like this.
01:29:37.420 | And okay.
01:29:40.140 | So as you can see from this tensor here,
01:29:43.660 | this is like when we do the attention in the transformer model.
01:29:47.500 | So in the transformer model, we have a sequence of tokens.
01:29:50.220 | Each token is representing, for example, a word.
01:29:52.700 | And the attention basically calculates the attention between each token.
01:29:57.180 | So how do two tokens are related to each other?
01:30:00.220 | In this case, we can think of it as a sequence of pixels.
01:30:03.420 | Each pixel with its own embedding, which is the features of that pixel.
01:30:07.820 | And we relate pixels to each other.
01:30:10.060 | And then we do the attention.
01:30:12.780 | Which is a self-attention.
01:30:17.260 | In which self-attention means that the query key and values are the same input.
01:30:20.940 | And this doesn't change the shape.
01:30:25.500 | So this one remains the same.
01:30:28.540 | Then we transpose back.
01:30:30.620 | And we do the inverse transformation.
01:30:36.060 | So because we put it in this form only to do attention.
01:30:39.260 | So now we transpose.
01:30:40.940 | So we take this one.
01:30:46.140 | And we convert it into features.
01:30:48.860 | And then height and width.
01:30:51.900 | And then again, we remove this multiplication by viewing again the tensor.
01:30:57.500 | So n, c, h, w.
01:31:01.420 | So we go from here.
01:31:05.980 | To here.
01:31:11.020 | Then we add the residual connection.
01:31:15.580 | And we return x.
01:31:16.460 | That's it.
01:31:17.660 | The residual connection will not change the size of the input.
01:31:22.380 | And we return a tensor of this shape here.
01:31:25.900 | Let me check also the residual connection here.
01:31:27.660 | It's correct.
01:31:28.460 | Okay.
01:31:29.260 | Now that we have also built the attention block, let's build also the self-attention.
01:31:32.700 | Since we are building the attentions.
01:31:35.180 | And the attentions, because we have two kinds of attention in the stable diffusion.
01:31:39.820 | One is called the self-attention.
01:31:41.820 | And one is the cross-attention.
01:31:43.820 | And we need to build both.
01:31:44.940 | So let's go build it in a separate class called "Attention".
01:31:48.300 | And okay.
01:31:55.100 | So again, import torch.
01:32:08.940 | Okay.
01:32:18.940 | I think you guys maybe want to review the attention before building it.
01:32:27.580 | So let's go review it.
01:32:28.860 | I have here opened my slides from my video about the attention model for the transformer model.
01:32:35.740 | So the self-attention, basically, it's a way for, especially in a language model,
01:32:42.140 | is a way for us to relate tokens to each other.
01:32:45.180 | So we start with a sequence of tokens.
01:32:47.420 | Each one of them having an embedding of size d model.
01:32:50.540 | And we transform it into queries, key, and values.
01:32:53.260 | In which query, key, and values in the self-attention are the same matrix, same sequence.
01:32:57.660 | We multiply them by wq matrix.
01:33:01.180 | So wq, wk, and wv, which are parameter matrices.
01:33:05.340 | Then we split them along the d model dimension into number of heads.
01:33:10.540 | So we can specify how many heads we want.
01:33:12.700 | In our case, the one attention that we will do here is actually only one head.
01:33:18.940 | I will show you later.
01:33:19.740 | And then we calculate the attention for each of this head.
01:33:23.580 | Then we combine back by concatenating this head together.
01:33:28.780 | We multiply this output matrix of the concatenation with another matrix called wo,
01:33:35.260 | which is the output matrix.
01:33:36.620 | And then this is the output of the multi-head attention.
01:33:40.700 | If we have only one head, instead of being a multi-head,
01:33:44.540 | then we will not do this splitting operation.
01:33:46.940 | We will just do this multiplication with the w and with the wo.
01:33:51.260 | And OK, this is how the self-attention works.
01:33:54.940 | So in a self-attention, we have this query key and values coming from the same matrix input.
01:33:59.100 | And this is what we are going to build.
01:34:01.500 | So we have the number of heads.
01:34:12.300 | Then we have the embedding.
01:34:14.380 | So what is the embedding of each token?
01:34:16.700 | But in our case, we are not talking about tokens.
01:34:19.980 | We will talk about pixels.
01:34:21.340 | And we can think that the number of channels of each pixel is the embedding of the pixel.
01:34:26.700 | So the embedding, just like in the original transformer,
01:34:30.380 | the embeddings are the kind of vectors that capture the meaning of the word.
01:34:35.020 | In this case, we have the channels.
01:34:36.540 | Each channel, each pixel represented by many channels
01:34:39.900 | that capture the information about that pixel.
01:34:41.980 | Here we have also the bias for the w matrices,
01:34:49.420 | which we don't have in the original transformer.
01:34:51.500 | OK, now let's define the w matrices.
01:35:07.900 | So wqwq and wv.
01:35:09.580 | We will represent it as one big linear layer.
01:35:11.980 | Instead of representing it as three different matrices, it's possible.
01:35:17.500 | We just say that it's a big matrix, three by the embedding.
01:35:21.340 | And the bias is if we want it.
01:35:25.260 | So in projection, in projection bias.
01:35:28.620 | So this means stands for in projection,
01:35:30.620 | because it's a projection of the input before we apply the attention.
01:35:34.300 | And then there is an auto projection, which is after we apply the attention.
01:35:37.420 | So the wo matrix.
01:35:47.100 | So as you remember here, the wo matrix is actually the model by the model.
01:35:51.100 | The input is also the model by the model.
01:35:53.420 | And this is exactly what we did.
01:35:54.780 | But we have three of them here.
01:35:56.860 | So it's three by the model.
01:35:58.220 | And then we save the number of heads.
01:36:08.060 | And then we saved the dimension of each head.
01:36:15.500 | The dimension of each head basically means that if we have multi head,
01:36:18.780 | each head will watch a part of the embedding of each token.
01:36:21.980 | So we need to save how much is this size.
01:36:25.820 | So the model divided by the number of heads.
01:36:28.060 | But divide by the number of heads.
01:36:32.700 | Let's implement the forward.
01:36:35.420 | We can also apply a mask.
01:36:43.660 | As you remember, the mask is a way to avoid relating tokens,
01:36:47.580 | one particular token with the tokens that come after it,
01:36:52.540 | but only with the token that come before it.
01:36:54.780 | And this is called the causal mask.
01:36:58.380 | If you really are not understanding what is happening here in the attention,
01:37:05.500 | I highly recommend you watch my previous video,
01:37:07.420 | because it's explained very well.
01:37:09.980 | And if you watch it, it will take not so much time.
01:37:14.380 | And I think you will learn a lot.
01:37:16.300 | So the first thing we do is extract the shape.
01:37:20.780 | Then we extract the size, the sequence,
01:37:31.420 | length and the embedding is equal to input shape.
01:37:38.780 | And then we say that we will convert it into another shape
01:37:47.020 | that I will show you later why.
01:37:48.460 | This is called the interim shape, intermediate shape.
01:37:58.140 | Then we apply the query key and value.
01:38:06.700 | We apply the in projection, so the wq, wq and wv matrix to the input,
01:38:12.540 | and we convert it into query key and values.
01:38:14.620 | So query key and values are equal to...
01:38:16.620 | We multiply it, but then we divide it with chunk.
01:38:22.220 | As I showed you before, what is chunk?
01:38:23.900 | Basically, we will multiply the input with the big matrix
01:38:29.260 | that represents wq, wq and wq,
01:38:31.420 | but then we split it back into three smaller matrices.
01:38:34.700 | This is the same as applying three different projections.
01:38:37.740 | Instead of...
01:38:38.960 | It's the same as applying three separate in projections,
01:38:43.500 | but it's also possible to combine it in one big matrix.
01:38:46.860 | This, what we will do, basically it will convert batch size,
01:38:56.060 | sequence length, dimension into batch size, sequence length, dimension multiplied by three.
01:39:05.740 | And then by using chunk, we split it along the last dimension
01:39:09.820 | into three different tensors of shape, batch size, sequence length and dimension.
01:39:23.900 | Okay, now we can split the query key and values in the number of heads.
01:39:31.180 | According to the number of heads, this is why we built this shape,
01:39:34.220 | which means split the dimension, the last dimension into n heads.
01:39:42.940 | And the values v.view, wonderful.
01:40:02.540 | This will convert, okay, let's write it,
01:40:08.620 | batch size, sequence length, dimension into batch size, sequence length,
01:40:17.180 | then h, so the number of heads and each dimension divided by the number of heads.
01:40:23.260 | So each head will watch the full sequence,
01:40:26.060 | but only a part of the embedding of each token, in this case, pixel.
01:40:31.260 | And we'll watch this part of the head.
01:40:35.180 | So the full dimension, the embedding divided by the number of heads.
01:40:39.500 | And then this will convert it, because we are also transposing,
01:40:46.380 | this will convert it into batch size,
01:40:49.020 | h, sequence length, and then dimension h.
01:40:55.900 | So each head will watch all the sequence, but only a part of the embedding.
01:41:03.180 | We then calculate the attention, just like the formula.
01:41:07.100 | So query multiplied by the transpose of the keys.
01:41:09.900 | So is the query, matrix multiplication with the transpose of the keys.
01:41:15.100 | This will return a matrix of size,
01:41:19.740 | batch size, h, sequence length by sequence length.
01:41:25.420 | We can then apply the mask.
01:41:30.860 | As you remember, the mask is something that we apply when we calculate the attention,
01:41:35.980 | if we don't want two tokens to relate to each other.
01:41:38.780 | We basically substitute their value.
01:41:41.820 | In this matrix, we substitute the interaction with minus infinity before applying the softmax,
01:41:47.500 | so that the softmax will make it zero.
01:41:49.820 | So this is what we are doing here.
01:41:51.260 | We first build the mask.
01:41:54.540 | This will create a causal mask, basically a mask where the upper triangle,
01:42:00.700 | so above the principal diagonal, is made up of one ones, a lot of ones.
01:42:13.740 | And then we can then apply the softmax.
01:42:17.500 | One ones, a lot of ones.
01:42:22.540 | And then we fill it up with minus infinity.
01:42:25.900 | Masked, oops, not mask, but wait.
01:42:32.940 | Masked fill, but with mask, and we put minus infinity, like this.
01:42:46.220 | As you remember, the formula of the transformer is a
01:42:48.860 | query multiplied by the transpose of the keys, and then divided by the square root of the model.
01:42:54.220 | So this is what we will do now.
01:42:55.580 | So divided by the square root of the model, set of the head.
01:43:02.140 | And then we apply the softmax.
01:43:15.020 | We multiply it by the WO matrix.
01:43:17.180 | We transpose back.
01:43:24.540 | So we want to remove, now we want to remove the head dimension.
01:43:29.420 | So output is equal to, let me write some shapes.
01:43:35.820 | So what is this?
01:43:37.020 | This is equal to patch size, sequence by sequence,
01:43:45.580 | multiplied, so matrix multiplication with patch size.
01:43:56.620 | This will result into patch size, H, sequence length, and dimension divided by H.
01:44:09.820 | This we then multiplied by the, we then transpose.
01:44:15.660 | And this will result into, so we start with this one.
01:44:23.580 | And it becomes, wait I put too many parentheses here, patch size, sequence length,
01:44:38.860 | H, and dimensions, okay.
01:44:44.620 | Then we can reshape as the input, like the initial shape, so this one.
01:44:57.980 | And then we apply the output projection.
01:45:02.540 | So we multiply it by the WO matrix.
01:45:13.900 | Okay.
01:45:22.060 | This is the self-attention.
01:45:26.220 | Now let's go back to continue building the decoder.
01:45:29.100 | For now we have built the attention block and the residual block.
01:45:31.900 | But we need to build the decoder.
01:45:45.740 | And also this one is a sequence of modules that we will apply one after another.
01:46:00.860 | We start with the convolution just like before.
01:46:03.180 | Now I will not write again the shapes change, but you got the idea.
01:46:07.100 | In the encoder we, in the encoder, let me show you here.
01:46:12.860 | Here.
01:46:17.900 | In the encoder we keep reducing the size of the image until it becomes small.
01:46:22.540 | In the decoder we need to return to the original size of the image.
01:46:27.180 | So we start with the latent dimension and we return to the original dimension of the image.
01:46:34.940 | Convolution.
01:46:39.100 | So we start with four channels and we output four channels.
01:46:54.460 | Then we have another convolution.
01:46:55.820 | We go to 500.
01:47:03.100 | Then we have a residual block just like before.
01:47:12.220 | Then we have an attention block.
01:47:23.180 | Then we have a bunch of residual blocks and we have four of them.
01:47:31.980 | Let me copy.
01:47:42.540 | Okay.
01:47:45.360 | Now the residual blocks, let me write some shapes here.
01:47:49.820 | Here we arrived to a situation in which we have batch size.
01:47:53.660 | We have 512 features and the size of the image still didn't grow
01:47:58.700 | because we didn't have any convolution that will make it grow.
01:48:02.140 | This one of course will remain the same because it's a residual block and etc.
01:48:12.700 | Now to increase the size of the image.
01:48:15.580 | So now the image is actually height divided by 8 which height as you remember is 512,
01:48:21.740 | the size of the image that we are working with.
01:48:24.220 | So this dimension here is 64 by 64.
01:48:27.980 | How can we increase it?
01:48:29.260 | We use one module called upsample.
01:48:31.260 | The upsample, we have to think of it like when we resize an image.
01:48:42.220 | So imagine you have an image that is 64 by 64
01:48:45.340 | and you want to transform it to 128 by 128.
01:48:48.860 | The upsample will do it just like when we resize an image.
01:48:52.620 | So it will replicate the pixels twice.
01:48:57.580 | So along the dimensions right and down for example twice.
01:49:02.220 | So that the total amount of pixels, the height and the width actually doubles.
01:49:07.180 | This is the upsample basically.
01:49:10.860 | It will just replicate each pixel so that by this scale factor along each dimension.
01:49:16.940 | So this one becomes batch size
01:49:24.060 | divided by 8, width divided by 8 becomes as we see here
01:49:38.460 | 8 divided by 4 and width divided by 4.
01:49:41.260 | Then we have a convolution, residual blocks.
01:49:47.420 | So we have convolutions of 2D, 512 to 512.
01:50:05.180 | Then we have residual blocks of 512 by 500.
01:50:09.340 | But in this case we have three of them, 2, 3.
01:50:11.660 | Then we have another upsample.
01:50:14.060 | This will again double the size of the image.
01:50:17.660 | So we have another one that will double the size of the image.
01:50:20.860 | And by a scale factor of 2.
01:50:23.340 | So now our image which was divided by 4 with 512 channels.
01:50:29.340 | So let's write it like this.
01:50:30.780 | Will become divided by 2 now.
01:50:34.300 | So it will double the size of the image.
01:50:37.100 | So now our image is 256 by 256.
01:50:41.660 | Then again we have a convolution.
01:50:45.980 | And then we have three residual blocks again.
01:50:54.940 | But this time we reduce the number of features.
01:50:58.940 | So 256 and then it's 256 to 256.
01:51:06.780 | Okay, then we have another upsampling which will again double the size of the image.
01:51:14.300 | And this time we will go from divide by 2 to divide by 2 up to the original size.
01:51:26.220 | And because the number of channels has changed, we are not 512 anymore.
01:51:31.100 | Okay.
01:51:32.480 | And then we have another convolution.
01:51:35.820 | This case with 256 because it's the new number of features.
01:51:40.540 | Then we have another bunch of residual blocks that will decrease the number of features.
01:51:52.380 | So we go to 256 to 128.
01:52:00.060 | We have finally a group norm.
01:52:07.980 | 32 is the group size.
01:52:12.700 | So we group features in groups of 32 before calculating the mu and the sigma before normalizing.
01:52:20.540 | And we define the number of channels as 128 which is the number of features that we have.
01:52:25.180 | So this group normalization will divide these 128 features into groups of 32.
01:52:32.380 | Then we apply the silu.
01:52:36.220 | And then we have a convolution.
01:52:41.340 | The final convolution that will transform into an image with the three channels.
01:52:47.660 | So RGB by applying these convolutions here which doesn't change the size of the output.
01:52:54.380 | So we'll go from an image that is batch size 128 height width.
01:53:01.660 | Why height width?
01:53:02.380 | Because after the last upsampling we become of the original size into an image with only three channels.
01:53:10.860 | And this is our decoder.
01:53:20.860 | Now we can write the forward method.
01:53:36.780 | I'm sorry if I'm putting a lot of spaces between here.
01:53:39.580 | But otherwise it's easy to get lost and not understand where we are.
01:53:43.580 | So here the input of the decoder is our latent.
01:53:50.620 | So it's batch size 4 height divided by 8 width divided by 8.
01:53:57.500 | As you remember here in the encoder the last thing we do is be scaled by this constant.
01:54:02.300 | So we nullify this scaling.
01:54:05.180 | So we reverse this scaling.
01:54:07.020 | 215 and then we run it through the decoder.
01:54:13.980 | And then return x which is batch size 3 height and width.
01:54:31.260 | Let me also write the input of this decoder which is this one.
01:54:35.500 | We already have it.
01:54:37.820 | Okay this is our variational auto encoder.
01:54:41.180 | So far let's go review.
01:54:43.740 | We are building our architecture of the stable diffusion.
01:54:50.620 | So far we have built the encoder and the decoder.
01:54:53.980 | But now we have to build the unit and then we have to build the clip text encoder.
01:55:01.020 | And finally we have to build the pipeline that will connect all of these things.
01:55:05.740 | So it's going to be a long journey but it's fun actually to build things.
01:55:10.220 | Because you learn every detail of how they work.
01:55:13.180 | So the next thing that we are going to build is the text encoder.
01:55:16.780 | So this clip encoder here that will allow us to encode the prompt into embeddings
01:55:22.700 | that we can then feed to this unit model here.
01:55:25.660 | So let's build this clip encoder.
01:55:28.620 | And we will of course use a pre-trained version.
01:55:31.180 | So by downloading the vocabulary and I will show you how it works.
01:55:35.260 | So let's start.
01:55:36.700 | We go to Visual Studio Code.
01:55:39.340 | We create a new file in st folder called clip.py.
01:55:43.260 | And here.
01:55:45.820 | And we start importing the usual stuff.
01:55:49.020 | [typing]
01:56:08.060 | And we also import self-attention because we will be using it.
01:56:10.620 | So basically clip is a layer.
01:56:15.100 | It's very similar to the encoder layer of the transformer.
01:56:18.620 | So as you remember the transformer.
01:56:20.300 | Let me show you here.
01:56:21.580 | The transformer.
01:56:24.140 | This is the encoder layer of the transformer.
01:56:26.460 | It's made of attention and then feed forwards.
01:56:30.060 | And there are many blocks like this one after another that are applied one after another.
01:56:34.140 | We also have something that tells the position of each token inside of the sentence.
01:56:38.700 | And we will also have something similar in clip.
01:56:41.020 | So we need to build something very similar to this one.
01:56:44.460 | And actually this is why I mean the transformer model was very successful.
01:56:48.140 | So that's why they use the same structure of course also for this purpose.
01:56:51.420 | And so let's go to build it.
01:56:54.140 | The first thing we will build.
01:56:57.500 | I will build first the skeleton of the model and then we will build each block.
01:57:00.700 | So let's build clip.
01:57:02.860 | [typing]
01:57:13.100 | And this has some embeddings.
01:57:15.820 | The embeddings allow us to convert the tokens.
01:57:19.660 | So as you remember in when you have a sentence made up of text.
01:57:23.180 | First you convert it into numbers.
01:57:25.260 | Where each number indicates the position of the token inside of the vocabulary.
01:57:28.940 | And then you convert it into embeddings.
01:57:31.180 | Where each embedding represents a vector of size 512 in the original transformer.
01:57:36.220 | But here in clip the size is 768.
01:57:40.300 | And each vector represents kind of the meaning of the word or the token captures.
01:57:45.660 | So this is an embedding.
01:57:48.380 | And later we define it.
01:57:51.260 | We need the vocabulary size.
01:57:52.620 | The vocabulary size is 49408.
01:57:55.420 | I took it directly from the file.
01:57:56.940 | This is the embedding size.
01:57:59.100 | And the sequence length.
01:58:00.140 | The maximum sequence length that we can have.
01:58:02.060 | Because we need to use the padding is 77.
01:58:05.260 | Because we should actually use some configuration file to save.
01:58:09.820 | But because we will be using with the pre-trained stable diffusion model.
01:58:13.820 | The size are already fixed for us.
01:58:16.060 | But in the future I will refactor the code to add some configuration actually.
01:58:20.300 | To make it more extensible.
01:58:23.900 | This is a list of layers.
01:58:31.180 | Each we call it the clip layer.
01:58:36.060 | We have this 12.
01:58:37.340 | Which indicates the number of head of the multihead attention.
01:58:43.020 | And then the embedding size which is 768.
01:58:47.020 | And we have 12 of these layers.
01:58:51.020 | Then we have the layer normalization.
01:58:54.940 | Layer norm.
01:58:58.460 | And we tell him how many features.
01:59:03.100 | So 768.
01:59:05.180 | And then we define the forward method.
01:59:07.580 | This is tensor.
01:59:14.940 | And this one returns float tensor.
01:59:18.460 | Why long tensor?
01:59:20.940 | Because the input IDs are usually numbers.
01:59:25.340 | That indicate the position of each token inside of the vocabulary.
01:59:28.460 | Also this concept.
01:59:29.420 | Please if it's not clear.
01:59:30.460 | Go watch my previous video.
01:59:31.980 | About the transformer.
01:59:32.540 | Because it's very clear there.
01:59:34.460 | When we work with the textual models.
01:59:36.540 | Okay.
01:59:44.140 | First we convert each token into embeddings.
01:59:46.140 | And then.
01:59:52.380 | So what is the size here?
01:59:54.300 | We are going from batch size.
01:59:55.740 | Sequence length into.
02:00:00.620 | Batch size.
02:00:03.260 | Sequence length.
02:00:04.460 | And dimension.
02:00:05.100 | Where the dimension is 768.
02:00:07.260 | Then we apply one after.
02:00:09.580 | One after another.
02:00:12.220 | All the layers of this encoder.
02:00:13.740 | Just like in the transformer model.
02:00:18.060 | And the last one we apply the layer normalization.
02:00:30.320 | And finally we return the output.
02:00:36.140 | Where the output is.
02:00:37.100 | Of course it's a sequence to sequence model.
02:00:41.660 | Just like the transformer.
02:00:42.620 | So the input should match the.
02:00:44.140 | The shape of the input should match the shape of the output.
02:00:47.740 | So we always obtain sequence length by the model.
02:00:52.300 | Okay.
02:00:53.980 | Now let's define these two blocks.
02:00:55.580 | The first one is the clip embedding.
02:00:57.340 | So let's go.
02:00:59.660 | Clip embedding.
02:01:06.300 | How much is the vocabulary size?
02:01:10.220 | What is the embedding size?
02:01:12.860 | And number of token.
02:01:20.060 | Okay.
02:01:21.660 | So the sequence length basically.
02:01:28.560 | Okay.
02:01:36.060 | We define the embedding itself.
02:01:38.380 | Using nn.embedding.
02:01:39.740 | Just like always.
02:01:40.540 | We need to tell him what is the number of embeddings.
02:01:49.740 | So the vocabulary size.
02:01:51.260 | And what is the dimension of each vector of the embedding token.
02:01:54.540 | Then we define some positional encoding.
02:01:57.420 | So now as you remember.
02:01:59.580 | The positional encoding in the original transformer.
02:02:01.580 | Are given by sinusoidal functions.
02:02:04.300 | But here in clip.
02:02:06.140 | They actually don't use them.
02:02:07.500 | They use some learned parameters.
02:02:10.380 | So they have these parameters.
02:02:13.260 | That are learned by the model during training.
02:02:15.980 | That tell the position of the token to the model.
02:02:23.740 | Tokens and embeddings.
02:02:27.740 | Like this.
02:02:29.260 | We apply them.
02:02:39.180 | So first we apply the embedding.
02:02:40.620 | So we go from.
02:02:41.500 | Patch size.
02:02:43.340 | Sequence length.
02:02:46.960 | Patch size.
02:02:49.900 | Sequence length.
02:02:51.180 | Dimension.
02:02:51.980 | And then just like in the original transformer.
02:03:02.460 | We add the positional encodings to each token.
02:03:07.820 | But in this case as I told you.
02:03:09.180 | The positional embeddings are not fixed.
02:03:12.620 | Like not sinusoidal functions.
02:03:14.380 | But they are learned by the model.
02:03:16.460 | So they are learned.
02:03:17.260 | And then later we will load these parameters.
02:03:19.740 | When we load the model.
02:03:21.020 | And then we return this x.
02:03:23.420 | Then we have the clip layer.
02:03:26.220 | Which is just like the layer of the transformer model.
02:03:29.100 | The encoder of the transformer model.
02:03:43.420 | So it returns nothing actually.
02:03:47.420 | And this one is wrong in it.
02:04:01.100 | Okay.
02:04:03.600 | We have just like in the transformer block.
02:04:06.620 | We have the pre norm.
02:04:08.860 | Then we have the attention.
02:04:09.820 | Then we have a post norm.
02:04:11.020 | And then we have the feed forward.
02:04:12.300 | So layer normalization.
02:04:16.860 | Then we have the attention.
02:04:27.900 | Which is a self attention.
02:04:31.980 | Later we will build the cross attention.
02:04:37.580 | And I will show you what is it.
02:04:38.780 | Then we have another layer normalization.
02:04:43.820 | Then we have two feed forward layers.
02:05:03.260 | And finally we have the forward method.
02:05:16.780 | Finally.
02:05:17.260 | So this one takes tensor.
02:05:23.500 | And returns a tensor.
02:05:24.700 | So let me write it.
02:05:25.660 | Tensor.
02:05:27.500 | Okay.
02:05:32.300 | Just like the transformer model.
02:05:33.740 | Okay let's go have a look.
02:05:35.020 | We have a bunch of residual connections.
02:05:37.740 | As you can see here.
02:05:38.460 | One residual connection here.
02:05:39.580 | One residual connection here.
02:05:40.780 | We have two normalizations.
02:05:41.980 | One here.
02:05:42.460 | One here.
02:05:42.960 | The feed forward.
02:05:44.460 | As just like in the original transformer.
02:05:46.220 | We have two linear layers.
02:05:48.780 | And then we have this multi head attention.
02:05:50.700 | Which is actually a self attention.
02:05:51.980 | Because it's the same input that becomes query key and values.
02:05:55.100 | So let's do it.
02:05:57.580 | The first residual connection x.
02:06:00.940 | So what is the input of this forward method?
02:06:03.260 | It's a batch size.
02:06:04.300 | Sequence length d mod.
02:06:07.580 | And the dimension of the embedding which is 768.
02:06:11.980 | The first thing we do is we apply the self attention.
02:06:15.260 | But before applying the self attention.
02:06:18.220 | We apply the layer normalization.
02:06:20.300 | So layer normal 1.
02:06:23.100 | Then we apply the attention.
02:06:28.940 | But with the causal mask.
02:06:30.220 | As you remember here.
02:06:36.780 | Self attention.
02:06:37.580 | We have the causal mask.
02:06:39.180 | Which basically means that every token cannot watch the next tokens.
02:06:42.940 | So cannot be related to future tokens.
02:06:44.780 | But only the one on the left of it.
02:06:46.860 | And this is what we want from a text model actually.
02:06:49.900 | We don't want the one word to watch the words that come after it.
02:06:53.900 | But only the words that come before it.
02:06:55.820 | Then we do this residual connection.
02:06:59.020 | So now we are.
02:06:59.820 | Now we are doing this connection here.
02:07:04.380 | Then we do the feed forward layer.
02:07:10.540 | Again we have a residual connection.
02:07:14.460 | We apply the normalization.
02:07:18.300 | I'm not writing all the shapes.
02:07:24.220 | If you watch my code online.
02:07:25.660 | I have written all of them.
02:07:27.340 | But mostly to save time.
02:07:28.780 | Because here we are already familiar with the structure of the transformer.
02:07:32.940 | Hopefully.
02:07:33.740 | So I am not repeating all the shapes here.
02:07:35.900 | We apply the first linear of the feed forward.
02:07:43.900 | Then as activation function.
02:07:47.980 | We use the GLUE function.
02:07:49.740 | And actually we call the quick GLUE function.
02:07:53.100 | Which is defined like this.
02:07:56.540 | X multiplied by torch dot sigmoid.
02:07:59.980 | Of 1.702 multiplied by x.
02:08:05.100 | And that's it.
02:08:09.260 | Should be like this.
02:08:10.140 | So this is called the quick GLUE activation function.
02:08:16.940 | Also here.
02:08:17.500 | There is no justification on why we should use this one and not another one.
02:08:23.420 | They just saw that in practice this one works better for this kind of application.
02:08:27.100 | So that's why we are using this function here.
02:08:29.340 | So now.
02:08:31.100 | And then we apply the residual connection.
02:08:37.900 | And finally we return x.
02:08:40.380 | This is exactly like the feed forward layer of the transformer.
02:08:44.940 | Except that in the transformer we don't have this activation function.
02:08:47.580 | But we have the RELU function.
02:08:48.780 | And if you remember in LLAMA we don't have the RELU function.
02:08:52.460 | We have the ZWIGLUE function.
02:08:54.300 | But here we are using the quick GLUE function.
02:08:56.940 | Which I actually am not so familiar with.
02:08:58.780 | But I think that it works good for this model.
02:09:02.940 | And they just kept it.
02:09:03.820 | So now we have built our text encoder here.
02:09:08.060 | CLIP.
02:09:08.460 | Which is very small as you can see.
02:09:09.980 | And our next thing to build is our unit.
02:09:15.900 | So we have built the variational autoencoder.
02:09:18.780 | The encoder part.
02:09:19.900 | And the decoder part.
02:09:21.580 | Now the next thing we have to build is this unit.
02:09:25.340 | As you remember the unit is the network that will give some noisified image.
02:09:31.900 | And the amount.
02:09:33.420 | And we also indicated to the network what is the amount of noise that we added to this image.
02:09:38.940 | The model has to predict how much noise is there.
02:09:43.740 | And how to remove it.
02:09:44.700 | And this unit is a bunch of convolutions.
02:09:49.660 | That will reduce the size of the image.
02:09:52.060 | As you can see.
02:09:52.700 | With each step.
02:09:54.620 | But by increasing the number of features.
02:09:58.540 | So we reduce the size.
02:09:59.580 | But we increase exactly what we did in the encoder of the variational autoencoder.
02:10:04.700 | And then we do the reverse steps.
02:10:07.740 | Just like we did with the decoder of the variational autoencoder.
02:10:10.540 | So now again we will work with some convolutions.
02:10:13.900 | With the residual blocks.
02:10:15.100 | With attentions.
02:10:17.260 | The one big difference is that we need to tell our unit.
02:10:21.100 | Not only the image that is already.
02:10:24.060 | So what is the image with noise.
02:10:26.620 | Not only the amount of noise.
02:10:29.500 | So the time step at which this noise was added.
02:10:32.540 | But also the prompt.
02:10:34.540 | Because as you remember we need to also tell this unit what is our prompt.
02:10:40.060 | Because we need to tell him how we want our output image to be.
02:10:44.940 | Because there are many ways to deny the initial noise.
02:10:47.660 | So if we want the initial noise to become a dog.
02:10:50.460 | We need to tell him we want a dog.
02:10:52.140 | If we want the initial noise to become a cat.
02:10:53.900 | We need to tell him we want a cat.
02:10:56.140 | So the unit has to know what is the prompt.
02:10:58.620 | And also he has to relate this prompt with the rest of the information.
02:11:03.180 | And what is the best way to combine two different stuff.
02:11:07.580 | So for example an image with text.
02:11:10.220 | We will use what is called the cross attention.
02:11:13.660 | Cross attention basically allows us to calculate the attention between two sequences.
02:11:18.460 | In which the query is the first sequence.
02:11:21.980 | And the keys and the values are coming from another sequence.
02:11:25.340 | So let's go build it and let's see how this works.
02:11:28.060 | Now the first thing we will do is create a new class.
02:11:33.740 | New file here called diffusion.
02:11:36.140 | Because this will be our diffusion model.
02:11:39.900 | And I think also here I will build from top down.
02:11:44.860 | So we first define the diffusion class.
02:11:47.820 | And then we build each block one by one.
02:11:49.740 | Let's start by importing the usual libraries.
02:11:53.900 | So import torch.
02:12:03.100 | From torch.
02:12:10.380 | And then we import the attention.
02:12:15.420 | The self attention.
02:12:19.340 | But also we will need the cross attention.
02:12:22.460 | Attention.
02:12:24.620 | And later we will build it.
02:12:25.740 | Then let's create the class diffusion.
02:12:32.940 | The class diffusion is basically our unit.
02:12:34.860 | This is made of time embedding.
02:12:46.140 | So something that we will define it later.
02:12:50.780 | Time embedding.
02:12:52.220 | 320 which is the size of the time embedding.
02:12:56.140 | So because we need to give the unit not only the noisified image.
02:13:00.700 | But also the time step at which it was noisified.
02:13:03.500 | So the image, the unit needs some way to understand this time step.
02:13:10.300 | So this is why this time step which is a number.
02:13:12.540 | Will be converted into an embedding.
02:13:14.620 | By using this particular module called time embedding.
02:13:17.260 | And later we will see it.
02:13:18.220 | Then we build the unit.
02:13:21.340 | And then the output layer of the unit.
02:13:27.180 | And later we will see what is it.
02:13:29.660 | This output layer.
02:13:30.620 | Put layer.
02:13:33.500 | Later we will see how to build it.
02:13:37.100 | Let's do the forward.
02:13:39.020 | As you remember the unit will receive the latent.
02:13:45.740 | So this Z which is a latent.
02:13:48.220 | Is the output of the variational autoencoder.
02:13:50.460 | So this latent which is a torch dot tensor.
02:13:53.020 | It will receive the context.
02:13:55.260 | What is the context?
02:13:56.380 | Is our prompt.
02:13:57.340 | Which is also a torch dot tensor.
02:14:00.620 | And it will receive the time.
02:14:02.460 | At which this latent was noisified.
02:14:04.700 | Which is also.
02:14:07.020 | I don't remember.
02:14:07.740 | I think it's a tensor also.
02:14:09.660 | Later I define it.
02:14:12.860 | Okay yeah it's tensor.
02:14:15.580 | Okay let's define the sizes.
02:14:21.420 | So the latent here is batch size.
02:14:25.500 | 4 because 4 is the output of the encoder.
02:14:28.380 | If you remember correctly here.
02:14:30.780 | Closing.
02:14:33.500 | Okay.
02:14:33.740 | Grid and width divided by 8.
02:14:38.620 | Then we have the context.
02:14:41.180 | Which is our prompt.
02:14:42.300 | Which we already converted with the clip encoder here.
02:14:46.540 | Which will be batch size.
02:14:48.940 | By sequence length.
02:14:50.140 | By dimension.
02:14:50.860 | Where the dimension is 768.
02:14:53.020 | Like we defined before.
02:14:54.540 | And the time will be another.
02:14:55.820 | We will define it later.
02:14:58.460 | How it's defined.
02:14:59.580 | How it's built.
02:15:00.300 | But it's each embedding.
02:15:03.020 | It's a number with an embedding of size.
02:15:05.180 | It's a vector of a size of 320.
02:15:07.820 | The first thing we do is.
02:15:11.260 | We convert this time into an embedding.
02:15:13.180 | And actually this time.
02:15:17.020 | We will see later.
02:15:17.820 | That it's actually.
02:15:18.700 | Just like the positional encoding.
02:15:20.940 | Of the transformer model.
02:15:23.180 | It's actually a number that is multiplied by.
02:15:26.540 | Sines and cosines.
02:15:28.860 | Just like in the transformer.
02:15:30.940 | Because they saw that it works for the transformer.
02:15:33.500 | So we can also use the same positional encoding.
02:15:35.820 | To convey the information of the time.
02:15:37.980 | Which is actually kind of an information.
02:15:39.980 | About position.
02:15:40.780 | So it tells the model.
02:15:42.300 | At which step we arrived in the denoisification.
02:15:48.460 | So this one will convert tensor of one 320.
02:15:53.260 | Into a tensor of one one two eight zero one thousand.
02:15:57.740 | The unit will convert our latent.
02:16:03.180 | Into another latent.
02:16:04.620 | So it will not change the size.
02:16:11.500 | Batch for height this is the output.
02:16:18.780 | Of the variation of the encoder.
02:16:20.140 | Which first becomes batch 320 features.
02:16:30.060 | Through this the unit so.
02:16:39.900 | So why here we have more features.
02:16:45.020 | Than the starting.
02:16:45.900 | Because let's review here.
02:16:47.420 | As you can see.
02:16:48.140 | The last layer of the unit.
02:16:50.940 | Actually we need to go back.
02:16:53.660 | To the same number of the features.
02:16:57.340 | You can see here.
02:16:58.060 | So here we start.
02:16:59.820 | Actually the dimensions here.
02:17:01.260 | Don't match what we will be using.
02:17:02.620 | So this is the original unit.
02:17:04.300 | But the one used.
02:17:05.180 | By stable diffusion is a modified unit.
02:17:08.380 | So in the last.
02:17:09.580 | When we build the decoder.
02:17:11.260 | The decoder will not build.
02:17:12.700 | The final number of features that we need.
02:17:15.100 | Which is four.
02:17:16.140 | But we need an additional output layer.
02:17:18.060 | To go back to the original size of features.
02:17:20.700 | And this is the job of this output layer.
02:17:23.100 | So later we will see.
02:17:25.580 | When we build this this layer.
02:17:28.140 | So output is equal to self dot final.
02:17:30.860 | This one will go from this size here.
02:17:35.580 | To back to the original size of the unit.
02:17:42.540 | Because the unit.
02:17:44.940 | His job is to take in latents.
02:17:47.900 | Predict how much noise is it.
02:17:49.900 | Then take again the same latent.
02:17:51.900 | Predict how much noise.
02:17:53.020 | We remove it.
02:17:54.140 | We remove the noise.
02:17:55.020 | Then again we give another latent.
02:17:57.020 | We predict how much noise.
02:17:58.220 | We remove the noise.
02:17:59.180 | We give another latent.
02:18:01.100 | We predict the noise.
02:18:01.900 | We remove the noise.
02:18:02.620 | Etc, etc, etc.
02:18:04.060 | So the output dimension must match the input dimension.
02:18:06.460 | And then we return the output.
02:18:09.660 | Which is the latent.
02:18:13.100 | Like this.
02:18:15.360 | Let's build first the time embedding.
02:18:18.540 | I think it's easy to build.
02:18:19.740 | So something that encodes information.
02:18:23.580 | About the time step in which we are.
02:18:26.060 | [TYPING SOUNDS]
02:18:28.940 | Okay.
02:18:40.320 | It is made of two linear layers.
02:18:45.740 | Nothing fancy here.
02:18:46.860 | Linear one.
02:18:54.460 | Which will map it to 4 by n embedding.
02:18:58.060 | And then linear two.
02:18:59.900 | 4 by n embedding into 4 by n embedding.
02:19:11.180 | And now you understand why it becomes 1280.
02:19:15.180 | Which is 4 times 320.
02:19:18.300 | [TYPING SOUNDS]
02:19:21.340 | This one returns to a tensor.
02:19:27.580 | So the input size is 1320.
02:19:32.300 | What we do is first we apply this first layer.
02:19:36.380 | Linear one.
02:19:38.300 | Then we apply the silo function.
02:19:40.700 | Then we apply again the second linear layer.
02:19:45.900 | [TYPING SOUNDS]
02:19:49.900 | And then we return it.
02:19:51.100 | Nothing special here.
02:19:52.540 | The output dimension is 1 by 1280.
02:19:58.220 | [TYPING SOUNDS]
02:20:02.220 | Okay.
02:20:04.560 | The next thing we need to build is the unit.
02:20:06.860 | The unit will require many blocks.
02:20:10.860 | So let's first build the unit itself.
02:20:13.660 | And then we build each of the blocks that it will require.
02:20:16.220 | So class unit.
02:20:19.500 | [TYPING SOUNDS]
02:20:36.380 | As you can see, the unit is made up of one encoder branch.
02:20:40.060 | So this is like the encoder of the variational autoencoder.
02:20:44.140 | Things go down.
02:20:45.340 | So the image becomes smaller, smaller, smaller.
02:20:47.500 | But the channels keep increasing.
02:20:49.420 | The features keep increasing.
02:20:51.020 | Then we have this bottleneck layer here.
02:20:53.500 | It's called bottleneck.
02:20:54.860 | And then we have a decoder part here.
02:20:57.180 | So it becomes original size.
02:20:58.860 | The image from the very small size becomes the original size.
02:21:02.700 | And then we have these skip connections between the encoder and the decoder.
02:21:07.020 | So the output of each layer of each step of the encoder
02:21:12.460 | is connected to the same step of the decoder on the other side.
02:21:15.820 | And you will see this one here.
02:21:19.500 | So we start building the left side, which is the encoders.
02:21:22.380 | Which is a list of modules.
02:21:25.100 | And to build these encoders, we need to define a special layer, basically, that will apply...
02:21:37.980 | Okay, let's build it and then I will describe it.
02:21:40.940 | SwitchSequential.
02:21:48.060 | And basically, this switchSequential, given a list of layers, will apply them one by one.
02:22:05.340 | So we can think of it as a sequential.
02:22:07.580 | But it can recognize what are the parameters of each of them and will apply accordingly.
02:22:14.860 | So after I define it, it will be more clear.
02:22:17.180 | So first we have, just like before, a convolution.
02:22:20.300 | Because we want to increase the number of channels.
02:22:23.100 | So as you can see, at the beginning, we increase the number of channels of the image.
02:22:26.860 | Here it's 64, but we go directly to 320.
02:22:29.900 | And then we have another one of this switchSequential.
02:22:36.540 | Which is a unit residual block.
02:22:41.660 | We define it later.
02:22:44.700 | But it's very similar to the residual block that we built already for the variational autoencoder.
02:22:50.460 | And then we have an attention block, which is also very similar to the attention block
02:22:54.300 | that we built for the variational autoencoder.
02:23:03.900 | Then we have-- OK, I think it's better to build this switchSequential.
02:23:07.740 | Otherwise, we have too many-- yeah.
02:23:09.500 | Let's build it.
02:23:11.980 | It's very simple.
02:23:12.700 | As you can see, it's a sequence.
02:23:21.260 | But given x, which is our latent, which is a torch.tensor, our context, so our prompt.
02:23:33.580 | And the time, which is also a tensor.
02:23:36.540 | We'll apply them one by one.
02:23:42.380 | But based on what they are.
02:23:49.340 | So if the layer is a unit attention block, for example.
02:23:52.460 | It will apply it like this.
02:23:57.180 | So layer of x and context.
02:24:00.620 | Because this attention block basically will compute the cross-attention between
02:24:04.060 | our latent and the prompt.
02:24:05.980 | This is why.
02:24:06.620 | This residual block will compute-- will match our latent with its time step.
02:24:22.860 | And then if it's any other layer, we just apply it.
02:24:27.100 | And then we return, but after the for a while.
02:24:34.300 | Yeah.
02:24:34.540 | So this is-- now we understood this.
02:24:37.020 | We just need to define this residual block and this attention block.
02:24:40.220 | Then we have another sequence-- sequential switch.
02:24:48.300 | This one here.
02:24:49.020 | So the code I'm writing actually is based on a repository.
02:24:55.580 | Upon which actually most of the code I wrote is based on.
02:24:59.100 | Which is in turn based on another repository,
02:25:01.180 | which was originally written for TensorFlow, if I remember correctly.
02:25:04.460 | So actually, the code for stable diffusion-- because it's a model that is built by
02:25:11.500 | Comfit's group at the LMU University, of course, it cannot be different from that code.
02:25:15.740 | So most of the code are actually similar to each other.
02:25:19.020 | I mean, you cannot create the same model and change the code.
02:25:23.180 | Of course, the code will be similar.
02:25:25.500 | So we again use this one-- switch sequential.
02:25:31.740 | So here we are building the encoder side.
02:25:34.140 | So we are reducing the size of the image.
02:25:54.220 | Let me check where we are.
02:25:56.220 | So we have the residual block of 320 to 64.
02:26:00.940 | And then we have an attention block of 8 to 80.
02:26:05.340 | And this attention block takes the number of head.
02:26:10.460 | This 8 indicates the number of head.
02:26:12.140 | And this indicates the embedding size.
02:26:14.780 | We will see later how we transform this, the output of this,
02:26:19.260 | into a sequence so that we can run attention on it.
02:26:22.620 | OK, we have this sequential.
02:26:26.460 | And then we have another one.
02:26:29.500 | Then we have another convolution.
02:26:38.380 | Let me just copy.
02:26:42.140 | Convolution of size from 640 to 640 channels.
02:26:49.660 | Kernel size 3, stride 2, padding 1.
02:26:51.900 | Then we have another residual block that will again increase the features.
02:26:59.260 | So from 640 to 1280.
02:27:05.020 | And then we have an attention block of 8 heads and 160 is the embedding size.
02:27:15.820 | Then we have another residual block of 1280 and 8 and 160.
02:27:28.860 | So as you can see, just like in the encoder of the variational autoencoder,
02:27:31.980 | we, with these convolutions, we keep decreasing the size of the image.
02:27:38.940 | So actually here we started with the latent representation,
02:27:42.620 | which was height divided by 8 and height divided by 8.
02:27:47.180 | So let me write some shapes here.
02:27:49.100 | At least you need to understand the size changes.
02:27:52.700 | So batch size for height divided by 8 and width divided by 8.
02:27:58.940 | When we apply this convolution, it will become divided by 16.
02:28:04.460 | So it will become divided by 16.
02:28:13.100 | So it will become a very small image.
02:28:17.020 | And after we apply the second one, it will become divided by 32.
02:28:22.380 | So here we start from 16.
02:28:27.100 | Here it will become divided by 32.
02:28:32.380 | So what does it mean divided by 32?
02:28:34.060 | That if the initial image was of size 512, the latent is of size 64 by 64.
02:28:40.220 | Then it becomes 32 by 32.
02:28:43.340 | Now it has become 16 by 16.
02:28:45.580 | And then we apply these residual connections.
02:28:51.020 | And then we apply another convolutional layer,
02:28:57.820 | which will reduce the size of the image further.
02:29:01.660 | So from 32 here, divide by 32 and divide by 32 to divide by 64.
02:29:12.460 | Every time we divided the size of the image by 2.
02:29:18.140 | And the number of features is 12801280.
02:29:25.500 | And then we have a unit residual block.
02:29:34.940 | So let me copy also this one.
02:29:36.460 | Of 1280 and 1280.
02:29:46.060 | And then we have a last one, which is another one of the same size.
02:29:51.740 | So now we have an image that is 64 divided by 64 and divided by 64,
02:29:57.340 | but with much more channels.
02:29:59.580 | I forgot to change the channel numbers here.
02:30:02.620 | So here is 1280 channels and divided by 64 divided by 64.
02:30:07.020 | And this one remains the same.
02:30:09.420 | Because the residual connections don't change the size.
02:30:12.940 | Here should be 1280 to 1280.
02:30:20.860 | Here should be 640 to 640.
02:30:25.020 | And here it should be 320 to 320.
02:30:31.020 | So as I said before, we keep reducing the size of the image,
02:30:35.900 | but we keep increasing this number of features of each pixel basically.
02:30:40.620 | Then we build the bottleneck,
02:30:42.780 | which is this part here of the unit.
02:30:49.420 | This is a sequence of a residual block.
02:30:57.420 | Then we have the attention block,
02:31:06.940 | which will make a self-attention.
02:31:11.100 | Sorry, not self-attention, cross-attention.
02:31:14.380 | And then we have another residual block.
02:31:16.540 | And then we have the decoder.
02:31:24.140 | So in the decoder, we will do the opposite of what we did in the encoder.
02:31:27.660 | So we will reduce the number of features, but increase the image size.
02:31:36.700 | Again, let's start with our beautiful switch sequential.
02:31:44.860 | So we have 2560 to 1280.
02:31:54.620 | Why here is 2560 even if after the bottleneck we have 1280?
02:32:02.940 | So we are talking about this part here.
02:32:06.460 | So after the input of the decoder,
02:32:11.420 | so this side here of the unit is the output of the bottleneck.
02:32:16.060 | But the bottleneck is outputting 1280 features,
02:32:20.140 | while the encoder is expecting 2560, so double the amount.
02:32:25.180 | Why? Because we need to consider that we have this skip connection here.
02:32:29.020 | So this skip connection will double the amount at each layer here.
02:32:33.020 | And this is why the input we expect here is double the size
02:32:36.540 | of what is the output of the previous layer.
02:32:39.420 | Let me write some shapes also here.
02:32:40.940 | So batch size 2560.
02:32:44.140 | The image is very small, so height by end width divided by 64.
02:32:49.660 | And it will become 1280.
02:32:59.260 | Then we apply another switch sequential of the same size.
02:33:09.260 | Then we apply another one with an upsample,
02:33:12.060 | just like we did in the variational autoencoder.
02:33:14.140 | So if you remember in the variational autoencoder,
02:33:16.300 | to increase the size of the image we do upsampling.
02:33:19.660 | And this is what we do exactly here.
02:33:21.260 | We do upsample.
02:33:23.420 | But this is not the upsample that we did exactly the same,
02:33:28.780 | but the concept is similar.
02:33:30.620 | And we will define it later also, this one.
02:33:36.620 | So we have another residual with attention.
02:33:39.100 | So we have a residual of 2000.
02:33:42.140 | And then we have an attention block.
02:33:45.260 | 8 by 160.
02:33:52.780 | Then we have again this one.
02:33:57.020 | Then we have another one with an attention block.
02:34:03.900 | Then we have another one with upsampling.
02:34:08.220 | So we have 9020.
02:34:12.300 | And then we have an upsample.
02:34:19.740 | This one is small.
02:34:28.700 | So I know that I'm not writing all the shapes,
02:34:32.380 | but otherwise it's a really tiring job and very long.
02:34:37.180 | So just remember that we are keep increasing the size of the image,
02:34:41.100 | but we will decrease the number of features.
02:34:44.380 | Later we will see that this number here will become very small,
02:34:47.580 | and the size of the image will become nearly to the normal.
02:34:50.700 | Then we have another one with attention.
02:34:57.580 | So as you can see, we are decreasing the features here.
02:35:05.820 | Then we have 8 by 80, and we are increasing also here the size.
02:35:12.380 | Then we have another one.
02:35:15.420 | And 880.
02:35:23.180 | Then we have another one with upsampling.
02:35:26.060 | So we increase the size of the image.
02:35:28.140 | So 960 to 640.
02:35:34.540 | 8 heads with the dimensions embedding size of 80,
02:35:39.100 | and the upsampling with 640 features.
02:35:43.100 | And then we have another residual block with attention.
02:35:58.620 | Then we have another one, which is a 640, 320, 840.
02:36:09.580 | And finally, the last one, we have 640 by 320.
02:36:17.900 | And 8 and 40.
02:36:19.980 | This dimension here is the same that will be applied by the output of the unit,
02:36:28.540 | as you can see here.
02:36:29.340 | This one here.
02:36:31.180 | And then we will give it to the final layer to build the original latent size.
02:36:34.620 | Okay, let's build all these blocks that we didn't build before.
02:36:42.380 | So first, let's build the upsample.
02:36:45.900 | Let's build it here, which is exactly the same as the two.
02:36:54.300 | Okay.
02:36:56.300 | We have this convolution.
02:37:10.300 | Without changing the number of features.
02:37:18.300 | And this is also doesn't change the size of the image, actually.
02:37:28.700 | So we will go from batch channels or features.
02:37:36.700 | Let's call it features height width to batch size features.
02:37:46.380 | Height multiplied by 2 and width multiplied by 2.
02:37:51.660 | Because we are going to use the upsampling.
02:37:54.300 | This interpolation that we will do now.
02:37:58.380 | Interpolate x scale factor equal to mode is equal to nearest is the same operation that we did here.
02:38:08.860 | The same operation here.
02:38:10.940 | It will double the size, basically.
02:38:14.540 | And then we apply a convolution.
02:38:17.900 | Now, we have to define the final block.
02:38:26.700 | And we also have to define for the output layer.
02:38:30.380 | And we also have to define the attention block and the residual block.
02:38:33.260 | So let's build first this output layer.
02:38:36.380 | It's easier to build.
02:38:47.580 | So let's... this one also has a group normalization.
02:39:05.260 | Again, with the 32 size of the group 32.
02:39:08.220 | Also has a convolution.
02:39:12.620 | And the padding of 1.
02:39:20.780 | Okay.
02:39:25.840 | The final layer needs to convert this shape into this shape.
02:39:31.500 | So 320 features into 4.
02:39:33.740 | We have... so we have an input which is batch size of 320 features.
02:39:42.060 | The height is divided by 8.
02:39:44.060 | And the width is divided by 8.
02:39:45.820 | We first apply a group normalization.
02:39:48.940 | Then we apply the SILU.
02:39:54.860 | Then we apply the convolution.
02:40:00.060 | And then we return.
02:40:02.540 | This will basically... the convolution... let me write also why we are reducing the size.
02:40:09.660 | This convolution will change the number of channels from in to out.
02:40:13.180 | And when we will declare it, we say that we want to convert from 320 to 4 here.
02:40:18.700 | So this one will be of shape batch size 4.
02:40:23.020 | Height divided by 4.
02:40:25.100 | Height divided by 4.
02:40:28.060 | And width divided by 8.
02:40:30.620 | Then we need to go build this residual block and this attention block here.
02:40:39.660 | So let's build it here.
02:40:40.860 | Let's start with the residual block,
02:40:44.060 | which is very similar to the residual block that we built for the variational autoencoder.
02:40:47.740 | So unit lock.
02:41:01.580 | So this is the embedding of the time step.
02:41:16.220 | As you remember, with the time embedding, we transform into an embedding of size 1280.
02:41:30.460 | We have this group normalization.
02:41:32.460 | It's always this group norm.
02:41:33.660 | Then we have a convolution.
02:41:57.500 | And we have a linear for the time embedding.
02:42:01.020 | Then we have another group normalization.
02:42:11.020 | We will see later what is this merged.
02:42:24.860 | And another convolution.
02:42:30.780 | Oops, kernel size 3, embedding 1.
02:42:37.500 | Again, just like before, we have if the in channels is equal to the out channels,
02:42:45.260 | we can connect them directly with the residual connection.
02:42:54.220 | Otherwise, we create a convolution to connect them,
02:42:57.180 | to convert the size of the input into the output.
02:42:59.740 | Otherwise, we cannot add the two tensors.
02:43:18.540 | Zero, okay.
02:43:27.500 | So it takes in as input this feature tensor, which is actually the latent
02:43:37.500 | batch size in channels.
02:43:42.300 | Then we have height and width.
02:43:45.100 | And then also the time embedding, which is 1 by 1280, just like here.
02:43:50.460 | And we build, first of all, a residual connection.
02:43:55.180 | Then we do apply the group normalization.
02:44:01.580 | So usually the residual connection, the residual blocks are more or less always the same.
02:44:05.500 | So there is a normalization and activation function.
02:44:08.780 | Then we can have some skip connection, etc, etc.
02:44:14.620 | then we have the time.
02:44:41.420 | here we are merging the latency with the time embedding,
02:45:03.180 | but the time embedding doesn't have the batch and the channels dimension.
02:45:07.500 | So we add it here with unsqueeze.
02:45:09.500 | And we merge them.
02:45:11.900 | Then we normalize this merged connection.
02:45:15.180 | This is why it's called merged.
02:45:19.260 | We apply the activation function.
02:45:34.700 | Then we apply this convolution.
02:45:37.500 | And finally, we apply the residual connection.
02:45:39.820 | So why are we doing this?
02:45:49.900 | Well, the idea is that here we have three inputs.
02:45:52.780 | We have the time embedding, we have the latent, we have the prompt.
02:45:56.300 | We need to find a way to combine the three information together.
02:46:00.140 | So the unit needs to learn to detect the noise present in a noisified image
02:46:05.100 | at a particular time step using a particular prompt as a condition.
02:46:09.100 | Which means that the model needs to recognize this time embedding
02:46:14.860 | and needs to relate this time embedding with the latency.
02:46:17.980 | And this is exactly what we are doing in this residual block here.
02:46:21.100 | We are relating the latent with the time embedding,
02:46:24.460 | so that the output will depend on the combination of both,
02:46:29.020 | not on the single noise or in the single time step.
02:46:31.820 | And this will also be done with the context using cross-attention
02:46:35.980 | in the attention block that we will build now.
02:46:37.900 | So, unit attention block.
02:46:58.460 | 768, okay.
02:47:09.340 | Okay, I will define some layers that for now will not make much sense,
02:47:18.140 | but later they will make sense when we make the forward method.
02:47:34.220 | okay.
02:47:43.100 | So, my cat is asking for food.
02:48:00.700 | I think he already has food, but maybe he wants to eat something special today.
02:48:06.540 | So, let me finish this attention block and the unit and then I'm all his.
02:48:12.220 | Why everyone wants attention?
02:48:16.140 | Self-attention and head channels.
02:48:22.300 | Here we don't have any bias.
02:48:24.940 | As you remember, the self-attention we can have the bias for the W matrices.
02:48:30.060 | Here we don't have any bias, just like in the vanilla transformer.
02:48:33.020 | So, we have this attention.
02:48:35.660 | Then we have a layer normalization, self.layernorm 2,
02:48:39.980 | which is along the same number of features.
02:48:45.740 | Then we have another attention.
02:48:47.020 | We will see later why we need all this attention,
02:48:49.980 | but this is not a self-attention.
02:48:51.260 | It's a cross-attention and we will see later how it works.
02:49:00.300 | then we have the layer norm 3.
02:49:23.820 | this is because we are using a function that is called the
02:49:32.380 | JGLU activation function.
02:49:34.300 | So, we need these matrices here.
02:49:47.340 | okay.
02:49:59.100 | Now we can build the forward method.
02:50:00.620 | So, our X is our latency.
02:50:03.260 | So, we have a batch size.
02:50:04.700 | We have features.
02:50:06.940 | We have height.
02:50:07.900 | We have width.
02:50:08.700 | Then we have our context, which is our prompt,
02:50:11.660 | which is a batch size, sequence length, dimension.
02:50:15.980 | The dimension is size 768, as we saw before.
02:50:19.260 | So, the first thing we will do is we will do the normalization.
02:50:25.100 | So, just like in the transformer, we will take the input,
02:50:28.540 | so our latency, and we apply the normalization and the convolution.
02:50:32.700 | Actually, in the transformer, there is no convolution,
02:50:34.700 | but only the normalization.
02:50:36.380 | So, this is called the long residual,
02:50:41.740 | because it will be applied at the end.
02:50:43.420 | Okay, so we have this here.
02:50:50.860 | We are applying the normalization,
02:50:52.140 | which doesn't change the size of the tensor.
02:50:54.460 | Then we have a convolution.
02:50:56.540 | X is equal to self.com input of X,
02:51:01.900 | which also doesn't change the size of the tensor.
02:51:05.580 | Then we take the shape,
02:51:08.460 | which is the batch size, the number of features, the height, and the width.
02:51:17.340 | We transpose because we want to apply cross-attention.
02:51:22.060 | First, we apply self-attention, then we apply cross-attention.
02:51:28.460 | So, we do normalization plus self-attention with skip connection.
02:51:40.220 | So, X is X dot transpose of minus one, minus two.
02:51:46.700 | So, we are going from this.
02:51:49.820 | Wait, I forgot something.
02:51:54.140 | Here, first of all, we need to do X is equal to X dot view.
02:51:58.860 | Then C, H multiplied by W.
02:52:02.780 | So, we are going from this to batch size features,
02:52:11.980 | and then H multiplied by W.
02:52:13.500 | So, this one multiplied by this.
02:52:14.940 | Then we transpose these two dimensions.
02:52:17.500 | So, now we get from here to here.
02:52:23.420 | So, the features become the last one.
02:52:26.940 | Now, we apply this normalization plus self-attention.
02:52:29.500 | So, we have a first short residual connection
02:52:35.260 | that we'll apply right after the attention.
02:52:37.180 | So, we say that X is equal to layer norm one.
02:52:42.380 | So, X.
02:52:43.020 | Then we apply the attention.
02:52:44.940 | So, self dot attention one.
02:52:46.700 | And then we apply the residual connection.
02:52:49.740 | So, X is plus equal to residual short, the first residual connection.
02:52:56.060 | Then we say that the residual short is again equal to six,
02:52:58.780 | because we are going to apply now the cross attention.
02:53:01.740 | So, now we apply the normalization plus the cross attention with skip connection.
02:53:11.100 | So, what we did here is what we do in any transformer.
02:53:20.060 | So, let me show you here what we do in any transformer.
02:53:23.020 | So, we apply some normalization.
02:53:24.460 | We calculate the attention.
02:53:26.140 | And then we combine it with a skip connection here.
02:53:28.460 | And now we will, instead of calculating a self-attention,
02:53:31.660 | we will do a cross attention, which we still didn't define.
02:53:35.020 | We will define it later.
02:53:36.460 | So, short.
02:53:39.340 | And then first we calculate, we apply the normalization.
02:53:43.020 | Then the cross attention between the latency and the prompt.
02:53:53.420 | This is cross attention.
02:53:54.860 | So, this is cross attention.
02:53:57.420 | And we will see how.
02:53:59.660 | And X plus or equal to residual short.
02:54:07.260 | Okay.
02:54:09.420 | And then again, equal to X.
02:54:12.300 | Finally, just like with the attention transformer,
02:54:16.140 | we have a feedforward layer with the JGLU activation function.
02:54:20.780 | Okay.
02:54:26.780 | And this is actually, if you watch the original implementation of the transformer,
02:54:45.180 | of the stable diffusion, it's implemented exactly like this.
02:54:48.620 | So, basically later we do element-wise multiplication.
02:54:55.020 | So, these are special activation functions that involve a lot of parameters.
02:55:03.420 | But why we use one and not the other?
02:55:09.420 | I told you, just like before,
02:55:11.500 | they just saw that this one works better for this kind of application.
02:55:14.940 | There is no other.
02:55:16.940 | Then we apply the skip connection.
02:55:22.940 | So, we apply the cross attention.
02:55:24.620 | Then we define another one here.
02:55:26.940 | So, this one is basically normalization plus feedforward layer with JGLU and skip connection.
02:55:38.620 | In which the skip connection is defined here.
02:55:41.660 | So, at the end, we always apply the skip connection.
02:55:44.780 | Finally, we change back to our tensor to not be a sequence of pixels anymore.
02:55:50.140 | So, we reverse the previous transposition.
02:55:54.700 | Transpose.
02:55:57.820 | So, basically, we go from batch size with width multiplied by height multiplied by width.
02:56:13.980 | And features into batch size features height multiplied by width.
02:56:25.260 | Then we remove this multiplication.
02:56:30.460 | So, we reverse this multiplication.
02:56:32.700 | And CHW.
02:56:36.940 | Finally, we apply the long skip connection that we defined here at the beginning.
02:56:43.820 | So, only if the size match.
02:56:46.540 | If the sizes don't match, we apply the here.
02:56:49.580 | This one we have here.
02:56:50.620 | Return self.com output.
02:56:55.420 | And this is all of our unit.
02:57:04.300 | We have defined everything, I think, except for the cross attention, which is very fast.
02:57:09.340 | So, we go to the attention that we defined before.
02:57:13.100 | And I put it in the wrong folder.
02:57:15.180 | It should be skip changes.
02:57:17.660 | Let me check if I put it correctly.
02:57:20.700 | Yeah, we only need to define this cross attention here.
02:57:28.460 | Okay, attention.
02:57:31.660 | So, let's go.
02:57:33.100 | And let's define this cross attention.
02:57:38.060 | So, class, it will be very similar to the, not very similar, actually, same as the self
02:57:44.780 | attention, except that the keys come from one side and the query and, sorry, the query
02:57:51.420 | come from one side and the key and the values from another side.
02:58:06.300 | So, this is the dimension of the embedding of the keys and the values.
02:58:15.100 | This is the one of the queries.
02:58:28.780 | This is the WQ matrix.
02:58:37.420 | In this case, we will define, instead of one big matrix made of three, WQ, WK and WV, we
02:58:42.940 | will define three different matrices.
02:58:44.540 | Both systems are fine.
02:58:46.860 | You can define it as one big matrix or three separately.
02:58:50.220 | It doesn't change anything, actually.
02:59:02.860 | So, the cross is from the keys and the values.
02:59:16.860 | Oops, linear.
02:59:32.220 | Then, we save the number of heads of this cross attention and also the dimension of
03:00:00.460 | each, how much information each head will see.
03:00:04.140 | And the head is equal to the embed divided by the number of heads.
03:00:13.660 | Let's define the forward method.
03:00:17.100 | X is our query and Y is our keys and values.
03:00:27.740 | So, we are relating X, which is our latency, which is of size batch size.
03:00:36.060 | It will have a sequence length, its own sequence length, Q, let's call it Q, and its own dimension.
03:00:43.580 | And the Y, which is the context or the prompt, which will be batch size.
03:00:53.580 | Sequence length of the key, because the prompt will become the key and the values.
03:00:58.860 | And each of them will have its own embedding size, the dimension of KV.
03:01:03.100 | We can already say that this will be a batch size of 77, because our sequence length of
03:01:09.580 | the prompt is 77 and its embedding is of size 768.
03:01:14.140 | So, let's build this one.
03:01:19.580 | This is input shape is equal to x dot shape.
03:01:23.740 | Okay, then we have the interim shape, like the same as before.
03:01:40.860 | So, this is the sequence length, then the n number of heads.
03:01:49.980 | And how much information each head will see.
03:01:52.300 | The head.
03:01:55.260 | The first thing we do is multiply queries by WQ matrix.
03:02:05.660 | So, query is equal to.
03:02:07.580 | Then we do the same for the keys and the values, but by using the other matrices.
03:02:17.900 | And as I told you before, the key and the values are the Y and not the X.
03:02:22.140 | Again, we split them into H heads, so H number of heads.
03:02:32.140 | Then we transpose.
03:02:38.860 | I will not write the shapes because they match the same transformation that we do here.
03:02:46.700 | Okay, again, we calculate the weight, which is the attention, as a query multiplied by
03:03:07.580 | the transpose of the keys.
03:03:12.300 | And then we divide it by the dimension of each head by the square root.
03:03:21.820 | Then we do the softmax.
03:03:29.420 | In this case, we don't have any causal mask, so we don't need to apply the mask like before,
03:03:36.300 | because here we are trying to relate the tokens, so the prompt with the pixels.
03:03:41.980 | So, each pixel can watch any word of the token, and any token can watch any pixel, basically.
03:03:49.100 | So, we don't need any mask.
03:04:00.940 | We are to obtain the output, we multiply it by the bit matrix.
03:04:04.140 | And then the output, again, is transposed, just like before.
03:04:10.140 | So, now we are doing exactly the same things that we did here.
03:04:13.900 | So, transpose, reshape, etc.
03:04:24.620 | And then return output.
03:04:38.780 | And this ends our building of the... let me show you.
03:04:43.020 | Now we have built all the building blocks for the stable diffusion.
03:04:49.420 | So, now we can finally combine them together.
03:04:53.420 | So, the next thing that we are going to do is to create the system that,
03:04:57.500 | taking the noise, taking the text, taking the time embedding, will run,
03:05:03.020 | for example, if we want to do text to image, will run this noise many times through the unit,
03:05:07.900 | according to a schedule.
03:05:10.540 | So, we will build the scheduler, which means that,
03:05:13.340 | because the unit is trained to predict how much noise is there,
03:05:17.820 | but we then need to remove this noise.
03:05:20.460 | So, to go from a noisy version to obtain a less noisy version,
03:05:25.180 | we need to remove the noise that is predicted by the unit.
03:05:28.140 | And this job is done by the scheduler.
03:05:30.460 | And now we will build the scheduler.
03:05:32.220 | We will build the code to load the weights of the pre-trained model.
03:05:35.980 | And then we combine all these things together.
03:05:39.580 | And we actually build what is called the pipeline.
03:05:42.060 | So, the pipeline of text to image, image to image, etc.
03:05:45.180 | And let's go.
03:05:48.300 | Now that we have built all the structure of the unit,
03:05:52.060 | or we have built the variational autoencoder, we have built a clip,
03:05:55.500 | we have built the attention blocks, etc.
03:05:58.700 | Now it's time to combine it all together.
03:06:01.980 | So, the first thing I kindly ask you to do is to actually download
03:06:05.500 | the pre-trained weights of the stable diffusion, because we need to inference it later.
03:06:09.340 | So, if you go to the repository I shared, this one, PyTorch Stable Diffusion,
03:06:14.220 | you can download the pre-trained weights of the stable diffusion 1.5
03:06:18.380 | directly from the website of Hugging Face.
03:06:20.940 | So, you download this file here, which is the EMA,
03:06:24.700 | which means Exponentially Moving Average,
03:06:27.020 | which means that it's a model that has been trained,
03:06:30.460 | but they didn't change the weights at each iteration,
03:06:32.860 | but with an Exponentially Moving Average schedule.
03:06:35.180 | So, this is good for inferencing.
03:06:37.500 | It means that the weights are more stable.
03:06:39.580 | But if you want to fine-tune later the model, you need to download this one.
03:06:43.900 | And we also need to download the files of the tokenizer,
03:06:48.380 | because, of course, we will give some prompt to the model to generate an image.
03:06:53.660 | And the prompt needs to be tokenized by a tokenizer,
03:06:56.700 | which will convert the words into tokens and the tokens into numbers.
03:07:00.620 | The numbers will then be mapped into embeddings by our clip embedding here.
03:07:05.420 | So, we need to download two files for the tokenizer.
03:07:08.780 | So, first of all, the weights of this one file here,
03:07:12.380 | then on the tokenizer folder, we find the merges.txt and the vocab.json.
03:07:17.340 | If we look at the vocab.json file, which I already downloaded,
03:07:21.420 | it's basically vocabulary.
03:07:23.420 | So, each token mapped to a number.
03:07:25.260 | That's it, just like what the tokenizer does.
03:07:27.580 | And then I also prepared the picture of a dog that I will be using for image-to-image,
03:07:32.140 | but you can use any image.
03:07:33.420 | You don't have to use the one I am using, of course.
03:07:36.700 | So, now, let's first build the pipeline.
03:07:41.180 | So, how we will inference this stable diffusion model.
03:07:44.540 | And then, while building the pipeline,
03:07:47.900 | I will also explain you how the scheduler will work.
03:07:51.420 | And we will build the scheduler later.
03:07:54.540 | I will explain all the formulas, all the mathematics behind it.
03:07:57.660 | So, let's start.
03:07:59.020 | Let's create a new file.
03:08:01.340 | Let's call it pipeline.py.
03:08:05.980 | And we import the usual stuff.
03:08:07.900 | NumPy.
03:08:12.380 | Oops, stop, stop, stop.
03:08:15.580 | NumPy as empty.
03:08:16.940 | We will also use a tqdm to show the progress bar.
03:08:22.540 | And later, we will build this sampler, the DPM sampler.
03:08:28.700 | And we will build it later.
03:08:30.220 | And I will also explain what is this sampler doing and how it works, etc, etc.
03:08:35.980 | So, first of all, let's define some constants.
03:08:38.060 | The stable diffusion can only produce images of size 512 by 512.
03:08:43.980 | So, height is 512 by 512.
03:08:46.620 | The latent dimension is the size of the latent tensor of the variational autoencoder.
03:08:55.420 | And as we saw before, if we go check the size,
03:09:00.300 | the encoder of the variational autoencoder will convert something that is 512 by 512
03:09:05.900 | into something that is 512 divided by 8.
03:09:09.260 | So, the latent dimension is 512 divided by 8.
03:09:12.780 | And the same goes on for the height.
03:09:16.060 | 512 divided by 8.
03:09:19.420 | We can also call it width divided by 8 and height divided by 8.
03:09:23.420 | Then, we create a function called the generator.
03:09:27.660 | This will be the main function that will allow us to do text to image and also image to image,
03:09:33.900 | which accepts a prompt, which is a string.
03:09:36.460 | An unconditional prompt.
03:09:40.300 | So, unconditional prompt.
03:09:41.820 | This is also called the negative prompt.
03:09:44.620 | If you ever used stable diffusion, for example, with the HuggingFace library,
03:09:48.380 | you will know that you can also specify a negative prompt,
03:09:51.820 | which tells that you want, for example, you want a picture of a cat,
03:09:56.620 | but you don't want the cat to be on the sofa.
03:09:59.900 | So, for example, you can put the word sofa in the negative prompt.
03:10:04.060 | So, it will try to go away from the concept of sofa when generating the image.
03:10:08.620 | Something like this.
03:10:09.260 | And this is connected with the classifier free guidance that we saw before.
03:10:13.980 | So, but don't worry, I will repeat all the concepts while we are building it.
03:10:17.020 | So, this is also a string.
03:10:18.540 | We can have an input image in case we are building an image to image.
03:10:25.580 | And then we have the strength.
03:10:27.660 | Strength, I will show you later what is it, but it's related to if we have an input image
03:10:33.340 | and how much, if we start from an image to generate another image,
03:10:37.180 | how much attention we want to pay to the initial starting image.
03:10:40.780 | And we can also have a parameter called doCFG,
03:10:46.300 | which means do classifier free guidance.
03:10:49.260 | We set it to yes.
03:10:51.020 | CFG scale, which is the weight of how much we want the model to pay attention to our prompt.
03:10:56.460 | It's a value that goes from 1 to 14.
03:10:58.700 | We start with 7.5.
03:11:00.380 | The sampler name, we will only implement one.
03:11:03.980 | So, it's called edpm.
03:11:05.500 | How many inference steps we want to do.
03:11:09.100 | And we will do 50.
03:11:12.060 | I think it's quite common to do 50 steps, which produces actually not bad results.
03:11:17.180 | The models are the pre-trained models.
03:11:20.140 | The seed is how we want to initialize our random number generator.
03:11:23.980 | Let me put a new line, otherwise we become crazy reading this.
03:11:27.820 | Okay.
03:11:31.920 | New line.
03:11:34.300 | So, seed.
03:11:36.540 | Then we have the device where we want to create our tensor.
03:11:39.820 | We have an idle device, which means basically if we load some model on CUDA
03:11:45.020 | and then we don't need the model, we move it to the CPU.
03:11:48.060 | And then the tokenizer that we will load later.
03:11:50.300 | Tokenizer is none.
03:11:52.540 | Okay.
03:11:53.040 | This is our method.
03:11:54.860 | This is our main pipeline that, given all this information, will generate one picture.
03:11:59.340 | So, it will pay attention to the prompt.
03:12:01.340 | It will pay attention to the input image, if there is,
03:12:03.980 | according to the weights that we have specified.
03:12:06.220 | So, the strength and the CFG scale.
03:12:08.460 | I will repeat all this concept.
03:12:10.540 | Don't worry, later I will explain them actually how they work also on the code level.
03:12:14.940 | So, let's start.
03:12:16.860 | So, the first thing we do is we disable.
03:12:21.980 | Okay.
03:12:22.480 | Torch.log(red) because we are inferencing the model.
03:12:28.780 | The first thing we make sure is the strength should be between 0 and 1.
03:12:34.220 | So, if...
03:12:34.860 | Then we raise an error.
03:12:45.020 | Raise value error.
03:12:47.660 | Must be between 0 and 1.
03:12:55.020 | If idle device.
03:12:59.340 | If we want to move things to the CPU, we create this lambda function.
03:13:06.140 | Otherwise.
03:13:14.140 | Okay.
03:13:17.440 | Then we create the... oops.
03:13:20.940 | I think I... okay.
03:13:24.940 | Then we create the... oops.
03:13:28.940 | I think I... okay.
03:13:33.100 | Okay.
03:13:33.600 | Then we create the random number generator that we will use.
03:13:38.060 | I think I made some mess with this.
03:13:43.260 | So, this one should be like here.
03:13:47.420 | Okay.
03:13:51.040 | And the generator is a random number generator that we will use to generate the noise.
03:14:01.260 | And if we want to start it with the seed.
03:14:04.540 | So, if seed.
03:14:05.580 | Then we generate with the random seed.
03:14:11.580 | Otherwise, we specify one manually.
03:14:14.140 | Let me fix this formatting because I don't know format document.
03:14:27.260 | Okay.
03:14:28.700 | Now, at least the...
03:14:29.660 | Then we define clip.
03:14:33.900 | The clip is a model that we take from the pre-trained models.
03:14:37.900 | So, it will have the clip model inside.
03:14:41.180 | So, this model here, basically.
03:14:43.580 | This one here.
03:14:45.420 | We move it to our device.
03:14:49.100 | Okay.
03:14:58.160 | As you remember with the classifier-free guidance.
03:15:01.520 | So, let me go back to my slides.
03:15:03.280 | When we do classifier-free guidance, we inference the model twice.
03:15:11.680 | First, by specifying the condition.
03:15:15.680 | So, the prompt.
03:15:16.560 | And another time by not specifying the condition.
03:15:19.200 | So, without the prompt.
03:15:20.720 | And then we combine the output of the model linearly with a weight.
03:15:26.320 | This weight, W, is our...
03:15:28.000 | This weight here, CFG scale.
03:15:31.440 | It indicates how much we want to pay attention to the conditioned output
03:15:36.320 | with respect to the unconditioned output.
03:15:38.320 | Which also means that how much we want the model to pay attention to the condition
03:15:43.280 | that we have specified.
03:15:44.400 | What is the condition?
03:15:45.200 | The prompt.
03:15:46.000 | The textual prompt that we have written.
03:15:47.680 | And the unconditioned actually is also...
03:15:53.360 | Will use the negative prompt.
03:15:54.880 | So, the negative prompt that you use in stable diffusion.
03:15:57.920 | Which is this parameter here.
03:16:00.160 | So, unconditioned prompt.
03:16:02.000 | This is the unconditional output.
03:16:03.680 | So, we will sample the...
03:16:05.520 | We will inference from the model twice.
03:16:07.600 | One with the prompt.
03:16:09.200 | One without.
03:16:10.000 | With the...
03:16:10.480 | One with the prompt.
03:16:11.440 | One with the unconditioned prompt.
03:16:13.040 | Which is usually an empty text.
03:16:14.880 | An empty string.
03:16:16.320 | And then we combine the two by this.
03:16:19.600 | And this will tell the model by using this weight.
03:16:21.760 | We will combine the output in such a way that we can decide
03:16:24.880 | how much we want the model to pay attention to the prompt.
03:16:27.360 | So, let's do it.
03:16:29.920 | If we want to do classifier-free guidance.
03:16:32.960 | First, convert the prompt into tokens.
03:16:38.080 | Using the tokenizer.
03:16:42.320 | We didn't specify what is the tokenizer yet.
03:16:48.480 | But later we will define it.
03:16:50.480 | So, the conditional tokens.
03:16:51.920 | Tokenizer.
03:16:54.560 | Patch.
03:16:55.840 | Encode plus.
03:16:58.480 | We want to encode the prompt.
03:17:02.720 | We want to append the padding up to the maximum length.
03:17:06.880 | Which means that the prompt, if it's too short,
03:17:10.000 | it will fill up it with paddings.
03:17:11.680 | And the max length, as you remember, is 77.
03:17:15.280 | Because we have also defined it here.
03:17:17.520 | The sequence length is 77.
03:17:19.360 | And we take the input IDs of this tokenizer.
03:17:22.960 | Then we convert these tokens, which are input IDs, into a tensor.
03:17:30.960 | Which will be of size batch size and sequence length.
03:17:35.920 | So, conditional tokens.
03:17:38.560 | So, conditional tokens.
03:17:41.920 | And we put it in the right device.
03:17:54.160 | Now, we run it through clip.
03:17:56.880 | So, it will convert batch size sequence length.
03:17:59.840 | So, these input IDs will be converted into embeddings.
03:18:05.680 | Of size 768.
03:18:08.480 | Each vector of size 768.
03:18:10.720 | So, let's call it dim.
03:18:12.400 | And what we do is conditional context is equal to clip of conditional tokens.
03:18:22.800 | So, we are taking these tokens and we are running them through clips.
03:18:25.840 | So, this forward method here.
03:18:27.360 | Which will return batch size sequence length dimension.
03:18:31.040 | And this is exactly what I have written here.
03:18:34.080 | We do the same for the unconditioned tokens.
03:18:36.480 | So, the negative prompt.
03:18:38.080 | Which, if you don't want to specify, we will use the empty string.
03:18:41.440 | Which means the unconditional output of the model.
03:18:44.240 | So, the model, what would the model produce without any condition?
03:18:51.440 | So, if we start with random noise and we ask the model to produce an image.
03:18:54.480 | It will produce an image.
03:18:55.600 | But without any condition.
03:18:56.880 | So, the model will output anything that it wants based on the initial noise.
03:19:03.920 | we convert it into tensor.
03:19:29.600 | Then we pass it through clips.
03:19:31.360 | Just like the conditional tokens.
03:19:32.800 | So, it will become tokens.
03:19:38.820 | So, it will also become a tensor of batch size sequence length dimension.
03:19:47.600 | Where the sequence length is actually always 77.
03:19:50.560 | And also, in this case, it was always 77.
03:19:52.480 | Because it's the max length here.
03:19:55.520 | But I forgot to write the code to convert it into.
03:19:59.040 | So, unconditional tokens is equal to tokenizer batch plus.
03:20:07.680 | So, the unconditional prompt.
03:20:13.680 | So, also the negative prompt.
03:20:16.240 | The padding is the same as before.
03:20:19.120 | So, max length.
03:20:20.000 | And the max length is defined as 77.
03:20:25.280 | And we take the input IDs from here.
03:20:27.280 | So, now we have these two prompts.
03:20:29.920 | What we do is we concatenate them.
03:20:31.680 | They will become the batch of our input to the unit.
03:20:43.280 | Okay.
03:20:49.140 | So, basically what we are doing is.
03:20:51.520 | We are taking the conditional and unconditional input.
03:20:54.240 | And we are combining them into one single tensor.
03:20:56.560 | So, they will become a tensor of batch size 2.
03:21:00.640 | So, 2 sequence length and dimension.
03:21:05.200 | Where sequence length is actually.
03:21:07.760 | We can already write it.
03:21:09.040 | It will become 2 by 77 by 768.
03:21:12.800 | Because 77 is the sequence length.
03:21:14.400 | And the dimension is 768.
03:21:17.840 | If we don't want to do conditional classifier free guidance.
03:21:24.880 | We only need to use the prompt and that's it.
03:21:28.080 | So, we do only one step through the unit.
03:21:31.520 | And only with the prompt.
03:21:33.920 | Without combining the unconditional input with the conditional input.
03:21:38.240 | But in this case.
03:21:38.880 | We cannot decide how much the model pays attention to the prompt.
03:21:44.880 | Because we don't have anything to combine it with.
03:21:48.240 | So, again we take the just the prompt.
03:21:57.920 | Just like before.
03:21:58.800 | You can take it.
03:22:02.560 | Let's call it just tokens.
03:22:04.480 | And then we transform this into a tensor.
03:22:09.280 | Tensor long.
03:22:16.640 | We put it in the right device.
03:22:18.400 | We calculated the context.
03:22:22.560 | Which is a one big tensor.
03:22:24.480 | We pass it through clip.
03:22:26.720 | But this case it will be only one.
03:22:30.720 | Only one.
03:22:32.800 | So, the batch size will be one.
03:22:34.160 | So, the batch dimension.
03:22:35.440 | The sequence is again 77.
03:22:37.520 | And the dimension is 768.
03:22:39.520 | So, here we are combining two prompts.
03:22:41.360 | Here we are combining one.
03:22:43.200 | Because we will run through the model.
03:22:45.040 | Two prompts.
03:22:45.840 | One unconditioned.
03:22:47.040 | One conditioned.
03:22:48.080 | So, one with the prompt that we want.
03:22:50.160 | One with the empty string.
03:22:51.520 | And the model will produce two outputs.
03:22:53.840 | Because the model takes care of the batch size.
03:22:56.400 | That's why we have the batch size.
03:22:57.840 | Since we have finished using the clip.
03:23:02.480 | We can move it to the idle device.
03:23:04.480 | This is very useful.
03:23:05.520 | Actually, if you have a very limited GPU.
03:23:08.640 | And you want to offload the models after using them.
03:23:11.760 | You can offload them back to the CPU.
03:23:13.440 | By moving them to the CPU again.
03:23:15.840 | And then we load the sampler.
03:23:21.920 | For now, we didn't define the sampler.
03:23:23.680 | But we use it and later we build it.
03:23:26.480 | Because it's better to build it after you know how it is used.
03:23:30.720 | If we build it before.
03:23:32.560 | I think it's easy to get lost in what is happening.
03:23:35.200 | What's happening actually.
03:23:36.720 | So, if the sampler name is ddpm.
03:23:40.960 | ddpm.
03:23:44.960 | Then we build the sampler.
03:23:46.960 | ddpm sampler.
03:23:49.520 | We pass it to the noise generator.
03:23:51.440 | And we tell the sampler how many steps we want to do for the inferencing.
03:23:55.920 | And I will show you later why.
03:24:00.480 | If the sampler is not ddpm.
03:24:03.920 | Then we raise an error.
03:24:04.960 | Because we didn't implement any other sampler.
03:24:08.720 | Why we need to tell him how many steps?
03:24:21.600 | Because as you remember.
03:24:22.800 | Let's go here.
03:24:23.920 | Here.
03:24:26.820 | This scheduler needs to do many steps.
03:24:30.240 | How many?
03:24:30.800 | We tell him exactly how many we want to do.
03:24:33.120 | In this case the denoisification steps will be 50.
03:24:36.960 | Even if during the training we have maximum 1000 steps.
03:24:41.120 | During inferencing we don't need to do 1000 steps.
03:24:43.600 | We can do less.
03:24:44.720 | Of course usually the more steps you do the better the quality.
03:24:47.520 | Because the more noise you can remove.
03:24:50.880 | But with different samplers they work in different way.
03:24:55.920 | And with ddpm usually 50 is good enough to get a nice result.
03:25:00.000 | For some other sampler.
03:25:00.880 | For example ddim you can do less steps.
03:25:03.120 | For some other samplers that work on with differential equations.
03:25:07.600 | You can do even less.
03:25:09.120 | Depends on which sampler you use.
03:25:10.640 | And how lucky you are with the particular prompt actually also.
03:25:15.920 | This is the latency that will run through the unit.
03:25:26.400 | And as you know it's of size "lat_height" and "lat_width".
03:25:32.480 | Which we defined before.
03:25:33.600 | So it's 512 divided by 8 by 512 divided by 8.
03:25:37.840 | So 64 by 64.
03:25:39.360 | And now let's do.
03:25:43.280 | What happens if the user specifies an input image?
03:25:47.280 | So if we have a prompt.
03:25:49.600 | We can take care of the prompt by either running a classifier free guidance.
03:25:56.000 | Which means combining the output of the model with the prompt and without the prompt.
03:26:03.520 | According to this scale here.
03:26:05.920 | Or we can directly just ask the model to output only one image.
03:26:12.480 | Only using the prompt.
03:26:13.600 | But then we cannot combine the two output with this scale.
03:26:17.360 | What happens however if we don't want to do text to image.
03:26:20.960 | But we want to do image to image.
03:26:22.800 | If we do image to image as we saw before.
03:26:25.680 | We start with an image.
03:26:27.120 | We encode it with the encoder.
03:26:29.120 | And then we add noise to it.
03:26:30.560 | And then we ask the scheduler to remove noise.
03:26:32.720 | Noise, noise.
03:26:33.440 | But since the unit will also be conditioned by the text prompt.
03:26:38.400 | We hope that while the unit will denoise this image.
03:26:43.520 | It will move it towards this prompt.
03:26:46.960 | So this is what we will do.
03:26:48.080 | First of things we load the image.
03:26:50.560 | And we encode it.
03:26:51.760 | And we add noise to it.
03:26:55.120 | So if an input image is specified.
03:26:57.760 | We load the encoder.
03:27:00.960 | We move it to the device.
03:27:08.080 | In case we are using CUDA for example.
03:27:10.240 | Then we load the tensor of the image.
03:27:14.400 | We resize it.
03:27:18.640 | We make sure that it's 512 by 512.
03:27:21.280 | With 8.
03:27:27.280 | And then we transform it into a NumPy array.
03:27:31.120 | And then into a tensor.
03:27:35.040 | So what will be the size here?
03:27:56.400 | It will be height by width by channel.
03:28:00.800 | And the channel will be 3.
03:28:03.120 | The next thing we do is we rescale this image.
03:28:06.080 | What does it mean?
03:28:06.960 | That the input of this unit should be normalized between.
03:28:12.480 | Should be, sorry, rescaled between -1 and +1.
03:28:16.240 | Because if we load the image.
03:28:17.680 | It will have three channels.
03:28:19.360 | Each channel will be between 0 and 255.
03:28:23.200 | So each pixel have three channels RGB.
03:28:25.920 | And each number is between 0 and 255.
03:28:28.480 | But this is not what the unit wants as input.
03:28:31.200 | The unit wants every channel, every pixel.
03:28:33.760 | To be between -1 and +1.
03:28:35.680 | So we will do this.
03:28:37.440 | We will build it later this function.
03:28:43.440 | It's called the rescale.
03:28:44.640 | To transform anything from that is from between 0 and 255.
03:28:52.240 | Into something that is between -1 and +1.
03:28:57.680 | And this will not change the size of the tensor.
03:29:02.640 | We add the batch dimension.
03:29:04.000 | Unsqueeze.
03:29:07.840 | This adds the batch dimension.
03:29:11.600 | Batch size.
03:29:17.040 | Okay, and then we change the order of the dimensions.
03:29:24.640 | Which is 0, 3, 1, 2.
03:29:36.880 | Because as you know the encoder of the variation autoencoder.
03:29:41.600 | Wants batch size, channel, height and width.
03:29:45.520 | While we have batch size, height, width, channel.
03:29:48.480 | So we permute them.
03:29:49.520 | So to obtain the correct input for the encoder.
03:29:53.120 | We have this one.
03:29:55.680 | Go into channel.
03:29:58.960 | And height and width.
03:30:04.960 | And then this part we can delete.
03:30:07.360 | Okay, this is the input.
03:30:09.520 | Then what we do is we sample some noise.
03:30:13.840 | Because as you remember the encoder.
03:30:15.920 | To run the encoder we need some noise.
03:30:18.640 | And then he will sample from this particular Gaussian.
03:30:21.280 | That we have defined before.
03:30:22.560 | So encoder noise.
03:30:25.920 | We sample it from our generator.
03:30:28.640 | So as you we have defined this generator.
03:30:34.400 | So that we can define only one seed.
03:30:36.480 | And we can also make the output deterministic.
03:30:40.320 | If we never change the seed.
03:30:41.600 | And this is why we use the generator.
03:30:45.680 | Latent shape.
03:30:47.040 | Okay, and now let's run it through the decoder.
03:30:49.760 | Run the image through the of the VAE.
03:31:00.160 | This will produce latency.
03:31:03.280 | So input image tensor.
03:31:07.120 | And then we give it some noise.
03:31:10.000 | Now we have to run it through the decoder.
03:31:14.000 | And then we give it some noise.
03:31:15.280 | Now we are exactly here.
03:31:18.240 | We produced this.
03:31:19.520 | This is our latency.
03:31:20.640 | So we give the image to the encoder.
03:31:22.560 | Along with some noise.
03:31:23.920 | It will produce a latent representation of this image.
03:31:27.280 | Now we need to tell our...
03:31:29.920 | As you can see here.
03:31:32.080 | We need to add some noise to this latent.
03:31:34.640 | How can we add noise?
03:31:35.600 | We use our scheduler.
03:31:36.800 | The strength basically tells us.
03:31:40.960 | The strength parameter that we defined here.
03:31:43.760 | Tells us how much we want the model.
03:31:46.800 | To pay attention to the input image.
03:31:49.120 | When generating the output image.
03:31:50.960 | The more the strength.
03:31:53.360 | The more the noise we add.
03:31:55.360 | So the more the strength.
03:31:57.600 | The more the strong the noise.
03:31:59.200 | So the model will be more creative.
03:32:02.480 | Because the model will have more noise to remove.
03:32:05.440 | And can create a different image.
03:32:08.000 | But if we add less noise to this initial image.
03:32:11.200 | The model cannot be very creative.
03:32:13.120 | Because most of the image is already defined.
03:32:15.920 | So there is not much noise to remove.
03:32:17.680 | So we expect that the output will resemble more or less the input.
03:32:21.840 | So this strength here basically means.
03:32:26.240 | The more noise.
03:32:27.440 | How much noise to add.
03:32:28.960 | The more noise we add.
03:32:30.480 | The less the output will resemble the input.
03:32:33.120 | The less noise we add.
03:32:34.400 | The more the output will resemble the input.
03:32:37.520 | Because the scheduler, the unit sorry.
03:32:40.880 | Has less possibility of changing the image.
03:32:44.640 | Because there is less noise.
03:32:45.600 | So let's do it.
03:32:49.680 | First we tell the sampler.
03:32:52.080 | What is the strength that we have defined.
03:32:54.320 | And later we will see what is this method doing.
03:32:57.200 | But for now we just write it.
03:32:59.760 | And then we ask the sampler.
03:33:01.040 | To add noise to our latency here.
03:33:03.200 | According to the strength that we have defined.
03:33:06.800 | Add noise.
03:33:13.280 | Basically the sampler will create.
03:33:26.240 | By setting the strength.
03:33:27.280 | Will create a time step schedule.
03:33:29.440 | Later we will see it.
03:33:30.800 | And by defining this time step schedule.
03:33:33.600 | It will.
03:33:34.320 | We will start.
03:33:35.200 | What is the initial noise level we will start with.
03:33:37.920 | Because if we set the noise level to be.
03:33:39.680 | For example the strength to be one.
03:33:41.280 | We will start with the maximum noise level.
03:33:43.440 | But if we set the strength to be 0.5.
03:33:46.480 | We will start with half noise.
03:33:48.800 | Not all completely noise.
03:33:50.240 | And later this will be more clear.
03:33:53.600 | When we actually build the sampler.
03:33:55.360 | So now just remember that.
03:33:56.800 | We are exactly here.
03:33:57.920 | So we have the image.
03:33:59.120 | We transform.
03:34:00.080 | We compress it with the encoder.
03:34:02.240 | Became a latent.
03:34:03.120 | We added some noise to it.
03:34:04.640 | According to the strength level.
03:34:06.080 | And then we need to pass it to the model.
03:34:09.680 | To the diffusion model.
03:34:10.720 | So now we don't need the encoder anymore.
03:34:12.400 | We can set it to the idle device.
03:34:15.440 | If the user didn't specify any image.
03:34:19.760 | Then how can we start the denoising?
03:34:22.480 | It means that we want to do text to image.
03:34:24.880 | So we start with random noise.
03:34:27.280 | So we start with random noise.
03:34:28.800 | Let's sample some random noise then.
03:34:31.520 | Generator and device is device.
03:34:42.400 | So let me write some comments.
03:34:45.440 | If we are doing text to image.
03:34:49.680 | Start with random noise.
03:34:52.160 | Random noise defined as N01.
03:34:56.480 | Or N0i actually.
03:34:58.880 | And we then finally load the diffusion model.
03:35:05.920 | Which is our unit.
03:35:06.880 | Diffusion.
03:35:09.200 | It's models.
03:35:10.080 | Diffusion.
03:35:12.160 | Later we see what is this model and how to load it.
03:35:14.640 | We take it to our device where we are working.
03:35:18.640 | So for example CUDA.
03:35:19.840 | And then our sampler will define some time steps.
03:35:24.880 | Time steps basically means that.
03:35:27.840 | As you remember to train the model we have maximum of 1000 time steps.
03:35:31.680 | But when we inference we don't need to do 1001 steps.
03:35:34.640 | In our case we will be doing for example 50 steps of inferencing.
03:35:38.400 | If the maximum strength level is 1000.
03:35:41.840 | For example if the maximum level is 1000.
03:35:46.400 | The minimum level will be 1.
03:35:48.720 | Or if the maximum level is 999.
03:35:50.720 | The minimum will be 0.
03:35:52.320 | And this is a linear time steps.
03:35:54.960 | If we do only 50 it means that we need to do.
03:35:57.760 | For example we start with 1000.
03:35:59.920 | And then we do every 20.
03:36:01.760 | So 980.
03:36:03.280 | Then 960.
03:36:07.680 | Then 800.
03:36:09.280 | What?
03:36:15.840 | 820 etc etc.
03:36:17.680 | Until we arrive to the 0th level.
03:36:19.280 | Basically each of these time steps indicates a noise level.
03:36:24.240 | So when we denoise the image.
03:36:29.520 | Or the initial noise in case we are doing the text to image.
03:36:32.400 | We can tell the scheduler to remove noise.
03:36:36.000 | According to particular time steps.
03:36:38.480 | Which are defined by how many inference steps we want.
03:36:41.440 | And this is exactly what we are going to do now.
03:36:44.320 | When we initialize the sampler.
03:36:47.200 | We tell him how many steps we want to do.
03:36:49.520 | And he will create this time step schedule.
03:36:52.720 | So according to how many we want.
03:36:54.640 | And now we just go through it.
03:36:56.640 | So we tell the time steps.
03:36:58.480 | We create tqdm which is a progress bar.
03:37:02.960 | We take the time steps.
03:37:04.080 | And for each of these time steps we denoise the image.
03:37:10.000 | So we have 1300.
03:37:18.560 | This is our...
03:37:19.680 | We need to tell the unit as you remember diffusion.
03:37:23.200 | The unit has as input the time embedding.
03:37:26.560 | So what is the time step we want to denoise.
03:37:29.520 | The context which is the prompt.
03:37:32.000 | Or in case we are doing a classifier free guidance.
03:37:35.040 | Also the unconditional prompt.
03:37:36.560 | And the latent.
03:37:38.400 | The current state of the latent.
03:37:40.400 | Because we will start with some latent.
03:37:42.320 | And then keep denoising it.
03:37:43.600 | And keep denoising it.
03:37:44.800 | Keep denoising it according to the time embedding.
03:37:47.360 | To the time step.
03:37:48.160 | So we calculate first the time embedding.
03:37:51.280 | Which is an embedding of the current time step.
03:37:53.680 | And we will obtain it from this function.
03:37:58.160 | Later we define it.
03:38:02.240 | This function basically will convert a number.
03:38:04.480 | So the time step into a vector.
03:38:06.400 | One of size 320.
03:38:10.000 | That describes this particular time step.
03:38:12.320 | And as you will see later.
03:38:15.200 | It's basically just equal to the positional encoding.
03:38:18.000 | That we did for the transformer model.
03:38:20.400 | So in the transformer model.
03:38:21.520 | We use the sines and the cosines.
03:38:22.960 | To define the position.
03:38:24.400 | Here we use the sines and cosines.
03:38:25.760 | To define the time step.
03:38:26.800 | And let's build the model input.
03:38:30.480 | Which is the latency.
03:38:31.440 | Which is of shape patch size 4.
03:38:36.880 | Because it's the input of the encoder.
03:38:40.800 | Of the variational autoencoder.
03:38:42.080 | Which is of size 4.
03:38:43.760 | Sorry, which has four channels.
03:38:48.160 | And then has latency height.
03:38:50.320 | Height and the latency width.
03:38:54.240 | Which is 64 by 64.
03:38:56.000 | Now if we do this one.
03:39:01.760 | We need to send.
03:39:04.080 | Basically we are sending the conditioned.
03:39:06.640 | Where is it?
03:39:08.240 | Here.
03:39:09.380 | We send the conditional input.
03:39:12.480 | But also the unconditional input.
03:39:14.080 | If we do the classifier free guidance.
03:39:16.000 | Which means that we need to send.
03:39:17.760 | The same latent with the prompt.
03:39:20.320 | And without the prompt.
03:39:21.840 | And so what we can do is.
03:39:23.680 | We can repeat this latent twice.
03:39:25.760 | If we are doing the classifier free guidance.
03:39:27.600 | It will become model input.
03:39:32.080 | By repeat.
03:39:33.760 | On one.
03:39:37.120 | This will basically transform.
03:39:38.640 | Batch size 4.
03:39:43.120 | So this is going to be twice.
03:39:57.120 | The size of the initial batch size.
03:39:58.960 | Which is one actually.
03:40:00.000 | And four channels.
03:40:02.720 | And latency height and latency width.
03:40:04.960 | So basically we are repeating this dimension twice.
03:40:07.280 | We are making two copies of the latency.
03:40:10.720 | One will be used with the prompt.
03:40:12.320 | One without the prompt.
03:40:13.520 | So now we do.
03:40:16.160 | We check the model output.
03:40:18.000 | What is the model output?
03:40:19.120 | It is the predicted noise by the unit.
03:40:22.000 | So the model output is.
03:40:24.080 | The predicted noise by the unit.
03:40:28.400 | We do diffusion.
03:40:33.120 | Model input.
03:40:38.720 | Context and time embedding.
03:40:40.720 | And if we do classifier free guidance.
03:40:45.920 | We need to combine the conditional output.
03:40:49.200 | And the unconditional output.
03:40:50.720 | Because we are passing the input of the model.
03:40:54.640 | If we are doing classifier free guidance.
03:40:56.080 | We are giving a batch size of two.
03:40:57.760 | The model will produce an output.
03:40:59.440 | That has batch size of two.
03:41:01.040 | So we can then split it into two different tensor.
03:41:06.160 | One will be the conditional.
03:41:08.000 | And one will be the unconditional.
03:41:10.000 | So the output conditional.
03:41:12.400 | And the output unconditional.
03:41:14.160 | Are split in this way.
03:41:16.560 | Using chunk.
03:41:17.360 | The dimension is along the 0th dimension.
03:41:22.480 | So by default it's the 0th dimension.
03:41:24.400 | And then we combine them according to this formula here.
03:41:29.600 | Where is the.
03:41:30.960 | I missed out.
03:41:34.000 | According to.
03:41:37.600 | This formula here.
03:41:38.800 | So unconditional output.
03:41:40.720 | Minus the sorry.
03:41:41.840 | The conditioned output.
03:41:42.880 | Minus the unconditioned output.
03:41:44.400 | Multiplied by the scale that we defined.
03:41:46.960 | Plus the unconditioned output.
03:41:49.600 | So the model output.
03:41:51.920 | Will be conditioned scale.
03:41:55.280 | Multiplied by the output.
03:41:58.160 | Conditioned minus the output.
03:42:00.880 | Unconditioned plus the output.
03:42:03.360 | Unconditioned.
03:42:06.000 | And then what we do is basically.
03:42:10.000 | Okay now comes the let's say the clue part.
03:42:13.520 | So we have a model.
03:42:16.000 | That is able to predict the noise in the current latency.
03:42:20.560 | So we start for example.
03:42:21.680 | Imagine we are doing text to image.
03:42:23.600 | So let me go back here.
03:42:25.760 | We are going text to images.
03:42:32.080 | Here.
03:42:35.120 | So we start with some random noise.
03:42:36.960 | And we transform into latency.
03:42:40.240 | Then according to some scheduler.
03:42:43.200 | According to some time step.
03:42:44.800 | We keep denoising it.
03:42:46.240 | Now our unit.
03:42:47.200 | Will predict the noise in the latency.
03:42:52.640 | But how can we remove this noise.
03:42:56.480 | From the image to obtain a less noisy image.
03:42:59.440 | This is done by the scheduler.
03:43:02.720 | So at each step we ask the unit.
03:43:05.280 | How much noise is in the image.
03:43:07.360 | We remove it.
03:43:08.320 | And then we give it again to the unit.
03:43:10.160 | And ask how much noise is there.
03:43:11.840 | And we remove it.
03:43:12.720 | And then ask again how much noise is there.
03:43:14.800 | And then we remove it.
03:43:15.680 | And then how much noise is there.
03:43:17.040 | And then we remove it.
03:43:18.000 | Until we finish all these time steps.
03:43:20.320 | After we have finished these time steps.
03:43:23.280 | We take the latent.
03:43:25.200 | Give it to the decoder.
03:43:26.320 | Which will build our image.
03:43:28.080 | And this is exactly what we are doing here.
03:43:30.320 | So imagine we don't have any input image.
03:43:32.240 | So we have some random noise.
03:43:34.000 | We define some time steps on this sampler.
03:43:37.520 | Based on how many inference steps we want to do.
03:43:40.800 | We do all this time step.
03:43:43.200 | We give the latency to the unit.
03:43:46.240 | The unit will tell us how much is the predicted noise.
03:43:48.960 | But then we need to remove this noise.
03:43:50.960 | So let's do it.
03:43:51.920 | So let's remove this noise.
03:43:53.520 | So the latency are equal to sampler dot step.
03:43:58.320 | Time step, latency, model, output.
03:44:03.040 | This basically means take the image from a more noisy version.
03:44:11.120 | Okay, let me write it better.
03:44:14.000 | Remove noise predicted by the unit.
03:44:18.240 | Unit, okay.
03:44:21.520 | And this is our loop of denoising.
03:44:26.400 | Then we can do to idle, diffusion.
03:44:29.040 | Now we have our denoised image.
03:44:33.760 | Because we have done it for many steps.
03:44:35.520 | Now what we do is we load the decoder.
03:44:39.600 | Which is models decoder.
03:44:42.960 | And then our image is run through the decoder.
03:44:55.760 | So we run the latency through the decoder.
03:44:57.920 | So we do this step here.
03:44:59.120 | So we run this latency through the decoder.
03:45:01.040 | This will give the image.
03:45:02.080 | It actually will be only one image.
03:45:04.880 | Because we only specify one image.
03:45:06.640 | Then we do images is equal to.
03:45:13.360 | Because the image was initially, as you remember here.
03:45:16.960 | It was rescaled.
03:45:18.000 | So from 0 to 255 in a new scale.
03:45:23.520 | That is between -1 and +1.
03:45:25.840 | Now we do the opposite step.
03:45:27.120 | So rescale again.
03:45:29.840 | From -1 to 1.
03:45:32.800 | Into 0 to 255.
03:45:36.320 | With clamp equal true.
03:45:37.920 | Later we will see this function.
03:45:40.800 | It's very easy.
03:45:41.440 | It's just a rescaling function.
03:45:42.880 | We permute.
03:45:45.120 | Because to save the image on the CPU.
03:45:47.600 | We want the channel dimension to be the last one.
03:45:49.840 | Permute.
03:45:52.400 | From 0 to 3.1.
03:45:56.320 | So this one basically will take the batch size.
03:45:59.040 | Channel height width.
03:46:04.320 | Into batch size.
03:46:09.680 | Height width channel.
03:46:15.920 | And then we move the image to the CPU.
03:46:21.840 | And then we need to convert it into a NumPy array.
03:46:34.400 | And then we return the image.
03:46:35.680 | Voila!
03:46:38.320 | Let's build this rescale method.
03:46:40.080 | So what is the old scale?
03:46:47.760 | Old range.
03:46:48.320 | What is the new range?
03:46:51.920 | And the clamp.
03:46:52.640 | So let's define the old minimum.
03:46:58.640 | Old maximum is the old range.
03:47:00.880 | New minimum and new maximum.
03:47:04.160 | New range.
03:47:07.680 | Minus equal to old min.
03:47:14.720 | x multiply equal to new max minus new min.
03:47:20.800 | Divided by old max minus old min.
03:47:28.000 | x plus equal to new min.
03:47:31.520 | We are just rescaling.
03:47:33.600 | So convert something that is within this range into this range.
03:47:37.040 | And if it's clamp.
03:47:42.240 | Then x is equal to x.clamp.
03:47:44.240 | New min.
03:47:46.880 | New max.
03:47:49.280 | And then we return x.
03:47:51.680 | Then we have the time embedding.
03:47:55.520 | The method that we didn't define here.
03:47:58.320 | This getTimeEmbedding.
03:47:59.760 | This means basically take the time step which is a number.
03:48:02.640 | So which is an integer.
03:48:04.640 | And convert it into a vector of size 320.
03:48:08.320 | And this will be done exactly using the same system that we use for the transformer.
03:48:12.320 | For the positional embeddings.
03:48:13.520 | So we first define the frequencies of our cosines and the sines.
03:48:23.440 | Exactly using the same formula of the transformer.
03:48:26.240 | So if you remember the formula is equal to the 10,000.
03:48:30.560 | 1 over 10,000 to the power of something.
03:48:33.840 | Of i.
03:48:34.240 | I remember correctly.
03:48:37.840 | So it's power of 10,000 and minus torch.range.
03:48:45.040 | So I am referring to this formula just in case you forgot.
03:48:48.000 | Let me find it using the slides.
03:48:51.440 | I am talking about this formula here.
03:48:56.480 | So the formula that defines the positional encodings here.
03:48:59.920 | Here we just use a different dimension of the embedding.
03:49:02.960 | This one will produce something that is 160 numbers.
03:49:18.480 | And this one will produce something that is 200 numbers.
03:49:25.760 | And 160 numbers.
03:49:27.280 | Then we multiply it.
03:49:31.680 | We multiply it with the time step.
03:49:35.840 | So we create a shape of size 1.
03:49:39.040 | So x is equal to torch dot tensor.
03:49:44.480 | Which is a single time step.
03:49:46.640 | Of t type.
03:49:49.200 | Take everything.
03:49:57.520 | We add one dimension.
03:49:59.520 | So we add one dimension here.
03:50:02.160 | This is like doing an unsqueeze.
03:50:03.920 | Multiply by the frequencies.
03:50:06.720 | And then we multiply this by the sines and the cosine.
03:50:13.920 | Just like we did in the original transformer.
03:50:16.880 | This one will return a tensor of size 100 by 62.
03:50:20.960 | So which is 320.
03:50:22.480 | Because we are concatenating two tensors.
03:50:36.480 | Not cosine, but sine of x.
03:50:38.080 | And then I concatenated along the dimension.
03:50:42.000 | The last dimension.
03:50:47.040 | And this is our time embedding.
03:50:49.680 | So now let's review what we have built here.
03:50:52.000 | We built basically a system.
03:50:55.360 | A method that takes the prompt.
03:50:58.000 | The unconditional prompt.
03:50:59.280 | Also called the negative prompt.
03:51:01.360 | The prompt or empty string.
03:51:04.640 | Because if we don't want to use any negative prompt.
03:51:07.840 | The input image.
03:51:08.800 | So what is the image we want to start from.
03:51:10.960 | In case we want to do an image to image.
03:51:13.280 | The strength is how much attention we want to pay to this input image.
03:51:17.600 | When we denoise the image.
03:51:18.800 | Or how much noise we want to add to it basically.
03:51:22.160 | And the more noise we add.
03:51:24.800 | The less the output will resemble the input image.
03:51:27.760 | If we want to do classifier free guidance.
03:51:31.120 | Which means that if we want the model to output to output.
03:51:34.640 | One is the output with the prompt.
03:51:36.960 | And one without the prompt.
03:51:38.320 | And then we can adjust how much we want to pay attention to the prompt.
03:51:42.880 | According to this scale.
03:51:44.160 | And then we defined the scheduler.
03:51:48.240 | Which is only one.
03:51:48.880 | The DPM.
03:51:49.440 | And we will define it now.
03:51:50.560 | And how many steps we want to do.
03:51:52.880 | The first thing we do is we create a generator.
03:51:55.280 | Which is just a random number generator.
03:51:57.120 | Then the second thing we do is.
03:51:59.680 | If we want to do classifier free guidance.
03:52:01.520 | As we need to do the.
03:52:02.960 | Basically we need to go through the units twice.
03:52:05.600 | One with the prompt.
03:52:06.560 | One without the prompt.
03:52:07.760 | The thing we do is that actually.
03:52:09.040 | We create a batch size of two.
03:52:11.120 | One with the prompt.
03:52:12.400 | And one without the prompt.
03:52:13.840 | Or using the unconditioned prompt.
03:52:15.760 | Or the negative prompt.
03:52:16.800 | In case we don't do the classifier free guidance.
03:52:20.880 | We only build one tensor.
03:52:22.480 | That only includes the prompt.
03:52:23.840 | The second thing we do is we load.
03:52:26.960 | If there is an input image.
03:52:28.240 | We load it.
03:52:29.200 | So instead of starting from random noise.
03:52:31.120 | We start from an image.
03:52:32.400 | Which is to which we add the noise.
03:52:34.560 | According to the strength we have defined.
03:52:37.200 | Then for the number of steps.
03:52:39.120 | Defined by the sampler.
03:52:40.400 | Which are actually defined.
03:52:41.520 | By the number of inference steps.
03:52:42.880 | We have defined here.
03:52:43.840 | We do a loop.
03:52:45.360 | A for loop.
03:52:46.000 | That for each for loop.
03:52:48.240 | Let me go here.
03:52:50.720 | The unit will predict some noise.
03:52:53.360 | And the scheduler will remove this noise.
03:52:56.240 | And give a new latent.
03:52:57.760 | Then this new latent is fed again to the unit.
03:53:00.080 | Which will predict some noise.
03:53:01.520 | And we remove this noise.
03:53:02.880 | According to the scheduler.
03:53:04.320 | Then we again predict some noise.
03:53:06.080 | And we remove some noise.
03:53:07.680 | The only thing we need to understand.
03:53:09.280 | Is how we remove the noise from the image now.
03:53:11.920 | Because we know that the unit is trained.
03:53:14.480 | To predict the noise.
03:53:15.440 | But how do we actually remove it?
03:53:16.960 | And this is the job of the scheduler.
03:53:20.320 | So now we need to go build this scheduler here.
03:53:23.200 | So let's go build it.
03:53:24.960 | Let's start building our ddpm scheduler.
03:53:28.400 | So ddpm.py
03:53:30.240 | Oops I forgot to put it inside the folder.
03:53:35.680 | And let me review one thing.
03:53:37.920 | Yeah.
03:53:39.860 | This is wrong.
03:53:43.280 | Okay.
03:53:43.780 | So import torch.
03:53:46.800 | Import roompy.
03:53:49.840 | And let's create the class ddpm sampler.
03:53:55.120 | Okay I didn't call it scheduler.
03:53:56.880 | Because I don't want you to be confused with the beta schedule.
03:54:00.320 | Which we will define later.
03:54:01.760 | So I call it scheduler here.
03:54:05.520 | Oops why I open this one.
03:54:06.960 | I call it scheduler here.
03:54:09.600 | But actually I mean the sampler.
03:54:12.240 | Because there is the beta schedule that we will define now.
03:54:15.360 | What is the beta schedule?
03:54:16.400 | Which indicates the amount of noise at each time step.
03:54:19.840 | And then there is what is known as the scheduler or the sampler.
03:54:23.040 | From now on I will refer it to as sampler.
03:54:26.400 | So this scheduler here actually means a sampler.
03:54:29.200 | I'm sorry for the confusion.
03:54:30.480 | I will update the slides when the video is out.
03:54:33.520 | So how much were the training steps?
03:54:36.320 | Which is 1000.
03:54:39.520 | The beta is okay now I define two constants.
03:54:44.560 | And later I define them.
03:54:46.560 | Where what are they and where they come from?
03:54:48.800 | 0, 85 and beta end.
03:54:53.920 | And I define the sampler.
03:54:55.920 | So this is the sampler.
03:54:57.600 | And this is the scheduler.
03:54:58.720 | And this is the sampler.
03:55:00.080 | And this is the scheduler.
03:55:01.120 | And beta end is a starting point of 0.0120.
03:55:06.640 | Okay.
03:55:09.060 | The parameter beta start and beta end.
03:55:11.840 | Basically if you go to the paper.
03:55:13.200 | If you look at the forward process.
03:55:16.160 | We can see that the forward process is the process that makes the image more noisy.
03:55:22.880 | We add noise to the image.
03:55:24.800 | So given an image that don't have less noise.
03:55:28.640 | How to get a more noisy image?
03:55:31.120 | According to this Gaussian distribution.
03:55:33.840 | Which is actually a chain of Gaussian distribution.
03:55:36.720 | Which is called a Markov chain of Gaussian distribution.
03:55:39.520 | And the noise that we add varies according to a variance schedule.
03:55:46.720 | Beta 1, beta 2, beta 3, beta 4, beta t.
03:55:49.840 | So beta basically it's a series of numbers.
03:55:52.720 | That indicates the variance of the noise that we add with each of these steps.
03:55:59.040 | And as in the latent in the stable diffusion.
03:56:03.440 | They use a beta start.
03:56:05.040 | So the first value of beta is 0.0085.
03:56:09.120 | And the last variance.
03:56:10.560 | So this the beta that will turn the image into complete noise.
03:56:14.960 | Is equal to 0.0120.
03:56:17.680 | It's a choice made by the authors.
03:56:19.920 | And we will use a linear schedule.
03:56:25.280 | Actually there are other schedules.
03:56:26.880 | Which are for example the cosine schedule etc.
03:56:29.200 | But we will be using the linear one.
03:56:30.880 | And we need to define this beta schedule.
03:56:35.360 | Which is actually 1000 numbers between beta start and beta end.
03:56:39.600 | So let's do it.
03:56:42.320 | So this is defined using the linear space.
03:56:48.240 | Where the starting number is beta start.
03:56:51.440 | Actually to the square root of beta start.
03:56:55.200 | So square root of beta start.
03:56:57.680 | Because this is how they define it in the stable diffusion.
03:57:01.600 | If you check the official repository.
03:57:03.600 | They will also have these numbers.
03:57:04.960 | And define in exactly the same way.
03:57:06.640 | 0.5 then the number of training steps.
03:57:11.920 | So in how many pieces we want to divide this linear space.
03:57:16.000 | Beta end.
03:57:21.280 | And then the type is torch dot float 32 I think.
03:57:26.720 | And then to the power of 2.
03:57:28.560 | Because they divide it into 1000.
03:57:32.320 | Then to the power of 2.
03:57:33.360 | This is in the diffusers libraries from Hugging Face.
03:57:38.480 | I think this is called the scaled linear schedule.
03:57:40.960 | Now we need to define other constants.
03:57:44.480 | That are needed for our forward and our backward process.
03:57:47.600 | So our forward process depends on this beta schedule.
03:57:50.720 | But actually this is only for the single step.
03:57:53.120 | So if we want to go from for example the original image.
03:57:56.320 | By one step forward of more noise.
03:57:58.800 | We need to apply this formula here.
03:58:00.480 | But there is a closed formula here.
03:58:03.920 | Called this one here.
03:58:05.920 | That allows you to go from the original image.
03:58:08.320 | To any noisified version of the image.
03:58:10.800 | At any time step.
03:58:12.240 | Between 0 and 1000.
03:58:13.920 | Using this one here.
03:58:15.920 | Which depends on alpha bar.
03:58:18.000 | That you can see here.
03:58:18.880 | So the square root of this alpha bar.
03:58:21.040 | And the variance also depends on this alpha bar.
03:58:23.520 | What is alpha bar?
03:58:24.720 | Alpha bar is the product of alpha.
03:58:27.600 | Going from 1 up to t.
03:58:29.920 | So if we are for example.
03:58:31.120 | We want to go from the time step 0.
03:58:33.440 | Which is the image without any noise.
03:58:35.440 | To the time step 10.
03:58:37.040 | Which is the image with some noise.
03:58:38.800 | And remember that time step 1000.
03:58:41.840 | Means that it's only noise.
03:58:43.680 | So we want to go to time step 10.
03:58:46.240 | Which means that we need to calculate.
03:58:48.160 | This As of 1.
03:58:50.880 | As 1.
03:58:51.600 | As 2.
03:58:52.160 | As 3.
03:58:52.560 | As 2.
03:58:53.040 | And up until As 10.
03:58:54.720 | And we multiply them together.
03:58:56.720 | This is the productory.
03:58:58.240 | And this A.
03:58:59.440 | What is this alpha?
03:59:00.880 | This alpha actually is 1 minus beta.
03:59:03.200 | So let's calculate this alphas first.
03:59:05.360 | So alpha is actually 1 minus beta.
03:59:07.520 | Beta self dot betas.
03:59:12.160 | So it becomes floating.
03:59:16.080 | And then we need to calculate.
03:59:18.080 | The product of this alphas.
03:59:19.760 | From 1 to t.
03:59:21.120 | And this is easily done with the PyTorch.
03:59:23.360 | We pre-compute them basically.
03:59:26.640 | This is also comprod self dot alphas.
03:59:33.440 | This will create basically an array.
03:59:37.840 | Where the first element is the first alpha.
03:59:40.080 | So alpha for example 0.
03:59:42.480 | The second element is alpha 0 multiplied by alpha 1.
03:59:47.360 | The third element is alpha 0.
03:59:49.840 | Multiplied by alpha 1.
03:59:52.240 | Multiplied by alpha 2 etc.
03:59:54.320 | So it's a cumulative product.
03:59:55.840 | It's we say.
03:59:56.480 | Then we create one tensor.
03:59:59.760 | That represents the number 1.
04:00:01.040 | And later we will use it.
04:00:02.080 | Tensor 1.0.
04:00:07.280 | Okay we save the generator.
04:00:12.000 | We save the number of training steps.
04:00:13.760 | And then we create the time step schedule.
04:00:21.600 | The time step basically.
04:00:23.840 | Because we want to reverse the noise.
04:00:27.680 | We want to remove noise.
04:00:28.720 | We will start from the more noisy to less noise.
04:00:31.600 | So we will go from 1000 to 0.
04:00:34.000 | Initially.
04:00:36.240 | So let's say time steps is equal to torch from.
04:00:41.200 | We reverse this.
04:00:52.560 | So this is from 0 to 1000.
04:00:54.320 | But actually we want 1000 to 0.
04:00:56.240 | And this is our initial schedule.
04:01:02.400 | In case we want to do 1000 steps.
04:01:04.080 | But later because here we actually specify.
04:01:07.040 | How many inference steps we want to do.
04:01:10.080 | We will change these time steps here.
04:01:12.720 | So if the user later specifies less than 1000.
04:01:15.200 | We will change it.
04:01:15.920 | So let's do it.
04:01:18.720 | We let's create the method.
04:01:19.840 | That will change this time steps.
04:01:22.960 | Based on how many actual steps we want to make.
04:01:25.840 | So set inference step.
04:01:29.680 | Time steps.
04:01:32.720 | As I said before.
04:01:38.480 | We usually perform 50.
04:01:39.920 | Which is also actually the one they use normally.
04:01:42.480 | For example in hugging face library.
04:01:45.840 | Let's save this value.
04:01:50.160 | Because we will need it later.
04:01:51.680 | Now if we have a number.
04:01:54.800 | For example we go from 1000.
04:01:57.200 | Actually it's not from.
04:01:58.320 | This is not from 0 to 1000.
04:02:00.560 | But it's from 0 to 1000 minus 1.
04:02:02.720 | Because this is excluded.
04:02:04.000 | So it will be from 99.
04:02:06.080 | 999, 998, 997, 996 etc up to 0.
04:02:12.640 | So we have 1000 numbers.
04:02:15.440 | But we don't want 1000 numbers.
04:02:17.040 | We want less.
04:02:18.080 | We want 50 of them.
04:02:19.680 | So what we do is basically.
04:02:20.960 | We space them every 20.
04:02:22.880 | So we start with 999.
04:02:25.280 | Then 999 minus 20.
04:02:27.600 | Then 999 minus 40 etc etc.
04:02:31.120 | Until we arrive to 0.
04:02:32.320 | But in total here will be 1000 steps.
04:02:37.120 | And here will be 50 steps.
04:02:39.440 | Why minus 20?
04:02:41.600 | Because 20 is 1000 divided by 50.
04:02:44.800 | If i'm not mistaken.
04:02:45.920 | So this is exactly what we are going to do.
04:02:51.120 | So we calculate the step ratio.
04:02:52.880 | Which is self dot num training step.
04:02:56.160 | Divide by how many we actually want.
04:02:58.400 | And we redefine the time steps.
04:03:00.720 | According to how many we actually want to make.
04:03:03.520 | 0 num inference steps.
04:03:15.440 | Multiply it by this step ratio.
04:03:17.200 | And round it.
04:03:24.160 | We reverse it just like before.
04:03:27.680 | Because this is from 0.
04:03:29.120 | So this is actually means 0.
04:03:31.360 | Then 20.
04:03:32.160 | Then 40.
04:03:32.880 | Then 60 etc.
04:03:34.480 | Until we reach 999.
04:03:36.080 | Then we reverse it.
04:03:37.200 | Then copy.
04:03:40.720 | S type.
04:03:43.760 | np dot int 64.
04:03:48.560 | So a long one.
04:03:50.640 | And then we define as tensor.
04:03:54.240 | Now the code looks very different from each other.
04:04:01.040 | Because actually I have been copying the code from multiple sources.
04:04:04.560 | Maybe one of them I think I copied from the HuggingFace library.
04:04:08.720 | So I didn't change it.
04:04:10.160 | I kept it to the original one.
04:04:12.080 | Okay.
04:04:13.760 | But the idea is the one I showed you before.
04:04:16.160 | So we copy the code from the HuggingFace library.
04:04:19.520 | I showed you before.
04:04:20.480 | So now we set the exact number of time steps we want.
04:04:23.760 | And we redefine this time steps array like this.
04:04:26.800 | Let's define the next method.
04:04:31.280 | Which basically tells us.
04:04:33.520 | Let's define the method on how to add noise to something.
04:04:36.720 | So imagine we have the image.
04:04:38.400 | As you remember to do image to image.
04:04:41.520 | We need to add noise to this latent.
04:04:43.520 | How do we add noise to something?
04:04:45.840 | Well we need to apply the formula as defined in the paper.
04:04:49.280 | Let's go in the paper here.
04:04:51.280 | We need to apply this formula here.
04:04:54.080 | And that's it.
04:04:55.520 | This means that given this image.
04:04:57.680 | I want to go to the noisified version of this image at time step t.
04:05:03.360 | Which means that I need to take.
04:05:06.800 | We need to have a sample from this Gaussian.
04:05:11.600 | But we don't.
04:05:12.320 | Okay.
04:05:13.680 | Let's build it.
04:05:14.320 | And we will apply the same trick that we did for the variational autoencoder.
04:05:17.840 | As you remember in the variational autoencoder.
04:05:19.920 | I actually already showed how we sample from a distribution.
04:05:23.040 | Of which we know the mean and the variance here.
04:05:25.440 | We will do the same here.
04:05:26.640 | But we of course we need to build the mean and the variance.
04:05:29.680 | What is the mean of this distribution?
04:05:31.520 | It's this one.
04:05:33.120 | And what is the variance?
04:05:35.040 | It's this one.
04:05:35.920 | So we need to build the mean and the variance.
04:05:37.520 | And then we sample from this.
04:05:38.640 | So let's do it.
04:05:40.480 | DDPM.
04:05:43.360 | So we take the original samples.
04:05:45.360 | Which is the float tensor.
04:05:48.880 | And then the time steps.
04:05:50.080 | So this is actually time step, not time steps.
04:05:57.040 | It indicates at what time step we want to add the noise.
04:06:00.000 | Because you can add the time step at the noise at time step 1, 2, 3, 4.
04:06:04.240 | Up to 1000.
04:06:05.520 | And we need to add the noise at the noise at time step 1, 2, 3, 4.
04:06:10.160 | 1, 2, 3, 4.
04:06:11.120 | Up to 1000.
04:06:12.240 | And with each level the noise increases.
04:06:15.200 | So the noisified version at the time step 1 will be not so noisy.
04:06:19.280 | But at the time step 1000 will be complete noise.
04:06:22.800 | This returns a float tensor.
04:06:27.680 | Okay.
04:06:31.040 | Let's calculate first.
04:06:35.120 | Let me check what we need to calculate first.
04:06:38.640 | We can calculate first the mean.
04:06:40.400 | And then the variance.
04:06:41.360 | So to calculate the mean we need this alpha cum prod.
04:06:46.000 | So the cumulative product of the alpha.
04:06:48.240 | Which stands for alpha bar.
04:06:50.160 | So the alpha bar as you can see is the cumulative product of all the alphas.
04:06:54.000 | Which is each alpha is 1 minus beta.
04:06:56.240 | So we take this alpha bar.
04:06:59.360 | Which we will call alpha cum prod.
04:07:01.280 | So it's already defined here.
04:07:04.560 | Alpha cum prod is self dot 2 device.
04:07:10.160 | We move it to the same device.
04:07:14.080 | Because we need to later combine it with it.
04:07:16.080 | And of the same type.
04:07:21.840 | This is a tensor.
04:07:30.400 | That we also move to the same device of the other tensor.
04:07:34.320 | Now we need to calculate the square root of alpha bar.
04:07:39.280 | So let's do it.
04:07:40.960 | Square root of alpha cum prod.
04:07:47.040 | Or alpha prod is alpha cum prod at the time step t.
04:07:53.280 | To the power of 0.5.
04:07:56.160 | Why to the power of 0.5?
04:07:57.760 | Because having a number to the power of 0.5 means doing it's the square root of the number.
04:08:04.240 | Because the square root of 1/2 which becomes the square.
04:08:07.840 | Sorry to the power of 1/2 which becomes the square root.
04:08:11.040 | And then we flatten this array.
04:08:14.800 | And then basically because we need to combine this alpha cum prod.
04:08:21.360 | Which doesn't have dimensions.
04:08:23.440 | It only has one dimension.
04:08:24.640 | Which is the number itself.
04:08:25.680 | But we need to combine it with the latency.
04:08:27.760 | We need to add some dimensions.
04:08:29.040 | So one trick is to just keep adding dimensions with unsqueeze.
04:08:32.640 | Until you have the same number of dimensions.
04:08:34.480 | So until the n of the square of the shape is less than.
04:08:41.440 | Most of this code I have taken from the Hugging Face libraries samplers.
04:08:52.080 | So we keep the dimension until this one and this tensor and this tensor have the same dimensions.
04:09:07.440 | This is because otherwise we cannot do broadcasting when we multiply them together.
04:09:11.120 | The other thing that we need to calculate this formula is this part here.
04:09:15.600 | 1 minus alpha bar.
04:09:16.880 | So let's do it.
04:09:18.800 | So sqrt of 1 minus alpha prod.
04:09:24.240 | As the name implies is 1 minus alpha cum prod at the time step t.
04:09:30.640 | To the power of 0.5.
04:09:34.480 | Why 0.5?
04:09:35.680 | Because we don't want the variance.
04:09:37.520 | We want the standard deviation.
04:09:39.440 | Just like we did with the encoder of the variational autoencoder.
04:09:43.920 | We want the standard deviation.
04:09:45.680 | Because as you remember if you have an n01 and you want to transform into an n with the given mean and the variance.
04:09:52.720 | The formula is x is equal to mean plus the standard deviation multiplied by the n01.
04:09:58.240 | Let's go back.
04:10:00.880 | So this is the standard deviation.
04:10:03.600 | And we also flatten this one.
04:10:13.920 | Flatten and then again we keep adding the dimensions until they have the same dimension.
04:10:19.680 | Otherwise we cannot multiply them together or sum them together.
04:10:27.760 | Unsqueeze so we keep adding dimensions.
04:10:37.360 | Now as you remember our method should add noise to an image.
04:10:44.080 | So we need to add noise means we need to sample some noise.
04:10:47.280 | So we need to sample some noise from the n01.
04:11:00.560 | Using this generator that we have.
04:11:07.280 | I think my cat is very angry today with me because I didn't play with him enough.
04:11:14.000 | So later if you guys excuse me I need to later play with him.
04:11:20.160 | I think we will be done very soon.
04:11:24.240 | So let's get the noisy samples using the noise and the mean and the variance that we have calculated.
04:11:30.720 | According exactly to this formula here.
04:11:32.880 | So we do the mean.
04:11:35.520 | Actually no the mean is this one multiplied by x0.
04:11:41.760 | So the mean is this one multiplied by x0 is the mean.
04:11:45.520 | So we need to take this square root of alpha comprod multiplied by x0 and this will be the mean.
04:11:50.880 | So the mean is square root of alpha prod multiplied by the original latency.
04:11:56.320 | So x0 so the input image or whatever we want to noisify.
04:12:00.240 | Plus the standard deviation which is a square root of this one multiplied by a sample of the
04:12:09.200 | from the n01 so the noise.
04:12:11.600 | And this is how we noisify an image.
04:12:14.720 | This is how we add noise to an image.
04:12:19.920 | So this one let me write it down.
04:12:22.160 | So all of this is according to the equation 4 of the DDM paper and also according to this.
04:12:36.480 | Okay now that we know how to add noise we need to understand how to remove noise.
04:12:46.400 | So as you remember let's review again here.
04:12:49.920 | Imagine we are doing text to text or text to image or image to image it doesn't matter.
04:12:57.200 | The point is our unit as you remember is trained to only predict the amount of noise given the
04:13:03.600 | latent with noise given the prompt and the time step at which this noise was added.
04:13:12.400 | So what we do is we have this predicted noise from the unit.
04:13:17.280 | We need to remove this noise so the unit will predict the noise but we need some way of
04:13:22.480 | removing the noise to get the next latent.
04:13:25.760 | What I mean by this is you can see this reverse process here.
04:13:32.960 | So the reverse process is defined here.
04:13:36.720 | We want to go from Xt so something more noisy to something less noisy based on the noise
04:13:49.040 | that was predicted by the unit.
04:13:51.440 | But here in this formula you don't see any relationship to the noise predicted by the
04:13:56.800 | unit.
04:13:57.680 | Actually here it just says if you have a network that can evaluate this mean and this variance
04:14:06.640 | you know how to remove the noise to how to go from Xt to Xt-1 but we don't have a method
04:14:13.120 | that actually predicts the mean and the variance.
04:14:15.120 | We have a method that tells us how much noise is there.
04:14:18.000 | So the formula we should be looking at is actually here.
04:14:22.800 | So here here because we have we trained our network our unit as a epsilon theta as you
04:14:33.600 | remember our training method was this we do gradient descent on this loss in which we
04:14:40.000 | train a network to predict the noise in a noisy image.
04:14:45.040 | So we need to use this epsilon theta now to remove the noise so this predicted noise to
04:14:51.120 | remove the noise and if we read the paper it's written here that to sample Xt-1 given
04:14:58.000 | Xt is to compute Xt-1 is equal to this formula here.
04:15:04.640 | This tells us how to go from Xt to Xt-1 and this is the so basically we sample some noise
04:15:12.720 | we multiply it by d sigma and this basically reminds us on how to move go from the N01
04:15:21.440 | to any distribution with a particular mean and a particular variance.
04:15:26.400 | So we will be working according to this formula here actually because we have a model that
04:15:31.360 | predicts noise here this epsilon theta and this is our unit.
04:15:36.080 | The unit is trained to predict noise.
04:15:38.000 | So let's build this part now and I will while building it I will also tell you which formula
04:15:44.160 | I'm referring to at each step so you can also follow the paper.
04:15:47.680 | So now let's build the method let's call step method that given the time step at which the
04:15:54.160 | noise was added or we think it was added because when we do the reverse process we can also
04:16:00.400 | skip it's not we think it was other but we can skip some time steps so we need to tell
04:16:06.000 | him what is the time step at which it should remove the noise the latency so as you know
04:16:12.640 | the unit works with the latency so with this z's here so this is z and it keeps denoising
04:16:19.040 | so the latency and then what is the model output so the predicted noise of the unit
04:16:27.920 | so the model output is the predicted noise torch dot tensor
04:16:33.520 | this model output corresponds to this epsilon theta of xtt so this is the predicted noise
04:16:43.680 | at time step t this latency is our xt and what else we need the alpha we have the beta
04:16:52.320 | we have we have everything okay let's go so t is equal to time step the previous t is
04:17:02.160 | equal to self dot get previous time step t this is a function that given this time step
04:17:11.040 | calculates the previous one later we will build it actually we can build it now it's very simple
04:17:31.680 | get previous time step self time step which is an integer we return another integer
04:17:38.960 | previous time step is equal to the time step
04:17:43.680 | minus self minus basically this quantity here step ratio so self dot num training steps
04:17:56.720 | divided by self dot num inference steps return previous t this one will return basically
04:18:06.560 | given for example the number 999 it will return number 999 minus 20 because the time steps
04:18:16.640 | for example the initial time step will be suppose it's 1000 the training steps we are doing is 1000
04:18:24.080 | divided by the number of inference step which is we will be doing is 50 so this is means 1000 minus
04:18:30.000 | 20 because 1000 divided by 50 is 20 so it will return 980 when we give him 980 as input he will
04:18:37.760 | return 960 so what is the next step that we will be doing in our for loop or what is the previous
04:18:45.200 | step of the denoising so we are going from the image noise at the time step 1000 to an image
04:18:52.320 | noise that time step 980 for example this is the meaning of previous stem then we retrieve some
04:19:01.280 | data later we will use it so alpha pod t is equal to self dot alpha for now if you don't understand
04:19:08.240 | don't worry because later i will write i will just collect some data that we need to calculate
04:19:12.400 | a formula and then i will tell you exactly which formula we are going to calculate
04:19:20.240 | alpha prod t
04:19:22.640 | if we don't have any previous step then we don't know which alpha to return so we just return one
04:19:45.440 | and actually there is a paper that came out i think from by dance that was complaining that
04:19:50.960 | this method of doing is not correct because the the last time step doesn't have this is not doesn't
04:19:58.720 | have the signal to noise ratio about equal to zero but okay this is something we don't need
04:20:03.760 | to care about now actually if you're interested i will link the paper in the comments
04:20:16.320 | rev current alpha t
04:20:26.320 | prod t divided by alpha prod current also this code i took it from
04:20:33.600 | hugging face diffusers library because i mean we are applying formulas so even if i wrote it by
04:20:43.520 | myself it wouldn't be any different because we are just applying formulas from the paper so
04:20:48.240 | so the first thing we need to do is to compute the original sample according to the formula 15
04:20:55.040 | of the paper what do i mean by this as you can see where is it this one where is it
04:21:05.120 | here so actually let me show you another formula here
04:21:12.160 | as you can see we can calculate the previous step so the less noise is the the forward process
04:21:22.080 | sorry the reverse process we can calculate the less noisy image given a more noisy image
04:21:26.960 | and the predicted image at time step zero according to this formula here where the mean
04:21:34.560 | is defined in this way and the variance is defined in this way but what is the predicted
04:21:44.400 | x0 so given an image given a noisy image at time step t how can we predict what is the x0
04:21:53.920 | of course this is the predicted x0 not what will be the x0 so this predicted x0 we can also retrieve
04:22:00.640 | it using the formula number 15 if i remember correctly it's here so this x0 is given as xt
04:22:10.800 | minus 1 minus alpha multiplied by the predicted noise at time step t divided by the square root
04:22:17.200 | of alpha all these quantities we have so actually there are two ways which are equivalent to each
04:22:22.800 | other actually numerically of going from more noisy to less noisy one way is this one this
04:22:28.960 | one here which is the algorithm 2 of the sampling and one is this one here so the equation number
04:22:36.560 | 7 that allows you to go from more noisy to less noisy but the two are numerically equivalent they
04:22:42.400 | just in the in the effect they are equivalent it's just they have different parameterization
04:22:48.000 | so they have different formulas so as a matter of fact for example here in the code they say
04:22:54.720 | to go from xt to xt minus 1 you need to do this calculation here but as you can see for example
04:23:02.480 | this is this numerator of this multiplied by this epsilon theta is different from the one
04:23:11.120 | in the algorithm here but actually they are the same thing because bt is equal to 1 minus alpha t
04:23:16.720 | as beta alpha is defined as 1 minus beta as you remember so there are multiple ways of obtaining
04:23:24.080 | the same thing so what we will do is we actually we will apply this formula here in which we need
04:23:29.760 | to calculate the mean and we need to calculate the variance according to these formulas here
04:23:34.720 | in which we know alpha we know beta we know alpha bar we know all the other alphas we know
04:23:40.320 | because there are parameters that depend on beta what we don't know is x0 but x0 can be calculated
04:23:46.240 | as in the formula 15 here so first we will calculate this x0 predicted x0
04:23:54.400 | so first compute the predicted original sample using formula 15 of the DDTM paper
04:24:13.360 | predicted original sample
04:24:16.960 | latency minus while so we do latency minus the square root of 1 minus alpha t what is the square
04:24:28.000 | root of 1 minus alpha t is equal to beta so i have here beta t which is already 1 minus alpha t as
04:24:36.160 | you can see alpha bar 1 minus alpha bar at the time step t because i already retrieve it from here
04:24:43.920 | so 1 minus sorry beta to the power to to the power of one half or the square root of beta
04:24:52.000 | so we do latency minus beta rod at time step t to the power of 0.5 which it means basically square
04:25:01.120 | root of beta and then we multiply this by the predicted noise of the image of the latent at
04:25:08.640 | time step t so what is the predicted noise it's the model output because our unit predicts the
04:25:15.120 | noise model output and then we need to divide this by let me check
04:25:23.360 | square root of alpha t which we have i think here alpha t here so the square root of
04:25:31.360 | alpha t alpha prod t to the power of 0.5
04:25:36.240 | here i have something on this one i don't need this one i don't need okay
04:25:44.560 | because otherwise it's wrong right yeah before first there is a product between
04:25:50.080 | these two terms and then there is the difference yeah okay this is how we compute the prediction
04:25:56.160 | the x0 now let's go back to the formula number seven
04:25:59.440 | seven seven okay now we have this x0 so we can compute this term and we can compute this term
04:26:06.960 | and this we can compute this term and all the other terms we also can compute so we calculate
04:26:11.120 | this mean and this variance and then we sample from this distribution so compute the coefficients
04:26:20.640 | for bred original sample and the current sample xt this is the same comment that you can find on
04:26:31.280 | the diffusers library which basically means we need to compute this one this is the coefficient
04:26:36.640 | for the predicted sample and this is the coefficient for xt this one here so predicted
04:26:43.520 | original sample coefficient which is equal to what alpha prod t minus one so the previous alpha
04:26:54.320 | prod t which is alpha prod t previous which means the alpha prod t but at the previous time step
04:27:04.720 | under the square root so to the power of 0.5 multiplied by the current beta t so
04:27:10.720 | the beta at the time step t so current beta t which is we define it here
04:27:18.640 | current beta t we retrieve it from alpha we could have a okay and then we divide it by
04:27:28.720 | beta product t because one minus alpha bar is actually equal to beta bar
04:27:33.120 | beta product t then we have the this coefficient here so this one here
04:27:42.320 | so this is current sample coefficient is equal to current alpha t to the power of 0.5
04:27:50.880 | which means the square root of this time this this thing here so the square root of alpha t
04:27:58.080 | and then we multiply it by beta at the previous time step because it's one minus alpha at the
04:28:03.200 | previous time step corresponds to beta as the previous time steps time step multiplied by beta
04:28:09.280 | prod t prev divide by beta at the time step t so beta prod t
04:28:17.840 | now we can compute the mean so the mean is the sum of these two terms
04:28:26.160 | pred prev sample so let me write some here compute the predicted
04:28:33.920 | previous sample mean mod t
04:28:39.840 | is equal to predicted original sample coefficient multiplied by what by x0 what is x0 is this one
04:28:50.000 | that we obtained by the formula number 15 so the prediction predicted original sample so x0
04:28:56.000 | plus this term here what is this term is this one here so the current sample coefficient
04:29:02.560 | multiplied by xt what is xt is the latency at the time step t
04:29:08.480 | now this we have computed the mean for now we need to compute also the variance
04:29:17.120 | let's create another method to compute the variance
04:29:19.760 | test get variance self time step int
04:29:28.480 | okay we obtained the previous time test t because we need to do for later calculations
04:29:37.440 | again we calculate the alpha prod t so all the terms that we need to calculate
04:29:45.280 | these particular terms here
04:30:15.040 | and the current beta t is equal to one minus alpha prod t divided by alpha prod this one
04:30:24.240 | what is current beta t is equal to one minus alpha prod t yeah one minus alpha prod t
04:30:36.080 | divided by alpha prod t0 okay so the variance according to the formula number six and seven
04:30:48.800 | so this formula here is given as one minus alpha prod tprev so one minus alpha prod tprev
04:31:04.000 | divided by one minus alpha prod which is one minus alpha prod why prod because
04:31:11.760 | this is the alpha bar and multiplied by the current beta
04:31:15.920 | beta t and beta t is defined i don't remember where it's one minus alpha
04:31:26.320 | and this is our variance we clamp it
04:31:29.840 | oops torch dot clamp the variance and the minimum that we want is one
04:31:41.600 | equal to minus 20 to make sure that with it doesn't reach zero and then we return the variance
04:31:52.240 | and now that we have the mean and the variance so this variance has also been computed using
04:31:58.960 | let me write here computed using formula seven of the ddpm paper
04:32:07.840 | and now we go back to our step function so what we do is
04:32:17.760 | it's equal to zero so because we only need to add the variance if we are not at the last
04:32:24.480 | last time step if you are at the last time step we have no noise so we don't add any
04:32:28.320 | we don't add we don't need to add any noise actually because the point is we are going
04:32:36.880 | to sample from this distribution and just like we did before we actually sample from the n01 and
04:32:42.240 | then we shift it according to the formula so the n gaussian as with the particular mean and the
04:32:52.160 | particular variance is equal to the the gaussian at zero one multiplied by the standard deviation
04:32:58.960 | plus the um the plus the mean so we sample the noise
04:33:26.080 | okay we sample some noise compute the variance
04:33:38.000 | actually this is the
04:33:42.640 | variance already multiplied by the noise so it's actually the standard deviation
04:33:48.800 | because we will see self dot get variance after the time step t to the power of 0.5 so this
04:33:58.240 | 0.5 so this one becomes the standard deviation we multiply it by the n01
04:34:04.720 | so what we are doing is basically we are going from n01
04:34:09.520 | to nn with a particular mu and a particular sigma using the usual trick of going from
04:34:18.080 | x is equal to the mu plus the sigma actually not yeah this is the sigma squared then
04:34:25.520 | because this is the variance sigma multiplied by the z where z where z is distributed according
04:34:33.360 | to the n01 this is the same thing that we always done also for the variation of the
04:34:39.520 | encoder also for adding the noise the same thing that we did before this is how you sample from a
04:34:45.840 | distribution how you actually shift the parameter of the gaussian distribution
04:34:49.280 | so predicted prev sample is equal to the predicted prev sample plus the variance
04:34:59.600 | this variance term here already includes the sigma multiplied by z
04:35:03.600 | and then we return predicted prev sample oh okay now we have also built the
04:35:14.000 | the sampler let me check if we have everything no we missed still still something which is the
04:35:19.360 | set strength method as you remember once we want when we want to do image to image so let's go
04:35:26.320 | back to check our slides if we want to do image to image we convert the image using the vae to a
04:35:32.560 | latent then we need to add noise to this latent but how much noise we can decide the more noise
04:35:38.640 | we add the more freedom the unit will have to change this image the less noise we add the
04:35:43.360 | less freedom it will have to change the image so what we do is basically by setting the strength
04:35:48.640 | we make our sampler start from a particular noise level and this is exactly what the method we want
04:35:55.600 | to implement so i made some mess okay so for example as soon as we load the image we set
04:36:02.640 | the strength which will shift the noise level from which we start from and then we add noise
04:36:08.240 | to our latent to create the image to image here so let's go here and we create this method called
04:36:16.480 | set strength
04:36:20.720 | okay in the start step because we will skip some steps
04:36:29.840 | is equal to self.num inference steps minus int of self.num inference
04:36:37.920 | this basically means that if we have 50 inference steps and then we set the strength to let's say
04:36:48.000 | 0.8 it means that we will skip 20% of the steps so when we will add we will start from image to
04:36:55.200 | image for example we will not start from a pure noise image but we will start from 80% of noise
04:37:01.760 | in this image so the unit will still have freedom to change this image but not as much as with 100%
04:37:08.160 | noise we redefine the time steps because we are altering the schedule so basically we skip some
04:37:17.920 | time steps and self.start step is equal to start step so actually what we do here is suppose we
04:37:29.840 | have the strength of 80% we are actually fooling the method the the unit into believing that he
04:37:35.440 | came up with this image which is now with this level of strength and now he needs to keep denoising
04:37:41.280 | it this is how we do image to image so we start with an image we noise it and then we make the
04:37:47.280 | unit believe that he came up with this image with this particular noise level and now he has to keep
04:37:53.760 | denoising it until according of course also to the prompt until we reach the clean image without any
04:38:01.120 | noise now we have the pipeline that we can call we have the ddpm sampler we have the model built
04:38:12.080 | of course we need to create the function to load the weights of this model so let's create another
04:38:18.000 | file we will call it the model loader here model loader because now we are nearly close to sampling
04:38:27.760 | from this finally from this table diffusion so now we need to create the method to load the
04:38:32.160 | pre-trained the pre-trained weights that we have downloaded before so let's create it
04:38:40.800 | import clip
04:38:42.240 | decoder va encoder then from decoder import va decoder
04:38:55.600 | fusion import diffusion our diffusion model which is our unit
04:39:05.040 | now let me first define it then i tell you what we need to do so preload preload models from
04:39:14.800 | standard weights
04:39:17.280 | okay as usual we load the weights using torch but we use we will create another function
04:39:32.720 | model converter dot load from standard weights
04:39:37.280 | this is a method that we will create later to to load the weights
04:39:49.360 | the pre-trained weights and i will show you why we need this method then we create our encoder
04:39:58.880 | and we load the state addict load state addict from our state addict
04:40:04.160 | and we also set strict to two oops don't strict
04:40:12.400 | strict
04:40:26.960 | true then we have the decoder
04:40:37.040 | and strict also so this strict parameter here basically tells that when you load a model from
04:40:52.000 | pytorch this for example this ckp ckpt file here it is a dictionary that contains many keys
04:40:59.680 | and each key corresponds to one matrix of our model so for example this uh self this group
04:41:06.320 | normalization has some parameters and the the how can torch load this parameter is exactly in this
04:41:12.720 | group norm by using the name of the variables that we have defined here and he will when we load a
04:41:20.480 | model from pytorch he will actually load the dictionary and then we load this dictionary
04:41:25.680 | into our models and he will match by names now the problem is the pre-trained model
04:41:30.960 | actually they don't use the same name that i have used and actually this code is based on another
04:41:36.720 | code that i have seen so actually the the names that we use are not the same as the pre-trained
04:41:42.560 | model also because the names in the pre-trained model not always uh very friendly for learning
04:41:49.120 | this is why i changed the names and also other people changed the names of the methods but this
04:41:55.200 | also means that the automatic mapping between the names of the pre-trained model and the names
04:42:01.440 | defined in our classes here cannot happen because it cannot happen automatically because the names
04:42:06.400 | do not match for this reason there is a script that i have created in my github library here
04:42:14.320 | that you need to download to convert these names it's just a script that maps one name into another
04:42:20.160 | so if the name is this one map it into this if the name is this one mapping into this
04:42:24.560 | there is nothing special about this script it's just a very big mapping of the names and this is
04:42:30.560 | actually done by most models because if you want to change the name of the classes and or the
04:42:36.960 | variables then you need to do this kind of mapping so i will also i will basically copy it i don't
04:42:43.920 | need to download the file so this will call the model converter.py model converter.py
04:42:52.320 | and that's it it's just a very big mapping of names and i take it from this comment here on
04:43:00.480 | github so this is model converter so we need to import this model converter import model converter
04:43:12.560 | import this model converter basically will convert the names and then we can use the
04:43:17.840 | load state dict and this will actually map all the names it's now now the names will map with
04:43:22.720 | each other and this trick makes sure that if there is even one name that doesn't map
04:43:26.880 | then throw an exception which is what i want because i want to make sure that all the names map
04:43:38.240 | so we define the diffusion and we load it's
04:43:40.480 | state dict
04:43:43.280 | diffusion and strict equal to true
04:43:52.480 | and let me check
04:44:07.440 | then we do clip is equal to clip dot to device so we move it to device where we
04:44:13.360 | want to work and then we load also his state dict so the parameters of the weights
04:44:17.840 | and then we return a dictionary clip
04:44:35.760 | clip and then we have the encoder is the encoder we have the decoder
04:44:42.960 | is the decoder and then we have the diffusion we have the diffusion etc
04:44:51.600 | now we have all the ingredients to run finally the inference guys so thank you for being patient so
04:44:58.800 | much and it's really finally we have we can see the light coming so let's build our notebook so we
04:45:07.680 | can visualize the image that we will build okay let's select the kernel stable diffusion i already
04:45:16.640 | created it in my repository you will also find the requirements that you need to install in order to
04:45:23.360 | run this so let's import everything we need so the model loader the pipeline
04:45:30.000 | peel import image this is how to load the image from python so patlib import actually this one
04:45:41.040 | we don't need transformers this is the only library that we will be using because there
04:45:48.000 | is the tokenizer of the clip so how to tokenize the the text into tokens before sending it to
04:45:54.080 | the clip embeddings otherwise we also need to build the tokenizer and it's really a lot of
04:45:59.120 | job i don't allow cuda and i also don't allow mps but you can activate these two
04:46:17.680 | variables if you want to use cuda or mps
04:46:19.920 | valuable and low cuda then the device becomes cuda of course
04:46:54.720 | and then we printed the device we are using
04:47:11.680 | okay let's load the tokenizer tokenizer is the clip tokenizer we need to tell him what is the
04:47:17.760 | vocabulary file so which is already saved here in the data data vocabulary.json and then also the
04:47:25.440 | merges file maybe one day i will make a video on how the tokenizer works so we can build also the
04:47:32.320 | tokenizer but this is something that requires a lot of time i mean and it's not really related
04:47:38.880 | to the diffusion model so that's why i didn't want to build it the model file is i will use the data
04:47:45.920 | and then this file here then we load the model so the models are model loader dot preload model from
04:47:54.480 | the model file into this device that we have selected okay let's build from text to image
04:48:02.640 | what we need to define the prompt for example i want a cat
04:48:08.160 | sitting or stretching let's say stretching on the floor highly detailed we need to create a
04:48:18.560 | prompt that will create a good image so we need to add some a lot of details ultra sharp cinematic
04:48:25.360 | etc etc 8k resolution the unconditioned prompt
04:48:32.720 | i keep it blank this you can also use it as a negative number you can use it as a negative
04:48:42.720 | prompt so if you don't want the sum you don't want the output to have some how to say some
04:48:50.800 | characteristics you can define it in the negative prompt of course i like to do cfg so the
04:48:57.120 | classifier free guidance which we set to true cfg scale is a number between 1 and 14 which
04:49:05.760 | indicates how much attention we want the model to pay to this prompt 14 means pay
04:49:10.560 | very much attention or 1 means we pay very little attention i use 7
04:49:17.920 | then we can define also the parameters for image to image
04:49:20.480 | so input image is equal to none image path is equal to i will define it with my
04:49:31.440 | image of the dog which i already have here and um but for now i don't want to load it
04:49:38.960 | so if we want to load it we need to do input image is equal to image.open
04:49:46.320 | image path but for now i will
04:49:48.480 | i will not use it so now let's comment it and if we use it we need to define the strength
04:49:56.960 | so how much noise we want to add to this image but for now let's not use it
04:50:00.080 | the sampler we will be using of course is the only one we have is the ddpm
04:50:04.640 | the number of inference steps is equal to
04:50:10.880 | 50 and the seed is equal to 42 because it's a lucky number at least according to some books
04:50:19.280 | output image is equal to pipeline generate okay the prompt is the prompt that we have defined
04:50:31.280 | the unconditioned prompt is the unconditioned prompt that we have defined
04:50:36.240 | input image is the input image that we have defined if it's not commented of course
04:50:42.000 | the strength for the image
04:50:44.000 | and the cfg scale is the one we have defined
04:50:53.200 | the sampler name is the sampler name we have defined
04:51:01.760 | the number of inference steps is the number of inference steps the seed
04:51:07.200 | models
04:51:10.000 | device
04:51:14.720 | idle device is our cpu so when we don't want to use something we move it to the cpu
04:51:22.720 | and the tokenizer is the tokenizer
04:51:27.200 | and then image dot from array output image if everything is done well if all the code has
04:51:37.680 | been written correctly you can always go back to my repository and download the code if you
04:51:43.280 | don't want to write it by yourself let's run the code and let's see what is the result my
04:51:48.880 | computer will take a while so it will take some time so let's run it so if we run the code it will
04:51:56.800 | generate an image according to our prompt in my computer it took really a long time so i cut the
04:52:01.520 | video and i actually already replaced the code with the one from my github because now i want
04:52:08.240 | to actually explain you the code without while showing you all the code together how does it
04:52:14.320 | work so now we we generated an image using only the prompt i use the cpu that's why it's very slow
04:52:20.400 | because my gpu is not powerful enough and we set a unconditioned prompt to zero we are using the
04:52:25.840 | classifier free guidance and with a scale of seven so let's go in the pipeline and let's see
04:52:30.880 | what happens so basically because we are doing the classifier free guidance we will generate
04:52:36.400 | two conditioning signals one with the prompt and one with empty text which is the unconditioned
04:52:43.280 | prompt which is also called the negative prompt this will result in a batch size of two that will
04:52:50.560 | run through the unit so let's go back to here suppose we are doing text to image so now our
04:52:56.240 | unit has two latents that he's doing at the same time because we have the batch size equal to two
04:53:01.680 | and for each of them it is predicting the noise level but how can we move remove this noise from
04:53:10.160 | the predicted noise from the initial noise so because to generate an image we start from random
04:53:17.680 | noise and the prompt initially we encode it with our vae so it becomes a latent which is still
04:53:24.960 | noise and with the unit we predict we predict how much noise is it according to a schedule so
04:53:31.040 | according to 50 steps that of inferencing that we will be doing at the beginning the first step will
04:53:37.360 | be 1000 the next step will be 980 the next step will be 960 etc so this time will change according
04:53:44.720 | to this schedule so that at the 50th step we are at the time step 0 and how can we then with the
04:53:55.440 | predicted noise go to the next latent so we remove this noise that was predicted by the unit well we
04:54:01.840 | do it with the sampler and in particular we do it with the sample method of the sampler step method
04:54:09.040 | sorry of the sampler which basically will calculate the previous sample given the current sample
04:54:14.640 | according to the formula number 7 here so which basically calculates the previous sample
04:54:21.120 | given the current one so the less noisy one given the current one and the predicted x0 so this is
04:54:27.840 | not x0 because we don't have x0 so we don't have the noise the sample without any noise so but we
04:54:34.880 | can predict it given the values of the current noise and the beta schedule another way of denoising
04:54:42.480 | is to do the sampling like this if you watch my other repository about the ddbm paper i actually
04:54:48.000 | implemented it like this if you want to see this version here and this is how we remove the noise
04:54:54.160 | to get a less noisy version so once we get the less noisy version we keep doing this process
04:54:59.920 | until there is no more noise so we are at the time step zero in which we have no more noise
04:55:04.800 | we give this latent to the decoder which will turn it into an image this is how the text to image
04:55:09.920 | works the image to image on the other side so let's try to do the image to image so to do the
04:55:15.200 | image to image we need to go here and we uncomment this code here this allows us to start with the
04:55:25.680 | dog and then give for example some prompt for example we want this dog here we want to say
04:55:31.840 | okay we want a dog stretching on the floor highly detailed etc we can run it i will not run it
04:55:38.560 | because it will take another five minutes and if we do this we can set a strength of let's say 0.6
04:55:45.760 | which means that let's go here so we set a strength of 0.6 so we have this input image
04:55:54.000 | strength of 0.6 means that we will add we will encode it with the variation auto encoder will
04:55:59.680 | become a latent will add some noise but how much noise not all the noise so that it becomes
04:56:06.560 | completely noise but less noise than that so at let's let's say 60 percent noise is not really
04:56:14.080 | true because because it depends on the schedule in our case it's linear so it can be considered
04:56:21.040 | 60 percent of noise we then give this image to the scheduler which will start not from the 1000
04:56:28.240 | step it will start before so if we set the strength to 0.6 it will start from the 600
04:56:34.800 | step and then move by 20 we'll keep going 600 then 580 then 560 then 540 etc until it reaches 20
04:56:46.640 | so in total it will do less steps because we start from a less noisy example but at the same
04:56:52.560 | time because we start with less noise the the unit also has less freedom to change the to alter the
04:57:00.720 | image because he already have the image so he cannot change it too much so how do you adjust
04:57:07.040 | the noise the the noise level depends if you want the unit to pay very much attention to the input
04:57:13.440 | image and not change it too much then you add less noise if you want to change completely the
04:57:20.800 | original image then you can add all the possible noise so you set the strength to one and this is
04:57:25.280 | how the image to image works i didn't implement the inpainting because the reason is that the
04:57:32.880 | pre-trained model here so the model that we are using is not fine-tuned for inpainting so if you
04:57:38.080 | go on the website and you look at the model card they have another model for inpainting which has
04:57:45.280 | different weights here the this one here but this the structure of this model is also a little
04:57:53.120 | different because they have in the unit they have five additional input channels for the mask
04:57:58.800 | i will of course implement it in my repository directly so i will modify the code and
04:58:07.440 | also implement the code for inpainting so that we can support this model but unfortunately i don't
04:58:12.960 | have the time now because in china here is guoqing and i'm going to laojia with my my wife so we are
04:58:19.760 | a little short of time but i hope that with my video guys you you got really into stable diffusion
04:58:26.240 | and you understood what is happening under the hood instead of just using the hugging face library
04:58:31.440 | and also notice that the model itself is not so particularly sophisticated if you check the
04:58:39.520 | decoder and the encoder they are just a bunch of convolutions and upsampling and the normalizations
04:58:47.200 | just like any other computer vision model and the same goes on for the unit of course there are very
04:58:53.280 | smart choices in how they do it okay but that's not the important thing of the diffusion and
04:58:59.440 | actually if we study the diffusion models like score models you will see that it doesn't even
04:59:03.680 | matter the structure of the model as long as the model is expressive it will actually learn the
04:59:09.040 | score function in the same way but this is not our case in this video i will talk about score model
04:59:14.000 | in future videos what i want you to understand is that how this all mechanism works together
04:59:20.400 | so how can we just learn a model that predicts the noise and then we come up with images and
04:59:28.160 | let me rehearse again the idea so we started by training a model that needs to learn a probability
04:59:35.760 | distribution as you remember p of theta here we we cannot learn this one directly because we don't
04:59:43.120 | know how to marginalize here so what we did is we find some lower bound for this quantity here and
04:59:49.120 | we maximize this lower bound how do we maximize this lower bound by training a model by running
04:59:54.800 | the gradient descent on this loss this loss produces a model that allow us to predict the
05:00:03.760 | noise then how do we actually use this model with the predicted noise to go back in time with the
05:00:10.880 | noise because the forward process we know how to go it's defined by us how to add noise but in back
05:00:15.680 | in time so how to remove noise we don't know and we do it according to the formulas that i have
05:00:20.880 | described in the sampler so the formula number seven and the formula number also this one actually
05:00:27.360 | we can use actually i will show you in my other um here i have another repository i think it's
05:00:33.760 | called python ddpm in which i implemented the ddpm paper but by using this algorithm here so
05:00:39.840 | if you are interested in this version of the denoising you can check my other uh repository
05:00:44.720 | here this one ddpm and i also wanted to show you how the inpainting works how the how the
05:00:52.720 | image to image and how the text to image works of course the possibilities are limitless it all
05:00:59.600 | depends on the powerfulness of the model and how you use it and i hope you use it in a clever way
05:01:07.200 | to build amazing products i also want to thank very much many repositories that i have used
05:01:13.600 | as a self-studying material so because of course i didn't make up all this by myself i studied a
05:01:18.720 | lot of papers i read i think to study this diffusion models i read more than 30 papers
05:01:23.600 | in the last few weeks so it took me a lot of time but i was really passionate about this kind of
05:01:30.240 | models because they're complicated and i really like to study things that can generate new stuff
05:01:35.600 | so i want to really thank in particularly some resources that i have used let me see
05:01:42.880 | this one's here so the official code the this guy divam gupta this other repository from
05:01:49.200 | this person here which i used very much actually as a base and the diffusers library from this
05:01:56.640 | hugging face upon which i based most of the code of my sampler because i think it's better to use
05:02:03.040 | because we are actually just applying some formulas there is no point in writing it from
05:02:06.960 | zero the point is actually understanding what is happening with these formulas and why we are doing
05:02:11.200 | it the things we are doing and as usual the full code is available i will also make all the slides
05:02:16.480 | available for you guys and i hope if you are in china you also have a great holiday with me and
05:02:21.920 | if you're not in china i hope you have a great time with your family and friends and everyone
05:02:25.840 | else so welcome back to my channel anytime and please feel free to comment on send me a comment
05:02:31.840 | or if you didn't understand something or if you want me to explain something better because i'm
05:02:37.520 | always available for explanation and guys i do this not as my full-time job of course i do it
05:02:43.920 | as a part-time and lately i'm doing consulting so i'm very busy but sometime i take time to record
05:02:51.040 | videos and so please share my channel share my video with people if you like it and so that my
05:02:57.680 | channel can grow and i have more motivation to keep doing this kind of videos which take really
05:03:02.800 | a lot of time because to prepare a video like this i spend around many weeks of research but
05:03:08.560 | this is okay i do it for as a passion i don't do it as a job and i spend really a lot of time
05:03:14.640 | preparing all the slides and preparing all the speeches and preparing the code and cleaning it
05:03:20.080 | and commenting it etc etc i always do it for free so if you would like to support me the best way is
05:03:26.400 | to subscribe like my video and share it with other people thank you guys and have a nice day