Coding Stable Diffusion from scratch in PyTorch

00:00:00.000 | Hello guys! Welcome to my new video on how to code stable diffusion from scratch.

00:00:05.360 | And stable diffusion is a model that was introduced last year.

00:00:09.480 | I think most of you are already familiar with it.

00:00:12.360 | And we will be coding it from scratch using PyTorch only.

00:00:16.640 | And as usual my video is going to be quite long,

00:00:19.880 | because we will be coding from scratch and at the same time I will be explaining each part that makes up stable diffusion.

00:00:27.480 | So as usual let me introduce you what are the topics that we will discuss and what are the prerequisites for watching this video.

00:00:35.520 | So of course we will discuss stable diffusion because we are going to build it from scratch using only PyTorch.

00:00:41.120 | So no other libraries will be used except for the tokenizer.

00:00:45.720 | I will describe the maths of the diffusion models as defined in the DDPM paper, but I will simplify it as much as possible.

00:00:54.240 | I will show you how classifier-free guidance works and of course we will also implement it,

00:00:59.200 | how the text-to-image works, image-to-image and in-painting.

00:01:03.360 | Of course to have a very complete view of diffusion models actually we should also introduce the score-based models

00:01:08.920 | and all the ODE and SDF theoretical framework.

00:01:12.240 | But most people are not familiar with ordinary differential equations or even stochastic differential equations.

00:01:18.720 | So I will not discuss these topics in this video and I'll leave it for future videos.

00:01:23.960 | So anyway we will have a complete copy of a stable diffusion, we will be able to generate images using the prompt,

00:01:32.280 | also condition on existing images etc.

00:01:35.680 | But for example the samplers based on the Euler method or Runge-Kutta method will not be built in this video.

00:01:42.560 | I will make a future video in which I describe these ones.

00:01:46.560 | What do I expect you to have as a prerequisite for watching this video?

00:01:50.560 | Well first of all it's good that if you have some notion of probability and statistics,

00:01:55.240 | so at least you know what is a Gaussian distribution,

00:01:58.160 | what is the conditional probability, the marginal probability, the likelihood etc.

00:02:02.520 | Now I don't expect you to have the mathematical formulation in your mind about these concepts,

00:02:08.520 | but at least the concepts behind them, so at least what do we mean by conditional probability or what do we mean by marginal probability.

00:02:16.040 | Anyway even if you're not very strong with mathematics I will always give a non-mathematics intuition for most concepts.

00:02:22.280 | So even if you don't have this background you will at least understand the concept behind this, some intuition behind this.

00:02:31.000 | And of course I expect you to know Python and PyTorch, at least basic level, because we will be coding using Python and PyTorch.

00:02:39.840 | And then we will be using a lot the attention mechanism,

00:02:43.040 | so if you're not familiar with the transformer model please watch my previous video on the attention and transformer.

00:02:49.800 | And we will also be using a lot of convolutions.

00:02:53.320 | So I don't expect you to know how mathematically the convolution layers work,

00:02:57.400 | but at least what they do on a practical level in a neural network.

00:03:02.720 | Anyway I will also review this while coding.

00:03:06.600 | And because this is going to be a long video I will first, because the stable diffusion and the diffusion models in general are quite complex from a mathematical point of view,

00:03:16.840 | so we cannot jump directly to the code without explaining what we are going to code and how it works.

00:03:23.120 | The first thing I will do is to give you some background knowledge from a mathematical point of view,

00:03:28.600 | but also from a conceptual point of view of how the diffusion models work and how stable diffusion works.

00:03:36.880 | And then we will build each part one by one.

00:03:41.600 | Of course at the beginning you will have a lot of ideas that are kind of confused because I will give you a lot of new concepts to grasp.

00:03:49.800 | And it's normal that you don't understand everything at the beginning.

00:03:52.920 | But don't worry because while coding I will repeat each concept more than once.

00:03:57.840 | So while coding you will also get a practical knowledge of what each part is doing and how they interact with each other.

00:04:05.360 | So please don't be scared if you don't understand everything in the beginning part of this video.

00:04:10.560 | Later when we start coding it everything will make sense to you.

00:04:14.440 | But we need this initial part because otherwise we cannot just jump in the dark and start coding without knowing what we are going to code.

00:04:22.480 | So let's start our journey.

00:04:26.800 | So what is stable diffusion?

00:04:28.720 | Stable diffusion is a model that was introduced in 2022, so last year, at the end of last year I remember,

00:04:36.080 | by Confit's group at the Ludwig Maximilian University in Munich, Germany.

00:04:40.680 | And it's open source, the weights, the pre-trained weights can be found on the Internet.

00:04:45.560 | And it became very famous because people started doing a lot of stuff and building projects with them and products with them with the stable diffusion.

00:04:53.240 | And one of the most simple use of stable diffusion is to do text to image.

00:04:58.320 | So given a prompt we want to generate an image.

00:05:01.040 | We will also see how image to image works and also how in-painting works.

00:05:05.840 | Image to image means that you already have a picture, for example, of a dog and you want to change it a little bit by using a prompt.

00:05:12.280 | For example, you want to ask the model to add the wings to the dog so that it looks like a flying dog.

00:05:17.880 | Or in-painting means that you remove some part of the image.

00:05:21.920 | For example, you can remove, I don't know, this part here and you ask the model to replace it with some other part that makes sense,

00:05:30.000 | that is coherent with the image. And we will see also how this works.

00:05:35.800 | Let's jump into generative models because diffusion models are generative models.

00:05:41.840 | But what is a generative model?

00:05:44.120 | Well, a generative model learns a probability distribution of the data such that we can then sample from the distribution to create new instances of the data.

00:05:54.640 | For example, if we have many pictures of cats or dogs or whatever we have,

00:05:59.640 | we can train a generative model on it and then we can sample from this distribution to create new images of cats or dogs or whatever.

00:06:08.560 | And this is exactly what we do with stable diffusion.

00:06:11.080 | We actually have a lot of images, we train it on a massive amount of images,

00:06:15.720 | and then we sample from this distribution to generate new images that don't exist in our training set.

00:06:22.320 | But the question may arise in your mind is why do we model data as distributions, as probability distributions?

00:06:30.240 | Well, let me give you an example. Imagine you are a criminal and you want to generate thousands of fake identities.

00:06:38.200 | Imagine you also live in a very simple world and each fake identity is made up of variables representing the characteristic of a person.

00:06:45.520 | So age and height. Suppose we only have two variables that make up a person.

00:06:49.600 | So it's the age of the person and the height of the person.

00:06:53.520 | In my case, I will be using the centimetre for the height. I think the Americans can convert it to feet.

00:06:59.800 | And so how do we proceed if we are a criminal with this goal?

00:07:04.120 | Well, we can ask the statistics department of the government to give us some statistics about the age and the height of the population.

00:07:11.320 | This information you can easily find online, for example. And then we can sample from this distribution.

00:07:17.800 | For example, if we model the age of the population like a Gaussian with the mean of 40 and the variance of 30.

00:07:25.840 | OK, these numbers are made up. I don't know if they reflect the reality.

00:07:29.640 | And the height in centimetres is 120 as mean and the variance is 100.

00:07:38.120 | We get these two distributions. Then we can sample from these two distributions to generate a fake identity.

00:07:45.240 | What does it mean to sample from a distribution? To sample from this kind of distribution means to throw a coin,

00:07:52.280 | a very special coin that has a very high chance of falling in this area, a lower chance of falling in this area,

00:08:00.200 | an even lower chance of falling in this area and a very nearly zero chance of falling in this area.

00:08:07.240 | So imagine we flip this coin once for the age, for example, and it falls here.

00:08:12.880 | So it's quite probable, not very probable, but quite probable.

00:08:16.760 | So suppose the age is three and let me write. So the age, let's say, is. Three.

00:08:27.960 | And then we toss again this coin and we and the coin falls, let's say here.

00:08:35.320 | So one hundred, let's say, thirty height. One hundred thirty centimetres.

00:08:43.560 | So as you can see, the combination of age and height is quite improbable in reality.

00:08:49.960 | I mean, no three years old is one metre and thirty centimetres high.

00:08:55.160 | I mean, at least not the ones I know. So this combination of age and height is very not plausible.

00:09:02.600 | So to produce plausible pairs, we actually need to model these two variables.

00:09:07.320 | So the age and height, not as independent variables and sample from each of them independently,

00:09:12.840 | but as a joint distribution. And usually we represent the joint distribution like this,

00:09:19.080 | where each combination of age and height has a probability score associated with it.

00:09:25.000 | And from this distribution, we only sample using one coin.

00:09:29.240 | And for example, this coin will have a very high probability with very high chance will fall in this area,

00:09:35.160 | with less chance will fall in this area and very close to zero chance of falling in this area.

00:09:40.440 | Suppose we throw the coin and it ends up in this area to get to the corresponding.

00:09:44.680 | Suppose this is the age and this is the height to get to the corresponding age and height.

00:09:50.280 | We just need to do like this. And suppose these are actually the real height and the real height.

00:09:55.080 | Now the numbers here are actually do not match, but you got the idea that to model something,

00:10:01.080 | we need a joint distribution over all the variables.

00:10:03.960 | And this is actually what we do also with our images.

00:10:06.680 | With our images, we create a very complex distribution in which, for example,

00:10:12.680 | each pixel is a distribution and the entirety of all the pixels are one big joint distribution.

00:10:20.120 | And once we have a joint distribution, we can do a lot of interesting things.

00:10:24.520 | For example, we can marginalize.

00:10:26.440 | So, for example, imagine we have a joint distribution over the age and the height.

00:10:30.840 | So let's call the age X and let's call the height, let's say Y.

00:10:36.200 | So if we have a joint distribution, which means having P of X and Y,

00:10:42.440 | which is defined for each combination of X and Y, we can always calculate P of X.

00:10:48.680 | So the probability of over the single variable by marginalizing over the other.

00:10:53.560 | So as the integral of P of X and Y and the Y.

00:10:59.320 | And this is how we marginalize,

00:11:02.280 | which means marginalizing over all the possible Y that we can have.

00:11:06.040 | And then we can also calculate the probability, the conditional probability.

00:11:09.960 | For example, we can say that the probability,

00:11:12.360 | what is the probability of the age being, let's say, from 0 to 3,

00:11:19.080 | given that the height is more than 1 meter.

00:11:23.320 | So something like this.

00:11:24.920 | We can do this kind of queries by using the conditional probability.

00:11:28.760 | So this is actually what we do with the generative model.

00:11:31.400 | We model our data as a very big joint distribution.

00:11:34.920 | And then we learn the parameters of this distribution,

00:11:39.800 | because it's a very complex distribution.

00:11:41.480 | So we let the neural network learn the parameters of this distribution.

00:11:45.160 | And our goal, of course, is to learn this very complex distribution

00:11:51.000 | and then sample from it to generate new data,

00:11:53.720 | just like the criminal before wanted to generate new fake identities

00:11:58.360 | by modeling the very complex distribution

00:12:00.680 | that represents the identity of a person.

00:12:03.000 | In our case, we will model our system as a joint distribution

00:12:08.680 | by including also some latent variables.

00:12:10.920 | So let me describe.

00:12:11.880 | As you probably are familiar with the diffusion models,

00:12:16.200 | we have two processes.

00:12:17.560 | One is called the forward process

00:12:19.480 | and one is called the reverse process.

00:12:21.960 | The forward process means that we have our initial image

00:12:25.000 | that we will call X0, so this here,

00:12:28.760 | and we add noise to it to get another image

00:12:32.120 | that is the same as the previous one,

00:12:35.240 | but with some noise on top of it.

00:12:37.240 | Then we take this image, which has a little noise,

00:12:40.920 | and we generate a new image that is same as the previous one,

00:12:44.440 | but with even more noise.

00:12:45.960 | So as you can see, this one has even more noise,

00:12:48.520 | and so on, so on, so on,

00:12:49.880 | until we arrive to the last latent variable called Zt,

00:12:53.960 | where t is equal to 1000,

00:12:57.480 | when it becomes completely noise, pure noise,

00:13:01.560 | like N0,1, actually N0i,

00:13:05.960 | because we are in the multivariate world.

00:13:09.080 | And our goal, actually, is to...

00:13:13.160 | this process, this forward process is fixed.

00:13:16.520 | So we define how to build the noisified version

00:13:20.280 | of each image given the previous one,

00:13:22.600 | so we know how to add noise,

00:13:24.520 | and we have a specific formula, an analytical formula,

00:13:27.720 | on how to add noise to an image.

00:13:30.600 | The problem is, we don't have the analytical formula

00:13:34.200 | to reverse this process,

00:13:35.720 | so we don't know how to take this one

00:13:38.280 | and just remove noise.

00:13:40.200 | There is no closed formula on how to do it,

00:13:43.640 | so we learn, we train a neural network

00:13:47.720 | to do this inverse process,

00:13:49.800 | to remove noise from something that has noise.

00:13:52.840 | And if you think about it,

00:13:54.200 | it is quite easy to add noise to something

00:13:56.840 | than it is to remove noise from something.

00:13:59.720 | That's why we are using a neural network for this purpose.

00:14:02.680 | Now we need to go inside, of course, of the math,

00:14:06.840 | because we will be using it not only to write the code,

00:14:09.720 | but also to write the sampler.

00:14:11.080 | And in the sampler, it's all about mathematics.

00:14:13.800 | And I will try to simplify it as much as possible,

00:14:16.840 | so don't be scared.

00:14:18.360 | So let's start.

00:14:19.160 | Okay, this is from the DDPM paper,

00:14:22.280 | so the Noising Diffusion Probabilistic Models,

00:14:24.600 | from Ho in 2020.

00:14:27.480 | And here we have two processes.

00:14:29.240 | The first is the forward process,

00:14:31.720 | which means that given the original image,

00:14:36.440 | how can I generate the noisified version of this image

00:14:40.680 | at time step t?

00:14:42.360 | In this case, actually, this is the joint distribution.

00:14:45.000 | Let's look at this one here.

00:14:49.000 | This means if I have the image at time step t minus one,

00:14:53.560 | how can I get the next time step,

00:14:55.400 | so the more noisified version of this image?

00:14:58.920 | Well, we define it as a Gaussian distribution centered,

00:15:04.200 | so the mean centered on the previous one,

00:15:06.600 | and the variance defined by this beta parameter here.

00:15:10.520 | This beta parameter here is decided by us,

00:15:13.960 | and it means how much noise we want to add

00:15:16.920 | at every step of this noisification process.

00:15:19.560 | This is also known as the Markov chain of noisification,

00:15:23.320 | because each variable is conditioned on the previous one.

00:15:26.840 | So to get xt, we need to have xt minus one.

00:15:30.600 | And as you can see from here,

00:15:32.520 | we start from x0, we go to x1.

00:15:36.120 | Here I call it z1 to differentiate it,

00:15:38.360 | but x1 actually is equal to z1.

00:15:40.920 | So x0 is the original image,

00:15:43.000 | and all the next x are noisy versions,

00:15:46.360 | with xt being the most noisy.

00:15:48.760 | So this is called the Markov chain of noisification,

00:15:53.240 | and we can do it like this.

00:15:56.840 | So it's defined by us as a process,

00:15:59.800 | which is a series of Gaussians that add noise.

00:16:02.680 | There is an interesting formula here.

00:16:05.880 | This is a closed loop, closed formula,

00:16:08.440 | to go from the original image to any image at time step t,

00:16:13.320 | without calculating all the intermediate images,

00:16:16.120 | using this particular parametrization.

00:16:19.480 | So we can go from the image, original image,

00:16:23.400 | to the image at time step t,

00:16:25.000 | by sampling from this distribution,

00:16:29.480 | by defining the distribution like this.

00:16:31.160 | So with this mean and with this variance.

00:16:33.800 | This mean here depends on a parameter, alpha, alpha bar,

00:16:38.440 | which is actually depending on beta.

00:16:40.440 | So it's something that we know,

00:16:42.040 | there is nothing we have to learn.

00:16:43.720 | And also the variance actually depends on alpha,

00:16:46.360 | which is defined as in function of beta.

00:16:49.400 | So beta is also something we know,

00:16:51.240 | so there is no parameters to learn here.

00:16:53.080 | Now let's look at the reverse process.

00:16:55.880 | The reverse process means that we have something noisy,

00:16:59.240 | and we want to get something less noisy.

00:17:01.400 | So we want to remove noise.

00:17:03.480 | And we also define it as a Gaussian,

00:17:05.960 | with a mean, mu theta, and a variance, sigma theta.

00:17:10.600 | Now, this mean and this variance are not known to us.

00:17:16.920 | We have to learn them.

00:17:18.520 | And we will use a neural network to learn these two parameters.

00:17:23.000 | Actually, the variance, we will also set this at fixed.

00:17:27.800 | We will parameterize it in such a way that this variance actually is fixed.

00:17:32.360 | So we hypothesize, we already know the variance.

00:17:34.920 | And we let the network learn only the mean of this distribution.

00:17:38.680 | So to rehearse, we have a forward process that adds noise.

00:17:42.680 | And we know everything about this process.

00:17:44.280 | We know how to add noise.

00:17:45.800 | We have a reverse process that we don't know how to denoise.

00:17:49.640 | So we let a network learn the parameters on how to denoise it.

00:17:54.680 | And OK, now that we have defined these two processes,

00:17:58.360 | how do we actually train a model to do it?

00:18:01.560 | Because as you remember, our initial goal is actually to learn

00:18:05.000 | a probability distribution over our data set.

00:18:09.000 | And so this quantity here.

00:18:12.040 | But unlike before, when we could marginalize, for example,

00:18:15.880 | in the case of the criminal who want to generate identities,

00:18:20.120 | we could marginalize over all the variables.

00:18:23.320 | Here we cannot marginalize.

00:18:24.600 | Because we need to marginalize over x1, x2, xt, x4, up to xt.

00:18:28.920 | So over a lot of variables.

00:18:30.520 | And to calculate this integral means to calculate it over all the possible x1.

00:18:36.600 | And over all the possible x2, et cetera.

00:18:38.760 | So it's a very complex calculation that is computationally intractable, we say.

00:18:44.040 | It means that it's theoretically possible.

00:18:46.440 | But practically, it will take forever.

00:18:48.600 | So we cannot use this route here.

00:18:52.040 | So what can we do?

00:18:53.320 | We want to learn this quantity here.

00:18:55.080 | So we want to learn the parameter theta of this to maximize the likelihood we can see here.

00:19:00.520 | What we did is we found a lower bound for this quantity here.

00:19:06.520 | So the quantity, the likelihood.

00:19:08.520 | And this lower bound is called the elbow.

00:19:11.160 | And if we maximize the lower bound, it will also maximize the likelihood.

00:19:16.040 | So let me give you a parallel example on what it means to maximize the lower bound.

00:19:21.800 | For example, imagine you have a company.

00:19:24.920 | And your company has some revenue.

00:19:26.840 | And usually, the revenue is more than or equal to the sales of your company.

00:19:33.560 | So you have some revenue coming from sales.

00:19:35.640 | Maybe you also have some revenue coming from interest that you get from your bank, et cetera.

00:19:41.000 | But we can for sure say that the revenue of your company is more than or equal to the sales of your company.

00:19:47.240 | So if you want to maximize your revenue, you can maximize your sales, for example,

00:19:52.280 | which is a lower bound over your revenue.

00:19:55.480 | So if we maximize the sales, we will also maximize the revenue.

00:19:58.200 | And this is the idea here.

00:19:59.400 | But how do we do it on a practical level?

00:20:03.880 | Well, this is the training code for the DDPM diffusion models as defined by the DDPM paper.

00:20:13.240 | And basically, the idea is after we get the elbow,

00:20:17.000 | we can parameterize the loss function as this.

00:20:20.360 | Which says that we need to learn-- we need to train a network called epsilon theta.

00:20:27.800 | That given a noisy image-- so this formula here

00:20:31.480 | means the noisy image at time step t and the time step at which the noise was added,

00:20:37.800 | the network has to predict how much noise is in the image, the noisified image.

00:20:43.240 | And if we do gradient descent over this loss function here,

00:20:50.600 | we will maximize the elbow.

00:20:53.400 | And at the same time, we will also maximize the log likelihood of our data.

00:20:59.880 | And this is how we train these kind of networks.

00:21:03.000 | Now, I know that this is a lot of concept that you have to grasp.

00:21:06.920 | So don't worry.

00:21:07.960 | For now, just remember that there is a forward process and there is a reverse process.

00:21:11.400 | And to train this network to do the reverse process,

00:21:15.000 | we need to train a network to detect how much noise

00:21:18.280 | is in a noisified version of the image at time step t.

00:21:21.560 | Let me show you how do we-- once we have this network that has already been trained,

00:21:29.800 | how do we actually sample to generate new data?

00:21:32.280 | So let's go here.

00:21:35.560 | Let's go here.

00:21:36.920 | So how do we generate new data?

00:21:38.600 | Suppose we already have a network that was trained for detecting how much noise is in there.

00:21:44.680 | And what we do is we start from complete noise.

00:21:47.720 | And then we ask the network to detect how much noise is in there.

00:21:52.040 | We remove this noise.

00:21:53.400 | And then we ask the network again how much noise is in there.

00:21:56.440 | And we remove it.

00:21:57.640 | And then we ask the network how much noise is there.

00:22:00.280 | OK, remove it.

00:22:02.040 | Then how much noise is here?

00:22:03.480 | OK, remove it, et cetera, et cetera.

00:22:05.160 | Until we reach this step, then here we will have something new.

00:22:09.560 | So if we start from pure noise and we do this reverse process many times,

00:22:13.720 | we will end up with something new.

00:22:15.480 | And this is the idea behind this generative model.

00:22:19.160 | Now that we know how to generate new data starting from pure noise,

00:22:24.520 | we also want to be able to control this denoisification process

00:22:29.160 | so we can generate images of something that we want.

00:22:32.280 | I mean, how can we tell the model to generate a picture of a cat

00:22:36.120 | or a picture of a dog or a picture of a house by starting from pure noise?

00:22:40.840 | Because as of now, by starting from pure noise and keep denoising,

00:22:46.360 | we will generate a new image, of course.

00:22:48.680 | But it's not like we can control which new image will be generated.

00:22:52.440 | So we need to find a way to tell the model what we want in this generational process.

00:22:57.160 | And the idea is that we start from pure noise.

00:23:01.800 | And during this chain of removing noise, so denoisification,

00:23:07.640 | we introduce a signal.

00:23:09.400 | Let's call it prompt.

00:23:11.400 | Prompt.

00:23:12.620 | Or it can also be called the conditioning signal.

00:23:16.440 | Or it can also be called the context.

00:23:18.920 | Anyway, they are the same concept.

00:23:21.720 | In which we influence the model into how to remove the noise

00:23:26.280 | so that the output will move towards what we want.

00:23:29.640 | To understand how this works, let's review again

00:23:33.560 | how the training of this kind of networks works.

00:23:36.200 | Because this is very important for us.

00:23:37.880 | To learn how the training of this kind of network goes,

00:23:40.920 | so that we can introduce the prompt.

00:23:42.840 | Let's go back.

00:23:46.760 | Okay, as I told you before, our final goal is to model a distribution,

00:23:53.560 | theta, p of theta, such that we maximize the likelihood of our data.

00:23:58.520 | And to learn this distribution, we maximize the ELBO, so the lower bound.

00:24:07.000 | But how do we maximize the ELBO?

00:24:09.240 | We minimize this loss, minimize this loss here.

00:24:14.840 | So by minimizing this loss, we maximize the ELBO,

00:24:17.960 | which in turn learns this distribution here.

00:24:21.400 | Because this ELBO here is the lower bound

00:24:25.800 | for the likelihood of our data distribution here.

00:24:29.000 | And what is this loss function?

00:24:32.440 | Loss function here indicates that we need to create a model, epsilon theta,

00:24:37.160 | such that if we give this model a noisified image at a particular noise level,

00:24:44.200 | and we also tell him what noise level we included in this image,

00:24:47.880 | the network has to predict how much noise is there.

00:24:51.880 | So this epsilon is how much noise we have added.

00:24:54.520 | And we can do a gradient descent on this training loop.

00:25:00.280 | This way we will learn a distribution of our data.

00:25:06.040 | But as you can see, this distribution doesn't include anything

00:25:09.560 | that tells the model what is a cat, or what is a dog, or what is a house.

00:25:14.120 | The model is just learning how to generate pictures that make sense,

00:25:18.280 | that are similar to our initial training data.

00:25:21.000 | But they don't know what is the relationship between that picture and the prompt.

00:25:24.680 | So one idea could be, OK, can we learn a joint distribution of our initial data,

00:25:33.400 | so all the images, and the conditioning signal, so the prompt?

00:25:38.120 | Well, this is also something that we don't want,

00:25:40.120 | because we want to actually learn this distribution,

00:25:42.680 | so that we can sample and generate new data.

00:25:45.240 | We don't want to learn the joint distribution

00:25:47.000 | that will be too much influenced by the context,

00:25:50.040 | and the model may not learn the generative process of the data.

00:25:53.320 | So our final goal is always this one.

00:25:55.560 | But we also want to find some how to condition this model

00:25:59.240 | into building something that we want.

00:26:01.160 | And the idea is that we modify this unit,

00:26:07.000 | so this model here, epsilon theta,

00:26:11.400 | will be built using, let me show you, this unit model here.

00:26:16.280 | This unit will receive as input an image that is noisified,

00:26:21.240 | so for example, a cat, with a particular noise level,

00:26:27.400 | and we also tell him what is the noise level that we added to this cat,

00:26:31.240 | and we give them both to the input of the unit,

00:26:33.560 | and the unit has to predict how much noise is there.

00:26:37.240 | This is the job of the unit.

00:26:38.600 | What if we introduce also the prompt signal here,

00:26:43.400 | so the conditioning signal here, so the prompt?

00:26:46.120 | This way, if we tell the model,

00:26:49.400 | can you remove noise from this image,

00:26:53.560 | which has this quantity of noise,

00:26:56.760 | and I am also telling you that it's a cat,

00:26:59.080 | so the model has more information on how to remove the noise.

00:27:03.080 | Yes, the model can learn this way,

00:27:05.080 | how to remove noise into building something that is more closer to the prompt.

00:27:10.360 | This will make the model conditioned,

00:27:14.280 | it means that it will act like a conditioned model,

00:27:16.760 | so we need to tell the model what is the condition that we want,

00:27:19.560 | so that the model can remove the noise in that particular way,

00:27:23.720 | moving the output towards that particular prompt.

00:27:27.480 | But at the same time, when we train the model,

00:27:31.400 | instead of only giving images along with the prompt,

00:27:35.960 | we can also sometimes, with a probability, let's say 50%,

00:27:38.840 | not give any prompt and let the model remove the noise

00:27:43.720 | without telling him anything about the prompt.

00:27:46.040 | So we just give him a bunch of zero when we give him the input.

00:27:50.680 | This way, the model will learn to act both as a conditioned model

00:27:56.600 | and also as a conditioned model,

00:27:58.520 | so the model will learn to pay attention to the prompt

00:28:01.080 | and also to not pay attention to the prompt.

00:28:03.240 | And what is the advantage of this?

00:28:05.880 | Is that we can, once when we want to generate a new picture,

00:28:11.320 | we can do two steps.

00:28:12.920 | In the first one, suppose you want to generate a picture of a cat,

00:28:16.680 | we can do like this.

00:28:17.560 | Let me delete first of all.

00:28:19.560 | Okay, we can do the first step.

00:28:23.000 | So let's call it step one.

00:28:27.160 | And we can start with pure noise,

00:28:30.440 | because as I told you before,

00:28:31.720 | to generate a new image, we start from pure noise.

00:28:34.360 | We indicate the model what is the noise level.

00:28:37.240 | So at the beginning, it will be t equal to 1000,

00:28:40.600 | so maximum noise level.

00:28:42.120 | And we tell the model that we want a cat.

00:28:44.920 | We give this as input to the unit.

00:28:49.320 | The unit will predict some noise that we need to remove

00:28:53.800 | in order to move the image towards what we want as output.

00:28:57.880 | So a cat.

00:28:58.520 | And this is our output one.

00:29:01.160 | Let's call it output one.

00:29:03.720 | Then we do another step.

00:29:05.320 | So let me delete this one.

00:29:06.920 | Then we do another step.

00:29:09.560 | Let's call it step two.

00:29:11.240 | And again, we give the same input noise as before,

00:29:17.400 | the same time step as the noise level.

00:29:21.720 | So it's the same noise with the same noise level,

00:29:24.040 | but we don't give any prompt.

00:29:25.480 | This way, the model will build some output.

00:29:28.760 | Let's call it out two,

00:29:30.120 | which is how to remove the noise to generate something.

00:29:34.040 | We don't know what, but to generate something

00:29:36.280 | that belongs to our data distribution.

00:29:38.040 | And then we combine these two output in such a way

00:29:42.360 | that we can decide how much we want the output

00:29:45.960 | to be closer to the prompt or not.

00:29:49.720 | This is called classifier-free guidance.

00:29:52.760 | So this approach here is called classifier-free guidance.

00:29:57.320 | I will not tell you why it's called classifier-free guidance,

00:29:59.640 | because otherwise I need to introduce the classifier guidance.

00:30:02.200 | And to talk about the classifier guidance,

00:30:03.960 | I need to introduce the score-based models

00:30:06.120 | to understand why it's called like this.

00:30:07.960 | But the idea is this, that we train a model that,

00:30:11.320 | when we train it, sometimes we give it the prompt

00:30:14.760 | and sometimes we don't give it the prompt,

00:30:16.440 | so that the model learns to ignore the prompt,

00:30:19.080 | but also to pay attention to the prompt.

00:30:21.240 | And when we sample from this model, we do two steps.

00:30:25.480 | First time, we give him the prompt of what we want.

00:30:27.960 | And the second time, we give the same noise,

00:30:29.800 | but without the prompt of what we want.

00:30:31.800 | And then we combine the two output,

00:30:35.080 | conditioned and unconditioned,

00:30:37.960 | linearly with a weight that indicates

00:30:41.560 | how much we want the output to be closer

00:30:44.600 | to our condition, to our prompt.

00:30:47.080 | The higher this value, the more the output

00:30:49.960 | will resemble our prompt.

00:30:51.640 | The lower this value, the less it will resemble our prompt.

00:30:55.640 | And this is the idea behind classifier-free guidance.

00:30:58.760 | To give the prompt, actually we will give,

00:31:01.400 | we need to give some kind of embedding to the,

00:31:04.360 | so the model needs to understand this prompt.

00:31:07.560 | To understand the prompt,

00:31:08.840 | the model needs some kind of embedding.

00:31:11.560 | Embedding means that we need some vectors

00:31:14.920 | that represent the meaning of the prompt.

00:31:17.640 | And this embedding are extracted

00:31:19.880 | using the CLIP text encoder.

00:31:21.640 | So before talking about the text encoder,

00:31:24.520 | let's talk about CLIP.

00:31:25.640 | So CLIP was a model built by OpenAI

00:31:28.760 | that allowed to connect text with images.

00:31:33.480 | And the text, basically, they took a bunch of images.

00:31:38.360 | So for example, this picture and its description.

00:31:41.720 | Then they took another image along with its description.

00:31:44.280 | So the image one is associated with the text number one,

00:31:48.120 | which is the description of the image one.

00:31:50.760 | Then the image two has the description number two.

00:31:53.720 | The image three has the text number three,

00:31:56.520 | which is the description of the image three,

00:31:58.520 | et cetera, et cetera.

00:31:59.800 | They built this matrix, you can see here,

00:32:02.280 | which is made up of the dot products

00:32:04.360 | of the embedding of the first image

00:32:06.760 | multiplied with all the possible captions here.

00:32:09.880 | So the image one with the text one.

00:32:12.280 | Image one with the text two.

00:32:13.560 | Image one with the text three, et cetera.

00:32:15.080 | Then image two with the text one.

00:32:17.080 | Image two with the text two, et cetera.

00:32:18.520 | How they train it?

00:32:20.760 | Basically, we know that the correspondence

00:32:23.320 | between image and the text is on the diagonal

00:32:26.280 | because the image one is associated with the text one.

00:32:30.040 | Image two is associated with the text two.

00:32:31.880 | Image three is associated with the text three.

00:32:33.880 | So how they train it?

00:32:35.480 | Basically, they said they built a loss function

00:32:37.960 | that they want this diagonal to have the maximum value

00:32:43.400 | and all the other numbers here to be zero

00:32:46.040 | because they are not matching.

00:32:47.160 | They are not the corresponding description of these images.

00:32:50.680 | In this way, the model learned how to combine

00:32:54.360 | the description of an image with the image itself.

00:32:57.880 | And what we do in stable diffusion

00:32:59.960 | is that we take this text encoder here,

00:33:02.600 | so only this part of this clip,

00:33:05.480 | to encode our prompt to get some embeddings.

00:33:09.160 | And these embeddings are then used as conditioning signal

00:33:13.480 | for our unit to denoise the image into what we want.

00:33:17.880 | Okay, there is another thing that we need to understand.

00:33:24.280 | So as I said before,

00:33:25.560 | we have a forward process that adds noise to the image.

00:33:29.560 | Then we have a reverse process

00:33:31.160 | that removes noise from the image.

00:33:33.480 | And this reverse process can be conditioned

00:33:36.120 | by using the classifier-free guidance.

00:33:39.800 | And this reverse process means

00:33:43.480 | that we need to do many steps of denoisification

00:33:46.840 | to arrive to the image, to the new image.

00:33:49.480 | And this also means that each of these steps

00:33:53.800 | involves going through the unit with a noisified image

00:33:57.720 | and getting as output the amount of noise present in this image.

00:34:02.200 | But if the image is very big,

00:34:04.280 | so suppose this image here is 512 multiplied by 512,

00:34:10.120 | it means every time on the unit,

00:34:12.680 | we will have a very big matrix

00:34:14.200 | that needs to go through this unit.

00:34:16.600 | And this may be very slow,

00:34:19.000 | because it's a very big matrix of data

00:34:22.120 | that the unit has to work on.

00:34:24.120 | What if we could somehow compress this image

00:34:27.560 | into something smaller,

00:34:29.160 | so that each step through the unit takes less time?

00:34:33.240 | Well, the idea is that yes,

00:34:35.080 | we can compress this image

00:34:37.000 | with something that is called the variational autoencoder.

00:34:41.000 | Let's see how the variational autoencoder works.

00:34:43.240 | Okay, the stable diffusion is actually known

00:34:48.920 | as a latent diffusion model,

00:34:53.320 | because what we learn is not the data

00:34:56.680 | probability distribution Px of our data,

00:35:00.440 | but we learn the latent representation of the data

00:35:04.680 | using a variational autoencoder.

00:35:06.920 | So basically we compress our data,

00:35:09.000 | so let's go back,

00:35:10.040 | we compress our data into something smaller,

00:35:12.760 | and then we learn the noisification process

00:35:16.040 | using this compressed version of the data,

00:35:18.840 | not the original data.

00:35:20.200 | And then we can decompress it to build the original data.

00:35:24.680 | Let me show you actually how it works on a practical level.

00:35:28.280 | So imagine you have some data

00:35:31.560 | and you want to send it to your friend over the internet.

00:35:34.200 | What do you do?

00:35:35.240 | You can send the original file

00:35:36.840 | or you can send the zipped file.

00:35:38.600 | So you can zip the file,

00:35:40.200 | maybe with WinZip, for example,

00:35:41.960 | and then you send the file to your friend

00:35:43.800 | and the friend can unzip it after receiving

00:35:47.960 | and rebuild the original data.

00:35:49.400 | This is exactly the job of the autoencoder.

00:35:51.880 | The autoencoder is a network

00:35:53.720 | that given an image, for example,

00:35:55.480 | will, after passing through the encoder,

00:35:58.120 | will transform into a vector

00:35:59.800 | which has a dimension that is much smaller

00:36:03.160 | than the original image.

00:36:04.760 | And if we use this vector

00:36:06.600 | and run it through the decoder,

00:36:08.360 | it will build the original image back.

00:36:10.920 | And we can do it for many images

00:36:15.000 | and each of them will have a representation in this.

00:36:17.560 | This is called a code corresponding to each image.

00:36:22.280 | Now, the problem with autoencoder

00:36:23.960 | is that the code learned by this model

00:36:27.480 | doesn't make any sense from a semantic point of view.

00:36:30.680 | So the code associated with the cat, for example,

00:36:33.480 | may be very similar to the code

00:36:36.040 | associated with pizza, for example,

00:36:38.120 | or the code associated with a building.

00:36:40.360 | So there is no semantic relationship

00:36:43.320 | between these codes.

00:36:44.600 | And to overcome this limitation of the autoencoder,

00:36:47.320 | we introduce the variational autoencoder,

00:36:49.640 | in which we learn to kind of compress the data,

00:36:54.600 | but at the same time,

00:36:56.280 | this data is distributed

00:36:58.040 | according to a multivariate distribution,

00:37:00.120 | which most of the times is a Gaussian.

00:37:02.440 | And we learn the mean and the sigma of this distribution,

00:37:08.440 | this very complex distribution here.

00:37:11.240 | And given the latent representation,

00:37:13.480 | we can always pass it through the decoder

00:37:15.640 | to rebuild the original data.

00:37:17.240 | And this is the idea that we use also in stable diffusion.

00:37:20.360 | Now we can finally combine all these things

00:37:24.440 | that we have seen together

00:37:25.960 | to see what is the architecture of the stable diffusion.

00:37:29.960 | So let's start with how the text-to-image works.

00:37:34.760 | Now, imagine text-to-image basically works like this.

00:37:39.160 | Imagine you want to generate a picture

00:37:42.280 | of a dog with glasses.

00:37:44.200 | So you start, of course, with a prompt,

00:37:46.120 | a dog with glasses.

00:37:47.640 | And then what do we do?

00:37:49.640 | We sample some noise here,

00:37:52.440 | some noise from the N01.

00:37:55.080 | We encode it with our variational autoencoder.

00:37:58.760 | This will give us a latent representation of this noise.

00:38:03.560 | Let's call it Z.

00:38:04.600 | This is, of course, a pure noise,

00:38:07.160 | but has been compressed by the encoder.

00:38:09.880 | And then we send it to the unit.

00:38:13.960 | The goal of the unit is to detect how much noise is there.

00:38:17.480 | And also, because to the unit,

00:38:19.320 | we also give the conditioning signal,

00:38:21.480 | the unit has to detect the noise,

00:38:24.440 | what noise we need to remove

00:38:26.920 | to make it into a picture that follows the prompt,

00:38:30.360 | so into a picture of a dog.

00:38:32.200 | So the unit, we pass it through the unit

00:38:34.280 | along with the time step, initial time step,

00:38:36.920 | so 1,000.

00:38:37.720 | And the unit will detect at the output here

00:38:41.800 | how much noise is there.

00:38:43.480 | Our scheduler, we will see later what is the scheduler,

00:38:46.280 | will remove this noise

00:38:48.200 | and then send it again to the unit

00:38:50.120 | for the second step of denoisification.

00:38:52.280 | And again, we send the time step,

00:38:54.840 | which is in this case not 1,000,

00:38:57.400 | but 980, for example,

00:38:59.560 | because we skipped some steps.

00:39:01.080 | And then we again, with the noise,

00:39:03.560 | we detect how much noise is there.

00:39:05.240 | The scheduler will remove this noise

00:39:07.480 | and again send it back.

00:39:09.240 | And we do many times this.

00:39:11.560 | We keep doing this denoisification for many steps

00:39:15.000 | until there is no more noise present in the image.

00:39:18.920 | And after we have finished this loop of steps,

00:39:22.520 | we get the output Z prime,

00:39:26.040 | which is still a latent

00:39:27.480 | because this unit only works

00:39:29.240 | with the latent representation of the data,

00:39:31.000 | not with the original data.

00:39:32.200 | We pass it through the decoder

00:39:35.000 | to obtain the output image.

00:39:38.440 | And this is why this is called a latent diffusion model

00:39:41.880 | because the unit,

00:39:43.080 | so the denoisification process,

00:39:44.840 | always works with the latent representation of the data.

00:39:47.720 | And this is how we generate text to image.

00:39:50.600 | We can do the same thing for image to image.

00:39:53.400 | Image to image means that I have,

00:39:55.480 | for example, the picture of a dog

00:39:57.320 | and I want to modify this image

00:40:00.040 | into something else by using a prompt.

00:40:02.440 | For example, I want the model to add glasses to this dog

00:40:05.480 | so I can give the input image here.

00:40:07.720 | And then I say a dog with glasses

00:40:09.800 | and hopefully the model will add glasses to this dog.

00:40:13.000 | How does it work?

00:40:13.800 | We encode the image

00:40:16.520 | with the encoder of the variational autoencoder

00:40:18.920 | and we get the latent representation of our image.

00:40:22.360 | Then we add noise to this latent

00:40:24.760 | because the unit, as we saw before,

00:40:26.920 | his job is to denoise an image.

00:40:29.000 | But of course, we need to have some noise to denoise.

00:40:31.640 | So we add noise to this image

00:40:34.920 | and the amount of noise that we add to this image,

00:40:38.360 | so this starting image here,

00:40:40.120 | indicates how much freedom the unit has

00:40:43.800 | into building the output image.

00:40:45.400 | Because the more noise we add,

00:40:47.400 | the more the unit has freedom to alter the image.

00:40:50.360 | But the less noise we add,

00:40:51.720 | the less freedom the model has to alter the image

00:40:54.520 | because it cannot change radically.

00:40:57.320 | If we start from pure noise,

00:41:00.680 | the unit can do anything it wants.

00:41:03.240 | But if we start with less noise,

00:41:05.000 | the unit is forced to modify just a little bit the output image.

00:41:09.720 | So the amount of noise that we start from

00:41:12.920 | indicates how much we want the model to pay attention

00:41:15.880 | to the initial image here.

00:41:17.560 | And then we give the prompt.

00:41:20.520 | For many steps, we keep denoising, denoising, denoising, denoising.

00:41:24.040 | And after there is no more noise,

00:41:26.600 | we take this latent representation,

00:41:28.760 | we pass it through the decoder

00:41:30.520 | and we get the output image here.

00:41:33.560 | And this is how image-to-image works.

00:41:35.560 | Now let's go to the last part,

00:41:38.120 | which is how in-painting works.

00:41:40.280 | In-painting works similar way to the image-to-image,

00:41:44.600 | but with a mask.

00:41:45.640 | So in-painting means, first of all,

00:41:47.960 | that we have an image

00:41:49.720 | and we want to cut some part of this image,

00:41:51.960 | for example, the legs of this dog

00:41:53.880 | and we want the model to generate new legs for this dog

00:41:57.240 | that are maybe a little different.

00:41:58.760 | So as you can see,

00:42:00.360 | these feet here are a little different

00:42:03.000 | from the legs of the dog here.

00:42:05.480 | So what we do is,

00:42:08.120 | we start from our initial image of the dog.

00:42:10.440 | We pass it through the encoder.

00:42:12.680 | It becomes a latent representation.

00:42:14.520 | We add some noise to this latent representation.

00:42:17.480 | We give some prompt to tell the model

00:42:20.280 | what we want the model to generate.

00:42:22.440 | So I just say a dog running

00:42:23.800 | because I want to generate new legs for this dog.

00:42:26.920 | And then we pass the noisified input to the unit.

00:42:32.600 | The unit will produce an output here

00:42:34.680 | for the first time step.

00:42:35.880 | But then, of course, nobody told the model

00:42:39.240 | to only predict this area.

00:42:41.640 | The model, of course,

00:42:42.920 | here at the output predicted

00:42:44.680 | and modified the noise all the image.

00:42:47.800 | But we take this output here

00:42:50.920 | and we don't care what the noise predicted

00:42:56.680 | for this area of the image.

00:42:58.200 | The area that we already know.

00:43:00.120 | We replace it with the image

00:43:02.680 | that we already know.

00:43:03.640 | And we pass it again through the unit.

00:43:07.560 | Basically, what we do is,

00:43:08.920 | at every step,

00:43:10.040 | at every output of the unit,

00:43:11.720 | we replace the areas that are already known

00:43:14.600 | with the areas of the original image.

00:43:18.120 | So, basically, to fool the model

00:43:20.760 | into believing that it was the model itself

00:43:23.080 | that came up with these details of the image,

00:43:26.120 | not us.

00:43:27.160 | So every time here in this area,

00:43:30.760 | before we send it back to the unit here,

00:43:33.160 | here we combine the output of the unit

00:43:36.600 | with the existing image

00:43:38.440 | by replacing whatever output

00:43:40.600 | the unit gave us for this area here

00:43:43.880 | with what is the original image.

00:43:45.800 | And then we give it back to the unit

00:43:47.880 | and we keep doing it.

00:43:49.400 | This way the model will only be able

00:43:51.720 | to work on this area here

00:43:53.560 | because this is the one we never replace

00:43:55.560 | in the output of the unit.

00:43:57.080 | And then after there is no more noise,

00:43:58.760 | we take the output,

00:44:00.120 | we send it to the decoder

00:44:02.440 | and then it will build the image

00:44:05.800 | we can see here.

00:44:06.920 | Okay, this is how the stable diffusion works

00:44:11.320 | from an architecture point of view.

00:44:13.080 | I know it has been a long journey.

00:44:14.600 | I had to introduce many concepts

00:44:16.920 | but it's very important

00:44:18.120 | that we know these concepts

00:44:19.640 | before we start building the unit

00:44:21.560 | because otherwise we don't even know

00:44:26.200 | how to start building the stable diffusion.

00:44:28.600 | Here we are finally coding our stable diffusion.

00:44:32.520 | And the first thing that we will code

00:44:35.000 | is the variational autoencoder

00:44:36.680 | because it's external to the unit,

00:44:38.520 | so it's external to the diffusion model,

00:44:40.680 | so the one that will detect,

00:44:42.360 | will predict how much noise

00:44:43.880 | is present in the image.

00:44:45.640 | And let's review it actually.

00:44:47.320 | Let's review the architecture

00:44:48.600 | and let me go to this slide here.

00:44:52.520 | Okay, oops.

00:44:55.080 | This one here.

00:44:59.000 | Okay, the first thing that we will build

00:45:01.240 | is this part here.

00:45:02.360 | The encoder and the decoder

00:45:07.240 | of our variational autoencoder.

00:45:09.240 | The job of the encoder

00:45:10.760 | and the decoder of the variational autoencoder

00:45:12.440 | is to encode an image or noise

00:45:15.960 | into a compressed version of the image

00:45:18.520 | or the noise itself

00:45:19.640 | such that then we can take this latent

00:45:23.560 | and run it through the unit.

00:45:25.480 | And then after the last step

00:45:28.120 | of the noisification

00:45:29.320 | we take this compressed version or latent

00:45:32.360 | and we pass it through the decoder

00:45:34.440 | and to get the original,

00:45:35.720 | the output image, not the original.

00:45:37.480 | And so the encoder,

00:45:41.080 | actually his job is to reduce

00:45:43.640 | the dimension of the data

00:45:45.320 | into a smaller data,

00:45:46.760 | into the data with the smaller dimension.

00:45:49.480 | And the idea is very similar

00:45:51.240 | to the one of the unit.

00:45:52.280 | So we start with a picture that is very big

00:45:54.760 | and at each step there are multiple levels.

00:45:57.720 | We keep reducing the size of the image

00:46:01.080 | but at the same time

00:46:02.440 | we keep increasing the features of the image.

00:46:05.400 | What does it mean?

00:46:06.440 | That initially each pixel of the image

00:46:08.920 | will be represented by three channels.

00:46:11.560 | So red, green and blue RGB.

00:46:13.800 | At each step by using convolutions

00:46:17.720 | we will reduce the size of the image

00:46:19.880 | but at the same time

00:46:20.840 | we will increase the number of features

00:46:23.720 | that each pixel represents.

00:46:26.280 | So each pixel will be represented

00:46:27.800 | not by three channels

00:46:28.760 | but maybe by more channels.

00:46:30.840 | This means that each pixel

00:46:32.840 | will actually capture more data.

00:46:34.920 | More data of the area

00:46:37.640 | to which that pixel belongs.

00:46:39.560 | And this is thanks to the convolutions.

00:46:42.040 | But I will show you later

00:46:43.720 | with an animation.

00:46:45.400 | So let's start building.

00:46:47.720 | The first thing we do is open Visual Studio.

00:46:51.320 | And we create three folders.

00:46:54.600 | The first is called data.

00:46:56.440 | And later we download the pre-trained weights

00:46:59.160 | that you can also find on my GitHub.

00:47:01.000 | Another folder called images

00:47:03.560 | in which we put images as input and output.

00:47:06.920 | And then another folder called SD

00:47:08.520 | which is our module.

00:47:09.560 | Let's create two files.

00:47:11.480 | One called encoder.py

00:47:13.320 | and one called decoder.py.

00:47:15.560 | These are the encoder

00:47:18.520 | and the decoder of our variational autoencoder.

00:47:22.600 | Let's start with the encoder.

00:47:24.600 | And the encoder is quite simple.

00:47:28.200 | So let's start by importing Torch

00:47:30.040 | and all the other stuff.

00:47:35.560 | Let me also select the interpreter.

00:47:38.360 | Okay.

00:47:42.140 | Then we need to import two other blocks

00:47:45.880 | that we will define later in the decoder.

00:47:47.800 | Let's call them for now

00:47:49.080 | port VAE attention block.

00:47:54.600 | And this is the port VAE attention block.

00:47:58.280 | And this is the port VAE attention block.

00:48:00.680 | And this is the port VAE attention block.

00:48:02.680 | And this is the port VAE attention block.

00:48:04.280 | And VAE residual block.

00:48:06.840 | For those who are familiar with computer vision models,

00:48:09.880 | the residual block is very similar

00:48:11.320 | to the residual block that is used in the ResNet.

00:48:13.800 | So later you will see the structure.

00:48:15.160 | It's very similar.

00:48:15.880 | But if those who are not familiar,

00:48:17.400 | don't worry, I will explain it later.

00:48:19.320 | So let's start building this encoder.

00:48:21.000 | And this will inherit from the sequential module

00:48:27.720 | which means basically our encoder

00:48:29.480 | is a sequence of modules, submodules.

00:48:32.040 | Okay.

00:48:32.540 | It's a sequence of submodules

00:48:50.760 | in which each module is something

00:48:53.960 | that reduces the dimension of the data,

00:48:56.520 | but at the same time increases its number of features.

00:49:00.760 | I will write the blocks one by one.

00:49:07.160 | And as soon as we encounter a block

00:49:09.720 | that we didn't define, we go to define it.

00:49:11.640 | And then we define also the shapes.

00:49:13.480 | So the first thing we do, just like in the unit,

00:49:16.200 | is we define a convolution.

00:49:18.520 | Convolution 2D.

00:49:20.840 | Initially, our image will have three channels.

00:49:25.160 | And we convert it to 128 channels

00:49:27.480 | with a kernel size of 3 and a padding of 1.

00:49:32.120 | For those who are not familiar with convolutions,

00:49:35.640 | let's go have a look at how convolutions work.

00:49:38.360 | Here.

00:49:55.500 | Here we can see that a convolution,

00:50:00.940 | basically, it's a kernel.

00:50:02.300 | So it's made of a matrix of a size that we can decide,

00:50:06.940 | which is defined by the parameter kernel size,

00:50:09.420 | which is run through the image

00:50:12.300 | as in the following animation.

00:50:15.180 | So block by block, as you can see.

00:50:17.980 | And at each block, each of the pixel below the kernel

00:50:23.500 | is multiplied by the value of the kernel in that position.

00:50:26.700 | So in this, for example, this pixel here,

00:50:28.780 | which is in position, let's call the,

00:50:31.260 | let's say this one here.

00:50:32.780 | So the first row and the first column

00:50:34.860 | is multiplied by this red value of the kernel.

00:50:39.740 | The second column, first row,

00:50:42.380 | is multiplied by the green value of the kernel.

00:50:45.420 | And then all of these multiplications

00:50:46.940 | are summed up to produce one output.

00:50:49.260 | So this output here comes from four multiplications

00:50:51.980 | that we do in this area,

00:50:53.500 | each one with the corresponding number of the kernel.

00:50:57.420 | This way, basically, by running this kernel through the image,

00:51:00.540 | we capture local information about the image.

00:51:03.980 | And this pixel here combines somehow

00:51:07.500 | the information of four pixels, not only one.

00:51:10.380 | And that's it.

00:51:12.540 | Then we can also increase the kernel size, for example.

00:51:16.380 | And the kernel size, increasing the kernel

00:51:19.180 | means that we capture more global information.

00:51:21.900 | So each pixel represents the information

00:51:24.140 | of more pixel from the original picture.

00:51:27.100 | So the output is smaller.

00:51:28.700 | And then we can introduce, for example, the stride,

00:51:32.620 | which means that we don't do it every successive pixel,

00:51:36.460 | but we skip some pixels, as you can see here.

00:51:38.700 | So we skip every second pixel here.

00:51:41.340 | And if the number is, the kernel size is even

00:51:46.780 | and the input size is odd,

00:51:49.020 | we will also never touch, for example,

00:51:51.020 | here, the border, as you can see.

00:51:52.380 | We can also implement a dilation,

00:51:54.780 | which means that it becomes, with the same kernel size,

00:51:59.100 | the information becomes even more global

00:52:01.820 | because we don't watch consecutive pixel,

00:52:04.700 | but we skip some pixels, et cetera, et cetera.

00:52:07.580 | So the kernels, basically, the convolutions

00:52:09.580 | allow us to capture information

00:52:12.060 | from a local area of the picture, of the image,

00:52:15.180 | and combine it using a kernel.

00:52:18.300 | And this is the idea behind convolutions.

00:52:20.140 | So this convolution here, for example,

00:52:23.660 | will start with our, okay, let's define some shapes.

00:52:27.020 | Our variational autoencoder,

00:52:29.580 | so the encoder of the variational autoencoder

00:52:32.060 | will start with batch size and three channels.

00:52:37.580 | Let's define it as channel.

00:52:39.660 | Then this image will have a height and the width,

00:52:43.580 | which will be 512 by 512, as we will see later.

00:52:47.900 | And this convolution will convert it into batch size 128 features

00:52:54.140 | with the same height and the same width.

00:52:56.780 | Why, in this case, the height and the width doesn't change?

00:53:02.540 | Because even if we have a kernel size of size three,

00:53:05.740 | because we add padding, basically,

00:53:07.500 | we add something to the right side,

00:53:09.820 | something to the top side,

00:53:11.180 | something to the bottom and the left of the image.

00:53:12.940 | So the image with the padding becomes bigger,

00:53:15.580 | but then the output of the convolution makes it smaller

00:53:18.940 | and matches the original size of the image.

00:53:21.420 | This is the reason we have the padding here.

00:53:24.780 | But we will see later that with the next blocks,

00:53:28.220 | the image size will start becoming smaller.

00:53:30.940 | The next block is called the residual block.

00:53:35.420 | And VAE residual block,

00:53:38.220 | which is from 128 channels to 128 channels.

00:53:44.380 | This is a combination,

00:53:45.580 | this residual block is a combination of convolutions and normalization.

00:53:52.540 | So it's just a bunch of convolutions that we will define later.

00:53:55.980 | And this one indicates how many input channels we have

00:54:00.300 | and how many output channels we have.

00:54:02.060 | And the residual block will not change the size of the image.

00:54:05.100 | So we define it.

00:54:07.580 | So our input image is 128.

00:54:11.820 | So batch size 128 height and width.

00:54:17.900 | And it becomes, it remains the same basically.

00:54:22.140 | Oops!

00:54:22.640 | Okay, we have another one.

00:54:27.580 | Another residual block with the same transformation.

00:54:33.100 | Then we have another convolution.

00:54:35.820 | And this time the convolution will change the size of the image.

00:54:38.780 | And we will see why.

00:54:39.660 | So we have a convolution.

00:54:41.740 | To the 128 to 128.

00:54:47.100 | Because the output channels of the last block is 128.

00:54:51.660 | So the input channel is 128.

00:54:53.340 | The output is 128.

00:54:54.700 | The kernel size is 3.

00:54:57.820 | The stride is 2.

00:55:00.140 | And the padding is 0.

00:55:02.860 | This will basically introduce kernel size 3, stride 2.

00:55:06.700 | Let's watch.

00:55:07.580 | So imagine the batch size is 6 by 6.

00:55:10.460 | Kernel size is 3.

00:55:13.020 | Stride is 2 without the deletion.

00:55:15.020 | And this is the output.

00:55:16.940 | Let me make it bigger.

00:55:20.140 | Okay, something.

00:55:25.020 | Yeah.

00:55:27.040 | So as you can see, with the stride of 2...

00:55:30.220 | Need to make it...

00:55:33.340 | Okay, with the stride of 2 and the kernel size of 3.

00:55:37.660 | This is the behavior.

00:55:38.620 | So we skip every 2 pixels before calculating the output.

00:55:43.660 | And this makes the output smaller than the input.

00:55:46.700 | Because of this stride.

00:55:48.300 | And also because of the kernel size.

00:55:49.900 | And we don't have any padding.

00:55:51.260 | So this transformation here will have the following shapes.

00:55:57.820 | So we are starting from batch size.

00:55:59.900 | 128.

00:56:03.260 | Height, width.

00:56:05.420 | So the original height and the width of the input image.

00:56:08.540 | But this time it will become batch size.

00:56:11.980 | 128.

00:56:16.220 | The height will become half.

00:56:18.540 | And the width will become half.

00:56:20.780 | Etc.

00:56:25.580 | Then we have two more residual blocks.

00:56:28.460 | With the same...

00:56:30.860 | Same as before.

00:56:34.540 | But this time by increasing the number of features.

00:56:36.940 | And also here we don't increase any.

00:56:43.420 | Here by increasing the feature means that we don't increase the size of the image.

00:56:48.940 | Or we reduce the size of the image.

00:56:50.380 | We just increase the number of features.

00:56:52.620 | So this one becomes 256.

00:56:55.500 | And here we start from...

00:57:03.980 | Oops 256 and we remain 256.

00:57:07.260 | Now you may be confused of why we are doing all of this.

00:57:10.140 | Okay the idea is we start with the initial image.

00:57:12.460 | And we keep decreasing the size of the image.

00:57:15.180 | So later you will see that the image will become divided by 4, divided by 8.

00:57:19.180 | But at the same time we keep increasing the features.

00:57:22.220 | So each pixel represents more information.

00:57:25.980 | But the number of pixels is diminishing.

00:57:28.780 | Is reducing at every step.

00:57:33.180 | So let's go forward.

00:57:34.700 | Then we have another convolution.

00:57:36.780 | And this time the size will become divided by 4.

00:57:41.500 | And the convolution is...

00:57:43.900 | Let me copy this one.

00:57:44.860 | 256 by 256.

00:57:52.300 | Because the previous output is 256.

00:57:54.940 | The kernel size is 3.

00:57:56.460 | The stride is 2 and the padding is 0.

00:57:58.700 | So just like before.

00:57:59.740 | Also in this case the size of the image will become half of what is it now.

00:58:04.060 | So the image is already divided by 2.

00:58:05.820 | So it will become divided by 4 now.

00:58:07.580 | Then we have another residual block.

00:58:18.140 | In which we increase the number of features.

00:58:21.420 | This time from 256 to 512.

00:58:26.860 | So we start from 256 and the image is divided by 4.

00:58:31.340 | And we go to 512.

00:58:34.780 | And the image size doesn't change.

00:58:36.620 | Then we have another one.

00:58:41.180 | From 512 to 512.

00:58:45.420 | In this case...

00:58:48.060 | Oops.

00:58:49.200 | We will see later what is the residual block.

00:58:52.780 | But the residual block you have to think of it as just a convolution with a normalization.

00:58:56.940 | We will see later.

00:58:57.820 | And this one is 512.

00:59:01.100 | And that goes into 512.

00:59:02.860 | And then we have another convolution that will make it even smaller.

00:59:10.140 | So let's copy this convolution here.

00:59:12.060 | This one will go from 512 to 512.

00:59:19.020 | The same kernel size and the same stride and the same padding as before.

00:59:23.180 | So the image will become even smaller.

00:59:25.340 | So our last dimension was this.

00:59:28.540 | Let me copy it.

00:59:29.500 | So we start with an image that is 512.

00:59:32.460 | 4 times smaller than the original image.

00:59:36.300 | And with the 4 times smaller width it will become 8 times smaller.

00:59:41.420 | And that's it.

00:59:46.220 | And then we have residual blocks also here.

00:59:49.740 | We have three of them in this case.

00:59:51.180 | Let me copy.

00:59:54.300 | One, two, three.

00:59:59.660 | I just write the one for the last one.

01:00:01.900 | So anyway the size, the shape changes here.

01:00:06.860 | It doesn't change the shape of the image or the number of features.

01:00:11.020 | So here we are going from divide by 8 and 512 here.

01:00:16.460 | 512.

01:00:18.460 | And we go to same dimension.

01:00:24.060 | 512.

01:00:24.860 | Divide by 8 and divide by 8.

01:00:29.020 | Then we have an attention block.

01:00:31.340 | And later we will see what is the attention block.

01:00:34.220 | Basically it will run a self-attention over each pixel.

01:00:38.700 | So each pixel will become kind of,

01:00:40.780 | as you remember, the attention is a way to relate tokens to each other in a sentence.

01:00:45.020 | So if we have an image made of pixels,

01:00:48.620 | the attention can be thought of as a sequence of pixels

01:00:52.300 | and the attention as a way to relate the pixel to each other.

01:00:55.180 | So this is the goal of the attention block.

01:00:57.820 | And because this way each pixel is related to each other,

01:01:04.620 | is not independent from each other.

01:01:06.620 | Even if the convolution already actually relates close pixels to each other,

01:01:12.380 | but the attention will be global.

01:01:14.220 | So even the last pixel can be related to the first pixel.

01:01:17.180 | This is the goal of the attention block.

01:01:19.020 | And also in this case we don't reduce the size

01:01:23.020 | because the attention is, the transformer's attention,

01:01:26.380 | is a sequence-to-sequence model.

01:01:27.660 | So we don't reduce the size of the sequence.

01:01:30.780 | And the image remains the same.

01:01:35.100 | Finally, we have another residual block.

01:01:36.860 | Let's...

01:01:40.800 | Let me copy here.

01:01:44.300 | Also no change in shape or size of the image.

01:01:47.740 | Then we have a normalization.

01:01:49.420 | And we will see what is this normalization.

01:01:51.660 | It's the group normalization,

01:01:53.420 | which also doesn't change the size.

01:01:55.180 | Just like any normalization, by the way.

01:01:58.060 | With the number of groups being 32

01:02:03.980 | and the number of channels being 512,

01:02:06.460 | because it's the number of features.

01:02:07.820 | Finally, we have an activation function called the CELU.

01:02:11.500 | The CELU is a function...

01:02:13.820 | Okay, it's derived from the sigmoid linear unit.

01:02:17.340 | And it's a function just like the RELU.

01:02:20.220 | There is nothing special.

01:02:21.420 | They just saw that this one works better for this kind of application.

01:02:25.580 | But there is no particular reason to choose one over another,

01:02:31.820 | except that they thought that practically this one works fine for this kind of models.

01:02:36.460 | And if you watch my previous video about LAMA, for example,

01:02:40.860 | in which we analyzed why they chose the ZWIGLU function.

01:02:43.740 | If you read the paper, at the end of the paper,

01:02:45.580 | they say that there is no particular reason they chose the ZWIGLU.

01:02:48.700 | They just saw that practically it works better.

01:02:50.620 | I mean, it's very difficult to describe why

01:02:52.460 | activation function works better than the others.

01:02:56.060 | So this is why they use the CELU here,

01:02:57.660 | because practically it works well.

01:03:01.100 | Now, we have another two convolutions.

01:03:03.340 | And then we are done with the encoder.

01:03:05.500 | Convolution, 512, 8, kernel size, and then padding.

01:03:16.460 | This will not change the size of the model.

01:03:22.860 | Because just like before, we have the kernel size as 3.

01:03:25.580 | But we have the padding that compensates for the reduction given by the kernel size.

01:03:30.220 | But we are decreasing the number of features.

01:03:32.300 | And this is the bottleneck of the encoder.

01:03:36.060 | And I will show you later on the architecture what is the bottleneck.

01:03:38.780 | And finally, we have another convolution.

01:03:47.260 | Which is 8 by 8 with kernel size equal to 1.

01:03:55.980 | And the padding is equal to 0.

01:03:57.980 | Which also doesn't change the size of the image.

01:04:02.540 | Because if you watch here, if you have a kernel size of 1,

01:04:05.740 | it means that each, without stride,

01:04:08.700 | each kernel basically is running over each pixel.

01:04:13.020 | So each output actually captures the information of only one pixel.

01:04:16.620 | So the output has the same dimension as the input.

01:04:18.860 | And this is why here also we don't change the...

01:04:24.460 | But here we need to change the number of...

01:04:27.500 | It becomes 8.

01:04:32.540 | And here from 8 to 8.

01:04:37.820 | And this is the list of modules that will make up our encoder.

01:04:44.540 | Before building the residual block and the attention block,

01:04:49.660 | so this attention block,

01:04:50.860 | let's write the forward method and then we build the residual block.

01:04:55.020 | So this is the init.

01:04:57.500 | Define it like this.

01:05:00.060 | Let me review it if it's correct.

01:05:04.380 | Okay, yeah.

01:05:05.180 | Now let's define the forward method.

01:05:12.700 | x is the image for which we want to encode.

01:05:19.660 | So it's a tensor.

01:05:21.420 | Torch.tensor.

01:05:23.020 | And the noise, we need some noise.

01:05:25.740 | And later I will show you why we need some noise.

01:05:27.740 | That has the same size as the output of the encoder.

01:05:31.260 | This returns a tensor.

01:05:35.500 | Okay, our input x will be of size patch size with some channels.

01:05:45.820 | Initially it will be 3 because it's an image.

01:05:48.860 | Height and width which will be 512 by 512.

01:05:52.540 | And then some noise.

01:05:55.100 | This noise has the same size as the output of the encoder.

01:05:59.660 | And we will see that it's jelly patch size.

01:06:05.820 | Then output channels.

01:06:08.540 | Height divided by 8 and width divided by 8.

01:06:16.940 | Then we just run sequentially all of these modules.

01:06:21.340 | And then there is one little thing here that in the convolutions that have the stride,

01:06:35.100 | we need to apply a special embedding.

01:06:38.460 | And I will show you why and how it works.

01:06:46.460 | So if the module has a stride attribute and it's equal to 2 2,

01:06:52.460 | which basically means this convolution here,

01:06:57.180 | this convolution here and this convolution here,

01:06:59.980 | we don't apply the padding here because the padding here is applied to the top of the image,

01:07:05.180 | bottom, left and right.

01:07:07.180 | But we want to do an asymmetrical padding so we do it manually.

01:07:10.460 | And this is applied like this.

01:07:13.420 | F.padding.

01:07:14.380 | Basically this says can you add a layer of pixels on the right side of the image

01:07:27.740 | and on the bottom side of the image only?

01:07:30.380 | Because when you apply the padding, it's padding left, padding right, padding top,

01:07:40.940 | padding bottom.

01:07:42.220 | This means add a layer of pixels in the right side of the image

01:07:47.340 | and on the top side of the image.

01:07:48.860 | And this is asymmetrical padding.

01:07:54.380 | And then if we apply it only for these convolutions that have the stride equal to 2.

01:08:00.140 | And then x is equal to module of x.

01:08:05.660 | OK, now you may be wondering why are we building this kind of structure?

01:08:10.060 | Why it's made like this?

01:08:11.500 | OK, usually in deep learning communities, especially during research,

01:08:16.620 | we don't reinvent the wheel every time.

01:08:18.460 | So the people who made the stable diffusion, but also the people before them,

01:08:23.180 | every time we want to use a model,

01:08:25.100 | we check what models similar to the one we want to build

01:08:29.260 | are already out there and they are working fine.

01:08:32.140 | So very probably the people who built stable diffusion,

01:08:35.900 | they saw that a model like this is working very well

01:08:39.100 | for some previous project as a variational autoencoder.

01:08:42.700 | They just modified it a little bit and kept it like it.

01:08:46.780 | So for most choices, actually, there is no reason.

01:08:49.740 | There is a historical reason, because it worked well in practice.

01:08:53.340 | And we know that convolutions work well in practice

01:08:57.420 | for image segmentation, for example, or anything related to computer vision.

01:09:01.340 | And this is why they made the model like this.

01:09:04.940 | So most encoders actually work like this, that we reduce the size of the image,

01:09:09.500 | but each we keep increasing the features of the image,

01:09:12.860 | the channels, the number of channels of the image.

01:09:14.940 | So the number of pixels becomes smaller,

01:09:18.620 | but each pixel is represented by more than three channels.

01:09:22.460 | So more channels at every step.

01:09:23.980 | Now, what we do is here we are running our image into sequentially,

01:09:30.940 | in one by one, through all of these modules here.

01:09:37.900 | So first through this convolution, then through this residual block,

01:09:41.020 | which is also some convolutions, then this residual block,

01:09:44.860 | then again convolution, convolution, convolution,

01:09:46.940 | until we run it through this attention block and et cetera.

01:09:50.460 | This will transform the image into something smaller,

01:09:55.260 | so a compressed version of the image.

01:09:57.500 | But as I showed you before, this is not an autoencoder.

01:10:00.540 | This is a variational autoencoder.

01:10:02.940 | So the variational autoencoder, let me show you again the picture here.

01:10:09.180 | We are not learning how to compress data.

01:10:11.500 | We are learning a latent space.

01:10:13.260 | And this latent space are the parameters of a multivariate Gaussian distribution.

01:10:19.340 | So actually, the variational autoencoder is trained to learn the mu and the sigma,

01:10:25.260 | so the mean and the variance of this distribution.

01:10:30.460 | And this is actually what we will get from the output of this variational autoencoder,

01:10:35.900 | not directly the compressed image.

01:10:39.020 | And if this is not clear, guys,

01:10:42.140 | I made a previous video about the variational autoencoder,

01:10:45.100 | in which I show you also why the history of why we do it like this,

01:10:48.460 | all the reparameterization trick, et cetera.

01:10:51.820 | But for now, just remember that this is not just a compressed version of the image,

01:10:56.700 | it's actually a distribution.

01:10:58.300 | And then we can sample from this distribution.

01:11:01.100 | And I will show you how.

01:11:02.380 | So the output of the variational autoencoder is actually the mean and the variance.

01:11:08.220 | And actually, it's actually not the variance, but the log variance.

01:11:11.420 | So the mean and the log variance is equal to torch.chunk(x2, dimension equal 1).

01:11:22.380 | We will see also what is the chunk function.

01:11:25.580 | So I will show you.

01:11:27.500 | So this basically converts batch size, 8 channels, height,

01:11:32.700 | height divided by 8, width divided by 8,

01:11:36.940 | which is the output of the last layer of this encoder.

01:11:40.140 | So this one.

01:11:41.100 | And we divide it into two tensors.

01:11:44.380 | So this chunk basically means divide it into two tensors along this dimension.

01:11:49.500 | So along this dimension, it will become two tensors of size,

01:11:52.860 | along this dimension of size 4.

01:11:55.980 | So two tensors of shape, batch size 4, then height divided by 8, and width divided by 8.

01:12:12.540 | And this basically, the output of this actually represents the mean and the variance.

01:12:21.660 | And what we do, we don't want the log variance, we want the variance actually.

01:12:27.340 | So to transform the log variance into variance, we do the exponentiation.

01:12:32.540 | So the first thing actually we also need to do is to clamp this variance,

01:12:37.100 | because otherwise it will become very small.

01:12:39.100 | So clamping means that if the variance is too small or too big,

01:12:42.460 | we want it to become within some ranges that are acceptable for us.

01:12:47.820 | So this clamping function, log variance,

01:12:50.700 | tells the PyTorch that if the value is too small or too big, make it within this range.

01:12:55.900 | And this doesn't change the shape of the tensors.

01:13:01.020 | So this still remains this tensor here.

01:13:04.300 | And then we transform the log variance into variance.

01:13:09.660 | So the variance is equal to the log variance dot exp,

01:13:13.820 | which means make the exponential of this.

01:13:16.140 | So you delete the log and it becomes the variance.

01:13:18.620 | And this also doesn't change the size of the shape of the tensor.

01:13:25.100 | And then to calculate the standard deviation from the variance,

01:13:28.140 | as you know, the standard deviation is the square root of the variance.

01:13:31.180 | So standard deviation is the variance dot sqrt.

01:13:40.300 | And also this doesn't change the size of the tensor.

01:13:44.540 | OK, now what we want, as I told you before, this is a latent space.

01:13:51.500 | It's a multivariate Gaussian, which has its own mean and its own variance.

01:13:56.060 | And we know the mean and the variance, this mean and this variance.

01:13:59.500 | How do we convert? How do we sample from it?

01:14:03.740 | Well, what we can sample from is, basically, we can sample from n_01.

01:14:09.420 | This is, if we have a sample from n_01,

01:14:12.620 | how do we convert it into a sample of a given mean and the given variance?

01:14:19.100 | This, as if you remember from probability and statistics,

01:14:23.260 | if you have a sample from n_01,

01:14:25.340 | you can convert it into any other sample of a Gaussian

01:14:28.860 | with a given mean and a variance through this transformation.

01:14:32.140 | So if z, let's call it this one, z is equal to n_01,

01:14:37.180 | we can transform into another n, let's call it x,

01:14:40.940 | through this transformation x is equal to z.

01:14:43.820 | Well, the mean of the new distribution plus

01:14:49.980 | the standard deviation of the new distribution multiplied by z.

01:14:54.780 | This is the transformation, this is the formula from probability and statistics.

01:14:58.860 | Basically means transform this distribution into this one,

01:15:01.420 | that has this mean and this variance,

01:15:03.020 | which basically means sample from this distribution.

01:15:05.980 | This is why we are given also the noise as input,

01:15:09.100 | because the noise we want it to come from with a particular seed of the noise generator.

01:15:14.700 | So we ask is as input and we sample from this distribution like this,

01:15:19.180 | x is equal to mean plus standard deviation multiplied by noise.

01:15:25.180 | Finally, there is also another step that we need to scale the output by a constant.

01:15:34.460 | This constant, I found it in the original repository.

01:15:37.740 | So I'm just writing it here without any explanation on why,

01:15:41.180 | because I actually, I also don't know.

01:15:43.260 | It's just a scaling constant that they use at the end.

01:15:46.460 | I don't know if it's there for historical reason,

01:15:49.020 | because they use some previous model that had this constant,

01:15:51.260 | or they introduced it for some particular reason.

01:15:54.060 | But it's a constant that I saw it in the original repository.

01:15:57.100 | And actually, if you check the original parameters of the stable diffusion model,

01:16:01.420 | there is also this constant.

01:16:02.460 | So I am also scaling the output by this constant.

01:16:05.020 | And then we return x.

01:16:08.140 | So now what we built so far,

01:16:10.620 | except that we didn't build the residual block and the attention block here,

01:16:14.940 | we built the encoder part of the variational autoencoder and also the sampling part.

01:16:20.140 | So we take the image, we run it through the encoder, it becomes very small.

01:16:24.140 | It will tell us the mean and the variance.

01:16:26.940 | And then we sample from that distribution given the mean and the variance.

01:16:31.340 | Now we need to build the decoder along with the residual block and the attention block.

01:16:36.700 | And what we will see is that in the decoder,

01:16:38.780 | we do the opposite of what we did in the encoder.

01:16:42.220 | So we will reduce the number of channels and at the same time,

01:16:46.700 | we will increase the size of the image.

01:16:48.620 | So let's go to the decoder.

01:16:52.620 | Let me review if everything is fine.

01:16:55.820 | Looks like it is.

01:17:00.380 | So let's go to the decoder.

01:17:01.820 | Again, import torch.

01:17:08.220 | We also need to define the attention.

01:17:29.260 | We need to define the self-attention.

01:17:30.860 | Later we define it.

01:17:31.820 | Let's define first the residual block, the one we defined before,

01:17:39.900 | so that you understand what is this residual block.

01:17:42.860 | And then we define the attention block that we defined before.

01:17:48.220 | And finally, we build the attention.

01:17:50.860 | So...

01:18:00.860 | Okay, this is made up of normalization and convolutions, like I said before.

01:18:09.420 | There is a two normalization, which is the group norm one.

01:18:14.860 | So...

01:18:24.860 | And then there is another group normalization.

01:18:40.860 | With remote channels to out channels.

01:18:58.860 | And then we have a skip connection.

01:19:05.900 | Skip connection basically means that you take the input,

01:19:09.660 | you skip some layers, and then you connect it there with the output of the last layer.

01:19:13.580 | And we also need this residual connection.

01:19:17.820 | If the two channels are different, we need to create another intermediate layer.

01:19:21.740 | Now I create it, later I explain it.

01:19:34.860 | Okay, let's create the forward method.

01:19:51.100 | Which is a torch.tensor.

01:20:02.620 | And returns a torch.tensor.

01:20:05.260 | Okay, the input of this residual layer, as you saw before,

01:20:10.780 | is something that has a batch with some channels,

01:20:15.020 | and then height and width, which can be different.

01:20:18.380 | It's not always the same.

01:20:19.500 | Sometimes it's 512 by 512, sometimes it's half of that,

01:20:23.420 | sometimes it's one fourth of that, etc.

01:20:25.980 | So suppose it's x is batch size in channels height width.

01:20:36.220 | What we do is we create the skip connection.

01:20:39.740 | So we save the initial input.

01:20:41.420 | We call it the residual or residue is equal to x.

01:20:45.500 | We apply the normalization.

01:20:48.060 | The first one.

01:20:53.660 | And this doesn't change the shape of the tensor.

01:20:56.940 | The normalization doesn't change.

01:20:59.340 | Then we apply the silo function.

01:21:00.780 | And this also doesn't change the size of the tensor.

01:21:07.820 | Then we apply the first convolution.

01:21:10.300 | This also doesn't change the size of the tensor,

01:21:18.220 | because as you can see here, we have kernel size 3, yes,

01:21:21.260 | but with the padding of 1.

01:21:22.860 | With the padding of 1, actually, it will not change the size of the tensor.

01:21:26.540 | So it will still remain this one.

01:21:28.140 | Then we apply again the group normalization 2.

01:21:32.540 | This again doesn't change the size of the tensor.

01:21:38.060 | Then we apply the silo again.

01:21:40.540 | Then we apply the convolution number 2.

01:21:46.700 | And finally, we apply the residual connection,

01:21:54.060 | which basically means that we take x plus the residual.

01:21:58.940 | But if the number of output channels is not equal to the input channels,

01:22:05.740 | you cannot add this one with this one,

01:22:07.420 | because this dimension will not match between the two.

01:22:09.740 | So what we do, we create this layer here

01:22:13.660 | to convert the input channels to the output channels of x,

01:22:17.420 | such that this sum can be done.

01:22:19.180 | So what we do is, we apply this residual layer.

01:22:22.380 | Residual layer of residual, like this.

01:22:27.100 | And this is our residual block.

01:22:29.660 | So as I told you, it's just a bunch of convolutions and group normalization.

01:22:33.580 | And for those who are familiar with the computer vision models,

01:22:36.140 | especially in ResNet, we use a lot of it.

01:22:38.220 | It's a very common block.

01:22:43.180 | Let's go build the attention block that we used also before in the encoder.

01:22:47.340 | This one here.

01:22:49.100 | And to define the attention, we also need to define the self-attention.

01:22:53.500 | So let's first build the attention block,

01:22:55.660 | which is used in the variational autoencoder.

01:22:57.340 | And then we define what is this self-attention.

01:23:11.500 | So it has a group normalization.

01:23:28.140 | Again, the channel is always 32 here in stable diffusion.

01:23:34.060 | But you also may be wondering, what is group normalization, right?

01:23:37.420 | So let's go to review it, actually, since we are here.

01:23:40.780 | And, okay, if you remember from my previous slides on Lama,

01:23:47.500 | let's go here, where we use a layer normalization.

01:23:52.620 | And also in the vanilla transformer, actually, we use layer normalization.

01:23:58.140 | So first of all, what is normalization?

01:24:00.220 | Normalization is basically when we have a deep neural network,

01:24:03.660 | each layer of the network produces some output that is fed to the next layer.

01:24:08.700 | Now, what happens is that if the output of a layer is varying in distribution,

01:24:14.700 | so sometimes, for example, the output of a layer is between 0 and 1,

01:24:18.380 | but the next step, maybe it's between 3 and 5,

01:24:22.140 | and the next step, maybe it's between 10 and 15, etc.

01:24:25.740 | So the distribution of the output of a layer changes,

01:24:29.180 | then the next layer also will see some input

01:24:32.060 | that is very different from what the layer is used to see.

01:24:36.860 | This will basically push the output of the next layer into a new distribution itself,

01:24:42.620 | which, in turn, will push the loss function into,

01:24:45.900 | basically, the output of the model to change very frequently in distribution.

01:24:53.740 | So sometimes it will be a very big number,

01:24:55.500 | sometimes it will be a very small number,

01:24:56.940 | sometimes it will be negative, sometimes it will be positive, etc.

01:24:59.740 | And this basically makes the loss function oscillate too much,

01:25:04.060 | and it makes the training slower.

01:25:05.980 | So what we do is we normalize the values before feeding them into layers,

01:25:09.740 | such that each layer always sees the same distribution of the data.

01:25:13.820 | So it will always see numbers that are distributed around 0 with a variance of 1.

01:25:19.260 | And this is the job of the layer normalization.

01:25:21.580 | So imagine you are a layer, and you have some input,

01:25:25.180 | which is a batch of 10 items.

01:25:27.260 | Each item has some features, so feature 1, feature 2, feature 3.

01:25:31.180 | Layer normalization calculates a mean and the variance over these features here,

01:25:36.380 | so over this distribution here,

01:25:38.380 | and then normalizes this value according to this formula.

01:25:42.140 | So each value basically becomes distributed between 0 and 1.

01:25:46.460 | With batch normalization, we normalize by columns,

01:25:50.700 | so the statistics mean and the sigma is calculated by columns.

01:25:54.380 | With layer normalization, it is calculated by rows,

01:25:57.420 | so each item independently from the others.

01:26:00.620 | With group normalization, on the other hand,

01:26:03.100 | it is like layer normalization, but not all of the features of the item, but grouped.

01:26:10.860 | So for example, imagine you have four features here.

01:26:13.500 | So here you have F1, F2, F3, F4, and you have two groups.

01:26:18.620 | Then the first group will be F1 and F2, and the second group will be F3 and F4.

01:26:23.740 | So you will have two means and two variance,

01:26:26.860 | one for the first group, one for the second group.

01:26:29.900 | But why do we use it like this?

01:26:32.540 | Why do we want to group this kind of features?

01:26:35.740 | Because these features actually, they come from convolutions.

01:26:39.180 | And as we saw before, let's go back to the website.

01:26:42.460 | Imagine you have a kernel of five here.

01:26:45.740 | Each output here actually comes from local area of the image.

01:26:51.900 | So the two close features, for example,

01:26:55.260 | two things that are close to each other, may be related to each other.

01:26:59.340 | So two things that are far from each other are not related to each other.

01:27:02.540 | This is why we can group, we can use group normalization in this case.

01:27:07.260 | Because closer features to each other will have kind of the same distribution,

01:27:12.860 | or we make them have the same distribution,

01:27:15.020 | and things that are far from each other may not.

01:27:17.580 | This is the basic idea behind group normalization.

01:27:20.300 | But the whole idea behind the normalization is that

01:27:22.700 | we don't want these things to oscillate too much.

01:27:25.260 | Otherwise, the loss of function will oscillate

01:27:27.420 | and will make the training slower.

01:27:28.860 | With normalization, we make the training faster.

01:27:30.940 | So let's go back to coding.

01:27:33.100 | So we were coding the attention block.

01:27:36.300 | So now the attention block has this group normalization and also an attention,

01:27:40.540 | which is a self-attention.

01:27:43.020 | And later we define it.

01:27:44.140 | And channels, okay.

01:27:48.140 | This one have a forward method.

01:27:54.300 | Torch.tensor, returns, of course, torch.tensor.

01:27:59.020 | Okay, what is the input of this block?

01:28:02.060 | The input of this block is something, where is it?

01:28:06.700 | Here.

01:28:07.500 | It's something in the form of batch size, number of channels, height and width.

01:28:11.980 | But because it will be used in many positions, this attention block,

01:28:15.180 | we don't define a specific size.

01:28:17.500 | So we just say that x is something that is a batch size,

01:28:21.900 | features or channels, if you want, height and width.

01:28:25.900 | Again, we create a residual connection.

01:28:28.940 | And the first thing we do is we extract the shape.

01:28:34.860 | So n is the batch size, the number of channels,

01:28:38.220 | the height and the width is equal to x.shape.

01:28:41.420 | Then, as I told you before,

01:28:45.340 | we do the self-attention between all the pixels of this image.

01:28:49.580 | And I will show you how.

01:28:50.700 | This will transform this tensor here into this tensor here.

01:29:03.660 | Height multiplied by width.

01:29:06.620 | So now we have a sequence where each item represents a pixel

01:29:11.340 | because we multiplied height by width.

01:29:13.260 | And then we transpose it.

01:29:15.180 | So put it back a little before.

01:29:18.220 | Transpose the -1 with -2.

01:29:20.620 | This will transform this shape into this shape.

01:29:26.940 | So we put back this one.

01:29:31.180 | So this one comes before and features becomes the last one.

01:29:34.700 | Something like this.

01:29:37.420 | And okay.

01:29:40.140 | So as you can see from this tensor here,

01:29:43.660 | this is like when we do the attention in the transformer model.

01:29:47.500 | So in the transformer model, we have a sequence of tokens.

01:29:50.220 | Each token is representing, for example, a word.

01:29:52.700 | And the attention basically calculates the attention between each token.

01:29:57.180 | So how do two tokens are related to each other?

01:30:00.220 | In this case, we can think of it as a sequence of pixels.

01:30:03.420 | Each pixel with its own embedding, which is the features of that pixel.

01:30:07.820 | And we relate pixels to each other.

01:30:10.060 | And then we do the attention.

01:30:12.780 | Which is a self-attention.

01:30:17.260 | In which self-attention means that the query key and values are the same input.

01:30:20.940 | And this doesn't change the shape.

01:30:25.500 | So this one remains the same.

01:30:28.540 | Then we transpose back.

01:30:30.620 | And we do the inverse transformation.

01:30:36.060 | So because we put it in this form only to do attention.

01:30:39.260 | So now we transpose.

01:30:40.940 | So we take this one.

01:30:46.140 | And we convert it into features.

01:30:48.860 | And then height and width.

01:30:51.900 | And then again, we remove this multiplication by viewing again the tensor.

01:30:57.500 | So n, c, h, w.

01:31:01.420 | So we go from here.

01:31:05.980 | To here.

01:31:11.020 | Then we add the residual connection.

01:31:15.580 | And we return x.

01:31:16.460 | That's it.

01:31:17.660 | The residual connection will not change the size of the input.

01:31:22.380 | And we return a tensor of this shape here.

01:31:25.900 | Let me check also the residual connection here.

01:31:27.660 | It's correct.

01:31:28.460 | Okay.

01:31:29.260 | Now that we have also built the attention block, let's build also the self-attention.

01:31:32.700 | Since we are building the attentions.

01:31:35.180 | And the attentions, because we have two kinds of attention in the stable diffusion.

01:31:39.820 | One is called the self-attention.

01:31:41.820 | And one is the cross-attention.

01:31:43.820 | And we need to build both.

01:31:44.940 | So let's go build it in a separate class called "Attention".

01:31:48.300 | And okay.

01:31:55.100 | So again, import torch.

01:32:08.940 | Okay.

01:32:18.940 | I think you guys maybe want to review the attention before building it.

01:32:27.580 | So let's go review it.

01:32:28.860 | I have here opened my slides from my video about the attention model for the transformer model.

01:32:35.740 | So the self-attention, basically, it's a way for, especially in a language model,

01:32:42.140 | is a way for us to relate tokens to each other.

01:32:45.180 | So we start with a sequence of tokens.

01:32:47.420 | Each one of them having an embedding of size d model.

01:32:50.540 | And we transform it into queries, key, and values.

01:32:53.260 | In which query, key, and values in the self-attention are the same matrix, same sequence.

01:32:57.660 | We multiply them by wq matrix.

01:33:01.180 | So wq, wk, and wv, which are parameter matrices.

01:33:05.340 | Then we split them along the d model dimension into number of heads.

01:33:10.540 | So we can specify how many heads we want.

01:33:12.700 | In our case, the one attention that we will do here is actually only one head.

01:33:18.940 | I will show you later.

01:33:19.740 | And then we calculate the attention for each of this head.

01:33:23.580 | Then we combine back by concatenating this head together.

01:33:28.780 | We multiply this output matrix of the concatenation with another matrix called wo,

01:33:35.260 | which is the output matrix.

01:33:36.620 | And then this is the output of the multi-head attention.

01:33:40.700 | If we have only one head, instead of being a multi-head,

01:33:44.540 | then we will not do this splitting operation.

01:33:46.940 | We will just do this multiplication with the w and with the wo.

01:33:51.260 | And OK, this is how the self-attention works.

01:33:54.940 | So in a self-attention, we have this query key and values coming from the same matrix input.

01:33:59.100 | And this is what we are going to build.

01:34:01.500 | So we have the number of heads.

01:34:12.300 | Then we have the embedding.

01:34:14.380 | So what is the embedding of each token?

01:34:16.700 | But in our case, we are not talking about tokens.

01:34:19.980 | We will talk about pixels.

01:34:21.340 | And we can think that the number of channels of each pixel is the embedding of the pixel.

01:34:26.700 | So the embedding, just like in the original transformer,

01:34:30.380 | the embeddings are the kind of vectors that capture the meaning of the word.

01:34:35.020 | In this case, we have the channels.

01:34:36.540 | Each channel, each pixel represented by many channels

01:34:39.900 | that capture the information about that pixel.

01:34:41.980 | Here we have also the bias for the w matrices,

01:34:49.420 | which we don't have in the original transformer.

01:34:51.500 | OK, now let's define the w matrices.

01:35:07.900 | So wqwq and wv.

01:35:09.580 | We will represent it as one big linear layer.

01:35:11.980 | Instead of representing it as three different matrices, it's possible.

01:35:17.500 | We just say that it's a big matrix, three by the embedding.

01:35:21.340 | And the bias is if we want it.

01:35:25.260 | So in projection, in projection bias.

01:35:28.620 | So this means stands for in projection,

01:35:30.620 | because it's a projection of the input before we apply the attention.

01:35:34.300 | And then there is an auto projection, which is after we apply the attention.

01:35:37.420 | So the wo matrix.

01:35:47.100 | So as you remember here, the wo matrix is actually the model by the model.

01:35:51.100 | The input is also the model by the model.

01:35:53.420 | And this is exactly what we did.

01:35:54.780 | But we have three of them here.

01:35:56.860 | So it's three by the model.

01:35:58.220 | And then we save the number of heads.

01:36:08.060 | And then we saved the dimension of each head.

01:36:15.500 | The dimension of each head basically means that if we have multi head,

01:36:18.780 | each head will watch a part of the embedding of each token.

01:36:21.980 | So we need to save how much is this size.

01:36:25.820 | So the model divided by the number of heads.

01:36:28.060 | But divide by the number of heads.

01:36:32.700 | Let's implement the forward.

01:36:35.420 | We can also apply a mask.

01:36:43.660 | As you remember, the mask is a way to avoid relating tokens,

01:36:47.580 | one particular token with the tokens that come after it,

01:36:52.540 | but only with the token that come before it.

01:36:54.780 | And this is called the causal mask.

01:36:58.380 | If you really are not understanding what is happening here in the attention,

01:37:05.500 | I highly recommend you watch my previous video,

01:37:07.420 | because it's explained very well.

01:37:09.980 | And if you watch it, it will take not so much time.

01:37:14.380 | And I think you will learn a lot.

01:37:16.300 | So the first thing we do is extract the shape.

01:37:20.780 | Then we extract the size, the sequence,

01:37:31.420 | length and the embedding is equal to input shape.

01:37:38.780 | And then we say that we will convert it into another shape

01:37:47.020 | that I will show you later why.

01:37:48.460 | This is called the interim shape, intermediate shape.

01:37:58.140 | Then we apply the query key and value.

01:38:06.700 | We apply the in projection, so the wq, wq and wv matrix to the input,

01:38:12.540 | and we convert it into query key and values.

01:38:14.620 | So query key and values are equal to...

01:38:16.620 | We multiply it, but then we divide it with chunk.

01:38:22.220 | As I showed you before, what is chunk?

01:38:23.900 | Basically, we will multiply the input with the big matrix

01:38:29.260 | that represents wq, wq and wq,

01:38:31.420 | but then we split it back into three smaller matrices.

01:38:34.700 | This is the same as applying three different projections.

01:38:37.740 | Instead of...

01:38:38.960 | It's the same as applying three separate in projections,

01:38:43.500 | but it's also possible to combine it in one big matrix.

01:38:46.860 | This, what we will do, basically it will convert batch size,

01:38:56.060 | sequence length, dimension into batch size, sequence length, dimension multiplied by three.

01:39:05.740 | And then by using chunk, we split it along the last dimension

01:39:09.820 | into three different tensors of shape, batch size, sequence length and dimension.

01:39:23.900 | Okay, now we can split the query key and values in the number of heads.

01:39:31.180 | According to the number of heads, this is why we built this shape,

01:39:34.220 | which means split the dimension, the last dimension into n heads.

01:39:42.940 | And the values v.view, wonderful.

01:40:02.540 | This will convert, okay, let's write it,

01:40:08.620 | batch size, sequence length, dimension into batch size, sequence length,

01:40:17.180 | then h, so the number of heads and each dimension divided by the number of heads.

01:40:23.260 | So each head will watch the full sequence,

01:40:26.060 | but only a part of the embedding of each token, in this case, pixel.

01:40:31.260 | And we'll watch this part of the head.

01:40:35.180 | So the full dimension, the embedding divided by the number of heads.

01:40:39.500 | And then this will convert it, because we are also transposing,

01:40:46.380 | this will convert it into batch size,

01:40:49.020 | h, sequence length, and then dimension h.

01:40:55.900 | So each head will watch all the sequence, but only a part of the embedding.

01:41:03.180 | We then calculate the attention, just like the formula.

01:41:07.100 | So query multiplied by the transpose of the keys.

01:41:09.900 | So is the query, matrix multiplication with the transpose of the keys.

01:41:15.100 | This will return a matrix of size,

01:41:19.740 | batch size, h, sequence length by sequence length.

01:41:25.420 | We can then apply the mask.

01:41:30.860 | As you remember, the mask is something that we apply when we calculate the attention,

01:41:35.980 | if we don't want two tokens to relate to each other.

01:41:38.780 | We basically substitute their value.

01:41:41.820 | In this matrix, we substitute the interaction with minus infinity before applying the softmax,

01:41:47.500 | so that the softmax will make it zero.

01:41:49.820 | So this is what we are doing here.

01:41:51.260 | We first build the mask.

01:41:54.540 | This will create a causal mask, basically a mask where the upper triangle,

01:42:00.700 | so above the principal diagonal, is made up of one ones, a lot of ones.

01:42:13.740 | And then we can then apply the softmax.

01:42:17.500 | One ones, a lot of ones.

01:42:22.540 | And then we fill it up with minus infinity.

01:42:25.900 | Masked, oops, not mask, but wait.

01:42:32.940 | Masked fill, but with mask, and we put minus infinity, like this.

01:42:46.220 | As you remember, the formula of the transformer is a

01:42:48.860 | query multiplied by the transpose of the keys, and then divided by the square root of the model.

01:42:54.220 | So this is what we will do now.

01:42:55.580 | So divided by the square root of the model, set of the head.

01:43:02.140 | And then we apply the softmax.

01:43:15.020 | We multiply it by the WO matrix.

01:43:17.180 | We transpose back.

01:43:24.540 | So we want to remove, now we want to remove the head dimension.

01:43:29.420 | So output is equal to, let me write some shapes.

01:43:35.820 | So what is this?

01:43:37.020 | This is equal to patch size, sequence by sequence,

01:43:45.580 | multiplied, so matrix multiplication with patch size.

01:43:56.620 | This will result into patch size, H, sequence length, and dimension divided by H.

01:44:09.820 | This we then multiplied by the, we then transpose.

01:44:15.660 | And this will result into, so we start with this one.

01:44:23.580 | And it becomes, wait I put too many parentheses here, patch size, sequence length,

01:44:38.860 | H, and dimensions, okay.

01:44:44.620 | Then we can reshape as the input, like the initial shape, so this one.

01:44:57.980 | And then we apply the output projection.

01:45:02.540 | So we multiply it by the WO matrix.

01:45:13.900 | Okay.

01:45:22.060 | This is the self-attention.

01:45:26.220 | Now let's go back to continue building the decoder.

01:45:29.100 | For now we have built the attention block and the residual block.

01:45:31.900 | But we need to build the decoder.

01:45:45.740 | And also this one is a sequence of modules that we will apply one after another.

01:46:00.860 | We start with the convolution just like before.

01:46:03.180 | Now I will not write again the shapes change, but you got the idea.

01:46:07.100 | In the encoder we, in the encoder, let me show you here.

01:46:12.860 | Here.

01:46:17.900 | In the encoder we keep reducing the size of the image until it becomes small.

01:46:22.540 | In the decoder we need to return to the original size of the image.

01:46:27.180 | So we start with the latent dimension and we return to the original dimension of the image.

01:46:34.940 | Convolution.

01:46:39.100 | So we start with four channels and we output four channels.

01:46:54.460 | Then we have another convolution.

01:46:55.820 | We go to 500.

01:47:03.100 | Then we have a residual block just like before.

01:47:12.220 | Then we have an attention block.

01:47:23.180 | Then we have a bunch of residual blocks and we have four of them.

01:47:31.980 | Let me copy.

01:47:42.540 | Okay.

01:47:45.360 | Now the residual blocks, let me write some shapes here.

01:47:49.820 | Here we arrived to a situation in which we have batch size.

01:47:53.660 | We have 512 features and the size of the image still didn't grow

01:47:58.700 | because we didn't have any convolution that will make it grow.

01:48:02.140 | This one of course will remain the same because it's a residual block and etc.

01:48:12.700 | Now to increase the size of the image.

01:48:15.580 | So now the image is actually height divided by 8 which height as you remember is 512,

01:48:21.740 | the size of the image that we are working with.

01:48:24.220 | So this dimension here is 64 by 64.

01:48:27.980 | How can we increase it?

01:48:29.260 | We use one module called upsample.

01:48:31.260 | The upsample, we have to think of it like when we resize an image.

01:48:42.220 | So imagine you have an image that is 64 by 64

01:48:45.340 | and you want to transform it to 128 by 128.

01:48:48.860 | The upsample will do it just like when we resize an image.

01:48:52.620 | So it will replicate the pixels twice.

01:48:57.580 | So along the dimensions right and down for example twice.

01:49:02.220 | So that the total amount of pixels, the height and the width actually doubles.

01:49:07.180 | This is the upsample basically.

01:49:10.860 | It will just replicate each pixel so that by this scale factor along each dimension.

01:49:16.940 | So this one becomes batch size

01:49:24.060 | divided by 8, width divided by 8 becomes as we see here

01:49:38.460 | 8 divided by 4 and width divided by 4.

01:49:41.260 | Then we have a convolution, residual blocks.

01:49:47.420 | So we have convolutions of 2D, 512 to 512.

01:50:05.180 | Then we have residual blocks of 512 by 500.

01:50:09.340 | But in this case we have three of them, 2, 3.

01:50:11.660 | Then we have another upsample.

01:50:14.060 | This will again double the size of the image.

01:50:17.660 | So we have another one that will double the size of the image.

01:50:20.860 | And by a scale factor of 2.

01:50:23.340 | So now our image which was divided by 4 with 512 channels.

01:50:29.340 | So let's write it like this.

01:50:30.780 | Will become divided by 2 now.

01:50:34.300 | So it will double the size of the image.

01:50:37.100 | So now our image is 256 by 256.

01:50:41.660 | Then again we have a convolution.

01:50:45.980 | And then we have three residual blocks again.

01:50:54.940 | But this time we reduce the number of features.

01:50:58.940 | So 256 and then it's 256 to 256.

01:51:06.780 | Okay, then we have another upsampling which will again double the size of the image.

01:51:14.300 | And this time we will go from divide by 2 to divide by 2 up to the original size.

01:51:26.220 | And because the number of channels has changed, we are not 512 anymore.

01:51:31.100 | Okay.

01:51:32.480 | And then we have another convolution.

01:51:35.820 | This case with 256 because it's the new number of features.

01:51:40.540 | Then we have another bunch of residual blocks that will decrease the number of features.

01:51:52.380 | So we go to 256 to 128.

01:52:00.060 | We have finally a group norm.

01:52:07.980 | 32 is the group size.

01:52:12.700 | So we group features in groups of 32 before calculating the mu and the sigma before normalizing.

01:52:20.540 | And we define the number of channels as 128 which is the number of features that we have.

01:52:25.180 | So this group normalization will divide these 128 features into groups of 32.

01:52:32.380 | Then we apply the silu.

01:52:36.220 | And then we have a convolution.

01:52:41.340 | The final convolution that will transform into an image with the three channels.

01:52:47.660 | So RGB by applying these convolutions here which doesn't change the size of the output.

01:52:54.380 | So we'll go from an image that is batch size 128 height width.

01:53:01.660 | Why height width?

01:53:02.380 | Because after the last upsampling we become of the original size into an image with only three channels.

01:53:10.860 | And this is our decoder.

01:53:20.860 | Now we can write the forward method.

01:53:36.780 | I'm sorry if I'm putting a lot of spaces between here.

01:53:39.580 | But otherwise it's easy to get lost and not understand where we are.

01:53:43.580 | So here the input of the decoder is our latent.

01:53:50.620 | So it's batch size 4 height divided by 8 width divided by 8.

01:53:57.500 | As you remember here in the encoder the last thing we do is be scaled by this constant.

01:54:02.300 | So we nullify this scaling.

01:54:05.180 | So we reverse this scaling.

01:54:07.020 | 215 and then we run it through the decoder.

01:54:13.980 | And then return x which is batch size 3 height and width.

01:54:31.260 | Let me also write the input of this decoder which is this one.

01:54:35.500 | We already have it.

01:54:37.820 | Okay this is our variational auto encoder.

01:54:41.180 | So far let's go review.

01:54:43.740 | We are building our architecture of the stable diffusion.

01:54:50.620 | So far we have built the encoder and the decoder.

01:54:53.980 | But now we have to build the unit and then we have to build the clip text encoder.

01:55:01.020 | And finally we have to build the pipeline that will connect all of these things.

01:55:05.740 | So it's going to be a long journey but it's fun actually to build things.

01:55:10.220 | Because you learn every detail of how they work.

01:55:13.180 | So the next thing that we are going to build is the text encoder.

01:55:16.780 | So this clip encoder here that will allow us to encode the prompt into embeddings

01:55:22.700 | that we can then feed to this unit model here.

01:55:25.660 | So let's build this clip encoder.

01:55:28.620 | And we will of course use a pre-trained version.

01:55:31.180 | So by downloading the vocabulary and I will show you how it works.

01:55:35.260 | So let's start.

01:55:36.700 | We go to Visual Studio Code.

01:55:39.340 | We create a new file in st folder called clip.py.

01:55:43.260 | And here.

01:55:45.820 | And we start importing the usual stuff.

01:55:49.020 | [typing]

01:56:08.060 | And we also import self-attention because we will be using it.

01:56:10.620 | So basically clip is a layer.

01:56:15.100 | It's very similar to the encoder layer of the transformer.

01:56:18.620 | So as you remember the transformer.

01:56:20.300 | Let me show you here.

01:56:21.580 | The transformer.

01:56:24.140 | This is the encoder layer of the transformer.

01:56:26.460 | It's made of attention and then feed forwards.

01:56:30.060 | And there are many blocks like this one after another that are applied one after another.

01:56:34.140 | We also have something that tells the position of each token inside of the sentence.

01:56:38.700 | And we will also have something similar in clip.

01:56:41.020 | So we need to build something very similar to this one.

01:56:44.460 | And actually this is why I mean the transformer model was very successful.

01:56:48.140 | So that's why they use the same structure of course also for this purpose.

01:56:51.420 | And so let's go to build it.

01:56:54.140 | The first thing we will build.

01:56:57.500 | I will build first the skeleton of the model and then we will build each block.

01:57:00.700 | So let's build clip.

01:57:02.860 | [typing]

01:57:13.100 | And this has some embeddings.

01:57:15.820 | The embeddings allow us to convert the tokens.

01:57:19.660 | So as you remember in when you have a sentence made up of text.

01:57:23.180 | First you convert it into numbers.

01:57:25.260 | Where each number indicates the position of the token inside of the vocabulary.

01:57:28.940 | And then you convert it into embeddings.

01:57:31.180 | Where each embedding represents a vector of size 512 in the original transformer.

01:57:36.220 | But here in clip the size is 768.

01:57:40.300 | And each vector represents kind of the meaning of the word or the token captures.

01:57:45.660 | So this is an embedding.

01:57:48.380 | And later we define it.

01:57:51.260 | We need the vocabulary size.

01:57:52.620 | The vocabulary size is 49408.

01:57:55.420 | I took it directly from the file.

01:57:56.940 | This is the embedding size.

01:57:59.100 | And the sequence length.

01:58:00.140 | The maximum sequence length that we can have.

01:58:02.060 | Because we need to use the padding is 77.

01:58:05.260 | Because we should actually use some configuration file to save.

01:58:09.820 | But because we will be using with the pre-trained stable diffusion model.

01:58:13.820 | The size are already fixed for us.

01:58:16.060 | But in the future I will refactor the code to add some configuration actually.

01:58:20.300 | To make it more extensible.

01:58:23.900 | This is a list of layers.

01:58:31.180 | Each we call it the clip layer.

01:58:36.060 | We have this 12.

01:58:37.340 | Which indicates the number of head of the multihead attention.

01:58:43.020 | And then the embedding size which is 768.

01:58:47.020 | And we have 12 of these layers.

01:58:51.020 | Then we have the layer normalization.

01:58:54.940 | Layer norm.

01:58:58.460 | And we tell him how many features.

01:59:03.100 | So 768.

01:59:05.180 | And then we define the forward method.

01:59:07.580 | This is tensor.

01:59:14.940 | And this one returns float tensor.

01:59:18.460 | Why long tensor?

01:59:20.940 | Because the input IDs are usually numbers.

01:59:25.340 | That indicate the position of each token inside of the vocabulary.

01:59:28.460 | Also this concept.

01:59:29.420 | Please if it's not clear.

01:59:30.460 | Go watch my previous video.

01:59:31.980 | About the transformer.

01:59:32.540 | Because it's very clear there.

01:59:34.460 | When we work with the textual models.

01:59:36.540 | Okay.

01:59:44.140 | First we convert each token into embeddings.

01:59:46.140 | And then.

01:59:52.380 | So what is the size here?

01:59:54.300 | We are going from batch size.

01:59:55.740 | Sequence length into.

02:00:00.620 | Batch size.

02:00:03.260 | Sequence length.

02:00:04.460 | And dimension.

02:00:05.100 | Where the dimension is 768.

02:00:07.260 | Then we apply one after.

02:00:09.580 | One after another.

02:00:12.220 | All the layers of this encoder.

02:00:13.740 | Just like in the transformer model.

02:00:18.060 | And the last one we apply the layer normalization.

02:00:29.820 | Oh.

02:00:30.320 | And finally we return the output.

02:00:36.140 | Where the output is.

02:00:37.100 | Of course it's a sequence to sequence model.

02:00:41.660 | Just like the transformer.

02:00:42.620 | So the input should match the.

02:00:44.140 | The shape of the input should match the shape of the output.

02:00:47.740 | So we always obtain sequence length by the model.

02:00:52.300 | Okay.

02:00:53.980 | Now let's define these two blocks.

02:00:55.580 | The first one is the clip embedding.

02:00:57.340 | So let's go.

02:00:59.660 | Clip embedding.

02:01:06.300 | How much is the vocabulary size?

02:01:10.220 | What is the embedding size?

02:01:12.860 | And number of token.

02:01:20.060 | Okay.

02:01:21.660 | So the sequence length basically.

02:01:28.060 | And.

02:01:28.560 | Okay.

02:01:36.060 | We define the embedding itself.

02:01:38.380 | Using nn.embedding.

02:01:39.740 | Just like always.

02:01:40.540 | We need to tell him what is the number of embeddings.

02:01:49.740 | So the vocabulary size.

02:01:51.260 | And what is the dimension of each vector of the embedding token.

02:01:54.540 | Then we define some positional encoding.

02:01:57.420 | So now as you remember.

02:01:59.580 | The positional encoding in the original transformer.

02:02:01.580 | Are given by sinusoidal functions.

02:02:04.300 | But here in clip.

02:02:06.140 | They actually don't use them.

02:02:07.500 | They use some learned parameters.

02:02:10.380 | So they have these parameters.

02:02:13.260 | That are learned by the model during training.

02:02:15.980 | That tell the position of the token to the model.

02:02:23.740 | Tokens and embeddings.

02:02:27.740 | Like this.

02:02:29.260 | We apply them.

02:02:39.180 | So first we apply the embedding.

02:02:40.620 | So we go from.

02:02:41.500 | Patch size.

02:02:43.340 | Sequence length.

02:02:46.460 | To.

02:02:46.960 | Patch size.

02:02:49.900 | Sequence length.

02:02:51.180 | Dimension.

02:02:51.980 | And then just like in the original transformer.

02:03:02.460 | We add the positional encodings to each token.

02:03:07.820 | But in this case as I told you.

02:03:09.180 | The positional embeddings are not fixed.

02:03:12.620 | Like not sinusoidal functions.

02:03:14.380 | But they are learned by the model.

02:03:16.460 | So they are learned.

02:03:17.260 | And then later we will load these parameters.

02:03:19.740 | When we load the model.

02:03:21.020 | And then we return this x.

02:03:23.420 | Then we have the clip layer.

02:03:26.220 | Which is just like the layer of the transformer model.

02:03:29.100 | The encoder of the transformer model.

02:03:43.420 | So it returns nothing actually.

02:03:47.420 | And this one is wrong in it.

02:04:01.100 | Okay.

02:04:03.600 | We have just like in the transformer block.

02:04:06.620 | We have the pre norm.

02:04:08.860 | Then we have the attention.

02:04:09.820 | Then we have a post norm.

02:04:11.020 | And then we have the feed forward.

02:04:12.300 | So layer normalization.

02:04:16.860 | Then we have the attention.

02:04:27.900 | Which is a self attention.

02:04:31.980 | Later we will build the cross attention.

02:04:37.580 | And I will show you what is it.

02:04:38.780 | Then we have another layer normalization.

02:04:43.820 | Then we have two feed forward layers.

02:05:03.260 | And finally we have the forward method.

02:05:16.780 | Finally.

02:05:17.260 | So this one takes tensor.

02:05:23.500 | And returns a tensor.

02:05:24.700 | So let me write it.

02:05:25.660 | Tensor.

02:05:27.500 | Okay.

02:05:32.300 | Just like the transformer model.

02:05:33.740 | Okay let's go have a look.

02:05:35.020 | We have a bunch of residual connections.

02:05:37.740 | As you can see here.

02:05:38.460 | One residual connection here.

02:05:39.580 | One residual connection here.

02:05:40.780 | We have two normalizations.

02:05:41.980 | One here.

02:05:42.460 | One here.

02:05:42.960 | The feed forward.

02:05:44.460 | As just like in the original transformer.

02:05:46.220 | We have two linear layers.

02:05:48.780 | And then we have this multi head attention.

02:05:50.700 | Which is actually a self attention.

02:05:51.980 | Because it's the same input that becomes query key and values.

02:05:55.100 | So let's do it.

02:05:57.580 | The first residual connection x.

02:06:00.940 | So what is the input of this forward method?

02:06:03.260 | It's a batch size.

02:06:04.300 | Sequence length d mod.

02:06:07.580 | And the dimension of the embedding which is 768.

02:06:11.980 | The first thing we do is we apply the self attention.

02:06:15.260 | But before applying the self attention.

02:06:18.220 | We apply the layer normalization.

02:06:20.300 | So layer normal 1.

02:06:23.100 | Then we apply the attention.

02:06:28.940 | But with the causal mask.

02:06:30.220 | As you remember here.

02:06:36.780 | Self attention.

02:06:37.580 | We have the causal mask.

02:06:39.180 | Which basically means that every token cannot watch the next tokens.

02:06:42.940 | So cannot be related to future tokens.

02:06:44.780 | But only the one on the left of it.

02:06:46.860 | And this is what we want from a text model actually.

02:06:49.900 | We don't want the one word to watch the words that come after it.

02:06:53.900 | But only the words that come before it.

02:06:55.820 | Then we do this residual connection.

02:06:59.020 | So now we are.

02:06:59.820 | Now we are doing this connection here.

02:07:04.380 | Then we do the feed forward layer.

02:07:10.540 | Again we have a residual connection.

02:07:14.460 | We apply the normalization.

02:07:18.300 | I'm not writing all the shapes.

02:07:24.220 | If you watch my code online.

02:07:25.660 | I have written all of them.

02:07:27.340 | But mostly to save time.

02:07:28.780 | Because here we are already familiar with the structure of the transformer.

02:07:32.940 | Hopefully.

02:07:33.740 | So I am not repeating all the shapes here.

02:07:35.900 | We apply the first linear of the feed forward.

02:07:43.900 | Then as activation function.

02:07:47.980 | We use the GLUE function.

02:07:49.740 | And actually we call the quick GLUE function.

02:07:53.100 | Which is defined like this.

02:07:56.540 | X multiplied by torch dot sigmoid.

02:07:59.980 | Of 1.702 multiplied by x.

02:08:05.100 | And that's it.

02:08:09.260 | Should be like this.

02:08:10.140 | So this is called the quick GLUE activation function.

02:08:16.940 | Also here.

02:08:17.500 | There is no justification on why we should use this one and not another one.

02:08:23.420 | They just saw that in practice this one works better for this kind of application.

02:08:27.100 | So that's why we are using this function here.

02:08:29.340 | So now.

02:08:31.100 | And then we apply the residual connection.

02:08:37.900 | And finally we return x.

02:08:40.380 | This is exactly like the feed forward layer of the transformer.

02:08:44.940 | Except that in the transformer we don't have this activation function.

02:08:47.580 | But we have the RELU function.

02:08:48.780 | And if you remember in LLAMA we don't have the RELU function.

02:08:52.460 | We have the ZWIGLUE function.

02:08:54.300 | But here we are using the quick GLUE function.

02:08:56.940 | Which I actually am not so familiar with.

02:08:58.780 | But I think that it works good for this model.

02:09:02.940 | And they just kept it.

02:09:03.820 | So now we have built our text encoder here.

02:09:08.060 | CLIP.

02:09:08.460 | Which is very small as you can see.

02:09:09.980 | And our next thing to build is our unit.

02:09:15.900 | So we have built the variational autoencoder.

02:09:18.780 | The encoder part.

02:09:19.900 | And the decoder part.

02:09:21.580 | Now the next thing we have to build is this unit.

02:09:25.340 | As you remember the unit is the network that will give some noisified image.

02:09:31.900 | And the amount.

02:09:33.420 | And we also indicated to the network what is the amount of noise that we added to this image.

02:09:38.940 | The model has to predict how much noise is there.

02:09:43.740 | And how to remove it.

02:09:44.700 | And this unit is a bunch of convolutions.

02:09:49.660 | That will reduce the size of the image.

02:09:52.060 | As you can see.

02:09:52.700 | With each step.

02:09:54.620 | But by increasing the number of features.

02:09:58.540 | So we reduce the size.

02:09:59.580 | But we increase exactly what we did in the encoder of the variational autoencoder.

02:10:04.700 | And then we do the reverse steps.

02:10:07.740 | Just like we did with the decoder of the variational autoencoder.

02:10:10.540 | So now again we will work with some convolutions.

02:10:13.900 | With the residual blocks.

02:10:15.100 | With attentions.

02:10:16.060 | Etc.

02:10:17.260 | The one big difference is that we need to tell our unit.

02:10:21.100 | Not only the image that is already.

02:10:24.060 | So what is the image with noise.

02:10:26.620 | Not only the amount of noise.

02:10:29.500 | So the time step at which this noise was added.

02:10:32.540 | But also the prompt.

02:10:34.540 | Because as you remember we need to also tell this unit what is our prompt.

02:10:40.060 | Because we need to tell him how we want our output image to be.

02:10:44.940 | Because there are many ways to deny the initial noise.

02:10:47.660 | So if we want the initial noise to become a dog.

02:10:50.460 | We need to tell him we want a dog.

02:10:52.140 | If we want the initial noise to become a cat.

02:10:53.900 | We need to tell him we want a cat.

02:10:56.140 | So the unit has to know what is the prompt.

02:10:58.620 | And also he has to relate this prompt with the rest of the information.

02:11:03.180 | And what is the best way to combine two different stuff.

02:11:07.580 | So for example an image with text.

02:11:10.220 | We will use what is called the cross attention.

02:11:13.660 | Cross attention basically allows us to calculate the attention between two sequences.

02:11:18.460 | In which the query is the first sequence.

02:11:21.980 | And the keys and the values are coming from another sequence.

02:11:25.340 | So let's go build it and let's see how this works.

02:11:28.060 | Now the first thing we will do is create a new class.

02:11:33.740 | New file here called diffusion.

02:11:36.140 | Because this will be our diffusion model.

02:11:39.900 | And I think also here I will build from top down.

02:11:44.860 | So we first define the diffusion class.

02:11:47.820 | And then we build each block one by one.

02:11:49.740 | Let's start by importing the usual libraries.

02:11:53.900 | So import torch.

02:12:03.100 | From torch.

02:12:10.380 | And then we import the attention.

02:12:15.420 | The self attention.

02:12:19.340 | But also we will need the cross attention.

02:12:22.460 | Attention.

02:12:24.620 | And later we will build it.

02:12:25.740 | Then let's create the class diffusion.

02:12:32.940 | The class diffusion is basically our unit.

02:12:34.860 | This is made of time embedding.

02:12:46.140 | So something that we will define it later.

02:12:50.780 | Time embedding.

02:12:52.220 | 320 which is the size of the time embedding.

02:12:56.140 | So because we need to give the unit not only the noisified image.

02:13:00.700 | But also the time step at which it was noisified.

02:13:03.500 | So the image, the unit needs some way to understand this time step.

02:13:10.300 | So this is why this time step which is a number.

02:13:12.540 | Will be converted into an embedding.

02:13:14.620 | By using this particular module called time embedding.

02:13:17.260 | And later we will see it.

02:13:18.220 | Then we build the unit.

02:13:21.340 | And then the output layer of the unit.

02:13:27.180 | And later we will see what is it.

02:13:29.660 | This output layer.

02:13:30.620 | Put layer.

02:13:33.500 | Later we will see how to build it.

02:13:37.100 | Let's do the forward.

02:13:39.020 | As you remember the unit will receive the latent.

02:13:45.740 | So this Z which is a latent.

02:13:48.220 | Is the output of the variational autoencoder.

02:13:50.460 | So this latent which is a torch dot tensor.

02:13:53.020 | It will receive the context.

02:13:55.260 | What is the context?

02:13:56.380 | Is our prompt.

02:13:57.340 | Which is also a torch dot tensor.

02:14:00.620 | And it will receive the time.

02:14:02.460 | At which this latent was noisified.

02:14:04.700 | Which is also.

02:14:07.020 | I don't remember.

02:14:07.740 | I think it's a tensor also.

02:14:09.660 | Later I define it.

02:14:12.860 | Okay yeah it's tensor.

02:14:15.580 | Okay let's define the sizes.

02:14:21.420 | So the latent here is batch size.

02:14:25.500 | 4 because 4 is the output of the encoder.

02:14:28.380 | If you remember correctly here.

02:14:29.740 | 4.

02:14:30.780 | Closing.

02:14:33.500 | Okay.

02:14:33.740 | Grid and width divided by 8.

02:14:38.620 | Then we have the context.

02:14:41.180 | Which is our prompt.

02:14:42.300 | Which we already converted with the clip encoder here.

02:14:46.540 | Which will be batch size.

02:14:48.940 | By sequence length.

02:14:50.140 | By dimension.

02:14:50.860 | Where the dimension is 768.

02:14:53.020 | Like we defined before.

02:14:54.540 | And the time will be another.

02:14:55.820 | We will define it later.

02:14:58.460 | How it's defined.

02:14:59.580 | How it's built.

02:15:00.300 | But it's each embedding.

02:15:03.020 | It's a number with an embedding of size.

02:15:05.180 | It's a vector of a size of 320.

02:15:07.820 | The first thing we do is.

02:15:11.260 | We convert this time into an embedding.

02:15:13.180 | And actually this time.

02:15:17.020 | We will see later.

02:15:17.820 | That it's actually.

02:15:18.700 | Just like the positional encoding.

02:15:20.940 | Of the transformer model.

02:15:23.180 | It's actually a number that is multiplied by.

02:15:26.540 | Sines and cosines.

02:15:28.860 | Just like in the transformer.

02:15:30.940 | Because they saw that it works for the transformer.

02:15:33.500 | So we can also use the same positional encoding.

02:15:35.820 | To convey the information of the time.

02:15:37.980 | Which is actually kind of an information.

02:15:39.980 | About position.

02:15:40.780 | So it tells the model.

02:15:42.300 | At which step we arrived in the denoisification.

02:15:48.460 | So this one will convert tensor of one 320.

02:15:53.260 | Into a tensor of one one two eight zero one thousand.

02:15:57.740 | The unit will convert our latent.

02:16:03.180 | Into another latent.

02:16:04.620 | So it will not change the size.

02:16:11.500 | Batch for height this is the output.

02:16:18.780 | Of the variation of the encoder.

02:16:20.140 | Which first becomes batch 320 features.

02:16:30.060 | Through this the unit so.

02:16:39.900 | So why here we have more features.

02:16:45.020 | Than the starting.

02:16:45.900 | Because let's review here.

02:16:47.420 | As you can see.

02:16:48.140 | The last layer of the unit.

02:16:50.940 | Actually we need to go back.

02:16:53.660 | To the same number of the features.

02:16:57.340 | You can see here.

02:16:58.060 | So here we start.

02:16:59.820 | Actually the dimensions here.

02:17:01.260 | Don't match what we will be using.

02:17:02.620 | So this is the original unit.

02:17:04.300 | But the one used.

02:17:05.180 | By stable diffusion is a modified unit.

02:17:08.380 | So in the last.

02:17:09.580 | When we build the decoder.

02:17:11.260 | The decoder will not build.

02:17:12.700 | The final number of features that we need.

02:17:15.100 | Which is four.

02:17:16.140 | But we need an additional output layer.

02:17:18.060 | To go back to the original size of features.

02:17:20.700 | And this is the job of this output layer.

02:17:23.100 | So later we will see.

02:17:25.580 | When we build this this layer.

02:17:28.140 | So output is equal to self dot final.

02:17:30.860 | This one will go from this size here.

02:17:35.580 | To back to the original size of the unit.

02:17:42.540 | Because the unit.

02:17:44.940 | His job is to take in latents.

02:17:47.900 | Predict how much noise is it.

02:17:49.900 | Then take again the same latent.

02:17:51.900 | Predict how much noise.

02:17:53.020 | We remove it.

02:17:54.140 | We remove the noise.

02:17:55.020 | Then again we give another latent.

02:17:57.020 | We predict how much noise.

02:17:58.220 | We remove the noise.

02:17:59.180 | We give another latent.

02:18:01.100 | We predict the noise.

02:18:01.900 | We remove the noise.

02:18:02.620 | Etc, etc, etc.

02:18:04.060 | So the output dimension must match the input dimension.

02:18:06.460 | And then we return the output.

02:18:09.660 | Which is the latent.

02:18:13.100 | Like this.

02:18:15.360 | Let's build first the time embedding.

02:18:18.540 | I think it's easy to build.

02:18:19.740 | So something that encodes information.

02:18:23.580 | About the time step in which we are.

02:18:26.060 | [TYPING SOUNDS]

02:18:28.940 | Okay.

02:18:40.320 | It is made of two linear layers.

02:18:45.740 | Nothing fancy here.

02:18:46.860 | Linear one.

02:18:54.460 | Which will map it to 4 by n embedding.

02:18:58.060 | And then linear two.

02:18:59.900 | 4 by n embedding into 4 by n embedding.

02:19:11.180 | And now you understand why it becomes 1280.

02:19:15.180 | Which is 4 times 320.

02:19:18.300 | [TYPING SOUNDS]

02:19:21.340 | This one returns to a tensor.

02:19:27.580 | So the input size is 1320.

02:19:32.300 | What we do is first we apply this first layer.

02:19:36.380 | Linear one.

02:19:38.300 | Then we apply the silo function.

02:19:40.700 | Then we apply again the second linear layer.

02:19:45.900 | [TYPING SOUNDS]

02:19:49.900 | And then we return it.

02:19:51.100 | Nothing special here.

02:19:52.540 | The output dimension is 1 by 1280.

02:19:58.220 | [TYPING SOUNDS]

02:20:01.020 | 280.

02:20:02.220 | Okay.

02:20:04.560 | The next thing we need to build is the unit.

02:20:06.860 | The unit will require many blocks.

02:20:10.860 | So let's first build the unit itself.

02:20:13.660 | And then we build each of the blocks that it will require.

02:20:16.220 | So class unit.

02:20:19.500 | [TYPING SOUNDS]

02:20:36.380 | As you can see, the unit is made up of one encoder branch.

02:20:40.060 | So this is like the encoder of the variational autoencoder.

02:20:44.140 | Things go down.

02:20:45.340 | So the image becomes smaller, smaller, smaller.

02:20:47.500 | But the channels keep increasing.

02:20:49.420 | The features keep increasing.

02:20:51.020 | Then we have this bottleneck layer here.

02:20:53.500 | It's called bottleneck.

02:20:54.860 | And then we have a decoder part here.

02:20:57.180 | So it becomes original size.

02:20:58.860 | The image from the very small size becomes the original size.

02:21:02.700 | And then we have these skip connections between the encoder and the decoder.

02:21:07.020 | So the output of each layer of each step of the encoder

02:21:12.460 | is connected to the same step of the decoder on the other side.

02:21:15.820 | And you will see this one here.

02:21:19.500 | So we start building the left side, which is the encoders.

02:21:22.380 | Which is a list of modules.

02:21:25.100 | And to build these encoders, we need to define a special layer, basically, that will apply...

02:21:37.980 | Okay, let's build it and then I will describe it.

02:21:40.940 | SwitchSequential.

02:21:48.060 | And basically, this switchSequential, given a list of layers, will apply them one by one.

02:22:05.340 | So we can think of it as a sequential.

02:22:07.580 | But it can recognize what are the parameters of each of them and will apply accordingly.

02:22:14.860 | So after I define it, it will be more clear.

02:22:17.180 | So first we have, just like before, a convolution.

02:22:20.300 | Because we want to increase the number of channels.

02:22:23.100 | So as you can see, at the beginning, we increase the number of channels of the image.

02:22:26.860 | Here it's 64, but we go directly to 320.

02:22:29.900 | And then we have another one of this switchSequential.

02:22:36.540 | Which is a unit residual block.

02:22:41.660 | We define it later.

02:22:44.700 | But it's very similar to the residual block that we built already for the variational autoencoder.

02:22:50.460 | And then we have an attention block, which is also very similar to the attention block

02:22:54.300 | that we built for the variational autoencoder.

02:23:03.900 | Then we have-- OK, I think it's better to build this switchSequential.

02:23:07.740 | Otherwise, we have too many-- yeah.

02:23:09.500 | Let's build it.

02:23:11.980 | It's very simple.

02:23:12.700 | As you can see, it's a sequence.

02:23:21.260 | But given x, which is our latent, which is a torch.tensor, our context, so our prompt.

02:23:33.580 | And the time, which is also a tensor.

02:23:36.540 | We'll apply them one by one.

02:23:42.380 | But based on what they are.

02:23:49.340 | So if the layer is a unit attention block, for example.

02:23:52.460 | It will apply it like this.

02:23:57.180 | So layer of x and context.

02:23:59.500 | Why?

02:24:00.620 | Because this attention block basically will compute the cross-attention between

02:24:04.060 | our latent and the prompt.

02:24:05.980 | This is why.

02:24:06.620 | This residual block will compute-- will match our latent with its time step.

02:24:22.860 | And then if it's any other layer, we just apply it.

02:24:27.100 | And then we return, but after the for a while.

02:24:34.300 | Yeah.

02:24:34.540 | So this is-- now we understood this.

02:24:37.020 | We just need to define this residual block and this attention block.

02:24:40.220 | Then we have another sequence-- sequential switch.

02:24:48.300 | This one here.

02:24:49.020 | So the code I'm writing actually is based on a repository.

02:24:55.580 | Upon which actually most of the code I wrote is based on.

02:24:59.100 | Which is in turn based on another repository,

02:25:01.180 | which was originally written for TensorFlow, if I remember correctly.

02:25:04.460 | So actually, the code for stable diffusion-- because it's a model that is built by

02:25:11.500 | Comfit's group at the LMU University, of course, it cannot be different from that code.

02:25:15.740 | So most of the code are actually similar to each other.

02:25:19.020 | I mean, you cannot create the same model and change the code.

02:25:23.180 | Of course, the code will be similar.

02:25:25.500 | So we again use this one-- switch sequential.

02:25:31.740 | So here we are building the encoder side.

02:25:34.140 | So we are reducing the size of the image.

02:25:54.220 | Let me check where we are.

02:25:56.220 | So we have the residual block of 320 to 64.

02:26:00.940 | And then we have an attention block of 8 to 80.

02:26:05.340 | And this attention block takes the number of head.

02:26:10.460 | This 8 indicates the number of head.

02:26:12.140 | And this indicates the embedding size.

02:26:14.780 | We will see later how we transform this, the output of this,

02:26:19.260 | into a sequence so that we can run attention on it.

02:26:22.620 | OK, we have this sequential.

02:26:26.460 | And then we have another one.

02:26:29.500 | Then we have another convolution.

02:26:38.380 | Let me just copy.

02:26:42.140 | Convolution of size from 640 to 640 channels.

02:26:49.660 | Kernel size 3, stride 2, padding 1.

02:26:51.900 | Then we have another residual block that will again increase the features.

02:26:59.260 | So from 640 to 1280.

02:27:05.020 | And then we have an attention block of 8 heads and 160 is the embedding size.

02:27:15.820 | Then we have another residual block of 1280 and 8 and 160.

02:27:28.860 | So as you can see, just like in the encoder of the variational autoencoder,

02:27:31.980 | we, with these convolutions, we keep decreasing the size of the image.

02:27:38.940 | So actually here we started with the latent representation,

02:27:42.620 | which was height divided by 8 and height divided by 8.

02:27:47.180 | So let me write some shapes here.

02:27:49.100 | At least you need to understand the size changes.

02:27:52.700 | So batch size for height divided by 8 and width divided by 8.

02:27:58.940 | When we apply this convolution, it will become divided by 16.

02:28:04.460 | So it will become divided by 16.

02:28:13.100 | So it will become a very small image.

02:28:17.020 | And after we apply the second one, it will become divided by 32.

02:28:22.380 | So here we start from 16.

02:28:27.100 | Here it will become divided by 32.

02:28:32.380 | So what does it mean divided by 32?

02:28:34.060 | That if the initial image was of size 512, the latent is of size 64 by 64.

02:28:40.220 | Then it becomes 32 by 32.

02:28:43.340 | Now it has become 16 by 16.

02:28:45.580 | And then we apply these residual connections.

02:28:51.020 | And then we apply another convolutional layer,

02:28:57.820 | which will reduce the size of the image further.

02:29:01.660 | So from 32 here, divide by 32 and divide by 32 to divide by 64.

02:29:12.460 | Every time we divided the size of the image by 2.

02:29:18.140 | And the number of features is 12801280.

02:29:25.500 | And then we have a unit residual block.

02:29:34.940 | So let me copy also this one.

02:29:36.460 | Of 1280 and 1280.

02:29:46.060 | And then we have a last one, which is another one of the same size.

02:29:51.740 | So now we have an image that is 64 divided by 64 and divided by 64,

02:29:57.340 | but with much more channels.

02:29:59.580 | I forgot to change the channel numbers here.

02:30:02.620 | So here is 1280 channels and divided by 64 divided by 64.

02:30:07.020 | And this one remains the same.

02:30:09.420 | Because the residual connections don't change the size.

02:30:12.940 | Here should be 1280 to 1280.

02:30:20.860 | Here should be 640 to 640.

02:30:25.020 | And here it should be 320 to 320.

02:30:31.020 | So as I said before, we keep reducing the size of the image,

02:30:35.900 | but we keep increasing this number of features of each pixel basically.

02:30:40.620 | Then we build the bottleneck,

02:30:42.780 | which is this part here of the unit.

02:30:49.420 | This is a sequence of a residual block.

02:30:57.420 | Then we have the attention block,

02:31:06.940 | which will make a self-attention.

02:31:11.100 | Sorry, not self-attention, cross-attention.

02:31:14.380 | And then we have another residual block.

02:31:16.540 | And then we have the decoder.

02:31:24.140 | So in the decoder, we will do the opposite of what we did in the encoder.

02:31:27.660 | So we will reduce the number of features, but increase the image size.

02:31:36.700 | Again, let's start with our beautiful switch sequential.

02:31:44.860 | So we have 2560 to 1280.

02:31:54.620 | Why here is 2560 even if after the bottleneck we have 1280?

02:32:02.940 | So we are talking about this part here.

02:32:06.460 | So after the input of the decoder,

02:32:11.420 | so this side here of the unit is the output of the bottleneck.

02:32:16.060 | But the bottleneck is outputting 1280 features,

02:32:20.140 | while the encoder is expecting 2560, so double the amount.

02:32:25.180 | Why? Because we need to consider that we have this skip connection here.

02:32:29.020 | So this skip connection will double the amount at each layer here.

02:32:33.020 | And this is why the input we expect here is double the size

02:32:36.540 | of what is the output of the previous layer.

02:32:39.420 | Let me write some shapes also here.

02:32:40.940 | So batch size 2560.

02:32:44.140 | The image is very small, so height by end width divided by 64.

02:32:49.660 | And it will become 1280.

02:32:59.260 | Then we apply another switch sequential of the same size.

02:33:09.260 | Then we apply another one with an upsample,

02:33:12.060 | just like we did in the variational autoencoder.

02:33:14.140 | So if you remember in the variational autoencoder,

02:33:16.300 | to increase the size of the image we do upsampling.

02:33:19.660 | And this is what we do exactly here.

02:33:21.260 | We do upsample.

02:33:23.420 | But this is not the upsample that we did exactly the same,

02:33:28.780 | but the concept is similar.

02:33:30.620 | And we will define it later also, this one.

02:33:36.620 | So we have another residual with attention.

02:33:39.100 | So we have a residual of 2000.

02:33:42.140 | And then we have an attention block.

02:33:45.260 | 8 by 160.

02:33:52.780 | Then we have again this one.

02:33:57.020 | Then we have another one with an attention block.

02:34:03.900 | Then we have another one with upsampling.

02:34:08.220 | So we have 9020.

02:34:12.300 | And then we have an upsample.

02:34:19.740 | This one is small.

02:34:28.700 | So I know that I'm not writing all the shapes,

02:34:32.380 | but otherwise it's a really tiring job and very long.

02:34:37.180 | So just remember that we are keep increasing the size of the image,

02:34:41.100 | but we will decrease the number of features.

02:34:44.380 | Later we will see that this number here will become very small,

02:34:47.580 | and the size of the image will become nearly to the normal.

02:34:50.700 | Then we have another one with attention.

02:34:57.580 | So as you can see, we are decreasing the features here.

02:35:05.820 | Then we have 8 by 80, and we are increasing also here the size.

02:35:12.380 | Then we have another one.

02:35:15.420 | And 880.

02:35:23.180 | Then we have another one with upsampling.

02:35:26.060 | So we increase the size of the image.

02:35:28.140 | So 960 to 640.

02:35:34.540 | 8 heads with the dimensions embedding size of 80,

02:35:39.100 | and the upsampling with 640 features.

02:35:43.100 | And then we have another residual block with attention.

02:35:47.580 | 40.

02:35:58.620 | Then we have another one, which is a 640, 320, 840.

02:36:09.580 | And finally, the last one, we have 640 by 320.

02:36:17.900 | And 8 and 40.

02:36:19.980 | This dimension here is the same that will be applied by the output of the unit,

02:36:28.540 | as you can see here.

02:36:29.340 | This one here.

02:36:31.180 | And then we will give it to the final layer to build the original latent size.

02:36:34.620 | Okay, let's build all these blocks that we didn't build before.

02:36:42.380 | So first, let's build the upsample.

02:36:45.900 | Let's build it here, which is exactly the same as the two.

02:36:54.300 | Okay.

02:36:56.300 | We have this convolution.

02:37:10.300 | Without changing the number of features.

02:37:18.300 | And this is also doesn't change the size of the image, actually.

02:37:28.700 | So we will go from batch channels or features.

02:37:36.700 | Let's call it features height width to batch size features.

02:37:46.380 | Height multiplied by 2 and width multiplied by 2.

02:37:51.020 | Why?

02:37:51.660 | Because we are going to use the upsampling.

02:37:54.300 | This interpolation that we will do now.

02:37:58.380 | Interpolate x scale factor equal to mode is equal to nearest is the same operation that we did here.

02:38:08.860 | The same operation here.

02:38:10.940 | It will double the size, basically.

02:38:14.540 | And then we apply a convolution.

02:38:17.900 | Now, we have to define the final block.

02:38:26.700 | And we also have to define for the output layer.

02:38:30.380 | And we also have to define the attention block and the residual block.

02:38:33.260 | So let's build first this output layer.

02:38:36.380 | It's easier to build.

02:38:47.580 | So let's... this one also has a group normalization.

02:39:05.260 | Again, with the 32 size of the group 32.

02:39:08.220 | Also has a convolution.

02:39:12.620 | And the padding of 1.

02:39:20.780 | Okay.

02:39:25.840 | The final layer needs to convert this shape into this shape.

02:39:31.500 | So 320 features into 4.

02:39:33.740 | We have... so we have an input which is batch size of 320 features.

02:39:42.060 | The height is divided by 8.

02:39:44.060 | And the width is divided by 8.

02:39:45.820 | We first apply a group normalization.

02:39:48.940 | Then we apply the SILU.

02:39:54.860 | Then we apply the convolution.

02:40:00.060 | And then we return.

02:40:02.540 | This will basically... the convolution... let me write also why we are reducing the size.

02:40:09.660 | This convolution will change the number of channels from in to out.

02:40:13.180 | And when we will declare it, we say that we want to convert from 320 to 4 here.

02:40:18.700 | So this one will be of shape batch size 4.

02:40:23.020 | Height divided by 4.

02:40:25.100 | Height divided by 4.

02:40:28.060 | And width divided by 8.

02:40:30.620 | Then we need to go build this residual block and this attention block here.

02:40:39.660 | So let's build it here.

02:40:40.860 | Let's start with the residual block,

02:40:44.060 | which is very similar to the residual block that we built for the variational autoencoder.

02:40:47.740 | So unit lock.

02:41:01.580 | So this is the embedding of the time step.

02:41:16.220 | As you remember, with the time embedding, we transform into an embedding of size 1280.

02:41:30.460 | We have this group normalization.

02:41:32.460 | It's always this group norm.

02:41:33.660 | Then we have a convolution.

02:41:57.500 | And we have a linear for the time embedding.

02:42:01.020 | Then we have another group normalization.

02:42:11.020 | We will see later what is this merged.

02:42:24.860 | And another convolution.

02:42:30.780 | Oops, kernel size 3, embedding 1.

02:42:37.500 | Again, just like before, we have if the in channels is equal to the out channels,

02:42:45.260 | we can connect them directly with the residual connection.

02:42:54.220 | Otherwise, we create a convolution to connect them,

02:42:57.180 | to convert the size of the input into the output.

02:42:59.740 | Otherwise, we cannot add the two tensors.

02:43:18.540 | Zero, okay.

02:43:27.500 | So it takes in as input this feature tensor, which is actually the latent

02:43:37.500 | batch size in channels.

02:43:42.300 | Then we have height and width.

02:43:45.100 | And then also the time embedding, which is 1 by 1280, just like here.

02:43:50.460 | And we build, first of all, a residual connection.

02:43:55.180 | Then we do apply the group normalization.

02:44:01.580 | So usually the residual connection, the residual blocks are more or less always the same.

02:44:05.500 | So there is a normalization and activation function.

02:44:08.780 | Then we can have some skip connection, etc, etc.

02:44:12.620 | So

02:44:14.620 | then we have the time.

02:44:39.420 | So

02:44:41.420 | here we are merging the latency with the time embedding,

02:45:03.180 | but the time embedding doesn't have the batch and the channels dimension.

02:45:07.500 | So we add it here with unsqueeze.

02:45:09.500 | And we merge them.

02:45:11.900 | Then we normalize this merged connection.

02:45:15.180 | This is why it's called merged.

02:45:19.260 | We apply the activation function.

02:45:34.700 | Then we apply this convolution.

02:45:37.500 | And finally, we apply the residual connection.

02:45:39.820 | So why are we doing this?

02:45:49.900 | Well, the idea is that here we have three inputs.

02:45:52.780 | We have the time embedding, we have the latent, we have the prompt.

02:45:56.300 | We need to find a way to combine the three information together.

02:46:00.140 | So the unit needs to learn to detect the noise present in a noisified image

02:46:05.100 | at a particular time step using a particular prompt as a condition.

02:46:09.100 | Which means that the model needs to recognize this time embedding

02:46:14.860 | and needs to relate this time embedding with the latency.

02:46:17.980 | And this is exactly what we are doing in this residual block here.

02:46:21.100 | We are relating the latent with the time embedding,

02:46:24.460 | so that the output will depend on the combination of both,

02:46:29.020 | not on the single noise or in the single time step.

02:46:31.820 | And this will also be done with the context using cross-attention

02:46:35.980 | in the attention block that we will build now.

02:46:37.900 | So, unit attention block.

02:46:58.460 | 768, okay.

02:47:09.340 | Okay, I will define some layers that for now will not make much sense,

02:47:18.140 | but later they will make sense when we make the forward method.

02:47:23.340 | So,

02:47:34.220 | okay.

02:47:43.100 | So, my cat is asking for food.

02:48:00.700 | I think he already has food, but maybe he wants to eat something special today.

02:48:06.540 | So, let me finish this attention block and the unit and then I'm all his.

02:48:12.220 | Why everyone wants attention?

02:48:16.140 | Self-attention and head channels.

02:48:22.300 | Here we don't have any bias.

02:48:24.940 | As you remember, the self-attention we can have the bias for the W matrices.

02:48:30.060 | Here we don't have any bias, just like in the vanilla transformer.

02:48:33.020 | So, we have this attention.

02:48:35.660 | Then we have a layer normalization, self.layernorm 2,

02:48:39.980 | which is along the same number of features.

02:48:45.740 | Then we have another attention.

02:48:47.020 | We will see later why we need all this attention,

02:48:49.980 | but this is not a self-attention.

02:48:51.260 | It's a cross-attention and we will see later how it works.

02:48:54.780 | So,

02:49:00.300 | then we have the layer norm 3.

02:49:14.540 | So,

02:49:23.820 | this is because we are using a function that is called the

02:49:32.380 | JGLU activation function.

02:49:34.300 | So, we need these matrices here.

02:49:38.700 | So,

02:49:47.340 | okay.

02:49:59.100 | Now we can build the forward method.

02:50:00.620 | So, our X is our latency.

02:50:03.260 | So, we have a batch size.

02:50:04.700 | We have features.

02:50:06.940 | We have height.

02:50:07.900 | We have width.

02:50:08.700 | Then we have our context, which is our prompt,

02:50:11.660 | which is a batch size, sequence length, dimension.

02:50:15.980 | The dimension is size 768, as we saw before.

02:50:19.260 | So, the first thing we will do is we will do the normalization.

02:50:25.100 | So, just like in the transformer, we will take the input,

02:50:28.540 | so our latency, and we apply the normalization and the convolution.

02:50:32.700 | Actually, in the transformer, there is no convolution,

02:50:34.700 | but only the normalization.

02:50:36.380 | So, this is called the long residual,

02:50:41.740 | because it will be applied at the end.

02:50:43.420 | Okay, so we have this here.

02:50:50.860 | We are applying the normalization,

02:50:52.140 | which doesn't change the size of the tensor.

02:50:54.460 | Then we have a convolution.

02:50:56.540 | X is equal to self.com input of X,

02:51:01.900 | which also doesn't change the size of the tensor.

02:51:05.580 | Then we take the shape,

02:51:08.460 | which is the batch size, the number of features, the height, and the width.

02:51:17.340 | We transpose because we want to apply cross-attention.

02:51:22.060 | First, we apply self-attention, then we apply cross-attention.

02:51:28.460 | So, we do normalization plus self-attention with skip connection.

02:51:40.220 | So, X is X dot transpose of minus one, minus two.

02:51:46.700 | So, we are going from this.

02:51:49.820 | Wait, I forgot something.

02:51:54.140 | Here, first of all, we need to do X is equal to X dot view.

02:51:58.860 | Then C, H multiplied by W.

02:52:02.780 | So, we are going from this to batch size features,

02:52:11.980 | and then H multiplied by W.

02:52:13.500 | So, this one multiplied by this.

02:52:14.940 | Then we transpose these two dimensions.

02:52:17.500 | So, now we get from here to here.

02:52:23.420 | So, the features become the last one.

02:52:26.940 | Now, we apply this normalization plus self-attention.

02:52:29.500 | So, we have a first short residual connection

02:52:35.260 | that we'll apply right after the attention.

02:52:37.180 | So, we say that X is equal to layer norm one.

02:52:42.380 | So, X.

02:52:43.020 | Then we apply the attention.

02:52:44.940 | So, self dot attention one.

02:52:46.700 | And then we apply the residual connection.

02:52:49.740 | So, X is plus equal to residual short, the first residual connection.

02:52:56.060 | Then we say that the residual short is again equal to six,

02:52:58.780 | because we are going to apply now the cross attention.

02:53:01.740 | So, now we apply the normalization plus the cross attention with skip connection.

02:53:11.100 | So, what we did here is what we do in any transformer.

02:53:20.060 | So, let me show you here what we do in any transformer.

02:53:23.020 | So, we apply some normalization.

02:53:24.460 | We calculate the attention.

02:53:26.140 | And then we combine it with a skip connection here.

02:53:28.460 | And now we will, instead of calculating a self-attention,

02:53:31.660 | we will do a cross attention, which we still didn't define.

02:53:35.020 | We will define it later.

02:53:36.460 | So, short.

02:53:39.340 | And then first we calculate, we apply the normalization.

02:53:43.020 | Then the cross attention between the latency and the prompt.

02:53:53.420 | This is cross attention.

02:53:54.860 | So, this is cross attention.

02:53:57.420 | And we will see how.

02:53:59.660 | And X plus or equal to residual short.

02:54:07.260 | Okay.

02:54:09.420 | And then again, equal to X.

02:54:12.300 | Finally, just like with the attention transformer,

02:54:16.140 | we have a feedforward layer with the JGLU activation function.

02:54:20.780 | Okay.

02:54:26.780 | And this is actually, if you watch the original implementation of the transformer,

02:54:45.180 | of the stable diffusion, it's implemented exactly like this.

02:54:48.620 | So, basically later we do element-wise multiplication.

02:54:55.020 | So, these are special activation functions that involve a lot of parameters.

02:55:03.420 | But why we use one and not the other?

02:55:09.420 | I told you, just like before,

02:55:11.500 | they just saw that this one works better for this kind of application.

02:55:14.940 | There is no other.

02:55:16.940 | Then we apply the skip connection.

02:55:22.940 | So, we apply the cross attention.

02:55:24.620 | Then we define another one here.

02:55:26.940 | So, this one is basically normalization plus feedforward layer with JGLU and skip connection.

02:55:38.620 | In which the skip connection is defined here.

02:55:41.660 | So, at the end, we always apply the skip connection.

02:55:44.780 | Finally, we change back to our tensor to not be a sequence of pixels anymore.

02:55:50.140 | So, we reverse the previous transposition.

02:55:54.700 | Transpose.

02:55:57.820 | So, basically, we go from batch size with width multiplied by height multiplied by width.

02:56:13.980 | And features into batch size features height multiplied by width.

02:56:25.260 | Then we remove this multiplication.

02:56:30.460 | So, we reverse this multiplication.

02:56:32.700 | And CHW.

02:56:36.940 | Finally, we apply the long skip connection that we defined here at the beginning.

02:56:43.820 | So, only if the size match.

02:56:46.540 | If the sizes don't match, we apply the here.

02:56:49.580 | This one we have here.

02:56:50.620 | Return self.com output.

02:56:55.420 | And this is all of our unit.

02:57:04.300 | We have defined everything, I think, except for the cross attention, which is very fast.

02:57:09.340 | So, we go to the attention that we defined before.

02:57:13.100 | And I put it in the wrong folder.

02:57:15.180 | It should be skip changes.

02:57:17.660 | Let me check if I put it correctly.

02:57:20.700 | Yeah, we only need to define this cross attention here.

02:57:28.460 | Okay, attention.

02:57:31.660 | So, let's go.

02:57:33.100 | And let's define this cross attention.

02:57:38.060 | So, class, it will be very similar to the, not very similar, actually, same as the self

02:57:44.780 | attention, except that the keys come from one side and the query and, sorry, the query

02:57:51.420 | come from one side and the key and the values from another side.

02:58:06.300 | So, this is the dimension of the embedding of the keys and the values.

02:58:15.100 | This is the one of the queries.

02:58:28.780 | This is the WQ matrix.

02:58:37.420 | In this case, we will define, instead of one big matrix made of three, WQ, WK and WV, we

02:58:42.940 | will define three different matrices.

02:58:44.540 | Both systems are fine.

02:58:46.860 | You can define it as one big matrix or three separately.

02:58:50.220 | It doesn't change anything, actually.

02:59:02.860 | So, the cross is from the keys and the values.

02:59:16.860 | Oops, linear.

02:59:32.220 | Then, we save the number of heads of this cross attention and also the dimension of

03:00:00.460 | each, how much information each head will see.

03:00:04.140 | And the head is equal to the embed divided by the number of heads.

03:00:13.660 | Let's define the forward method.

03:00:17.100 | X is our query and Y is our keys and values.

03:00:27.740 | So, we are relating X, which is our latency, which is of size batch size.

03:00:36.060 | It will have a sequence length, its own sequence length, Q, let's call it Q, and its own dimension.

03:00:43.580 | And the Y, which is the context or the prompt, which will be batch size.

03:00:53.580 | Sequence length of the key, because the prompt will become the key and the values.

03:00:58.860 | And each of them will have its own embedding size, the dimension of KV.

03:01:03.100 | We can already say that this will be a batch size of 77, because our sequence length of

03:01:09.580 | the prompt is 77 and its embedding is of size 768.

03:01:14.140 | So, let's build this one.

03:01:19.580 | This is input shape is equal to x dot shape.

03:01:23.740 | Okay, then we have the interim shape, like the same as before.

03:01:40.860 | So, this is the sequence length, then the n number of heads.

03:01:49.980 | And how much information each head will see.

03:01:52.300 | The head.

03:01:55.260 | The first thing we do is multiply queries by WQ matrix.

03:02:05.660 | So, query is equal to.

03:02:07.580 | Then we do the same for the keys and the values, but by using the other matrices.

03:02:17.900 | And as I told you before, the key and the values are the Y and not the X.

03:02:22.140 | Again, we split them into H heads, so H number of heads.

03:02:32.140 | Then we transpose.

03:02:38.860 | I will not write the shapes because they match the same transformation that we do here.

03:02:46.700 | Okay, again, we calculate the weight, which is the attention, as a query multiplied by

03:03:07.580 | the transpose of the keys.

03:03:12.300 | And then we divide it by the dimension of each head by the square root.

03:03:21.820 | Then we do the softmax.

03:03:29.420 | In this case, we don't have any causal mask, so we don't need to apply the mask like before,

03:03:36.300 | because here we are trying to relate the tokens, so the prompt with the pixels.

03:03:41.980 | So, each pixel can watch any word of the token, and any token can watch any pixel, basically.

03:03:49.100 | So, we don't need any mask.

03:04:00.940 | We are to obtain the output, we multiply it by the bit matrix.

03:04:04.140 | And then the output, again, is transposed, just like before.

03:04:10.140 | So, now we are doing exactly the same things that we did here.

03:04:13.900 | So, transpose, reshape, etc.

03:04:24.620 | And then return output.

03:04:38.780 | And this ends our building of the... let me show you.

03:04:43.020 | Now we have built all the building blocks for the stable diffusion.

03:04:49.420 | So, now we can finally combine them together.

03:04:53.420 | So, the next thing that we are going to do is to create the system that,

03:04:57.500 | taking the noise, taking the text, taking the time embedding, will run,

03:05:03.020 | for example, if we want to do text to image, will run this noise many times through the unit,

03:05:07.900 | according to a schedule.

03:05:10.540 | So, we will build the scheduler, which means that,

03:05:13.340 | because the unit is trained to predict how much noise is there,

03:05:17.820 | but we then need to remove this noise.

03:05:20.460 | So, to go from a noisy version to obtain a less noisy version,

03:05:25.180 | we need to remove the noise that is predicted by the unit.

03:05:28.140 | And this job is done by the scheduler.

03:05:30.460 | And now we will build the scheduler.

03:05:32.220 | We will build the code to load the weights of the pre-trained model.

03:05:35.980 | And then we combine all these things together.

03:05:39.580 | And we actually build what is called the pipeline.

03:05:42.060 | So, the pipeline of text to image, image to image, etc.

03:05:45.180 | And let's go.

03:05:48.300 | Now that we have built all the structure of the unit,

03:05:52.060 | or we have built the variational autoencoder, we have built a clip,

03:05:55.500 | we have built the attention blocks, etc.

03:05:58.700 | Now it's time to combine it all together.

03:06:01.980 | So, the first thing I kindly ask you to do is to actually download

03:06:05.500 | the pre-trained weights of the stable diffusion, because we need to inference it later.

03:06:09.340 | So, if you go to the repository I shared, this one, PyTorch Stable Diffusion,

03:06:14.220 | you can download the pre-trained weights of the stable diffusion 1.5

03:06:18.380 | directly from the website of Hugging Face.

03:06:20.940 | So, you download this file here, which is the EMA,

03:06:24.700 | which means Exponentially Moving Average,

03:06:27.020 | which means that it's a model that has been trained,

03:06:30.460 | but they didn't change the weights at each iteration,

03:06:32.860 | but with an Exponentially Moving Average schedule.

03:06:35.180 | So, this is good for inferencing.

03:06:37.500 | It means that the weights are more stable.

03:06:39.580 | But if you want to fine-tune later the model, you need to download this one.

03:06:43.900 | And we also need to download the files of the tokenizer,

03:06:48.380 | because, of course, we will give some prompt to the model to generate an image.

03:06:53.660 | And the prompt needs to be tokenized by a tokenizer,

03:06:56.700 | which will convert the words into tokens and the tokens into numbers.

03:07:00.620 | The numbers will then be mapped into embeddings by our clip embedding here.

03:07:05.420 | So, we need to download two files for the tokenizer.

03:07:08.780 | So, first of all, the weights of this one file here,

03:07:12.380 | then on the tokenizer folder, we find the merges.txt and the vocab.json.

03:07:17.340 | If we look at the vocab.json file, which I already downloaded,

03:07:21.420 | it's basically vocabulary.

03:07:23.420 | So, each token mapped to a number.

03:07:25.260 | That's it, just like what the tokenizer does.

03:07:27.580 | And then I also prepared the picture of a dog that I will be using for image-to-image,

03:07:32.140 | but you can use any image.

03:07:33.420 | You don't have to use the one I am using, of course.

03:07:36.700 | So, now, let's first build the pipeline.

03:07:41.180 | So, how we will inference this stable diffusion model.

03:07:44.540 | And then, while building the pipeline,

03:07:47.900 | I will also explain you how the scheduler will work.

03:07:51.420 | And we will build the scheduler later.

03:07:54.540 | I will explain all the formulas, all the mathematics behind it.

03:07:57.660 | So, let's start.

03:07:59.020 | Let's create a new file.

03:08:01.340 | Let's call it pipeline.py.

03:08:05.980 | And we import the usual stuff.

03:08:07.900 | NumPy.

03:08:12.380 | Oops, stop, stop, stop.

03:08:15.580 | NumPy as empty.

03:08:16.940 | We will also use a tqdm to show the progress bar.

03:08:22.540 | And later, we will build this sampler, the DPM sampler.

03:08:28.700 | And we will build it later.

03:08:30.220 | And I will also explain what is this sampler doing and how it works, etc, etc.

03:08:35.980 | So, first of all, let's define some constants.

03:08:38.060 | The stable diffusion can only produce images of size 512 by 512.

03:08:43.980 | So, height is 512 by 512.

03:08:46.620 | The latent dimension is the size of the latent tensor of the variational autoencoder.

03:08:55.420 | And as we saw before, if we go check the size,

03:09:00.300 | the encoder of the variational autoencoder will convert something that is 512 by 512

03:09:05.900 | into something that is 512 divided by 8.

03:09:09.260 | So, the latent dimension is 512 divided by 8.

03:09:12.780 | And the same goes on for the height.

03:09:16.060 | 512 divided by 8.

03:09:19.420 | We can also call it width divided by 8 and height divided by 8.

03:09:23.420 | Then, we create a function called the generator.

03:09:27.660 | This will be the main function that will allow us to do text to image and also image to image,

03:09:33.900 | which accepts a prompt, which is a string.

03:09:36.460 | An unconditional prompt.

03:09:40.300 | So, unconditional prompt.

03:09:41.820 | This is also called the negative prompt.

03:09:44.620 | If you ever used stable diffusion, for example, with the HuggingFace library,

03:09:48.380 | you will know that you can also specify a negative prompt,

03:09:51.820 | which tells that you want, for example, you want a picture of a cat,

03:09:56.620 | but you don't want the cat to be on the sofa.

03:09:59.900 | So, for example, you can put the word sofa in the negative prompt.

03:10:04.060 | So, it will try to go away from the concept of sofa when generating the image.

03:10:08.620 | Something like this.

03:10:09.260 | And this is connected with the classifier free guidance that we saw before.

03:10:13.980 | So, but don't worry, I will repeat all the concepts while we are building it.

03:10:17.020 | So, this is also a string.

03:10:18.540 | We can have an input image in case we are building an image to image.

03:10:25.580 | And then we have the strength.

03:10:27.660 | Strength, I will show you later what is it, but it's related to if we have an input image

03:10:33.340 | and how much, if we start from an image to generate another image,

03:10:37.180 | how much attention we want to pay to the initial starting image.

03:10:40.780 | And we can also have a parameter called doCFG,

03:10:46.300 | which means do classifier free guidance.

03:10:49.260 | We set it to yes.

03:10:51.020 | CFG scale, which is the weight of how much we want the model to pay attention to our prompt.

03:10:56.460 | It's a value that goes from 1 to 14.

03:10:58.700 | We start with 7.5.

03:11:00.380 | The sampler name, we will only implement one.

03:11:03.980 | So, it's called edpm.

03:11:05.500 | How many inference steps we want to do.

03:11:09.100 | And we will do 50.

03:11:12.060 | I think it's quite common to do 50 steps, which produces actually not bad results.

03:11:17.180 | The models are the pre-trained models.

03:11:20.140 | The seed is how we want to initialize our random number generator.

03:11:23.980 | Let me put a new line, otherwise we become crazy reading this.

03:11:27.820 | Okay.

03:11:31.920 | New line.

03:11:34.300 | So, seed.

03:11:36.540 | Then we have the device where we want to create our tensor.

03:11:39.820 | We have an idle device, which means basically if we load some model on CUDA

03:11:45.020 | and then we don't need the model, we move it to the CPU.

03:11:48.060 | And then the tokenizer that we will load later.

03:11:50.300 | Tokenizer is none.

03:11:52.540 | Okay.

03:11:53.040 | This is our method.

03:11:54.860 | This is our main pipeline that, given all this information, will generate one picture.

03:11:59.340 | So, it will pay attention to the prompt.

03:12:01.340 | It will pay attention to the input image, if there is,

03:12:03.980 | according to the weights that we have specified.

03:12:06.220 | So, the strength and the CFG scale.

03:12:08.460 | I will repeat all this concept.

03:12:10.540 | Don't worry, later I will explain them actually how they work also on the code level.

03:12:14.940 | So, let's start.

03:12:16.860 | So, the first thing we do is we disable.

03:12:21.980 | Okay.

03:12:22.480 | Torch.log(red) because we are inferencing the model.

03:12:28.780 | The first thing we make sure is the strength should be between 0 and 1.

03:12:34.220 | So, if...

03:12:34.860 | Then we raise an error.

03:12:45.020 | Raise value error.

03:12:47.660 | Must be between 0 and 1.

03:12:55.020 | If idle device.

03:12:59.340 | If we want to move things to the CPU, we create this lambda function.

03:13:06.140 | Otherwise.

03:13:14.140 | Okay.

03:13:17.440 | Then we create the... oops.

03:13:20.940 | I think I... okay.

03:13:24.940 | Then we create the... oops.

03:13:28.940 | I think I... okay.

03:13:33.100 | Okay.

03:13:33.600 | Then we create the random number generator that we will use.

03:13:38.060 | I think I made some mess with this.

03:13:43.260 | So, this one should be like here.

03:13:47.420 | Okay.

03:13:51.040 | And the generator is a random number generator that we will use to generate the noise.

03:14:01.260 | And if we want to start it with the seed.

03:14:04.540 | So, if seed.

03:14:05.580 | Then we generate with the random seed.

03:14:11.580 | Otherwise, we specify one manually.

03:14:14.140 | Let me fix this formatting because I don't know format document.

03:14:27.260 | Okay.

03:14:28.700 | Now, at least the...

03:14:29.660 | Then we define clip.

03:14:33.900 | The clip is a model that we take from the pre-trained models.

03:14:37.900 | So, it will have the clip model inside.

03:14:41.180 | So, this model here, basically.

03:14:43.580 | This one here.

03:14:45.420 | We move it to our device.

03:14:49.100 | Okay.

03:14:58.160 | As you remember with the classifier-free guidance.

03:15:01.520 | So, let me go back to my slides.

03:15:03.280 | When we do classifier-free guidance, we inference the model twice.

03:15:11.680 | First, by specifying the condition.

03:15:15.680 | So, the prompt.

03:15:16.560 | And another time by not specifying the condition.

03:15:19.200 | So, without the prompt.

03:15:20.720 | And then we combine the output of the model linearly with a weight.

03:15:26.320 | This weight, W, is our...

03:15:28.000 | This weight here, CFG scale.

03:15:31.440 | It indicates how much we want to pay attention to the conditioned output

03:15:36.320 | with respect to the unconditioned output.

03:15:38.320 | Which also means that how much we want the model to pay attention to the condition

03:15:43.280 | that we have specified.

03:15:44.400 | What is the condition?

03:15:45.200 | The prompt.

03:15:46.000 | The textual prompt that we have written.

03:15:47.680 | And the unconditioned actually is also...

03:15:53.360 | Will use the negative prompt.

03:15:54.880 | So, the negative prompt that you use in stable diffusion.

03:15:57.920 | Which is this parameter here.

03:16:00.160 | So, unconditioned prompt.

03:16:02.000 | This is the unconditional output.

03:16:03.680 | So, we will sample the...

03:16:05.520 | We will inference from the model twice.

03:16:07.600 | One with the prompt.

03:16:09.200 | One without.

03:16:10.000 | With the...

03:16:10.480 | One with the prompt.

03:16:11.440 | One with the unconditioned prompt.

03:16:13.040 | Which is usually an empty text.

03:16:14.880 | An empty string.

03:16:16.320 | And then we combine the two by this.

03:16:19.600 | And this will tell the model by using this weight.

03:16:21.760 | We will combine the output in such a way that we can decide

03:16:24.880 | how much we want the model to pay attention to the prompt.

03:16:27.360 | So, let's do it.

03:16:29.920 | If we want to do classifier-free guidance.

03:16:32.960 | First, convert the prompt into tokens.

03:16:38.080 | Using the tokenizer.

03:16:42.320 | We didn't specify what is the tokenizer yet.

03:16:48.480 | But later we will define it.

03:16:50.480 | So, the conditional tokens.

03:16:51.920 | Tokenizer.

03:16:54.560 | Patch.

03:16:55.840 | Encode plus.

03:16:58.480 | We want to encode the prompt.

03:17:02.720 | We want to append the padding up to the maximum length.

03:17:06.880 | Which means that the prompt, if it's too short,

03:17:10.000 | it will fill up it with paddings.

03:17:11.680 | And the max length, as you remember, is 77.

03:17:15.280 | Because we have also defined it here.

03:17:17.520 | The sequence length is 77.

03:17:19.360 | And we take the input IDs of this tokenizer.

03:17:22.960 | Then we convert these tokens, which are input IDs, into a tensor.

03:17:30.960 | Which will be of size batch size and sequence length.

03:17:35.920 | So, conditional tokens.

03:17:38.560 | So, conditional tokens.

03:17:41.920 | And we put it in the right device.

03:17:54.160 | Now, we run it through clip.

03:17:56.880 | So, it will convert batch size sequence length.

03:17:59.840 | So, these input IDs will be converted into embeddings.

03:18:05.680 | Of size 768.

03:18:08.480 | Each vector of size 768.

03:18:10.720 | So, let's call it dim.

03:18:12.400 | And what we do is conditional context is equal to clip of conditional tokens.

03:18:22.800 | So, we are taking these tokens and we are running them through clips.

03:18:25.840 | So, this forward method here.

03:18:27.360 | Which will return batch size sequence length dimension.

03:18:31.040 | And this is exactly what I have written here.

03:18:34.080 | We do the same for the unconditioned tokens.

03:18:36.480 | So, the negative prompt.

03:18:38.080 | Which, if you don't want to specify, we will use the empty string.

03:18:41.440 | Which means the unconditional output of the model.

03:18:44.240 | So, the model, what would the model produce without any condition?

03:18:51.440 | So, if we start with random noise and we ask the model to produce an image.

03:18:54.480 | It will produce an image.

03:18:55.600 | But without any condition.

03:18:56.880 | So, the model will output anything that it wants based on the initial noise.

03:19:01.920 | So,

03:19:03.920 | we convert it into tensor.

03:19:29.600 | Then we pass it through clips.

03:19:31.360 | Just like the conditional tokens.

03:19:32.800 | So, it will become tokens.

03:19:38.320 | Yes.

03:19:38.820 | So, it will also become a tensor of batch size sequence length dimension.

03:19:47.600 | Where the sequence length is actually always 77.

03:19:50.560 | And also, in this case, it was always 77.

03:19:52.480 | Because it's the max length here.

03:19:55.520 | But I forgot to write the code to convert it into.

03:19:59.040 | So, unconditional tokens is equal to tokenizer batch plus.

03:20:07.680 | So, the unconditional prompt.

03:20:13.680 | So, also the negative prompt.

03:20:16.240 | The padding is the same as before.

03:20:19.120 | So, max length.

03:20:20.000 | And the max length is defined as 77.

03:20:25.280 | And we take the input IDs from here.

03:20:27.280 | So, now we have these two prompts.

03:20:29.920 | What we do is we concatenate them.

03:20:31.680 | They will become the batch of our input to the unit.

03:20:43.280 | Okay.

03:20:49.140 | So, basically what we are doing is.

03:20:51.520 | We are taking the conditional and unconditional input.

03:20:54.240 | And we are combining them into one single tensor.

03:20:56.560 | So, they will become a tensor of batch size 2.

03:21:00.640 | So, 2 sequence length and dimension.

03:21:05.200 | Where sequence length is actually.

03:21:07.760 | We can already write it.

03:21:09.040 | It will become 2 by 77 by 768.

03:21:12.800 | Because 77 is the sequence length.

03:21:14.400 | And the dimension is 768.

03:21:17.840 | If we don't want to do conditional classifier free guidance.

03:21:24.880 | We only need to use the prompt and that's it.

03:21:28.080 | So, we do only one step through the unit.

03:21:31.520 | And only with the prompt.

03:21:33.920 | Without combining the unconditional input with the conditional input.

03:21:38.240 | But in this case.

03:21:38.880 | We cannot decide how much the model pays attention to the prompt.

03:21:44.880 | Because we don't have anything to combine it with.

03:21:48.240 | So, again we take the just the prompt.

03:21:57.920 | Just like before.

03:21:58.800 | You can take it.

03:22:02.560 | Let's call it just tokens.

03:22:04.480 | And then we transform this into a tensor.

03:22:09.280 | Tensor long.

03:22:16.640 | We put it in the right device.

03:22:18.400 | We calculated the context.

03:22:22.560 | Which is a one big tensor.

03:22:24.480 | We pass it through clip.

03:22:26.720 | But this case it will be only one.

03:22:30.720 | Only one.

03:22:32.800 | So, the batch size will be one.

03:22:34.160 | So, the batch dimension.

03:22:35.440 | The sequence is again 77.

03:22:37.520 | And the dimension is 768.

03:22:39.520 | So, here we are combining two prompts.

03:22:41.360 | Here we are combining one.

03:22:42.960 | Why?

03:22:43.200 | Because we will run through the model.

03:22:45.040 | Two prompts.

03:22:45.840 | One unconditioned.

03:22:47.040 | One conditioned.

03:22:48.080 | So, one with the prompt that we want.

03:22:50.160 | One with the empty string.

03:22:51.520 | And the model will produce two outputs.

03:22:53.840 | Because the model takes care of the batch size.

03:22:56.400 | That's why we have the batch size.

03:22:57.840 | Since we have finished using the clip.

03:23:02.480 | We can move it to the idle device.

03:23:04.480 | This is very useful.

03:23:05.520 | Actually, if you have a very limited GPU.

03:23:08.640 | And you want to offload the models after using them.

03:23:11.760 | You can offload them back to the CPU.

03:23:13.440 | By moving them to the CPU again.

03:23:15.840 | And then we load the sampler.

03:23:21.920 | For now, we didn't define the sampler.

03:23:23.680 | But we use it and later we build it.

03:23:26.480 | Because it's better to build it after you know how it is used.

03:23:30.720 | If we build it before.

03:23:32.560 | I think it's easy to get lost in what is happening.

03:23:35.200 | What's happening actually.

03:23:36.720 | So, if the sampler name is ddpm.

03:23:40.960 | ddpm.

03:23:44.960 | Then we build the sampler.

03:23:46.960 | ddpm sampler.

03:23:49.520 | We pass it to the noise generator.

03:23:51.440 | And we tell the sampler how many steps we want to do for the inferencing.

03:23:55.920 | And I will show you later why.

03:24:00.480 | If the sampler is not ddpm.

03:24:03.920 | Then we raise an error.

03:24:04.960 | Because we didn't implement any other sampler.

03:24:08.720 | Why we need to tell him how many steps?

03:24:21.600 | Because as you remember.

03:24:22.800 | Let's go here.

03:24:23.920 | Here.

03:24:26.820 | This scheduler needs to do many steps.

03:24:30.240 | How many?

03:24:30.800 | We tell him exactly how many we want to do.

03:24:33.120 | In this case the denoisification steps will be 50.

03:24:36.960 | Even if during the training we have maximum 1000 steps.

03:24:41.120 | During inferencing we don't need to do 1000 steps.

03:24:43.600 | We can do less.

03:24:44.720 | Of course usually the more steps you do the better the quality.

03:24:47.520 | Because the more noise you can remove.

03:24:50.880 | But with different samplers they work in different way.

03:24:55.920 | And with ddpm usually 50 is good enough to get a nice result.

03:25:00.000 | For some other sampler.

03:25:00.880 | For example ddim you can do less steps.

03:25:03.120 | For some other samplers that work on with differential equations.

03:25:07.600 | You can do even less.

03:25:09.120 | Depends on which sampler you use.

03:25:10.640 | And how lucky you are with the particular prompt actually also.

03:25:15.920 | This is the latency that will run through the unit.

03:25:26.400 | And as you know it's of size "lat_height" and "lat_width".

03:25:32.480 | Which we defined before.

03:25:33.600 | So it's 512 divided by 8 by 512 divided by 8.

03:25:37.840 | So 64 by 64.

03:25:39.360 | And now let's do.

03:25:43.280 | What happens if the user specifies an input image?

03:25:47.280 | So if we have a prompt.

03:25:49.600 | We can take care of the prompt by either running a classifier free guidance.

03:25:56.000 | Which means combining the output of the model with the prompt and without the prompt.

03:26:03.520 | According to this scale here.

03:26:05.920 | Or we can directly just ask the model to output only one image.

03:26:12.480 | Only using the prompt.

03:26:13.600 | But then we cannot combine the two output with this scale.

03:26:17.360 | What happens however if we don't want to do text to image.

03:26:20.960 | But we want to do image to image.

03:26:22.800 | If we do image to image as we saw before.

03:26:25.680 | We start with an image.

03:26:27.120 | We encode it with the encoder.

03:26:29.120 | And then we add noise to it.

03:26:30.560 | And then we ask the scheduler to remove noise.

03:26:32.720 | Noise, noise.

03:26:33.440 | But since the unit will also be conditioned by the text prompt.

03:26:38.400 | We hope that while the unit will denoise this image.

03:26:43.520 | It will move it towards this prompt.

03:26:46.960 | So this is what we will do.

03:26:48.080 | First of things we load the image.

03:26:50.560 | And we encode it.

03:26:51.760 | And we add noise to it.

03:26:55.120 | So if an input image is specified.

03:26:57.760 | We load the encoder.

03:27:00.960 | We move it to the device.

03:27:08.080 | In case we are using CUDA for example.

03:27:10.240 | Then we load the tensor of the image.

03:27:14.400 | We resize it.

03:27:18.640 | We make sure that it's 512 by 512.

03:27:21.280 | With 8.

03:27:27.280 | And then we transform it into a NumPy array.

03:27:31.120 | And then into a tensor.

03:27:35.040 | So what will be the size here?

03:27:56.400 | It will be height by width by channel.

03:28:00.800 | And the channel will be 3.

03:28:03.120 | The next thing we do is we rescale this image.

03:28:06.080 | What does it mean?

03:28:06.960 | That the input of this unit should be normalized between.

03:28:12.480 | Should be, sorry, rescaled between -1 and +1.

03:28:16.240 | Because if we load the image.

03:28:17.680 | It will have three channels.

03:28:19.360 | Each channel will be between 0 and 255.

03:28:23.200 | So each pixel have three channels RGB.

03:28:25.920 | And each number is between 0 and 255.

03:28:28.480 | But this is not what the unit wants as input.

03:28:31.200 | The unit wants every channel, every pixel.

03:28:33.760 | To be between -1 and +1.

03:28:35.680 | So we will do this.

03:28:37.440 | We will build it later this function.

03:28:43.440 | It's called the rescale.

03:28:44.640 | To transform anything from that is from between 0 and 255.

03:28:52.240 | Into something that is between -1 and +1.

03:28:57.680 | And this will not change the size of the tensor.

03:29:02.640 | We add the batch dimension.

03:29:04.000 | Unsqueeze.

03:29:07.840 | This adds the batch dimension.

03:29:11.600 | Batch size.

03:29:17.040 | Okay, and then we change the order of the dimensions.

03:29:24.640 | Which is 0, 3, 1, 2.

03:29:34.960 | Why?

03:29:36.880 | Because as you know the encoder of the variation autoencoder.

03:29:41.600 | Wants batch size, channel, height and width.

03:29:45.520 | While we have batch size, height, width, channel.

03:29:48.480 | So we permute them.

03:29:49.520 | So to obtain the correct input for the encoder.

03:29:53.120 | We have this one.

03:29:55.680 | Go into channel.

03:29:58.960 | And height and width.

03:30:04.960 | And then this part we can delete.

03:30:07.360 | Okay, this is the input.

03:30:09.520 | Then what we do is we sample some noise.

03:30:13.840 | Because as you remember the encoder.

03:30:15.920 | To run the encoder we need some noise.

03:30:18.640 | And then he will sample from this particular Gaussian.

03:30:21.280 | That we have defined before.

03:30:22.560 | So encoder noise.

03:30:25.920 | We sample it from our generator.

03:30:28.640 | So as you we have defined this generator.

03:30:34.400 | So that we can define only one seed.

03:30:36.480 | And we can also make the output deterministic.

03:30:40.320 | If we never change the seed.

03:30:41.600 | And this is why we use the generator.

03:30:45.680 | Latent shape.

03:30:47.040 | Okay, and now let's run it through the decoder.

03:30:49.760 | Run the image through the of the VAE.

03:31:00.160 | This will produce latency.

03:31:03.280 | So input image tensor.

03:31:07.120 | And then we give it some noise.

03:31:10.000 | Now we have to run it through the decoder.

03:31:14.000 | And then we give it some noise.

03:31:15.280 | Now we are exactly here.

03:31:18.240 | We produced this.

03:31:19.520 | This is our latency.

03:31:20.640 | So we give the image to the encoder.

03:31:22.560 | Along with some noise.

03:31:23.920 | It will produce a latent representation of this image.

03:31:27.280 | Now we need to tell our...

03:31:29.920 | As you can see here.

03:31:32.080 | We need to add some noise to this latent.

03:31:34.640 | How can we add noise?

03:31:35.600 | We use our scheduler.

03:31:36.800 | The strength basically tells us.

03:31:40.960 | The strength parameter that we defined here.

03:31:43.760 | Tells us how much we want the model.

03:31:46.800 | To pay attention to the input image.

03:31:49.120 | When generating the output image.

03:31:50.960 | The more the strength.

03:31:53.360 | The more the noise we add.

03:31:55.360 | So the more the strength.

03:31:57.600 | The more the strong the noise.

03:31:59.200 | So the model will be more creative.

03:32:02.480 | Because the model will have more noise to remove.

03:32:05.440 | And can create a different image.

03:32:08.000 | But if we add less noise to this initial image.

03:32:11.200 | The model cannot be very creative.

03:32:13.120 | Because most of the image is already defined.

03:32:15.920 | So there is not much noise to remove.

03:32:17.680 | So we expect that the output will resemble more or less the input.

03:32:21.840 | So this strength here basically means.

03:32:26.240 | The more noise.

03:32:27.440 | How much noise to add.

03:32:28.960 | The more noise we add.

03:32:30.480 | The less the output will resemble the input.

03:32:33.120 | The less noise we add.

03:32:34.400 | The more the output will resemble the input.

03:32:37.520 | Because the scheduler, the unit sorry.

03:32:40.880 | Has less possibility of changing the image.

03:32:44.640 | Because there is less noise.

03:32:45.600 | So let's do it.

03:32:49.680 | First we tell the sampler.

03:32:52.080 | What is the strength that we have defined.

03:32:54.320 | And later we will see what is this method doing.

03:32:57.200 | But for now we just write it.

03:32:59.760 | And then we ask the sampler.

03:33:01.040 | To add noise to our latency here.

03:33:03.200 | According to the strength that we have defined.

03:33:06.800 | Add noise.

03:33:13.280 | Basically the sampler will create.

03:33:26.240 | By setting the strength.

03:33:27.280 | Will create a time step schedule.

03:33:29.440 | Later we will see it.

03:33:30.800 | And by defining this time step schedule.

03:33:33.600 | It will.

03:33:34.320 | We will start.

03:33:35.200 | What is the initial noise level we will start with.

03:33:37.920 | Because if we set the noise level to be.

03:33:39.680 | For example the strength to be one.

03:33:41.280 | We will start with the maximum noise level.

03:33:43.440 | But if we set the strength to be 0.5.

03:33:46.480 | We will start with half noise.

03:33:48.800 | Not all completely noise.

03:33:50.240 | And later this will be more clear.

03:33:53.600 | When we actually build the sampler.

03:33:55.360 | So now just remember that.

03:33:56.800 | We are exactly here.

03:33:57.920 | So we have the image.

03:33:59.120 | We transform.

03:34:00.080 | We compress it with the encoder.

03:34:02.240 | Became a latent.

03:34:03.120 | We added some noise to it.

03:34:04.640 | According to the strength level.

03:34:06.080 | And then we need to pass it to the model.

03:34:09.680 | To the diffusion model.

03:34:10.720 | So now we don't need the encoder anymore.

03:34:12.400 | We can set it to the idle device.

03:34:15.440 | If the user didn't specify any image.

03:34:19.760 | Then how can we start the denoising?

03:34:22.480 | It means that we want to do text to image.

03:34:24.880 | So we start with random noise.

03:34:27.280 | So we start with random noise.

03:34:28.800 | Let's sample some random noise then.

03:34:31.520 | Generator and device is device.

03:34:42.400 | So let me write some comments.

03:34:45.440 | If we are doing text to image.

03:34:49.680 | Start with random noise.

03:34:52.160 | Random noise defined as N01.

03:34:56.480 | Or N0i actually.

03:34:58.880 | And we then finally load the diffusion model.

03:35:05.920 | Which is our unit.

03:35:06.880 | Diffusion.

03:35:09.200 | It's models.

03:35:10.080 | Diffusion.

03:35:12.160 | Later we see what is this model and how to load it.

03:35:14.640 | We take it to our device where we are working.

03:35:18.640 | So for example CUDA.

03:35:19.840 | And then our sampler will define some time steps.

03:35:24.880 | Time steps basically means that.

03:35:27.840 | As you remember to train the model we have maximum of 1000 time steps.

03:35:31.680 | But when we inference we don't need to do 1001 steps.

03:35:34.640 | In our case we will be doing for example 50 steps of inferencing.

03:35:38.400 | If the maximum strength level is 1000.

03:35:41.840 | For example if the maximum level is 1000.

03:35:46.400 | The minimum level will be 1.

03:35:48.720 | Or if the maximum level is 999.

03:35:50.720 | The minimum will be 0.

03:35:52.320 | And this is a linear time steps.

03:35:54.960 | If we do only 50 it means that we need to do.

03:35:57.760 | For example we start with 1000.

03:35:59.920 | And then we do every 20.

03:36:01.760 | So 980.

03:36:03.280 | Then 960.

03:36:04.480 | 940.

03:36:05.680 | 920.

03:36:07.040 | 900.

03:36:07.680 | Then 800.

03:36:09.280 | What?

03:36:11.920 | 880.

03:36:12.880 | 860.

03:36:14.800 | 840.

03:36:15.840 | 820 etc etc.

03:36:17.680 | Until we arrive to the 0th level.

03:36:19.280 | Basically each of these time steps indicates a noise level.

03:36:24.240 | So when we denoise the image.

03:36:29.520 | Or the initial noise in case we are doing the text to image.

03:36:32.400 | We can tell the scheduler to remove noise.

03:36:36.000 | According to particular time steps.

03:36:38.480 | Which are defined by how many inference steps we want.

03:36:41.440 | And this is exactly what we are going to do now.

03:36:44.320 | When we initialize the sampler.

03:36:47.200 | We tell him how many steps we want to do.

03:36:49.520 | And he will create this time step schedule.

03:36:52.720 | So according to how many we want.

03:36:54.640 | And now we just go through it.

03:36:56.640 | So we tell the time steps.

03:36:58.480 | We create tqdm which is a progress bar.

03:37:02.960 | We take the time steps.

03:37:04.080 | And for each of these time steps we denoise the image.

03:37:10.000 | So we have 1300.

03:37:18.560 | This is our...

03:37:19.680 | We need to tell the unit as you remember diffusion.

03:37:23.200 | The unit has as input the time embedding.

03:37:26.560 | So what is the time step we want to denoise.

03:37:29.520 | The context which is the prompt.

03:37:32.000 | Or in case we are doing a classifier free guidance.

03:37:35.040 | Also the unconditional prompt.

03:37:36.560 | And the latent.

03:37:38.400 | The current state of the latent.

03:37:40.400 | Because we will start with some latent.

03:37:42.320 | And then keep denoising it.

03:37:43.600 | And keep denoising it.

03:37:44.800 | Keep denoising it according to the time embedding.

03:37:47.360 | To the time step.

03:37:48.160 | So we calculate first the time embedding.

03:37:51.280 | Which is an embedding of the current time step.

03:37:53.680 | And we will obtain it from this function.

03:37:58.160 | Later we define it.

03:38:02.240 | This function basically will convert a number.

03:38:04.480 | So the time step into a vector.

03:38:06.400 | One of size 320.

03:38:10.000 | That describes this particular time step.

03:38:12.320 | And as you will see later.

03:38:15.200 | It's basically just equal to the positional encoding.

03:38:18.000 | That we did for the transformer model.

03:38:20.400 | So in the transformer model.

03:38:21.520 | We use the sines and the cosines.

03:38:22.960 | To define the position.

03:38:24.400 | Here we use the sines and cosines.

03:38:25.760 | To define the time step.

03:38:26.800 | And let's build the model input.

03:38:30.480 | Which is the latency.

03:38:31.440 | Which is of shape patch size 4.

03:38:36.880 | Because it's the input of the encoder.

03:38:40.800 | Of the variational autoencoder.

03:38:42.080 | Which is of size 4.

03:38:43.760 | Sorry, which has four channels.

03:38:48.160 | And then has latency height.

03:38:50.320 | Height and the latency width.

03:38:54.240 | Which is 64 by 64.

03:38:56.000 | Now if we do this one.

03:39:01.760 | We need to send.

03:39:04.080 | Basically we are sending the conditioned.

03:39:06.640 | Where is it?

03:39:08.240 | Here.

03:39:09.380 | We send the conditional input.

03:39:12.480 | But also the unconditional input.

03:39:14.080 | If we do the classifier free guidance.

03:39:16.000 | Which means that we need to send.

03:39:17.760 | The same latent with the prompt.

03:39:20.320 | And without the prompt.

03:39:21.840 | And so what we can do is.

03:39:23.680 | We can repeat this latent twice.

03:39:25.760 | If we are doing the classifier free guidance.

03:39:27.600 | It will become model input.

03:39:32.080 | By repeat.

03:39:33.760 | On one.

03:39:37.120 | This will basically transform.

03:39:38.640 | Batch size 4.

03:39:43.120 | So this is going to be twice.

03:39:57.120 | The size of the initial batch size.

03:39:58.960 | Which is one actually.

03:40:00.000 | And four channels.

03:40:02.720 | And latency height and latency width.

03:40:04.960 | So basically we are repeating this dimension twice.

03:40:07.280 | We are making two copies of the latency.

03:40:10.720 | One will be used with the prompt.

03:40:12.320 | One without the prompt.

03:40:13.520 | So now we do.

03:40:16.160 | We check the model output.

03:40:18.000 | What is the model output?

03:40:19.120 | It is the predicted noise by the unit.

03:40:22.000 | So the model output is.

03:40:24.080 | The predicted noise by the unit.

03:40:28.400 | We do diffusion.

03:40:33.120 | Model input.

03:40:38.720 | Context and time embedding.

03:40:40.720 | And if we do classifier free guidance.

03:40:45.920 | We need to combine the conditional output.

03:40:49.200 | And the unconditional output.

03:40:50.720 | Because we are passing the input of the model.

03:40:54.640 | If we are doing classifier free guidance.

03:40:56.080 | We are giving a batch size of two.

03:40:57.760 | The model will produce an output.

03:40:59.440 | That has batch size of two.

03:41:01.040 | So we can then split it into two different tensor.

03:41:06.160 | One will be the conditional.

03:41:08.000 | And one will be the unconditional.

03:41:10.000 | So the output conditional.

03:41:12.400 | And the output unconditional.

03:41:14.160 | Are split in this way.

03:41:16.560 | Using chunk.

03:41:17.360 | The dimension is along the 0th dimension.

03:41:22.480 | So by default it's the 0th dimension.

03:41:24.400 | And then we combine them according to this formula here.

03:41:29.600 | Where is the.

03:41:30.960 | I missed out.

03:41:34.000 | According to.

03:41:37.600 | This formula here.

03:41:38.800 | So unconditional output.

03:41:40.720 | Minus the sorry.

03:41:41.840 | The conditioned output.

03:41:42.880 | Minus the unconditioned output.

03:41:44.400 | Multiplied by the scale that we defined.

03:41:46.960 | Plus the unconditioned output.

03:41:49.600 | So the model output.

03:41:51.920 | Will be conditioned scale.

03:41:55.280 | Multiplied by the output.

03:41:58.160 | Conditioned minus the output.

03:42:00.880 | Unconditioned plus the output.

03:42:03.360 | Unconditioned.

03:42:06.000 | And then what we do is basically.

03:42:10.000 | Okay now comes the let's say the clue part.

03:42:13.520 | So we have a model.

03:42:16.000 | That is able to predict the noise in the current latency.

03:42:20.560 | So we start for example.

03:42:21.680 | Imagine we are doing text to image.

03:42:23.600 | So let me go back here.

03:42:25.760 | We are going text to images.

03:42:32.080 | Here.

03:42:35.120 | So we start with some random noise.

03:42:36.960 | And we transform into latency.

03:42:40.240 | Then according to some scheduler.

03:42:43.200 | According to some time step.

03:42:44.800 | We keep denoising it.

03:42:46.240 | Now our unit.

03:42:47.200 | Will predict the noise in the latency.

03:42:52.640 | But how can we remove this noise.

03:42:56.480 | From the image to obtain a less noisy image.

03:42:59.440 | This is done by the scheduler.

03:43:02.720 | So at each step we ask the unit.

03:43:05.280 | How much noise is in the image.

03:43:07.360 | We remove it.

03:43:08.320 | And then we give it again to the unit.

03:43:10.160 | And ask how much noise is there.

03:43:11.840 | And we remove it.

03:43:12.720 | And then ask again how much noise is there.

03:43:14.800 | And then we remove it.

03:43:15.680 | And then how much noise is there.

03:43:17.040 | And then we remove it.

03:43:18.000 | Until we finish all these time steps.

03:43:20.320 | After we have finished these time steps.

03:43:23.280 | We take the latent.

03:43:25.200 | Give it to the decoder.

03:43:26.320 | Which will build our image.

03:43:28.080 | And this is exactly what we are doing here.

03:43:30.320 | So imagine we don't have any input image.

03:43:32.240 | So we have some random noise.

03:43:34.000 | We define some time steps on this sampler.

03:43:37.520 | Based on how many inference steps we want to do.

03:43:40.800 | We do all this time step.

03:43:43.200 | We give the latency to the unit.

03:43:46.240 | The unit will tell us how much is the predicted noise.

03:43:48.960 | But then we need to remove this noise.

03:43:50.960 | So let's do it.

03:43:51.920 | So let's remove this noise.

03:43:53.520 | So the latency are equal to sampler dot step.

03:43:58.320 | Time step, latency, model, output.

03:44:03.040 | This basically means take the image from a more noisy version.

03:44:11.120 | Okay, let me write it better.

03:44:14.000 | Remove noise predicted by the unit.

03:44:18.240 | Unit, okay.

03:44:21.520 | And this is our loop of denoising.

03:44:26.400 | Then we can do to idle, diffusion.

03:44:29.040 | Now we have our denoised image.

03:44:33.760 | Because we have done it for many steps.

03:44:35.520 | Now what we do is we load the decoder.

03:44:39.600 | Which is models decoder.

03:44:42.960 | And then our image is run through the decoder.

03:44:55.760 | So we run the latency through the decoder.

03:44:57.920 | So we do this step here.

03:44:59.120 | So we run this latency through the decoder.

03:45:01.040 | This will give the image.

03:45:02.080 | It actually will be only one image.

03:45:04.880 | Because we only specify one image.

03:45:06.640 | Then we do images is equal to.

03:45:13.360 | Because the image was initially, as you remember here.

03:45:16.960 | It was rescaled.

03:45:18.000 | So from 0 to 255 in a new scale.

03:45:23.520 | That is between -1 and +1.

03:45:25.840 | Now we do the opposite step.

03:45:27.120 | So rescale again.

03:45:29.840 | From -1 to 1.

03:45:32.800 | Into 0 to 255.

03:45:36.320 | With clamp equal true.

03:45:37.920 | Later we will see this function.

03:45:40.800 | It's very easy.

03:45:41.440 | It's just a rescaling function.

03:45:42.880 | We permute.

03:45:45.120 | Because to save the image on the CPU.

03:45:47.600 | We want the channel dimension to be the last one.

03:45:49.840 | Permute.

03:45:52.400 | From 0 to 3.1.

03:45:56.320 | So this one basically will take the batch size.

03:45:59.040 | Channel height width.

03:46:04.320 | Into batch size.

03:46:09.680 | Height width channel.

03:46:15.920 | And then we move the image to the CPU.

03:46:21.840 | And then we need to convert it into a NumPy array.

03:46:34.400 | And then we return the image.

03:46:35.680 | Voila!

03:46:38.320 | Let's build this rescale method.

03:46:40.080 | So what is the old scale?

03:46:47.760 | Old range.

03:46:48.320 | What is the new range?

03:46:51.920 | And the clamp.

03:46:52.640 | So let's define the old minimum.

03:46:58.640 | Old maximum is the old range.

03:47:00.880 | New minimum and new maximum.

03:47:04.160 | New range.

03:47:07.680 | Minus equal to old min.

03:47:14.720 | x multiply equal to new max minus new min.

03:47:20.800 | Divided by old max minus old min.

03:47:28.000 | x plus equal to new min.

03:47:31.520 | We are just rescaling.

03:47:33.600 | So convert something that is within this range into this range.

03:47:37.040 | And if it's clamp.

03:47:42.240 | Then x is equal to x.clamp.

03:47:44.240 | New min.

03:47:46.880 | New max.

03:47:49.280 | And then we return x.

03:47:51.680 | Then we have the time embedding.

03:47:55.520 | The method that we didn't define here.

03:47:58.320 | This getTimeEmbedding.

03:47:59.760 | This means basically take the time step which is a number.

03:48:02.640 | So which is an integer.

03:48:04.640 | And convert it into a vector of size 320.

03:48:08.320 | And this will be done exactly using the same system that we use for the transformer.

03:48:12.320 | For the positional embeddings.

03:48:13.520 | So we first define the frequencies of our cosines and the sines.

03:48:23.440 | Exactly using the same formula of the transformer.

03:48:26.240 | So if you remember the formula is equal to the 10,000.

03:48:30.560 | 1 over 10,000 to the power of something.

03:48:33.840 | Of i.

03:48:34.240 | I remember correctly.

03:48:37.840 | So it's power of 10,000 and minus torch.range.

03:48:45.040 | So I am referring to this formula just in case you forgot.

03:48:48.000 | Let me find it using the slides.

03:48:51.440 | I am talking about this formula here.

03:48:56.480 | So the formula that defines the positional encodings here.

03:48:59.920 | Here we just use a different dimension of the embedding.

03:49:02.960 | This one will produce something that is 160 numbers.

03:49:18.480 | And this one will produce something that is 200 numbers.

03:49:25.760 | And 160 numbers.

03:49:27.280 | Then we multiply it.

03:49:31.680 | We multiply it with the time step.

03:49:35.840 | So we create a shape of size 1.

03:49:39.040 | So x is equal to torch dot tensor.

03:49:44.480 | Which is a single time step.

03:49:46.640 | Of t type.

03:49:49.200 | Take everything.

03:49:57.520 | We add one dimension.

03:49:59.520 | So we add one dimension here.

03:50:02.160 | This is like doing an unsqueeze.

03:50:03.920 | Multiply by the frequencies.

03:50:06.720 | And then we multiply this by the sines and the cosine.

03:50:13.920 | Just like we did in the original transformer.

03:50:16.880 | This one will return a tensor of size 100 by 62.

03:50:20.960 | So which is 320.

03:50:22.480 | Because we are concatenating two tensors.

03:50:36.480 | Not cosine, but sine of x.

03:50:38.080 | And then I concatenated along the dimension.

03:50:42.000 | The last dimension.

03:50:47.040 | And this is our time embedding.

03:50:49.680 | So now let's review what we have built here.

03:50:52.000 | We built basically a system.

03:50:55.360 | A method that takes the prompt.

03:50:58.000 | The unconditional prompt.

03:50:59.280 | Also called the negative prompt.

03:51:01.360 | The prompt or empty string.

03:51:04.640 | Because if we don't want to use any negative prompt.

03:51:07.840 | The input image.

03:51:08.800 | So what is the image we want to start from.

03:51:10.960 | In case we want to do an image to image.

03:51:13.280 | The strength is how much attention we want to pay to this input image.

03:51:17.600 | When we denoise the image.

03:51:18.800 | Or how much noise we want to add to it basically.

03:51:22.160 | And the more noise we add.

03:51:24.800 | The less the output will resemble the input image.

03:51:27.760 | If we want to do classifier free guidance.

03:51:31.120 | Which means that if we want the model to output to output.

03:51:34.640 | One is the output with the prompt.

03:51:36.960 | And one without the prompt.

03:51:38.320 | And then we can adjust how much we want to pay attention to the prompt.

03:51:42.880 | According to this scale.

03:51:44.160 | And then we defined the scheduler.

03:51:48.240 | Which is only one.

03:51:48.880 | The DPM.

03:51:49.440 | And we will define it now.

03:51:50.560 | And how many steps we want to do.

03:51:52.880 | The first thing we do is we create a generator.

03:51:55.280 | Which is just a random number generator.

03:51:57.120 | Then the second thing we do is.

03:51:59.680 | If we want to do classifier free guidance.

03:52:01.520 | As we need to do the.

03:52:02.960 | Basically we need to go through the units twice.

03:52:05.600 | One with the prompt.

03:52:06.560 | One without the prompt.

03:52:07.760 | The thing we do is that actually.

03:52:09.040 | We create a batch size of two.

03:52:11.120 | One with the prompt.

03:52:12.400 | And one without the prompt.

03:52:13.840 | Or using the unconditioned prompt.

03:52:15.760 | Or the negative prompt.

03:52:16.800 | In case we don't do the classifier free guidance.

03:52:20.880 | We only build one tensor.

03:52:22.480 | That only includes the prompt.

03:52:23.840 | The second thing we do is we load.

03:52:26.960 | If there is an input image.

03:52:28.240 | We load it.

03:52:29.200 | So instead of starting from random noise.

03:52:31.120 | We start from an image.

03:52:32.400 | Which is to which we add the noise.

03:52:34.560 | According to the strength we have defined.

03:52:37.200 | Then for the number of steps.

03:52:39.120 | Defined by the sampler.

03:52:40.400 | Which are actually defined.

03:52:41.520 | By the number of inference steps.

03:52:42.880 | We have defined here.

03:52:43.840 | We do a loop.

03:52:45.360 | A for loop.

03:52:46.000 | That for each for loop.

03:52:48.240 | Let me go here.

03:52:50.720 | The unit will predict some noise.

03:52:53.360 | And the scheduler will remove this noise.

03:52:56.240 | And give a new latent.

03:52:57.760 | Then this new latent is fed again to the unit.

03:53:00.080 | Which will predict some noise.

03:53:01.520 | And we remove this noise.

03:53:02.880 | According to the scheduler.

03:53:04.320 | Then we again predict some noise.

03:53:06.080 | And we remove some noise.

03:53:07.680 | The only thing we need to understand.

03:53:09.280 | Is how we remove the noise from the image now.

03:53:11.920 | Because we know that the unit is trained.

03:53:14.480 | To predict the noise.

03:53:15.440 | But how do we actually remove it?

03:53:16.960 | And this is the job of the scheduler.

03:53:20.320 | So now we need to go build this scheduler here.

03:53:23.200 | So let's go build it.

03:53:24.960 | Let's start building our ddpm scheduler.

03:53:28.400 | So ddpm.py

03:53:30.240 | Oops I forgot to put it inside the folder.

03:53:35.680 | And let me review one thing.

03:53:37.920 | Yeah.

03:53:39.860 | This is wrong.

03:53:43.280 | Okay.

03:53:43.780 | So import torch.

03:53:46.800 | Import roompy.

03:53:49.840 | And let's create the class ddpm sampler.

03:53:55.120 | Okay I didn't call it scheduler.

03:53:56.880 | Because I don't want you to be confused with the beta schedule.

03:54:00.320 | Which we will define later.

03:54:01.760 | So I call it scheduler here.

03:54:05.520 | Oops why I open this one.

03:54:06.960 | I call it scheduler here.

03:54:09.600 | But actually I mean the sampler.

03:54:12.240 | Because there is the beta schedule that we will define now.

03:54:15.360 | What is the beta schedule?

03:54:16.400 | Which indicates the amount of noise at each time step.

03:54:19.840 | And then there is what is known as the scheduler or the sampler.

03:54:23.040 | From now on I will refer it to as sampler.

03:54:26.400 | So this scheduler here actually means a sampler.

03:54:29.200 | I'm sorry for the confusion.

03:54:30.480 | I will update the slides when the video is out.

03:54:33.520 | So how much were the training steps?

03:54:36.320 | Which is 1000.

03:54:39.520 | The beta is okay now I define two constants.

03:54:44.560 | And later I define them.

03:54:46.560 | Where what are they and where they come from?

03:54:48.800 | 0, 85 and beta end.

03:54:53.920 | And I define the sampler.

03:54:55.920 | So this is the sampler.

03:54:57.600 | And this is the scheduler.

03:54:58.720 | And this is the sampler.

03:55:00.080 | And this is the scheduler.

03:55:01.120 | And beta end is a starting point of 0.0120.

03:55:06.640 | Okay.

03:55:09.060 | The parameter beta start and beta end.

03:55:11.840 | Basically if you go to the paper.

03:55:13.200 | If you look at the forward process.

03:55:16.160 | We can see that the forward process is the process that makes the image more noisy.

03:55:22.880 | We add noise to the image.

03:55:24.800 | So given an image that don't have less noise.

03:55:28.640 | How to get a more noisy image?

03:55:31.120 | According to this Gaussian distribution.

03:55:33.840 | Which is actually a chain of Gaussian distribution.

03:55:36.720 | Which is called a Markov chain of Gaussian distribution.

03:55:39.520 | And the noise that we add varies according to a variance schedule.

03:55:46.720 | Beta 1, beta 2, beta 3, beta 4, beta t.

03:55:49.840 | So beta basically it's a series of numbers.

03:55:52.720 | That indicates the variance of the noise that we add with each of these steps.

03:55:59.040 | And as in the latent in the stable diffusion.

03:56:03.440 | They use a beta start.

03:56:05.040 | So the first value of beta is 0.0085.

03:56:09.120 | And the last variance.

03:56:10.560 | So this the beta that will turn the image into complete noise.

03:56:14.960 | Is equal to 0.0120.

03:56:17.680 | It's a choice made by the authors.

03:56:19.920 | And we will use a linear schedule.

03:56:25.280 | Actually there are other schedules.

03:56:26.880 | Which are for example the cosine schedule etc.

03:56:29.200 | But we will be using the linear one.

03:56:30.880 | And we need to define this beta schedule.

03:56:35.360 | Which is actually 1000 numbers between beta start and beta end.

03:56:39.600 | So let's do it.

03:56:42.320 | So this is defined using the linear space.

03:56:48.240 | Where the starting number is beta start.

03:56:51.440 | Actually to the square root of beta start.

03:56:55.200 | So square root of beta start.

03:56:57.680 | Because this is how they define it in the stable diffusion.

03:57:01.600 | If you check the official repository.

03:57:03.600 | They will also have these numbers.

03:57:04.960 | And define in exactly the same way.

03:57:06.640 | 0.5 then the number of training steps.

03:57:11.920 | So in how many pieces we want to divide this linear space.

03:57:16.000 | Beta end.

03:57:21.280 | And then the type is torch dot float 32 I think.

03:57:26.720 | And then to the power of 2.

03:57:28.560 | Because they divide it into 1000.

03:57:32.320 | Then to the power of 2.

03:57:33.360 | This is in the diffusers libraries from Hugging Face.

03:57:38.480 | I think this is called the scaled linear schedule.

03:57:40.960 | Now we need to define other constants.

03:57:44.480 | That are needed for our forward and our backward process.

03:57:47.600 | So our forward process depends on this beta schedule.

03:57:50.720 | But actually this is only for the single step.

03:57:53.120 | So if we want to go from for example the original image.

03:57:56.320 | By one step forward of more noise.

03:57:58.800 | We need to apply this formula here.

03:58:00.480 | But there is a closed formula here.

03:58:03.920 | Called this one here.

03:58:05.920 | That allows you to go from the original image.

03:58:08.320 | To any noisified version of the image.

03:58:10.800 | At any time step.

03:58:12.240 | Between 0 and 1000.

03:58:13.920 | Using this one here.

03:58:15.920 | Which depends on alpha bar.

03:58:18.000 | That you can see here.

03:58:18.880 | So the square root of this alpha bar.

03:58:21.040 | And the variance also depends on this alpha bar.

03:58:23.520 | What is alpha bar?

03:58:24.720 | Alpha bar is the product of alpha.

03:58:27.600 | Going from 1 up to t.

03:58:29.920 | So if we are for example.

03:58:31.120 | We want to go from the time step 0.

03:58:33.440 | Which is the image without any noise.

03:58:35.440 | To the time step 10.

03:58:37.040 | Which is the image with some noise.

03:58:38.800 | And remember that time step 1000.

03:58:41.840 | Means that it's only noise.

03:58:43.680 | So we want to go to time step 10.

03:58:46.240 | Which means that we need to calculate.

03:58:48.160 | This As of 1.

03:58:50.880 | As 1.

03:58:51.600 | As 2.

03:58:52.160 | As 3.

03:58:52.560 | As 2.

03:58:53.040 | And up until As 10.

03:58:54.720 | And we multiply them together.

03:58:56.720 | This is the productory.

03:58:58.240 | And this A.

03:58:59.440 | What is this alpha?

03:59:00.880 | This alpha actually is 1 minus beta.

03:59:03.200 | So let's calculate this alphas first.

03:59:05.360 | So alpha is actually 1 minus beta.

03:59:07.520 | Beta self dot betas.

03:59:12.160 | So it becomes floating.

03:59:16.080 | And then we need to calculate.

03:59:18.080 | The product of this alphas.

03:59:19.760 | From 1 to t.

03:59:21.120 | And this is easily done with the PyTorch.

03:59:23.360 | We pre-compute them basically.

03:59:26.640 | This is also comprod self dot alphas.

03:59:33.440 | This will create basically an array.

03:59:37.840 | Where the first element is the first alpha.

03:59:40.080 | So alpha for example 0.

03:59:42.480 | The second element is alpha 0 multiplied by alpha 1.

03:59:47.360 | The third element is alpha 0.

03:59:49.840 | Multiplied by alpha 1.

03:59:52.240 | Multiplied by alpha 2 etc.

03:59:54.320 | So it's a cumulative product.

03:59:55.840 | It's we say.

03:59:56.480 | Then we create one tensor.

03:59:59.760 | That represents the number 1.

04:00:01.040 | And later we will use it.

04:00:02.080 | Tensor 1.0.

04:00:07.280 | Okay we save the generator.

04:00:12.000 | We save the number of training steps.

04:00:13.760 | And then we create the time step schedule.

04:00:21.600 | The time step basically.

04:00:23.840 | Because we want to reverse the noise.

04:00:27.680 | We want to remove noise.

04:00:28.720 | We will start from the more noisy to less noise.

04:00:31.600 | So we will go from 1000 to 0.

04:00:34.000 | Initially.

04:00:36.240 | So let's say time steps is equal to torch from.

04:00:41.200 | We reverse this.

04:00:52.560 | So this is from 0 to 1000.

04:00:54.320 | But actually we want 1000 to 0.

04:00:56.240 | And this is our initial schedule.

04:01:02.400 | In case we want to do 1000 steps.

04:01:04.080 | But later because here we actually specify.

04:01:07.040 | How many inference steps we want to do.

04:01:10.080 | We will change these time steps here.

04:01:12.720 | So if the user later specifies less than 1000.

04:01:15.200 | We will change it.

04:01:15.920 | So let's do it.

04:01:18.720 | We let's create the method.

04:01:19.840 | That will change this time steps.

04:01:22.960 | Based on how many actual steps we want to make.

04:01:25.840 | So set inference step.

04:01:29.680 | Time steps.

04:01:32.720 | As I said before.

04:01:38.480 | We usually perform 50.

04:01:39.920 | Which is also actually the one they use normally.

04:01:42.480 | For example in hugging face library.

04:01:45.840 | Let's save this value.

04:01:50.160 | Because we will need it later.

04:01:51.680 | Now if we have a number.

04:01:54.800 | For example we go from 1000.

04:01:57.200 | Actually it's not from.

04:01:58.320 | This is not from 0 to 1000.

04:02:00.560 | But it's from 0 to 1000 minus 1.

04:02:02.720 | Because this is excluded.

04:02:04.000 | So it will be from 99.

04:02:06.080 | 999, 998, 997, 996 etc up to 0.

04:02:12.640 | So we have 1000 numbers.

04:02:15.440 | But we don't want 1000 numbers.

04:02:17.040 | We want less.

04:02:18.080 | We want 50 of them.

04:02:19.680 | So what we do is basically.

04:02:20.960 | We space them every 20.

04:02:22.880 | So we start with 999.

04:02:25.280 | Then 999 minus 20.

04:02:27.600 | Then 999 minus 40 etc etc.

04:02:31.120 | Until we arrive to 0.

04:02:32.320 | But in total here will be 1000 steps.

04:02:37.120 | And here will be 50 steps.

04:02:39.440 | Why minus 20?

04:02:41.600 | Because 20 is 1000 divided by 50.

04:02:44.800 | If i'm not mistaken.

04:02:45.920 | So this is exactly what we are going to do.

04:02:51.120 | So we calculate the step ratio.

04:02:52.880 | Which is self dot num training step.

04:02:56.160 | Divide by how many we actually want.

04:02:58.400 | And we redefine the time steps.

04:03:00.720 | According to how many we actually want to make.

04:03:03.520 | 0 num inference steps.

04:03:15.440 | Multiply it by this step ratio.

04:03:17.200 | And round it.

04:03:24.160 | We reverse it just like before.

04:03:27.680 | Because this is from 0.

04:03:29.120 | So this is actually means 0.

04:03:31.360 | Then 20.

04:03:32.160 | Then 40.

04:03:32.880 | Then 60 etc.

04:03:34.480 | Until we reach 999.

04:03:36.080 | Then we reverse it.

04:03:37.200 | Then copy.

04:03:40.720 | S type.

04:03:43.760 | np dot int 64.

04:03:48.560 | So a long one.

04:03:50.640 | And then we define as tensor.

04:03:54.240 | Now the code looks very different from each other.

04:04:01.040 | Because actually I have been copying the code from multiple sources.

04:04:04.560 | Maybe one of them I think I copied from the HuggingFace library.

04:04:08.720 | So I didn't change it.

04:04:10.160 | I kept it to the original one.

04:04:12.080 | Okay.

04:04:13.760 | But the idea is the one I showed you before.

04:04:16.160 | So we copy the code from the HuggingFace library.

04:04:19.520 | I showed you before.

04:04:20.480 | So now we set the exact number of time steps we want.

04:04:23.760 | And we redefine this time steps array like this.

04:04:26.800 | Let's define the next method.

04:04:31.280 | Which basically tells us.

04:04:33.520 | Let's define the method on how to add noise to something.

04:04:36.720 | So imagine we have the image.

04:04:38.400 | As you remember to do image to image.

04:04:41.520 | We need to add noise to this latent.

04:04:43.520 | How do we add noise to something?

04:04:45.840 | Well we need to apply the formula as defined in the paper.

04:04:49.280 | Let's go in the paper here.

04:04:51.280 | We need to apply this formula here.

04:04:54.080 | And that's it.

04:04:55.520 | This means that given this image.

04:04:57.680 | I want to go to the noisified version of this image at time step t.

04:05:03.360 | Which means that I need to take.

04:05:06.800 | We need to have a sample from this Gaussian.

04:05:11.600 | But we don't.

04:05:12.320 | Okay.

04:05:13.680 | Let's build it.

04:05:14.320 | And we will apply the same trick that we did for the variational autoencoder.

04:05:17.840 | As you remember in the variational autoencoder.

04:05:19.920 | I actually already showed how we sample from a distribution.

04:05:23.040 | Of which we know the mean and the variance here.

04:05:25.440 | We will do the same here.

04:05:26.640 | But we of course we need to build the mean and the variance.

04:05:29.680 | What is the mean of this distribution?

04:05:31.520 | It's this one.

04:05:33.120 | And what is the variance?

04:05:35.040 | It's this one.

04:05:35.920 | So we need to build the mean and the variance.

04:05:37.520 | And then we sample from this.

04:05:38.640 | So let's do it.

04:05:40.480 | DDPM.

04:05:43.360 | So we take the original samples.

04:05:45.360 | Which is the float tensor.

04:05:48.880 | And then the time steps.

04:05:50.080 | So this is actually time step, not time steps.

04:05:57.040 | It indicates at what time step we want to add the noise.

04:06:00.000 | Because you can add the time step at the noise at time step 1, 2, 3, 4.

04:06:04.240 | Up to 1000.

04:06:05.520 | And we need to add the noise at the noise at time step 1, 2, 3, 4.

04:06:10.160 | 1, 2, 3, 4.

04:06:11.120 | Up to 1000.

04:06:12.240 | And with each level the noise increases.

04:06:15.200 | So the noisified version at the time step 1 will be not so noisy.

04:06:19.280 | But at the time step 1000 will be complete noise.

04:06:22.800 | This returns a float tensor.

04:06:27.680 | Okay.

04:06:31.040 | Let's calculate first.

04:06:35.120 | Let me check what we need to calculate first.

04:06:38.640 | We can calculate first the mean.

04:06:40.400 | And then the variance.

04:06:41.360 | So to calculate the mean we need this alpha cum prod.

04:06:46.000 | So the cumulative product of the alpha.

04:06:48.240 | Which stands for alpha bar.

04:06:50.160 | So the alpha bar as you can see is the cumulative product of all the alphas.

04:06:54.000 | Which is each alpha is 1 minus beta.

04:06:56.240 | So we take this alpha bar.

04:06:59.360 | Which we will call alpha cum prod.

04:07:01.280 | So it's already defined here.

04:07:04.560 | Alpha cum prod is self dot 2 device.

04:07:10.160 | We move it to the same device.

04:07:14.080 | Because we need to later combine it with it.

04:07:16.080 | And of the same type.

04:07:21.840 | This is a tensor.

04:07:30.400 | That we also move to the same device of the other tensor.

04:07:34.320 | Now we need to calculate the square root of alpha bar.

04:07:39.280 | So let's do it.

04:07:40.960 | Square root of alpha cum prod.

04:07:47.040 | Or alpha prod is alpha cum prod at the time step t.

04:07:53.280 | To the power of 0.5.

04:07:56.160 | Why to the power of 0.5?

04:07:57.760 | Because having a number to the power of 0.5 means doing it's the square root of the number.

04:08:04.240 | Because the square root of 1/2 which becomes the square.

04:08:07.840 | Sorry to the power of 1/2 which becomes the square root.

04:08:11.040 | And then we flatten this array.

04:08:14.800 | And then basically because we need to combine this alpha cum prod.

04:08:21.360 | Which doesn't have dimensions.

04:08:23.440 | It only has one dimension.

04:08:24.640 | Which is the number itself.

04:08:25.680 | But we need to combine it with the latency.

04:08:27.760 | We need to add some dimensions.

04:08:29.040 | So one trick is to just keep adding dimensions with unsqueeze.

04:08:32.640 | Until you have the same number of dimensions.

04:08:34.480 | So until the n of the square of the shape is less than.

04:08:41.440 | Most of this code I have taken from the Hugging Face libraries samplers.

04:08:52.080 | So we keep the dimension until this one and this tensor and this tensor have the same dimensions.

04:09:07.440 | This is because otherwise we cannot do broadcasting when we multiply them together.

04:09:11.120 | The other thing that we need to calculate this formula is this part here.

04:09:15.600 | 1 minus alpha bar.

04:09:16.880 | So let's do it.

04:09:18.800 | So sqrt of 1 minus alpha prod.

04:09:24.240 | As the name implies is 1 minus alpha cum prod at the time step t.

04:09:30.640 | To the power of 0.5.

04:09:34.480 | Why 0.5?

04:09:35.680 | Because we don't want the variance.

04:09:37.520 | We want the standard deviation.

04:09:39.440 | Just like we did with the encoder of the variational autoencoder.

04:09:43.920 | We want the standard deviation.

04:09:45.680 | Because as you remember if you have an n01 and you want to transform into an n with the given mean and the variance.

04:09:52.720 | The formula is x is equal to mean plus the standard deviation multiplied by the n01.

04:09:58.240 | Let's go back.

04:10:00.880 | So this is the standard deviation.

04:10:03.600 | And we also flatten this one.

04:10:13.920 | Flatten and then again we keep adding the dimensions until they have the same dimension.

04:10:19.680 | Otherwise we cannot multiply them together or sum them together.

04:10:27.760 | Unsqueeze so we keep adding dimensions.

04:10:37.360 | Now as you remember our method should add noise to an image.

04:10:44.080 | So we need to add noise means we need to sample some noise.

04:10:47.280 | So we need to sample some noise from the n01.

04:11:00.560 | Using this generator that we have.

04:11:07.280 | I think my cat is very angry today with me because I didn't play with him enough.

04:11:14.000 | So later if you guys excuse me I need to later play with him.

04:11:20.160 | I think we will be done very soon.

04:11:24.240 | So let's get the noisy samples using the noise and the mean and the variance that we have calculated.

04:11:30.720 | According exactly to this formula here.

04:11:32.880 | So we do the mean.

04:11:35.520 | Actually no the mean is this one multiplied by x0.

04:11:41.760 | So the mean is this one multiplied by x0 is the mean.

04:11:45.520 | So we need to take this square root of alpha comprod multiplied by x0 and this will be the mean.

04:11:50.880 | So the mean is square root of alpha prod multiplied by the original latency.

04:11:56.320 | So x0 so the input image or whatever we want to noisify.

04:12:00.240 | Plus the standard deviation which is a square root of this one multiplied by a sample of the

04:12:09.200 | from the n01 so the noise.

04:12:11.600 | And this is how we noisify an image.

04:12:14.720 | This is how we add noise to an image.

04:12:19.920 | So this one let me write it down.

04:12:22.160 | So all of this is according to the equation 4 of the DDM paper and also according to this.

04:12:36.480 | Okay now that we know how to add noise we need to understand how to remove noise.

04:12:46.400 | So as you remember let's review again here.

04:12:49.920 | Imagine we are doing text to text or text to image or image to image it doesn't matter.

04:12:57.200 | The point is our unit as you remember is trained to only predict the amount of noise given the

04:13:03.600 | latent with noise given the prompt and the time step at which this noise was added.

04:13:12.400 | So what we do is we have this predicted noise from the unit.

04:13:17.280 | We need to remove this noise so the unit will predict the noise but we need some way of

04:13:22.480 | removing the noise to get the next latent.

04:13:25.760 | What I mean by this is you can see this reverse process here.

04:13:32.960 | So the reverse process is defined here.

04:13:36.720 | We want to go from Xt so something more noisy to something less noisy based on the noise

04:13:49.040 | that was predicted by the unit.

04:13:51.440 | But here in this formula you don't see any relationship to the noise predicted by the

04:13:56.800 | unit.

04:13:57.680 | Actually here it just says if you have a network that can evaluate this mean and this variance

04:14:06.640 | you know how to remove the noise to how to go from Xt to Xt-1 but we don't have a method

04:14:13.120 | that actually predicts the mean and the variance.

04:14:15.120 | We have a method that tells us how much noise is there.

04:14:18.000 | So the formula we should be looking at is actually here.

04:14:22.800 | So here here because we have we trained our network our unit as a epsilon theta as you

04:14:33.600 | remember our training method was this we do gradient descent on this loss in which we

04:14:40.000 | train a network to predict the noise in a noisy image.

04:14:45.040 | So we need to use this epsilon theta now to remove the noise so this predicted noise to

04:14:51.120 | remove the noise and if we read the paper it's written here that to sample Xt-1 given

04:14:58.000 | Xt is to compute Xt-1 is equal to this formula here.

04:15:04.640 | This tells us how to go from Xt to Xt-1 and this is the so basically we sample some noise

04:15:12.720 | we multiply it by d sigma and this basically reminds us on how to move go from the N01

04:15:21.440 | to any distribution with a particular mean and a particular variance.

04:15:26.400 | So we will be working according to this formula here actually because we have a model that

04:15:31.360 | predicts noise here this epsilon theta and this is our unit.

04:15:36.080 | The unit is trained to predict noise.

04:15:38.000 | So let's build this part now and I will while building it I will also tell you which formula

04:15:44.160 | I'm referring to at each step so you can also follow the paper.

04:15:47.680 | So now let's build the method let's call step method that given the time step at which the

04:15:54.160 | noise was added or we think it was added because when we do the reverse process we can also

04:16:00.400 | skip it's not we think it was other but we can skip some time steps so we need to tell

04:16:06.000 | him what is the time step at which it should remove the noise the latency so as you know

04:16:12.640 | the unit works with the latency so with this z's here so this is z and it keeps denoising

04:16:19.040 | so the latency and then what is the model output so the predicted noise of the unit

04:16:27.920 | so the model output is the predicted noise torch dot tensor

04:16:33.520 | this model output corresponds to this epsilon theta of xtt so this is the predicted noise

04:16:43.680 | at time step t this latency is our xt and what else we need the alpha we have the beta

04:16:52.320 | we have we have everything okay let's go so t is equal to time step the previous t is

04:17:02.160 | equal to self dot get previous time step t this is a function that given this time step

04:17:11.040 | calculates the previous one later we will build it actually we can build it now it's very simple

04:17:16.160 | okay

04:17:31.680 | get previous time step self time step which is an integer we return another integer

04:17:38.960 | previous time step is equal to the time step

04:17:43.680 | minus self minus basically this quantity here step ratio so self dot num training steps

04:17:56.720 | divided by self dot num inference steps return previous t this one will return basically

04:18:06.560 | given for example the number 999 it will return number 999 minus 20 because the time steps

04:18:16.640 | for example the initial time step will be suppose it's 1000 the training steps we are doing is 1000

04:18:24.080 | divided by the number of inference step which is we will be doing is 50 so this is means 1000 minus

04:18:30.000 | 20 because 1000 divided by 50 is 20 so it will return 980 when we give him 980 as input he will

04:18:37.760 | return 960 so what is the next step that we will be doing in our for loop or what is the previous

04:18:45.200 | step of the denoising so we are going from the image noise at the time step 1000 to an image

04:18:52.320 | noise that time step 980 for example this is the meaning of previous stem then we retrieve some

04:19:01.280 | data later we will use it so alpha pod t is equal to self dot alpha for now if you don't understand

04:19:08.240 | don't worry because later i will write i will just collect some data that we need to calculate

04:19:12.400 | a formula and then i will tell you exactly which formula we are going to calculate

04:19:20.240 | alpha prod t

04:19:22.640 | if we don't have any previous step then we don't know which alpha to return so we just return one

04:19:45.440 | and actually there is a paper that came out i think from by dance that was complaining that

04:19:50.960 | this method of doing is not correct because the the last time step doesn't have this is not doesn't

04:19:58.720 | have the signal to noise ratio about equal to zero but okay this is something we don't need

04:20:03.760 | to care about now actually if you're interested i will link the paper in the comments

04:20:16.320 | rev current alpha t

04:20:26.320 | prod t divided by alpha prod current also this code i took it from

04:20:33.600 | hugging face diffusers library because i mean we are applying formulas so even if i wrote it by

04:20:43.520 | myself it wouldn't be any different because we are just applying formulas from the paper so

04:20:48.240 | so the first thing we need to do is to compute the original sample according to the formula 15

04:20:55.040 | of the paper what do i mean by this as you can see where is it this one where is it

04:21:05.120 | here so actually let me show you another formula here

04:21:12.160 | as you can see we can calculate the previous step so the less noise is the the forward process

04:21:22.080 | sorry the reverse process we can calculate the less noisy image given a more noisy image

04:21:26.960 | and the predicted image at time step zero according to this formula here where the mean

04:21:34.560 | is defined in this way and the variance is defined in this way but what is the predicted

04:21:44.400 | x0 so given an image given a noisy image at time step t how can we predict what is the x0

04:21:53.920 | of course this is the predicted x0 not what will be the x0 so this predicted x0 we can also retrieve

04:22:00.640 | it using the formula number 15 if i remember correctly it's here so this x0 is given as xt

04:22:10.800 | minus 1 minus alpha multiplied by the predicted noise at time step t divided by the square root

04:22:17.200 | of alpha all these quantities we have so actually there are two ways which are equivalent to each

04:22:22.800 | other actually numerically of going from more noisy to less noisy one way is this one this

04:22:28.960 | one here which is the algorithm 2 of the sampling and one is this one here so the equation number

04:22:36.560 | 7 that allows you to go from more noisy to less noisy but the two are numerically equivalent they

04:22:42.400 | just in the in the effect they are equivalent it's just they have different parameterization

04:22:48.000 | so they have different formulas so as a matter of fact for example here in the code they say

04:22:54.720 | to go from xt to xt minus 1 you need to do this calculation here but as you can see for example

04:23:02.480 | this is this numerator of this multiplied by this epsilon theta is different from the one

04:23:11.120 | in the algorithm here but actually they are the same thing because bt is equal to 1 minus alpha t

04:23:16.720 | as beta alpha is defined as 1 minus beta as you remember so there are multiple ways of obtaining

04:23:24.080 | the same thing so what we will do is we actually we will apply this formula here in which we need

04:23:29.760 | to calculate the mean and we need to calculate the variance according to these formulas here

04:23:34.720 | in which we know alpha we know beta we know alpha bar we know all the other alphas we know

04:23:40.320 | because there are parameters that depend on beta what we don't know is x0 but x0 can be calculated

04:23:46.240 | as in the formula 15 here so first we will calculate this x0 predicted x0

04:23:54.400 | so first compute the predicted original sample using formula 15 of the DDTM paper

04:24:13.360 | predicted original sample

04:24:16.960 | latency minus while so we do latency minus the square root of 1 minus alpha t what is the square

04:24:28.000 | root of 1 minus alpha t is equal to beta so i have here beta t which is already 1 minus alpha t as

04:24:36.160 | you can see alpha bar 1 minus alpha bar at the time step t because i already retrieve it from here

04:24:43.920 | so 1 minus sorry beta to the power to to the power of one half or the square root of beta

04:24:52.000 | so we do latency minus beta rod at time step t to the power of 0.5 which it means basically square

04:25:01.120 | root of beta and then we multiply this by the predicted noise of the image of the latent at

04:25:08.640 | time step t so what is the predicted noise it's the model output because our unit predicts the

04:25:15.120 | noise model output and then we need to divide this by let me check

04:25:23.360 | square root of alpha t which we have i think here alpha t here so the square root of

04:25:31.360 | alpha t alpha prod t to the power of 0.5

04:25:36.240 | here i have something on this one i don't need this one i don't need okay

04:25:44.560 | because otherwise it's wrong right yeah before first there is a product between

04:25:50.080 | these two terms and then there is the difference yeah okay this is how we compute the prediction

04:25:56.160 | the x0 now let's go back to the formula number seven

04:25:59.440 | seven seven okay now we have this x0 so we can compute this term and we can compute this term

04:26:06.960 | and this we can compute this term and all the other terms we also can compute so we calculate

04:26:11.120 | this mean and this variance and then we sample from this distribution so compute the coefficients

04:26:20.640 | for bred original sample and the current sample xt this is the same comment that you can find on

04:26:31.280 | the diffusers library which basically means we need to compute this one this is the coefficient

04:26:36.640 | for the predicted sample and this is the coefficient for xt this one here so predicted

04:26:43.520 | original sample coefficient which is equal to what alpha prod t minus one so the previous alpha

04:26:54.320 | prod t which is alpha prod t previous which means the alpha prod t but at the previous time step

04:27:04.720 | under the square root so to the power of 0.5 multiplied by the current beta t so

04:27:10.720 | the beta at the time step t so current beta t which is we define it here

04:27:18.640 | current beta t we retrieve it from alpha we could have a okay and then we divide it by

04:27:28.720 | beta product t because one minus alpha bar is actually equal to beta bar

04:27:33.120 | beta product t then we have the this coefficient here so this one here

04:27:42.320 | so this is current sample coefficient is equal to current alpha t to the power of 0.5

04:27:50.880 | which means the square root of this time this this thing here so the square root of alpha t

04:27:58.080 | and then we multiply it by beta at the previous time step because it's one minus alpha at the

04:28:03.200 | previous time step corresponds to beta as the previous time steps time step multiplied by beta

04:28:09.280 | prod t prev divide by beta at the time step t so beta prod t

04:28:17.840 | now we can compute the mean so the mean is the sum of these two terms

04:28:26.160 | pred prev sample so let me write some here compute the predicted

04:28:33.920 | previous sample mean mod t

04:28:39.840 | is equal to predicted original sample coefficient multiplied by what by x0 what is x0 is this one

04:28:50.000 | that we obtained by the formula number 15 so the prediction predicted original sample so x0

04:28:56.000 | plus this term here what is this term is this one here so the current sample coefficient

04:29:02.560 | multiplied by xt what is xt is the latency at the time step t

04:29:08.480 | now this we have computed the mean for now we need to compute also the variance

04:29:17.120 | let's create another method to compute the variance

04:29:19.760 | test get variance self time step int

04:29:28.480 | okay we obtained the previous time test t because we need to do for later calculations

04:29:37.440 | again we calculate the alpha prod t so all the terms that we need to calculate

04:29:45.280 | these particular terms here

04:30:15.040 | and the current beta t is equal to one minus alpha prod t divided by alpha prod this one

04:30:24.240 | what is current beta t is equal to one minus alpha prod t yeah one minus alpha prod t

04:30:36.080 | divided by alpha prod t0 okay so the variance according to the formula number six and seven

04:30:48.800 | so this formula here is given as one minus alpha prod tprev so one minus alpha prod tprev

04:31:04.000 | divided by one minus alpha prod which is one minus alpha prod why prod because

04:31:11.760 | this is the alpha bar and multiplied by the current beta

04:31:15.920 | beta t and beta t is defined i don't remember where it's one minus alpha

04:31:26.320 | and this is our variance we clamp it

04:31:29.840 | oops torch dot clamp the variance and the minimum that we want is one

04:31:41.600 | equal to minus 20 to make sure that with it doesn't reach zero and then we return the variance

04:31:52.240 | and now that we have the mean and the variance so this variance has also been computed using

04:31:58.960 | let me write here computed using formula seven of the ddpm paper

04:32:07.840 | and now we go back to our step function so what we do is

04:32:17.760 | it's equal to zero so because we only need to add the variance if we are not at the last

04:32:24.480 | last time step if you are at the last time step we have no noise so we don't add any

04:32:28.320 | we don't add we don't need to add any noise actually because the point is we are going

04:32:36.880 | to sample from this distribution and just like we did before we actually sample from the n01 and

04:32:42.240 | then we shift it according to the formula so the n gaussian as with the particular mean and the

04:32:52.160 | particular variance is equal to the the gaussian at zero one multiplied by the standard deviation

04:32:58.960 | plus the um the plus the mean so we sample the noise

04:33:26.080 | okay we sample some noise compute the variance

04:33:38.000 | actually this is the

04:33:42.640 | variance already multiplied by the noise so it's actually the standard deviation

04:33:48.800 | because we will see self dot get variance after the time step t to the power of 0.5 so this

04:33:58.240 | 0.5 so this one becomes the standard deviation we multiply it by the n01

04:34:04.720 | so what we are doing is basically we are going from n01

04:34:09.520 | to nn with a particular mu and a particular sigma using the usual trick of going from

04:34:18.080 | x is equal to the mu plus the sigma actually not yeah this is the sigma squared then

04:34:25.520 | because this is the variance sigma multiplied by the z where z where z is distributed according

04:34:33.360 | to the n01 this is the same thing that we always done also for the variation of the

04:34:39.520 | encoder also for adding the noise the same thing that we did before this is how you sample from a

04:34:45.840 | distribution how you actually shift the parameter of the gaussian distribution

04:34:49.280 | so predicted prev sample is equal to the predicted prev sample plus the variance

04:34:59.600 | this variance term here already includes the sigma multiplied by z

04:35:03.600 | and then we return predicted prev sample oh okay now we have also built the

04:35:14.000 | the sampler let me check if we have everything no we missed still still something which is the

04:35:19.360 | set strength method as you remember once we want when we want to do image to image so let's go

04:35:26.320 | back to check our slides if we want to do image to image we convert the image using the vae to a

04:35:32.560 | latent then we need to add noise to this latent but how much noise we can decide the more noise

04:35:38.640 | we add the more freedom the unit will have to change this image the less noise we add the

04:35:43.360 | less freedom it will have to change the image so what we do is basically by setting the strength

04:35:48.640 | we make our sampler start from a particular noise level and this is exactly what the method we want

04:35:55.600 | to implement so i made some mess okay so for example as soon as we load the image we set

04:36:02.640 | the strength which will shift the noise level from which we start from and then we add noise

04:36:08.240 | to our latent to create the image to image here so let's go here and we create this method called

04:36:16.480 | set strength

04:36:20.720 | okay in the start step because we will skip some steps

04:36:29.840 | is equal to self.num inference steps minus int of self.num inference

04:36:37.920 | this basically means that if we have 50 inference steps and then we set the strength to let's say

04:36:48.000 | 0.8 it means that we will skip 20% of the steps so when we will add we will start from image to

04:36:55.200 | image for example we will not start from a pure noise image but we will start from 80% of noise

04:37:01.760 | in this image so the unit will still have freedom to change this image but not as much as with 100%

04:37:08.160 | noise we redefine the time steps because we are altering the schedule so basically we skip some

04:37:17.920 | time steps and self.start step is equal to start step so actually what we do here is suppose we

04:37:29.840 | have the strength of 80% we are actually fooling the method the the unit into believing that he

04:37:35.440 | came up with this image which is now with this level of strength and now he needs to keep denoising

04:37:41.280 | it this is how we do image to image so we start with an image we noise it and then we make the

04:37:47.280 | unit believe that he came up with this image with this particular noise level and now he has to keep

04:37:53.760 | denoising it until according of course also to the prompt until we reach the clean image without any

04:38:01.120 | noise now we have the pipeline that we can call we have the ddpm sampler we have the model built

04:38:12.080 | of course we need to create the function to load the weights of this model so let's create another

04:38:18.000 | file we will call it the model loader here model loader because now we are nearly close to sampling

04:38:27.760 | from this finally from this table diffusion so now we need to create the method to load the

04:38:32.160 | pre-trained the pre-trained weights that we have downloaded before so let's create it

04:38:40.800 | import clip

04:38:42.240 | decoder va encoder then from decoder import va decoder

04:38:55.600 | fusion import diffusion our diffusion model which is our unit

04:39:05.040 | now let me first define it then i tell you what we need to do so preload preload models from

04:39:14.800 | standard weights

04:39:17.280 | okay as usual we load the weights using torch but we use we will create another function

04:39:32.720 | model converter dot load from standard weights

04:39:37.280 | this is a method that we will create later to to load the weights

04:39:49.360 | the pre-trained weights and i will show you why we need this method then we create our encoder

04:39:58.880 | and we load the state addict load state addict from our state addict

04:40:04.160 | and we also set strict to two oops don't strict

04:40:12.400 | strict

04:40:26.960 | true then we have the decoder

04:40:37.040 | and strict also so this strict parameter here basically tells that when you load a model from

04:40:52.000 | pytorch this for example this ckp ckpt file here it is a dictionary that contains many keys

04:40:59.680 | and each key corresponds to one matrix of our model so for example this uh self this group

04:41:06.320 | normalization has some parameters and the the how can torch load this parameter is exactly in this

04:41:12.720 | group norm by using the name of the variables that we have defined here and he will when we load a

04:41:20.480 | model from pytorch he will actually load the dictionary and then we load this dictionary

04:41:25.680 | into our models and he will match by names now the problem is the pre-trained model

04:41:30.960 | actually they don't use the same name that i have used and actually this code is based on another

04:41:36.720 | code that i have seen so actually the the names that we use are not the same as the pre-trained

04:41:42.560 | model also because the names in the pre-trained model not always uh very friendly for learning

04:41:49.120 | this is why i changed the names and also other people changed the names of the methods but this

04:41:55.200 | also means that the automatic mapping between the names of the pre-trained model and the names

04:42:01.440 | defined in our classes here cannot happen because it cannot happen automatically because the names

04:42:06.400 | do not match for this reason there is a script that i have created in my github library here

04:42:14.320 | that you need to download to convert these names it's just a script that maps one name into another

04:42:20.160 | so if the name is this one map it into this if the name is this one mapping into this

04:42:24.560 | there is nothing special about this script it's just a very big mapping of the names and this is

04:42:30.560 | actually done by most models because if you want to change the name of the classes and or the

04:42:36.960 | variables then you need to do this kind of mapping so i will also i will basically copy it i don't

04:42:43.920 | need to download the file so this will call the model converter.py model converter.py

04:42:52.320 | and that's it it's just a very big mapping of names and i take it from this comment here on

04:43:00.480 | github so this is model converter so we need to import this model converter import model converter

04:43:12.560 | import this model converter basically will convert the names and then we can use the

04:43:17.840 | load state dict and this will actually map all the names it's now now the names will map with

04:43:22.720 | each other and this trick makes sure that if there is even one name that doesn't map

04:43:26.880 | then throw an exception which is what i want because i want to make sure that all the names map

04:43:38.240 | so we define the diffusion and we load it's

04:43:40.480 | state dict

04:43:43.280 | diffusion and strict equal to true

04:43:52.480 | and let me check

04:44:07.440 | then we do clip is equal to clip dot to device so we move it to device where we

04:44:13.360 | want to work and then we load also his state dict so the parameters of the weights

04:44:17.840 | and then we return a dictionary clip

04:44:35.760 | clip and then we have the encoder is the encoder we have the decoder

04:44:42.960 | is the decoder and then we have the diffusion we have the diffusion etc

04:44:51.600 | now we have all the ingredients to run finally the inference guys so thank you for being patient so

04:44:58.800 | much and it's really finally we have we can see the light coming so let's build our notebook so we

04:45:07.680 | can visualize the image that we will build okay let's select the kernel stable diffusion i already

04:45:16.640 | created it in my repository you will also find the requirements that you need to install in order to

04:45:23.360 | run this so let's import everything we need so the model loader the pipeline

04:45:30.000 | peel import image this is how to load the image from python so patlib import actually this one

04:45:41.040 | we don't need transformers this is the only library that we will be using because there

04:45:48.000 | is the tokenizer of the clip so how to tokenize the the text into tokens before sending it to

04:45:54.080 | the clip embeddings otherwise we also need to build the tokenizer and it's really a lot of

04:45:59.120 | job i don't allow cuda and i also don't allow mps but you can activate these two

04:46:17.680 | variables if you want to use cuda or mps

04:46:19.920 | valuable and low cuda then the device becomes cuda of course

04:46:43.200 | so

04:46:54.720 | and then we printed the device we are using

04:47:11.680 | okay let's load the tokenizer tokenizer is the clip tokenizer we need to tell him what is the

04:47:17.760 | vocabulary file so which is already saved here in the data data vocabulary.json and then also the

04:47:25.440 | merges file maybe one day i will make a video on how the tokenizer works so we can build also the

04:47:32.320 | tokenizer but this is something that requires a lot of time i mean and it's not really related

04:47:38.880 | to the diffusion model so that's why i didn't want to build it the model file is i will use the data

04:47:45.920 | and then this file here then we load the model so the models are model loader dot preload model from

04:47:54.480 | the model file into this device that we have selected okay let's build from text to image

04:48:02.640 | what we need to define the prompt for example i want a cat

04:48:08.160 | sitting or stretching let's say stretching on the floor highly detailed we need to create a

04:48:18.560 | prompt that will create a good image so we need to add some a lot of details ultra sharp cinematic

04:48:25.360 | etc etc 8k resolution the unconditioned prompt

04:48:32.720 | i keep it blank this you can also use it as a negative number you can use it as a negative

04:48:42.720 | prompt so if you don't want the sum you don't want the output to have some how to say some

04:48:50.800 | characteristics you can define it in the negative prompt of course i like to do cfg so the

04:48:57.120 | classifier free guidance which we set to true cfg scale is a number between 1 and 14 which

04:49:05.760 | indicates how much attention we want the model to pay to this prompt 14 means pay

04:49:10.560 | very much attention or 1 means we pay very little attention i use 7

04:49:17.920 | then we can define also the parameters for image to image

04:49:20.480 | so input image is equal to none image path is equal to i will define it with my

04:49:31.440 | image of the dog which i already have here and um but for now i don't want to load it

04:49:38.960 | so if we want to load it we need to do input image is equal to image.open

04:49:46.320 | image path but for now i will

04:49:48.480 | i will not use it so now let's comment it and if we use it we need to define the strength

04:49:56.960 | so how much noise we want to add to this image but for now let's not use it

04:50:00.080 | the sampler we will be using of course is the only one we have is the ddpm

04:50:04.640 | the number of inference steps is equal to

04:50:10.880 | 50 and the seed is equal to 42 because it's a lucky number at least according to some books

04:50:19.280 | output image is equal to pipeline generate okay the prompt is the prompt that we have defined

04:50:31.280 | the unconditioned prompt is the unconditioned prompt that we have defined

04:50:36.240 | input image is the input image that we have defined if it's not commented of course

04:50:42.000 | the strength for the image

04:50:44.000 | and the cfg scale is the one we have defined

04:50:53.200 | the sampler name is the sampler name we have defined

04:51:01.760 | the number of inference steps is the number of inference steps the seed

04:51:07.200 | models

04:51:10.000 | device

04:51:14.720 | idle device is our cpu so when we don't want to use something we move it to the cpu

04:51:22.720 | and the tokenizer is the tokenizer

04:51:27.200 | and then image dot from array output image if everything is done well if all the code has

04:51:37.680 | been written correctly you can always go back to my repository and download the code if you

04:51:43.280 | don't want to write it by yourself let's run the code and let's see what is the result my

04:51:48.880 | computer will take a while so it will take some time so let's run it so if we run the code it will

04:51:56.800 | generate an image according to our prompt in my computer it took really a long time so i cut the

04:52:01.520 | video and i actually already replaced the code with the one from my github because now i want

04:52:08.240 | to actually explain you the code without while showing you all the code together how does it

04:52:14.320 | work so now we we generated an image using only the prompt i use the cpu that's why it's very slow

04:52:20.400 | because my gpu is not powerful enough and we set a unconditioned prompt to zero we are using the

04:52:25.840 | classifier free guidance and with a scale of seven so let's go in the pipeline and let's see

04:52:30.880 | what happens so basically because we are doing the classifier free guidance we will generate

04:52:36.400 | two conditioning signals one with the prompt and one with empty text which is the unconditioned

04:52:43.280 | prompt which is also called the negative prompt this will result in a batch size of two that will

04:52:50.560 | run through the unit so let's go back to here suppose we are doing text to image so now our

04:52:56.240 | unit has two latents that he's doing at the same time because we have the batch size equal to two

04:53:01.680 | and for each of them it is predicting the noise level but how can we move remove this noise from

04:53:10.160 | the predicted noise from the initial noise so because to generate an image we start from random

04:53:17.680 | noise and the prompt initially we encode it with our vae so it becomes a latent which is still

04:53:24.960 | noise and with the unit we predict we predict how much noise is it according to a schedule so

04:53:31.040 | according to 50 steps that of inferencing that we will be doing at the beginning the first step will

04:53:37.360 | be 1000 the next step will be 980 the next step will be 960 etc so this time will change according

04:53:44.720 | to this schedule so that at the 50th step we are at the time step 0 and how can we then with the

04:53:55.440 | predicted noise go to the next latent so we remove this noise that was predicted by the unit well we

04:54:01.840 | do it with the sampler and in particular we do it with the sample method of the sampler step method

04:54:09.040 | sorry of the sampler which basically will calculate the previous sample given the current sample

04:54:14.640 | according to the formula number 7 here so which basically calculates the previous sample

04:54:21.120 | given the current one so the less noisy one given the current one and the predicted x0 so this is

04:54:27.840 | not x0 because we don't have x0 so we don't have the noise the sample without any noise so but we

04:54:34.880 | can predict it given the values of the current noise and the beta schedule another way of denoising

04:54:42.480 | is to do the sampling like this if you watch my other repository about the ddbm paper i actually

04:54:48.000 | implemented it like this if you want to see this version here and this is how we remove the noise

04:54:54.160 | to get a less noisy version so once we get the less noisy version we keep doing this process

04:54:59.920 | until there is no more noise so we are at the time step zero in which we have no more noise

04:55:04.800 | we give this latent to the decoder which will turn it into an image this is how the text to image

04:55:09.920 | works the image to image on the other side so let's try to do the image to image so to do the

04:55:15.200 | image to image we need to go here and we uncomment this code here this allows us to start with the

04:55:25.680 | dog and then give for example some prompt for example we want this dog here we want to say

04:55:31.840 | okay we want a dog stretching on the floor highly detailed etc we can run it i will not run it

04:55:38.560 | because it will take another five minutes and if we do this we can set a strength of let's say 0.6

04:55:45.760 | which means that let's go here so we set a strength of 0.6 so we have this input image

04:55:54.000 | strength of 0.6 means that we will add we will encode it with the variation auto encoder will

04:55:59.680 | become a latent will add some noise but how much noise not all the noise so that it becomes

04:56:06.560 | completely noise but less noise than that so at let's let's say 60 percent noise is not really

04:56:14.080 | true because because it depends on the schedule in our case it's linear so it can be considered

04:56:21.040 | 60 percent of noise we then give this image to the scheduler which will start not from the 1000

04:56:28.240 | step it will start before so if we set the strength to 0.6 it will start from the 600

04:56:34.800 | step and then move by 20 we'll keep going 600 then 580 then 560 then 540 etc until it reaches 20

04:56:46.640 | so in total it will do less steps because we start from a less noisy example but at the same

04:56:52.560 | time because we start with less noise the the unit also has less freedom to change the to alter the

04:57:00.720 | image because he already have the image so he cannot change it too much so how do you adjust

04:57:07.040 | the noise the the noise level depends if you want the unit to pay very much attention to the input

04:57:13.440 | image and not change it too much then you add less noise if you want to change completely the

04:57:20.800 | original image then you can add all the possible noise so you set the strength to one and this is

04:57:25.280 | how the image to image works i didn't implement the inpainting because the reason is that the

04:57:32.880 | pre-trained model here so the model that we are using is not fine-tuned for inpainting so if you

04:57:38.080 | go on the website and you look at the model card they have another model for inpainting which has

04:57:45.280 | different weights here the this one here but this the structure of this model is also a little

04:57:53.120 | different because they have in the unit they have five additional input channels for the mask

04:57:58.800 | i will of course implement it in my repository directly so i will modify the code and

04:58:07.440 | also implement the code for inpainting so that we can support this model but unfortunately i don't

04:58:12.960 | have the time now because in china here is guoqing and i'm going to laojia with my my wife so we are

04:58:19.760 | a little short of time but i hope that with my video guys you you got really into stable diffusion

04:58:26.240 | and you understood what is happening under the hood instead of just using the hugging face library

04:58:31.440 | and also notice that the model itself is not so particularly sophisticated if you check the

04:58:39.520 | decoder and the encoder they are just a bunch of convolutions and upsampling and the normalizations

04:58:47.200 | just like any other computer vision model and the same goes on for the unit of course there are very

04:58:53.280 | smart choices in how they do it okay but that's not the important thing of the diffusion and

04:58:59.440 | actually if we study the diffusion models like score models you will see that it doesn't even

04:59:03.680 | matter the structure of the model as long as the model is expressive it will actually learn the

04:59:09.040 | score function in the same way but this is not our case in this video i will talk about score model

04:59:14.000 | in future videos what i want you to understand is that how this all mechanism works together

04:59:20.400 | so how can we just learn a model that predicts the noise and then we come up with images and

04:59:28.160 | let me rehearse again the idea so we started by training a model that needs to learn a probability

04:59:35.760 | distribution as you remember p of theta here we we cannot learn this one directly because we don't

04:59:43.120 | know how to marginalize here so what we did is we find some lower bound for this quantity here and

04:59:49.120 | we maximize this lower bound how do we maximize this lower bound by training a model by running

04:59:54.800 | the gradient descent on this loss this loss produces a model that allow us to predict the

05:00:03.760 | noise then how do we actually use this model with the predicted noise to go back in time with the

05:00:10.880 | noise because the forward process we know how to go it's defined by us how to add noise but in back

05:00:15.680 | in time so how to remove noise we don't know and we do it according to the formulas that i have

05:00:20.880 | described in the sampler so the formula number seven and the formula number also this one actually

05:00:27.360 | we can use actually i will show you in my other um here i have another repository i think it's

05:00:33.760 | called python ddpm in which i implemented the ddpm paper but by using this algorithm here so

05:00:39.840 | if you are interested in this version of the denoising you can check my other uh repository

05:00:44.720 | here this one ddpm and i also wanted to show you how the inpainting works how the how the

05:00:52.720 | image to image and how the text to image works of course the possibilities are limitless it all

05:00:59.600 | depends on the powerfulness of the model and how you use it and i hope you use it in a clever way

05:01:07.200 | to build amazing products i also want to thank very much many repositories that i have used

05:01:13.600 | as a self-studying material so because of course i didn't make up all this by myself i studied a

05:01:18.720 | lot of papers i read i think to study this diffusion models i read more than 30 papers

05:01:23.600 | in the last few weeks so it took me a lot of time but i was really passionate about this kind of

05:01:30.240 | models because they're complicated and i really like to study things that can generate new stuff

05:01:35.600 | so i want to really thank in particularly some resources that i have used let me see

05:01:42.880 | this one's here so the official code the this guy divam gupta this other repository from

05:01:49.200 | this person here which i used very much actually as a base and the diffusers library from this

05:01:56.640 | hugging face upon which i based most of the code of my sampler because i think it's better to use

05:02:03.040 | because we are actually just applying some formulas there is no point in writing it from

05:02:06.960 | zero the point is actually understanding what is happening with these formulas and why we are doing

05:02:11.200 | it the things we are doing and as usual the full code is available i will also make all the slides

05:02:16.480 | available for you guys and i hope if you are in china you also have a great holiday with me and

05:02:21.920 | if you're not in china i hope you have a great time with your family and friends and everyone

05:02:25.840 | else so welcome back to my channel anytime and please feel free to comment on send me a comment

05:02:31.840 | or if you didn't understand something or if you want me to explain something better because i'm

05:02:37.520 | always available for explanation and guys i do this not as my full-time job of course i do it

05:02:43.920 | as a part-time and lately i'm doing consulting so i'm very busy but sometime i take time to record

05:02:51.040 | videos and so please share my channel share my video with people if you like it and so that my

05:02:57.680 | channel can grow and i have more motivation to keep doing this kind of videos which take really

05:03:02.800 | a lot of time because to prepare a video like this i spend around many weeks of research but

05:03:08.560 | this is okay i do it for as a passion i don't do it as a job and i spend really a lot of time

05:03:14.640 | preparing all the slides and preparing all the speeches and preparing the code and cleaning it

05:03:20.080 | and commenting it etc etc i always do it for free so if you would like to support me the best way is

05:03:26.400 | to subscribe like my video and share it with other people thank you guys and have a nice day

Coding Stable Diffusion from scratch in PyTorch

Chapters