Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

00:00:00.000 | Welcome to my video about the variational autoencoder.

00:00:02.840 | In this video, I will be introducing the model, how it works, the architecture,

00:00:07.400 | and I will also be going into the maths.

00:00:10.360 | Why should you learn about the variational autoencoder?

00:00:13.160 | Well, it's one of the building blocks of the stable diffusion,

00:00:16.960 | and if you can understand the maths behind the variational autoencoder,

00:00:20.240 | you have covered more than 50% of the maths that you need for the stable diffusion.

00:00:25.640 | At the same time, I will also try to simplify the math as much as possible

00:00:29.400 | so that everyone with whatever background can follow the video.

00:00:34.720 | Before we go into the details of variational autoencoder,

00:00:37.360 | we need to understand what is an autoencoder.

00:00:41.880 | So the autoencoder is a model that is made of two smaller models.

00:00:46.520 | The first is the encoder, the second the decoder,

00:00:49.240 | and they are joined together by this bottleneck Z.

00:00:52.200 | The goal of the encoder is to take some input

00:00:55.160 | and convert it into a lower dimensional representation, let's call it Z,

00:01:00.240 | and then if we take this lower dimensional representation

00:01:03.240 | and give it as input to the decoder,

00:01:05.600 | we hope that the model will reproduce the original data.

00:01:09.440 | And why do we do that?

00:01:10.960 | Because we want to compress the original data into a lower dimension.

00:01:19.040 | We can have an analogy with file compression.

00:01:21.360 | For example, if you have a picture, let's call it zebra.jpg,

00:01:24.400 | and you zip the file, you will end up with a zebra.zip file.

00:01:28.800 | And if you unzip the file or decompress the file,

00:01:31.680 | you will end up with the same file again.

00:01:34.200 | The difference between the autoencoder and the compression

00:01:37.200 | is that the autoencoder is a neural network

00:01:39.520 | and the neural network will not reproduce the exact original input,

00:01:44.920 | but will try to reproduce as much as possible of the original input.

00:01:49.960 | So what makes a good autoencoder?

00:01:52.920 | The code should be as small as possible,

00:01:54.760 | that is, the lower representation of the data should be as small as possible,

00:01:58.680 | and the reconstructed input should be as close as possible to the original input.

00:02:03.080 | But what's the problem with autoencoders?

00:02:05.640 | The problem with autoencoders is that the code learned by the model doesn't make sense.

00:02:11.280 | That is, the model just learns a mapping between input data and a code Z,

00:02:17.680 | but doesn't learn any semantic relationship between the data.

00:02:21.040 | For example, if we watch at the code learned for the picture of the tomato,

00:02:25.200 | it's very similar to the code learned for the picture of the zebra,

00:02:28.880 | or the cat is very similar to the code learned for the pizza, for example.

00:02:33.280 | So the model didn't capture any relationship between the data

00:02:36.840 | or any semantic relationship between the data.

00:02:39.160 | And this is why we introduced the variational autoencoder.

00:02:42.120 | In the variational autoencoder, we learn a latent space.

00:02:45.400 | So not a code, but a latent space,

00:02:47.800 | which represents a multivariate distribution over this data.

00:02:52.240 | And we hope that this multivariate distribution, so this latent space,

00:02:56.720 | captures also the semantic relationship between the data.

00:02:59.800 | So for example, we hope that all the food pictures have a similar representation in this latent space,

00:03:05.440 | and also all the animals have a similar representation,

00:03:08.160 | and all the cars and all the buildings, for example the stadium,

00:03:11.120 | have a similar relationship with each other.

00:03:13.200 | And the most important thing that we want to do with this variational autoencoder

00:03:17.640 | is we want to be able to sample from this latent space to generate new data.

00:03:23.440 | So what does it mean to sample the latent space?

00:03:26.440 | Well, for example, when you use Python to generate a random number between 1 and 100,

00:03:32.280 | you're actually sampling from a random distribution,

00:03:35.560 | it's called uniform random distribution,

00:03:37.560 | because every number has equal probability of being chosen.

00:03:40.440 | We can sample from the latent space to generate a new random vector,

00:03:44.280 | give it to the decoder and generate new data.

00:03:46.880 | For example, if we sample from this latent space,

00:03:49.560 | which is the latent space of a variational autoencoder that was trained on food pictures,

00:03:54.600 | and we happen to sample something that was exactly in between of these three pictures,

00:04:00.040 | we hope to get something that also in its meaning is similar to these three pictures.

00:04:05.680 | So for example, in between the picture of egg, floor and basil leaves,

00:04:10.160 | we hope to find pasta with basil, for example.

00:04:13.280 | Which means that the model has captured somehow the relationship between the data it was trained upon,

00:04:18.960 | so it can generate new data.

00:04:20.600 | Why is the space called latent space?

00:04:23.560 | Because we model our data as it is coming from a variable X,

00:04:29.000 | so a random variable X that we can observe,

00:04:32.360 | but this variable X is conditioned on another random variable Z that is not visible to us,

00:04:39.080 | that is hidden.

00:04:40.760 | So latent means hidden.

00:04:42.760 | And we will model this hidden variable as a multivariate Gaussian with means and variance.

00:04:50.040 | I know that this all sounds very abstract, so let me give you a more concrete example.

00:04:55.400 | I will use the Plato's allegory of the cave.

00:04:58.680 | In the Plato's allegory of the cave,

00:05:01.320 | we have some people who since the childhood are born and lived all their life in this cave.

00:05:07.360 | We are talking about these people here.

00:05:09.680 | And these people never left the cave, so they only stayed in this area of the cave.

00:05:14.880 | And they are chained, so they cannot leave.

00:05:17.840 | These people, since childhood, have seen these pictures on the cave

00:05:22.840 | that are projected from these 3D objects through this fire.

00:05:27.320 | So they are the shadow of these 3D objects here.

00:05:30.080 | But these people, they don't know that these pictures actually are casted from these 3D objects.

00:05:35.680 | They don't know that they are shadows.

00:05:37.440 | For them, the horse is something black that moves like this.

00:05:40.480 | The bird is something black that moves like this.

00:05:43.360 | So we need to think that we are just like these people.

00:05:47.320 | So we have some data that we can observe.

00:05:49.960 | But this data actually comes from something that we cannot observe,

00:05:53.880 | that is of a higher representation of this data,

00:05:57.200 | abstract representation of this data.

00:05:59.320 | And we want to learn something about this abstract representation.

00:06:03.000 | Before we go into the maths of variational autoencoder,

00:06:06.680 | let me give you a little pep talk.

00:06:08.440 | Because the math is going to be a little hard to follow for some people

00:06:12.080 | and some easy for other people.

00:06:13.920 | The point is, in order to understand the variational autoencoder,

00:06:17.160 | you need to understand the math behind it.

00:06:19.480 | Not only the numerical math, but also the concept.

00:06:22.440 | So what I will try to do is to give you the necessary background to understand the math,

00:06:26.560 | if you are interested in learning the math.

00:06:29.360 | But at the same time, I will also try to convey some general information,

00:06:34.360 | some high-level representation of what is happening

00:06:37.000 | and why we are doing what we are doing.

00:06:39.160 | Also, I believe that VA is the most important component of stable diffusion models.

00:06:43.400 | So concepts like ELBO that we will see in the following slides

00:06:46.840 | also come in stable diffusion.

00:06:48.080 | So if you understand it here,

00:06:49.640 | it will make it easy for you to understand the stable diffusion.

00:06:52.800 | Plus, in 2023, I think you shouldn't be memorizing things,

00:06:56.600 | so just memorizing the architecture of models,

00:06:58.880 | because ChatGPT can do that faster and better than you.

00:07:01.520 | If you want to compete with a machine,

00:07:03.120 | you need to be human.

00:07:04.560 | You can't be a machine and compete with a machine.

00:07:06.600 | I also believe that you should try to learn things not only out of curiosity,

00:07:11.640 | but because that's the true engine of innovation and creativity.

00:07:15.120 | And plus, math is fun.

00:07:17.120 | So let's start by introducing some math concepts

00:07:20.280 | that we will need in the following slides.

00:07:23.560 | Don't be scared if you are not familiar with these concepts,

00:07:26.120 | because I will try to give a higher representation of what is happening.

00:07:30.640 | So even if you don't understand each step,

00:07:32.440 | you will still understand what is happening on a higher level.

00:07:36.480 | We need what is the expectation of a random variable, which is this.

00:07:40.120 | We need the chain rule of probability, which is this,

00:07:43.200 | and the bias theorem.

00:07:44.800 | All of these three concepts are usually taught in a bachelor's class,

00:07:48.040 | so I hope that you are familiar with it.

00:07:50.560 | And another concept that is not taught in a bachelor,

00:07:53.240 | but I will introduce now, is the Kullback-Leiber divergence.

00:07:56.720 | This is a very important concept in machine learning,

00:07:59.200 | and it's a divergence measure that allows you to measure

00:08:02.760 | the distance between two probability distributions.

00:08:06.560 | So given probability distribution P and Q,

00:08:09.240 | the Kullback-Leiber divergence tells you

00:08:11.720 | how far are these two probability distributions.

00:08:15.600 | But at the same time, this is not a distance metric,

00:08:18.880 | because it's not symmetric.

00:08:20.120 | So when you have a distance metric, usually from, for example,

00:08:23.160 | the physical distance from point A to B,

00:08:25.720 | if A to B is one meter apart, then B to A is also one meter apart.

00:08:29.680 | But this doesn't happen with Kullback-Leiber divergence.

00:08:32.520 | For example, the divergence between P and Q

00:08:34.880 | is not the divergence between Q and P.

00:08:37.880 | However, just like any distance metric,

00:08:40.400 | it is always greater than or equal to zero,

00:08:42.760 | and it's equal to zero if and only if the two distributions are same.

00:08:47.240 | We can now introduce our model.

00:08:50.160 | Now, we saw before that we want to model our data

00:08:53.600 | as coming from a random distribution that we call X,

00:08:56.720 | which is conditioned on a hidden variable or latent variable called Z.

00:09:01.120 | So we could also, for example,

00:09:02.880 | marginalize over the joint probability using this relationship here.

00:09:07.160 | The problem is this integral is intractable,

00:09:09.760 | because we need to integrate over all latent variable Z.

00:09:13.680 | And what does it mean to be intractable?

00:09:15.680 | It means that in theory, we can calculate it.

00:09:18.440 | But in practice, it is so slow and so computationally expensive

00:09:22.200 | that it's not worth it.

00:09:23.280 | So something intractable is like trying to guess your neighbor's Wi-Fi password.

00:09:27.960 | In theory, you can do it by generating all possible passwords

00:09:30.800 | and try all of them.

00:09:31.800 | But in practice, it will take you thousands of years.

00:09:34.440 | So this relationship, we can also write it like this

00:09:36.960 | by using the chain rule of probability.

00:09:39.200 | We are trying to find this, so our data,

00:09:42.400 | so a probability distribution over our data,

00:09:44.840 | but we need this ground truth of this, which we don't have,

00:09:48.400 | because this is the probability distribution over the latent space

00:09:52.200 | given our data.

00:09:53.880 | But this is also something we want to learn,

00:09:55.520 | so we cannot use this relationship.

00:09:57.480 | So this looks like a chicken and egg problem,

00:10:00.640 | because we are trying to find this using this,

00:10:05.240 | but we don't have this.

00:10:07.080 | And to find this, we need this.

00:10:10.040 | So how do we come out of it?

00:10:12.320 | Usually, when you cannot find something that you want,

00:10:15.920 | you try to approximate it.

00:10:17.720 | And this is the road that we will follow.

00:10:20.560 | So, this is what we want to find.

00:10:23.160 | And we think that it's parametrized by some parameters theta

00:10:29.480 | that we don't know.

00:10:31.320 | However, what if we could find something that is a surrogate,

00:10:34.680 | something that is approximation of this,

00:10:38.120 | that has its own parameters?

00:10:40.640 | Well, let's follow this road.

00:10:42.080 | So let's do some maths.

00:10:45.200 | We start with the log likelihood of our data,

00:10:48.280 | which is equal to itself.

00:10:50.400 | We can then multiply by one.

00:10:52.160 | So why this quantity is one?

00:10:54.560 | Well, this is the integral over the domain

00:10:57.080 | of a probability distribution, which is always equal to one.

00:11:00.360 | And we can bring this quantity inside the integral,

00:11:03.760 | because it doesn't depend on the variable that is integrated.

00:11:07.280 | This is the definition of expectation.

00:11:11.240 | We can see that this integral is actually an expectation.

00:11:14.040 | And inside of this expectation,

00:11:17.920 | we can apply the equation given by the chain rule of probability.

00:11:22.560 | We can multiply the numerator and the denominator

00:11:25.160 | by the same quantity,

00:11:26.360 | which means actually multiplying by one.

00:11:29.360 | Then we can split the expectation

00:11:31.120 | because the log of a product

00:11:33.360 | can be written as the sum of the logs.

00:11:35.800 | And then after writing the sum of the logs,

00:11:38.440 | we can split the expectation.

00:11:40.720 | And finally, we can see that the second expectation

00:11:43.240 | is actually Kullback-Leibler divergence.

00:11:45.840 | And we know that it's always greater than or equal to zero.

00:11:48.880 | Now, let me expand this relationship that we have found.

00:11:52.160 | That is, the log likelihood of our data

00:11:55.160 | is equal to this quantity plus this KL divergence.

00:11:59.760 | We will call this quantity here the ELBO,

00:12:06.680 | which stands for evidence lower bound,

00:12:09.400 | plus a KL divergence

00:12:11.040 | that is always greater than or equal to zero.

00:12:13.800 | Now, what can we infer from this expression

00:12:17.480 | without knowing nothing about the quantities involved?

00:12:20.800 | Okay, if you cannot see it,

00:12:23.680 | let me help you with a parallel example.

00:12:26.440 | Imagine you are an employee.

00:12:28.080 | You have a total compensation,

00:12:29.960 | which is your base salary plus a bonus,

00:12:32.720 | which is always greater than or equal to zero.

00:12:34.720 | Without knowing nothing about your base salary

00:12:37.000 | or your total compensation,

00:12:38.720 | we can for sure deduce the following,

00:12:41.480 | that your total compensation is always greater than

00:12:44.680 | or equal to your base salary.

00:12:46.360 | Now, this expression here has the same structure

00:12:50.160 | as this expression here.

00:12:52.360 | So we can infer the same for the first expression.

00:12:56.520 | That is, the first quantity is always greater than

00:12:59.280 | or equal to the second quantity

00:13:00.800 | without caring what happens to the third quantity.

00:13:03.240 | So this also means that this is a lower bound for this.

00:13:07.280 | This also means that if we maximize this,

00:13:10.400 | this will also be maximized.

00:13:12.440 | Let's look at the ELBO in detail.

00:13:14.720 | Now, we found before that this quantity here

00:13:18.000 | is what we want.

00:13:19.280 | And if we maximize this quantity,

00:13:21.840 | we are going to automatically maximize this quantity.

00:13:24.920 | But this one can also be written like this

00:13:27.040 | by using the chain rule of probability.

00:13:29.640 | And then we can split again the expectation.

00:13:32.560 | And then we can see that the second expectation

00:13:34.600 | is a KL divergence itself.

00:13:36.440 | In this case, it's a reverse KL divergence.

00:13:38.920 | So it's not same as the one we saw before

00:13:41.400 | because the numerator doesn't match

00:13:43.360 | with the probability distribution we see here.

00:13:46.040 | So we put a minus sign here.

00:13:48.000 | Now, our goal is to maximize this,

00:13:51.920 | but we maximize this to actually maximize

00:13:55.760 | this log likelihood.

00:13:57.920 | However, if we maximize this quantity here,

00:14:00.640 | so this is our ELBO,

00:14:02.560 | we are actually maximizing this sum

00:14:06.080 | or this difference, actually.

00:14:08.080 | And when you maximize this difference,

00:14:10.960 | you are maximizing this,

00:14:13.160 | and at the same time, you are minimizing this.

00:14:16.720 | If you cannot see it,

00:14:17.920 | let me give you a parallel example.

00:14:20.120 | By showing you like this,

00:14:21.800 | imagine you have a company

00:14:23.640 | and your company has profit, revenue, and cost.

00:14:27.160 | If you want to maximize your profit,

00:14:29.720 | what do you do?

00:14:30.560 | You maximize your revenue,

00:14:32.280 | and at the same time, you maximize your cost.

00:14:34.880 | That's the only way, right?

00:14:36.360 | So if we are maximizing the ELBO,

00:14:38.720 | we are actually maximizing this first quantity here,

00:14:41.920 | and at the same time,

00:14:42.960 | we are minimizing the second quantity we see here.

00:14:46.240 | Now, let's look at what do these quantities mean.

00:14:49.200 | And for that, I took this picture from a paper

00:14:52.800 | from Kingma and Welling,

00:14:54.040 | who are also the authors of the first paper

00:14:56.080 | about the variational autoencoder,

00:14:57.760 | and we can see that this is a log likelihood.

00:15:01.000 | It's a posterior of something that given Z

00:15:04.520 | gives us a probability distribution over X.

00:15:07.320 | So here we are talking about the decoder,

00:15:10.360 | and here we have a KL divergence

00:15:12.560 | between two distributions.

00:15:14.240 | One is called the prior, so this one.

00:15:16.440 | So this is what we want our Z space to look like,

00:15:20.600 | and as I said before,

00:15:21.760 | we want our Z space to be a multivariate Gaussian,

00:15:25.000 | and this is the learned distribution by the model.

00:15:28.800 | So this KL, when we maximize the ELBO,

00:15:31.480 | since we are minimizing this quantity,

00:15:34.000 | the model actually is minimizing the distance

00:15:36.960 | between what it is learning as the Z space

00:15:39.960 | and what we want the Z space to look like.

00:15:42.600 | So it is making the Z space

00:15:44.200 | to look like a multivariate Gaussian.

00:15:46.920 | And because when we maximize the ELBO,

00:15:49.080 | we also maximize the first quantity,

00:15:51.360 | the model is also learning to maximize

00:15:53.800 | the reconstruction quality of the sample X,

00:15:56.760 | given its latent representation Z.

00:15:59.560 | Now, the problem is when you maximize something

00:16:02.960 | that has a stochastic quantity inside,

00:16:06.840 | here we have a probability distribution,

00:16:09.160 | we need, okay, first of all,

00:16:10.880 | let me describe how to maximize something.

00:16:13.680 | So when we want to maximize a function,

00:16:16.240 | we usually take the gradient

00:16:17.960 | and adjust the weights of the model

00:16:19.840 | so they move along the gradient.

00:16:22.080 | Or when we want to minimize a function,

00:16:24.360 | we take the gradient and adjust the weights of the model

00:16:26.880 | so they move against the gradient direction.

00:16:29.560 | And this is also what happens when we train our models.

00:16:32.720 | For example, imagine we have a function that is convex,

00:16:37.080 | so it looks like, let's say, a ball,

00:16:38.840 | and our minimum is here, our initial weights are here,

00:16:43.880 | so we evaluate our gradient in this point,

00:16:47.080 | and the gradient always points

00:16:48.560 | to where the direction of growth of the function,

00:16:52.600 | so the function is growing in this direction,

00:16:54.760 | and we move against the gradient

00:16:56.720 | if we want to minimize the function, right?

00:16:58.840 | And the problem is we are not calculating the true gradient

00:17:03.840 | when we run our model.

00:17:04.960 | We are actually calculating what is called,

00:17:07.420 | we are using stochastic gradient descent.

00:17:10.840 | Have you ever wondered why it's called

00:17:12.440 | stochastic gradient descent and not just gradient descent?

00:17:15.640 | Because actually, to minimize a function,

00:17:17.760 | you need to evaluate the function over all the data set,

00:17:21.080 | so all the training data you have,

00:17:23.680 | not only on a single batch.

00:17:26.200 | But by doing it on a single batch,

00:17:28.960 | you get a distribution over the possible gradient.

00:17:32.760 | So for example, when we use stochastic gradient descent

00:17:36.480 | and we evaluate the gradient of our loss function,

00:17:40.360 | we do not get the true gradient.

00:17:42.120 | We get a distribution over the gradient.

00:17:44.840 | And someone proved that if you do it long enough,

00:17:47.960 | so if you do it over the entire training set,

00:17:50.940 | so one epoch, it will actually converge

00:17:53.600 | to the true gradient.

00:17:55.280 | Now, the fact that it is a stochastic gradient descent,

00:17:58.720 | it also means that the gradient that we get

00:18:01.120 | has a mean and a variance.

00:18:03.160 | Now, the variance, in our case,

00:18:05.680 | in stochastic gradient descent, is small enough

00:18:08.840 | so that we can use stochastic gradient descent.

00:18:12.160 | The problem with this quantity,

00:18:14.400 | if we do the same job with this one, we get an estimator.

00:18:17.640 | So this is called estimating a quantity,

00:18:19.880 | a stochastic quantity, that has a high variance,

00:18:22.920 | as shown in the paper.

00:18:25.340 | So if we look at the paper by Kingma and Welling,

00:18:28.180 | they show that there is an estimator for the elbow,

00:18:32.580 | and this estimator, however, exhibits a very high variance.

00:18:37.020 | So it means that, for example,

00:18:38.700 | imagine we are trying to minimize our function.

00:18:41.380 | If we use an estimator that has high variance,

00:18:45.700 | suppose we are here and the minimum of the model is here.

00:18:49.300 | If we are lucky, when we calculate the gradient,

00:18:54.220 | we will get the true gradient

00:18:55.780 | and we move against the gradient.

00:18:58.060 | However, if we are unlucky because it has high variance,

00:19:01.620 | the model may return a very different gradient

00:19:04.540 | than what we expect, for example, in this direction,

00:19:06.900 | and then we will move to the opposite direction,

00:19:09.180 | which is this one, so it will take us far from the minimum,

00:19:12.740 | and this is not what we want.

00:19:14.740 | So we cannot use an estimator that has high variance.

00:19:18.060 | This estimator is, however, unbiased.

00:19:20.420 | It means that if we do it many times,

00:19:22.700 | it will converge.

00:19:24.700 | But because of it being with a high variance,

00:19:27.900 | we cannot use it in practice.

00:19:30.140 | Plus, how do we run backpropagation

00:19:33.460 | on a quantity that is stochastic?

00:19:35.820 | Because we need to sample from our z-space

00:19:38.500 | to calculate the loss.

00:19:40.980 | So how can we calculate the derivative

00:19:43.780 | of the sampling operation?

00:19:45.380 | We cannot do that, PyTorch cannot do that.

00:19:47.500 | So we need a new estimator.

00:19:50.820 | And the idea is that we want to take

00:19:53.740 | the source of randomness outside of the model,

00:19:56.860 | and we will call it reparameterization trick.

00:19:59.780 | So the reparameterization trick means basically

00:20:02.260 | that we take the stochastic component outside of z,

00:20:06.900 | we create a new variable, epsilon,

00:20:09.740 | that is our stochastic node.

00:20:11.700 | We sample from epsilon,

00:20:14.100 | combine it with the parameters learned by the model,

00:20:17.700 | so mu and sigma squared,

00:20:19.420 | which is the mean and the sigma of our multivariate Gaussian

00:20:23.220 | that we are trying to learn,

00:20:24.420 | and then we will run backpropagation through it.

00:20:27.540 | Let me show you with a picture.

00:20:29.300 | This is a picture I took from another paper

00:20:33.140 | by Kingma and Welling,

00:20:34.300 | and we can see here that when we run backpropagation,

00:20:37.940 | we calculate our loss function.

00:20:39.820 | We calculate the gradient,

00:20:41.580 | but before, this node here was random, was stochastic,

00:20:46.940 | so we couldn't run backpropagation through it

00:20:49.500 | because, as I told you,

00:20:51.220 | we don't know how to calculate the gradient

00:20:54.300 | of the sampling operation.

00:20:55.820 | However, if we take the randomness outside of this node

00:21:00.820 | to another node that is outside of z,

00:21:03.180 | we can run backpropagation

00:21:04.780 | and update the parameters of our model.

00:21:07.820 | And then, of course, the backpropagation

00:21:09.780 | will also calculate the gradient along this path,

00:21:13.260 | but we will just discard them

00:21:16.300 | because we don't care.

00:21:17.220 | We will choose a random source that is fixed.

00:21:20.180 | We will use N01 or N0i,

00:21:24.180 | in case we are using a multivariate Gaussian,

00:21:27.300 | and so now we can actually calculate the backpropagation.

00:21:30.060 | Plus, this estimator that we found has lower variance.

00:21:33.980 | So we found this estimator

00:21:35.900 | in which we replaced the stochastic quantity here,

00:21:38.780 | so Q of V of Z given X,

00:21:41.100 | conditioned on X, with this one,

00:21:43.260 | which is actually coming from our noise source,

00:21:46.020 | which is epsilon.

00:21:47.740 | We combine it with the parameters learned from the model

00:21:51.900 | through this transformation,

00:21:53.900 | and then this is our new estimator.

00:21:56.980 | This is also called the Monte Carlo estimator.

00:22:00.300 | We also can prove that this new estimator is unbiased.

00:22:03.380 | It means that if we run it many times,

00:22:06.060 | it will actually converge to the true gradient.

00:22:08.980 | So we can do that like this.

00:22:10.580 | So if we take the gradient of this estimator,

00:22:13.580 | and we do it many times,

00:22:15.100 | so on average,

00:22:16.540 | we can see that we can write this quantity here like this,

00:22:20.140 | because this is our estimator.

00:22:22.500 | And then, because this gradient operation

00:22:25.900 | doesn't depend on the parameters of this estimation,

00:22:28.980 | we can take out this gradient operator,

00:22:33.260 | and then we can write this quantity inside this one here

00:22:37.420 | as our original ELBO reparameterized

00:22:40.740 | over the noise source epsilon.

00:22:42.780 | I want to recap what we have said so far.

00:22:44.980 | So we found something called ELBO

00:22:48.820 | that if we maximize it,

00:22:51.060 | we will actually learn the latent space.

00:22:53.900 | We also found an estimator for this ELBO

00:22:57.020 | that allows the back propagation to be run.

00:23:00.140 | So now I want to combine all this knowledge together

00:23:02.860 | to simulate what the network will actually do.

00:23:06.060 | So imagine we have a picture here of something.

00:23:10.700 | We run it through the encoder,

00:23:12.340 | and the encoder is something that,

00:23:13.860 | given our picture,

00:23:14.980 | gives us the latent representation.

00:23:17.660 | Then what we do?

00:23:18.740 | We sample from this noise source,

00:23:20.740 | which is outside the model,

00:23:23.020 | so it's not inside of Z,

00:23:24.620 | it's not inside of our neural network.

00:23:27.380 | And how to sample it?

00:23:28.540 | Well, there is a function in PyTorch

00:23:30.540 | called torch.rand_unlike,

00:23:32.140 | because we will sampling from a distribution

00:23:34.540 | with zero mean and unitary variance.

00:23:37.220 | We will combine this sample,

00:23:39.460 | this noisy sample,

00:23:41.340 | with the parameters learned by the model.

00:23:44.300 | We will pass it through the decoder,

00:23:46.500 | so given Z gives us back X,

00:23:50.500 | and then we will calculate the loss

00:23:52.380 | between the reconstructed sample

00:23:55.140 | and the original sample.

00:23:56.980 | I want to show you the loss function.

00:23:59.140 | Don't be scared because it's a little long

00:24:01.220 | and not easy to derive

00:24:02.540 | if you don't have the necessary background,

00:24:04.100 | but I will try to simplify the meaning behind it.

00:24:07.060 | So the loss function is this one.

00:24:11.180 | We can see it here and it's made of two components.

00:24:14.100 | As we saw before,

00:24:15.460 | the loss function is basically the elbow,

00:24:17.940 | so it's made of two components.

00:24:19.140 | One that tells how far our distribution,

00:24:22.020 | the learned distribution,

00:24:23.180 | is from what we want our distribution to look like.

00:24:26.420 | And the second one is the quality of the reconstruction,

00:24:29.620 | which is this one.

00:24:30.900 | So this one, we can just use the MSE loss

00:24:34.380 | that will basically evaluate pixel by pixel

00:24:36.980 | how our image is different from the original image,

00:24:40.620 | so the reconstructed sample from the original sample.

00:24:43.420 | And this quantity here

00:24:45.820 | allows to calculate the KL divergence between the prior,

00:24:48.900 | so what we want our Z space to look like

00:24:51.460 | and what is actually the Z space learned by the model.

00:24:56.020 | How to combine the noise sampled

00:24:58.380 | from this noise source epsilon

00:25:01.060 | with the parameters learned by the model?

00:25:03.100 | Well, because we chose the model,

00:25:06.220 | the prior to be Gaussian,

00:25:07.660 | and we also chose the noise to be Gaussian,

00:25:10.460 | we can combine it like this.

00:25:12.900 | So the mu learned by the model

00:25:15.460 | plus the sigma learned by the model multiplied,

00:25:18.820 | this is element-wise multiplication,

00:25:22.340 | so not matrix multiplication,

00:25:23.780 | with the noise.

00:25:25.300 | I also want you to notice that here,

00:25:28.700 | before we are not learning sigma square,

00:25:31.420 | we are learning log of sigma square.

00:25:34.220 | So we are not learning the variance,

00:25:36.100 | but the log variance.

00:25:38.060 | Why is this the case?

00:25:39.580 | Well, if we learn sigma squared,

00:25:42.340 | we should force our model to learn a positive quantity,

00:25:45.980 | because sigma squared cannot be negative.

00:25:48.180 | So we just pretend that we are learning log sigma squared,

00:25:51.420 | and then we want to transform into sigma squared,

00:25:53.940 | we just take the exponentiation.

00:25:56.300 | So I hope that you have some understanding

00:25:59.380 | of what we did and why we did it,

00:26:02.260 | because my goal was to give you an insight

00:26:04.540 | of what is the ELBO, so the ELBO is something

00:26:06.780 | that we can maximize to learn this space.

00:26:09.460 | And I also wanted to show you the derivation of this ELBO

00:26:12.980 | and all the problems involved in this,

00:26:14.780 | because this is the same problems that we will face

00:26:17.100 | when we will talk about the stable diffusion.

00:26:19.060 | And this part here, I took from the original paper

00:26:22.140 | from Kingma and Welling,

00:26:23.580 | in which they saw the loss function.

00:26:27.300 | If you're wondering why we got this particular formula here

00:26:30.980 | for the KL divergence,

00:26:32.500 | I saw the derivation on Stack Exchange,

00:26:35.340 | and I attach it here in case you want

00:26:37.180 | to have a better understanding on how to derive it yourself.

00:26:40.820 | Thank you for watching this video,

00:26:42.540 | and hopefully you learned everything there is to know about,

00:26:46.460 | at least from a theoretical point of view, about the VAE.

00:26:49.900 | In my next video, I want to also make a practical example

00:26:53.460 | on how to code a VAE and how to train a network,

00:26:56.180 | and then how to sample from this latent space.

00:26:58.860 | If you watch this video and that video,

00:27:00.980 | I'm pretty sure that you will have a deep understanding of the VAE,

00:27:04.220 | and you will have a solid foundation

00:27:06.420 | to then understand the stable diffusion.

00:27:08.540 | Thank you for watching, and welcome back to my channel.

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

Chapters