Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

Welcome to my video about the variational autoencoder. In this video, I will be introducing the model, how it works, the architecture, and I will also be going into the maths. Why should you learn about the variational autoencoder? Well, it's one of the building blocks of the stable diffusion, and if you can understand the maths behind the variational autoencoder, you have covered more than 50% of the maths that you need for the stable diffusion.

At the same time, I will also try to simplify the math as much as possible so that everyone with whatever background can follow the video. Before we go into the details of variational autoencoder, we need to understand what is an autoencoder. So the autoencoder is a model that is made of two smaller models.

The first is the encoder, the second the decoder, and they are joined together by this bottleneck Z. The goal of the encoder is to take some input and convert it into a lower dimensional representation, let's call it Z, and then if we take this lower dimensional representation and give it as input to the decoder, we hope that the model will reproduce the original data.

And why do we do that? Because we want to compress the original data into a lower dimension. We can have an analogy with file compression. For example, if you have a picture, let's call it zebra.jpg, and you zip the file, you will end up with a zebra.zip file. And if you unzip the file or decompress the file, you will end up with the same file again.

The difference between the autoencoder and the compression is that the autoencoder is a neural network and the neural network will not reproduce the exact original input, but will try to reproduce as much as possible of the original input. So what makes a good autoencoder? The code should be as small as possible, that is, the lower representation of the data should be as small as possible, and the reconstructed input should be as close as possible to the original input.

But what's the problem with autoencoders? The problem with autoencoders is that the code learned by the model doesn't make sense. That is, the model just learns a mapping between input data and a code Z, but doesn't learn any semantic relationship between the data. For example, if we watch at the code learned for the picture of the tomato, it's very similar to the code learned for the picture of the zebra, or the cat is very similar to the code learned for the pizza, for example.

So the model didn't capture any relationship between the data or any semantic relationship between the data. And this is why we introduced the variational autoencoder. In the variational autoencoder, we learn a latent space. So not a code, but a latent space, which represents a multivariate distribution over this data.

And we hope that this multivariate distribution, so this latent space, captures also the semantic relationship between the data. So for example, we hope that all the food pictures have a similar representation in this latent space, and also all the animals have a similar representation, and all the cars and all the buildings, for example the stadium, have a similar relationship with each other.

And the most important thing that we want to do with this variational autoencoder is we want to be able to sample from this latent space to generate new data. So what does it mean to sample the latent space? Well, for example, when you use Python to generate a random number between 1 and 100, you're actually sampling from a random distribution, it's called uniform random distribution, because every number has equal probability of being chosen.

We can sample from the latent space to generate a new random vector, give it to the decoder and generate new data. For example, if we sample from this latent space, which is the latent space of a variational autoencoder that was trained on food pictures, and we happen to sample something that was exactly in between of these three pictures, we hope to get something that also in its meaning is similar to these three pictures.

So for example, in between the picture of egg, floor and basil leaves, we hope to find pasta with basil, for example. Which means that the model has captured somehow the relationship between the data it was trained upon, so it can generate new data. Why is the space called latent space?

Because we model our data as it is coming from a variable X, so a random variable X that we can observe, but this variable X is conditioned on another random variable Z that is not visible to us, that is hidden. So latent means hidden. And we will model this hidden variable as a multivariate Gaussian with means and variance.

I know that this all sounds very abstract, so let me give you a more concrete example. I will use the Plato's allegory of the cave. In the Plato's allegory of the cave, we have some people who since the childhood are born and lived all their life in this cave.

We are talking about these people here. And these people never left the cave, so they only stayed in this area of the cave. And they are chained, so they cannot leave. These people, since childhood, have seen these pictures on the cave that are projected from these 3D objects through this fire.

So they are the shadow of these 3D objects here. But these people, they don't know that these pictures actually are casted from these 3D objects. They don't know that they are shadows. For them, the horse is something black that moves like this. The bird is something black that moves like this.

So we need to think that we are just like these people. So we have some data that we can observe. But this data actually comes from something that we cannot observe, that is of a higher representation of this data, abstract representation of this data. And we want to learn something about this abstract representation.

Before we go into the maths of variational autoencoder, let me give you a little pep talk. Because the math is going to be a little hard to follow for some people and some easy for other people. The point is, in order to understand the variational autoencoder, you need to understand the math behind it.

Not only the numerical math, but also the concept. So what I will try to do is to give you the necessary background to understand the math, if you are interested in learning the math. But at the same time, I will also try to convey some general information, some high-level representation of what is happening and why we are doing what we are doing.

Also, I believe that VA is the most important component of stable diffusion models. So concepts like ELBO that we will see in the following slides also come in stable diffusion. So if you understand it here, it will make it easy for you to understand the stable diffusion. Plus, in 2023, I think you shouldn't be memorizing things, so just memorizing the architecture of models, because ChatGPT can do that faster and better than you.

If you want to compete with a machine, you need to be human. You can't be a machine and compete with a machine. I also believe that you should try to learn things not only out of curiosity, but because that's the true engine of innovation and creativity. And plus, math is fun.

So let's start by introducing some math concepts that we will need in the following slides. Don't be scared if you are not familiar with these concepts, because I will try to give a higher representation of what is happening. So even if you don't understand each step, you will still understand what is happening on a higher level.

We need what is the expectation of a random variable, which is this. We need the chain rule of probability, which is this, and the bias theorem. All of these three concepts are usually taught in a bachelor's class, so I hope that you are familiar with it. And another concept that is not taught in a bachelor, but I will introduce now, is the Kullback-Leiber divergence.

This is a very important concept in machine learning, and it's a divergence measure that allows you to measure the distance between two probability distributions. So given probability distribution P and Q, the Kullback-Leiber divergence tells you how far are these two probability distributions. But at the same time, this is not a distance metric, because it's not symmetric.

So when you have a distance metric, usually from, for example, the physical distance from point A to B, if A to B is one meter apart, then B to A is also one meter apart. But this doesn't happen with Kullback-Leiber divergence. For example, the divergence between P and Q is not the divergence between Q and P.

However, just like any distance metric, it is always greater than or equal to zero, and it's equal to zero if and only if the two distributions are same. We can now introduce our model. Now, we saw before that we want to model our data as coming from a random distribution that we call X, which is conditioned on a hidden variable or latent variable called Z.

So we could also, for example, marginalize over the joint probability using this relationship here. The problem is this integral is intractable, because we need to integrate over all latent variable Z. And what does it mean to be intractable? It means that in theory, we can calculate it. But in practice, it is so slow and so computationally expensive that it's not worth it.

So something intractable is like trying to guess your neighbor's Wi-Fi password. In theory, you can do it by generating all possible passwords and try all of them. But in practice, it will take you thousands of years. So this relationship, we can also write it like this by using the chain rule of probability.

We are trying to find this, so our data, so a probability distribution over our data, but we need this ground truth of this, which we don't have, because this is the probability distribution over the latent space given our data. But this is also something we want to learn, so we cannot use this relationship.

So this looks like a chicken and egg problem, because we are trying to find this using this, but we don't have this. And to find this, we need this. So how do we come out of it? Usually, when you cannot find something that you want, you try to approximate it.

And this is the road that we will follow. So, this is what we want to find. And we think that it's parametrized by some parameters theta that we don't know. However, what if we could find something that is a surrogate, something that is approximation of this, that has its own parameters?

Well, let's follow this road. So let's do some maths. We start with the log likelihood of our data, which is equal to itself. We can then multiply by one. So why this quantity is one? Well, this is the integral over the domain of a probability distribution, which is always equal to one.

And we can bring this quantity inside the integral, because it doesn't depend on the variable that is integrated. This is the definition of expectation. We can see that this integral is actually an expectation. And inside of this expectation, we can apply the equation given by the chain rule of probability.

We can multiply the numerator and the denominator by the same quantity, which means actually multiplying by one. Then we can split the expectation because the log of a product can be written as the sum of the logs. And then after writing the sum of the logs, we can split the expectation.

And finally, we can see that the second expectation is actually Kullback-Leibler divergence. And we know that it's always greater than or equal to zero. Now, let me expand this relationship that we have found. That is, the log likelihood of our data is equal to this quantity plus this KL divergence.

We will call this quantity here the ELBO, which stands for evidence lower bound, plus a KL divergence that is always greater than or equal to zero. Now, what can we infer from this expression without knowing nothing about the quantities involved? Okay, if you cannot see it, let me help you with a parallel example.

Imagine you are an employee. You have a total compensation, which is your base salary plus a bonus, which is always greater than or equal to zero. Without knowing nothing about your base salary or your total compensation, we can for sure deduce the following, that your total compensation is always greater than or equal to your base salary.

Now, this expression here has the same structure as this expression here. So we can infer the same for the first expression. That is, the first quantity is always greater than or equal to the second quantity without caring what happens to the third quantity. So this also means that this is a lower bound for this.

This also means that if we maximize this, this will also be maximized. Let's look at the ELBO in detail. Now, we found before that this quantity here is what we want. And if we maximize this quantity, we are going to automatically maximize this quantity. But this one can also be written like this by using the chain rule of probability.

And then we can split again the expectation. And then we can see that the second expectation is a KL divergence itself. In this case, it's a reverse KL divergence. So it's not same as the one we saw before because the numerator doesn't match with the probability distribution we see here.

So we put a minus sign here. Now, our goal is to maximize this, but we maximize this to actually maximize this log likelihood. However, if we maximize this quantity here, so this is our ELBO, we are actually maximizing this sum or this difference, actually. And when you maximize this difference, you are maximizing this, and at the same time, you are minimizing this.

If you cannot see it, let me give you a parallel example. By showing you like this, imagine you have a company and your company has profit, revenue, and cost. If you want to maximize your profit, what do you do? You maximize your revenue, and at the same time, you maximize your cost.

That's the only way, right? So if we are maximizing the ELBO, we are actually maximizing this first quantity here, and at the same time, we are minimizing the second quantity we see here. Now, let's look at what do these quantities mean. And for that, I took this picture from a paper from Kingma and Welling, who are also the authors of the first paper about the variational autoencoder, and we can see that this is a log likelihood.

It's a posterior of something that given Z gives us a probability distribution over X. So here we are talking about the decoder, and here we have a KL divergence between two distributions. One is called the prior, so this one. So this is what we want our Z space to look like, and as I said before, we want our Z space to be a multivariate Gaussian, and this is the learned distribution by the model.

So this KL, when we maximize the ELBO, since we are minimizing this quantity, the model actually is minimizing the distance between what it is learning as the Z space and what we want the Z space to look like. So it is making the Z space to look like a multivariate Gaussian.

And because when we maximize the ELBO, we also maximize the first quantity, the model is also learning to maximize the reconstruction quality of the sample X, given its latent representation Z. Now, the problem is when you maximize something that has a stochastic quantity inside, here we have a probability distribution, we need, okay, first of all, let me describe how to maximize something.

So when we want to maximize a function, we usually take the gradient and adjust the weights of the model so they move along the gradient. Or when we want to minimize a function, we take the gradient and adjust the weights of the model so they move against the gradient direction.

And this is also what happens when we train our models. For example, imagine we have a function that is convex, so it looks like, let's say, a ball, and our minimum is here, our initial weights are here, so we evaluate our gradient in this point, and the gradient always points to where the direction of growth of the function, so the function is growing in this direction, and we move against the gradient if we want to minimize the function, right?

And the problem is we are not calculating the true gradient when we run our model. We are actually calculating what is called, we are using stochastic gradient descent. Have you ever wondered why it's called stochastic gradient descent and not just gradient descent? Because actually, to minimize a function, you need to evaluate the function over all the data set, so all the training data you have, not only on a single batch.

But by doing it on a single batch, you get a distribution over the possible gradient. So for example, when we use stochastic gradient descent and we evaluate the gradient of our loss function, we do not get the true gradient. We get a distribution over the gradient. And someone proved that if you do it long enough, so if you do it over the entire training set, so one epoch, it will actually converge to the true gradient.

Now, the fact that it is a stochastic gradient descent, it also means that the gradient that we get has a mean and a variance. Now, the variance, in our case, in stochastic gradient descent, is small enough so that we can use stochastic gradient descent. The problem with this quantity, if we do the same job with this one, we get an estimator.

So this is called estimating a quantity, a stochastic quantity, that has a high variance, as shown in the paper. So if we look at the paper by Kingma and Welling, they show that there is an estimator for the elbow, and this estimator, however, exhibits a very high variance. So it means that, for example, imagine we are trying to minimize our function.

If we use an estimator that has high variance, suppose we are here and the minimum of the model is here. If we are lucky, when we calculate the gradient, we will get the true gradient and we move against the gradient. However, if we are unlucky because it has high variance, the model may return a very different gradient than what we expect, for example, in this direction, and then we will move to the opposite direction, which is this one, so it will take us far from the minimum, and this is not what we want.

So we cannot use an estimator that has high variance. This estimator is, however, unbiased. It means that if we do it many times, it will converge. But because of it being with a high variance, we cannot use it in practice. Plus, how do we run backpropagation on a quantity that is stochastic?

Because we need to sample from our z-space to calculate the loss. So how can we calculate the derivative of the sampling operation? We cannot do that, PyTorch cannot do that. So we need a new estimator. And the idea is that we want to take the source of randomness outside of the model, and we will call it reparameterization trick.

So the reparameterization trick means basically that we take the stochastic component outside of z, we create a new variable, epsilon, that is our stochastic node. We sample from epsilon, combine it with the parameters learned by the model, so mu and sigma squared, which is the mean and the sigma of our multivariate Gaussian that we are trying to learn, and then we will run backpropagation through it.

Let me show you with a picture. This is a picture I took from another paper by Kingma and Welling, and we can see here that when we run backpropagation, we calculate our loss function. We calculate the gradient, but before, this node here was random, was stochastic, so we couldn't run backpropagation through it because, as I told you, we don't know how to calculate the gradient of the sampling operation.

However, if we take the randomness outside of this node to another node that is outside of z, we can run backpropagation and update the parameters of our model. And then, of course, the backpropagation will also calculate the gradient along this path, but we will just discard them because we don't care.

We will choose a random source that is fixed. We will use N01 or N0i, in case we are using a multivariate Gaussian, and so now we can actually calculate the backpropagation. Plus, this estimator that we found has lower variance. So we found this estimator in which we replaced the stochastic quantity here, so Q of V of Z given X, conditioned on X, with this one, which is actually coming from our noise source, which is epsilon.

We combine it with the parameters learned from the model through this transformation, and then this is our new estimator. This is also called the Monte Carlo estimator. We also can prove that this new estimator is unbiased. It means that if we run it many times, it will actually converge to the true gradient.

So we can do that like this. So if we take the gradient of this estimator, and we do it many times, so on average, we can see that we can write this quantity here like this, because this is our estimator. And then, because this gradient operation doesn't depend on the parameters of this estimation, we can take out this gradient operator, and then we can write this quantity inside this one here as our original ELBO reparameterized over the noise source epsilon.

I want to recap what we have said so far. So we found something called ELBO that if we maximize it, we will actually learn the latent space. We also found an estimator for this ELBO that allows the back propagation to be run. So now I want to combine all this knowledge together to simulate what the network will actually do.

So imagine we have a picture here of something. We run it through the encoder, and the encoder is something that, given our picture, gives us the latent representation. Then what we do? We sample from this noise source, which is outside the model, so it's not inside of Z, it's not inside of our neural network.

And how to sample it? Well, there is a function in PyTorch called torch.rand_unlike, because we will sampling from a distribution with zero mean and unitary variance. We will combine this sample, this noisy sample, with the parameters learned by the model. We will pass it through the decoder, so given Z gives us back X, and then we will calculate the loss between the reconstructed sample and the original sample.

I want to show you the loss function. Don't be scared because it's a little long and not easy to derive if you don't have the necessary background, but I will try to simplify the meaning behind it. So the loss function is this one. We can see it here and it's made of two components.

As we saw before, the loss function is basically the elbow, so it's made of two components. One that tells how far our distribution, the learned distribution, is from what we want our distribution to look like. And the second one is the quality of the reconstruction, which is this one.

So this one, we can just use the MSE loss that will basically evaluate pixel by pixel how our image is different from the original image, so the reconstructed sample from the original sample. And this quantity here allows to calculate the KL divergence between the prior, so what we want our Z space to look like and what is actually the Z space learned by the model.

How to combine the noise sampled from this noise source epsilon with the parameters learned by the model? Well, because we chose the model, the prior to be Gaussian, and we also chose the noise to be Gaussian, we can combine it like this. So the mu learned by the model plus the sigma learned by the model multiplied, this is element-wise multiplication, so not matrix multiplication, with the noise.

I also want you to notice that here, before we are not learning sigma square, we are learning log of sigma square. So we are not learning the variance, but the log variance. Why is this the case? Well, if we learn sigma squared, we should force our model to learn a positive quantity, because sigma squared cannot be negative.

So we just pretend that we are learning log sigma squared, and then we want to transform into sigma squared, we just take the exponentiation. So I hope that you have some understanding of what we did and why we did it, because my goal was to give you an insight of what is the ELBO, so the ELBO is something that we can maximize to learn this space.

And I also wanted to show you the derivation of this ELBO and all the problems involved in this, because this is the same problems that we will face when we will talk about the stable diffusion. And this part here, I took from the original paper from Kingma and Welling, in which they saw the loss function.

If you're wondering why we got this particular formula here for the KL divergence, I saw the derivation on Stack Exchange, and I attach it here in case you want to have a better understanding on how to derive it yourself. Thank you for watching this video, and hopefully you learned everything there is to know about, at least from a theoretical point of view, about the VAE.

In my next video, I want to also make a practical example on how to code a VAE and how to train a network, and then how to sample from this latent space. If you watch this video and that video, I'm pretty sure that you will have a deep understanding of the VAE, and you will have a solid foundation to then understand the stable diffusion.

Thank you for watching, and welcome back to my channel.

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

Chapters

Transcript