back to indexStanford CS25: V5 I Transformers for Video Generation, Andrew Brown of Meta

00:00:00.000 |
I'm very happy to end it off with a bang here with Andrew. 00:00:08.180 |
So he's a research scientist at Meta's Gen AI team, focusing on media generation. 00:00:14.760 |
Over the past few years, his team has focused on publishing research papers 00:00:18.940 |
that push the frontiers of video generative models, 00:00:27.160 |
Prior to working on Meta, Andrew completed his PhD at Oxford's Visual Geometry Group, VGG, 00:00:32.400 |
under the supervision of Professor Andrew Zisserman. 00:00:36.160 |
So without further ado, I'll hand it off to Andrew to take it. 00:00:48.080 |
If you guys haven't heard of that, Gen AI is the research organization that releases MESA's generative models, 00:00:53.960 |
things like LAMA for the text LLMs and our media generation models as well. 00:01:00.260 |
I've been there for about two and a half years, ever since I finished my PhD. 00:01:05.340 |
And like Stephen said, over that period, we've released a bunch of frontier-pushing video generation models. 00:01:17.020 |
And yeah, today's talk is Transformers for video generation. 00:01:20.960 |
So I saw in the list of seminar speakers already you've had a bunch on NLP and some media generation stuff. 00:01:27.180 |
I hope this is new compared to what you've heard already. 00:01:34.800 |
You guys might have seen videos like this on the internet. 00:01:38.020 |
But given a text prompt, contemporary text-to-video generation models can now create these incredibly high-quality videos. 00:01:47.920 |
Complex motion, fantastical scenes, very high-quality. 00:01:54.360 |
Another example I quite like, this is a ghost in a white bedsheet. 00:02:04.440 |
Not only are they super high-quality, they seem to have learned some notion of the laws of physics. 00:02:09.760 |
The thing I like about this one is if you look closely, you can see the reflection of the ghost is shown in the mirror there. 00:02:15.220 |
So these models have clearly learned something very, like I was saying, if you look very closely, you can see this reflection of the ghost in the mirror. 00:02:25.460 |
So these text-to-video generation models can be used for other amazing capabilities as well, like editing. 00:02:32.200 |
So, for example, you can give these models an input video, like on the top left, and specify some edit prompt, like turn the runner into an inflatable dinosaur or turn it into a cactus desert. 00:02:46.500 |
Now, I don't know how long a lot of you have been paying attention to the media generation field or even the machine learning field. 00:02:54.540 |
But if you've been only paying attention for the last year, year and a half, this is all that you're used to. 00:02:59.560 |
You're only used to seeing generated videos that are completely indistinguishable from real ones. 00:03:04.100 |
But this is an incredibly recent development. 00:03:07.780 |
So on the right here, I have the same video I showed you on the first slide. 00:03:12.460 |
This is from a model that was released in October 2024. 00:03:15.400 |
And on the left, I have what was an amazing states-of-the-art approach from September 2022. 00:03:21.800 |
This is another model that was released by my team. 00:03:26.200 |
And I can't stress enough how amazing this was at the time. 00:03:31.500 |
I'm sure every speaker at this seminar has come in and said there's been, like, amazing progress in this machine learning subfield or that one. 00:03:42.500 |
But nowhere is it more clear than video generation. 00:03:49.760 |
This is what the subject of today's talk is going to be. 00:03:56.700 |
How do we train models to generate videos like this? 00:04:00.100 |
So all of the videos you saw in the previous slides are from a paper we released in October 2024 called MovieGen. 00:04:09.760 |
A little spoiler for how we did this and a spoiler for why I was invited to this is that we did it using transformers. 00:04:19.040 |
I saw this little guy included in previous talks, so I had to put him there. 00:04:21.640 |
So today's conclusions are going to be twofold. 00:04:26.620 |
First, I'm going to talk you through every detail of how you train a transformer to generate videos like this. 00:04:36.960 |
Some of the concepts are going to be familiar, things like transformers. 00:04:43.280 |
The second takeaway is going to be this conclusion that I'll keep on saying. 00:04:47.700 |
That throughout this project, we learned that scaling data, compute, and model parameters for a simple transformer also works for video generation. 00:04:56.420 |
We've seen it work in all kinds of machine learning fields. 00:04:59.740 |
We also saw it work here for video generation. 00:05:02.040 |
Okay, so just a little bit of personal background. 00:05:08.140 |
I think most of the speakers that you've had are from NLP, which makes sense because transformers came from NLP. 00:05:13.460 |
I'm a computer vision researcher, and I've been in visual generation for a few years now, and it hasn't always been as popular as it is now. 00:05:23.220 |
So how I got into this, I was sat in this lecture hall in the engineering department at Oxford in, what, the second year of my PhD, 00:05:33.880 |
and we were having a talk by Professor Antonio Tarabla from MIT, and he was presenting this work called GAN dissection. 00:05:43.220 |
The state-of-the-art image generation approach at the time were these generative adversarial networks, 00:05:47.860 |
and in 2019, you could generate like a blurry face or a blurry kitchen or a blurry bedroom, and that was amazing at the time. 00:05:59.780 |
And this paper here was showing that you could activate or deactivate certain neurons in the GAN, 00:06:05.960 |
and in doing so, you could make certain concepts appear or disappear. 00:06:09.820 |
So these are the kind of images that we were generating at the time. 00:06:14.340 |
So this is like a blurry kitchen, and this was like near state-of-the-art. 00:06:19.600 |
So you can see, you know, the visual concepts are pretty messed up. 00:06:23.560 |
It doesn't make a huge amount of sense, but I can't stress how, like, good this was at the time. 00:06:27.820 |
I'm sure a lot of you have seen, like, Dolly and stable diffusion and so on. 00:06:35.060 |
So the point of the paper, or at least in this example, was they were showing that you could activate certain neurons 00:06:42.680 |
You could imagine how your kitchen would look with windows. 00:06:48.060 |
So some, like, pretty janky windows on the left. 00:06:52.580 |
And my mind was completely blown, as I can sense, like, all of yours are as well. 00:07:01.480 |
So what they showed here is not only that the model had learned a physically plausible place to put the windows, 00:07:08.180 |
but importantly, they'd also shown that the model had learned some notion of physics. 00:07:12.340 |
The model had learned that if you put windows on the left-hand side of the room, 00:07:16.280 |
light will come through them, and there'll be a reflection of the marble countertop. 00:07:21.580 |
And the model had just learned this by looking at images. 00:07:24.960 |
So this had, like, a really profound impact for me as a young PhD student, 00:07:29.840 |
and I've been in visual generation ever since. 00:07:40.100 |
I'll give an overview of the model that we trained, 00:07:43.600 |
talk about architecture, data, and training recipe results and applications, 00:07:50.720 |
So first, I included a bit of historical context on movie gen and video generation. 00:07:59.720 |
Maybe some of you are quite new to the video generation field. 00:08:05.020 |
This snapshot is maybe three years, from 2024 back to around 2022. 00:08:11.820 |
This is a century in machine learning research. 00:08:18.380 |
There are some very relevant, important works here. 00:08:22.120 |
I've just included a snapshot to make a couple of points. 00:08:24.460 |
There have been two milestone events in video generation. 00:08:30.360 |
The first was in 2022, when people started using diffusion modeling. 00:08:34.040 |
This is when the whole community started using diffusion. 00:08:37.300 |
That was a big step up in visual quality at that point. 00:08:41.760 |
The second one was in 2024, and this is the point of today's talk. 00:08:46.020 |
Before 2024, people were using quite small-scale, specialized architectures. 00:08:52.220 |
I say small because the definition of small and large has been moving all over the place recently. 00:08:59.060 |
But these were specialized architectures for computer vision, things like CNNs, UNETs, and so on. 00:09:06.480 |
And then around 2024, video generation sort of boarded this architecture unification setup. 00:09:13.760 |
So all over machine learning fields, we've seen people ditch specialized architectures 00:09:19.420 |
and move towards this simple transformer setup. 00:09:21.680 |
The reason is because all of these different fields are seeing the benefits of efficiency and scalability 00:09:30.060 |
So in 2024, the video generation community started doing the same, 00:09:34.780 |
and that's where Movie Jam comes in, which I'll talk about today. 00:09:38.660 |
So a quick overview on what Movie Jam is before we get into the details. 00:09:45.320 |
So Movie Jam was a cast of foundation models that generates high-quality 1080p HD videos 00:09:51.360 |
with different aspect ratios and synchronized audio. 00:09:53.740 |
Today, I'm just going to be talking about the text-to-video model. 00:09:56.460 |
I'll show some fun examples later of the other models that we trained. 00:10:02.140 |
And again, like I've been saying, the point of the paper was showing that scaling data, training compute, 00:10:07.900 |
and model parameters for a simple transformer trained with flow matching, I'll cover that later, 00:10:15.240 |
We also presented a few sort of innovations and simplifications along the way. 00:10:21.060 |
So Movie Jam video, the model, was a 30 billion parameter foundation model for joints, text-to-image, and text-to-video generation. 00:10:30.400 |
The model was trained on the order of around 100 million videos and 1 billion images. 00:10:42.220 |
There are three main things that I want to cover today. 00:10:51.380 |
The second is what learning objective that we used. 00:10:54.220 |
And the third is what model architecture we used for learning it. 00:10:58.280 |
If I do a good job here, then all of you will know all there is to know about video generation. 00:11:02.840 |
I'm imagining most of you have more of a NLP background with sort of text-to-aggressive models. 00:11:10.700 |
I'm going to try and contextualize all of this in relation to text and how it differs from large language models. 00:11:25.600 |
Well, we mean how are we going to represent the data for the model? 00:11:34.580 |
We know that X is going to be derived in some way from videos, but there's an open question of how exactly do we do that. 00:11:41.220 |
So to motivate what we ended up doing, I'm just going to talk a bit about the differences between text and media. 00:11:51.540 |
So let's look at some text data, for example. 00:11:55.220 |
And this piece of data, a sentence, an image of a cat. 00:11:59.500 |
What has happened when this data sample has been created is a human has put in a huge amount of thought into compressing what they're thinking into this very well-designed, semantically rich language. 00:12:17.620 |
Every word packs a huge amount of information. 00:12:22.280 |
So in practice, when people are training large language models, they can use a representation that's quite close to this. 00:12:28.320 |
So they might use a sort of simple tokenizer before feeding it into the transformer. 00:12:38.380 |
So let's look at a related piece of data, an actual image of a cat. 00:12:46.620 |
This image is just continuous raw data that's been captured by your camera. 00:12:50.580 |
Maybe the only sort of human effort that's gone into this is positioning the camera and framing the cat in the middle of the camera, but nothing else. 00:13:00.880 |
As a result of this, there's tons of redundancy. 00:13:03.080 |
So what I mean by this is if you know what a cat is, and then let's say the middle pixel you know is this white fur on the cat. 00:13:13.340 |
Well, if you know what a cat is, you know the next pixel along is probably going to be white fur and the next one along from that. 00:13:19.500 |
And if it's a video, you know the same pixels one frame along are also going to be white cat fur. 00:13:26.760 |
So this begs the question of perhaps this raw data could be transformed into something that more closely resembles language. 00:13:39.400 |
So if you forget everything that I just said on the previous slide, one thing you could do, a very simple approach, is you just model the pixels directly. 00:13:48.000 |
So let's say you took an image or a video, you unraveled it into a long sequence of pixels, and then trained, let's say, next token prediction on that. 00:13:56.860 |
And in a way, that's what some prior works did. 00:13:59.800 |
Things like imagine video or image GPT back in the day, along with a little bit of patching. 00:14:05.240 |
So this is a very sort of conceptually simple method, but it's very computationally constraining. 00:14:14.520 |
So the thing about modeling pixels directly is the number of pixels scales quadrastically with the image or video resolution. 00:14:24.540 |
It's even worse for video when you have a temporal dimension. 00:14:27.860 |
In practice, what this means is these models can only model very low-resolution images or videos, things like 64 by 64 videos, which is not ideal. 00:14:37.760 |
And if we want to generate an actual large HD video, these methods had to employ a huge cascade of upsampling stages, things like super-resolution models, frame interpolation models, to increase the size of the data. 00:14:54.400 |
So instead, what prior work does is they learn a compressed latent representation using something like a VAE or a VQVAE trained offline. 00:15:07.120 |
This is what every sort of text-to-image model or text-to-video model you've seen on social media has been doing for a long time. 00:15:16.140 |
If you're modeling a compressed version of your data, then you can natively model larger data. 00:15:27.140 |
The other advantage is that this offline-trained VAE or VQVAE can remove some of the computational burden from the language model. 00:15:35.560 |
For example, these autoencoders could handle the modeling of how two separate blades of grass differ from each other in an image or a video. 00:15:44.920 |
And it can take that burden off the sort of downstream language model. 00:15:53.600 |
From an architecture, we train something called a temporal autoencoder for spatial-temporal video compression. 00:16:02.640 |
This is basically just a variational autoencoder. 00:16:16.600 |
After the encoder, the representation will be compressed, and you end up with this latent representation at the bottleneck in the middle. 00:16:24.300 |
And this is the representation that we're going to use downstream. 00:16:35.140 |
You get to the bottleneck latent representation. 00:16:40.760 |
And you have a bunch of losses between the output and the input, things like L1 losses, adversarial losses, and so on. 00:16:48.260 |
So this representation in the middle is what we're going to model. 00:16:52.540 |
When we talk about learning P of X, we're learning the distribution of this latent. 00:16:57.920 |
What this means is when we train a generative model on this, it generates in this space. 00:17:05.300 |
So after we generate a video, we then need to decode it back to RGB space. 00:17:10.140 |
So the TAE that we trained had 8x compression in each dimension. 00:17:21.280 |
And this was pretty high compression at the time. 00:17:31.600 |
And six months is, again, like a decade in machine learning research. 00:17:35.340 |
But at the time, this was very high compression. 00:17:36.900 |
And like I said, this means that we can natively model very high-resolution videos. 00:17:42.580 |
As an example, the largest video that we model in this work is 768x768 pixels, 16 seconds, 16 FPS. 00:17:54.080 |
Now, if we were to model pixels directly, and we took a video of that size, we unraveled everything, we treated one pixel as one token, and we flattened the whole thing, it would result in 150 million tokens. 00:18:06.980 |
Even with very long context training methods with language models at the moment, this is completely unfeasible. 00:18:15.460 |
But using this temporal autoencoder, the same video is just compressed to 73,000 tokens. 00:18:22.680 |
So this is suddenly completely computationally feasible using like today's parallelism approaches, today's infrastructure. 00:18:32.220 |
If anyone is doing this math offline, there is also a patch of high layer, if someone thinks my math is wrong here. 00:18:38.660 |
Okay, so that's everything for the representation. 00:18:44.180 |
Does anyone have any questions at this point? 00:18:56.180 |
A lot of autoencoders for videos do use causality. 00:19:02.260 |
Some nice outcomes of that is that when you encode images, they can be encoded completely independently of subsequent frames and so on. 00:19:18.960 |
Okay, so next up is which generative modeling learning objective do we use? 00:19:24.020 |
So probably in most of the talks that you've had so far, you've heard about autoregression and next token prediction for text. 00:19:30.720 |
In media generation, we haven't been doing that for a couple of years. 00:19:36.620 |
So the de facto approach in most media generation has been using diffusion modeling. 00:19:50.940 |
Flow matching is, in a way, a simpler generalization of diffusion. 00:19:57.580 |
Now, if any of you have watched a talk on diffusion modeling or read anything about diffusion or flow matching, I'm sure you've seen a figure like this before. 00:20:07.440 |
I'm going to give a brief sort of explainer of what is similar between flow matching and diffusion. 00:20:13.380 |
And then we're going to go over what a training step looks like. 00:20:17.660 |
So both diffusion and flow matching have a very similar setup. 00:20:22.360 |
So you assume that you have some unknown data distribution. 00:20:26.280 |
This is the sort of distribution of images in this figure. 00:20:33.040 |
This is the distribution you're trying to learn. 00:20:34.780 |
This is the distribution we want to learn and then sample from. 00:20:37.700 |
You also assume you have a known data distribution on the right. 00:20:42.820 |
And we model this as normally just like Gaussian noise. 00:20:48.240 |
Both assume then that we have this fixed forward process. 00:20:51.720 |
What this means is we have a method of translating between the unknown data distribution and the known one by iteratively adding noise. 00:21:00.500 |
Both assume that if you keep on adding noise, you basically end up at this known data distribution. 00:21:08.400 |
Then both of them train a neural network to do the reverse process. 00:21:12.380 |
So they train a neural network to take one of these images and iteratively denoise those. 00:21:17.740 |
And then at inference time, you can iteratively use this neural network to go from a sample that's pure noise back to a sample from this data distribution that you've just learned. 00:21:27.600 |
And that's how we end up sampling images and videos. 00:21:30.880 |
So diffusion and flow matching are very similar in a lot of ways. 00:21:35.760 |
Flow matching is, in a way, a simpler generalization. 00:21:39.040 |
It's been very recently shown to result in more robust training and more efficient probability paths that are easier and faster to sample from. 00:21:47.920 |
So this paper came out pretty recently in 2023 from some colleagues at Massa. 00:21:56.820 |
Importantly, it's been shown to work better than diffusion. 00:21:59.380 |
And I'm not going to go into a huge amount of detail here, but we're going to go over how a training step looks like. 00:22:11.520 |
We're going to step through them pretty easily. 00:22:19.600 |
This is your image of a cat on the previous slide. 00:22:29.340 |
And we sample from the known data distribution. 00:22:31.940 |
This just means taking a sample from a normal Gaussian. 00:22:41.140 |
This is just this sort of intermediate image, a somewhat noise image of a cat. 00:22:47.440 |
There are lots of different ways of constructing Xt. 00:22:50.440 |
We use what's called the simple linear interpolation from the flow matching paper. 00:22:56.900 |
So this is how we go from the three things we sampled above to this intermediate training sample. 00:23:02.460 |
Then in flow matching, what you do is you train the model to predict the velocity. 00:23:08.220 |
This is a value which moves the training sample back in the direction of the data sample. 00:23:16.160 |
So this is how we compute the velocity, simply by differentiating the equation above. 00:23:23.060 |
And then on the right, we have our actual learning objective. 00:23:26.180 |
So this is the mean squared error between the model prediction and the ground truth velocity. 00:23:31.300 |
Here, the model prediction is parametrized by U. 00:23:42.880 |
Remember, we're doing text to video generation. 00:23:44.840 |
So we need to condition the generation on the text prompt. 00:24:03.080 |
So you start by sampling from this known data distribution, Gaussian noise. 00:24:07.020 |
And then we use an ordinary differential equation solver to go back to the data distribution, 00:24:19.900 |
At each time step, you compute the model's prediction for the velocity 00:24:23.660 |
and use the solver to move the sample in the direction of the unknown data distribution. 00:24:29.820 |
And at the end of that, you have your sample. 00:24:32.660 |
Low-level detail, we use a quite simple solver. 00:24:38.420 |
There are lots of different options you can choose. 00:24:43.620 |
Okay, so lastly, which model architecture do we use? 00:24:49.960 |
Now, I already said that we're using transformers. 00:24:54.400 |
The big goal of this paper was to benefit from, like I've said about seven times already, 00:25:01.080 |
scaling data, training compute, and model parameters with transformers. 00:25:05.760 |
But there is a question about which transformer to use. 00:25:07.920 |
So, in my research organization, we train these things called LAMAs. 00:25:14.380 |
LAMA is the large language model that messes open sources. 00:25:19.480 |
We take the LAMA3 model, hence why I sort of pasted this L3 on it. 00:25:25.840 |
LAMA3 is quite a classic, dense, fully connected, decoder-only language model. 00:25:34.480 |
So, what we did in MovieGen is you take your videos, you encode them with the TAE that we discussed earlier, 00:25:40.840 |
you flatten the tokens, this gives you your input sequence, and we just throw it into LAMA. 00:25:51.980 |
Now, when I say LAMA, I don't mean a pre-trained LAMA. 00:25:56.400 |
I don't mean one that's been trained for text. 00:25:59.320 |
What I mean is the architecture, so a randomly initialized architecture. 00:26:07.320 |
Training large language models at scale is incredibly difficult. 00:26:12.180 |
Every time you change anything about the architecture, you need different hyper-premises, they scale differently. 00:26:21.320 |
So, the fact that we already know, in our research organization, how to scale this architecture, 00:26:27.340 |
and the fact that we have the infra set up already to train these things at scale, makes a huge difference. 00:26:32.440 |
So, that's why the simplest thing for us to do was to go with the LAMA architecture. 00:26:46.640 |
But I agree, that would be a really cool thing to try. 00:26:50.660 |
Yeah, it's not entirely clear why that would work. 00:27:00.040 |
Obviously, like, these are very different modalities, different learning objectives. 00:27:04.400 |
But, you know, in a lot of ways, there's a lot of sort of shared structure between these modalities that would maybe benefit from that. 00:27:17.640 |
Okay, so that last slide was, like, very deceptively oversimplified. 00:27:23.220 |
There are some changes that we needed to make to LAMA 3. 00:27:25.720 |
So, importantly, LAMA 3 is a model for auto-aggressive text generation, and we are doing text-to-video generation using flow matching. 00:27:33.320 |
So, there are three changes that we need to make. 00:27:35.020 |
I'm going to go over all of them exhaustively to sort of hammer the point that we barely changed the architecture. 00:27:43.720 |
So, the first thing that we need to do is incorporate the text conditioning. 00:27:46.560 |
You'll have seen on the previous slide that our input sequence is just made up of video tokens. 00:27:51.000 |
We're doing text-to-video generation, so we need to incorporate the text conditioning somehow. 00:28:00.760 |
So, very simply, we construct a sequence made up of our text conditioning, and we add cross-attention layers into the transformer block. 00:28:09.200 |
So, these go between the self-attention layers and the feedforward network. 00:28:12.640 |
It's a very common way of adding text conditioning for media generation models. 00:28:17.920 |
There is a question, as well, of what should your text representation be? 00:28:27.000 |
One, like, very simple thing to do would be you just tokenize the prompt, the caption, and then you feed that in. 00:28:35.620 |
But when you do that, you're very much burdening your model with learning this text representation from scratch. 00:28:41.140 |
So, instead, we use pre-trained text representations. 00:28:50.120 |
The first two, all of them are pre-trained frozen text models, basically. 00:28:54.500 |
The first two have very sort of semantic level representations. 00:29:04.940 |
And the third one has more of a sort of character-level text representation by T5. 00:29:10.720 |
So, we encode the text prompt using all three of these. 00:29:15.940 |
We project them all to the model dimension, and we concatenate, and that gives us our text sequence. 00:29:25.860 |
The second thing we need to do, you'll remember from the learning objective, that we also need to condition on the time step. 00:29:33.780 |
So, what we do here is we do this in adaptive layer norm blocks. 00:29:36.900 |
So, we've already added a cross-attention block. 00:29:41.180 |
This might seem like quite a strange way of adding some conditioning. 00:29:46.780 |
It's something that was popularized in the diffusion transformer paper, which was the first paper that used diffusion with transformers, obviously, by the name. 00:29:59.500 |
It's basically very computationally cheap, and works super well. 00:30:14.920 |
They use causal masking for next token prediction. 00:30:17.180 |
For the flow matching objective, we have no such constraints. 00:30:21.220 |
We want every video token to see every other video token. 00:30:24.260 |
We don't care about causal masking, so we take that out. 00:30:27.440 |
A very low-level detail is that because of this, we use multi-head attention instead of groups query attention. 00:30:35.360 |
Other than that, it's the LLAMA architecture. 00:30:37.000 |
Okay, so we now have our full architecture diagram. 00:30:44.740 |
I was very wary of putting this earlier because it's pretty complicated, but I think it should all make sense at this point. 00:30:50.680 |
So from left to right, during training, we take one of our training videos. 00:30:56.780 |
We encode it with the TAE to get to our compressed latent representation. 00:31:05.520 |
And we flatten it, and that's how we get our input sequence. 00:31:09.040 |
During training, we'll construct our training sample by combining it with this Gaussian noise. 00:31:15.840 |
During inference, this whole sequence will start off being Gaussian noise. 00:31:20.880 |
The sequence goes through these LLAMA transformer blocks. 00:31:24.480 |
We add in the conditioning, and then we get our output sequence. 00:31:29.680 |
During training, we'd compute our loss and backpropagate. 00:31:32.960 |
During inference, we would do this iterative denoising process, and finally decode back to the RGB space. 00:31:50.880 |
So, at inference for a given text prompt, we add this conditioning into the model always. 00:32:08.620 |
So the input sequence is just Gaussian noise, but that's not where the text information comes in. 00:32:13.020 |
The text information comes in through these cross-attention layers. 00:32:16.400 |
So even though the input sequence is just noise, the model is still seeing these clean text information. 00:32:22.380 |
We don't noise the text or anything like that. 00:32:26.620 |
That's just a hyperbrameter that we keep constant. 00:32:55.460 |
I cannot remember off the top of my head exactly what that hyperparameter is for, actually. 00:33:03.840 |
The next question is, how many of you know these steps? 00:33:13.360 |
So, during inference, you'll sample a series of time steps. 00:33:16.280 |
Usually, with these flow-based models, the more you sample, the better. 00:33:20.440 |
You'll better approximate these probability paths. 00:33:26.000 |
One of the advantages of flow matching is that the probability paths are theoretically straighter. 00:33:31.380 |
So, you should be able to require less function evaluations to approximate this path. 00:33:38.920 |
And is it 250, like, so, is it not a pre-described number of Ts during training? 00:33:50.840 |
But during inference, it's, we just sample some discrete ones. 00:33:58.560 |
On the two videos that you show the difference between 2022 and 2024, was the improvement just 00:34:05.380 |
algorithmic what you used with the hardware in terms of TPUs, or it was just a pure algorithm? 00:34:17.080 |
There are quite a lot of things that have changed. 00:34:20.720 |
There's an architecture change moving towards transformers, and then there's the scale. 00:34:27.360 |
Scaling the data, scaling the amount of compute, in order for that to be, like, tractable, better, like, GPU hardware, it does help. 00:34:39.440 |
There wasn't a huge improvement in GPU hardware over those two years. 00:34:42.260 |
It might have been one or two generations of NVIDIA's stuff. 00:34:55.840 |
So, now we have an architecture that we're pretty confident scales. 00:35:00.580 |
We have a learning objective that we think should work in flow matching. 00:35:14.260 |
So, the last technical details are about data and the training recipe. 00:35:22.520 |
I think in a lot of ways, this is the most important slide of the entire talk today. 00:35:29.260 |
Data is so important for training large language models. 00:35:33.820 |
And, by the way, when I say large language models, I'm sort of just talking about transformers at scale for any modality. 00:35:38.540 |
People use, like, different definitions there. 00:35:51.680 |
The scaling laws depend on the data being clean. 00:35:58.040 |
And the model output quality depends on the data being clean. 00:36:01.060 |
As a result of this, research groups at these big companies spend a huge amount of resources on data. 00:36:08.800 |
This is something that I find isn't really talked about so much. 00:36:12.380 |
But they'll spend huge amounts of resources in terms of GPUs and also actual researchers. 00:36:16.760 |
Often, on these research teams, the data teams massively outnumber the modeling teams. 00:36:22.880 |
Which was something very new to me after my little PhD. 00:36:31.640 |
Well, you know, remember we're training generative models. 00:36:35.540 |
We are learning this distribution of our training data. 00:36:39.840 |
And then we're sampling videos from it that are likely according to that training data. 00:36:43.960 |
So, if we want to sample the kinds of videos that I showed you on the first two slides, 00:36:48.720 |
then all of our training data needs to look like that. 00:36:56.660 |
You obviously have a post-training phase as well where you can align your videos to be more high-quality perhaps. 00:37:03.060 |
But your pre-training data still needs to look great. 00:37:09.500 |
We trained the model on the order of around 100 million videos. 00:37:16.480 |
Well, we can predict a sort of training budget that we had for this project. 00:37:20.920 |
And you want as many videos as you can that you don't epoch. 00:37:25.360 |
So, the challenge was how to get this many videos that are high enough quality. 00:37:29.800 |
At the time of MovieGen, we constructed this incredibly detailed complex pipeline 00:37:36.140 |
with a bunch of handcrafted and model-based filters. 00:37:40.900 |
I'll just talk through a few bits of it because of the sheer amount of work that went into this. 00:37:47.500 |
So, you start with a large pool of videos from some corpus. 00:37:53.640 |
They may be different lengths, long tail of concepts. 00:37:57.040 |
We did a bunch of visual filtering on these, removing videos that are too small, scene changes, bad aesthetics. 00:38:04.940 |
We removed a bunch of videos that have bad motion. 00:38:07.880 |
It turns out a bunch of videos in any large corpus have really slow motion, janky motion, motion effects. 00:38:25.160 |
But the really important one here is resampling. 00:38:27.500 |
Large language models do not work well when they're trained on a very imbalanced data set in terms of concepts. 00:38:37.140 |
They work best if the concept distribution is roughly, roughly uniform. 00:38:41.480 |
And that uniformity doesn't occur if you just take a random set of videos. 00:38:47.320 |
So, we do this very sort of complex visual concept extraction, clustering, the upweights, certain clusters, downweights, certain clusters. 00:38:56.280 |
This, all of this will give us a set of videos. 00:39:00.420 |
We also need captions because we're doing text-to-video generation. 00:39:04.320 |
And we generate these automatically using Lama 3. 00:39:15.360 |
So, this multi-stage recipe here was optimized for conversion speed. 00:39:21.180 |
We start off with a 256-pixel T2i stage image generation. 00:39:26.360 |
Here, the model can whip through a bunch of samples in very few, relatively, GPU hours. 00:39:33.440 |
We then move on to a pre-training stage with joint text-to-image and text-to-video generation where we progressively increase the resolution from 256p to 768p. 00:39:45.140 |
At the highest resolution here, 768p, that's where we have the sequence length of 73k. 00:39:52.700 |
And we train this on 6,000 GPUs, around 1,500 batch size. 00:40:01.100 |
So, we have a text-to-video post-training stage. 00:40:04.160 |
This is just SFT on, like, a very small set of very high-quality videos. 00:40:09.420 |
And then we also branch off into these different capabilities. 00:40:12.080 |
I'm not going to talk about those a bunch, but I'll show you some examples later. 00:40:20.140 |
For, like, sort of a wider-tailed ASF for brain research videos, is there sort of, is there free training you can do specifically to address those types of challenges? 00:40:30.860 |
Yeah, I think, definitely, I guess the question is about certain concepts in your pre-training data. 00:40:40.640 |
So, I guess, yeah, in a very large corpus like we'll be training on, there is an incredibly long tail of concepts. 00:40:46.720 |
Pretty much every concept that you might want to generate would probably appear at some point in that data set. 00:40:59.940 |
Okay, so, lastly, the results and applications. 00:41:04.040 |
So, putting everything together that I just showed you, and that's all of the technical details, by the way, like, we published all of this. 00:41:09.920 |
We get videos like this, and like the ones that I showed at the beginning of the talk as well. 00:41:16.340 |
And this pretty much sums up the whole point of this project. 00:41:22.080 |
I really want to hammer home that these kinds of videos were nowhere near possible before people started scaling transformers. 00:41:29.740 |
And we showed here that, again, this classic architecture unification story, scaling data, model parameters, and compute for a simple transformer, ended up in a model that can reason about objects, motion, and physics just by watching videos. 00:41:47.840 |
So, none of this was possible, and it's scaling transformers that unlocks all of this. 00:41:52.400 |
I have to keep on pressing replay, because I couldn't figure out how to auto-replay on Google Slides. 00:42:06.120 |
Here's a sloth with pink sunglasses on a donut float. 00:42:09.300 |
Again, I think the physics of sloth-like donut flotation here are good. 00:42:20.840 |
It does actually highlight something quite important. 00:42:24.400 |
When we're training generative models, one important thing is how well it generalizes to concepts that may be one in its pre-training data. 00:42:32.320 |
We can't be sure that there are no, like, slots and floats in the pre-training data, but, you know, by common sense, there probably aren't too many. 00:42:41.060 |
So, the fact that it can generate this is testament to its generalizing capabilities. 00:42:59.140 |
Like I said before, you take the original video on the top left, and you can provide these precise edit instructions. 00:43:09.080 |
I didn't get a chance to cover this too much today, but the team trained this with paired data. 00:43:15.580 |
It's incredibly hard to get paired data for this task. 00:43:19.560 |
These pairs of input and outputs, they came up with, like, a fascinating self-supervised approach for this. 00:43:26.280 |
Where did you guys get the data from on this project? 00:43:30.020 |
The videos are completely licensed videos by matter. 00:43:49.320 |
You can take, you know, everyone wants to take a video of their penguin and put some Victorian outfits on it. 00:44:01.680 |
One other model that we trained is the personalization model. 00:44:06.280 |
So, here, this is MovieGen video, but with the added capability that you can condition on an image of yourself. 00:44:14.300 |
So, here, the model can generate a video that is faithful to this text prompt, but also contains the person in the conditioning image. 00:44:27.780 |
This is my colleague that I worked on the project with. 00:44:31.480 |
So, yeah, if anyone is interested in this, please go read the paper. 00:44:39.200 |
Very lastly, we're going to see now if the theater has audio. 00:44:49.200 |
Again, I didn't have time to go into this today. 00:44:52.080 |
But this is a model that will condition on text and a video, either real or generated, and generate synchronized audio. 00:44:59.340 |
So, this way, we can add audio to our generated videos. 00:45:05.360 |
So, I'd recommend going to check out the paper if you're interested. 00:45:15.160 |
It's incredibly hard to do fair comparisons in video generation. 00:45:21.280 |
We don't have automated metrics or anything like this. 00:45:25.780 |
So, what we did was we did a very extensive human evaluation study. 00:45:30.400 |
We came up with a bunch of metrics that are somewhat orthogonal to each other that test every aspect of video generation. 00:45:37.220 |
Things like motion quality, how well the videos follow the text prompts, visual quality. 00:45:46.680 |
I won't go into a huge bunch of detail here, but they're very sort of fully defined in the paper. 00:45:51.960 |
We put in a load of work to make sure that the human evaluators had sort of low standard deviation when multiple evaluators were rating the same thing. 00:46:01.120 |
So, we compared to all of the methods that were released in the same year. 00:46:07.540 |
So, things like runways, models, Luma, Sora at the time, and Kling. 00:46:15.640 |
And these are net win rates you're seeing here. 00:46:18.780 |
So, a score above zero means that our model was preferred. 00:46:21.600 |
At the time of release, Movie Jam outperformed all of the prior work. 00:46:33.220 |
The one conclusion we can draw is that Movie Jam was better than these at the time of release. 00:46:40.580 |
We are, like researchers, what we like to do at this point is to look into all of the technical reports of the prior work and see what they did differently and what did Movie Jam do differently. 00:46:53.120 |
Because we'd like to conclude, right, what led to these improvements? 00:47:04.020 |
But it's unfortunately not possible in today's age to do that because we don't see research publications. 00:47:10.160 |
But what we do know is that all of the technical details I just presented work, and they work really, really well. 00:47:16.480 |
And they're a good starting point for anyone who's looking to improve text-to-video generation. 00:47:29.660 |
I've talked throughout the whole talk today about architecture unification across modalities and learning objectives, and this was a really nice result that we found at the end of the project. 00:47:44.040 |
So what you're seeing here is a scaling law graph. 00:47:46.460 |
How we use these when we're training large language models is often when you start training like a, you know, a GPT or a LAMA or whatever, you have a certain training budget. 00:48:01.620 |
This is how many months you have and how many GPUs you have for that period. 00:48:06.540 |
But if you know that training compute budget, an open question is how big your model should be. 00:48:13.480 |
What should be the optimal model size for that compute budget? 00:48:17.060 |
You train for more, you train for like more restorations or a larger model. 00:48:23.680 |
These scaling law curves are for estimating the optimal model size for a given compute budget. 00:48:29.320 |
So just looking at the blue crosses, we plotted a few of these data points for a movie gen. 00:48:35.740 |
Remember movie gen, text-to-video model, LAMA 3 architecture. 00:48:39.500 |
And looking at the blue crosses, we can see this nice correlation. 00:48:43.160 |
We often see in scaling laws with transformers. 00:48:49.020 |
So this is the scaling law for the text model. 00:48:51.080 |
And amazingly, we see that the LAMA 3 scaling law for this text-only model serves as a reasonable predictor for model size and compute for video generation. 00:49:04.480 |
And this seems to hint that scaling laws for transformers are maybe modality independent, which is pretty fascinating. 00:49:31.380 |
The model will struggle with generating complex motions from complex prompts. 00:49:37.720 |
An example here is a dramatic scene of two cars colliding at an intersection. 00:49:45.960 |
And then at some point near the end, they sort of, I don't know what you'd call that, they sort of independently implode. 00:49:53.660 |
At one point, the silver car kind of turns into two cars as well. 00:50:00.040 |
So this is a random generation from our model. 00:50:02.200 |
So, you know, text-to-video generation is not solved. 00:50:05.240 |
So my final slide is some ideas about where I think video generation is going next. 00:50:17.240 |
Things that I'm sure we'll see some version of at some point soon. 00:50:24.140 |
How can we solve the issues that you saw in the previous slides? 00:50:29.520 |
Well, the first kind of obvious one is scaling everything more. 00:50:35.540 |
This has been like the story of machine learning for the last six years or something. 00:50:50.660 |
I think scaling everything more would definitely result in far higher quality generations. 00:50:59.000 |
Some challenges there would be around scaling data by like an order of magnitude. 00:51:08.660 |
We've all seen the amazing benefits in language modeling that have come from reasoning in the last year or two. 00:51:21.020 |
Here, the reasoning gives the model the ability to sort of pause, think, generate a chain of thoughts, self-correct before generating an answer. 00:51:34.080 |
I think it's very clear that video generation models would benefit from this kind of reasoning capability as well. 00:51:42.460 |
When all of us were looking at the videos on the previous slides, it was very clear, right, that something was wrong. 00:51:51.200 |
It doesn't seem too unfathomable to be able to imbue some video generation model with the capability to self-correct, maybe see that there are some errors in the video that it generates and correct it. 00:52:08.940 |
There are lots of really interesting research questions here, like what does it mean to generate a reasoning trace for media generation? 00:52:16.000 |
You might have seen these like chain of thoughts that are generated by like R1 or R3. 00:52:22.860 |
I wonder what that looks like for media generation. 00:52:28.260 |
So, these latest state-of-the-art reasoning approaches are all trained with RL. 00:52:35.200 |
RL requires verification models to verify the correctness of the outputs. 00:52:40.440 |
It's an open research question what that means for video generation. 00:52:44.820 |
How do you verify the correctness of a video? 00:52:52.920 |
So, recent large language models are natively multimodal. 00:52:58.600 |
They can generate text, they can do image understanding, they can do video understanding. 00:53:05.900 |
So, it's an interesting question as to whether video generation would also benefit from being thrown into this native mix. 00:53:16.320 |
And if so, there are interesting questions around how you would train such a thing. 00:53:22.900 |
I've just talked you through how flow matching seems to work the best for video generation. 00:53:26.840 |
Is there a way that you can have multiple learning objectives? 00:53:30.280 |
Do you need to unify the learning objectives? 00:53:36.240 |
So, very lastly, it was a huge team that worked on MovieGen. 00:53:46.420 |
I had so much fun on this project, learned so much. 00:53:56.480 |
So, yeah, I'm going to leave it on this slide, but we can do questions now, if there are any. 00:54:15.200 |
And a lot of these types of architectures are unit-based, the KOMNet kind of structures. 00:54:20.140 |
And that, like, so the action-and-diffusion policy, surprisingly, like, the KOMNet architecture outperforms the transformer architecture for action-and-diffusion for robotics. 00:54:29.300 |
And they make an argument that it has an inductive bias about smoothing, so that there's more consistency in the spatial and the time domain and all that great stuff. 00:54:41.680 |
And, you know, what do you see that for that piece? 00:54:43.040 |
Yeah, so I haven't read that particular paper, but it does sound like a familiar point. 00:54:51.820 |
I guess the, you know, the whole theory around architecture unification is that we might have these specialized architecture, like CNNs, and they hold these inductive biases around visual data. 00:55:09.620 |
Like, they prioritize these local interactions with the convolutional mask and so on. 00:55:14.620 |
And the going idea has been that when you're training at small scale, having these inductive biases helps. 00:55:21.980 |
But when you're training at large scale and you have enough data, you can sort of learn all of these yourself with a transformer in a less constrained setting. 00:55:33.180 |
We have found that scaling transformers here works better than scaling these specialized architectures, for a few reasons. 00:55:41.720 |
I mean, it's not even to say that I don't think CNNs could scale. 00:55:46.680 |
It seems as though it's easier to scale transformers. 00:55:50.620 |
It's more straightforward to know in which direction to scale. 00:56:03.520 |
Can this model and approach be used for 3D generation to, like, video games and stuff like that? 00:56:09.120 |
So, if you go back to the architecture, it's totally modality independent. 00:56:27.220 |
The important thing here is that we've turned a modality into a sequence of tokens. 00:56:33.520 |
At this point, after that, everything that's happening in the architecture is modality independent. 00:56:38.920 |
So, really for any new kind of data, all you need is a way of turning it into a sequence of tokens. 00:56:47.940 |
It may be more challenging to encode, like, that kind of data. 00:56:54.700 |
But as long as you can encode it in some way to a series of tokens, you can use the exact same approach. 00:57:06.940 |
So, we can, I can read a few of those right now. 00:57:10.940 |
So, one is, so, 16 seconds seems like the longest videos you can effectively generate right now. 00:57:15.660 |
What are the main obstacles or things keeping us from getting to longer? 00:57:29.620 |
Well, there is so many different answers to this. 00:57:32.660 |
If you're thinking from the movie gen setup, the issue is sequence length. 00:57:37.380 |
Given the level of compression that we had, 73K was pretty much the longest sequence length we could feasibly train at. 00:57:46.420 |
If you want to train on 32-second videos, that's going to double. 00:57:53.260 |
You could train an encoder with far more compression. 00:57:59.620 |
There's another question about learning objective. 00:58:07.980 |
There are lots of papers out there that will iteratively generate videos along the temporal axis. 00:58:13.780 |
You will generate a chunk and then generate a new chunk conditioned on the previous trunk. 00:58:26.060 |
And there are lots of interesting papers that generate sort of infinite length videos using these like iterative processes. 00:58:37.560 |
There's some questions on like synthetic data compute. 00:58:41.500 |
So, someone says, you know, meta is a rare example of having an abundance of training data. 00:58:47.680 |
What if, you know, you run out of video, then what's next, I guess? 00:59:08.960 |
I think that there's a lot of interesting work to be done on improving the data filtration steps. 00:59:16.820 |
So, you know, in this slide, we were optimizing this for high precision. 00:59:22.860 |
So, making sure that all the videos at the end were very high quality. 00:59:32.040 |
But we would have lost a lot of good data in this like very complex pipeline here. 00:59:36.140 |
One like way forward for getting more data is moving to smarter ways of doing this process. 00:59:43.200 |
Maybe completely language model based, for example. 00:59:56.080 |
How can academic researchers contribute to video generation without having, you know, access to thousands of GPUs? 01:00:06.780 |
It's obviously very tough to do this level of pre-training outside of industry labs. 01:00:13.180 |
But, you know, throughout this paper, we've used a bunch of innovations that came from academia. 01:00:18.840 |
For example, a lot of the flow matching work was done at universities. 01:00:24.080 |
The main paper we take, Inspiration Forum, did come from Masa. 01:00:29.700 |
But this kind of research can be done at small scale, and then all of us can learn from it. 01:00:52.420 |
But things like learning objectives, I think post-training schemes, we've seen a lot of, like, grade work coming out of academia. 01:01:01.640 |
A few questions all related to text, I guess. 01:01:06.700 |
Some folks, you know, are saying there's a lot of work of cleaning and processing video data. 01:01:12.820 |
How do you make sure that the actual text, for example, the LAMA 3 generated captions are high quality and complete? 01:01:27.280 |
We put a lot of work into training this LAMA 3 captioner, basically. 01:01:36.700 |
This went through its own sort of large-scale training in order to generate good-looking captions that were aligned with what we wanted. 01:01:46.640 |
But certainly, there's a lot of room for improvement there. 01:01:49.280 |
These captions are not as good as human-written captions. 01:01:54.900 |
There are a lot of architectural reasons for that. 01:01:56.920 |
A lot of these, like, video-conditioned language models cannot see the whole video. 01:02:03.660 |
If it's a 16-second video at 16 FPS, often it's far too much video for the model to be conditioned on. 01:02:12.620 |
So often with a lot of these open-source models, like not only LAMA but Jemma and so on, you have to subsample frames. 01:02:21.160 |
And so you're blocking the model, the language model, from seeing a lot of the video. 01:02:24.600 |
It's going to lead to some issues, some missed things. 01:02:27.240 |
So we do our best by training and post-training a captioning model, doing a bunch of evaluations on it. 01:02:33.560 |
But that's definitely something that can be improved. 01:02:35.740 |
And there's a bunch of really cool results out there from the text-to-image community showing that if you improve captions, your image quality gets better. 01:02:44.020 |
It's not entirely clear why, but that keeps on happening. 01:02:47.160 |
Someone asked a related thing, which is how much roller importance does the text-to-prompt encoder play? 01:02:55.120 |
I think there were some image generation works which showed, like, replacing the text encoder. 01:03:00.780 |
I think it was from clip to, like, T5, like, really helped improve performance. 01:03:05.560 |
Did you guys play around with, like, several different text encoders? 01:03:10.900 |
So this particular series of text encodings was sort of precedent in our team. 01:03:17.280 |
We took motivation from a recent, like, state-of-the-art text-to-image paper that used this series of text encodings. 01:03:25.660 |
But, you know, it's worth pointing out here that it is quite strange what we've done here. 01:03:30.380 |
You would think it would be, like, intuitive that you want your best possible text representation here. 01:03:39.040 |
All of these text representations are nowhere near state-of-the-art. 01:03:43.800 |
There's been a few works and empirical findings showing that, in this setup at least, decoder-only text representations don't work as well for some reason. 01:03:58.800 |
Some reasons for that could be, or some have hypothesized, that you need a text representation that is more aligned with the media space. 01:04:07.640 |
That's why a lot of people you'll see conditioning on clip, which is what we do. 01:04:15.940 |
We didn't ablate this in this project, though. 01:04:20.400 |
I have a question followed up on the text part. 01:04:23.140 |
How well would this do with, like, very detailed prompts? 01:04:27.300 |
Like, I want to move a video about this person wearing this specific color, and here's what happens later on. 01:04:35.300 |
And then, like, a very detailed script, rather than, you know, just a video of some penguins, I guess. 01:04:42.020 |
We definitely did observe that the model can do these sequential actions, but not always completely accurately. 01:04:52.120 |
Some of that might be issues in the captioning of the pre-training data. 01:04:57.980 |
Like, are you accurately captioning all of these things happening? 01:05:01.800 |
But, yeah, that will be one of the places where it struggles, if you were to detail, like, three or four things happening in sequence. 01:05:11.160 |
We have one question, which is, could you somehow hard code in some priors related to physics and real-world common sense to improve the realism and accuracy of generated videos? 01:05:21.100 |
I guess that might be related to that video of, like, the car splitting and all of that. 01:05:35.000 |
So, in a way, it's kind of like the antithesis of... 01:05:40.300 |
the things that we were trying here, of, like, removing all of the inductive biases and just scaling compute and data. 01:05:48.160 |
But I do think those are interesting things to try, simply because maybe if you are trying to learn the laws of physics better, just random videos from, like, a large pool aren't maybe the best thing to use there. 01:06:05.240 |
And there are other cool works that have been released where they're trained, like, entirely on video game data, things like that. 01:06:12.100 |
Yeah, I think encoding some sort of priors would be cool. 01:06:19.300 |
Yeah, because, you know, there are certain things about the natural world that we do know for sure. 01:06:22.740 |
Certain, like, computer vision principles and so on. 01:06:35.960 |
We've got a couple others online, and then we can go back to in-person. 01:06:39.260 |
One's more on, you know, deep fakes and malicious use of video generation models. 01:06:44.720 |
Is there any work on things like watermarking to sort of deal with that? 01:06:49.920 |
There's definitely a bunch of work on watermarking. 01:06:52.460 |
We, yeah, there's definitely a bunch of work on watermarking coming from a few groups. 01:07:02.140 |
DeepMind has released a few interesting papers on it. 01:07:11.300 |
And then we can go back to some in-person questions. 01:07:21.660 |
I've noticed some other models have a GAN discriminator before or in the decoder for temporal coherence. 01:07:28.260 |
Is this still useful, or does scale overcome this need? 01:07:38.900 |
Yeah, so we also have, again, discriminator there. 01:07:41.660 |
It's not necessarily for temporal consistency. 01:07:51.060 |
There's sort of a historical precedent for this. 01:07:59.760 |
When you're training these VAEs, folk found, it was the stable diffusion researchers, actually, 01:08:09.040 |
They published this paper called VQGAN back in 2021, I think. 01:08:14.240 |
And they showed that normally how you train these VAEs is that you just use L1 losses between the input and the output. 01:08:22.780 |
But they found that the amount of compression you could do was very limited when you were doing that. 01:08:27.180 |
So what they started doing was adding some adversarial losses from the GAN literature. 01:08:32.340 |
What this does is it tells the VAE that I don't have to decode exactly what was given to me. 01:08:41.880 |
Because the GAN losses, they're not L1 losses, they're not pixel level losses. 01:08:45.700 |
These losses are just how well the GAN can tell if it's a real image or a fake one. 01:08:51.640 |
So when folks started training VAEs with these losses, the outputs were given a bit more freedom. 01:09:00.260 |
So the model could generate a range of things for a sudden input. 01:09:04.200 |
And it's not getting penalized from the loss as long as they look real. 01:09:07.440 |
And I realize there's a bunch of detail there. 01:09:10.780 |
But basically, by adding these losses, we were able to get another 2x level of compression. 01:09:16.760 |
So we are also using adversarial losses in the VAE, if that was their question. 01:09:22.180 |
Someone added an interesting question just now, which is, 01:09:25.520 |
how might you make sure all the videos in your training data are real? 01:09:32.700 |
I guess that's related to some work nowadays in language models. 01:09:37.280 |
But once more and more training data becomes synthetically generated, does that lead to any issues, I guess? 01:09:45.980 |
Definitely this idea of, like, data set poisoning, it's often called, is a problem. 01:09:51.140 |
I think there are a couple interesting answers to this. 01:09:57.020 |
Firstly, in our pre-training data, if it's low quality, then we don't want to train on it. 01:10:03.800 |
And hopefully, we would find it and get rid of it. 01:10:06.320 |
But training on generated data is not always a bad thing. 01:10:10.040 |
Most contemporary post-training approaches for language models 01:10:17.360 |
You train on generations from the model itself. 01:10:20.000 |
So it's not like training on generated data is always bad. 01:10:23.460 |
If there's some really, like, bad generated videos in the pre-training data, 01:10:35.880 |
We've got time for a couple more in-person questions. 01:10:39.000 |
In the paper, we included, like, full details of the training infrastructure. 01:10:45.140 |
I don't think we included, like, in the paper, like, full details of the training infrastructure. 01:10:53.800 |
I don't think we included any details about the inference infrastructure. 01:11:02.820 |
In the paper, we included, like, full details of the training infrastructure. 01:11:08.880 |
I don't think we included any details about the inference infrastructure. 01:11:14.320 |
There are lots of details on the inference infrastructure. 01:11:16.700 |
There are lots of details on the inference in the paper. 01:11:20.240 |
There are lots of details on the inference infrastructure that I've shown here. 01:11:27.580 |
I don't know if I missed that, but for the videos you generated, do they come with audio? 01:11:33.240 |
Yeah, so this text-to-video model that I've shown here, it generates video. 01:11:41.860 |
But with the publication, we released a, well, published about a, where is it? 01:11:52.980 |
Actually, no, we're not going over that problem again. 01:12:01.480 |
So there were, yeah, there was a separate MovieGen audio model that we trained that will add audio 01:12:09.500 |
Video asked two separate models, and the audio is, like, another level of complexity where 01:12:19.560 |
there are, like, multiple chats, and if you want to train on mini-files, whatever. 01:12:23.800 |
So what is, like, your current progress in this oil generation? 01:12:33.560 |
So, we have a really great audio research team that works on this. 01:12:44.200 |
One very nice thing to do, I think, would be to generate everything at once, like video 01:12:54.620 |
So, theoretically, both modalities should benefit from being trained together. 01:13:00.680 |
The audio even encodes some information about videos that are not present in the video. 01:13:09.360 |
It's very hard to get high-quality video data. 01:13:11.360 |
It's even harder to get high-quality video with good audio data. 01:13:15.240 |
So, that's part of the reason why we didn't do that for this project. 01:13:24.920 |
Thanks so much, Andrew, for the very insightful talk and answering all our questions.