Stanford CS25: V5 I Transformers for Video Generation, Andrew Brown of Meta

I'm very happy to end it off with a bang here with Andrew. So he's a research scientist at Meta's Gen AI team, focusing on media generation. Over the past few years, his team has focused on publishing research papers that push the frontiers of video generative models, including emu video as well as movie gen.

Prior to working on Meta, Andrew completed his PhD at Oxford's Visual Geometry Group, VGG, under the supervision of Professor Andrew Zisserman. So without further ado, I'll hand it off to Andrew to take it. Yeah, thank you for that. Like the intro said, I'm Andrew Brown. I'm a researcher in Gen AI at MESA.

If you guys haven't heard of that, Gen AI is the research organization that releases MESA's generative models, things like LAMA for the text LLMs and our media generation models as well. I've been there for about two and a half years, ever since I finished my PhD. And like Stephen said, over that period, we've released a bunch of frontier-pushing video generation models.

So thank you so much for inviting me. I'm very honored. Thank you. And yeah, today's talk is Transformers for video generation. So I saw in the list of seminar speakers already you've had a bunch on NLP and some media generation stuff. I hope this is new compared to what you've heard already.

Okay, so text-to-video models. You guys might have seen videos like this on the internet. But given a text prompt, contemporary text-to-video generation models can now create these incredibly high-quality videos. Complex motion, fantastical scenes, very high-quality. These things are amazing. Another example I quite like, this is a ghost in a white bedsheet.

These things are amazing. Not only are they super high-quality, they seem to have learned some notion of the laws of physics. The thing I like about this one is if you look closely, you can see the reflection of the ghost is shown in the mirror there. So these models have clearly learned something very, like I was saying, if you look very closely, you can see this reflection of the ghost in the mirror.

These things are amazing. And it doesn't stop there. So these text-to-video generation models can be used for other amazing capabilities as well, like editing. So, for example, you can give these models an input video, like on the top left, and specify some edit prompt, like turn the runner into an inflatable dinosaur or turn it into a cactus desert.

These things are amazing. Now, I don't know how long a lot of you have been paying attention to the media generation field or even the machine learning field. But if you've been only paying attention for the last year, year and a half, this is all that you're used to.

You're only used to seeing generated videos that are completely indistinguishable from real ones. But this is an incredibly recent development. So on the right here, I have the same video I showed you on the first slide. This is from a model that was released in October 2024. And on the left, I have what was an amazing states-of-the-art approach from September 2022.

This is another model that was released by my team. And I can't stress enough how amazing this was at the time. And this gap, you'll see, is two years. I'm sure every speaker at this seminar has come in and said there's been, like, amazing progress in this machine learning subfield or that one.

And there has been everywhere. But nowhere is it more clear than video generation. So how did this happen? This is what the subject of today's talk is going to be. How do we train models to generate videos like this? So all of the videos you saw in the previous slides are from a paper we released in October 2024 called MovieGen.

Here's a little snapshot of the abstract. A little spoiler for how we did this and a spoiler for why I was invited to this is that we did it using transformers. I saw this little guy included in previous talks, so I had to put him there. So today's conclusions are going to be twofold.

First, I'm going to talk you through every detail of how you train a transformer to generate videos like this. We publish all of the details. We're going to step through everything. Some of the concepts are going to be familiar, things like transformers. Some are going to be new. We're going to go through it all.

The second takeaway is going to be this conclusion that I'll keep on saying. That throughout this project, we learned that scaling data, compute, and model parameters for a simple transformer also works for video generation. We've seen it work in all kinds of machine learning fields. We also saw it work here for video generation.

Okay, so just a little bit of personal background. I think most of the speakers that you've had are from NLP, which makes sense because transformers came from NLP. I'm a computer vision researcher, and I've been in visual generation for a few years now, and it hasn't always been as popular as it is now.

So how I got into this, I was sat in this lecture hall in the engineering department at Oxford in, what, the second year of my PhD, and we were having a talk by Professor Antonio Tarabla from MIT, and he was presenting this work called GAN dissection. Some of you may have heard of GANs.

The state-of-the-art image generation approach at the time were these generative adversarial networks, and in 2019, you could generate like a blurry face or a blurry kitchen or a blurry bedroom, and that was amazing at the time. And this paper here was showing that you could activate or deactivate certain neurons in the GAN, and in doing so, you could make certain concepts appear or disappear.

So these are the kind of images that we were generating at the time. So this is like a blurry kitchen, and this was like near state-of-the-art. So you can see, you know, the visual concepts are pretty messed up. It doesn't make a huge amount of sense, but I can't stress how, like, good this was at the time.

I'm sure a lot of you have seen, like, Dolly and stable diffusion and so on. Things were not always this good. So the point of the paper, or at least in this example, was they were showing that you could activate certain neurons and make windows appear in the kitchen.

You could imagine how your kitchen would look with windows. And the outcome was this. So some, like, pretty janky windows on the left. And my mind was completely blown, as I can sense, like, all of yours are as well. This was amazing. So what they showed here is not only that the model had learned a physically plausible place to put the windows, but importantly, they'd also shown that the model had learned some notion of physics.

The model had learned that if you put windows on the left-hand side of the room, light will come through them, and there'll be a reflection of the marble countertop. And you can see that here. And the model had just learned this by looking at images. So this had, like, a really profound impact for me as a young PhD student, and I've been in visual generation ever since.

Okay, so the body of the talk. There's going to be five parts. I'll give an overview of the model that we trained, talk about architecture, data, and training recipe results and applications, and a little discussion on what is next. So first, I included a bit of historical context on movie gen and video generation.

Maybe some of you are quite new to the video generation field. The field is quite new inherently. This snapshot is maybe three years, from 2024 back to around 2022. This is a century in machine learning research. I haven't included all of the works here. There are some very relevant, important works here.

I've just included a snapshot to make a couple of points. There have been two milestone events in video generation. The first was in 2022, when people started using diffusion modeling. This is when the whole community started using diffusion. That was a big step up in visual quality at that point.

The second one was in 2024, and this is the point of today's talk. Before 2024, people were using quite small-scale, specialized architectures. I say small because the definition of small and large has been moving all over the place recently. But these were specialized architectures for computer vision, things like CNNs, UNETs, and so on.

And then around 2024, video generation sort of boarded this architecture unification setup. So all over machine learning fields, we've seen people ditch specialized architectures and move towards this simple transformer setup. The reason is because all of these different fields are seeing the benefits of efficiency and scalability by moving to these transformers.

So in 2024, the video generation community started doing the same, and that's where Movie Jam comes in, which I'll talk about today. So a quick overview on what Movie Jam is before we get into the details. So Movie Jam was a cast of foundation models that generates high-quality 1080p HD videos with different aspect ratios and synchronized audio.

Today, I'm just going to be talking about the text-to-video model. I'll show some fun examples later of the other models that we trained. And again, like I've been saying, the point of the paper was showing that scaling data, training compute, and model parameters for a simple transformer trained with flow matching, I'll cover that later, yielded state-of-the-art results.

We also presented a few sort of innovations and simplifications along the way. So Movie Jam video, the model, was a 30 billion parameter foundation model for joints, text-to-image, and text-to-video generation. The model was trained on the order of around 100 million videos and 1 billion images. Okay, so on to the architecture.

There are three main things that I want to cover today. The first is the representation. What representation are we learning? The second is what learning objective that we used. And the third is what model architecture we used for learning it. If I do a good job here, then all of you will know all there is to know about video generation.

I'm imagining most of you have more of a NLP background with sort of text-to-aggressive models. I'm going to try and contextualize all of this in relation to text and how it differs from large language models. Okay, so the representation. What do we mean by that? Well, we mean how are we going to represent the data for the model?

We're doing generative modeling here. We're learning P of X. The question here is what should X be? We know that X is going to be derived in some way from videos, but there's an open question of how exactly do we do that. So to motivate what we ended up doing, I'm just going to talk a bit about the differences between text and media.

So let's look at some text data, for example. And this piece of data, a sentence, an image of a cat. What has happened when this data sample has been created is a human has put in a huge amount of thought into compressing what they're thinking into this very well-designed, semantically rich language.

It's very compressed. Every word packs a huge amount of information. And it's also inherently discrete. So in practice, when people are training large language models, they can use a representation that's quite close to this. So they might use a sort of simple tokenizer before feeding it into the transformer.

Media data is incredibly different to this. So let's look at a related piece of data, an actual image of a cat. No compression has really occurred here. This image is just continuous raw data that's been captured by your camera. Maybe the only sort of human effort that's gone into this is positioning the camera and framing the cat in the middle of the camera, but nothing else.

As a result of this, there's tons of redundancy. So what I mean by this is if you know what a cat is, and then let's say the middle pixel you know is this white fur on the cat. Well, if you know what a cat is, you know the next pixel along is probably going to be white fur and the next one along from that.

And if it's a video, you know the same pixels one frame along are also going to be white cat fur. There's a huge amount of redundancy. So this begs the question of perhaps this raw data could be transformed into something that more closely resembles language. So what do we actually do?

So if you forget everything that I just said on the previous slide, one thing you could do, a very simple approach, is you just model the pixels directly. So let's say you took an image or a video, you unraveled it into a long sequence of pixels, and then trained, let's say, next token prediction on that.

And in a way, that's what some prior works did. Things like imagine video or image GPT back in the day, along with a little bit of patching. So this is a very sort of conceptually simple method, but it's very computationally constraining. So the thing about modeling pixels directly is the number of pixels scales quadrastically with the image or video resolution.

It's bad for images. It's even worse for video when you have a temporal dimension. In practice, what this means is these models can only model very low-resolution images or videos, things like 64 by 64 videos, which is not ideal. And if we want to generate an actual large HD video, these methods had to employ a huge cascade of upsampling stages, things like super-resolution models, frame interpolation models, to increase the size of the data.

So this really isn't ideal. So instead, what prior work does is they learn a compressed latent representation using something like a VAE or a VQVAE trained offline. This is what every sort of text-to-image model or text-to-video model you've seen on social media has been doing for a long time.

The advantages here are twofold. If you're modeling a compressed version of your data, then you can natively model larger data. We don't need to go down to 64 by 64. We can natively model something bigger. The other advantage is that this offline-trained VAE or VQVAE can remove some of the computational burden from the language model.

For example, these autoencoders could handle the modeling of how two separate blades of grass differ from each other in an image or a video. And it can take that burden off the sort of downstream language model. Okay, so that is what we do. From an architecture, we train something called a temporal autoencoder for spatial-temporal video compression.

This is basically just a variational autoencoder. How does this work? Well, you take a video. You'll feed it through the TAE encoder. A VAE consists of an encoder and a decoder. After the encoder, the representation will be compressed, and you end up with this latent representation at the bottleneck in the middle.

And this is the representation that we're going to use downstream. How you train these things is quite simple. Some of you may have seen it before. You take your video. You feed it through the encoder. You get to the bottleneck latent representation. You then decode it back to pixel space.

And you have a bunch of losses between the output and the input, things like L1 losses, adversarial losses, and so on. So this representation in the middle is what we're going to model. When we talk about learning P of X, we're learning the distribution of this latent. What this means is when we train a generative model on this, it generates in this space.

It doesn't generate in RGB space. So after we generate a video, we then need to decode it back to RGB space. So the TAE that we trained had 8x compression in each dimension. 8x in height, width, and time. And this was pretty high compression at the time. It's not the highest compression anymore.

This was published like six months ago. And six months is, again, like a decade in machine learning research. But at the time, this was very high compression. And like I said, this means that we can natively model very high-resolution videos. As an example, the largest video that we model in this work is 768x768 pixels, 16 seconds, 16 FPS.

Now, if we were to model pixels directly, and we took a video of that size, we unraveled everything, we treated one pixel as one token, and we flattened the whole thing, it would result in 150 million tokens. Even with very long context training methods with language models at the moment, this is completely unfeasible.

But using this temporal autoencoder, the same video is just compressed to 73,000 tokens. So this is suddenly completely computationally feasible using like today's parallelism approaches, today's infrastructure. If anyone is doing this math offline, there is also a patch of high layer, if someone thinks my math is wrong here.

Okay, so that's everything for the representation. Does anyone have any questions at this point? No. It's a great point. A lot of autoencoders for videos do use causality. Some nice outcomes of that is that when you encode images, they can be encoded completely independently of subsequent frames and so on.

But no, this isn't causal. Okay, so next up is which generative modeling learning objective do we use? So probably in most of the talks that you've had so far, you've heard about autoregression and next token prediction for text. In media generation, we haven't been doing that for a couple of years.

So the de facto approach in most media generation has been using diffusion modeling. We use something called flow matching. So what is flow matching? Flow matching is, in a way, a simpler generalization of diffusion. Now, if any of you have watched a talk on diffusion modeling or read anything about diffusion or flow matching, I'm sure you've seen a figure like this before.

I'm going to give a brief sort of explainer of what is similar between flow matching and diffusion. And then we're going to go over what a training step looks like. So both diffusion and flow matching have a very similar setup. So you assume that you have some unknown data distribution.

This is the sort of distribution of images in this figure. In this case, like images of cats. This is the distribution you're trying to learn. This is the distribution we want to learn and then sample from. You also assume you have a known data distribution on the right. And we model this as normally just like Gaussian noise.

Both assume then that we have this fixed forward process. What this means is we have a method of translating between the unknown data distribution and the known one by iteratively adding noise. Both assume that if you keep on adding noise, you basically end up at this known data distribution.

Then both of them train a neural network to do the reverse process. So they train a neural network to take one of these images and iteratively denoise those. And then at inference time, you can iteratively use this neural network to go from a sample that's pure noise back to a sample from this data distribution that you've just learned.

And that's how we end up sampling images and videos. So diffusion and flow matching are very similar in a lot of ways. Flow matching is, in a way, a simpler generalization. It's been very recently shown to result in more robust training and more efficient probability paths that are easier and faster to sample from.

So this paper came out pretty recently in 2023 from some colleagues at Massa. Importantly, it's been shown to work better than diffusion. And I'm not going to go into a huge amount of detail here, but we're going to go over how a training step looks like. So there are lots of equations here.

We're going to step through them pretty easily. It's a three-step process. We first take a training data sample, X1. This is your image of a cat on the previous slide. This is just an image from your data set. We then sample a time step. This is a float between zero and one.

And we sample from the known data distribution. This just means taking a sample from a normal Gaussian. We then construct a training sample, Xt. What is this? This is just this sort of intermediate image, a somewhat noise image of a cat. There are lots of different ways of constructing Xt.

We use what's called the simple linear interpolation from the flow matching paper. And the equation is shown here. So this is how we go from the three things we sampled above to this intermediate training sample. Then in flow matching, what you do is you train the model to predict the velocity.

This is a value which moves the training sample back in the direction of the data sample. In actuality, it's very simple. So this is how we compute the velocity, simply by differentiating the equation above. This is how we get our ground truth. And then on the right, we have our actual learning objective.

So this is the mean squared error between the model prediction and the ground truth velocity. Here, the model prediction is parametrized by U. It takes as input the training sample. It's conditioned on two things. It's conditioned on the text prompt. That's P. Remember, we're doing text to video generation.

So we need to condition the generation on the text prompt. I'll cover later how we do this. You condition on the time step as well. And then theta are the model parameters. Okay, you know how to do flow matching. Inference is also pretty simple. So you start by sampling from this known data distribution, Gaussian noise.

And then we use an ordinary differential equation solver to go back to the data distribution, given a series of time steps. So very simply, you'll sample some noise. You'll sample a stream of time steps. At each time step, you compute the model's prediction for the velocity and use the solver to move the sample in the direction of the unknown data distribution.

And at the end of that, you have your sample. Low-level detail, we use a quite simple solver. There are lots of different options you can choose. Okay, so lastly, which model architecture do we use? Now, I already said that we're using transformers. The big goal of this paper was to benefit from, like I've said about seven times already, scaling data, training compute, and model parameters with transformers.

But there is a question about which transformer to use. So, in my research organization, we train these things called LAMAs. LAMA is the large language model that messes open sources. We take the LAMA3 model, hence why I sort of pasted this L3 on it. LAMA3 is quite a classic, dense, fully connected, decoder-only language model.

So, what we did in MovieGen is you take your videos, you encode them with the TAE that we discussed earlier, you flatten the tokens, this gives you your input sequence, and we just throw it into LAMA. So, very, very simple. Now, when I say LAMA, I don't mean a pre-trained LAMA.

I don't mean one that's been trained for text. What I mean is the architecture, so a randomly initialized architecture. But this is still very important. Training large language models at scale is incredibly difficult. Every time you change anything about the architecture, you need different hyper-premises, they scale differently. It's incredibly tricky.

So, the fact that we already know, in our research organization, how to scale this architecture, and the fact that we have the infra set up already to train these things at scale, makes a huge difference. So, that's why the simplest thing for us to do was to go with the LAMA architecture.

We didn't do that for this project. But I agree, that would be a really cool thing to try. Yeah, it's not entirely clear why that would work. Obviously, like, these are very different modalities, different learning objectives. But, you know, in a lot of ways, there's a lot of sort of shared structure between these modalities that would maybe benefit from that.

Okay, so that last slide was, like, very deceptively oversimplified. There are some changes that we needed to make to LAMA 3. So, importantly, LAMA 3 is a model for auto-aggressive text generation, and we are doing text-to-video generation using flow matching. So, there are three changes that we need to make.

I'm going to go over all of them exhaustively to sort of hammer the point that we barely changed the architecture. So, the first thing that we need to do is incorporate the text conditioning. You'll have seen on the previous slide that our input sequence is just made up of video tokens.

We're doing text-to-video generation, so we need to incorporate the text conditioning somehow. And we do this using cross-attention layers. So, very simply, we construct a sequence made up of our text conditioning, and we add cross-attention layers into the transformer block. So, these go between the self-attention layers and the feedforward network.

It's a very common way of adding text conditioning for media generation models. There is a question, as well, of what should your text representation be? How should you construct this sequence? One, like, very simple thing to do would be you just tokenize the prompt, the caption, and then you feed that in.

But when you do that, you're very much burdening your model with learning this text representation from scratch. So, instead, we use pre-trained text representations. We use three, in fact. Three that are complementary to each other. The first two, all of them are pre-trained frozen text models, basically. The first two have very sort of semantic level representations.

UL2 is a large-scale encoder-decoder model. Metaclip is our internal clip model. And the third one has more of a sort of character-level text representation by T5. So, we encode the text prompt using all three of these. We project them all to the model dimension, and we concatenate, and that gives us our text sequence.

The second thing we need to do, you'll remember from the learning objective, that we also need to condition on the time step. So, what we do here is we do this in adaptive layer norm blocks. So, we've already added a cross-attention block. We also add this adaptive layer norm block.

This might seem like quite a strange way of adding some conditioning. It's something that was popularized in the diffusion transformer paper, which was the first paper that used diffusion with transformers, obviously, by the name. It might seem a little bit random. It's basically very computationally cheap, and works super well.

Okay, number three. There are only three. We use full bidirectional attention. LLAMA is an auto-aggressive text model. They use causal masking for next token prediction. For the flow matching objective, we have no such constraints. We want every video token to see every other video token. We don't care about causal masking, so we take that out.

A very low-level detail is that because of this, we use multi-head attention instead of groups query attention. But that's everything. Other than that, it's the LLAMA architecture. Okay, so we now have our full architecture diagram. I was very wary of putting this earlier because it's pretty complicated, but I think it should all make sense at this point.

So from left to right, during training, we take one of our training videos. We encode it with the TAE to get to our compressed latent representation. It goes through a small Petrify. This just does some extra compression. And we flatten it, and that's how we get our input sequence.

During training, we'll construct our training sample by combining it with this Gaussian noise. During inference, this whole sequence will start off being Gaussian noise. The sequence goes through these LLAMA transformer blocks. We add in the conditioning, and then we get our output sequence. During training, we'd compute our loss and backpropagate.

During inference, we would do this iterative denoising process, and finally decode back to the RGB space. Any questions on the architecture? Any questions on the architecture? So, at inference for a given text prompt, we add this conditioning into the model always. So the input sequence is just Gaussian noise, but that's not where the text information comes in.

The text information comes in through these cross-attention layers. So even though the input sequence is just noise, the model is still seeing these clean text information. We don't noise the text or anything like that. Does that make sense? That's just a hyperbrameter that we keep constant. I cannot remember off the top of my head exactly what that hyperparameter is for, actually.

I'm going to talk to you about that after. The next question is, how many of you know these steps? Yeah. Is it typically required? Yeah. So, it's a very good point. So, during inference, you'll sample a series of time steps. Usually, with these flow-based models, the more you sample, the better.

You'll better approximate these probability paths. In practice, we use 250, I think. One of the advantages of flow matching is that the probability paths are theoretically straighter. So, you should be able to require less function evaluations to approximate this path. That's one of the advantages. And is it 250, like, so, is it not a pre-described number of Ts during training?

Yeah, that's continuous. Yeah. But during inference, it's, we just sample some discrete ones. Yeah. Thank you. So, I have a question. On the two videos that you show the difference between 2022 and 2024, was the improvement just algorithmic what you used with the hardware in terms of TPUs, or it was just a pure algorithm?

So, I think everything comes down to scale. There are quite a lot of things that have changed. There's an architecture change moving towards transformers, and then there's the scale. And that includes a whole load of things. Scaling the data, scaling the amount of compute, in order for that to be, like, tractable, better, like, GPU hardware, it does help.

There wasn't a huge improvement in GPU hardware over those two years. It might have been one or two generations of NVIDIA's stuff. Yeah. Okay. Okay. So, now we have an architecture that we're pretty confident scales. Longer. We have a learning objective that we think should work in flow matching.

That's pretty much it for the details. It's not the entire story. But that's a lot of it. Okay. So, the last technical details are about data and the training recipe. So, data. I think in a lot of ways, this is the most important slide of the entire talk today.

Data is so important for training large language models. And, by the way, when I say large language models, I'm sort of just talking about transformers at scale for any modality. People use, like, different definitions there. But these models are incredibly data-hungry. They require internet-scale data. And they require the data to be clean.

The scaling laws depend on the data being clean. Otherwise, the scaling laws don't hold. And the model output quality depends on the data being clean. As a result of this, research groups at these big companies spend a huge amount of resources on data. This is something that I find isn't really talked about so much.

But they'll spend huge amounts of resources in terms of GPUs and also actual researchers. Often, on these research teams, the data teams massively outnumber the modeling teams. Which was something very new to me after my little PhD. So, why is this? Well, you know, remember we're training generative models.

We are learning this distribution of our training data. And then we're sampling videos from it that are likely according to that training data. So, if we want to sample the kinds of videos that I showed you on the first two slides, then all of our training data needs to look like that.

I'm mainly talking about pre-training here. You obviously have a post-training phase as well where you can align your videos to be more high-quality perhaps. But your pre-training data still needs to look great. So, this is a huge challenge. We trained the model on the order of around 100 million videos.

How do we get to this number? Well, we can predict a sort of training budget that we had for this project. And you want as many videos as you can that you don't epoch. Thank you very much. So, the challenge was how to get this many videos that are high enough quality.

At the time of MovieGen, we constructed this incredibly detailed complex pipeline with a bunch of handcrafted and model-based filters. I'll just talk through a few bits of it because of the sheer amount of work that went into this. So, you start with a large pool of videos from some corpus.

They may be different lengths, long tail of concepts. We did a bunch of visual filtering on these, removing videos that are too small, scene changes, bad aesthetics. We removed a bunch of videos that have bad motion. It turns out a bunch of videos in any large corpus have really slow motion, janky motion, motion effects.

We removed all of those. We then did a content filtering step. This is first deduplication. But the really important one here is resampling. Large language models do not work well when they're trained on a very imbalanced data set in terms of concepts. So, something with a very long tail.

They work best if the concept distribution is roughly, roughly uniform. And that uniformity doesn't occur if you just take a random set of videos. So, we do this very sort of complex visual concept extraction, clustering, the upweights, certain clusters, downweights, certain clusters. This, all of this will give us a set of videos.

We also need captions because we're doing text-to-video generation. And we generate these automatically using Lama 3. So, that is data. Very lastly, the training recipe. So, this multi-stage recipe here was optimized for conversion speed. We start off with a 256-pixel T2i stage image generation. Here, the model can whip through a bunch of samples in very few, relatively, GPU hours.

We then move on to a pre-training stage with joint text-to-image and text-to-video generation where we progressively increase the resolution from 256p to 768p. At the highest resolution here, 768p, that's where we have the sequence length of 73k. And we train this on 6,000 GPUs, around 1,500 batch size. At this point, the model splinters.

So, we have a text-to-video post-training stage. This is just SFT on, like, a very small set of very high-quality videos. And then we also branch off into these different capabilities. I'm not going to talk about those a bunch, but I'll show you some examples later. Any questions, Owen? Sure.

For, like, sort of a wider-tailed ASF for brain research videos, is there sort of, is there free training you can do specifically to address those types of challenges? Yeah, I think, definitely, I guess the question is about certain concepts in your pre-training data. So, I guess, yeah, in a very large corpus like we'll be training on, there is an incredibly long tail of concepts.

Pretty much every concept that you might want to generate would probably appear at some point in that data set. Okay, anything else? Okay, so, lastly, the results and applications. So, putting everything together that I just showed you, and that's all of the technical details, by the way, like, we published all of this.

We get videos like this, and like the ones that I showed at the beginning of the talk as well. And this pretty much sums up the whole point of this project. I really want to hammer home that these kinds of videos were nowhere near possible before people started scaling transformers.

And we showed here that, again, this classic architecture unification story, scaling data, model parameters, and compute for a simple transformer, ended up in a model that can reason about objects, motion, and physics just by watching videos. So, none of this was possible, and it's scaling transformers that unlocks all of this.

I have to keep on pressing replay, because I couldn't figure out how to auto-replay on Google Slides. How many PhDs does it take to auto-replay? Okay, so, another one. Here's a sloth with pink sunglasses on a donut float. Again, I think the physics of sloth-like donut flotation here are good.

This slide isn't only a joke. It does actually highlight something quite important. When we're training generative models, one important thing is how well it generalizes to concepts that may be one in its pre-training data. We can't be sure that there are no, like, slots and floats in the pre-training data, but, you know, by common sense, there probably aren't too many.

So, the fact that it can generate this is testament to its generalizing capabilities. Okay, a few other things. I showed you some of these at the beginning. We also trained a MovieGen edit model. Like I said before, you take the original video on the top left, and you can provide these precise edit instructions.

This is really magic to me. I didn't get a chance to cover this too much today, but the team trained this with paired data. It's incredibly hard to get paired data for this task. These pairs of input and outputs, they came up with, like, a fascinating self-supervised approach for this.

I'd really recommend reading the paper. Hey. Where did you guys get the data from on this project? The videos are completely licensed videos by matter. Five minutes. And one more example here. You can take, you know, everyone wants to take a video of their penguin and put some Victorian outfits on it.

So, this is great. One other model that we trained is the personalization model. So, here, this is MovieGen video, but with the added capability that you can condition on an image of yourself. So, here, the model can generate a video that is faithful to this text prompt, but also contains the person in the conditioning image.

So, that's really fun. And another funny example here. This is my colleague that I worked on the project with. This makes me laugh. So, yeah, if anyone is interested in this, please go read the paper. A bunch of work went into this. Very lastly, we're going to see now if the theater has audio.

It doesn't. So, we trained this MovieGen audio model. Again, I didn't have time to go into this today. But this is a model that will condition on text and a video, either real or generated, and generate synchronized audio. So, this way, we can add audio to our generated videos.

It's an amazing research team that did this. So, I'd recommend going to check out the paper if you're interested. Okay. So, there are a couple of just points left. Firstly, some quantitative results. It's incredibly hard to do fair comparisons in video generation. We don't have automated metrics or anything like this.

So, what we did was we did a very extensive human evaluation study. We came up with a bunch of metrics that are somewhat orthogonal to each other that test every aspect of video generation. Things like motion quality, how well the videos follow the text prompts, visual quality. These metrics are shown on the left.

I won't go into a huge bunch of detail here, but they're very sort of fully defined in the paper. We put in a load of work to make sure that the human evaluators had sort of low standard deviation when multiple evaluators were rating the same thing. So, we compared to all of the methods that were released in the same year.

This is 2024. So, things like runways, models, Luma, Sora at the time, and Kling. And these are net win rates you're seeing here. So, a score above zero means that our model was preferred. At the time of release, Movie Jam outperformed all of the prior work. So, this is, that's great.

It's very hard to draw conclusions here. The one conclusion we can draw is that Movie Jam was better than these at the time of release. We are, like researchers, what we like to do at this point is to look into all of the technical reports of the prior work and see what they did differently and what did Movie Jam do differently.

Because we'd like to conclude, right, what led to these improvements? Was it more compute? Was it flow matching? Was it better data filtration? Things like that. But it's unfortunately not possible in today's age to do that because we don't see research publications. But what we do know is that all of the technical details I just presented work, and they work really, really well.

And they're a good starting point for anyone who's looking to improve text-to-video generation. And we hope the community does. Okay. The very last technical thing, scaling laws. I've talked throughout the whole talk today about architecture unification across modalities and learning objectives, and this was a really nice result that we found at the end of the project.

So what you're seeing here is a scaling law graph. How we use these when we're training large language models is often when you start training like a, you know, a GPT or a LAMA or whatever, you have a certain training budget. This is how many months you have and how many GPUs you have for that period.

But if you know that training compute budget, an open question is how big your model should be. What should be the optimal model size for that compute budget? It could be a very small model. You train for more, you train for like more restorations or a larger model. You train for less.

These scaling law curves are for estimating the optimal model size for a given compute budget. So just looking at the blue crosses, we plotted a few of these data points for a movie gen. Remember movie gen, text-to-video model, LAMA 3 architecture. And looking at the blue crosses, we can see this nice correlation.

We often see in scaling laws with transformers. We then overlaid the LAMA 3 scaling law. So this is the scaling law for the text model. And amazingly, we see that the LAMA 3 scaling law for this text-only model serves as a reasonable predictor for model size and compute for video generation.

And this seems to hint that scaling laws for transformers are maybe modality independent, which is pretty fascinating. Okay, so we're on the last part now. What is next? Movie gen did not solve video generation. There are still lots of problems. The model will struggle with generating complex motions from complex prompts.

An example here is a dramatic scene of two cars colliding at an intersection. So it's looking pretty good at this point. And then at some point near the end, they sort of, I don't know what you'd call that, they sort of independently implode. At one point, the silver car kind of turns into two cars as well.

So this is a random generation from our model. So, you know, text-to-video generation is not solved. So my final slide is some ideas about where I think video generation is going next. Things that I'm sure we'll see some version of at some point soon. So what is next? How can we solve the issues that you saw in the previous slides?

Well, the first kind of obvious one is scaling everything more. This has been like the story of machine learning for the last six years or something. And I think this would like definitely work. Movie gen was a 30 billion parameter model. It was based on Lama 3. The largest Lama 3 model was 405b.

I think scaling everything more would definitely result in far higher quality generations. Some challenges there would be around scaling data by like an order of magnitude. Number two is reasoning. We've all seen the amazing benefits in language modeling that have come from reasoning in the last year or two.

Here, the reasoning gives the model the ability to sort of pause, think, generate a chain of thoughts, self-correct before generating an answer. I think it's very clear that video generation models would benefit from this kind of reasoning capability as well. I think that could result in a step change.

When all of us were looking at the videos on the previous slides, it was very clear, right, that something was wrong. It looked like very obvious to us. It doesn't seem too unfathomable to be able to imbue some video generation model with the capability to self-correct, maybe see that there are some errors in the video that it generates and correct it.

There are lots of really interesting research questions here, like what does it mean to generate a reasoning trace for media generation? You might have seen these like chain of thoughts that are generated by like R1 or R3. I wonder what that looks like for media generation. The other issue is how to verify.

So, these latest state-of-the-art reasoning approaches are all trained with RL. RL requires verification models to verify the correctness of the outputs. It's an open research question what that means for video generation. How do you verify the correctness of a video? Very lastly, we have native generation. So, recent large language models are natively multimodal.

They can generate text, they can do image understanding, they can do video understanding. Some can even do image generation. So, it's an interesting question as to whether video generation would also benefit from being thrown into this native mix. And if so, there are interesting questions around how you would train such a thing.

I've just talked you through how flow matching seems to work the best for video generation. Is there a way that you can have multiple learning objectives? Do you need to unify the learning objectives? Things like that. So, very lastly, it was a huge team that worked on MovieGen. These are a bunch of amazing researchers.

I had so much fun on this project, learned so much. There are loads of good friends here. So, yeah, shout-outs to all of them. So, yeah, I'm going to leave it on this slide, but we can do questions now, if there are any. And a lot of these types of architectures are unit-based, the KOMNet kind of structures.

And that, like, so the action-and-diffusion policy, surprisingly, like, the KOMNet architecture outperforms the transformer architecture for action-and-diffusion for robotics. And they make an argument that it has an inductive bias about smoothing, so that there's more consistency in the spatial and the time domain and all that great stuff. And, you know, what do you see that for that piece?

Yeah, so I haven't read that particular paper, but it does sound like a familiar point. I guess the, you know, the whole theory around architecture unification is that we might have these specialized architecture, like CNNs, and they hold these inductive biases around visual data. Like, they prioritize these local interactions with the convolutional mask and so on.

And the going idea has been that when you're training at small scale, having these inductive biases helps. But when you're training at large scale and you have enough data, you can sort of learn all of these yourself with a transformer in a less constrained setting. We have found that scaling transformers here works better than scaling these specialized architectures, for a few reasons.

I mean, it's not even to say that I don't think CNNs could scale. It seems as though it's easier to scale transformers. It's more straightforward to know in which direction to scale. All the infrastructure already exists. But yeah, it's a very big debate. I'm not, I'm not, like, calling it.

Yeah. Hey. Can this model and approach be used for 3D generation to, like, video games and stuff like that? So, if you go back to the architecture, it's totally modality independent. So, for any, this is for videos and images. The important thing here is that we've turned a modality into a sequence of tokens.

At this point, after that, everything that's happening in the architecture is modality independent. So, really for any new kind of data, all you need is a way of turning it into a sequence of tokens. It may be more challenging to encode, like, that kind of data. But as long as you can encode it in some way to a series of tokens, you can use the exact same approach.

Yeah. We have a lot of questions online as well. So, we can, I can read a few of those right now. So, one is, so, 16 seconds seems like the longest videos you can effectively generate right now. What are the main obstacles or things keeping us from getting to longer?

For example, real movie length generations. So, the main issue is a computational one. Well, there is so many different answers to this. If you're thinking from the movie gen setup, the issue is sequence length. Given the level of compression that we had, 73K was pretty much the longest sequence length we could feasibly train at.

If you want to train on 32-second videos, that's going to double. There are multiple ways around this. You could train an encoder with far more compression. And then you could get to videos that long. There's another question about learning objective. Here, we generate everything at once. There are lots of papers out there that will iteratively generate videos along the temporal axis.

You will generate a chunk and then generate a new chunk conditioned on the previous trunk. It's kind of like next token prediction. So, I don't think there's necessarily. There are lots of ways around it. And there are lots of interesting papers that generate sort of infinite length videos using these like iterative processes.

Let me see. There's some questions on like synthetic data compute. So, someone says, you know, meta is a rare example of having an abundance of training data. What if, you know, you run out of video, then what's next, I guess? I really know how to answer that. If we run out of video.

Look, I think there's. I think that there's a lot of interesting work to be done on improving the data filtration steps. So, you know, in this slide, we were optimizing this for high precision. So, making sure that all the videos at the end were very high quality. Recall was sacrificed throughout this.

These are like very computer vision terms. But we would have lost a lot of good data in this like very complex pipeline here. One like way forward for getting more data is moving to smarter ways of doing this process. Maybe completely language model based, for example. Yeah. I think that's one answer.

A question about compute. This is more general, I guess. How can academic researchers contribute to video generation without having, you know, access to thousands of GPUs? Yeah. Yeah. It's obviously very tough to do this level of pre-training outside of industry labs. But, you know, throughout this paper, we've used a bunch of innovations that came from academia.

For example, a lot of the flow matching work was done at universities. The main paper we take, Inspiration Forum, did come from Masa. But this kind of research can be done at small scale, and then all of us can learn from it. Yeah. The pre-training is very tough. Yeah.

I think the pre-training is tough. But things like learning objectives, I think post-training schemes, we've seen a lot of, like, grade work coming out of academia. For that. Yeah. A few questions all related to text, I guess. Some folks, you know, are saying there's a lot of work of cleaning and processing video data.

How do you make sure that the actual text, for example, the LAMA 3 generated captions are high quality and complete? Yeah, good question. We put a lot of work into training this LAMA 3 captioner, basically. So this is a video-conditioned LAMA model. This went through its own sort of large-scale training in order to generate good-looking captions that were aligned with what we wanted.

But certainly, there's a lot of room for improvement there. These captions are not as good as human-written captions. There are a lot of architectural reasons for that. A lot of these, like, video-conditioned language models cannot see the whole video. If it's a 16-second video at 16 FPS, often it's far too much video for the model to be conditioned on.

So often with a lot of these open-source models, like not only LAMA but Jemma and so on, you have to subsample frames. And so you're blocking the model, the language model, from seeing a lot of the video. It's going to lead to some issues, some missed things. So we do our best by training and post-training a captioning model, doing a bunch of evaluations on it.

But that's definitely something that can be improved. And there's a bunch of really cool results out there from the text-to-image community showing that if you improve captions, your image quality gets better. It's not entirely clear why, but that keeps on happening. Someone asked a related thing, which is how much roller importance does the text-to-prompt encoder play?

I think there were some image generation works which showed, like, replacing the text encoder. I think it was from clip to, like, T5, like, really helped improve performance. Did you guys play around with, like, several different text encoders? So this particular series of text encodings was sort of precedent in our team.

We took motivation from a recent, like, state-of-the-art text-to-image paper that used this series of text encodings. But, you know, it's worth pointing out here that it is quite strange what we've done here. You would think it would be, like, intuitive that you want your best possible text representation here.

All of these text representations are nowhere near state-of-the-art. They're not LAMA. They're not GPT. There's been a few works and empirical findings showing that, in this setup at least, decoder-only text representations don't work as well for some reason. Some reasons for that could be, or some have hypothesized, that you need a text representation that is more aligned with the media space.

That's why a lot of people you'll see conditioning on clip, which is what we do. So, yeah, a bunch of cool work to do here. We didn't ablate this in this project, though. I have a question followed up on the text part. How well would this do with, like, very detailed prompts?

Like, I want to move a video about this person wearing this specific color, and here's what happens later on. And then, like, a very detailed script, rather than, you know, just a video of some penguins, I guess. We definitely did observe that the model can do these sequential actions, but not always completely accurately.

Some of that might be issues in the captioning of the pre-training data. Like, are you accurately captioning all of these things happening? But, yeah, that will be one of the places where it struggles, if you were to detail, like, three or four things happening in sequence. Yeah. We have one question, which is, could you somehow hard code in some priors related to physics and real-world common sense to improve the realism and accuracy of generated videos?

I guess that might be related to that video of, like, the car splitting and all of that. Yeah. Yeah. So, in a way, it's kind of like the antithesis of... the things that we were trying here, of, like, removing all of the inductive biases and just scaling compute and data.

But I do think those are interesting things to try, simply because maybe if you are trying to learn the laws of physics better, just random videos from, like, a large pool aren't maybe the best thing to use there. And there are other cool works that have been released where they're trained, like, entirely on video game data, things like that.

Yeah, I think encoding some sort of priors would be cool. Yeah, because, you know, there are certain things about the natural world that we do know for sure. Certain, like, computer vision principles and so on. But, yeah, we didn't do that here. Do you know how to possibly encode those?

Yeah. That's, like, a super open question. Super open question. Okay, makes sense. We've got a couple others online, and then we can go back to in-person. One's more on, you know, deep fakes and malicious use of video generation models. Is there any work on things like watermarking to sort of deal with that?

There's definitely a bunch of work on watermarking. We, yeah, there's definitely a bunch of work on watermarking coming from a few groups. DeepMind has released a few interesting papers on it. Masa has a team working on this as well. So, yeah, very important work to be done. And then we can go back to some in-person questions.

One's about GANs, actually. I've noticed some other models have a GAN discriminator before or in the decoder for temporal coherence. Is this still useful, or does scale overcome this need? They mean in the VAE decoder? I think so. Yeah, so we also have, again, discriminator there. It's not necessarily for temporal consistency.

That might be one of the outcomes. There's sort of a historical precedent for this. When you are training these, where is it? When you're training these VAEs, folk found, it was the stable diffusion researchers, actually, while they were still at college in Germany. They published this paper called VQGAN back in 2021, I think.

And they showed that normally how you train these VAEs is that you just use L1 losses between the input and the output. But they found that the amount of compression you could do was very limited when you were doing that. So what they started doing was adding some adversarial losses from the GAN literature.

What this does is it tells the VAE that I don't have to decode exactly what was given to me. Because the GAN losses, they're not L1 losses, they're not pixel level losses. These losses are just how well the GAN can tell if it's a real image or a fake one.

So when folks started training VAEs with these losses, the outputs were given a bit more freedom. So the model could generate a range of things for a sudden input. And it's not getting penalized from the loss as long as they look real. And I realize there's a bunch of detail there.

But basically, by adding these losses, we were able to get another 2x level of compression. So we are also using adversarial losses in the VAE, if that was their question. Someone added an interesting question just now, which is, how might you make sure all the videos in your training data are real?

What if some of them are fake? I guess that's related to some work nowadays in language models. But once more and more training data becomes synthetically generated, does that lead to any issues, I guess? Definitely this idea of, like, data set poisoning, it's often called, is a problem. I think there are a couple interesting answers to this.

Firstly, in our pre-training data, if it's low quality, then we don't want to train on it. And hopefully, we would find it and get rid of it. But training on generated data is not always a bad thing. Most contemporary post-training approaches for language models are based upon training on generated data.

You train on generations from the model itself. So it's not like training on generated data is always bad. If there's some really, like, bad generated videos in the pre-training data, we do want to get rid of those. But it's not always bad. We've got time for a couple more in-person questions.

If... In the paper, we included, like, full details of the training infrastructure. I don't think we included, like, in the paper, like, full details of the training infrastructure. That will be there. I don't think we included any details about the inference infrastructure. In the paper, we included, like, full details of the training infrastructure.

That will be there. I don't think we included any details about the inference infrastructure. There are lots of details on the inference infrastructure. There are lots of details on the inference in the paper. There are lots of details on the inference infrastructure that I've shown here. Any other questions?

I don't know if I missed that, but for the videos you generated, do they come with audio? Yeah, so this text-to-video model that I've shown here, it generates video. But with the publication, we released a, well, published about a, where is it? Actually, no, we're not going over that problem again.

There's no audio. Yes. So there were, yeah, there was a separate MovieGen audio model that we trained that will add audio to a generated video. Video asked two separate models, and the audio is, like, another level of complexity where there are, like, multiple chats, and if you want to train on mini-files, whatever.

So what is, like, your current progress in this oil generation? So, we have a really great audio research team that works on this. One very nice thing to do, I think, would be to generate everything at once, like video and audio. The two modalities are very correlated. There's a lot of shared information.

So, theoretically, both modalities should benefit from being trained together. The audio even encodes some information about videos that are not present in the video. So, the issue is one of data. It's very hard to get high-quality video data. It's even harder to get high-quality video with good audio data.

So, that's part of the reason why we didn't do that for this project. Awesome. Thanks so much, Andrew, for the very insightful talk and answering all our questions. Thanks. Thanks. Thank you.

Stanford CS25: V5 I Transformers for Video Generation, Andrew Brown of Meta

Transcript