back to indexNVIDIA Cosmos: World Foundation Model Platform for Physical AI - w/ Ethan He

00:00:00.000 |
the 75 pages of the report. I can't cover everything in one hour. I can talk about it for 00:00:07.040 |
hours. So I'll just cover what I focus on as a data scaling and model scaling. First, I'll do 00:00:15.520 |
an introduction of Cosmos for people who are not familiar with it. I guess the introduction is best 00:00:22.720 |
to Saru Jensa himself. It includes autoregressive world foundation models, diffusion-based world 00:00:33.120 |
foundation models, advanced tokenizers, and an NVIDIA CUDA, an AI-accelerated data pipeline. 00:00:40.400 |
Cosmos models ingest text, image, or video prompts and generate virtual world states as videos. 00:00:50.400 |
Cosmos generations prioritize the unique requirements of AV and robotics use cases, 00:00:55.600 |
like real world environments, lighting, and object permanence. 00:01:00.400 |
Developers use NVIDIA Omniverse to build physics-based geospatially accurate scenarios, 00:01:07.760 |
then output Omniverse renders into Cosmos, which generates photoreal physically-based synthetic 00:01:14.720 |
data. Whether diverse objects or environments, conditions like weather or time of day, 00:01:36.400 |
or edge case scenarios, developers use Cosmos to generate worlds for reinforcement learning 00:01:44.320 |
AI feedback to improve policy models or to test and validate model performance. 00:01:51.040 |
Even across multi-sensor views, Cosmos can generate tokens in real time, 00:01:59.920 |
bringing the power of foresight and multiverse simulation to AI models, 00:02:05.440 |
generating every possible future to help the model select the right path. 00:02:09.680 |
Working with the world's developer ecosystem, 00:02:14.400 |
NVIDIA is helping advance the next wave of physical AI. 00:02:18.160 |
Okay, so what's a world model? A world model, it takes past observations, acts, 00:02:31.840 |
and also perturbations. See, it can predict the future predictions. 00:02:39.200 |
The perturbation can take any forms, like it can be actions from the physical AI, 00:02:46.320 |
or it just can be some random perturbation, or a text description of the perturbation. 00:02:54.160 |
So, in the Cosmos 1.0, we open-sourced a family of models. We have two sets of 00:03:02.960 |
forward quantization models. One is based on diffusion, while the other is based on 00:03:08.240 |
autoregressive models. For each family, we also built two base models and two derivatives. 00:03:15.040 |
To achieve the best generation quality, we also built an upsampler for the diffusion model, 00:03:21.600 |
and also a diffusion decoder to improve the video generated from the autoregressive model. 00:03:27.200 |
So, these are already open-sourced on GitHub. You can feel free to try. 00:03:32.560 |
So, for the diffusion world model, this is the architecture overview of it. 00:03:41.040 |
So, the input video goes through a video tokenizer. Now, here it's called CV8x8x8. 00:03:51.360 |
Basically, the time spatial are both compressed by 8. If you have 8 frames, it's going to go into 00:04:04.080 |
one frame. I assume everyone knows diffusion. The tokens are corrupted, then go through a 00:04:12.320 |
diffusion transformer. The model then generates the reconstructed video during training. 00:04:22.160 |
This is an example video generated from the diffusion world model. 00:04:28.880 |
For the autoregressive world model, it goes through a similar process. As a tokenizer, 00:04:39.280 |
instead, it goes from discrete instead of continuous. Discrete tokenizer is very similar to 00:04:47.840 |
LLMs. This discrete tokenizer converts video patches into one of the vocabularies. There's a 00:05:02.560 |
64k vocabulary. These discrete tokens are fed into a transformer with a similar architecture as LLMs. 00:05:15.280 |
Then, discrete tokens are generated. Then, there's a decoder, which is also a discrete decoder that 00:05:24.320 |
decodes these tokens into videos. There has been debate on whether diffusion or autoregressive 00:05:34.160 |
models are better since we don't know. So, we built both of them. 00:05:43.280 |
For example, here, this is an input image for the autoregressive model. You can 00:05:51.520 |
use this as a pre-filling word for the transformer. Then, in the decoding process, 00:06:06.480 |
Autoregressive, if you want better quality of the generated result, you can go with the diffusion 00:06:16.160 |
model. If you want the model to be faster, you can try the autoregressive model. 00:06:23.520 |
Autoregressive also plays very well into other modalities. You can easily combine other tokens 00:06:33.520 |
like text tokens or action tokens. But here, our autoregressive model is trained purely on videos. 00:06:41.760 |
We also released post-training scripts for these models. Right now, in the Cosmos paper, 00:06:55.360 |
we discuss several post-training examples of the Cosmos foundation models for different physical 00:07:03.600 |
texts. Right now, in the GitHub, we support general post-training. This fine-tunes the 00:07:11.440 |
word models to generate a target distribution of the videos based on a custom dataset. 00:07:17.200 |
The target distribution could include a specific camera spec or a specific domain such as FAC3. 00:07:26.000 |
Here is an example. We took a few videos of humanoid. Here is a jetty robot, a jetty humanoid, 00:07:37.840 |
and just roughly five videos of this humanoid. The video is in simulation. 00:07:49.280 |
After fine-tuning the diffusion model, you can generate novel videos of this 00:07:57.040 |
robot doing something else. The model is able to remember the characteristics of this robot 00:08:04.880 |
while generating novel tasks which are not possible in either simulation or 00:08:14.000 |
in the real world through tele-opt. There are more post-training scripts that are coming soon. 00:08:26.960 |
For example, instruction control. Post-training models for robotic 00:08:34.320 |
manual motivation to produce a video based on textual instruction. You can instruct the robots 00:08:41.760 |
to perform some tasks like folding clothes or picking up objects. Also, action control. 00:08:49.440 |
The post-trained robots can predict both the next video frame and the next action frame. 00:08:59.040 |
Here, the example shows a camera control. Adding the camera pose as a condition, 00:09:07.840 |
you can generate 3D consistent video simulation from a single image or video. 00:09:13.440 |
This can enable drastic navigation in virtual environments. You can also do 00:09:24.000 |
multi-view generation, especially for autonomous driving. You can generate synchronized multi-view 00:09:36.160 |
videos from text prompts then simulate the driving scenarios with multiple camera perspectives. 00:09:43.840 |
Next, I'll dive into technical details. First, I'll go over data scaling. It's a model scaling. 00:10:02.560 |
So, we open-sourced a training framework. The data curation part, you can sign up for it. It's 00:10:15.360 |
coming soon. The training framework is open-sourced. When we curate the data for text, 00:10:31.680 |
you can just grab the text online and the label is basically next token prediction, 00:10:39.840 |
which is relatively straightforward and cheap to curate. However, for videos, 00:10:47.680 |
for example, you have a video shot of someone playing basketball. You need to label a basketball 00:10:55.600 |
player as dribbling the ball and shooting it into the hoop. Labeling video data requires 00:11:02.400 |
good AI models for automatic captioning. We want to control the AI models to generate 00:11:11.040 |
using text we specify. Also, another challenge is that video signals are less refined compared to 00:11:21.680 |
text. Maybe out of like an hour of videos, there might only be a second of interesting stuff. 00:11:28.560 |
This is very computationally challenging and very expensive. We use distributed computing 00:11:36.800 |
to solve this problem. This is a life cycle of curation. So, on top of DGS cloud platform, 00:11:47.760 |
we use real data based on streaming pipeline running on thousands of GPUs. 00:11:54.400 |
The long video goes into the pipeline and then the videos are splitted and then transcoded into 00:12:04.720 |
shorter clips. Then, different AI models are running on the short clips to detect high-quality 00:12:14.800 |
videos. Another NVIDIA BLM captioning model running using TensorRT LLM is used to caption the video. 00:12:33.840 |
Data curation for the video foundation models are very challenging. The scale of the video data 00:12:44.480 |
are hundreds of petabytes, much bigger than the previous image models. Orchestration at scale, 00:12:52.720 |
heterogeneous computer requirements of tens of AI models running efficiently together 00:13:00.640 |
is also very challenging. You have the captioning model, you have models to detect the scene change, 00:13:08.720 |
you have models to detect the video consistency, aesthetic, etc. Multiple concurrent streams of 00:13:18.480 |
high-throughput data exchange between AI models also impose bandwidth challenges to the cluster. 00:13:26.320 |
Every single step of the curation pipeline needs to be GPU-accelerated. 00:13:36.320 |
We also need to manage the resiliency of the GPU-based data pipeline at scale. 00:13:42.320 |
So, each inference model needs to run at speed of light. We go from the baseline, 00:13:57.360 |
where the model is run on PyTorch, to use TRT LLM to accelerate. 00:14:05.600 |
And then we run it on a larger batch to further accelerate it. And today, we use FP8 quantization 00:14:14.480 |
to further accelerate it to 7x compared to the baseline. 00:14:19.200 |
So, video understanding, so filtering the high-quality clips and auto-labeling is not 00:14:30.240 |
enough for building a video foundation model. We need to understand a lot more about the videos 00:14:37.920 |
for specific domain training. We remove the duplicated content 00:14:43.920 |
and visual search understandings of the data. 00:14:47.840 |
So, these are the next last cycle of the video data curation. 00:14:57.680 |
After the captioning, we need to do clustering to group the data into different categories, 00:15:11.920 |
Then there is a semantic deduplication to remove the redundant data. 00:15:24.560 |
Finally, video taxonomy to further help researchers to pick the data they want to train on. 00:15:32.720 |
The takeaway for the video data curation is we build the video processing capabilities 00:15:51.680 |
into Nemo Curator to enable the developers to curate high-quality data and train 00:15:57.680 |
highly accurate video foundation models. By leveraging end-to-end GPU acceleration 00:16:05.280 |
and optimizing the data orchestration through the pipeline, 00:16:08.720 |
Nemo Curator can scale to over 100 petabytes of data. 00:16:14.800 |
Other optimizations reduce the processing time and lower the total cost of ownership. 00:16:20.800 |
The models are optimized for high throughput and enhancing overall pipeline efficiency. 00:16:39.680 |
So, using Nemo Video Foundation Model Training Framework, you can scale these video models 00:16:48.960 |
up to 20 times larger than traditional frameworks. The framework is capable of training models 00:16:57.440 |
like diffusion or autoregressive or foundation models up to 100 billion parameters. 00:17:06.480 |
The throughput is highly optimized. We achieve roughly 450 00:17:11.600 |
teraflops. That's close to 50% MFU on the H100 chips. 00:17:23.680 |
These are very close to the training efficiency of the LLM training. 00:17:37.520 |
data curation. We have hundreds of petabytes of data going into the curation pipeline. 00:17:45.600 |
After curation, the data set we get are short video clips and images with text embedding. 00:17:58.720 |
Even though the scale of the data is much smaller, these are still 00:18:03.360 |
considered relatively big if we want to train on the clusters today. 00:18:09.280 |
For example, the images are on the O1 billion scales, and the videos are roughly 100 million 00:18:21.760 |
video scales. In the paper, we use image and video tokenizer compression rate of 8x8x8. 00:18:34.720 |
At this scale, the images are compressed to roughly 200 kilobytes. For 1 billion images, 00:18:47.120 |
it's roughly at the level of 100 terabytes. For the videos, it's on petabyte level. 00:18:55.600 |
A solution can be storing this data on the cluster or storing them on cloud storage like S3. 00:19:11.520 |
Storing them on the cluster has some huge costs. Most of the clusters don't have huge 00:19:19.440 |
storage on the clusters themselves. We provide both solutions in the open source framework. 00:19:34.240 |
We leverage Megatron Energon, which is another open source library from 00:19:39.920 |
NVIDIA, to load data efficiently. It allows you to load data from web source like 00:19:49.120 |
AWS S3 very efficiently without the GPU idling during training. It allows you to deterministically 00:20:01.200 |
save and restore the data loader, which is one of the biggest challenges in loading from the 00:20:08.320 |
web dataset. In web dataset, usually the data is loaded sequentially. When your training is 00:20:16.880 |
interrupted, the traditional way of training, you have to randomize loading. You won't be able to 00:20:28.640 |
load non-repetitive data without Megatron Energon. 00:20:32.960 |
Another challenge in loading the data is variable input data shapes. 00:20:46.560 |
The data types are different. You have image, video, and you also have different durations of 00:20:55.440 |
the videos. You have one second, 10 seconds, or even 50 seconds. The resolutions are different, 00:21:04.640 |
so 60p, 720p, 1080p. There are also different aspect ratios, 16 by 9, 9 by 16. 00:21:12.240 |
When you're training on text, you don't have this kind of problem. In video, 00:21:24.640 |
this can cause a very big problem in efficiency if we batch the data. 00:21:29.840 |
Traditional approach is batching the data. For each different shape of the 00:21:42.960 |
input, for example, image, we batch the images into a few samples of batch. For the videos, 00:21:53.760 |
for the very large videos, you can just take one video as input. For medium size, 00:22:00.240 |
maybe you can batch two or four into one batch. The pros is that this is commonly used for most 00:22:11.200 |
of the models nowadays. For example, in ImageNet training, traditionally people just resize all of 00:22:20.640 |
the image into 512 by 512 to mitigate this problem. But the challenge here is 00:22:29.760 |
if you want to train on different aspect ratios and different resolutions, you need a complicated 00:22:38.720 |
data loading logic to ensure that during training, at each iteration, at least the data shape is the 00:22:46.240 |
same. And the efficiency is not very high because not all of the data shapes can be efficiently 00:22:54.320 |
utilized by GPU. And also constantly changing the shape of the input data can cause challenge to 00:23:02.800 |
the fused kernels. On GPU, if all of your tensor operation shapes are the same for all the 00:23:13.040 |
iterations, we can optimize for it and it runs more efficiently than dynamic shape. 00:23:20.240 |
The data loading scheme we open source is called Pack Sequencing or Sequence Packing. 00:23:34.640 |
Different from the traditional SBHD format, this one allows you to use different image, video, 00:23:45.680 |
multi-model, whatever, also different aspect ratios, duration, resolution. The key is to 00:23:56.800 |
reshape all of the data into one-dimensional sequence and then pack them together into 00:24:04.400 |
one batch. And when you pass this into transformer, outside of self-attention, 00:24:12.720 |
there's no problem at all. The MLP operation of the transformer is just a per-token operation. 00:24:21.760 |
But for self-attention, we will need to mask, create a block diagonal mask so that each of 00:24:30.640 |
the samples in the sequence are computing self-attention on themselves. And this 00:24:38.240 |
operation is automatically done on the fused CUDA kernel. You only need to supply the 00:24:49.360 |
sequence length in our training code and that's all you need to enable Pack Sequence training. 00:24:55.840 |
With this data loading scheme, the training efficiency is very high and you can see 00:25:05.920 |
in the end there is a padding. If you have large enough max sequence, the padding is already very 00:25:14.480 |
small. And the training efficiency is very close to when you have all of the samples with the exact 00:25:22.800 |
same shape. Next, I'm covering the parallelism. The training on videos is one of the biggest 00:25:40.560 |
challenges in the context lens. Traditionally, in pre-training LLMs, the context lens is really 00:25:49.360 |
like 4K. Nowadays, it's 8K on LLAMA. But training on videos, the context lens is much larger. 00:25:57.600 |
Say we have five seconds of videos, encoding it with a 8x8x8 tokenizer, it goes into roughly 00:26:10.320 |
60K or 70K of tokens. This is 10 times larger than the sequence length of LLMs. 00:26:19.360 |
Context parallelism or ring attention is one of the key techniques we use to scale 00:26:28.320 |
the Euclidean transformer or the autoregressive word model to up to 1 million tokens. 00:26:38.160 |
Using context parallelism, you can place the activations of the entire transformer along the 00:26:44.480 |
sequence dimension. This exploits the permutation invariance of attention to distribute the sequence 00:26:52.320 |
in ring topology. Hey, quick question, Ethan. I know for some LLM models, like even the LLAMA 00:27:04.240 |
models that are trained up to 128K context, something that they do is they do the bulk of 00:27:09.440 |
the training, like the majority of the five trillion tokens are done at a smaller context 00:27:14.480 |
line. Then in that post-training, they continually train on longer context. Is that a thing in video 00:27:22.640 |
gen? Can you train the majority of the model at a short clip length and then extend this and 00:27:30.160 |
extrapolate it out? Yes, that's a good question. I think the bottleneck here is we don't have a 00:27:41.120 |
very efficient video compressor. Even a five second video is like 60K tokens. 00:27:53.280 |
If we say we train on shorter videos like one second, that also works. But for the majority 00:28:00.160 |
of the training, the video foundation models, they are 10 times longer context compared to 00:28:08.160 |
the LLMs. For post-training, the video models are extended to even longer context, 00:28:19.120 |
say like one million tokens, to be able to generate a video roughly like one minute. 00:28:26.000 |
That makes sense. It's basically the same problem, it's just a 10x scale on both sides, 00:28:32.880 |
so even the short context is still there. Yes. Thank you. I'd say if we have a very good 00:28:40.320 |
tokenizer in the future that can efficiently reconstruct the videos, maybe it's a paradigm 00:28:47.120 |
of change. Right now, the video tokenizer customers release are 8x8x8 or 8x16x16. 00:28:58.320 |
Spatial-wise, 16x16 is already near the limit. If you go beyond that, a lot of the reconstruction 00:29:11.040 |
artifacts will appear. Makes sense. Thank you. For video generation and inference, 00:29:22.400 |
we also employ context parallel. In the open source repository, you can already use context 00:29:30.880 |
parallel to accelerate the inference. For example, on 8 GPUs, using context parallel 8, you can 00:29:38.880 |
generate a 5-second video under 30 seconds. Using more across different nodes, you can generate a 00:29:49.520 |
video in a matter of seconds. Another challenge brought by diffusion transformer is challenges 00:30:05.360 |
to pipeline parallelism. Traditionally, in LLMs, for different pipeline stages, you only need to 00:30:13.440 |
pass the hidden states to the next pipeline stage. But diffusion transformers have a lot of 00:30:20.160 |
conditioning and adapt to their norm, and also conditioning on text, which creates difficulty for 00:30:30.000 |
the pipeline parallelism. So we provide a solution to generate the additional conditionings 00:30:39.920 |
on each pipeline parallel ranks. This value is slightly more compute, 00:30:45.840 |
but reduces the communication cost a lot, which leads to improved performance. 00:30:58.000 |
Okay. I think that's all of my presentation. Thank you for listening. Any questions? 00:31:04.800 |
Hi, Ethan. Thanks a lot for joining us again. This is RJ. I asked a question at the beginning 00:31:21.760 |
of the chat. I'm a little unclear about how the encoder gets, like, the encoder to the 8x8 00:31:33.760 |
latent space gets trained. Is that just part of the diffusion training, or is there something, 00:31:42.560 |
like, some sort of, like, a separate step that is used to train that encoder? 00:31:51.600 |
Yeah, that's a good question. So a separate step is used to train the encoder. 00:31:55.920 |
Tokenizer is a fancy name of this, but this is VectorQuant has the variational autoencoder, 00:32:09.520 |
Yeah, you would basically train it for the task of reconstructing the videos. 00:32:18.480 |
Okay, right. So – but how do you get it to create a 3D – what's it like, the 00:32:25.440 |
TLDR, and how to get it to create a 3D latent space like that? 00:32:28.720 |
Yeah, so the model architecture itself is a causal convolutional neural network. It can 00:32:41.040 |
reconstruct – the encoder and decoder structure reconstructs the video. So the training objective 00:32:49.520 |
is basically reconstructing the video. The process is you need to collect some 00:32:55.280 |
diverse set of different videos, ideally in your domain, and then train this causal CNN 00:33:04.160 |
to reconstruct those videos. The codebook here for continuous tokens are just those continuous 00:33:12.960 |
tokens, but for discrete tokenizers, you would do vector quantization to quantize in 264-QIF codebase. 00:33:22.480 |
Okay. Is it – sorry, I didn't have time to pre-read the paper. Is this covered in the paper, 00:33:29.120 |
or is there a separate paper for this? Yeah, this is covered. 00:33:32.880 |
Okay, got it, got it. Thank you. This is really super interesting, exciting work. Thank you very 00:33:39.280 |
much for joining us. So additionally, the tokenizer is 00:33:44.560 |
phrased during the training of the transformer, because if you don't phrase the tokenizer, 00:33:53.440 |
it can lead to catastrophe forgetting. You can – you just generate – if you just predict the error, 00:34:11.360 |
Sorry, I have also a question. I didn't find any reaction button that I can 00:34:19.040 |
raise my hand. Can I ask the question right now? Yeah. 00:34:22.480 |
Okay, perfect. So my question is about the open source framework for pre-training that you 00:34:31.600 |
Yes. So do you think, potentially, if I have a set of videos, but those videos, originally, 00:34:40.240 |
they were not necessarily in the RGB space, okay? So I don't know, for example, satellites, 00:34:46.000 |
or anything, a spectral wavelength, or whatever. And I just somehow mapped them to videos. Do you 00:34:51.600 |
think I can still customize your framework and just pre-train my own tokenizer, or basically 00:35:00.800 |
whatever else that exists in that framework? Yeah, if your data domain is different from 00:35:11.200 |
video, it's recommended to fine-tune the tokenizers. So just fine-tuning, do you think 00:35:18.560 |
you're going to work? So, because if the tokenizer is not fine-tuned, it might produce some artifacts 00:35:30.560 |
for your data if your data domain is different. Sorry, yeah, go ahead. 00:35:39.600 |
After fine-tuning the tokenizer, you might also want to fine-tune the diffusion transformer or 00:35:46.720 |
autoregressive transformer. Yeah, both of these are supported in the framework. 00:35:51.200 |
Awesome. And, you know, I can also pre-train the tokenizer using the current framework. 00:36:02.080 |
Thanks for the presentation. I had a quick question related to some of the, 00:36:10.720 |
well, you mentioned it's coming soon, for multi-view generation and more camera control. 00:36:15.200 |
So, curious if you could speak any more towards how you're approaching multi-view, 00:36:21.040 |
or how to make sure that the camera intrinsics correlate between one another, 00:36:27.120 |
you know, if they're all video-based generation versus having a true, like, 00:36:31.120 |
grounded scene understanding, how you guys are approaching that. 00:36:34.960 |
Yes, that's a good question. So, these are coming soon, but the techniques are covered in the paper. 00:36:44.000 |
For example, for multi-view generation, the different number of views are folded into the, 00:36:50.800 |
into one of the dimensions in the data. So, the model input is still roughly the same. 00:36:58.960 |
It seems to have, in fact, it's falling into the time, the time axis. 00:37:04.000 |
And for the camera intrinsics, it's not, it's not used now, because 00:37:13.440 |
if you have a consistent intrinsics, we don't, you don't have this problem, but 00:37:19.040 |
if your intrinsics is going to change across different training data, I guess it's helpful to 00:37:27.280 |
include that in the conditioning information. At least in the example, we use consistent intrinsics. 00:37:42.640 |
Yeah, so you're saying it has more to do with, perhaps, more the training data that you're using 00:37:47.600 |
to post-train these models, to have it be consistent and 00:37:51.120 |
have similar intrinsics? Is that sort of what you're saying? 00:38:08.400 |
I, I can, I can answer questions in the, in the chat. Yeah, I wasn't looking at it. 00:38:15.200 |
Yeah, so what does the token represent in this case? One pixel of video? So, the, 00:38:25.360 |
yeah, the tokens, the tokens are a patch of video. Say, for an image, an 8x8 patch is one pixel. 00:38:44.560 |
That means, roughly, for one second, the video is, if it's 30 frames, 00:38:50.560 |
in the, in the time domain, you have, like, four, you have, like, four tokens. 00:38:59.040 |
And spatially, that depends on your resolution. 00:39:07.840 |
Yeah, so the video doesn't have a depth map, but it can be, 00:39:16.000 |
What's the different, difference between post-training and fine-tuning? 00:39:24.400 |
I'd say, like, post-training, it's a fancy word of fine-tuning. And now it's, 00:39:33.040 |
now it's fine-tuning specifically referenced to, like, it's all special techniques, like, 00:39:40.320 |
just continue pre-training. I would say these two words are interchangeably. 00:39:45.680 |
Oh, number of tokens for each of these foundation models trained. So, for, yeah, for pre-training, 00:40:02.080 |
it's a, it's a hundred million video clips level. And I, I have the equation in the, 00:40:10.080 |
and according to, so each video clip is roughly five seconds. 00:40:15.120 |
And using, using that information, you can calculate the number of tokens. 00:40:24.160 |
10, 10 trillion, at least 10 trillion tokens or more. You can calculate it for yourself. 00:40:46.480 |
Yeah, what type of hardware is adequate for post-training on our own data? So, 00:40:55.520 |
the post-training, the open-source now needs, like, eight, eight H100 for 00:41:01.840 |
diffusion and two H100 for the autoregressive model. 00:41:08.160 |
But with some technique, like activation offloading or LoRa, 00:41:15.280 |
I believe LoRa and GPUs can also be used for post-training. 00:41:35.440 |
So, so the, the word, the word in our model name, we, we want to emphasize that the model, 00:41:46.720 |
the model has spatial consistency and we're aiming to provide the best foundation model for 00:42:10.880 |
Okay, I think that's all the questions in the chat. Any more questions? 00:42:18.800 |
Hi Ethan, thanks for the talk. I had a question. So, for, you said, for identifying high-quality 00:42:27.040 |
videos, you, high-quality clips, you filter them out first, right? How do you do that? 00:42:33.200 |
Do you use, like, some already available open models or do you train your own models for that? 00:42:41.600 |
Yeah, that's a good question. So, so there are different, there are different metrics for 00:42:48.160 |
filtering videos. For, there are both heuristic and some, some models. Heuristic, like, if the 00:42:55.760 |
video is static, it's, it's basic image, it's not a good video, or you can also train a model to 00:43:04.640 |
classify, classify the quality of this, this video, like, aesthetic score. So, that, that might need 00:43:15.120 |
some other training and labeling, and also motion scoring, like, how much motion it's in the video. 00:43:25.520 |
So, in your case, you guys trained a custom model for that, 00:43:29.360 |
based on these metrics, maybe motion or based on the aesthetics? 00:43:34.640 |
There, there are a lot of, like, available models open source already. You, you can check it out. 00:43:41.520 |
Like, there are aesthetic, aesthetic classifier, etc. Yeah. 00:44:00.560 |
Um, another quick question is, you know, as Cosmos develops or releases more iterations, 00:44:06.960 |
how do you foresee approaches to adding more controllability within the scenario? 00:44:13.760 |
So, more refined control over what's happening in the scene, and what variables you want to 00:44:19.120 |
change versus not to change? Sort of inherent to, you know, video generation in general, I think 00:44:24.320 |
you don't have as much control, and curious if you're seeing that as a requirement, and how, 00:44:31.920 |
Yeah, I think that's very important for post-training. 00:44:35.360 |
That also depends on different use cases. Say, you have, um, depending on your data, 00:44:45.120 |
if your, say, if your data have more different parameters you can use as conditioning, 00:44:51.920 |
I think adding, adding them into the training would definitely help. 00:44:56.240 |
Yeah, if you have, if you have additional, like, camera intrinsics, if you have additional 00:45:04.160 |
cameras as condition, additional signals, like audio, all of them can use as conditioning. 00:45:11.680 |
The model, it's, it's quite flexible for, to add additional conditioning. 00:45:18.720 |
For the diffusion model, they allow you to add it through cross-attention, 00:45:37.440 |
Ethan, I have another question that's somewhat related. 00:45:41.840 |
How much, I was a little confused about how much of the, sort of, the, the ability to generate 00:45:53.040 |
realistic, you know, sort of, physics and physical model, well, like, sort of, world models, 00:46:00.640 |
is due to training versus some, like, inductive bias in the model, and, like, what were the, 00:46:06.960 |
if, if the, in as much as it was inductive bias, what, what were the key things there? 00:46:13.840 |
So, I, I think two key things are data and scale. 00:46:21.360 |
So, I, the, the model itself, as they grow larger and larger, a lot of the, 00:46:32.720 |
a lot of the 3D capability, consistency physics intrinsics automatically appear when the model 00:46:41.440 |
is bigger. And another thing is data. I think in the data, you need to have enough 00:46:48.480 |
demonstration of different physical property for the model to learn. 00:46:55.440 |
It says the model itself doesn't have a lot of 00:46:58.160 |
inductive bias. It's just, we're just using transformers. There's no, 00:47:05.360 |
like, spatial attention, temporal attention, those kind of things. 00:47:34.240 |
If, if there aren't other questions, I actually have one more. So, in, in the, sort of, 00:47:42.000 |
like, the original diagram of the architecture, there's some, some things that I didn't understand 00:47:50.320 |
about the, the positional embeddings. Like, there's the, there's, like, two different 00:47:57.840 |
positional embeddings, or three different positional embeddings, I think. Yeah, so, 00:48:02.240 |
there's, like, this absolute positional embedding. And then, actually, there's another diagram that, 00:48:07.600 |
where there's another positional embedding that goes into the cross attention, I think. 00:48:12.080 |
Yeah, this, or, well, it's, I'm not sure what that is, that time step in this 00:48:16.880 |
scale shift gate. So, I got, I was kind of confused about what the purpose of all these are. 00:48:21.520 |
Yeah, yeah. So, so, the timestamps is specific to the diffusion models. You know, 00:48:30.240 |
the diffusion process, you're going through multiple steps to diffuse noise, and it becomes 00:48:37.760 |
a clear, clear and crisp video, right? So, during training, the process is, 00:48:45.760 |
you randomly apply some noise to the tokens, and for, you also need to indicate the model, 00:48:54.560 |
like, how much noise is added. If there are more noise, the timestamps, it's an earlier step. 00:49:01.760 |
The less noise, the timestamp is, is close to the end of the generation. 00:49:08.000 |
So, during inference, the model can gradually remove the noise and the condition on the, 00:49:17.600 |
which timestamp it is on. For the absolute positional embedding and 3D robe, 00:49:24.880 |
those, those tell the model, for this token, which, which position it is in the video. 00:49:33.760 |
Sure. No, I guess I was just confused about the, what is the need for both 00:49:43.120 |
the rotary positional embedding and also the absolute 00:49:46.480 |
positional embedding? Like, why is that, why are both of those needed? 00:49:49.920 |
So, not necessarily, but this can improve the model. In fact, if you just use absolute 00:50:01.040 |
positional embedding, it can also work. Okay, I see. Okay, got it, got it. Thank you. 00:50:16.560 |
Yeah. So, there are, there was a comment in the chat about the use of vector quantization. 00:50:25.920 |
Now, how is that used, actually? I don't think that it's used for selecting 00:50:35.920 |
patches, but it could be used for the discrete latent space. 00:50:41.040 |
It's a, it's a training technique. It's basically for, for the autoregressive part of the model, 00:50:49.280 |
you need to, you need to have fixed set of vocabulary and the input are basically indices, 00:50:58.080 |
like this word is number, number one, number two, etc. So, when you're training the tokenizer, 00:51:12.480 |
When it's choosing, for each patch, it basically looks for the closest vector in the codebook 00:51:30.160 |
Thanks, Ethan. I had a question about the size of the models that were posted to HuggingFace. 00:51:51.600 |
How did you guys select those sizes? Did you experiment with larger sizes? Yeah, those are 00:51:58.000 |
my questions. So, yeah, this is a first release of Cosmos 1.0. There might be bigger models in 00:52:12.400 |
the future, and because we, when doing research, we want to go from small to big. We're not doing 00:52:23.520 |
it blank shot, and I think we're still in the infancy of word foundation models, and they're, 00:52:30.720 |
let's say, it's kind of like GPT-1 or GPT-2 stage of word foundation model. 00:52:39.600 |
Bigger models will definitely come in the future. 00:52:42.720 |
Got it. Thanks. Was there any thinking in terms of, well, this is good enough 00:52:50.320 |
for most of the applications we see from, I don't know, customers or partners? 00:52:55.840 |
It's not good enough yet. It can be better and better. The model now has some emerging 00:53:09.280 |
physics property in the generative video. I would think it can get better. 00:53:19.840 |
Thanks. Guys, so it looks like Sviks passed the baton to me. He had to drop off a call, 00:53:48.240 |
or for an in real life meeting, and so I want to, if there are any other questions, 00:53:56.400 |
encourage you to ask. Otherwise, I think we can take a little bit of time to discuss the next 00:54:04.080 |
paper, and I actually have to have a hard stop at, in three minutes, so I need to drop off 00:54:14.240 |
at that time. So, first of all, Ethan, this is fantastic. I hope you keep coming back to these 00:54:21.600 |
paper club meetings, and even if you want to present someone else's paper rather than your 00:54:33.840 |
own, certainly anytime you publish a paper, we definitely want to see you here. But if others 00:54:39.600 |
publish paper and you think it's exciting and you want to share it, we definitely would love to have 00:54:44.560 |
you as well. Thank you. Thank you for hosting. Yeah, I mean, Sviks, but I'm happy to facilitate 00:54:54.560 |
where I can. Are there other questions for Ethan before we, I'm not sure how much time 00:55:01.520 |
we have really to discuss the next paper, but, okay, does anyone want to volunteer? I think that 00:55:14.160 |
I saw some chat, and I'm not sure about this, but I saw some noise on the Discord about people just 00:55:21.360 |
picking things from the list of papers that are in our backlog, and then just giving brief, like, 00:55:29.280 |
sort of very fast discussions of those. Maybe I think that in the past we've taken 10 or 15 minutes 00:55:36.880 |
to just go over, summarize the paper for everyone. Probably you'll, people won't probably pre-read, 00:55:43.360 |
but it'll just be a good, you know, sort of way to understand in some detail what are the key 00:55:49.760 |
points from the paper. So, maybe I can post that. I think it's already in Discord, but I can post 00:55:56.800 |
that in Discord. If there are people who are not on Discord, maybe I can ask, I can suggest Sviks 00:56:04.320 |
also post that in the, like, in, like, Twitter or whatever. Is that, is there, unless, of course, 00:56:13.520 |
someone wants to volunteer to present a paper next week? Okay. Somebody asked that, for the 00:56:26.480 |
Discord channel, if no one can dig that up, I suggest, I think it's on the LatentSpace, like, 00:56:36.720 |
on, you can, you can dig through the LatentSpace sub stack, or, like, maybe there's a, I think 00:56:42.640 |
there's a website, too, and you can find it there. Otherwise, you can hit me on X, and I'll find it 00:56:51.920 |
for you, or LinkedIn, as well. I'm, on both of them, my user handle is Haneke, or you can obviously 00:57:01.360 |
also, Sviks, or anyone else here. Oh, there it goes. Okay, great. Okay, guys, so, grab that if 00:57:09.280 |
you need it. I'm going to end the meeting, and, yeah, I got to go. So, I'm going to, I'm going to 00:57:14.000 |
stop recording. Actually, I probably was supposed to stop recording, and guess what, 00:57:18.400 |
the edit, whatever. And thank you very much. We'll see you next week.