NVIDIA Cosmos: World Foundation Model Platform for Physical AI

00:00:00.000 | the 75 pages of the report. I can't cover everything in one hour. I can talk about it for

00:00:07.040 | hours. So I'll just cover what I focus on as a data scaling and model scaling. First, I'll do

00:00:15.520 | an introduction of Cosmos for people who are not familiar with it. I guess the introduction is best

00:00:22.720 | to Saru Jensa himself. It includes autoregressive world foundation models, diffusion-based world

00:00:33.120 | foundation models, advanced tokenizers, and an NVIDIA CUDA, an AI-accelerated data pipeline.

00:00:40.400 | Cosmos models ingest text, image, or video prompts and generate virtual world states as videos.

00:00:50.400 | Cosmos generations prioritize the unique requirements of AV and robotics use cases,

00:00:55.600 | like real world environments, lighting, and object permanence.

00:01:00.400 | Developers use NVIDIA Omniverse to build physics-based geospatially accurate scenarios,

00:01:07.760 | then output Omniverse renders into Cosmos, which generates photoreal physically-based synthetic

00:01:14.720 | data. Whether diverse objects or environments, conditions like weather or time of day,

00:01:36.400 | or edge case scenarios, developers use Cosmos to generate worlds for reinforcement learning

00:01:44.320 | AI feedback to improve policy models or to test and validate model performance.

00:01:51.040 | Even across multi-sensor views, Cosmos can generate tokens in real time,

00:01:59.920 | bringing the power of foresight and multiverse simulation to AI models,

00:02:05.440 | generating every possible future to help the model select the right path.

00:02:09.680 | Working with the world's developer ecosystem,

00:02:14.400 | NVIDIA is helping advance the next wave of physical AI.

00:02:18.160 | Okay, so what's a world model? A world model, it takes past observations, acts,

00:02:31.840 | and also perturbations. See, it can predict the future predictions.

00:02:39.200 | The perturbation can take any forms, like it can be actions from the physical AI,

00:02:46.320 | or it just can be some random perturbation, or a text description of the perturbation.

00:02:54.160 | So, in the Cosmos 1.0, we open-sourced a family of models. We have two sets of

00:03:02.960 | forward quantization models. One is based on diffusion, while the other is based on

00:03:08.240 | autoregressive models. For each family, we also built two base models and two derivatives.

00:03:15.040 | To achieve the best generation quality, we also built an upsampler for the diffusion model,

00:03:21.600 | and also a diffusion decoder to improve the video generated from the autoregressive model.

00:03:27.200 | So, these are already open-sourced on GitHub. You can feel free to try.

00:03:32.560 | So, for the diffusion world model, this is the architecture overview of it.

00:03:41.040 | So, the input video goes through a video tokenizer. Now, here it's called CV8x8x8.

00:03:51.360 | Basically, the time spatial are both compressed by 8. If you have 8 frames, it's going to go into

00:04:04.080 | one frame. I assume everyone knows diffusion. The tokens are corrupted, then go through a

00:04:12.320 | diffusion transformer. The model then generates the reconstructed video during training.

00:04:22.160 | This is an example video generated from the diffusion world model.

00:04:28.880 | For the autoregressive world model, it goes through a similar process. As a tokenizer,

00:04:39.280 | instead, it goes from discrete instead of continuous. Discrete tokenizer is very similar to

00:04:47.840 | LLMs. This discrete tokenizer converts video patches into one of the vocabularies. There's a

00:05:02.560 | 64k vocabulary. These discrete tokens are fed into a transformer with a similar architecture as LLMs.

00:05:15.280 | Then, discrete tokens are generated. Then, there's a decoder, which is also a discrete decoder that

00:05:24.320 | decodes these tokens into videos. There has been debate on whether diffusion or autoregressive

00:05:34.160 | models are better since we don't know. So, we built both of them.

00:05:43.280 | For example, here, this is an input image for the autoregressive model. You can

00:05:51.520 | use this as a pre-filling word for the transformer. Then, in the decoding process,

00:06:00.240 | it can decode into videos.

00:06:06.480 | Autoregressive, if you want better quality of the generated result, you can go with the diffusion

00:06:16.160 | model. If you want the model to be faster, you can try the autoregressive model.

00:06:23.520 | Autoregressive also plays very well into other modalities. You can easily combine other tokens

00:06:33.520 | like text tokens or action tokens. But here, our autoregressive model is trained purely on videos.

00:06:41.760 | We also released post-training scripts for these models. Right now, in the Cosmos paper,

00:06:55.360 | we discuss several post-training examples of the Cosmos foundation models for different physical

00:07:03.600 | texts. Right now, in the GitHub, we support general post-training. This fine-tunes the

00:07:11.440 | word models to generate a target distribution of the videos based on a custom dataset.

00:07:17.200 | The target distribution could include a specific camera spec or a specific domain such as FAC3.

00:07:26.000 | Here is an example. We took a few videos of humanoid. Here is a jetty robot, a jetty humanoid,

00:07:37.840 | and just roughly five videos of this humanoid. The video is in simulation.

00:07:49.280 | After fine-tuning the diffusion model, you can generate novel videos of this

00:07:57.040 | robot doing something else. The model is able to remember the characteristics of this robot

00:08:04.880 | while generating novel tasks which are not possible in either simulation or

00:08:14.000 | in the real world through tele-opt. There are more post-training scripts that are coming soon.

00:08:26.960 | For example, instruction control. Post-training models for robotic

00:08:34.320 | manual motivation to produce a video based on textual instruction. You can instruct the robots

00:08:41.760 | to perform some tasks like folding clothes or picking up objects. Also, action control.

00:08:49.440 | The post-trained robots can predict both the next video frame and the next action frame.

00:08:59.040 | Here, the example shows a camera control. Adding the camera pose as a condition,

00:09:07.840 | you can generate 3D consistent video simulation from a single image or video.

00:09:13.440 | This can enable drastic navigation in virtual environments. You can also do

00:09:24.000 | multi-view generation, especially for autonomous driving. You can generate synchronized multi-view

00:09:36.160 | videos from text prompts then simulate the driving scenarios with multiple camera perspectives.

00:09:43.840 | Next, I'll dive into technical details. First, I'll go over data scaling. It's a model scaling.

00:10:02.560 | So, we open-sourced a training framework. The data curation part, you can sign up for it. It's

00:10:15.360 | coming soon. The training framework is open-sourced. When we curate the data for text,

00:10:31.680 | you can just grab the text online and the label is basically next token prediction,

00:10:39.840 | which is relatively straightforward and cheap to curate. However, for videos,

00:10:47.680 | for example, you have a video shot of someone playing basketball. You need to label a basketball

00:10:55.600 | player as dribbling the ball and shooting it into the hoop. Labeling video data requires

00:11:02.400 | good AI models for automatic captioning. We want to control the AI models to generate

00:11:11.040 | using text we specify. Also, another challenge is that video signals are less refined compared to

00:11:21.680 | text. Maybe out of like an hour of videos, there might only be a second of interesting stuff.

00:11:28.560 | This is very computationally challenging and very expensive. We use distributed computing

00:11:36.800 | to solve this problem. This is a life cycle of curation. So, on top of DGS cloud platform,

00:11:47.760 | we use real data based on streaming pipeline running on thousands of GPUs.

00:11:54.400 | The long video goes into the pipeline and then the videos are splitted and then transcoded into

00:12:04.720 | shorter clips. Then, different AI models are running on the short clips to detect high-quality

00:12:14.800 | videos. Another NVIDIA BLM captioning model running using TensorRT LLM is used to caption the video.

00:12:27.600 | And finally, we get a training dataset.

00:12:33.840 | Data curation for the video foundation models are very challenging. The scale of the video data

00:12:44.480 | are hundreds of petabytes, much bigger than the previous image models. Orchestration at scale,

00:12:52.720 | heterogeneous computer requirements of tens of AI models running efficiently together

00:13:00.640 | is also very challenging. You have the captioning model, you have models to detect the scene change,

00:13:08.720 | you have models to detect the video consistency, aesthetic, etc. Multiple concurrent streams of

00:13:18.480 | high-throughput data exchange between AI models also impose bandwidth challenges to the cluster.

00:13:26.320 | Every single step of the curation pipeline needs to be GPU-accelerated.

00:13:36.320 | We also need to manage the resiliency of the GPU-based data pipeline at scale.

00:13:42.320 | So, each inference model needs to run at speed of light. We go from the baseline,

00:13:57.360 | where the model is run on PyTorch, to use TRT LLM to accelerate.

00:14:05.600 | And then we run it on a larger batch to further accelerate it. And today, we use FP8 quantization

00:14:14.480 | to further accelerate it to 7x compared to the baseline.

00:14:19.200 | So, video understanding, so filtering the high-quality clips and auto-labeling is not

00:14:30.240 | enough for building a video foundation model. We need to understand a lot more about the videos

00:14:37.920 | for specific domain training. We remove the duplicated content

00:14:43.920 | and visual search understandings of the data.

00:14:47.840 | So, these are the next last cycle of the video data curation.

00:14:57.680 | After the captioning, we need to do clustering to group the data into different categories,

00:15:06.000 | sports, entertainment, robotics, etc.

00:15:11.920 | Then there is a semantic deduplication to remove the redundant data.

00:15:24.560 | Finally, video taxonomy to further help researchers to pick the data they want to train on.

00:15:32.720 | The takeaway for the video data curation is we build the video processing capabilities

00:15:51.680 | into Nemo Curator to enable the developers to curate high-quality data and train

00:15:57.680 | highly accurate video foundation models. By leveraging end-to-end GPU acceleration

00:16:05.280 | and optimizing the data orchestration through the pipeline,

00:16:08.720 | Nemo Curator can scale to over 100 petabytes of data.

00:16:14.800 | Other optimizations reduce the processing time and lower the total cost of ownership.

00:16:20.800 | The models are optimized for high throughput and enhancing overall pipeline efficiency.

00:16:28.560 | Next, let's go over the model scaling.

00:16:39.680 | So, using Nemo Video Foundation Model Training Framework, you can scale these video models

00:16:48.960 | up to 20 times larger than traditional frameworks. The framework is capable of training models

00:16:57.440 | like diffusion or autoregressive or foundation models up to 100 billion parameters.

00:17:06.480 | The throughput is highly optimized. We achieve roughly 450

00:17:11.600 | teraflops. That's close to 50% MFU on the H100 chips.

00:17:23.680 | These are very close to the training efficiency of the LLM training.

00:17:34.640 | Previously, we talked about the scale of

00:17:37.520 | data curation. We have hundreds of petabytes of data going into the curation pipeline.

00:17:45.600 | After curation, the data set we get are short video clips and images with text embedding.

00:17:58.720 | Even though the scale of the data is much smaller, these are still

00:18:03.360 | considered relatively big if we want to train on the clusters today.

00:18:09.280 | For example, the images are on the O1 billion scales, and the videos are roughly 100 million

00:18:21.760 | video scales. In the paper, we use image and video tokenizer compression rate of 8x8x8.

00:18:34.720 | At this scale, the images are compressed to roughly 200 kilobytes. For 1 billion images,

00:18:47.120 | it's roughly at the level of 100 terabytes. For the videos, it's on petabyte level.

00:18:55.600 | A solution can be storing this data on the cluster or storing them on cloud storage like S3.

00:19:11.520 | Storing them on the cluster has some huge costs. Most of the clusters don't have huge

00:19:19.440 | storage on the clusters themselves. We provide both solutions in the open source framework.

00:19:34.240 | We leverage Megatron Energon, which is another open source library from

00:19:39.920 | NVIDIA, to load data efficiently. It allows you to load data from web source like

00:19:49.120 | AWS S3 very efficiently without the GPU idling during training. It allows you to deterministically

00:20:01.200 | save and restore the data loader, which is one of the biggest challenges in loading from the

00:20:08.320 | web dataset. In web dataset, usually the data is loaded sequentially. When your training is

00:20:16.880 | interrupted, the traditional way of training, you have to randomize loading. You won't be able to

00:20:28.640 | load non-repetitive data without Megatron Energon.

00:20:32.960 | Another challenge in loading the data is variable input data shapes.

00:20:46.560 | The data types are different. You have image, video, and you also have different durations of

00:20:55.440 | the videos. You have one second, 10 seconds, or even 50 seconds. The resolutions are different,

00:21:04.640 | so 60p, 720p, 1080p. There are also different aspect ratios, 16 by 9, 9 by 16.

00:21:12.240 | When you're training on text, you don't have this kind of problem. In video,

00:21:24.640 | this can cause a very big problem in efficiency if we batch the data.

00:21:29.840 | Traditional approach is batching the data. For each different shape of the

00:21:42.960 | input, for example, image, we batch the images into a few samples of batch. For the videos,

00:21:53.760 | for the very large videos, you can just take one video as input. For medium size,

00:22:00.240 | maybe you can batch two or four into one batch. The pros is that this is commonly used for most

00:22:11.200 | of the models nowadays. For example, in ImageNet training, traditionally people just resize all of

00:22:20.640 | the image into 512 by 512 to mitigate this problem. But the challenge here is

00:22:29.760 | if you want to train on different aspect ratios and different resolutions, you need a complicated

00:22:38.720 | data loading logic to ensure that during training, at each iteration, at least the data shape is the

00:22:46.240 | same. And the efficiency is not very high because not all of the data shapes can be efficiently

00:22:54.320 | utilized by GPU. And also constantly changing the shape of the input data can cause challenge to

00:23:02.800 | the fused kernels. On GPU, if all of your tensor operation shapes are the same for all the

00:23:13.040 | iterations, we can optimize for it and it runs more efficiently than dynamic shape.

00:23:20.240 | The data loading scheme we open source is called Pack Sequencing or Sequence Packing.

00:23:34.640 | Different from the traditional SBHD format, this one allows you to use different image, video,

00:23:45.680 | multi-model, whatever, also different aspect ratios, duration, resolution. The key is to

00:23:56.800 | reshape all of the data into one-dimensional sequence and then pack them together into

00:24:04.400 | one batch. And when you pass this into transformer, outside of self-attention,

00:24:12.720 | there's no problem at all. The MLP operation of the transformer is just a per-token operation.

00:24:21.760 | But for self-attention, we will need to mask, create a block diagonal mask so that each of

00:24:30.640 | the samples in the sequence are computing self-attention on themselves. And this

00:24:38.240 | operation is automatically done on the fused CUDA kernel. You only need to supply the

00:24:49.360 | sequence length in our training code and that's all you need to enable Pack Sequence training.

00:24:55.840 | With this data loading scheme, the training efficiency is very high and you can see

00:25:05.920 | in the end there is a padding. If you have large enough max sequence, the padding is already very

00:25:14.480 | small. And the training efficiency is very close to when you have all of the samples with the exact

00:25:22.800 | same shape. Next, I'm covering the parallelism. The training on videos is one of the biggest

00:25:40.560 | challenges in the context lens. Traditionally, in pre-training LLMs, the context lens is really

00:25:49.360 | like 4K. Nowadays, it's 8K on LLAMA. But training on videos, the context lens is much larger.

00:25:57.600 | Say we have five seconds of videos, encoding it with a 8x8x8 tokenizer, it goes into roughly

00:26:10.320 | 60K or 70K of tokens. This is 10 times larger than the sequence length of LLMs.

00:26:19.360 | Context parallelism or ring attention is one of the key techniques we use to scale

00:26:28.320 | the Euclidean transformer or the autoregressive word model to up to 1 million tokens.

00:26:38.160 | Using context parallelism, you can place the activations of the entire transformer along the

00:26:44.480 | sequence dimension. This exploits the permutation invariance of attention to distribute the sequence

00:26:52.320 | in ring topology. Hey, quick question, Ethan. I know for some LLM models, like even the LLAMA

00:27:04.240 | models that are trained up to 128K context, something that they do is they do the bulk of

00:27:09.440 | the training, like the majority of the five trillion tokens are done at a smaller context

00:27:14.480 | line. Then in that post-training, they continually train on longer context. Is that a thing in video

00:27:22.640 | gen? Can you train the majority of the model at a short clip length and then extend this and

00:27:30.160 | extrapolate it out? Yes, that's a good question. I think the bottleneck here is we don't have a

00:27:41.120 | very efficient video compressor. Even a five second video is like 60K tokens.

00:27:53.280 | If we say we train on shorter videos like one second, that also works. But for the majority

00:28:00.160 | of the training, the video foundation models, they are 10 times longer context compared to

00:28:08.160 | the LLMs. For post-training, the video models are extended to even longer context,

00:28:19.120 | say like one million tokens, to be able to generate a video roughly like one minute.

00:28:26.000 | That makes sense. It's basically the same problem, it's just a 10x scale on both sides,

00:28:32.880 | so even the short context is still there. Yes. Thank you. I'd say if we have a very good

00:28:40.320 | tokenizer in the future that can efficiently reconstruct the videos, maybe it's a paradigm

00:28:47.120 | of change. Right now, the video tokenizer customers release are 8x8x8 or 8x16x16.

00:28:58.320 | Spatial-wise, 16x16 is already near the limit. If you go beyond that, a lot of the reconstruction

00:29:11.040 | artifacts will appear. Makes sense. Thank you. For video generation and inference,

00:29:22.400 | we also employ context parallel. In the open source repository, you can already use context

00:29:30.880 | parallel to accelerate the inference. For example, on 8 GPUs, using context parallel 8, you can

00:29:38.880 | generate a 5-second video under 30 seconds. Using more across different nodes, you can generate a

00:29:49.520 | video in a matter of seconds. Another challenge brought by diffusion transformer is challenges

00:30:05.360 | to pipeline parallelism. Traditionally, in LLMs, for different pipeline stages, you only need to

00:30:13.440 | pass the hidden states to the next pipeline stage. But diffusion transformers have a lot of

00:30:20.160 | conditioning and adapt to their norm, and also conditioning on text, which creates difficulty for

00:30:30.000 | the pipeline parallelism. So we provide a solution to generate the additional conditionings

00:30:39.920 | on each pipeline parallel ranks. This value is slightly more compute,

00:30:45.840 | but reduces the communication cost a lot, which leads to improved performance.

00:30:58.000 | Okay. I think that's all of my presentation. Thank you for listening. Any questions?

00:31:04.800 | Hi, Ethan. Thanks a lot for joining us again. This is RJ. I asked a question at the beginning

00:31:21.760 | of the chat. I'm a little unclear about how the encoder gets, like, the encoder to the 8x8

00:31:33.760 | latent space gets trained. Is that just part of the diffusion training, or is there something,

00:31:42.560 | like, some sort of, like, a separate step that is used to train that encoder?

00:31:51.600 | Yeah, that's a good question. So a separate step is used to train the encoder.

00:31:55.920 | Tokenizer is a fancy name of this, but this is VectorQuant has the variational autoencoder,

00:32:07.280 | VQVE. Okay, yeah.

00:32:09.520 | Yeah, you would basically train it for the task of reconstructing the videos.

00:32:18.480 | Okay, right. So – but how do you get it to create a 3D – what's it like, the

00:32:25.440 | TLDR, and how to get it to create a 3D latent space like that?

00:32:28.720 | Yeah, so the model architecture itself is a causal convolutional neural network. It can

00:32:41.040 | reconstruct – the encoder and decoder structure reconstructs the video. So the training objective

00:32:49.520 | is basically reconstructing the video. The process is you need to collect some

00:32:55.280 | diverse set of different videos, ideally in your domain, and then train this causal CNN

00:33:04.160 | to reconstruct those videos. The codebook here for continuous tokens are just those continuous

00:33:12.960 | tokens, but for discrete tokenizers, you would do vector quantization to quantize in 264-QIF codebase.

00:33:22.480 | Okay. Is it – sorry, I didn't have time to pre-read the paper. Is this covered in the paper,

00:33:29.120 | or is there a separate paper for this? Yeah, this is covered.

00:33:32.880 | Okay, got it, got it. Thank you. This is really super interesting, exciting work. Thank you very

00:33:39.280 | much for joining us. So additionally, the tokenizer is

00:33:44.560 | phrased during the training of the transformer, because if you don't phrase the tokenizer,

00:33:53.440 | it can lead to catastrophe forgetting. You can – you just generate – if you just predict the error,

00:34:03.920 | and the loss is there. Okay, got it, yeah.

00:34:11.360 | Sorry, I have also a question. I didn't find any reaction button that I can

00:34:19.040 | raise my hand. Can I ask the question right now? Yeah.

00:34:22.480 | Okay, perfect. So my question is about the open source framework for pre-training that you

00:34:28.480 | mentioned. I think it was NEMO, right? Yes.

00:34:31.600 | Yes. So do you think, potentially, if I have a set of videos, but those videos, originally,

00:34:40.240 | they were not necessarily in the RGB space, okay? So I don't know, for example, satellites,

00:34:46.000 | or anything, a spectral wavelength, or whatever. And I just somehow mapped them to videos. Do you

00:34:51.600 | think I can still customize your framework and just pre-train my own tokenizer, or basically

00:35:00.800 | whatever else that exists in that framework? Yeah, if your data domain is different from

00:35:11.200 | video, it's recommended to fine-tune the tokenizers. So just fine-tuning, do you think

00:35:18.560 | you're going to work? So, because if the tokenizer is not fine-tuned, it might produce some artifacts

00:35:30.560 | for your data if your data domain is different. Sorry, yeah, go ahead.

00:35:39.600 | After fine-tuning the tokenizer, you might also want to fine-tune the diffusion transformer or

00:35:46.720 | autoregressive transformer. Yeah, both of these are supported in the framework.

00:35:51.200 | Awesome. And, you know, I can also pre-train the tokenizer using the current framework.

00:35:58.560 | Or fine-tune. Yeah.

00:36:00.400 | Thank you.

00:36:02.080 | Thanks for the presentation. I had a quick question related to some of the,

00:36:10.720 | well, you mentioned it's coming soon, for multi-view generation and more camera control.

00:36:15.200 | So, curious if you could speak any more towards how you're approaching multi-view,

00:36:21.040 | or how to make sure that the camera intrinsics correlate between one another,

00:36:27.120 | you know, if they're all video-based generation versus having a true, like,

00:36:31.120 | grounded scene understanding, how you guys are approaching that.

00:36:34.960 | Yes, that's a good question. So, these are coming soon, but the techniques are covered in the paper.

00:36:44.000 | For example, for multi-view generation, the different number of views are folded into the,

00:36:50.800 | into one of the dimensions in the data. So, the model input is still roughly the same.

00:36:58.960 | It seems to have, in fact, it's falling into the time, the time axis.

00:37:04.000 | And for the camera intrinsics, it's not, it's not used now, because

00:37:13.440 | if you have a consistent intrinsics, we don't, you don't have this problem, but

00:37:19.040 | if your intrinsics is going to change across different training data, I guess it's helpful to

00:37:27.280 | include that in the conditioning information. At least in the example, we use consistent intrinsics.

00:37:42.640 | Yeah, so you're saying it has more to do with, perhaps, more the training data that you're using

00:37:47.600 | to post-train these models, to have it be consistent and

00:37:51.120 | have similar intrinsics? Is that sort of what you're saying?

00:37:55.040 | Yeah, yes.

00:37:57.120 | All right, okay.

00:38:08.400 | I, I can, I can answer questions in the, in the chat. Yeah, I wasn't looking at it.

00:38:15.200 | Yeah, so what does the token represent in this case? One pixel of video? So, the,

00:38:25.360 | yeah, the tokens, the tokens are a patch of video. Say, for an image, an 8x8 patch is one pixel.

00:38:38.720 | For a video, an 8x8 patch is, is one token.

00:38:44.560 | That means, roughly, for one second, the video is, if it's 30 frames,

00:38:50.560 | in the, in the time domain, you have, like, four, you have, like, four tokens.

00:38:59.040 | And spatially, that depends on your resolution.

00:39:07.840 | Yeah, so the video doesn't have a depth map, but it can be,

00:39:12.080 | you can add it in the post-training process.

00:39:16.000 | What's the different, difference between post-training and fine-tuning?

00:39:24.400 | I'd say, like, post-training, it's a fancy word of fine-tuning. And now it's,

00:39:33.040 | now it's fine-tuning specifically referenced to, like, it's all special techniques, like,

00:39:40.320 | just continue pre-training. I would say these two words are interchangeably.

00:39:45.680 | Oh, number of tokens for each of these foundation models trained. So, for, yeah, for pre-training,

00:40:02.080 | it's a, it's a hundred million video clips level. And I, I have the equation in the,

00:40:10.080 | and according to, so each video clip is roughly five seconds.

00:40:15.120 | And using, using that information, you can calculate the number of tokens.

00:40:21.360 | I'd say it's roughly on the scale of, like,

00:40:24.160 | 10, 10 trillion, at least 10 trillion tokens or more. You can calculate it for yourself.

00:40:32.000 | [silence]

00:40:46.480 | Yeah, what type of hardware is adequate for post-training on our own data? So,

00:40:55.520 | the post-training, the open-source now needs, like, eight, eight H100 for

00:41:01.840 | diffusion and two H100 for the autoregressive model.

00:41:08.160 | But with some technique, like activation offloading or LoRa,

00:41:15.280 | I believe LoRa and GPUs can also be used for post-training.

00:41:20.080 | [silence]

00:41:35.440 | So, so the, the word, the word in our model name, we, we want to emphasize that the model,

00:41:46.720 | the model has spatial consistency and we're aiming to provide the best foundation model for

00:41:54.560 | robotics post-training.

00:41:57.440 | [silence]

00:42:10.880 | Okay, I think that's all the questions in the chat. Any more questions?

00:42:16.640 | [silence]

00:42:18.800 | Hi Ethan, thanks for the talk. I had a question. So, for, you said, for identifying high-quality

00:42:27.040 | videos, you, high-quality clips, you filter them out first, right? How do you do that?

00:42:33.200 | Do you use, like, some already available open models or do you train your own models for that?

00:42:38.320 | [silence]

00:42:41.600 | Yeah, that's a good question. So, so there are different, there are different metrics for

00:42:48.160 | filtering videos. For, there are both heuristic and some, some models. Heuristic, like, if the

00:42:55.760 | video is static, it's, it's basic image, it's not a good video, or you can also train a model to

00:43:04.640 | classify, classify the quality of this, this video, like, aesthetic score. So, that, that might need

00:43:15.120 | some other training and labeling, and also motion scoring, like, how much motion it's in the video.

00:43:21.520 | [silence]

00:43:25.520 | So, in your case, you guys trained a custom model for that,

00:43:29.360 | based on these metrics, maybe motion or based on the aesthetics?

00:43:33.040 | [silence]

00:43:34.640 | There, there are a lot of, like, available models open source already. You, you can check it out.

00:43:41.520 | Like, there are aesthetic, aesthetic classifier, etc. Yeah.

00:43:47.040 | [silence]

00:43:49.120 | Okay, thanks.

00:43:50.000 | [silence]

00:44:00.560 | Um, another quick question is, you know, as Cosmos develops or releases more iterations,

00:44:06.960 | how do you foresee approaches to adding more controllability within the scenario?

00:44:13.760 | So, more refined control over what's happening in the scene, and what variables you want to

00:44:19.120 | change versus not to change? Sort of inherent to, you know, video generation in general, I think

00:44:24.320 | you don't have as much control, and curious if you're seeing that as a requirement, and how,

00:44:28.560 | how you're thinking of approaching it.

00:44:29.840 | [silence]

00:44:31.920 | Yeah, I think that's very important for post-training.

00:44:35.360 | That also depends on different use cases. Say, you have, um, depending on your data,

00:44:45.120 | if your, say, if your data have more different parameters you can use as conditioning,

00:44:51.920 | I think adding, adding them into the training would definitely help.

00:44:56.240 | Yeah, if you have, if you have additional, like, camera intrinsics, if you have additional

00:45:04.160 | cameras as condition, additional signals, like audio, all of them can use as conditioning.

00:45:11.680 | The model, it's, it's quite flexible for, to add additional conditioning.

00:45:18.720 | For the diffusion model, they allow you to add it through cross-attention,

00:45:23.840 | and similar for autoregressive model.

00:45:26.800 | [silence]

00:45:37.440 | Ethan, I have another question that's somewhat related.

00:45:41.840 | How much, I was a little confused about how much of the, sort of, the, the ability to generate

00:45:53.040 | realistic, you know, sort of, physics and physical model, well, like, sort of, world models,

00:46:00.640 | is due to training versus some, like, inductive bias in the model, and, like, what were the,

00:46:06.960 | if, if the, in as much as it was inductive bias, what, what were the key things there?

00:46:13.840 | So, I, I think two key things are data and scale.

00:46:21.360 | So, I, the, the model itself, as they grow larger and larger, a lot of the,

00:46:32.720 | a lot of the 3D capability, consistency physics intrinsics automatically appear when the model

00:46:41.440 | is bigger. And another thing is data. I think in the data, you need to have enough

00:46:48.480 | demonstration of different physical property for the model to learn.

00:46:55.440 | It says the model itself doesn't have a lot of

00:46:58.160 | inductive bias. It's just, we're just using transformers. There's no,

00:47:05.360 | like, spatial attention, temporal attention, those kind of things.

00:47:10.800 | Okay, got it. Thank you.

00:47:16.240 | [no audio]

00:47:34.240 | If, if there aren't other questions, I actually have one more. So, in, in the, sort of,

00:47:42.000 | like, the original diagram of the architecture, there's some, some things that I didn't understand

00:47:50.320 | about the, the positional embeddings. Like, there's the, there's, like, two different

00:47:57.840 | positional embeddings, or three different positional embeddings, I think. Yeah, so,

00:48:02.240 | there's, like, this absolute positional embedding. And then, actually, there's another diagram that,

00:48:07.600 | where there's another positional embedding that goes into the cross attention, I think.

00:48:12.080 | Yeah, this, or, well, it's, I'm not sure what that is, that time step in this

00:48:16.880 | scale shift gate. So, I got, I was kind of confused about what the purpose of all these are.

00:48:21.520 | Yeah, yeah. So, so, the timestamps is specific to the diffusion models. You know,

00:48:30.240 | the diffusion process, you're going through multiple steps to diffuse noise, and it becomes

00:48:37.760 | a clear, clear and crisp video, right? So, during training, the process is,

00:48:45.760 | you randomly apply some noise to the tokens, and for, you also need to indicate the model,

00:48:54.560 | like, how much noise is added. If there are more noise, the timestamps, it's an earlier step.

00:49:01.760 | The less noise, the timestamp is, is close to the end of the generation.

00:49:08.000 | So, during inference, the model can gradually remove the noise and the condition on the,

00:49:17.600 | which timestamp it is on. For the absolute positional embedding and 3D robe,

00:49:24.880 | those, those tell the model, for this token, which, which position it is in the video.

00:49:33.760 | Sure. No, I guess I was just confused about the, what is the need for both

00:49:43.120 | the rotary positional embedding and also the absolute

00:49:46.480 | positional embedding? Like, why is that, why are both of those needed?

00:49:49.920 | So, not necessarily, but this can improve the model. In fact, if you just use absolute

00:50:01.040 | positional embedding, it can also work. Okay, I see. Okay, got it, got it. Thank you.

00:50:11.120 | Yeah, Ethan, can I ask a question? Yeah.

00:50:16.560 | Yeah. So, there are, there was a comment in the chat about the use of vector quantization.

00:50:25.920 | Now, how is that used, actually? I don't think that it's used for selecting

00:50:35.920 | patches, but it could be used for the discrete latent space.

00:50:41.040 | It's a, it's a training technique. It's basically for, for the autoregressive part of the model,

00:50:49.280 | you need to, you need to have fixed set of vocabulary and the input are basically indices,

00:50:58.080 | like this word is number, number one, number two, etc. So, when you're training the tokenizer,

00:51:07.280 | you need to quantize them into the codebook.

00:51:12.480 | When it's choosing, for each patch, it basically looks for the closest vector in the codebook

00:51:24.400 | and pick it out. Yeah.

00:51:30.160 | Thanks, Ethan. I had a question about the size of the models that were posted to HuggingFace.

00:51:51.600 | How did you guys select those sizes? Did you experiment with larger sizes? Yeah, those are

00:51:58.000 | my questions. So, yeah, this is a first release of Cosmos 1.0. There might be bigger models in

00:52:12.400 | the future, and because we, when doing research, we want to go from small to big. We're not doing

00:52:23.520 | it blank shot, and I think we're still in the infancy of word foundation models, and they're,

00:52:30.720 | let's say, it's kind of like GPT-1 or GPT-2 stage of word foundation model.

00:52:39.600 | Bigger models will definitely come in the future.

00:52:42.720 | Got it. Thanks. Was there any thinking in terms of, well, this is good enough

00:52:50.320 | for most of the applications we see from, I don't know, customers or partners?

00:52:55.840 | It's not good enough yet. It can be better and better. The model now has some emerging

00:53:09.280 | physics property in the generative video. I would think it can get better.

00:53:19.840 | Thanks. Guys, so it looks like Sviks passed the baton to me. He had to drop off a call,

00:53:48.240 | or for an in real life meeting, and so I want to, if there are any other questions,

00:53:56.400 | encourage you to ask. Otherwise, I think we can take a little bit of time to discuss the next

00:54:04.080 | paper, and I actually have to have a hard stop at, in three minutes, so I need to drop off

00:54:14.240 | at that time. So, first of all, Ethan, this is fantastic. I hope you keep coming back to these

00:54:21.600 | paper club meetings, and even if you want to present someone else's paper rather than your

00:54:33.840 | own, certainly anytime you publish a paper, we definitely want to see you here. But if others

00:54:39.600 | publish paper and you think it's exciting and you want to share it, we definitely would love to have

00:54:44.560 | you as well. Thank you. Thank you for hosting. Yeah, I mean, Sviks, but I'm happy to facilitate

00:54:54.560 | where I can. Are there other questions for Ethan before we, I'm not sure how much time

00:55:01.520 | we have really to discuss the next paper, but, okay, does anyone want to volunteer? I think that

00:55:14.160 | I saw some chat, and I'm not sure about this, but I saw some noise on the Discord about people just

00:55:21.360 | picking things from the list of papers that are in our backlog, and then just giving brief, like,

00:55:29.280 | sort of very fast discussions of those. Maybe I think that in the past we've taken 10 or 15 minutes

00:55:36.880 | to just go over, summarize the paper for everyone. Probably you'll, people won't probably pre-read,

00:55:43.360 | but it'll just be a good, you know, sort of way to understand in some detail what are the key

00:55:49.760 | points from the paper. So, maybe I can post that. I think it's already in Discord, but I can post

00:55:56.800 | that in Discord. If there are people who are not on Discord, maybe I can ask, I can suggest Sviks

00:56:04.320 | also post that in the, like, in, like, Twitter or whatever. Is that, is there, unless, of course,

00:56:13.520 | someone wants to volunteer to present a paper next week? Okay. Somebody asked that, for the

00:56:26.480 | Discord channel, if no one can dig that up, I suggest, I think it's on the LatentSpace, like,

00:56:36.720 | on, you can, you can dig through the LatentSpace sub stack, or, like, maybe there's a, I think

00:56:42.640 | there's a website, too, and you can find it there. Otherwise, you can hit me on X, and I'll find it

00:56:51.920 | for you, or LinkedIn, as well. I'm, on both of them, my user handle is Haneke, or you can obviously

00:57:01.360 | also, Sviks, or anyone else here. Oh, there it goes. Okay, great. Okay, guys, so, grab that if

00:57:09.280 | you need it. I'm going to end the meeting, and, yeah, I got to go. So, I'm going to, I'm going to

00:57:14.000 | stop recording. Actually, I probably was supposed to stop recording, and guess what,

00:57:18.400 | the edit, whatever. And thank you very much. We'll see you next week.

00:57:23.360 | Goodbye.

00:57:29.280 | [BLANK_AUDIO]

NVIDIA Cosmos: World Foundation Model Platform for Physical AI - w/ Ethan He