back to index

NVIDIA Cosmos: World Foundation Model Platform for Physical AI - w/ Ethan He


Whisper Transcript | Transcript Only Page

00:00:00.000 | the 75 pages of the report. I can't cover everything in one hour. I can talk about it for
00:00:07.040 | hours. So I'll just cover what I focus on as a data scaling and model scaling. First, I'll do
00:00:15.520 | an introduction of Cosmos for people who are not familiar with it. I guess the introduction is best
00:00:22.720 | to Saru Jensa himself. It includes autoregressive world foundation models, diffusion-based world
00:00:33.120 | foundation models, advanced tokenizers, and an NVIDIA CUDA, an AI-accelerated data pipeline.
00:00:40.400 | Cosmos models ingest text, image, or video prompts and generate virtual world states as videos.
00:00:50.400 | Cosmos generations prioritize the unique requirements of AV and robotics use cases,
00:00:55.600 | like real world environments, lighting, and object permanence.
00:01:00.400 | Developers use NVIDIA Omniverse to build physics-based geospatially accurate scenarios,
00:01:07.760 | then output Omniverse renders into Cosmos, which generates photoreal physically-based synthetic
00:01:14.720 | data. Whether diverse objects or environments, conditions like weather or time of day,
00:01:36.400 | or edge case scenarios, developers use Cosmos to generate worlds for reinforcement learning
00:01:44.320 | AI feedback to improve policy models or to test and validate model performance.
00:01:51.040 | Even across multi-sensor views, Cosmos can generate tokens in real time,
00:01:59.920 | bringing the power of foresight and multiverse simulation to AI models,
00:02:05.440 | generating every possible future to help the model select the right path.
00:02:09.680 | Working with the world's developer ecosystem,
00:02:14.400 | NVIDIA is helping advance the next wave of physical AI.
00:02:18.160 | Okay, so what's a world model? A world model, it takes past observations, acts,
00:02:31.840 | and also perturbations. See, it can predict the future predictions.
00:02:39.200 | The perturbation can take any forms, like it can be actions from the physical AI,
00:02:46.320 | or it just can be some random perturbation, or a text description of the perturbation.
00:02:54.160 | So, in the Cosmos 1.0, we open-sourced a family of models. We have two sets of
00:03:02.960 | forward quantization models. One is based on diffusion, while the other is based on
00:03:08.240 | autoregressive models. For each family, we also built two base models and two derivatives.
00:03:15.040 | To achieve the best generation quality, we also built an upsampler for the diffusion model,
00:03:21.600 | and also a diffusion decoder to improve the video generated from the autoregressive model.
00:03:27.200 | So, these are already open-sourced on GitHub. You can feel free to try.
00:03:32.560 | So, for the diffusion world model, this is the architecture overview of it.
00:03:41.040 | So, the input video goes through a video tokenizer. Now, here it's called CV8x8x8.
00:03:51.360 | Basically, the time spatial are both compressed by 8. If you have 8 frames, it's going to go into
00:04:04.080 | one frame. I assume everyone knows diffusion. The tokens are corrupted, then go through a
00:04:12.320 | diffusion transformer. The model then generates the reconstructed video during training.
00:04:22.160 | This is an example video generated from the diffusion world model.
00:04:28.880 | For the autoregressive world model, it goes through a similar process. As a tokenizer,
00:04:39.280 | instead, it goes from discrete instead of continuous. Discrete tokenizer is very similar to
00:04:47.840 | LLMs. This discrete tokenizer converts video patches into one of the vocabularies. There's a
00:05:02.560 | 64k vocabulary. These discrete tokens are fed into a transformer with a similar architecture as LLMs.
00:05:15.280 | Then, discrete tokens are generated. Then, there's a decoder, which is also a discrete decoder that
00:05:24.320 | decodes these tokens into videos. There has been debate on whether diffusion or autoregressive
00:05:34.160 | models are better since we don't know. So, we built both of them.
00:05:43.280 | For example, here, this is an input image for the autoregressive model. You can
00:05:51.520 | use this as a pre-filling word for the transformer. Then, in the decoding process,
00:06:00.240 | it can decode into videos.
00:06:06.480 | Autoregressive, if you want better quality of the generated result, you can go with the diffusion
00:06:16.160 | model. If you want the model to be faster, you can try the autoregressive model.
00:06:23.520 | Autoregressive also plays very well into other modalities. You can easily combine other tokens
00:06:33.520 | like text tokens or action tokens. But here, our autoregressive model is trained purely on videos.
00:06:41.760 | We also released post-training scripts for these models. Right now, in the Cosmos paper,
00:06:55.360 | we discuss several post-training examples of the Cosmos foundation models for different physical
00:07:03.600 | texts. Right now, in the GitHub, we support general post-training. This fine-tunes the
00:07:11.440 | word models to generate a target distribution of the videos based on a custom dataset.
00:07:17.200 | The target distribution could include a specific camera spec or a specific domain such as FAC3.
00:07:26.000 | Here is an example. We took a few videos of humanoid. Here is a jetty robot, a jetty humanoid,
00:07:37.840 | and just roughly five videos of this humanoid. The video is in simulation.
00:07:49.280 | After fine-tuning the diffusion model, you can generate novel videos of this
00:07:57.040 | robot doing something else. The model is able to remember the characteristics of this robot
00:08:04.880 | while generating novel tasks which are not possible in either simulation or
00:08:14.000 | in the real world through tele-opt. There are more post-training scripts that are coming soon.
00:08:26.960 | For example, instruction control. Post-training models for robotic
00:08:34.320 | manual motivation to produce a video based on textual instruction. You can instruct the robots
00:08:41.760 | to perform some tasks like folding clothes or picking up objects. Also, action control.
00:08:49.440 | The post-trained robots can predict both the next video frame and the next action frame.
00:08:59.040 | Here, the example shows a camera control. Adding the camera pose as a condition,
00:09:07.840 | you can generate 3D consistent video simulation from a single image or video.
00:09:13.440 | This can enable drastic navigation in virtual environments. You can also do
00:09:24.000 | multi-view generation, especially for autonomous driving. You can generate synchronized multi-view
00:09:36.160 | videos from text prompts then simulate the driving scenarios with multiple camera perspectives.
00:09:43.840 | Next, I'll dive into technical details. First, I'll go over data scaling. It's a model scaling.
00:10:02.560 | So, we open-sourced a training framework. The data curation part, you can sign up for it. It's
00:10:15.360 | coming soon. The training framework is open-sourced. When we curate the data for text,
00:10:31.680 | you can just grab the text online and the label is basically next token prediction,
00:10:39.840 | which is relatively straightforward and cheap to curate. However, for videos,
00:10:47.680 | for example, you have a video shot of someone playing basketball. You need to label a basketball
00:10:55.600 | player as dribbling the ball and shooting it into the hoop. Labeling video data requires
00:11:02.400 | good AI models for automatic captioning. We want to control the AI models to generate
00:11:11.040 | using text we specify. Also, another challenge is that video signals are less refined compared to
00:11:21.680 | text. Maybe out of like an hour of videos, there might only be a second of interesting stuff.
00:11:28.560 | This is very computationally challenging and very expensive. We use distributed computing
00:11:36.800 | to solve this problem. This is a life cycle of curation. So, on top of DGS cloud platform,
00:11:47.760 | we use real data based on streaming pipeline running on thousands of GPUs.
00:11:54.400 | The long video goes into the pipeline and then the videos are splitted and then transcoded into
00:12:04.720 | shorter clips. Then, different AI models are running on the short clips to detect high-quality
00:12:14.800 | videos. Another NVIDIA BLM captioning model running using TensorRT LLM is used to caption the video.
00:12:27.600 | And finally, we get a training dataset.
00:12:33.840 | Data curation for the video foundation models are very challenging. The scale of the video data
00:12:44.480 | are hundreds of petabytes, much bigger than the previous image models. Orchestration at scale,
00:12:52.720 | heterogeneous computer requirements of tens of AI models running efficiently together
00:13:00.640 | is also very challenging. You have the captioning model, you have models to detect the scene change,
00:13:08.720 | you have models to detect the video consistency, aesthetic, etc. Multiple concurrent streams of
00:13:18.480 | high-throughput data exchange between AI models also impose bandwidth challenges to the cluster.
00:13:26.320 | Every single step of the curation pipeline needs to be GPU-accelerated.
00:13:36.320 | We also need to manage the resiliency of the GPU-based data pipeline at scale.
00:13:42.320 | So, each inference model needs to run at speed of light. We go from the baseline,
00:13:57.360 | where the model is run on PyTorch, to use TRT LLM to accelerate.
00:14:05.600 | And then we run it on a larger batch to further accelerate it. And today, we use FP8 quantization
00:14:14.480 | to further accelerate it to 7x compared to the baseline.
00:14:19.200 | So, video understanding, so filtering the high-quality clips and auto-labeling is not
00:14:30.240 | enough for building a video foundation model. We need to understand a lot more about the videos
00:14:37.920 | for specific domain training. We remove the duplicated content
00:14:43.920 | and visual search understandings of the data.
00:14:47.840 | So, these are the next last cycle of the video data curation.
00:14:57.680 | After the captioning, we need to do clustering to group the data into different categories,
00:15:06.000 | sports, entertainment, robotics, etc.
00:15:11.920 | Then there is a semantic deduplication to remove the redundant data.
00:15:24.560 | Finally, video taxonomy to further help researchers to pick the data they want to train on.
00:15:32.720 | The takeaway for the video data curation is we build the video processing capabilities
00:15:51.680 | into Nemo Curator to enable the developers to curate high-quality data and train
00:15:57.680 | highly accurate video foundation models. By leveraging end-to-end GPU acceleration
00:16:05.280 | and optimizing the data orchestration through the pipeline,
00:16:08.720 | Nemo Curator can scale to over 100 petabytes of data.
00:16:14.800 | Other optimizations reduce the processing time and lower the total cost of ownership.
00:16:20.800 | The models are optimized for high throughput and enhancing overall pipeline efficiency.
00:16:28.560 | Next, let's go over the model scaling.
00:16:39.680 | So, using Nemo Video Foundation Model Training Framework, you can scale these video models
00:16:48.960 | up to 20 times larger than traditional frameworks. The framework is capable of training models
00:16:57.440 | like diffusion or autoregressive or foundation models up to 100 billion parameters.
00:17:06.480 | The throughput is highly optimized. We achieve roughly 450
00:17:11.600 | teraflops. That's close to 50% MFU on the H100 chips.
00:17:23.680 | These are very close to the training efficiency of the LLM training.
00:17:34.640 | Previously, we talked about the scale of
00:17:37.520 | data curation. We have hundreds of petabytes of data going into the curation pipeline.
00:17:45.600 | After curation, the data set we get are short video clips and images with text embedding.
00:17:58.720 | Even though the scale of the data is much smaller, these are still
00:18:03.360 | considered relatively big if we want to train on the clusters today.
00:18:09.280 | For example, the images are on the O1 billion scales, and the videos are roughly 100 million
00:18:21.760 | video scales. In the paper, we use image and video tokenizer compression rate of 8x8x8.
00:18:34.720 | At this scale, the images are compressed to roughly 200 kilobytes. For 1 billion images,
00:18:47.120 | it's roughly at the level of 100 terabytes. For the videos, it's on petabyte level.
00:18:55.600 | A solution can be storing this data on the cluster or storing them on cloud storage like S3.
00:19:11.520 | Storing them on the cluster has some huge costs. Most of the clusters don't have huge
00:19:19.440 | storage on the clusters themselves. We provide both solutions in the open source framework.
00:19:34.240 | We leverage Megatron Energon, which is another open source library from
00:19:39.920 | NVIDIA, to load data efficiently. It allows you to load data from web source like
00:19:49.120 | AWS S3 very efficiently without the GPU idling during training. It allows you to deterministically
00:20:01.200 | save and restore the data loader, which is one of the biggest challenges in loading from the
00:20:08.320 | web dataset. In web dataset, usually the data is loaded sequentially. When your training is
00:20:16.880 | interrupted, the traditional way of training, you have to randomize loading. You won't be able to
00:20:28.640 | load non-repetitive data without Megatron Energon.
00:20:32.960 | Another challenge in loading the data is variable input data shapes.
00:20:46.560 | The data types are different. You have image, video, and you also have different durations of
00:20:55.440 | the videos. You have one second, 10 seconds, or even 50 seconds. The resolutions are different,
00:21:04.640 | so 60p, 720p, 1080p. There are also different aspect ratios, 16 by 9, 9 by 16.
00:21:12.240 | When you're training on text, you don't have this kind of problem. In video,
00:21:24.640 | this can cause a very big problem in efficiency if we batch the data.
00:21:29.840 | Traditional approach is batching the data. For each different shape of the
00:21:42.960 | input, for example, image, we batch the images into a few samples of batch. For the videos,
00:21:53.760 | for the very large videos, you can just take one video as input. For medium size,
00:22:00.240 | maybe you can batch two or four into one batch. The pros is that this is commonly used for most
00:22:11.200 | of the models nowadays. For example, in ImageNet training, traditionally people just resize all of
00:22:20.640 | the image into 512 by 512 to mitigate this problem. But the challenge here is
00:22:29.760 | if you want to train on different aspect ratios and different resolutions, you need a complicated
00:22:38.720 | data loading logic to ensure that during training, at each iteration, at least the data shape is the
00:22:46.240 | same. And the efficiency is not very high because not all of the data shapes can be efficiently
00:22:54.320 | utilized by GPU. And also constantly changing the shape of the input data can cause challenge to
00:23:02.800 | the fused kernels. On GPU, if all of your tensor operation shapes are the same for all the
00:23:13.040 | iterations, we can optimize for it and it runs more efficiently than dynamic shape.
00:23:20.240 | The data loading scheme we open source is called Pack Sequencing or Sequence Packing.
00:23:34.640 | Different from the traditional SBHD format, this one allows you to use different image, video,
00:23:45.680 | multi-model, whatever, also different aspect ratios, duration, resolution. The key is to
00:23:56.800 | reshape all of the data into one-dimensional sequence and then pack them together into
00:24:04.400 | one batch. And when you pass this into transformer, outside of self-attention,
00:24:12.720 | there's no problem at all. The MLP operation of the transformer is just a per-token operation.
00:24:21.760 | But for self-attention, we will need to mask, create a block diagonal mask so that each of
00:24:30.640 | the samples in the sequence are computing self-attention on themselves. And this
00:24:38.240 | operation is automatically done on the fused CUDA kernel. You only need to supply the
00:24:49.360 | sequence length in our training code and that's all you need to enable Pack Sequence training.
00:24:55.840 | With this data loading scheme, the training efficiency is very high and you can see
00:25:05.920 | in the end there is a padding. If you have large enough max sequence, the padding is already very
00:25:14.480 | small. And the training efficiency is very close to when you have all of the samples with the exact
00:25:22.800 | same shape. Next, I'm covering the parallelism. The training on videos is one of the biggest
00:25:40.560 | challenges in the context lens. Traditionally, in pre-training LLMs, the context lens is really
00:25:49.360 | like 4K. Nowadays, it's 8K on LLAMA. But training on videos, the context lens is much larger.
00:25:57.600 | Say we have five seconds of videos, encoding it with a 8x8x8 tokenizer, it goes into roughly
00:26:10.320 | 60K or 70K of tokens. This is 10 times larger than the sequence length of LLMs.
00:26:19.360 | Context parallelism or ring attention is one of the key techniques we use to scale
00:26:28.320 | the Euclidean transformer or the autoregressive word model to up to 1 million tokens.
00:26:38.160 | Using context parallelism, you can place the activations of the entire transformer along the
00:26:44.480 | sequence dimension. This exploits the permutation invariance of attention to distribute the sequence
00:26:52.320 | in ring topology. Hey, quick question, Ethan. I know for some LLM models, like even the LLAMA
00:27:04.240 | models that are trained up to 128K context, something that they do is they do the bulk of
00:27:09.440 | the training, like the majority of the five trillion tokens are done at a smaller context
00:27:14.480 | line. Then in that post-training, they continually train on longer context. Is that a thing in video
00:27:22.640 | gen? Can you train the majority of the model at a short clip length and then extend this and
00:27:30.160 | extrapolate it out? Yes, that's a good question. I think the bottleneck here is we don't have a
00:27:41.120 | very efficient video compressor. Even a five second video is like 60K tokens.
00:27:53.280 | If we say we train on shorter videos like one second, that also works. But for the majority
00:28:00.160 | of the training, the video foundation models, they are 10 times longer context compared to
00:28:08.160 | the LLMs. For post-training, the video models are extended to even longer context,
00:28:19.120 | say like one million tokens, to be able to generate a video roughly like one minute.
00:28:26.000 | That makes sense. It's basically the same problem, it's just a 10x scale on both sides,
00:28:32.880 | so even the short context is still there. Yes. Thank you. I'd say if we have a very good
00:28:40.320 | tokenizer in the future that can efficiently reconstruct the videos, maybe it's a paradigm
00:28:47.120 | of change. Right now, the video tokenizer customers release are 8x8x8 or 8x16x16.
00:28:58.320 | Spatial-wise, 16x16 is already near the limit. If you go beyond that, a lot of the reconstruction
00:29:11.040 | artifacts will appear. Makes sense. Thank you. For video generation and inference,
00:29:22.400 | we also employ context parallel. In the open source repository, you can already use context
00:29:30.880 | parallel to accelerate the inference. For example, on 8 GPUs, using context parallel 8, you can
00:29:38.880 | generate a 5-second video under 30 seconds. Using more across different nodes, you can generate a
00:29:49.520 | video in a matter of seconds. Another challenge brought by diffusion transformer is challenges
00:30:05.360 | to pipeline parallelism. Traditionally, in LLMs, for different pipeline stages, you only need to
00:30:13.440 | pass the hidden states to the next pipeline stage. But diffusion transformers have a lot of
00:30:20.160 | conditioning and adapt to their norm, and also conditioning on text, which creates difficulty for
00:30:30.000 | the pipeline parallelism. So we provide a solution to generate the additional conditionings
00:30:39.920 | on each pipeline parallel ranks. This value is slightly more compute,
00:30:45.840 | but reduces the communication cost a lot, which leads to improved performance.
00:30:58.000 | Okay. I think that's all of my presentation. Thank you for listening. Any questions?
00:31:04.800 | Hi, Ethan. Thanks a lot for joining us again. This is RJ. I asked a question at the beginning
00:31:21.760 | of the chat. I'm a little unclear about how the encoder gets, like, the encoder to the 8x8
00:31:33.760 | latent space gets trained. Is that just part of the diffusion training, or is there something,
00:31:42.560 | like, some sort of, like, a separate step that is used to train that encoder?
00:31:51.600 | Yeah, that's a good question. So a separate step is used to train the encoder.
00:31:55.920 | Tokenizer is a fancy name of this, but this is VectorQuant has the variational autoencoder,
00:32:07.280 | VQVE. Okay, yeah.
00:32:09.520 | Yeah, you would basically train it for the task of reconstructing the videos.
00:32:18.480 | Okay, right. So – but how do you get it to create a 3D – what's it like, the
00:32:25.440 | TLDR, and how to get it to create a 3D latent space like that?
00:32:28.720 | Yeah, so the model architecture itself is a causal convolutional neural network. It can
00:32:41.040 | reconstruct – the encoder and decoder structure reconstructs the video. So the training objective
00:32:49.520 | is basically reconstructing the video. The process is you need to collect some
00:32:55.280 | diverse set of different videos, ideally in your domain, and then train this causal CNN
00:33:04.160 | to reconstruct those videos. The codebook here for continuous tokens are just those continuous
00:33:12.960 | tokens, but for discrete tokenizers, you would do vector quantization to quantize in 264-QIF codebase.
00:33:22.480 | Okay. Is it – sorry, I didn't have time to pre-read the paper. Is this covered in the paper,
00:33:29.120 | or is there a separate paper for this? Yeah, this is covered.
00:33:32.880 | Okay, got it, got it. Thank you. This is really super interesting, exciting work. Thank you very
00:33:39.280 | much for joining us. So additionally, the tokenizer is
00:33:44.560 | phrased during the training of the transformer, because if you don't phrase the tokenizer,
00:33:53.440 | it can lead to catastrophe forgetting. You can – you just generate – if you just predict the error,
00:34:03.920 | and the loss is there. Okay, got it, yeah.
00:34:11.360 | Sorry, I have also a question. I didn't find any reaction button that I can
00:34:19.040 | raise my hand. Can I ask the question right now? Yeah.
00:34:22.480 | Okay, perfect. So my question is about the open source framework for pre-training that you
00:34:28.480 | mentioned. I think it was NEMO, right? Yes.
00:34:31.600 | Yes. So do you think, potentially, if I have a set of videos, but those videos, originally,
00:34:40.240 | they were not necessarily in the RGB space, okay? So I don't know, for example, satellites,
00:34:46.000 | or anything, a spectral wavelength, or whatever. And I just somehow mapped them to videos. Do you
00:34:51.600 | think I can still customize your framework and just pre-train my own tokenizer, or basically
00:35:00.800 | whatever else that exists in that framework? Yeah, if your data domain is different from
00:35:11.200 | video, it's recommended to fine-tune the tokenizers. So just fine-tuning, do you think
00:35:18.560 | you're going to work? So, because if the tokenizer is not fine-tuned, it might produce some artifacts
00:35:30.560 | for your data if your data domain is different. Sorry, yeah, go ahead.
00:35:39.600 | After fine-tuning the tokenizer, you might also want to fine-tune the diffusion transformer or
00:35:46.720 | autoregressive transformer. Yeah, both of these are supported in the framework.
00:35:51.200 | Awesome. And, you know, I can also pre-train the tokenizer using the current framework.
00:35:58.560 | Or fine-tune. Yeah.
00:36:00.400 | Thank you.
00:36:02.080 | Thanks for the presentation. I had a quick question related to some of the,
00:36:10.720 | well, you mentioned it's coming soon, for multi-view generation and more camera control.
00:36:15.200 | So, curious if you could speak any more towards how you're approaching multi-view,
00:36:21.040 | or how to make sure that the camera intrinsics correlate between one another,
00:36:27.120 | you know, if they're all video-based generation versus having a true, like,
00:36:31.120 | grounded scene understanding, how you guys are approaching that.
00:36:34.960 | Yes, that's a good question. So, these are coming soon, but the techniques are covered in the paper.
00:36:44.000 | For example, for multi-view generation, the different number of views are folded into the,
00:36:50.800 | into one of the dimensions in the data. So, the model input is still roughly the same.
00:36:58.960 | It seems to have, in fact, it's falling into the time, the time axis.
00:37:04.000 | And for the camera intrinsics, it's not, it's not used now, because
00:37:13.440 | if you have a consistent intrinsics, we don't, you don't have this problem, but
00:37:19.040 | if your intrinsics is going to change across different training data, I guess it's helpful to
00:37:27.280 | include that in the conditioning information. At least in the example, we use consistent intrinsics.
00:37:42.640 | Yeah, so you're saying it has more to do with, perhaps, more the training data that you're using
00:37:47.600 | to post-train these models, to have it be consistent and
00:37:51.120 | have similar intrinsics? Is that sort of what you're saying?
00:37:55.040 | Yeah, yes.
00:37:57.120 | All right, okay.
00:38:08.400 | I, I can, I can answer questions in the, in the chat. Yeah, I wasn't looking at it.
00:38:15.200 | Yeah, so what does the token represent in this case? One pixel of video? So, the,
00:38:25.360 | yeah, the tokens, the tokens are a patch of video. Say, for an image, an 8x8 patch is one pixel.
00:38:38.720 | For a video, an 8x8 patch is, is one token.
00:38:44.560 | That means, roughly, for one second, the video is, if it's 30 frames,
00:38:50.560 | in the, in the time domain, you have, like, four, you have, like, four tokens.
00:38:59.040 | And spatially, that depends on your resolution.
00:39:07.840 | Yeah, so the video doesn't have a depth map, but it can be,
00:39:12.080 | you can add it in the post-training process.
00:39:16.000 | What's the different, difference between post-training and fine-tuning?
00:39:24.400 | I'd say, like, post-training, it's a fancy word of fine-tuning. And now it's,
00:39:33.040 | now it's fine-tuning specifically referenced to, like, it's all special techniques, like,
00:39:40.320 | just continue pre-training. I would say these two words are interchangeably.
00:39:45.680 | Oh, number of tokens for each of these foundation models trained. So, for, yeah, for pre-training,
00:40:02.080 | it's a, it's a hundred million video clips level. And I, I have the equation in the,
00:40:10.080 | and according to, so each video clip is roughly five seconds.
00:40:15.120 | And using, using that information, you can calculate the number of tokens.
00:40:21.360 | I'd say it's roughly on the scale of, like,
00:40:24.160 | 10, 10 trillion, at least 10 trillion tokens or more. You can calculate it for yourself.
00:40:32.000 | [silence]
00:40:46.480 | Yeah, what type of hardware is adequate for post-training on our own data? So,
00:40:55.520 | the post-training, the open-source now needs, like, eight, eight H100 for
00:41:01.840 | diffusion and two H100 for the autoregressive model.
00:41:08.160 | But with some technique, like activation offloading or LoRa,
00:41:15.280 | I believe LoRa and GPUs can also be used for post-training.
00:41:20.080 | [silence]
00:41:35.440 | So, so the, the word, the word in our model name, we, we want to emphasize that the model,
00:41:46.720 | the model has spatial consistency and we're aiming to provide the best foundation model for
00:41:54.560 | robotics post-training.
00:41:57.440 | [silence]
00:42:10.880 | Okay, I think that's all the questions in the chat. Any more questions?
00:42:16.640 | [silence]
00:42:18.800 | Hi Ethan, thanks for the talk. I had a question. So, for, you said, for identifying high-quality
00:42:27.040 | videos, you, high-quality clips, you filter them out first, right? How do you do that?
00:42:33.200 | Do you use, like, some already available open models or do you train your own models for that?
00:42:38.320 | [silence]
00:42:41.600 | Yeah, that's a good question. So, so there are different, there are different metrics for
00:42:48.160 | filtering videos. For, there are both heuristic and some, some models. Heuristic, like, if the
00:42:55.760 | video is static, it's, it's basic image, it's not a good video, or you can also train a model to
00:43:04.640 | classify, classify the quality of this, this video, like, aesthetic score. So, that, that might need
00:43:15.120 | some other training and labeling, and also motion scoring, like, how much motion it's in the video.
00:43:21.520 | [silence]
00:43:25.520 | So, in your case, you guys trained a custom model for that,
00:43:29.360 | based on these metrics, maybe motion or based on the aesthetics?
00:43:33.040 | [silence]
00:43:34.640 | There, there are a lot of, like, available models open source already. You, you can check it out.
00:43:41.520 | Like, there are aesthetic, aesthetic classifier, etc. Yeah.
00:43:47.040 | [silence]
00:43:49.120 | Okay, thanks.
00:43:50.000 | [silence]
00:44:00.560 | Um, another quick question is, you know, as Cosmos develops or releases more iterations,
00:44:06.960 | how do you foresee approaches to adding more controllability within the scenario?
00:44:13.760 | So, more refined control over what's happening in the scene, and what variables you want to
00:44:19.120 | change versus not to change? Sort of inherent to, you know, video generation in general, I think
00:44:24.320 | you don't have as much control, and curious if you're seeing that as a requirement, and how,
00:44:28.560 | how you're thinking of approaching it.
00:44:29.840 | [silence]
00:44:31.920 | Yeah, I think that's very important for post-training.
00:44:35.360 | That also depends on different use cases. Say, you have, um, depending on your data,
00:44:45.120 | if your, say, if your data have more different parameters you can use as conditioning,
00:44:51.920 | I think adding, adding them into the training would definitely help.
00:44:56.240 | Yeah, if you have, if you have additional, like, camera intrinsics, if you have additional
00:45:04.160 | cameras as condition, additional signals, like audio, all of them can use as conditioning.
00:45:11.680 | The model, it's, it's quite flexible for, to add additional conditioning.
00:45:18.720 | For the diffusion model, they allow you to add it through cross-attention,
00:45:23.840 | and similar for autoregressive model.
00:45:26.800 | [silence]
00:45:37.440 | Ethan, I have another question that's somewhat related.
00:45:41.840 | How much, I was a little confused about how much of the, sort of, the, the ability to generate
00:45:53.040 | realistic, you know, sort of, physics and physical model, well, like, sort of, world models,
00:46:00.640 | is due to training versus some, like, inductive bias in the model, and, like, what were the,
00:46:06.960 | if, if the, in as much as it was inductive bias, what, what were the key things there?
00:46:13.840 | So, I, I think two key things are data and scale.
00:46:21.360 | So, I, the, the model itself, as they grow larger and larger, a lot of the,
00:46:32.720 | a lot of the 3D capability, consistency physics intrinsics automatically appear when the model
00:46:41.440 | is bigger. And another thing is data. I think in the data, you need to have enough
00:46:48.480 | demonstration of different physical property for the model to learn.
00:46:55.440 | It says the model itself doesn't have a lot of
00:46:58.160 | inductive bias. It's just, we're just using transformers. There's no,
00:47:05.360 | like, spatial attention, temporal attention, those kind of things.
00:47:10.800 | Okay, got it. Thank you.
00:47:16.240 | [no audio]
00:47:34.240 | If, if there aren't other questions, I actually have one more. So, in, in the, sort of,
00:47:42.000 | like, the original diagram of the architecture, there's some, some things that I didn't understand
00:47:50.320 | about the, the positional embeddings. Like, there's the, there's, like, two different
00:47:57.840 | positional embeddings, or three different positional embeddings, I think. Yeah, so,
00:48:02.240 | there's, like, this absolute positional embedding. And then, actually, there's another diagram that,
00:48:07.600 | where there's another positional embedding that goes into the cross attention, I think.
00:48:12.080 | Yeah, this, or, well, it's, I'm not sure what that is, that time step in this
00:48:16.880 | scale shift gate. So, I got, I was kind of confused about what the purpose of all these are.
00:48:21.520 | Yeah, yeah. So, so, the timestamps is specific to the diffusion models. You know,
00:48:30.240 | the diffusion process, you're going through multiple steps to diffuse noise, and it becomes
00:48:37.760 | a clear, clear and crisp video, right? So, during training, the process is,
00:48:45.760 | you randomly apply some noise to the tokens, and for, you also need to indicate the model,
00:48:54.560 | like, how much noise is added. If there are more noise, the timestamps, it's an earlier step.
00:49:01.760 | The less noise, the timestamp is, is close to the end of the generation.
00:49:08.000 | So, during inference, the model can gradually remove the noise and the condition on the,
00:49:17.600 | which timestamp it is on. For the absolute positional embedding and 3D robe,
00:49:24.880 | those, those tell the model, for this token, which, which position it is in the video.
00:49:33.760 | Sure. No, I guess I was just confused about the, what is the need for both
00:49:43.120 | the rotary positional embedding and also the absolute
00:49:46.480 | positional embedding? Like, why is that, why are both of those needed?
00:49:49.920 | So, not necessarily, but this can improve the model. In fact, if you just use absolute
00:50:01.040 | positional embedding, it can also work. Okay, I see. Okay, got it, got it. Thank you.
00:50:11.120 | Yeah, Ethan, can I ask a question? Yeah.
00:50:16.560 | Yeah. So, there are, there was a comment in the chat about the use of vector quantization.
00:50:25.920 | Now, how is that used, actually? I don't think that it's used for selecting
00:50:35.920 | patches, but it could be used for the discrete latent space.
00:50:41.040 | It's a, it's a training technique. It's basically for, for the autoregressive part of the model,
00:50:49.280 | you need to, you need to have fixed set of vocabulary and the input are basically indices,
00:50:58.080 | like this word is number, number one, number two, etc. So, when you're training the tokenizer,
00:51:07.280 | you need to quantize them into the codebook.
00:51:12.480 | When it's choosing, for each patch, it basically looks for the closest vector in the codebook
00:51:24.400 | and pick it out. Yeah.
00:51:30.160 | Thanks, Ethan. I had a question about the size of the models that were posted to HuggingFace.
00:51:51.600 | How did you guys select those sizes? Did you experiment with larger sizes? Yeah, those are
00:51:58.000 | my questions. So, yeah, this is a first release of Cosmos 1.0. There might be bigger models in
00:52:12.400 | the future, and because we, when doing research, we want to go from small to big. We're not doing
00:52:23.520 | it blank shot, and I think we're still in the infancy of word foundation models, and they're,
00:52:30.720 | let's say, it's kind of like GPT-1 or GPT-2 stage of word foundation model.
00:52:39.600 | Bigger models will definitely come in the future.
00:52:42.720 | Got it. Thanks. Was there any thinking in terms of, well, this is good enough
00:52:50.320 | for most of the applications we see from, I don't know, customers or partners?
00:52:55.840 | It's not good enough yet. It can be better and better. The model now has some emerging
00:53:09.280 | physics property in the generative video. I would think it can get better.
00:53:19.840 | Thanks. Guys, so it looks like Sviks passed the baton to me. He had to drop off a call,
00:53:48.240 | or for an in real life meeting, and so I want to, if there are any other questions,
00:53:56.400 | encourage you to ask. Otherwise, I think we can take a little bit of time to discuss the next
00:54:04.080 | paper, and I actually have to have a hard stop at, in three minutes, so I need to drop off
00:54:14.240 | at that time. So, first of all, Ethan, this is fantastic. I hope you keep coming back to these
00:54:21.600 | paper club meetings, and even if you want to present someone else's paper rather than your
00:54:33.840 | own, certainly anytime you publish a paper, we definitely want to see you here. But if others
00:54:39.600 | publish paper and you think it's exciting and you want to share it, we definitely would love to have
00:54:44.560 | you as well. Thank you. Thank you for hosting. Yeah, I mean, Sviks, but I'm happy to facilitate
00:54:54.560 | where I can. Are there other questions for Ethan before we, I'm not sure how much time
00:55:01.520 | we have really to discuss the next paper, but, okay, does anyone want to volunteer? I think that
00:55:14.160 | I saw some chat, and I'm not sure about this, but I saw some noise on the Discord about people just
00:55:21.360 | picking things from the list of papers that are in our backlog, and then just giving brief, like,
00:55:29.280 | sort of very fast discussions of those. Maybe I think that in the past we've taken 10 or 15 minutes
00:55:36.880 | to just go over, summarize the paper for everyone. Probably you'll, people won't probably pre-read,
00:55:43.360 | but it'll just be a good, you know, sort of way to understand in some detail what are the key
00:55:49.760 | points from the paper. So, maybe I can post that. I think it's already in Discord, but I can post
00:55:56.800 | that in Discord. If there are people who are not on Discord, maybe I can ask, I can suggest Sviks
00:56:04.320 | also post that in the, like, in, like, Twitter or whatever. Is that, is there, unless, of course,
00:56:13.520 | someone wants to volunteer to present a paper next week? Okay. Somebody asked that, for the
00:56:26.480 | Discord channel, if no one can dig that up, I suggest, I think it's on the LatentSpace, like,
00:56:36.720 | on, you can, you can dig through the LatentSpace sub stack, or, like, maybe there's a, I think
00:56:42.640 | there's a website, too, and you can find it there. Otherwise, you can hit me on X, and I'll find it
00:56:51.920 | for you, or LinkedIn, as well. I'm, on both of them, my user handle is Haneke, or you can obviously
00:57:01.360 | also, Sviks, or anyone else here. Oh, there it goes. Okay, great. Okay, guys, so, grab that if
00:57:09.280 | you need it. I'm going to end the meeting, and, yeah, I got to go. So, I'm going to, I'm going to
00:57:14.000 | stop recording. Actually, I probably was supposed to stop recording, and guess what,
00:57:18.400 | the edit, whatever. And thank you very much. We'll see you next week.
00:57:23.360 | Goodbye.
00:57:29.280 | [BLANK_AUDIO]