NVIDIA Cosmos: World Foundation Model Platform for Physical AI

the 75 pages of the report. I can't cover everything in one hour. I can talk about it for hours. So I'll just cover what I focus on as a data scaling and model scaling. First, I'll do an introduction of Cosmos for people who are not familiar with it. I guess the introduction is best to Saru Jensa himself.

It includes autoregressive world foundation models, diffusion-based world foundation models, advanced tokenizers, and an NVIDIA CUDA, an AI-accelerated data pipeline. Cosmos models ingest text, image, or video prompts and generate virtual world states as videos. Cosmos generations prioritize the unique requirements of AV and robotics use cases, like real world environments, lighting, and object permanence.

Developers use NVIDIA Omniverse to build physics-based geospatially accurate scenarios, then output Omniverse renders into Cosmos, which generates photoreal physically-based synthetic data. Whether diverse objects or environments, conditions like weather or time of day, or edge case scenarios, developers use Cosmos to generate worlds for reinforcement learning AI feedback to improve policy models or to test and validate model performance.

Even across multi-sensor views, Cosmos can generate tokens in real time, bringing the power of foresight and multiverse simulation to AI models, generating every possible future to help the model select the right path. Working with the world's developer ecosystem, NVIDIA is helping advance the next wave of physical AI. Okay, so what's a world model?

A world model, it takes past observations, acts, and also perturbations. See, it can predict the future predictions. The perturbation can take any forms, like it can be actions from the physical AI, or it just can be some random perturbation, or a text description of the perturbation. So, in the Cosmos 1.0, we open-sourced a family of models.

We have two sets of forward quantization models. One is based on diffusion, while the other is based on autoregressive models. For each family, we also built two base models and two derivatives. To achieve the best generation quality, we also built an upsampler for the diffusion model, and also a diffusion decoder to improve the video generated from the autoregressive model.

So, these are already open-sourced on GitHub. You can feel free to try. So, for the diffusion world model, this is the architecture overview of it. So, the input video goes through a video tokenizer. Now, here it's called CV8x8x8. Basically, the time spatial are both compressed by 8. If you have 8 frames, it's going to go into one frame.

I assume everyone knows diffusion. The tokens are corrupted, then go through a diffusion transformer. The model then generates the reconstructed video during training. This is an example video generated from the diffusion world model. For the autoregressive world model, it goes through a similar process. As a tokenizer, instead, it goes from discrete instead of continuous.

Discrete tokenizer is very similar to LLMs. This discrete tokenizer converts video patches into one of the vocabularies. There's a 64k vocabulary. These discrete tokens are fed into a transformer with a similar architecture as LLMs. Then, discrete tokens are generated. Then, there's a decoder, which is also a discrete decoder that decodes these tokens into videos.

There has been debate on whether diffusion or autoregressive models are better since we don't know. So, we built both of them. For example, here, this is an input image for the autoregressive model. You can use this as a pre-filling word for the transformer. Then, in the decoding process, it can decode into videos.

Autoregressive, if you want better quality of the generated result, you can go with the diffusion model. If you want the model to be faster, you can try the autoregressive model. Autoregressive also plays very well into other modalities. You can easily combine other tokens like text tokens or action tokens.

But here, our autoregressive model is trained purely on videos. We also released post-training scripts for these models. Right now, in the Cosmos paper, we discuss several post-training examples of the Cosmos foundation models for different physical texts. Right now, in the GitHub, we support general post-training. This fine-tunes the word models to generate a target distribution of the videos based on a custom dataset.

The target distribution could include a specific camera spec or a specific domain such as FAC3. Here is an example. We took a few videos of humanoid. Here is a jetty robot, a jetty humanoid, and just roughly five videos of this humanoid. The video is in simulation. After fine-tuning the diffusion model, you can generate novel videos of this robot doing something else.

The model is able to remember the characteristics of this robot while generating novel tasks which are not possible in either simulation or in the real world through tele-opt. There are more post-training scripts that are coming soon. For example, instruction control. Post-training models for robotic manual motivation to produce a video based on textual instruction.

You can instruct the robots to perform some tasks like folding clothes or picking up objects. Also, action control. The post-trained robots can predict both the next video frame and the next action frame. Here, the example shows a camera control. Adding the camera pose as a condition, you can generate 3D consistent video simulation from a single image or video.

This can enable drastic navigation in virtual environments. You can also do multi-view generation, especially for autonomous driving. You can generate synchronized multi-view videos from text prompts then simulate the driving scenarios with multiple camera perspectives. Next, I'll dive into technical details. First, I'll go over data scaling. It's a model scaling.

So, we open-sourced a training framework. The data curation part, you can sign up for it. It's coming soon. The training framework is open-sourced. When we curate the data for text, you can just grab the text online and the label is basically next token prediction, which is relatively straightforward and cheap to curate.

However, for videos, for example, you have a video shot of someone playing basketball. You need to label a basketball player as dribbling the ball and shooting it into the hoop. Labeling video data requires good AI models for automatic captioning. We want to control the AI models to generate using text we specify.

Also, another challenge is that video signals are less refined compared to text. Maybe out of like an hour of videos, there might only be a second of interesting stuff. This is very computationally challenging and very expensive. We use distributed computing to solve this problem. This is a life cycle of curation.

So, on top of DGS cloud platform, we use real data based on streaming pipeline running on thousands of GPUs. The long video goes into the pipeline and then the videos are splitted and then transcoded into shorter clips. Then, different AI models are running on the short clips to detect high-quality videos.

Another NVIDIA BLM captioning model running using TensorRT LLM is used to caption the video. And finally, we get a training dataset. Data curation for the video foundation models are very challenging. The scale of the video data are hundreds of petabytes, much bigger than the previous image models. Orchestration at scale, heterogeneous computer requirements of tens of AI models running efficiently together is also very challenging.

You have the captioning model, you have models to detect the scene change, you have models to detect the video consistency, aesthetic, etc. Multiple concurrent streams of high-throughput data exchange between AI models also impose bandwidth challenges to the cluster. Every single step of the curation pipeline needs to be GPU-accelerated.

We also need to manage the resiliency of the GPU-based data pipeline at scale. So, each inference model needs to run at speed of light. We go from the baseline, where the model is run on PyTorch, to use TRT LLM to accelerate. And then we run it on a larger batch to further accelerate it.

And today, we use FP8 quantization to further accelerate it to 7x compared to the baseline. So, video understanding, so filtering the high-quality clips and auto-labeling is not enough for building a video foundation model. We need to understand a lot more about the videos for specific domain training. We remove the duplicated content and visual search understandings of the data.

So, these are the next last cycle of the video data curation. After the captioning, we need to do clustering to group the data into different categories, sports, entertainment, robotics, etc. Then there is a semantic deduplication to remove the redundant data. Finally, video taxonomy to further help researchers to pick the data they want to train on.

The takeaway for the video data curation is we build the video processing capabilities into Nemo Curator to enable the developers to curate high-quality data and train highly accurate video foundation models. By leveraging end-to-end GPU acceleration and optimizing the data orchestration through the pipeline, Nemo Curator can scale to over 100 petabytes of data.

Other optimizations reduce the processing time and lower the total cost of ownership. The models are optimized for high throughput and enhancing overall pipeline efficiency. Next, let's go over the model scaling. So, using Nemo Video Foundation Model Training Framework, you can scale these video models up to 20 times larger than traditional frameworks.

The framework is capable of training models like diffusion or autoregressive or foundation models up to 100 billion parameters. The throughput is highly optimized. We achieve roughly 450 teraflops. That's close to 50% MFU on the H100 chips. These are very close to the training efficiency of the LLM training. Previously, we talked about the scale of data curation.

We have hundreds of petabytes of data going into the curation pipeline. After curation, the data set we get are short video clips and images with text embedding. Even though the scale of the data is much smaller, these are still considered relatively big if we want to train on the clusters today.

For example, the images are on the O1 billion scales, and the videos are roughly 100 million video scales. In the paper, we use image and video tokenizer compression rate of 8x8x8. At this scale, the images are compressed to roughly 200 kilobytes. For 1 billion images, it's roughly at the level of 100 terabytes.

For the videos, it's on petabyte level. A solution can be storing this data on the cluster or storing them on cloud storage like S3. Storing them on the cluster has some huge costs. Most of the clusters don't have huge storage on the clusters themselves. We provide both solutions in the open source framework.

We leverage Megatron Energon, which is another open source library from NVIDIA, to load data efficiently. It allows you to load data from web source like AWS S3 very efficiently without the GPU idling during training. It allows you to deterministically save and restore the data loader, which is one of the biggest challenges in loading from the web dataset.

In web dataset, usually the data is loaded sequentially. When your training is interrupted, the traditional way of training, you have to randomize loading. You won't be able to load non-repetitive data without Megatron Energon. Another challenge in loading the data is variable input data shapes. The data types are different.

You have image, video, and you also have different durations of the videos. You have one second, 10 seconds, or even 50 seconds. The resolutions are different, so 60p, 720p, 1080p. There are also different aspect ratios, 16 by 9, 9 by 16. When you're training on text, you don't have this kind of problem.

In video, this can cause a very big problem in efficiency if we batch the data. Traditional approach is batching the data. For each different shape of the input, for example, image, we batch the images into a few samples of batch. For the videos, for the very large videos, you can just take one video as input.

For medium size, maybe you can batch two or four into one batch. The pros is that this is commonly used for most of the models nowadays. For example, in ImageNet training, traditionally people just resize all of the image into 512 by 512 to mitigate this problem. But the challenge here is if you want to train on different aspect ratios and different resolutions, you need a complicated data loading logic to ensure that during training, at each iteration, at least the data shape is the same.

And the efficiency is not very high because not all of the data shapes can be efficiently utilized by GPU. And also constantly changing the shape of the input data can cause challenge to the fused kernels. On GPU, if all of your tensor operation shapes are the same for all the iterations, we can optimize for it and it runs more efficiently than dynamic shape.

The data loading scheme we open source is called Pack Sequencing or Sequence Packing. Different from the traditional SBHD format, this one allows you to use different image, video, multi-model, whatever, also different aspect ratios, duration, resolution. The key is to reshape all of the data into one-dimensional sequence and then pack them together into one batch.

And when you pass this into transformer, outside of self-attention, there's no problem at all. The MLP operation of the transformer is just a per-token operation. But for self-attention, we will need to mask, create a block diagonal mask so that each of the samples in the sequence are computing self-attention on themselves.

And this operation is automatically done on the fused CUDA kernel. You only need to supply the sequence length in our training code and that's all you need to enable Pack Sequence training. With this data loading scheme, the training efficiency is very high and you can see in the end there is a padding.

If you have large enough max sequence, the padding is already very small. And the training efficiency is very close to when you have all of the samples with the exact same shape. Next, I'm covering the parallelism. The training on videos is one of the biggest challenges in the context lens.

Traditionally, in pre-training LLMs, the context lens is really like 4K. Nowadays, it's 8K on LLAMA. But training on videos, the context lens is much larger. Say we have five seconds of videos, encoding it with a 8x8x8 tokenizer, it goes into roughly 60K or 70K of tokens. This is 10 times larger than the sequence length of LLMs.

Context parallelism or ring attention is one of the key techniques we use to scale the Euclidean transformer or the autoregressive word model to up to 1 million tokens. Using context parallelism, you can place the activations of the entire transformer along the sequence dimension. This exploits the permutation invariance of attention to distribute the sequence in ring topology.

Hey, quick question, Ethan. I know for some LLM models, like even the LLAMA models that are trained up to 128K context, something that they do is they do the bulk of the training, like the majority of the five trillion tokens are done at a smaller context line. Then in that post-training, they continually train on longer context.

Is that a thing in video gen? Can you train the majority of the model at a short clip length and then extend this and extrapolate it out? Yes, that's a good question. I think the bottleneck here is we don't have a very efficient video compressor. Even a five second video is like 60K tokens.

If we say we train on shorter videos like one second, that also works. But for the majority of the training, the video foundation models, they are 10 times longer context compared to the LLMs. For post-training, the video models are extended to even longer context, say like one million tokens, to be able to generate a video roughly like one minute.

That makes sense. It's basically the same problem, it's just a 10x scale on both sides, so even the short context is still there. Yes. Thank you. I'd say if we have a very good tokenizer in the future that can efficiently reconstruct the videos, maybe it's a paradigm of change.

Right now, the video tokenizer customers release are 8x8x8 or 8x16x16. Spatial-wise, 16x16 is already near the limit. If you go beyond that, a lot of the reconstruction artifacts will appear. Makes sense. Thank you. For video generation and inference, we also employ context parallel. In the open source repository, you can already use context parallel to accelerate the inference.

For example, on 8 GPUs, using context parallel 8, you can generate a 5-second video under 30 seconds. Using more across different nodes, you can generate a video in a matter of seconds. Another challenge brought by diffusion transformer is challenges to pipeline parallelism. Traditionally, in LLMs, for different pipeline stages, you only need to pass the hidden states to the next pipeline stage.

But diffusion transformers have a lot of conditioning and adapt to their norm, and also conditioning on text, which creates difficulty for the pipeline parallelism. So we provide a solution to generate the additional conditionings on each pipeline parallel ranks. This value is slightly more compute, but reduces the communication cost a lot, which leads to improved performance.

Okay. I think that's all of my presentation. Thank you for listening. Any questions? Hi, Ethan. Thanks a lot for joining us again. This is RJ. I asked a question at the beginning of the chat. I'm a little unclear about how the encoder gets, like, the encoder to the 8x8 latent space gets trained.

Is that just part of the diffusion training, or is there something, like, some sort of, like, a separate step that is used to train that encoder? Yeah, that's a good question. So a separate step is used to train the encoder. Tokenizer is a fancy name of this, but this is VectorQuant has the variational autoencoder, VQVE.

Okay, yeah. Yeah, you would basically train it for the task of reconstructing the videos. Okay, right. So – but how do you get it to create a 3D – what's it like, the TLDR, and how to get it to create a 3D latent space like that? Yeah, so the model architecture itself is a causal convolutional neural network.

It can reconstruct – the encoder and decoder structure reconstructs the video. So the training objective is basically reconstructing the video. The process is you need to collect some diverse set of different videos, ideally in your domain, and then train this causal CNN to reconstruct those videos. The codebook here for continuous tokens are just those continuous tokens, but for discrete tokenizers, you would do vector quantization to quantize in 264-QIF codebase.

Okay. Is it – sorry, I didn't have time to pre-read the paper. Is this covered in the paper, or is there a separate paper for this? Yeah, this is covered. Okay, got it, got it. Thank you. This is really super interesting, exciting work. Thank you very much for joining us.

So additionally, the tokenizer is phrased during the training of the transformer, because if you don't phrase the tokenizer, it can lead to catastrophe forgetting. You can – you just generate – if you just predict the error, and the loss is there. Okay, got it, yeah. Sorry, I have also a question.

I didn't find any reaction button that I can raise my hand. Can I ask the question right now? Yeah. Okay, perfect. So my question is about the open source framework for pre-training that you mentioned. I think it was NEMO, right? Yes. Yes. So do you think, potentially, if I have a set of videos, but those videos, originally, they were not necessarily in the RGB space, okay?

So I don't know, for example, satellites, or anything, a spectral wavelength, or whatever. And I just somehow mapped them to videos. Do you think I can still customize your framework and just pre-train my own tokenizer, or basically whatever else that exists in that framework? Yeah, if your data domain is different from video, it's recommended to fine-tune the tokenizers.

So just fine-tuning, do you think you're going to work? So, because if the tokenizer is not fine-tuned, it might produce some artifacts for your data if your data domain is different. Sorry, yeah, go ahead. After fine-tuning the tokenizer, you might also want to fine-tune the diffusion transformer or autoregressive transformer.

Yeah, both of these are supported in the framework. Awesome. And, you know, I can also pre-train the tokenizer using the current framework. Or fine-tune. Yeah. Thank you. Thanks for the presentation. I had a quick question related to some of the, well, you mentioned it's coming soon, for multi-view generation and more camera control.

So, curious if you could speak any more towards how you're approaching multi-view, or how to make sure that the camera intrinsics correlate between one another, you know, if they're all video-based generation versus having a true, like, grounded scene understanding, how you guys are approaching that. Yes, that's a good question.

So, these are coming soon, but the techniques are covered in the paper. For example, for multi-view generation, the different number of views are folded into the, into one of the dimensions in the data. So, the model input is still roughly the same. It seems to have, in fact, it's falling into the time, the time axis.

And for the camera intrinsics, it's not, it's not used now, because if you have a consistent intrinsics, we don't, you don't have this problem, but if your intrinsics is going to change across different training data, I guess it's helpful to include that in the conditioning information. At least in the example, we use consistent intrinsics.

Yeah, so you're saying it has more to do with, perhaps, more the training data that you're using to post-train these models, to have it be consistent and have similar intrinsics? Is that sort of what you're saying? Yeah, yes. All right, okay. I, I can, I can answer questions in the, in the chat.

Yeah, I wasn't looking at it. Yeah, so what does the token represent in this case? One pixel of video? So, the, yeah, the tokens, the tokens are a patch of video. Say, for an image, an 8x8 patch is one pixel. For a video, an 8x8 patch is, is one token.

That means, roughly, for one second, the video is, if it's 30 frames, in the, in the time domain, you have, like, four, you have, like, four tokens. And spatially, that depends on your resolution. Yeah, so the video doesn't have a depth map, but it can be, you can add it in the post-training process.

What's the different, difference between post-training and fine-tuning? I'd say, like, post-training, it's a fancy word of fine-tuning. And now it's, now it's fine-tuning specifically referenced to, like, it's all special techniques, like, just continue pre-training. I would say these two words are interchangeably. Oh, number of tokens for each of these foundation models trained.

So, for, yeah, for pre-training, it's a, it's a hundred million video clips level. And I, I have the equation in the, and according to, so each video clip is roughly five seconds. And using, using that information, you can calculate the number of tokens. I'd say it's roughly on the scale of, like, 10, 10 trillion, at least 10 trillion tokens or more.

You can calculate it for yourself. Yeah, what type of hardware is adequate for post-training on our own data? So, the post-training, the open-source now needs, like, eight, eight H100 for diffusion and two H100 for the autoregressive model. But with some technique, like activation offloading or LoRa, I believe LoRa and GPUs can also be used for post-training.

So, so the, the word, the word in our model name, we, we want to emphasize that the model, the model has spatial consistency and we're aiming to provide the best foundation model for robotics post-training. Okay, I think that's all the questions in the chat. Any more questions? Hi Ethan, thanks for the talk.

I had a question. So, for, you said, for identifying high-quality videos, you, high-quality clips, you filter them out first, right? How do you do that? Do you use, like, some already available open models or do you train your own models for that? Yeah, that's a good question. So, so there are different, there are different metrics for filtering videos.

For, there are both heuristic and some, some models. Heuristic, like, if the video is static, it's, it's basic image, it's not a good video, or you can also train a model to classify, classify the quality of this, this video, like, aesthetic score. So, that, that might need some other training and labeling, and also motion scoring, like, how much motion it's in the video.

So, in your case, you guys trained a custom model for that, based on these metrics, maybe motion or based on the aesthetics? There, there are a lot of, like, available models open source already. You, you can check it out. Like, there are aesthetic, aesthetic classifier, etc. Yeah. Okay, thanks.

Um, another quick question is, you know, as Cosmos develops or releases more iterations, how do you foresee approaches to adding more controllability within the scenario? So, more refined control over what's happening in the scene, and what variables you want to change versus not to change? Sort of inherent to, you know, video generation in general, I think you don't have as much control, and curious if you're seeing that as a requirement, and how, how you're thinking of approaching it.

Yeah, I think that's very important for post-training. That also depends on different use cases. Say, you have, um, depending on your data, if your, say, if your data have more different parameters you can use as conditioning, I think adding, adding them into the training would definitely help. Yeah, if you have, if you have additional, like, camera intrinsics, if you have additional cameras as condition, additional signals, like audio, all of them can use as conditioning.

The model, it's, it's quite flexible for, to add additional conditioning. For the diffusion model, they allow you to add it through cross-attention, and similar for autoregressive model. Ethan, I have another question that's somewhat related. How much, I was a little confused about how much of the, sort of, the, the ability to generate realistic, you know, sort of, physics and physical model, well, like, sort of, world models, is due to training versus some, like, inductive bias in the model, and, like, what were the, if, if the, in as much as it was inductive bias, what, what were the key things there?

So, I, I think two key things are data and scale. So, I, the, the model itself, as they grow larger and larger, a lot of the, a lot of the 3D capability, consistency physics intrinsics automatically appear when the model is bigger. And another thing is data. I think in the data, you need to have enough demonstration of different physical property for the model to learn.

It says the model itself doesn't have a lot of inductive bias. It's just, we're just using transformers. There's no, like, spatial attention, temporal attention, those kind of things. Okay, got it. Thank you. If, if there aren't other questions, I actually have one more. So, in, in the, sort of, like, the original diagram of the architecture, there's some, some things that I didn't understand about the, the positional embeddings.

Like, there's the, there's, like, two different positional embeddings, or three different positional embeddings, I think. Yeah, so, there's, like, this absolute positional embedding. And then, actually, there's another diagram that, where there's another positional embedding that goes into the cross attention, I think. Yeah, this, or, well, it's, I'm not sure what that is, that time step in this scale shift gate.

So, I got, I was kind of confused about what the purpose of all these are. Yeah, yeah. So, so, the timestamps is specific to the diffusion models. You know, the diffusion process, you're going through multiple steps to diffuse noise, and it becomes a clear, clear and crisp video, right?

So, during training, the process is, you randomly apply some noise to the tokens, and for, you also need to indicate the model, like, how much noise is added. If there are more noise, the timestamps, it's an earlier step. The less noise, the timestamp is, is close to the end of the generation.

So, during inference, the model can gradually remove the noise and the condition on the, which timestamp it is on. For the absolute positional embedding and 3D robe, those, those tell the model, for this token, which, which position it is in the video. Sure. No, I guess I was just confused about the, what is the need for both the rotary positional embedding and also the absolute positional embedding?

Like, why is that, why are both of those needed? So, not necessarily, but this can improve the model. In fact, if you just use absolute positional embedding, it can also work. Okay, I see. Okay, got it, got it. Thank you. Yeah, Ethan, can I ask a question? Yeah. Yeah.

So, there are, there was a comment in the chat about the use of vector quantization. Now, how is that used, actually? I don't think that it's used for selecting patches, but it could be used for the discrete latent space. It's a, it's a training technique. It's basically for, for the autoregressive part of the model, you need to, you need to have fixed set of vocabulary and the input are basically indices, like this word is number, number one, number two, etc.

So, when you're training the tokenizer, you need to quantize them into the codebook. When it's choosing, for each patch, it basically looks for the closest vector in the codebook and pick it out. Yeah. Thanks, Ethan. I had a question about the size of the models that were posted to HuggingFace.

How did you guys select those sizes? Did you experiment with larger sizes? Yeah, those are my questions. So, yeah, this is a first release of Cosmos 1.0. There might be bigger models in the future, and because we, when doing research, we want to go from small to big. We're not doing it blank shot, and I think we're still in the infancy of word foundation models, and they're, let's say, it's kind of like GPT-1 or GPT-2 stage of word foundation model.

Bigger models will definitely come in the future. Got it. Thanks. Was there any thinking in terms of, well, this is good enough for most of the applications we see from, I don't know, customers or partners? It's not good enough yet. It can be better and better. The model now has some emerging physics property in the generative video.

I would think it can get better. Thanks. Guys, so it looks like Sviks passed the baton to me. He had to drop off a call, or for an in real life meeting, and so I want to, if there are any other questions, encourage you to ask. Otherwise, I think we can take a little bit of time to discuss the next paper, and I actually have to have a hard stop at, in three minutes, so I need to drop off at that time.

So, first of all, Ethan, this is fantastic. I hope you keep coming back to these paper club meetings, and even if you want to present someone else's paper rather than your own, certainly anytime you publish a paper, we definitely want to see you here. But if others publish paper and you think it's exciting and you want to share it, we definitely would love to have you as well.

Thank you. Thank you for hosting. Yeah, I mean, Sviks, but I'm happy to facilitate where I can. Are there other questions for Ethan before we, I'm not sure how much time we have really to discuss the next paper, but, okay, does anyone want to volunteer? I think that I saw some chat, and I'm not sure about this, but I saw some noise on the Discord about people just picking things from the list of papers that are in our backlog, and then just giving brief, like, sort of very fast discussions of those.

Maybe I think that in the past we've taken 10 or 15 minutes to just go over, summarize the paper for everyone. Probably you'll, people won't probably pre-read, but it'll just be a good, you know, sort of way to understand in some detail what are the key points from the paper.

So, maybe I can post that. I think it's already in Discord, but I can post that in Discord. If there are people who are not on Discord, maybe I can ask, I can suggest Sviks also post that in the, like, in, like, Twitter or whatever. Is that, is there, unless, of course, someone wants to volunteer to present a paper next week?

Okay. Somebody asked that, for the Discord channel, if no one can dig that up, I suggest, I think it's on the LatentSpace, like, on, you can, you can dig through the LatentSpace sub stack, or, like, maybe there's a, I think there's a website, too, and you can find it there.

Otherwise, you can hit me on X, and I'll find it for you, or LinkedIn, as well. I'm, on both of them, my user handle is Haneke, or you can obviously also, Sviks, or anyone else here. Oh, there it goes. Okay, great. Okay, guys, so, grab that if you need it.

I'm going to end the meeting, and, yeah, I got to go. So, I'm going to, I'm going to stop recording. Actually, I probably was supposed to stop recording, and guess what, the edit, whatever. And thank you very much. We'll see you next week. Goodbye.

NVIDIA Cosmos: World Foundation Model Platform for Physical AI - w/ Ethan He

Transcript