back to indexVeo 3 + DeepSeek V3.2-Exp Explained: Sparse Attention for Affordable Long Contexts

Chapters
0:0 Introduction to the Veo 3 analysis paper and initial skepticism
0:58 Premise that Veo 3 can act as an LLM for video, performing general reasoning and tasks
2:9 Comparison to Sora 1 and Sora 2, and a discussion on world models
3:23 The paper's claim of Veo 3's reasoning capabilities, including maze solving, and the potential influence of an LLM prompt rewriter
4:36 Discussion on general-purpose vision understanding through large-scale training
6:5 Demonstration of Veo 3's capabilities on a web page, covering perception, modeling, manipulation, and reasoning
7:52 Skepticism regarding the true source of "reasoning" (LLM vs. video model)
9:28 Quantitative results and comparison to other models, including the use of green backgrounds for better performance
11:11 How Google attempts to isolate the video model's reasoning capabilities from the LLM rewriter
11:46 Overview of the four hierarchical capabilities: perception, modeling, manipulation, and reasoning
13:36 Detailed look at perception tasks (edge detection, segmentation) and the claim that video models will replace bespoke CV models
15:6 Discussion on modeling physical properties and optical phenomena, and manipulation tasks like background removal
15:41 Visual reasoning and the concept of "chain of frames" as analogous to "chain of thought"
17:9 Quantitative tasks, performance metrics, and comparison to Veo 2 and Nano Banana
18:1 Detailed analysis of specific quantitative tasks like edge detection, object extraction, and segmentation, highlighting the green background bias
20:26 Discussion on maze solving, image editing, and visual symmetry solving
21:57 Discussion on Veo 3's emergent zero-shot abilities and its role as a foundation model for machine vision
23:6 Framing the paper's outlook and the benefits of general capabilities
24:12 Recap of the Veo 3 paper as an analysis rather than a technical detail paper
25:19 Discussion on the speaker's skepticism about the paper's claims and the importance of capability exploration
25:40 Question about dollar-per-token cost and the comparison between specialized models and foundation models
26:10 Question on what "true reasoning" would look like in video models without an LLM rewriter
27:52 Discussion on Sora 2's system card and the absence of quantitative metrics
28:36 OpenAI's approach to system cards and safety measures for Sora 2, including moderation classifiers and output blocking
30:37 Transparency, watermarking, and internal detection tools for AI-generated content
32:2 Discussion on control nets and the limitations of API-based models
33:51 Follow-up on the dollar cost question and the future of specialized vs. generalist models
36:9 Question on how text prompts are integrated into the latent space of vision models
38:19 Introduction to the DeepSeek V3.1 Experimental paper
38:52 DeepSeek V3.1's focus on **reducing inference cost** for long contexts using sparse attention
39:54 Explanation of the **sparse attention mechanism** with a lightning indexer and fine-grained token selection
41:2 How the indexer computes an index score to select top-k tokens, and the role of its smaller size for computational efficiency
47:8 Two stages of pre-training: dense warm-up (for indexer initialization) and sparse attention mechanism training
48:51 Post-training with two modifications: **specialist distillation** and **mixed RL training
49:23 Specialist distillation using expert models in mathematics, competitive programming, logical reasoning, agentic coding, and agentic search
50:52 Mixed RL training to balance performance across tasks and prevent catastrophic forgetting
52:22 Evaluations showing the efficiency and power of the sparse attention implementation, with cost reduction
54:10 The potential for other models to adapt this sparse attention technique due to continued training feasibility
54:49 Discussion on slight performance drops in some benchmarks for DeepSeek V3.1 but significant cost savings
00:00:04.400 |
So just to recap for recording, this is not a VL3 paper. 00:00:19.820 |
If you guys have seen the paper that's all hyping up LLMs or zero-shot learners, 00:00:32.100 |
I felt like this paper is low-key a response to be the first one to just come out 00:00:36.620 |
and, like, say the same thing that, you know, 00:00:40.220 |
you can have general-purpose video models, and they're getting pretty good. 00:00:49.100 |
So basically, they're making this premise that also, you know, 00:00:58.420 |
So I guess there's some background before we get into it. 00:01:01.960 |
So they're basically making the premise that VL3 can be, like, LLM for video. 00:01:09.820 |
They can, you know, they can do a bunch of different tasks, 00:01:12.660 |
and then they break down how do they test whether it's, like, a general model. 00:01:19.260 |
Can they have, like, what are these different categories? 00:01:22.680 |
Perception, modeling, manipulation, and reasoning, and then different tasks, 00:01:26.860 |
and then they do a bunch of inference, and then they're, like, you know, 00:01:29.180 |
how does this compare to specific models in those domains? 00:01:33.960 |
And they're, like, these, VL3 is actually a pretty decent generalist, but, you know, 00:01:39.780 |
calling it a reasoner, a bit of a stretch, and then they cite how this is, like, 00:01:45.380 |
there was a GPT-3 moment in LLMs where GPT-3 was, like, okay, it can do text, 00:01:54.020 |
and then, you know, it became a very, very good general model that's, like, 00:01:57.720 |
a state-of-the-art summarization, classification, all these different tasks, right? 00:02:03.120 |
So, they're making the claim that VL3 and, you know, video models can do the same. 00:02:09.860 |
The second, like, thing that came out this week was Sora 2. 00:02:13.960 |
Sora 2 also said, like, okay, the Sora 1 was basically, like, 00:02:19.860 |
the GPT-1 moment for video models, and Sora 2 is, like, GPT-3.5. 00:02:30.720 |
And then, you know, there's an app that we can play around with. 00:02:33.640 |
So, some background before, I guess, going into this paper of if video models can reason. 00:02:39.900 |
This weekend, I was talking to someone at OpenAI. 00:02:43.900 |
He runs the ImageGen team and is also part-time at Sora. 00:02:50.140 |
And he basically, he's, like, okay, my entire background was on world models. 00:02:54.100 |
And he's, like, the right modality for world modeling and reasoning is not video. 00:03:04.240 |
It's auto-regressive image and text because you can reason. 00:03:08.360 |
And, you know, without getting too deep into it, the premise is, like, okay, 00:03:17.180 |
That's open, you know, food for thought for you to think about. 00:03:29.320 |
They can perceive, model, manipulate the world, do early forms of reasoning because they can 00:03:35.700 |
And I was, like, okay, that's kind of interesting. 00:03:37.600 |
You know, they are solving mazes and this and that. 00:03:39.880 |
Then they make this very fun claim somewhere in here. 00:03:46.640 |
Where basically what they're doing is they're doing a lot of prompting and they're testing 00:03:55.300 |
So they're generating prompts and then they're seeing, can it model this stuff? 00:03:59.180 |
And then they note that in the vertex, so this paper is from Google. 00:04:05.820 |
But basically in the vertex API, there is a prompt rewriter. 00:04:12.740 |
So basically they're like, you know, one thing to know is the solving could be done in the 00:04:19.900 |
LLM backbone because whatever prompt you said gets rewritten, right? 00:04:25.720 |
Well, maybe the prompt is actually being rewritten and the LLM is giving information on how to 00:04:37.060 |
And so they're making this claim that, you know, there's really good task specific models 00:04:41.380 |
like segment anything from meta for segmentation, YOLO for object detection and stuff. 00:04:47.160 |
And they're like, okay, you know, can we have the same primitive in video models, like just 00:04:54.660 |
by large scale training on just text and video and web scale data? 00:04:58.220 |
Do they have general purpose vision understanding similar to how LLMs do? 00:05:03.760 |
They generated roughly 14,000 generated videos, 62 qualitative, seven quantitative tasks. 00:05:14.900 |
They show early forms of chains of frames, which they consider visual reasoning, like maze and 00:05:25.400 |
I am a little, you know, I don't know how I feel about that claim because you have to remember 00:05:31.660 |
that as much as you have chain of frames, we know that vision transformers can have temporal 00:05:40.420 |
That doesn't necessarily translate to reasoning, right? 00:05:43.440 |
Just because you can interpolate between frames or like in the self-driving sense, you can remember 00:05:49.560 |
where an object is just because something else is in front of it. 00:05:52.900 |
That doesn't necessarily constitute to reasoning, right? 00:05:55.440 |
You have attention, you have a bit of memory and you have stored state, but that's not the 00:06:08.920 |
They have different tasks, different sub stack. 00:06:13.340 |
Before we go to that, I'm going to share the website. 00:06:29.920 |
This is the, it's like a webpage of their paper. 00:06:35.980 |
So basically TLDR VO3 shows emergent zero shot capabilities across tasks indicating that video 00:06:42.960 |
models are on the path to becoming vision foundation models, just like LLMs. 00:06:49.700 |
Can, can you pin out a little dot in an eye modeling? 00:06:54.180 |
Can you like model basic physics, manipulation? 00:06:57.380 |
Can you, you know, make sense of opening a cap reasoning? 00:07:04.900 |
And then if you're interested, it's very nice. 00:07:14.900 |
And then these are kind of some of these things. 00:07:16.920 |
So, uh, edge detection, uh, segmentation, you know, you can get a nice little visualization 00:07:23.020 |
of what all these tasks are, uh, denoising, blurring, denoising, and then they break it 00:07:35.960 |
Uh, can you see what would set on fire first? 00:07:39.700 |
I don't know if this is how paper fire spreads, uh, you know, gravity on the moon is different 00:07:49.660 |
Um, you know, so it's, it's like a nice visualization because the paper is static, uh, manipulation, 00:07:58.920 |
So like, these are some of the, the things that they tested for and prompted. 00:08:03.160 |
And it's like a nice, it's like a nice thought exercise, right? 00:08:06.880 |
If you have to break down video generation into subcategories, um, these are nice tasks that 00:08:16.320 |
So it, it's cool that they did that, um, some of this reasoning stuff, you know, it's 00:08:20.820 |
just, it's just hard to, it's hard to distinguish how much of this is the LLM that's rewriting 00:08:28.600 |
the prompt and giving explicit instructions on what to generate versus, um, the video, like 00:08:36.540 |
And also some of these are just pretty basic, right? 00:08:39.420 |
Like, are these just stochastic representations, right? 00:08:42.420 |
Like in LLM with a million parameters can probably tell you what comes next in the sequence of 00:08:50.020 |
So it can also tell big, small, small, smaller, these are not super challenging puzzles. 00:09:01.500 |
There's like little things, but you know, they have a bunch of these and then the maze. 00:09:05.040 |
I'm, I'm somewhat skeptical, but fun little visualization for people that are interested. 00:09:10.940 |
Um, I am going to change back to the paper real quick. 00:09:37.500 |
Um, feel free to chime in if you found any of this interesting, because honestly, it's 00:09:41.660 |
just a lot of, a lot of, um, a lot of basic definition examples, uh, for each task, we 00:09:49.300 |
query the publicly available VO2 or VO3 APIs. 00:09:52.840 |
We prompt the model with an initial input and the text instruction. 00:09:57.420 |
They generate 16 by nine at 7 20 P 24 FPS for eight seconds. 00:10:05.520 |
So according to vertex documentation, the API uses an LLM based prompt rewriter. 00:10:12.300 |
This means that some of the solutions are likely to come from the LLM instead of the video. 00:10:17.380 |
For example, Sudoku, uh, we treat the system and rewrite the video generator as a single 00:10:28.420 |
Uh, however, to isolate the videos, the model's reasoning capabilities. 00:10:32.560 |
We verified that standalone LLMs couldn't reliably solve some key tasks. 00:10:39.940 |
Uh, here, I think you also want to digest, like, dig into this a little more. 00:10:44.500 |
Um, some LLMs are very, very bad at vision, right? 00:10:53.720 |
It can't tell like there's a barrier in the image, even though it's like a really smart LLM. 00:11:00.500 |
It could probably create something with a barrier. 00:11:03.980 |
So I could create a solution to Sudoku or whatever maze, and it could create the maze. 00:11:08.780 |
Uh, but relying on its vision, you know, it's, it's kind of cooked. 00:11:15.760 |
Um, RJ is asking about how they tried to isolate this. 00:11:19.640 |
Honestly, the sad thing was, this is the only, this is like the only five lines of the paper 00:11:28.280 |
It doesn't even tell you that there's this, and this is like the only section that they 00:11:33.260 |
Uh, they, they do bring it up with like, okay, is this going to be, um, you know, is this going 00:11:40.760 |
Cause it's, it's definitely not cause they're from Google. 00:11:43.900 |
It's cause it's the best on the leaderboards. 00:11:49.680 |
First, there's four hierarchical capabilities. 00:11:52.320 |
These, they all build on the last they claim. 00:11:54.220 |
So perception, can you understand visual information modeling, which builds on the perception? 00:11:59.900 |
You know, uh, can you form stuff in a visual world manipulation? 00:12:03.660 |
Can you alter perceive stuff in a visual world and reasoning? 00:12:09.760 |
And then, you know, we basically saw those examples before, right? 00:12:12.600 |
So perception, uh, can you, can you do de-noising? 00:12:16.260 |
Uh, can you, can you, you know, highlight what's in a thing? 00:12:19.860 |
And then modeling is like, okay, do you know what happens in this world? 00:12:23.700 |
So like, if I drop something light on water, it floats manipulation. 00:12:29.560 |
So like, if you have a guy standing facing forward, can you picture the rest of the 00:12:33.980 |
body and stuff like, okay, if you open a jar, what happens, right? 00:12:43.000 |
Um, some stuff, uh, and they're, they're trying to make those claims that, 00:12:47.740 |
There's, there's four levels of this, uh, for each section, they prompt VO3 12 00:12:52.960 |
times and record the success rate in the caption. 00:12:55.720 |
Uh, there's interesting little distinctions they make later on. 00:12:58.780 |
Some stuff is like video models really like to do more in printing. 00:13:04.340 |
So after they finish a task, they still generate their end sequence. 00:13:08.040 |
So they report a best frame and a last frame, because sometimes the 00:13:13.280 |
Um, and then there's also like pass at K and they really like to mention how 00:13:20.660 |
It's not like LLMs where you're doing few shot prompting. 00:13:24.360 |
Uh, and then, you know, uh, success rate greater than zero in 12 attempts means that 00:13:29.740 |
the ability is there while success rate closer to one means that's reliable. 00:13:35.460 |
Uh, so stuff like segmentation, augment, uh, object detection, edge detection, all those, 00:13:43.820 |
So they test, uh, edge detection, segmentation, key point, localization, super resolution, blind, 00:13:49.380 |
deeper, uh, deep blurring, denoising, low light enhancing, a bunch of these things. 00:13:53.880 |
Uh, the takeaway, uh, is the team pretty cool. 00:13:57.080 |
The takeaway is basically that just like LLMs video models will replace bespoke models in 00:14:02.500 |
Uh, once they become sufficiently cheap and reliable. 00:14:06.240 |
I think we should think more about this, uh, claim a little bit more like, so one thing 00:14:14.180 |
is with LLMs, small LLMs do decent, but they're not being used as much, right? 00:14:21.720 |
We don't have like small trained encoders that were used on the edge before, but one thing 00:14:28.860 |
we do have is small computer vision models used everywhere. 00:14:32.280 |
So like, think your car's, uh, ADAS system, right? 00:14:36.040 |
You have a very shitty, small computer vision model that can detect lanes and detect objects. 00:14:42.020 |
And like, you know, you don't have any memory for running big models. 00:14:45.800 |
So I think there's still quite a space for small computer vision models because a lot of them 00:14:54.280 |
Like your, um, you know, your, your home security system is, it's not going to be running VO3. 00:14:59.420 |
Maybe it will, maybe, maybe down, down the line. 00:15:02.080 |
But it's a little different from LLMs in that sense. 00:15:14.220 |
So buoyancy, the air resistance of dropping that, uh, scarf or whatever it was, uh, optical 00:15:20.580 |
phenomenons like light reflections and stuff, refraction, uh, adding and mixing colors. 00:15:25.820 |
Then in manipulation, they want to manipulate stuff, right? 00:15:28.420 |
So can we remove a background and infill stuff? 00:15:30.700 |
Uh, can we colorize images in painting, out painting? 00:15:36.980 |
And then, you know, if you're curious, there's the other like 30, 40 things that they do, but 00:15:43.700 |
Then the last one visual reasoning, it's basically, um, you know, since stuff is frame by frame, 00:15:50.380 |
can the parallel chains of thoughts and LLMs be like chain of frames? 00:15:55.760 |
They, they, they, they test all these things, you know, fitting ship shapes into holes, sorting 00:16:02.760 |
Uh, the, the third takeaway is frame by frame video generation, generation parallelizes chain 00:16:10.680 |
Just like chain of thought enables language models to reason chain of frame enables video 00:16:18.540 |
Once again, you know, I don't know that just because you can have temporal consistency, 00:16:23.020 |
that means that you are reasoning per se, right? 00:16:26.740 |
Part of this is just architecture, like fundamentals, right? 00:16:30.820 |
Vision transformers have a form of temporal consistency. 00:16:34.300 |
It doesn't mean that there's any extra reasoning applied to them. 00:16:36.980 |
You're, you're not doing like a chain of thought style reasoning, but it's, it's, uh, it's a thing 00:16:43.500 |
Uh, and then, you know, this is, this is some of it. 00:16:46.260 |
So more of that, um, and this is how they test stuff. 00:16:51.860 |
Um, they, they go, they go more into this later, but there's basically an overlap between some 00:16:56.560 |
of these, some of the fun stuff they say is like, uh, low key VO3 was just too good on this 00:17:12.820 |
So there's, there's seven tasks that they test, uh, you know, as detection, segmentation, 00:17:24.760 |
Uh, VO3 likes to continue scenes even after task completion. 00:17:30.260 |
Uh, where, where applicable, they, they compare it to nano banana, um, on some stuff, you know, 00:17:40.500 |
they, they match nano banana or they even exceed it. 00:17:43.820 |
Um, and then for video models, there's substantial improvement when you do K equals 10 or more steps, 00:17:52.260 |
Uh, they also plot all this stuff with VO2 and they're like, okay, VO2 kind of suck. 00:17:56.200 |
VO2 is very bad on all these, but now is the, the long part where they go into all the numbers and stuff. 00:18:04.760 |
If anyone's interested, you know, we can always pause and dive deep, but edge detection, um, 00:18:10.400 |
prompted to detect, therefore perceive edges. 00:18:14.000 |
I think a lot of this also goes into their prompting, right? 00:18:16.860 |
Let alone, there is a rewriter, but you know, how are they prompting these things? 00:18:20.160 |
So, uh, original image generated frame extracted masks, and then there's the ground truth mask. 00:18:25.320 |
So then they often do an overlap of, um, output on, on grading these, uh, over a subset of 50, easy images, 00:18:35.420 |
Here's their prompt blank, each distinct entity in overlaid flat color background fades from white to green, dah, dah, dah, dah, dah. 00:18:43.260 |
Some fun stuff they know is that the model always does better when there's a green background. 00:18:48.020 |
So, you know, green screen-esque, um, same thing here. 00:18:54.020 |
So, uh, you know, the background changes to white animals line up in a row. 00:18:59.160 |
How do they perform VO3 is goaded VO2 was okay here. 00:19:04.200 |
Nano banana is a lot better, but VO3 is up there. 00:19:10.760 |
It does significantly better than no green background, but nano banana doesn't see the same. 00:19:16.380 |
Change or they don't, they don't really put it. 00:19:22.520 |
They, they measure it by, you know, mean intersection over union. 00:19:29.020 |
Uh, basically, uh, VO3 achieves 0.74 for best frame comparable to nano bananas. 00:19:35.660 |
0.73, uh, VO3, it's still lacks behind specialized stuff, right? 00:19:40.820 |
So segment, anything version two from meta, this is a specialized segment model. 00:19:47.700 |
I'm slightly annoyed that they don't show the performance of that, but it's okay. 00:19:51.760 |
Um, these are also not super cherry picked samples. 00:19:55.960 |
Uh, they often list which, um, data set they pull from and they just sample some images. 00:20:04.680 |
Um, but yeah, they, they consistently perform better with green backgrounds than white. 00:20:12.000 |
So that's kind of interesting possibly due to the widespread of green screens. 00:20:23.520 |
Um, I think there's reasoning in the current term. 00:20:27.820 |
We're still discussing reasoning may solving. 00:20:30.020 |
I think this is where I'm going to start to go a little faster. 00:20:32.360 |
Uh, VO3 gets a lot better performance, like 90, 92 given simpler tasks. 00:20:41.480 |
Uh, there's a strong bias for animated scenes that might induce unintended changes. 00:20:54.860 |
Uh, the, the interesting things here are like, okay. 00:20:59.480 |
It's doing significantly better than nano banana and way, way better than VO2. 00:21:04.700 |
I don't know what the hell happened to VO2, but VO2 is like, you know, somewhere in the like sub 10% for a lot of these, uh, which is, which is rough. 00:21:15.400 |
So like you give VO2, uh, you give VO2 a random pattern and you tell it to complete it. 00:21:29.640 |
It's, it's being told to reflect the pattern across the center. 00:21:32.260 |
So VO2 couldn't do it at all nano banana can't reflect this that well. 00:21:36.180 |
It's like 28% in 10 attempts, but, uh, the best frame of VO3 can zero shot, like a hundred percent accuracy. 00:21:44.040 |
This kind of interesting task made solving, uh, visual symmetry, solving, visual analogy, completion, um, more stuff. 00:22:00.040 |
They really like, like foundational, like, okay, guys, just don't forget, you know, breakthroughs happen. 00:22:08.320 |
We are here to make the case that machine vision is on the cusp of a similar paradigm shift enabled by emergent abilities of large scale video models. 00:22:17.500 |
Our core finding is that VO3 can solve a wide range of tasks in a zero shot manner, spanning from the full vision stack, perception, modeling, manipulation, even early forms of visual reasoning. 00:22:30.000 |
It's, you know, massive consistent improvement from VO2 to VO3, which indicates that visual models become general purpose foundation models, just as they have for LLMs. 00:22:40.160 |
Uh, performance is a lower bound video generation is expensive, but costs tend to fall. 00:22:46.160 |
Uh, yeah, Jack of many trades, but master of a few VR three's performance is below state of the art of specialized models. 00:22:53.160 |
This mirrors early days of LLMs GPT three reported performance. 00:22:57.160 |
Well, below fine tune models on many tasks, didn't stop them from becoming foundational. 00:23:02.160 |
Uh, outlooks, this is a, uh, exciting time provision. 00:23:07.160 |
I do like the framing of the paper and I do think it's like net good for the average user, right? 00:23:13.160 |
Like if you need object detection or like, you know, if you want like a outline of how to draw something, it's a lot more work to go and have to find an object detection. 00:23:26.160 |
model or segment anything, or some web app that like, you know, you gotta enter your info to do versus just having like, you know, okay. 00:23:39.160 |
Uh, for example, I was like at a little event and there was like a page to, you know, draw yourself and everyone was drawing themselves. 00:23:48.160 |
And one girl just took a selfie, sent it to VO3 and was like, okay, give me an outline that I can trace. 00:23:55.160 |
And I was like, oh yeah, that's kind of cool. 00:23:58.160 |
And I was like, I never would have expected you to use VO3 for this, but they did it. 00:24:03.160 |
Uh, and then, you know, we have a bunch more examples towards the bottom, but I think, I think the website stuff is kind of more fun. 00:24:15.160 |
Um, this is a paper analyzing and trying to make a claim that, uh, video models can be good generalist. 00:24:23.160 |
I think the thing that I probably skipped over that I thought was interesting was there's that section on them being, um, this is all zero shot. 00:24:35.160 |
Um, so, you know, compared to LLMs where you can do like 64 shot, um, this is zero shot. 00:24:43.160 |
So we believe that bespoke model forms zero shot video model capability of zero shot. 00:24:50.160 |
Uh, but I think it's also like cool to see how much better this is than, um, VO2. 00:24:56.160 |
But yeah, okay, I guess, I guess I skipped it because they just keep showing zero shot everywhere, but they did make somewhere to the, you know, they want to reinstate that this is all zero shot. 00:25:11.160 |
That's, that's the quick overview of the paper though. 00:25:23.160 |
Like every new model needs this, um, and reasoning is obviously, uh, very important. 00:25:31.160 |
It's just that stupid people hyped it up as a VO3 paper, not VO3 paper, but yeah, it's, it's nice distinction in how they, um, how they, how they split capabilities and stuff. 00:25:50.160 |
If they desire to scale this up to be a vision foundation model, nothing about dollar token costs. 00:25:59.160 |
Um, they, they mentioned that stuff will get cheaper. 00:26:08.160 |
Actually, I don't know how to raise my hand in zoom. 00:26:11.160 |
Um, uh, what is, so what would a true, I think that I heard you say that like a chain, this is not really reasoning. 00:26:24.160 |
It's like, uh, because of the, because the LLM is already doing the, the, the next frame prediction. 00:26:34.160 |
So the, the, the thing there is we don't know what the, the, the video model has actually generating, right? 00:26:39.160 |
Like when you pass in a prompt, like, uh, generate, uh, maze that does this and solve it. 00:26:46.160 |
The LLM like pre-prompter could literally be saying, okay, outline of mazes like this, you know, uh, like ball starts in this location, then goes here, then goes here, then goes here. 00:26:58.160 |
So like if the LLM is explicitly solving it, which it really could be, then, then it's not the video model. 00:27:07.160 |
It's just the LLM prompt that's being rewritten. 00:27:09.160 |
And I mean, like you could test this on video models that are not VL3 that you just don't use a prompt rewriter, right? 00:27:18.160 |
Like test your own video generation model and don't, don't have a prompt rewriter, but like, nonetheless, you know, it's, it's all free inference. 00:27:27.160 |
It's like not easy to generate this many samples. 00:27:36.160 |
Like if you, if you just didn't have it rewritten, then, you know, you might have different results. 00:27:52.160 |
Um, what other fun stuff we can, we can look at the Sora technical report. 00:28:01.160 |
Oh, no, I think it's just like a, sorry, sorry, sorry. 00:28:04.160 |
Safety, safety card, system card, system card. 00:28:09.160 |
So my spicy tweet about Sora was going to be, there's no numbers in Sora. 00:28:13.160 |
The only numbers is about like, uh, anti refusal versus anti, um, what refusal versus good refusal. 00:28:22.160 |
Um, but there's no numbers, there's no evals, no, nothing. 00:28:30.160 |
I think they'll put out a technical blog post later. 00:28:36.160 |
Um, the reality of the situation is Sora one paper was pretty good. 00:28:46.160 |
It's like, if you were to, you know, get into a legal dispute or ask someone like, Hey, should 00:29:03.160 |
But yeah, it's, it's mostly just safety checks. 00:29:13.160 |
Um, I thought it's interesting that, you know, now they allow like face uploads. 00:29:29.160 |
So, you know, input prompts, output video frames, transcripts, comments, all that stuff 00:29:39.160 |
So the strategy involves blocking the tool from generating a video. 00:29:43.160 |
So is my screen sharing to the, yeah, that's okay. 00:29:46.160 |
Uh, basically, you know, this one is the most straightforward, right? 00:29:49.160 |
If you send in a system prompt that gets flagged, then of course it will not be generated. 00:29:56.160 |
I'll put blocking, uh, this is after the video has been generated. 00:30:04.160 |
So some child sexual abuse, material classification model. 00:30:12.160 |
Safety response monitoring block output that violates policies. 00:30:18.160 |
Um, there's additional safety stuff for people under 18 and anyone under 13. 00:30:32.160 |
Water marking stuff will be watermarked internal detection tools to help see if, if stuff was created by Sora. 00:30:47.160 |
So even if you remove the watermark, you know, they, they want to see if the video audio was generated. 00:30:52.160 |
Um, when I read this, by the way, um, I, I was actually thinking through like, what do you really want to see in like an industry standard? 00:31:00.160 |
Um, it's that opening eye and Google are converging on the same thing. 00:31:04.160 |
And they're not, uh, Google has its own ID thing. 00:31:15.160 |
Um, you've been investing in watermarking for a long time. 00:31:19.160 |
I thought you meant like a safety, like distinction for what VO3 can and can't output. 00:31:27.160 |
There's, there's quite a few things for watermarking. 00:31:32.160 |
Like if you can do something better and you can put more effort into it, why, why match what Google does? 00:31:51.160 |
I feel like they, they can put out like a nice moderation API on this, you know, that would be interesting. 00:32:01.160 |
Uh, someone's asking when they'll have stable diffusion kind of moderifiers, like control net and stuff. 00:32:08.160 |
We'll be able to actually do almost anything we want. 00:32:13.160 |
What you're missing is these are like API based garden models, right? 00:32:18.160 |
Uh, you, you can only do so much so, uh, you can take the output of this and then apply a control net to it. 00:32:25.160 |
But like, that's not the same as taking an open, open video model and doing whatever you want. 00:32:31.160 |
So you want to use what the private labs have put out, then, you know, you got to use whatever their API lets you use or the app. 00:32:41.160 |
Um, misuse, do not support this, do not support public figures, blocking generations that include real people. 00:32:50.160 |
Uh, you can't upload a person of a picture of a person unless they opt in. 00:33:10.160 |
I tried that I was like trying to get to see if they could reason. 00:33:13.160 |
Um, they all kind of still filled on that child safety, team safety, model output restrictions 00:33:22.160 |
It's like not that long, but you know, basic stuff to read. 00:33:42.160 |
These numbers don't mean much, but yeah, that's, that's kind of sort of safety card. 00:33:51.160 |
So our first quick paper club, we finished early. 00:33:53.160 |
I actually did want to follow up on the question that I had asked about the dollar cost of it 00:33:58.160 |
Um, the reason I ask is because like, I can imagine two separate worlds, one in which like a model 00:34:04.160 |
routing gets better and the specialized models get better too, which I can imagine. 00:34:08.160 |
Like if specialized computer vision models, I could, let me, let me back up the trend that 00:34:14.160 |
Wouldn't that apply to both specialized model computer vision models or any model and a foundation 00:34:19.160 |
And if that's the case, if you had a world in which there was like better model routing that 00:34:24.160 |
specialized model versus just this like God model. 00:34:29.160 |
I feel like there could be a new paper about that, or they haven't talked. 00:34:37.160 |
Um, I think the thing with specialized models is they don't often just specialize on performance, 00:34:49.160 |
A lot of these things are often on device or need a certain low latency. 00:34:54.160 |
And for their use case, they're, they're specialized for that in terms of like architecture changes. 00:35:00.160 |
Some of them are different architectures, so it won't benefit in this. 00:35:05.160 |
And same thing with the like routing expert type thing. 00:35:12.160 |
Like if you want LLMs for different use cases, sure. 00:35:16.160 |
You can route to a bunch of stuff, but you know, uh, what, what does that look like for video? 00:35:21.160 |
Like, are you, are you trying to get understanding? 00:35:25.160 |
Like, are there tasks or do you just want general? 00:35:32.160 |
Like, I think it's also, there aren't many people hosting these, right? 00:35:37.160 |
Like you don't have many third party do everything video all in one platforms, right? 00:35:42.160 |
Like you have a few video editing apps, but they're not like consumer, like edge detection, 00:35:49.160 |
Like there's not many of those that would do this routing, but it's, it's an interesting thought experiment. 00:35:56.160 |
Like, I think, I think it's cool, but I don't know how much it falls in. 00:36:00.160 |
Like, you know, how much you really get out of it. 00:36:09.160 |
Any, any other thoughts, questions, or we move on for next week. 00:36:14.160 |
Well, I have another question actually, then if we have some time, um, like, so my question 00:36:20.160 |
is, uh, with these video generation models, the, just so I have a better understanding the, 00:36:25.160 |
the part that, uh, introspects a text prompt, like how is that baked into the latent space 00:36:32.160 |
Basically what we can call a vision model in a sense, right? 00:36:35.160 |
Like, is it all in one latent space or like, is it separated out? 00:36:40.160 |
We did, uh, we did a deep dev on the Sorrel one technical blog post that shows this. 00:36:49.160 |
So we've done like a few papers that were like multimodal image and text, and those show 00:36:58.160 |
Um, basically you, you have a contrast of loss that merges the two embeddings spaces, right? 00:37:04.160 |
So you have text embeddings and then audio and you merge them, or you can do a fusion model. 00:37:09.160 |
Uh, but basically what these video models are these days are just diffusion scaled up. 00:37:14.160 |
And then you have like an inverse clip style training data. 00:37:18.160 |
So you have like really, really good captions. 00:37:21.160 |
And then you, you do the opposite and you scale up diffusion, but, uh, either, you know, 00:37:26.160 |
I think you could also just read the Sora one technical blog post. 00:37:31.160 |
And, but it's fair to say the weights contain both like pixel level knowledge in a sense, 00:37:36.160 |
and, uh, text level, like it's, it's both in one single, like in the weights. 00:37:44.160 |
Because you have to have text understanding, but it's not like, it's not like native 00:37:51.160 |
And then those are like, uh, so like recently in the past six months, you saw some, um, 00:37:56.160 |
you saw some outputs that were like image models that can generate text. 00:38:01.160 |
Well, we covered a paper on how that works as well. 00:38:04.160 |
And it's basically just like a post train task of, um, how, how you can get them to learn, 00:38:10.160 |
learn what text looks like in the, in the image generation. 00:38:15.160 |
Um, I think, um, I'm glad you want to also do deep seek 3.2. 00:38:21.160 |
Cause I felt like people didn't read it cause it wasn't on the Luma until an hour ago. 00:38:25.160 |
But if you want to, if you want to cover it quick, it's actually a very quick paper. 00:38:52.160 |
I'm going to be doing like a very, very quick walk, walk through, through the new deep seek 00:38:58.160 |
So they just released a new model called deep seek V 3.1 experimental. 00:39:02.160 |
And it's all about, uh, reducing the cost of inference, especially on longer context, uh, 00:39:11.160 |
So basically boosting long context efficiency with the deep seek sparse attention. 00:39:24.160 |
So basically when we have a very long context input, like let's say we're inferencing on over 00:39:31.160 |
30 K or 40 K tokens, uh, the, the cost per tokens goes up quickly because in the normal attention 00:39:41.160 |
mechanism, we are attending to every token in the context prior to their current token. 00:39:47.160 |
So if you're doing inference over 30 K tokens, we're basically doing attention over these 00:39:53.160 |
And this, uh, causes the compute and also the memory requirement to go, uh, quadratically, 00:40:01.160 |
But the, the idea behind this sparse attention mechanism is what if we don't have to attend 00:40:08.160 |
What if we only have to attend to a limited or a finite number of tokens that doesn't grow 00:40:18.160 |
And the way they do it, it is by using what's called the, uh, sparse attention. 00:40:34.160 |
And the second one is a fine grained, uh, token selection mechanism. 00:40:38.160 |
Basically, this is kind of like similar to how the mixture of expert mechanism works. 00:40:44.160 |
You have a router that determines which experts you should use. 00:40:47.160 |
And then you can use actually the, uh, the, uh, the limited number of experts to do the, uh, 00:40:53.160 |
the inference on the indexer here, uh, works very similarly, very similarly to the, uh, router. 00:41:00.160 |
It chooses the tokens that you only need to attend to. 00:41:04.160 |
So it basically computes an index score between the query token and every token in the sequence. 00:41:14.160 |
So basically you get the query of the current token, and then you have a key for all the previous tokens. 00:41:24.160 |
Maybe you also do averaging over number of heads, and then you get a score. 00:41:28.160 |
This score is basically like a scalar number that goes from, uh, 00:41:33.160 |
zero to one, for example, or basically like a score, uh, a single number, not, not a vector. 00:41:39.160 |
And you get this for each token in the sequence so far. 00:41:43.160 |
The next step is you can select the top K, uh, the top K tokens with this score. 00:41:50.160 |
So basically if you setting K to 2000, you're gonna select the tokens with the, the 2000 tokens with the highest score. 00:41:58.160 |
And then you only have to pay attention to these tokens. 00:42:01.160 |
You don't have to, uh, attend to all the tokens in the sequence so far. 00:42:05.160 |
Uh, so H denotes the number of heads, uh, Q is Q and W are derived from the query token. 00:42:14.160 |
And then the K is derived from the, all the tokens in the sequence so far. 00:42:18.160 |
They say they chose ReLU because it's a simple activation function and, and it results in, in high throughput. 00:42:24.160 |
And the, the important note here is that, uh, the indexer is actually attending to all the previous tokens. 00:42:33.160 |
So how are we actually saving on the computation? 00:42:36.160 |
And the reason we are, we are saving with the computation is because the, the indexer is, is relatively small. 00:42:41.160 |
It has a small number of heads and it can be implemented in FB8. 00:42:45.160 |
So this results in massive, uh, computational efficiency. 00:42:49.160 |
And I've, I've pulled some numbers from the implementation, uh, the actual implementation of the model. 00:42:54.160 |
We can see that the top K they're using is 2048. 00:42:58.160 |
The number of heads is 64 and the hidden dimension is 128. 00:43:03.160 |
And this number is actually quite small compared to the rest of the model. 00:43:07.160 |
This is why we can get to save a lot on computations, even though we're attending to all the previous tokens in the indexer. 00:43:15.160 |
So once we have the scores, we can select the tokens we need to attend to. 00:43:19.160 |
And then we just do the normal attention, uh, calculations. 00:43:24.160 |
So we do the attention over, uh, the tokens that we have selected using the normal attention implementation of the deep seek model. 00:43:41.160 |
You select the top K tokens, and then you just do the normal attention for the deep seek models. 00:43:47.160 |
Uh, any questions about this so far before we go to the training? 00:43:54.160 |
Um, does the indexer get reused across layers or does this per layer? 00:44:07.160 |
I think it's, uh, they don't mention this explicitly actually, but I think it should be maybe used across layers. 00:44:18.160 |
I'm going to have to double check this later, to be honest, but I don't think they mentioned it in the report. 00:44:28.160 |
I, I didn't quite grok why this helps if you're not reusing it, but I'll, uh, I'll have to guess I'll have to read carefully. 00:44:38.160 |
I think even if it's, uh, even if it's not reused across layer, even if, if every attention layer, uh, has its separate indexer, we're still saving quite a lot because we're only attending to 2k tokens instead of like 30k or 40k. 00:44:52.160 |
But I'm going to have to double check whether it's actually reused or not. 00:44:56.160 |
No, no, I, I get that, but I guess I didn't understand what the logic of why calculating that versus just calculating attention is cheaper. 00:45:04.160 |
I didn't understand the sort of math on that. 00:45:10.160 |
It's just, you're somehow reducing the amount of things that you're, you're quadratic over, but I don't, I don't get what that is. 00:45:20.160 |
So we're only limited to calculating the attention to 2,000 tokens. 00:45:26.160 |
No matter if you're attending to 50,000, if like the sequence length is 50,000 or 70,000 or like 1 million tokens, we're still attending only to 2,000 tokens. 00:45:39.160 |
It's not, it's no longer quadratic in the input sequence. 00:45:43.160 |
Once you've gotten the, done the indexing fine. 00:45:48.160 |
No, I just don't understand why the, the selector, I mean, the selector is also quadratic, right? 00:45:53.160 |
It's just that you've reduced what it has to do. 00:46:00.160 |
It's, it's because the number of heads are smaller and also the dimension, the hidden dimension is smaller. 00:46:06.160 |
It's much smaller than the, the normal attention. 00:46:18.160 |
I think this was a bit confusing to me as well. 00:46:19.160 |
Like, why are we like calculating over all the sequence? 00:46:23.160 |
But the reason is because the indexer is actually much smaller than the normal attention. 00:46:28.160 |
Uh, so this is how they managed to save the compute during inference. 00:46:33.160 |
But how do they actually train such a modification? 00:46:37.160 |
And turns out the, they just do continued training. 00:46:40.160 |
They don't have to re retrain the model from scratch. 00:46:43.160 |
Uh, they can do just training on, uh, continued training, uh, of the current model. 00:46:49.160 |
So they start with the deep three, a deep seek v 3.1 terminus checkpoint. 00:46:54.160 |
And they do continued training, uh, using the two stages of pre-training and also post-training. 00:47:00.160 |
So for continued pre-training, they do, uh, two stages. 00:47:11.160 |
Because at the beginning, the indexer is initialized using random weights. 00:47:15.160 |
So they do the first step, which is like a warmup to get this to be like something that basically works. 00:47:23.160 |
And not just like random weights, uh, to align the indexer outputs with the main attention distribution. 00:47:30.160 |
They do, uh, they sum the attention scores across all attention heads, and then they do L1 normalization. 00:47:37.160 |
And then they set the KL divergence loss as the objective. 00:47:41.160 |
Uh, and these like a learning rate of 10 to the power of minus three, they train only for a thousand steps. 00:47:52.160 |
Each with 16 sequences of 128,000 tokens, which is kind of like around 2 billion tokens, which is actually quite cheap, uh, compared to the, the full pre-training stage. 00:48:04.160 |
Uh, the second step in pre-training is actually training the sparse attention mechanism. 00:48:10.160 |
So once the indexer has been, uh, warmed up, they, they do the fine-grained, fine-grained token selection mechanism. 00:48:17.160 |
And then they just train over all the model parameters, uh, to make the model learn how to work with only 2K tokens in the attention. 00:48:26.160 |
So they, they align the indexer outputs to the main attention distribution, but they only consider the selected tokens, uh, the top-case selected tokens according to the indexer. 00:48:39.160 |
And again, they apply the KL divergence, uh, loss. 00:48:42.160 |
Uh, this step is actually a bit expensive because it's using almost a billion, a trillion tokens. 00:48:56.160 |
Uh, and this is not cheap for a model of this size. 00:48:59.160 |
So I think this is where most of the training cost is, uh, in this whole process. 00:49:04.160 |
So once you've done pre-training, you want to do post-training because this model is going to be like a chat model. 00:49:10.160 |
It's not just going to be a base, uh, next token prediction model. 00:49:15.160 |
So after the pre-training, they perform post-training that is quite similar to the normal deep seek post-training. 00:49:21.160 |
Uh, they, they, they have two, two, I think modifications in this version. 00:49:29.160 |
Uh, so basically they, they train specialist models on, on five, on around five tasks. 00:49:37.160 |
Each model version is, is trained to be like an expert at this task. 00:49:41.160 |
And then they generate synthetic data of these specialist models and they use it to train the generalist model that they release to the public. 00:49:51.160 |
Uh, so they mention five specialized domains. 00:49:56.160 |
Mathematics, competitive programming, uh, logical reasoning, agentic coding, and agentic search. 00:50:02.160 |
Each specialist is, is trained with large scale RL computing. 00:50:06.160 |
And then they just, uh, generate training data for the, uh, chain of thought reasoning for, for the, uh, for the generalist model. 00:50:14.160 |
And also the direct response generation, because this, this model is supposed to be like a hybrid model. 00:50:19.160 |
It can work as a reasoner and also as a direct instruct model. 00:50:24.160 |
So once the specialist models are prepared, they are used to, uh, generate the domain specific data for the final checkpoint. 00:50:33.160 |
And these, uh, the experiments show that these models, uh, they are quite similar to the specialist on the domain specific, uh, areas, but also they, they are still like a generalist model. 00:50:46.160 |
Uh, so this is like quite a nice balance between, uh, maximizing the performance on specialist tasks and also being a generalist model. 00:50:55.160 |
Uh, the second modification they make is the do mixed RL training. 00:51:01.160 |
So previously, previously the, the, uh, post training pipeline consisted of like multiple steps. 00:51:09.160 |
You do one step and then the next step and then the other step. 00:51:12.160 |
But in this version, the, they combine all the steps into one, one, uh, single stage pipeline, basically. 00:51:20.160 |
So they merge the reasoning agent and human alignment training into one RL stage. 00:51:26.160 |
Uh, and this, the reason is they want to balance the performance across the diverse tasks and domains, but prevent the catastrophic forgetting that happens when you train a model in one task and then you train it on another task. 00:51:43.160 |
Uh, for, for reasoning and, and, uh, agent tasks, they use the same rule based outcome reward. 00:51:51.160 |
They also use a length penalty and language consistency reward. 00:51:55.160 |
And also they employ a generative reward model where each prompt has its own rubrics for evaluation. 00:52:08.160 |
And the second one is language consistency versus accuracy. 00:52:11.160 |
So basically they want to maximize the accuracy of the model predictions while keeping the, uh, the length of the answer short to reduce inference cost and latency and also keeping the language consistent. 00:52:28.160 |
I'm gonna, uh, quickly go over them, but basically the, they show that the model, uh, this new, uh, sparse attention implementation is quite efficient and is quite powerful. 00:52:40.160 |
And it it's, it's like almost matching the previous model. 00:52:44.160 |
Uh, at the beginning it's, it's a bit behind, but it catches up quickly and becomes almost identical to the original model. 00:52:59.160 |
So they compare the inference cost of the original deep seek model and the sparse attention version. 00:53:06.160 |
And we can see that the original model, which is the blue line is, is, is like ballooning in cost. 00:53:11.160 |
Uh, whenever the context gets large, but the modified sparse attention version, which is the orange line. 00:53:18.160 |
It it's, it's become much cheaper to serve at a longer, uh, input sequence. 00:53:26.160 |
And the, the case is also quite similar for decoding. 00:53:28.160 |
And I think it's even more pronounced in the decoding stage, which is where the reasoning models spend most of their time on. 00:53:35.160 |
Because now we have models that do decoding for 32 K or 64 K tokens, much of which is reasoning basically. 00:53:42.160 |
Uh, they said something that's quite important. 00:53:47.160 |
This is still an experimental version and the, they're, uh, conducting validation in real world, uh, inference scenarios. 00:53:54.160 |
So that's why they call this version an experimental version. 00:53:58.160 |
And yeah, this is basically the gist of the new, uh, deep seek model. 00:54:03.160 |
If anyone has any questions, please go ahead. 00:54:14.160 |
So if you use the, uh, API is significantly cheaper. 00:54:19.160 |
And also this is quite, not quite, but like relatively easy to implement for almost any model. 00:54:25.160 |
Uh, so you can see like Kimmy, uh, GLM and, and Quinn may be adapting this or something quite similar the next few weeks or months. 00:54:35.160 |
Because they show that you just need to do, uh, continued training. 00:54:41.160 |
So this is quite, this is quite, quite massive if it actually, if it is as good as they claim it to be. 00:54:47.160 |
I thought there was the interesting, um, note that performance does go down slightly. 00:54:53.160 |
Like if you look at the benchmarks between this and the previous version they had running a few benchmarks went down a few points, like a GPQA diamond, right? 00:55:03.160 |
And I think that's why it's ever so slightly down, but like, you know, it's half the cost. 00:55:11.160 |
Like sometimes they're down, like for, uh, GPQ diamond, but you can see browse comp is a bit higher. 00:55:18.160 |
So I think it's, it's kind of like quite similar and the next versions of models, which will, which are going to be like trained from scratch on this modified attention. 00:55:26.160 |
I think they're going to be like able to be much, much better in terms of accuracy. 00:55:39.160 |
Thank you for the quick, quick 15 minute cover. 00:55:42.160 |
Uh, next week, I think we have stuff on bio on kit and RJ is helping out too. 00:55:52.160 |
Drink scientific reasoning LMs with biological world models with soft verifiers.