Veo 3 + DeepSeek V3.2-Exp Explained: Sparse Attention for Affordable Long Contexts

Chapters

0:0 Introduction to the Veo 3 analysis paper and initial skepticism
0:58 Premise that Veo 3 can act as an LLM for video, performing general reasoning and tasks
2:9 Comparison to Sora 1 and Sora 2, and a discussion on world models
3:23 The paper's claim of Veo 3's reasoning capabilities, including maze solving, and the potential influence of an LLM prompt rewriter
4:36 Discussion on general-purpose vision understanding through large-scale training
6:5 Demonstration of Veo 3's capabilities on a web page, covering perception, modeling, manipulation, and reasoning
7:52 Skepticism regarding the true source of "reasoning" (LLM vs. video model)
9:28 Quantitative results and comparison to other models, including the use of green backgrounds for better performance
11:11 How Google attempts to isolate the video model's reasoning capabilities from the LLM rewriter
11:46 Overview of the four hierarchical capabilities: perception, modeling, manipulation, and reasoning
13:36 Detailed look at perception tasks (edge detection, segmentation) and the claim that video models will replace bespoke CV models
15:6 Discussion on modeling physical properties and optical phenomena, and manipulation tasks like background removal
15:41 Visual reasoning and the concept of "chain of frames" as analogous to "chain of thought"
17:9 Quantitative tasks, performance metrics, and comparison to Veo 2 and Nano Banana
18:1 Detailed analysis of specific quantitative tasks like edge detection, object extraction, and segmentation, highlighting the green background bias
20:26 Discussion on maze solving, image editing, and visual symmetry solving
21:57 Discussion on Veo 3's emergent zero-shot abilities and its role as a foundation model for machine vision
23:6 Framing the paper's outlook and the benefits of general capabilities
24:12 Recap of the Veo 3 paper as an analysis rather than a technical detail paper
25:19 Discussion on the speaker's skepticism about the paper's claims and the importance of capability exploration
25:40 Question about dollar-per-token cost and the comparison between specialized models and foundation models
26:10 Question on what "true reasoning" would look like in video models without an LLM rewriter
27:52 Discussion on Sora 2's system card and the absence of quantitative metrics
28:36 OpenAI's approach to system cards and safety measures for Sora 2, including moderation classifiers and output blocking
30:37 Transparency, watermarking, and internal detection tools for AI-generated content
32:2 Discussion on control nets and the limitations of API-based models
33:51 Follow-up on the dollar cost question and the future of specialized vs. generalist models
36:9 Question on how text prompts are integrated into the latent space of vision models
38:19 Introduction to the DeepSeek V3.1 Experimental paper
38:52 DeepSeek V3.1's focus on **reducing inference cost** for long contexts using sparse attention
39:54 Explanation of the **sparse attention mechanism** with a lightning indexer and fine-grained token selection
41:2 How the indexer computes an index score to select top-k tokens, and the role of its smaller size for computational efficiency
47:8 Two stages of pre-training: dense warm-up (for indexer initialization) and sparse attention mechanism training
48:51 Post-training with two modifications: **specialist distillation** and **mixed RL training
49:23 Specialist distillation using expert models in mathematics, competitive programming, logical reasoning, agentic coding, and agentic search
50:52 Mixed RL training to balance performance across tasks and prevent catastrophic forgetting
52:22 Evaluations showing the efficiency and power of the sparse attention implementation, with cost reduction
54:10 The potential for other models to adapt this sparse attention technique due to continued training feasibility
54:49 Discussion on slight performance drops in some benchmarks for DeepSeek V3.1 but significant cost savings

Whisper Transcript | Transcript Only Page

00:00:00.000 | Pretty short paper.

00:00:00.960 | All right.

00:00:02.500 | Okay.

00:00:03.720 | Two papers.

00:00:04.400 | So just to recap for recording, this is not a VL3 paper.

00:00:10.260 | It's a paper that's analyzing VL3.

00:00:12.960 | The Twitter hype was too real.

00:00:16.760 | They just said VL3 paper dropped.

00:00:18.400 | VL3 paper did not drop.

00:00:19.820 | If you guys have seen the paper that's all hyping up LLMs or zero-shot learners,

00:00:29.920 | and you want generalist models and stuff,

00:00:32.100 | I felt like this paper is low-key a response to be the first one to just come out

00:00:36.620 | and, like, say the same thing that, you know,

00:00:40.220 | you can have general-purpose video models, and they're getting pretty good.

00:00:43.620 | What else do we need to check in there?

00:00:49.100 | So basically, they're making this premise that also, you know,

00:00:53.940 | feel free to interrupt and comment whenever.

00:00:56.080 | I have fun thoughts about this, too.

00:00:58.420 | So I guess there's some background before we get into it.

00:01:01.960 | So they're basically making the premise that VL3 can be, like, LLM for video.

00:01:07.660 | They can do general reasoning.

00:01:09.820 | They can, you know, they can do a bunch of different tasks,

00:01:12.660 | and then they break down how do they test whether it's, like, a general model.

00:01:16.520 | So, like, can models perceive the world?

00:01:19.260 | Can they have, like, what are these different categories?

00:01:22.680 | Perception, modeling, manipulation, and reasoning, and then different tasks,

00:01:26.860 | and then they do a bunch of inference, and then they're, like, you know,

00:01:29.180 | how does this compare to specific models in those domains?

00:01:33.960 | And they're, like, these, VL3 is actually a pretty decent generalist, but, you know,

00:01:39.780 | calling it a reasoner, a bit of a stretch, and then they cite how this is, like,

00:01:45.380 | there was a GPT-3 moment in LLMs where GPT-3 was, like, okay, it can do text,

00:01:54.020 | and then, you know, it became a very, very good general model that's, like,

00:01:57.720 | a state-of-the-art summarization, classification, all these different tasks, right?

00:02:03.120 | So, they're making the claim that VL3 and, you know, video models can do the same.

00:02:09.860 | The second, like, thing that came out this week was Sora 2.

00:02:13.960 | Sora 2 also said, like, okay, the Sora 1 was basically, like,

00:02:19.860 | the GPT-1 moment for video models, and Sora 2 is, like, GPT-3.5.

00:02:26.680 | We scaled the hell out of it.

00:02:28.000 | It's got more modalities.

00:02:29.140 | It's longer.

00:02:29.880 | It's pretty good.

00:02:30.720 | And then, you know, there's an app that we can play around with.

00:02:33.640 | So, some background before, I guess, going into this paper of if video models can reason.

00:02:39.900 | This weekend, I was talking to someone at OpenAI.

00:02:43.900 | He runs the ImageGen team and is also part-time at Sora.

00:02:50.140 | And he basically, he's, like, okay, my entire background was on world models.

00:02:54.100 | And he's, like, the right modality for world modeling and reasoning is not video.

00:03:01.960 | It's actually what ImageGen is.

00:03:04.240 | It's auto-regressive image and text because you can reason.

00:03:08.360 | And, you know, without getting too deep into it, the premise is, like, okay,

00:03:12.280 | can diffusion models reason?

00:03:14.500 | Can they interleave text and image?

00:03:17.180 | That's open, you know, food for thought for you to think about.

00:03:20.120 | But it was a very interesting conversation.

00:03:22.600 | I'll let you guys have your own thoughts.

00:03:24.260 | And, okay, back to this paper.

00:03:26.420 | They basically say, yes, they can reason.

00:03:28.280 | They can do all this stuff.

00:03:29.320 | They can perceive, model, manipulate the world, do early forms of reasoning because they can

00:03:35.060 | solve mazes.

00:03:35.700 | And I was, like, okay, that's kind of interesting.

00:03:37.600 | You know, they are solving mazes and this and that.

00:03:39.880 | Then they make this very fun claim somewhere in here.

00:03:43.480 | I highlighted it.

00:03:45.340 | I'll get back to it later.

00:03:46.640 | Where basically what they're doing is they're doing a lot of prompting and they're testing

00:03:54.380 | VO3, right?

00:03:55.300 | So they're generating prompts and then they're seeing, can it model this stuff?

00:03:59.180 | And then they note that in the vertex, so this paper is from Google.

00:04:03.920 | It's from DeepMind.

00:04:04.700 | So they know what's going on.

00:04:05.820 | But basically in the vertex API, there is a prompt rewriter.

00:04:12.740 | So basically they're like, you know, one thing to know is the solving could be done in the

00:04:19.900 | LLM backbone because whatever prompt you said gets rewritten, right?

00:04:23.340 | So you ask it for a maze that gets solved.

00:04:25.720 | Well, maybe the prompt is actually being rewritten and the LLM is giving information on how to

00:04:32.900 | solve it.

00:04:33.200 | We will never know.

00:04:34.240 | But anyway, it's fun.

00:04:37.060 | And so they're making this claim that, you know, there's really good task specific models

00:04:41.380 | like segment anything from meta for segmentation, YOLO for object detection and stuff.

00:04:47.160 | And they're like, okay, you know, can we have the same primitive in video models, like just

00:04:54.660 | by large scale training on just text and video and web scale data?

00:04:58.220 | Do they have general purpose vision understanding similar to how LLMs do?

00:05:02.480 | What did they do?

00:05:03.760 | They generated roughly 14,000 generated videos, 62 qualitative, seven quantitative tasks.

00:05:12.900 | They show their performance.

00:05:14.900 | They show early forms of chains of frames, which they consider visual reasoning, like maze and

00:05:24.260 | symmetry solving.

00:05:25.400 | I am a little, you know, I don't know how I feel about that claim because you have to remember

00:05:31.660 | that as much as you have chain of frames, we know that vision transformers can have temporal

00:05:39.320 | consistency, right?

00:05:40.420 | That doesn't necessarily translate to reasoning, right?

00:05:43.440 | Just because you can interpolate between frames or like in the self-driving sense, you can remember

00:05:49.560 | where an object is just because something else is in front of it.

00:05:52.900 | That doesn't necessarily constitute to reasoning, right?

00:05:55.440 | You have attention, you have a bit of memory and you have stored state, but that's not the

00:06:01.140 | same as reasoning.

00:06:01.860 | But yeah.

00:06:03.900 | So how do they do this?

00:06:05.500 | Crazy.

00:06:06.080 | We prompt VO.

00:06:07.000 | They do prompting.

00:06:07.980 | They're prompt engineers.

00:06:08.920 | They have different tasks, different sub stack.

00:06:11.900 | Oh, actually one sec.

00:06:13.340 | Before we go to that, I'm going to share the website.

00:06:18.040 | They had a pretty nice demonstration.

00:06:21.680 | Let me pull it up.

00:06:23.540 | Here it is.

00:06:26.000 | So let's change their screen.

00:06:29.920 | This is the, it's like a webpage of their paper.

00:06:35.980 | So basically TLDR VO3 shows emergent zero shot capabilities across tasks indicating that video

00:06:42.960 | models are on the path to becoming vision foundation models, just like LLMs.

00:06:47.140 | Here's what they, you know, determine.

00:06:49.000 | So perception.

00:06:49.700 | Can, can you pin out a little dot in an eye modeling?

00:06:54.180 | Can you like model basic physics, manipulation?

00:06:57.380 | Can you, you know, make sense of opening a cap reasoning?

00:07:01.900 | Can you do a little maze and stuff?

00:07:04.900 | And then if you're interested, it's very nice.

00:07:07.920 | They have a notebook LM sub of their paper.

00:07:13.220 | So you can just listen to it in 15 minutes.

00:07:14.900 | And then these are kind of some of these things.

00:07:16.920 | So, uh, edge detection, uh, segmentation, you know, you can get a nice little visualization

00:07:23.020 | of what all these tasks are, uh, denoising, blurring, denoising, and then they break it

00:07:28.540 | down by tasks like modeling.

00:07:29.720 | So, um, can you move a rigid body?

00:07:32.540 | Can you have soft bodies?

00:07:34.060 | So like, this is some physics in here.

00:07:35.960 | Uh, can you see what would set on fire first?

00:07:38.880 | The paper?

00:07:39.700 | I don't know if this is how paper fire spreads, uh, you know, gravity on the moon is different

00:07:45.160 | than gravity on earth, uh, buoyancy.

00:07:47.720 | What happens if you drop a cap in water?

00:07:49.660 | Um, you know, so it's, it's like a nice visualization because the paper is static, uh, manipulation,

00:07:56.620 | style transfer, background, renewal.

00:07:58.920 | So like, these are some of the, the things that they tested for and prompted.

00:08:03.160 | And it's like a nice, it's like a nice thought exercise, right?

00:08:06.880 | If you have to break down video generation into subcategories, um, these are nice tasks that

00:08:14.300 | you can do a nice split.

00:08:16.320 | So it, it's cool that they did that, um, some of this reasoning stuff, you know, it's

00:08:20.820 | just, it's just hard to, it's hard to distinguish how much of this is the LLM that's rewriting

00:08:28.600 | the prompt and giving explicit instructions on what to generate versus, um, the video, like

00:08:35.640 | the video model.

00:08:36.540 | And also some of these are just pretty basic, right?

00:08:39.420 | Like, are these just stochastic representations, right?

00:08:42.420 | Like in LLM with a million parameters can probably tell you what comes next in the sequence of

00:08:48.620 | one plus one equals two.

00:08:49.720 | Right.

00:08:50.020 | So it can also tell big, small, small, smaller, these are not super challenging puzzles.

00:08:55.640 | Um, but like, this is interesting.

00:08:59.360 | It tries the square and the circle first.

00:09:01.500 | There's like little things, but you know, they have a bunch of these and then the maze.

00:09:05.040 | I'm, I'm somewhat skeptical, but fun little visualization for people that are interested.

00:09:10.940 | Um, I am going to change back to the paper real quick.

00:09:18.320 | Where did my paper go?

00:09:29.340 | There we go.

00:09:32.340 | Okay.

00:09:33.180 | And then they, they of course plot all this.

00:09:35.100 | They compare it to other models and stuff.

00:09:37.500 | Um, feel free to chime in if you found any of this interesting, because honestly, it's

00:09:41.660 | just a lot of, a lot of, um, a lot of basic definition examples, uh, for each task, we

00:09:49.300 | query the publicly available VO2 or VO3 APIs.

00:09:52.840 | We prompt the model with an initial input and the text instruction.

00:09:57.420 | They generate 16 by nine at 7 20 P 24 FPS for eight seconds.

00:10:02.760 | Uh, this was kind of that interesting thing.

00:10:05.160 | Right?

00:10:05.520 | So according to vertex documentation, the API uses an LLM based prompt rewriter.

00:10:12.300 | This means that some of the solutions are likely to come from the LLM instead of the video.

00:10:17.380 | For example, Sudoku, uh, we treat the system and rewrite the video generator as a single

00:10:23.260 | black box entity.

00:10:24.400 | I'm like, okay, that's, that's rough.

00:10:26.980 | You should add that in the abstract.

00:10:28.420 | Uh, however, to isolate the videos, the model's reasoning capabilities.

00:10:32.560 | We verified that standalone LLMs couldn't reliably solve some key tasks.

00:10:37.120 | It's like a maze solving and stuff.

00:10:39.940 | Uh, here, I think you also want to digest, like, dig into this a little more.

00:10:44.140 | Right.

00:10:44.500 | Um, some LLMs are very, very bad at vision, right?

00:10:48.800 | Claude sucks at vision.

00:10:50.340 | Claude cannot play Pokemon that well.

00:10:53.440 | Right.

00:10:53.720 | It can't tell like there's a barrier in the image, even though it's like a really smart LLM.

00:11:00.500 | It could probably create something with a barrier.

00:11:03.980 | So I could create a solution to Sudoku or whatever maze, and it could create the maze.

00:11:08.780 | Uh, but relying on its vision, you know, it's, it's kind of cooked.

00:11:12.800 | Um, okay.

00:11:14.600 | Let me look at the thing.

00:11:15.760 | Um, RJ is asking about how they tried to isolate this.

00:11:19.640 | Honestly, the sad thing was, this is the only, this is like the only five lines of the paper

00:11:24.560 | that talked about it.

00:11:25.340 | You run this through chat GPT for a summary.

00:11:28.280 | It doesn't even tell you that there's this, and this is like the only section that they

00:11:32.720 | bring it up.

00:11:33.260 | Uh, they, they do bring it up with like, okay, is this going to be, um, you know, is this going

00:11:37.340 | to solve arc?

00:11:37.940 | Is this the future and stuff?

00:11:39.200 | Uh, why choose VO?

00:11:40.760 | Cause it's, it's definitely not cause they're from Google.

00:11:43.900 | It's cause it's the best on the leaderboards.

00:11:45.800 | Um, okay.

00:11:47.180 | Qualitative results.

00:11:49.680 | First, there's four hierarchical capabilities.

00:11:52.320 | These, they all build on the last they claim.

00:11:54.220 | So perception, can you understand visual information modeling, which builds on the perception?

00:11:59.900 | You know, uh, can you form stuff in a visual world manipulation?

00:12:03.660 | Can you alter perceive stuff in a visual world and reasoning?

00:12:07.640 | So can you reason across space and time?

00:12:09.760 | And then, you know, we basically saw those examples before, right?

00:12:12.600 | So perception, uh, can you, can you do de-noising?

00:12:16.260 | Uh, can you, can you, you know, highlight what's in a thing?

00:12:19.860 | And then modeling is like, okay, do you know what happens in this world?

00:12:23.700 | So like, if I drop something light on water, it floats manipulation.

00:12:27.600 | Can you move around the world?

00:12:29.280 | Right?

00:12:29.560 | So like, if you have a guy standing facing forward, can you picture the rest of the

00:12:33.980 | body and stuff like, okay, if you open a jar, what happens, right?

00:12:38.040 | What happens behind the jar?

00:12:39.160 | Reasoning is kind of mazes or AGI for them.

00:12:43.000 | Um, some stuff, uh, and they're, they're trying to make those claims that,

00:12:46.620 | you know, it's hierarchical.

00:12:47.740 | There's, there's four levels of this, uh, for each section, they prompt VO3 12

00:12:52.960 | times and record the success rate in the caption.

00:12:55.720 | Uh, there's interesting little distinctions they make later on.

00:12:58.780 | Some stuff is like video models really like to do more in printing.

00:13:04.340 | So after they finish a task, they still generate their end sequence.

00:13:08.040 | So they report a best frame and a last frame, because sometimes the

00:13:11.920 | best frame is not at the end.

00:13:13.280 | Um, and then there's also like pass at K and they really like to mention how

00:13:19.300 | all this is zero shot, right?

00:13:20.660 | It's not like LLMs where you're doing few shot prompting.

00:13:23.060 | All this is kind of zero shot.

00:13:24.360 | Uh, and then, you know, uh, success rate greater than zero in 12 attempts means that

00:13:29.740 | the ability is there while success rate closer to one means that's reliable.

00:13:34.440 | Okay.

00:13:34.780 | Perception.

00:13:35.460 | Uh, so stuff like segmentation, augment, uh, object detection, edge detection, all those,

00:13:40.940 | um, how, how do they perform?

00:13:43.820 | So they test, uh, edge detection, segmentation, key point, localization, super resolution, blind,

00:13:49.380 | deeper, uh, deep blurring, denoising, low light enhancing, a bunch of these things.

00:13:53.880 | Uh, the takeaway, uh, is the team pretty cool.

00:13:56.900 | Yeah.

00:13:57.080 | The takeaway is basically that just like LLMs video models will replace bespoke models in

00:14:01.760 | computer vision.

00:14:02.500 | Uh, once they become sufficiently cheap and reliable.

00:14:06.240 | I think we should think more about this, uh, claim a little bit more like, so one thing

00:14:14.180 | is with LLMs, small LLMs do decent, but they're not being used as much, right?

00:14:21.720 | We don't have like small trained encoders that were used on the edge before, but one thing

00:14:28.860 | we do have is small computer vision models used everywhere.

00:14:31.940 | Right.

00:14:32.280 | So like, think your car's, uh, ADAS system, right?

00:14:36.040 | You have a very shitty, small computer vision model that can detect lanes and detect objects.

00:14:42.020 | And like, you know, you don't have any memory for running big models.

00:14:45.800 | So I think there's still quite a space for small computer vision models because a lot of them

00:14:52.840 | are deployed on the edge, right?

00:14:54.280 | Like your, um, you know, your, your home security system is, it's not going to be running VO3.

00:14:59.420 | Maybe it will, maybe, maybe down, down the line.

00:15:02.080 | But it's a little different from LLMs in that sense.

00:15:04.360 | Um, okay.

00:15:05.840 | That is, um, perception.

00:15:07.840 | Then they talk about modeling.

00:15:09.820 | So how do they test modeling?

00:15:11.820 | They look at physical properties, right?

00:15:14.220 | So buoyancy, the air resistance of dropping that, uh, scarf or whatever it was, uh, optical

00:15:20.580 | phenomenons like light reflections and stuff, refraction, uh, adding and mixing colors.

00:15:25.820 | Then in manipulation, they want to manipulate stuff, right?

00:15:28.420 | So can we remove a background and infill stuff?

00:15:30.700 | Uh, can we colorize images in painting, out painting?

00:15:34.180 | And can we like edit text in, in stuff?

00:15:36.980 | And then, you know, if you're curious, there's the other like 30, 40 things that they do, but

00:15:42.320 | that's kind of high level.

00:15:43.700 | Then the last one visual reasoning, it's basically, um, you know, since stuff is frame by frame,

00:15:50.380 | can the parallel chains of thoughts and LLMs be like chain of frames?

00:15:53.760 | Can we reason?

00:15:54.760 | How do they do it?

00:15:55.760 | They, they, they, they test all these things, you know, fitting ship shapes into holes, sorting

00:15:59.780 | numbers, uh, mazes, stuff like that.

00:16:02.760 | Uh, the, the third takeaway is frame by frame video generation, generation parallelizes chain

00:16:09.100 | of thought and language models.

00:16:10.680 | Just like chain of thought enables language models to reason chain of frame enables video

00:16:15.520 | models to reason across time and space.

00:16:18.540 | Once again, you know, I don't know that just because you can have temporal consistency,

00:16:23.020 | that means that you are reasoning per se, right?

00:16:26.740 | Part of this is just architecture, like fundamentals, right?

00:16:30.820 | Vision transformers have a form of temporal consistency.

00:16:34.300 | It doesn't mean that there's any extra reasoning applied to them.

00:16:36.980 | You're, you're not doing like a chain of thought style reasoning, but it's, it's, uh, it's a thing

00:16:42.500 | that they state.

00:16:43.500 | Uh, and then, you know, this is, this is some of it.

00:16:46.260 | So more of that, um, and this is how they test stuff.

00:16:49.700 | So edge detection, how do they test it?

00:16:51.860 | Um, they, they go, they go more into this later, but there's basically an overlap between some

00:16:56.560 | of these, some of the fun stuff they say is like, uh, low key VO3 was just too good on this

00:17:03.120 | benchmark.

00:17:04.120 | Like it performs better than the eval.

00:17:06.020 | So, you know, we've, we've done it boys.

00:17:08.380 | We've, we've got ATI.

00:17:09.820 | Okay.

00:17:10.820 | That's the qualitative stuff.

00:17:11.820 | Then there's quantitative stuff.

00:17:12.820 | So there's, there's seven tasks that they test, uh, you know, as detection, segmentation,

00:17:17.520 | all this stuff, editing symmetries.

00:17:19.820 | Uh, they look at last frame and best frame.

00:17:22.660 | This is what I was talking about earlier.

00:17:24.760 | Uh, VO3 likes to continue scenes even after task completion.

00:17:29.040 | Uh, this was fun.

00:17:30.260 | Uh, where, where applicable, they, they compare it to nano banana, um, on some stuff, you know,

00:17:40.500 | they, they match nano banana or they even exceed it.

00:17:43.820 | Um, and then for video models, there's substantial improvement when you do K equals 10 or more steps,

00:17:50.260 | you know, you get much better performance.

00:17:52.260 | Uh, they also plot all this stuff with VO2 and they're like, okay, VO2 kind of suck.

00:17:56.200 | VO2 is very bad on all these, but now is the, the long part where they go into all the numbers and stuff.

00:18:02.760 | I think I'm going to go kind of quick.

00:18:04.760 | If anyone's interested, you know, we can always pause and dive deep, but edge detection, um,

00:18:10.400 | prompted to detect, therefore perceive edges.

00:18:14.000 | I think a lot of this also goes into their prompting, right?

00:18:16.860 | Let alone, there is a rewriter, but you know, how are they prompting these things?

00:18:20.160 | So, uh, original image generated frame extracted masks, and then there's the ground truth mask.

00:18:25.320 | So then they often do an overlap of, um, output on, on grading these, uh, over a subset of 50, easy images,

00:18:33.660 | one to three large objects.

00:18:35.420 | Here's their prompt blank, each distinct entity in overlaid flat color background fades from white to green, dah, dah, dah, dah, dah.

00:18:43.260 | Some fun stuff they know is that the model always does better when there's a green background.

00:18:48.020 | So, you know, green screen-esque, um, same thing here.

00:18:52.120 | Object extraction is another one.

00:18:54.020 | So, uh, you know, the background changes to white animals line up in a row.

00:18:59.160 | How do they perform VO3 is goaded VO2 was okay here.

00:19:03.640 | Same thing.

00:19:04.200 | Nano banana is a lot better, but VO3 is up there.

00:19:07.700 | VO2 is not up there.

00:19:09.260 | Background is green.

00:19:10.760 | It does significantly better than no green background, but nano banana doesn't see the same.

00:19:16.380 | Change or they don't, they don't really put it.

00:19:18.740 | Okay.

00:19:19.760 | Segmentation.

00:19:20.840 | Uh, we know what segmentation is.

00:19:22.520 | They, they measure it by, you know, mean intersection over union.

00:19:25.580 | Emma, uh, do we have chat?

00:19:27.760 | No, not chat.

00:19:28.360 | Okay.

00:19:29.020 | Uh, basically, uh, VO3 achieves 0.74 for best frame comparable to nano bananas.

00:19:35.660 | 0.73, uh, VO3, it's still lacks behind specialized stuff, right?

00:19:40.820 | So segment, anything version two from meta, this is a specialized segment model.

00:19:46.120 | Uh, it lacks there.

00:19:47.700 | I'm slightly annoyed that they don't show the performance of that, but it's okay.

00:19:51.760 | Um, these are also not super cherry picked samples.

00:19:55.960 | Uh, they often list which, um, data set they pull from and they just sample some images.

00:20:03.360 | So that's nice to know.

00:20:04.680 | Um, but yeah, they, they consistently perform better with green backgrounds than white.

00:20:12.000 | So that's kind of interesting possibly due to the widespread of green screens.

00:20:17.080 | Okay.

00:20:17.700 | Manipulation, object extraction, same thing.

00:20:20.580 | So generated frame.

00:20:22.300 | Can you manipulate it?

00:20:23.520 | Um, I think there's reasoning in the current term.

00:20:27.600 | Okay.

00:20:27.820 | We're still discussing reasoning may solving.

00:20:30.020 | I think this is where I'm going to start to go a little faster.

00:20:32.360 | Uh, VO3 gets a lot better performance, like 90, 92 given simpler tasks.

00:20:39.020 | You can read a hundred image editing.

00:20:41.480 | Uh, there's a strong bias for animated scenes that might induce unintended changes.

00:20:48.980 | So, you know, can you, can you finish this?

00:20:52.440 | Can you extend this scene out?

00:20:54.860 | Uh, the, the interesting things here are like, okay.

00:20:57.980 | For scene completion.

00:20:59.480 | It's doing significantly better than nano banana and way, way better than VO2.

00:21:04.700 | I don't know what the hell happened to VO2, but VO2 is like, you know, somewhere in the like sub 10% for a lot of these, uh, which is, which is rough.

00:21:15.400 | So like you give VO2, uh, you give VO2 a random pattern and you tell it to complete it.

00:21:21.480 | Honestly, I don't know.

00:21:24.100 | Like, is this meant to be ground truth?

00:21:26.180 | I guess so.

00:21:27.000 | But, um, yeah, it's okay.

00:21:29.640 | It's, it's being told to reflect the pattern across the center.

00:21:32.260 | So VO2 couldn't do it at all nano banana can't reflect this that well.

00:21:36.180 | It's like 28% in 10 attempts, but, uh, the best frame of VO3 can zero shot, like a hundred percent accuracy.

00:21:44.040 | This kind of interesting task made solving, uh, visual symmetry, solving, visual analogy, completion, um, more stuff.

00:21:54.940 | Okay.

00:21:55.340 | That's basically their quantitative section.

00:21:57.520 | Then we have basic discussion.

00:22:00.040 | They really like, like foundational, like, okay, guys, just don't forget, you know, breakthroughs happen.

00:22:06.040 | We have general purpose LLMs.

00:22:08.320 | We are here to make the case that machine vision is on the cusp of a similar paradigm shift enabled by emergent abilities of large scale video models.

00:22:17.500 | Our core finding is that VO3 can solve a wide range of tasks in a zero shot manner, spanning from the full vision stack, perception, modeling, manipulation, even early forms of visual reasoning.

00:22:29.000 | Uh, performance is not perfect.

00:22:30.000 | It's, you know, massive consistent improvement from VO2 to VO3, which indicates that visual models become general purpose foundation models, just as they have for LLMs.

00:22:40.160 | Uh, performance is a lower bound video generation is expensive, but costs tend to fall.

00:22:46.160 | Uh, yeah, Jack of many trades, but master of a few VR three's performance is below state of the art of specialized models.

00:22:53.160 | This mirrors early days of LLMs GPT three reported performance.

00:22:57.160 | Well, below fine tune models on many tasks, didn't stop them from becoming foundational.

00:23:02.160 | Uh, outlooks, this is a, uh, exciting time provision.

00:23:06.160 | I do.

00:23:07.160 | I do like the framing of the paper and I do think it's like net good for the average user, right?

00:23:13.160 | Like if you need object detection or like, you know, if you want like a outline of how to draw something, it's a lot more work to go and have to find an object detection.

00:23:26.160 | model or segment anything, or some web app that like, you know, you gotta enter your info to do versus just having like, you know, okay.

00:23:35.160 | Gemini can do VO3.

00:23:36.160 | Sora two can just do this.

00:23:37.160 | And you just send in a picture.

00:23:39.160 | Uh, for example, I was like at a little event and there was like a page to, you know, draw yourself and everyone was drawing themselves.

00:23:48.160 | And one girl just took a selfie, sent it to VO3 and was like, okay, give me an outline that I can trace.

00:23:53.160 | And it just did it.

00:23:54.160 | You know, you don't need any apps.

00:23:55.160 | And I was like, oh yeah, that's kind of cool.

00:23:56.160 | I like general capabilities in this.

00:23:58.160 | And I was like, I never would have expected you to use VO3 for this, but they did it.

00:24:03.160 | Uh, and then, you know, we have a bunch more examples towards the bottom, but I think, I think the website stuff is kind of more fun.

00:24:09.160 | Um, yeah, that's the short paper.

00:24:12.160 | Um, yeah, that's the short paper.

00:24:13.160 | This is not a paper on VO3.

00:24:15.160 | Um, this is a paper analyzing and trying to make a claim that, uh, video models can be good generalist.

00:24:23.160 | I think the thing that I probably skipped over that I thought was interesting was there's that section on them being, um, this is all zero shot.

00:24:35.160 | Um, so, you know, compared to LLMs where you can do like 64 shot, um, this is zero shot.

00:24:43.160 | So we believe that bespoke model forms zero shot video model capability of zero shot.

00:24:50.160 | Uh, but I think it's also like cool to see how much better this is than, um, VO2.

00:24:56.160 | But yeah, okay, I guess, I guess I skipped it because they just keep showing zero shot everywhere, but they did make somewhere to the, you know, they want to reinstate that this is all zero shot.

00:25:06.160 | This is not few shot prompted or anything.

00:25:08.160 | Oh, I closed my tab.

00:25:10.160 | Um, okay.

00:25:11.160 | That's, that's the quick overview of the paper though.

00:25:14.160 | Um, thoughts, comments, questions.

00:25:17.160 | I feel like you hate the paper too much.

00:25:21.160 | Like it's, it's a capability exploration.

00:25:23.160 | Like every new model needs this, um, and reasoning is obviously, uh, very important.

00:25:28.160 | I like it.

00:25:29.160 | I like, I like capability exploration.

00:25:31.160 | It's just that stupid people hyped it up as a VO3 paper, not VO3 paper, but yeah, it's, it's nice distinction in how they, um, how they, how they split capabilities and stuff.

00:25:44.160 | It's cool in that sense.

00:25:46.160 | Uh, no, think about dollar per token costs.

00:25:50.160 | If they desire to scale this up to be a vision foundation model, nothing about dollar token costs.

00:25:56.160 | They, they, they do mention that cost decay.

00:25:59.160 | Um, they, they mentioned that stuff will get cheaper.

00:26:02.160 | Yeah.

00:26:03.160 | Uh, I do have a question.

00:26:08.160 | Actually, I don't know how to raise my hand in zoom.

00:26:10.160 | I'm sorry.

00:26:11.160 | Um, uh, what is, so what would a true, I think that I heard you say that like a chain, this is not really reasoning.

00:26:24.160 | It's like, uh, because of the, because the LLM is already doing the, the, the next frame prediction.

00:26:29.160 | What would reasoning look like if not this?

00:26:32.160 | Well, potentially.

00:26:33.160 | Right.

00:26:34.160 | So the, the, the thing there is we don't know what the, the, the video model has actually generating, right?

00:26:39.160 | Like when you pass in a prompt, like, uh, generate, uh, maze that does this and solve it.

00:26:46.160 | The LLM like pre-prompter could literally be saying, okay, outline of mazes like this, you know, uh, like ball starts in this location, then goes here, then goes here, then goes here.

00:26:58.160 | Right.

00:26:58.160 | So like if the LLM is explicitly solving it, which it really could be, then, then it's not the video model.

00:27:04.160 | That's, that's doing the reasoning per se.

00:27:06.160 | Right.

00:27:07.160 | It's just the LLM prompt that's being rewritten.

00:27:09.160 | And I mean, like you could test this on video models that are not VL3 that you just don't use a prompt rewriter, right?

00:27:18.160 | Like test your own video generation model and don't, don't have a prompt rewriter, but like, nonetheless, you know, it's, it's all free inference.

00:27:26.160 | It's all free work.

00:27:27.160 | It's like not easy to generate this many samples.

00:27:30.160 | So Google doing VL3 doesn't hurt.

00:27:33.160 | Right.

00:27:34.160 | But it is something to know, right?

00:27:36.160 | Like if you, if you just didn't have it rewritten, then, you know, you might have different results.

00:27:43.160 | Cool.

00:27:52.160 | Um, what other fun stuff we can, we can look at the Sora technical report.

00:27:57.160 | Let's look at Sora.

00:27:59.160 | There is one.

00:28:01.160 | Oh, no, I think it's just like a, sorry, sorry, sorry.

00:28:04.160 | Safety, safety card, system card, system card.

00:28:08.160 | I looked at it.

00:28:09.160 | So my spicy tweet about Sora was going to be, there's no numbers in Sora.

00:28:13.160 | The only numbers is about like, uh, anti refusal versus anti, um, what refusal versus good refusal.

00:28:22.160 | Um, but there's no numbers, there's no evals, no, nothing.

00:28:25.160 | Uh, evals, no, nothing.

00:28:26.160 | So, uh, evals people to get another L.

00:28:30.160 | I think they'll put out a technical blog post later.

00:28:32.160 | It's just not out yet.

00:28:33.160 | Right.

00:28:34.160 | Like VL3 might do one later.

00:28:36.160 | Um, the reality of the situation is Sora one paper was pretty good.

00:28:42.160 | This is a system card.

00:28:43.160 | I think it's pretty cool.

00:28:44.160 | What open AI does with system cards.

00:28:46.160 | It's like, if you were to, you know, get into a legal dispute or ask someone like, Hey, should

00:28:53.160 | the model do this?

00:28:54.160 | Should it not do this?

00:28:55.160 | There's no ambiguity, right?

00:28:56.160 | They want to lay out directly.

00:28:58.160 | Uh, you know, here's what we did on.

00:29:01.160 | Here's what it should do.

00:29:02.160 | Here's what it shouldn't do.

00:29:03.160 | But yeah, it's, it's mostly just safety checks.

00:29:07.160 | Uh, it's fun.

00:29:08.160 | They say it'll come to API later.

00:29:10.160 | Um, what else did red teaming?

00:29:13.160 | Um, I thought it's interesting that, you know, now they allow like face uploads.

00:29:20.160 | You're seeing Sam Altman clips everywhere.

00:29:23.160 | Um, safety stack.

00:29:26.160 | They also have moderation classifiers.

00:29:29.160 | So, you know, input prompts, output video frames, transcripts, comments, all that stuff

00:29:34.160 | are run through various safety models.

00:29:36.160 | Um, input prompt blocking.

00:29:39.160 | So the strategy involves blocking the tool from generating a video.

00:29:43.160 | So is my screen sharing to the, yeah, that's okay.

00:29:46.160 | Uh, basically, you know, this one is the most straightforward, right?

00:29:49.160 | If you send in a system prompt that gets flagged, then of course it will not be generated.

00:29:56.160 | I'll put blocking, uh, this is after the video has been generated.

00:30:01.160 | Uh, there's a, a combination of controls.

00:30:04.160 | So some child sexual abuse, material classification model.

00:30:10.160 | Um, what else?

00:30:12.160 | Safety response monitoring block output that violates policies.

00:30:16.160 | Basically they'll do input output blocking.

00:30:18.160 | Um, there's additional safety stuff for people under 18 and anyone under 13.

00:30:27.160 | Can't use it at all.

00:30:28.160 | What else?

00:30:29.160 | Um, there's not much here.

00:30:30.160 | There's not much else here.

00:30:31.160 | Uh, transparent.

00:30:32.160 | Water marking stuff will be watermarked internal detection tools to help see if, if stuff was created by Sora.

00:30:47.160 | So even if you remove the watermark, you know, they, they want to see if the video audio was generated.

00:30:52.160 | Um, when I read this, by the way, um, I, I was actually thinking through like, what do you really want to see in like an industry standard?

00:31:00.160 | Um, it's that opening eye and Google are converging on the same thing.

00:31:04.160 | And they're not, uh, Google has its own ID thing.

00:31:09.160 | And opening eye has this thing.

00:31:11.160 | So.

00:31:12.160 | I didn't know Google has one.

00:31:14.160 | I haven't seen it.

00:31:15.160 | Um, you've been investing in watermarking for a long time.

00:31:18.160 | Oh, no watermarking.

00:31:19.160 | Yeah.

00:31:19.160 | I thought you meant like a safety, like distinction for what VO3 can and can't output.

00:31:25.160 | Uh, no, no, no, just, just watermarking.

00:31:27.160 | There's, there's quite a few things for watermarking.

00:31:29.160 | I don't know if I'd want a standard yet.

00:31:31.160 | Right?

00:31:32.160 | Like if you can do something better and you can put more effort into it, why, why match what Google does?

00:31:38.160 | If you can do better, better.

00:31:40.160 | Now generative video is consumer level.

00:31:43.160 | Everyone can do it.

00:31:45.160 | Uh, like it's, it's basically time to do it.

00:31:48.160 | Yeah, but I can't really wait a year.

00:31:51.160 | I feel like they, they can put out like a nice moderation API on this, you know, that would be interesting.

00:32:01.160 | Uh, someone's asking when they'll have stable diffusion kind of moderifiers, like control net and stuff.

00:32:08.160 | We'll be able to actually do almost anything we want.

00:32:11.160 | Or do I miss something?

00:32:13.160 | Yeah.

00:32:13.160 | What you're missing is these are like API based garden models, right?

00:32:18.160 | Uh, you, you can only do so much so, uh, you can take the output of this and then apply a control net to it.

00:32:25.160 | But like, that's not the same as taking an open, open video model and doing whatever you want.

00:32:31.160 | So you want to use what the private labs have put out, then, you know, you got to use whatever their API lets you use or the app.

00:32:40.160 | The app's kind of fun.

00:32:41.160 | Um, misuse, do not support this, do not support public figures, blocking generations that include real people.

00:32:50.160 | Uh, you can't upload a person of a picture of a person unless they opt in.

00:32:55.160 | Uh, consent.

00:32:57.160 | Uh, I mean, I think that's a good thing.

00:32:58.160 | I think that's a good thing.

00:32:59.160 | Um, I think that's a good thing.

00:33:00.160 | I think that's a good thing.

00:33:01.160 | Um, I think that's a good thing.

00:33:02.160 | I think that's a good thing.

00:33:03.160 | Um, I think that's a good thing.

00:33:04.160 | I think that's a good thing.

00:33:05.160 | Um, I think that's a good thing.

00:33:06.160 | I think that's a good thing.

00:33:07.160 | Um, I think that's a good thing.

00:33:08.160 | Um, I think that's a good thing.

00:33:09.160 | Um, I think that's a good thing.

00:33:10.160 | I tried that I was like trying to get to see if they could reason.

00:33:13.160 | Um, they all kind of still filled on that child safety, team safety, model output restrictions

00:33:21.160 | for minors.

00:33:22.160 | It's like not that long, but you know, basic stuff to read.

00:33:26.160 | Red teamers, safety evals.

00:33:31.160 | Yeah, they had some numbers here.

00:33:36.160 | Um, not unsafe at output, not over refuse.

00:33:42.160 | These numbers don't mean much, but yeah, that's, that's kind of sort of safety card.

00:33:47.160 | Okay.

00:33:48.160 | Um, other stuff.

00:33:49.160 | Anyone else want to chime in share?

00:33:51.160 | So our first quick paper club, we finished early.

00:33:53.160 | I actually did want to follow up on the question that I had asked about the dollar cost of it

00:33:57.160 | coming down.

00:33:58.160 | Um, the reason I ask is because like, I can imagine two separate worlds, one in which like a model

00:34:04.160 | routing gets better and the specialized models get better too, which I can imagine.

00:34:08.160 | Like if specialized computer vision models, I could, let me, let me back up the trend that

00:34:13.160 | drives inference costs down.

00:34:14.160 | Wouldn't that apply to both specialized model computer vision models or any model and a foundation

00:34:19.160 | model.

00:34:19.160 | And if that's the case, if you had a world in which there was like better model routing that

00:34:23.160 | goes to the right.

00:34:24.160 | specialized model versus just this like God model.

00:34:28.160 | I don't know.

00:34:29.160 | I feel like there could be a new paper about that, or they haven't talked.

00:34:31.160 | They haven't been honest enough about that.

00:34:33.160 | That could be an issue.

00:34:34.160 | Yeah.

00:34:35.160 | I think it's a fair, fair thought.

00:34:36.160 | Right.

00:34:37.160 | Um, I think the thing with specialized models is they don't often just specialize on performance,

00:34:44.160 | right?

00:34:45.160 | They're specializing for their constraints.

00:34:47.160 | like a lot of detection models.

00:34:49.160 | A lot of these things are often on device or need a certain low latency.

00:34:54.160 | And for their use case, they're, they're specialized for that in terms of like architecture changes.

00:35:00.160 | Some of them are different architectures, so it won't benefit in this.

00:35:03.160 | Right.

00:35:04.160 | Some of them will, of course.

00:35:05.160 | And same thing with the like routing expert type thing.

00:35:10.160 | Uh, you can draw the parallel of LLMs.

00:35:12.160 | Like if you want LLMs for different use cases, sure.

00:35:16.160 | You can route to a bunch of stuff, but you know, uh, what, what does that look like for video?

00:35:21.160 | Like, are you, are you trying to get understanding?

00:35:23.160 | Are you trying to do segmentation?

00:35:25.160 | Like, are there tasks or do you just want general?

00:35:28.160 | I think it's like completely fair point.

00:35:31.160 | Um, I don't know where it'll go.

00:35:32.160 | Like, I think it's also, there aren't many people hosting these, right?

00:35:37.160 | Like you don't have many third party do everything video all in one platforms, right?

00:35:42.160 | Like you have a few video editing apps, but they're not like consumer, like edge detection,

00:35:47.160 | like deep blurring, all that stuff.

00:35:49.160 | Like there's not many of those that would do this routing, but it's, it's an interesting thought experiment.

00:35:56.160 | Like, I think, I think it's cool, but I don't know how much it falls in.

00:36:00.160 | Like, you know, how much you really get out of it.

00:36:02.160 | Yeah.

00:36:03.160 | Thanks.

00:36:04.160 | Cool.

00:36:09.160 | Any, any other thoughts, questions, or we move on for next week.

00:36:13.160 | First early paper.

00:36:14.160 | Well, I have another question actually, then if we have some time, um, like, so my question

00:36:20.160 | is, uh, with these video generation models, the, just so I have a better understanding the,

00:36:25.160 | the part that, uh, introspects a text prompt, like how is that baked into the latent space

00:36:31.160 | of a vision?

00:36:32.160 | Basically what we can call a vision model in a sense, right?

00:36:35.160 | Like, is it all in one latent space or like, is it separated out?

00:36:39.160 | Yeah.

00:36:40.160 | We did, uh, we did a deep dev on the Sorrel one technical blog post that shows this.

00:36:45.160 | Um, honestly just watch that.

00:36:47.160 | It's like a better explainer of it.

00:36:49.160 | So we've done like a few papers that were like multimodal image and text, and those show

00:36:56.160 | good merges of modality.

00:36:58.160 | Um, basically you, you have a contrast of loss that merges the two embeddings spaces, right?

00:37:04.160 | So you have text embeddings and then audio and you merge them, or you can do a fusion model.

00:37:09.160 | Uh, but basically what these video models are these days are just diffusion scaled up.

00:37:14.160 | And then you have like an inverse clip style training data.

00:37:18.160 | So you have like really, really good captions.

00:37:21.160 | And then you, you do the opposite and you scale up diffusion, but, uh, either, you know,

00:37:26.160 | I think you could also just read the Sora one technical blog post.

00:37:30.160 | Sure.

00:37:31.160 | And, but it's fair to say the weights contain both like pixel level knowledge in a sense,

00:37:36.160 | and, uh, text level, like it's, it's both in one single, like in the weights.

00:37:41.160 | Kind of, kind of.

00:37:42.160 | Yeah.

00:37:43.160 | Sure.

00:37:44.160 | Because you have to have text understanding, but it's not like, it's not like native

00:37:47.160 | native text generation as well coming out.

00:37:49.160 | Right.

00:37:50.160 | Right.

00:37:51.160 | And then those are like, uh, so like recently in the past six months, you saw some, um,

00:37:56.160 | you saw some outputs that were like image models that can generate text.

00:38:01.160 | Well, we covered a paper on how that works as well.

00:38:04.160 | And it's basically just like a post train task of, um, how, how you can get them to learn,

00:38:10.160 | learn what text looks like in the, in the image generation.

00:38:13.160 | So there, there is some of that.

00:38:14.160 | Okay.

00:38:15.160 | Um, I think, um, I'm glad you want to also do deep seek 3.2.

00:38:20.160 | I was going to save it.

00:38:21.160 | Cause I felt like people didn't read it cause it wasn't on the Luma until an hour ago.

00:38:25.160 | But if you want to, if you want to cover it quick, it's actually a very quick paper.

00:38:28.160 | Uh, why don't you do it?

00:38:29.160 | Yeah.

00:38:30.160 | I think we can cover it like quickly.

00:38:31.160 | Okay.

00:38:32.160 | Okay.

00:38:33.160 | I'll let you do all.

00:38:34.160 | Yeah.

00:38:35.160 | You do.

00:38:36.160 | Okay.

00:38:37.160 | Uh, sharing my, is it visible?

00:38:46.160 | Yep.

00:38:47.160 | Looks good.

00:38:48.160 | Okay.

00:38:49.160 | Cool.

00:38:50.160 | So, Hey everyone.

00:38:51.160 | Uh, my name is I'm good.

00:38:52.160 | I'm going to be doing like a very, very quick walk, walk through, through the new deep seek

00:38:57.160 | paper.

00:38:58.160 | So they just released a new model called deep seek V 3.1 experimental.

00:39:02.160 | And it's all about, uh, reducing the cost of inference, especially on longer context, uh,

00:39:10.160 | scenarios.

00:39:11.160 | So basically boosting long context efficiency with the deep seek sparse attention.

00:39:17.160 | I'm going to skip the abstract.

00:39:20.160 | I just want to go to the, uh, just a bit.

00:39:24.160 | So basically when we have a very long context input, like let's say we're inferencing on over

00:39:31.160 | 30 K or 40 K tokens, uh, the, the cost per tokens goes up quickly because in the normal attention

00:39:41.160 | mechanism, we are attending to every token in the context prior to their current token.

00:39:47.160 | So if you're doing inference over 30 K tokens, we're basically doing attention over these

00:39:52.160 | 30 K tokens.

00:39:53.160 | And this, uh, causes the compute and also the memory requirement to go, uh, quadratically,

00:40:00.160 | basically.

00:40:01.160 | But the, the idea behind this sparse attention mechanism is what if we don't have to attend

00:40:07.160 | to all the previous tokens?

00:40:08.160 | What if we only have to attend to a limited or a finite number of tokens that doesn't grow

00:40:13.160 | with the, uh, size of the input?

00:40:16.160 | So this is basically it.

00:40:18.160 | And the way they do it, it is by using what's called the, uh, sparse attention.

00:40:25.160 | They call it deep seek sparse attention.

00:40:27.160 | So, uh, they have two components.

00:40:31.160 | The first one is called a lightning indexer.

00:40:34.160 | And the second one is a fine grained, uh, token selection mechanism.

00:40:38.160 | Basically, this is kind of like similar to how the mixture of expert mechanism works.

00:40:44.160 | You have a router that determines which experts you should use.

00:40:47.160 | And then you can use actually the, uh, the, uh, the limited number of experts to do the, uh,

00:40:53.160 | the inference on the indexer here, uh, works very similarly, very similarly to the, uh, router.

00:41:00.160 | It chooses the tokens that you only need to attend to.

00:41:04.160 | So it basically computes an index score between the query token and every token in the sequence.

00:41:10.160 | So far, uh, using this formula.

00:41:14.160 | So basically you get the query of the current token, and then you have a key for all the previous tokens.

00:41:19.160 | You multiply them.

00:41:20.160 | You use, uh, a real activation.

00:41:22.160 | You multiply this by another weight.

00:41:24.160 | Maybe you also do averaging over number of heads, and then you get a score.

00:41:28.160 | This score is basically like a scalar number that goes from, uh,

00:41:33.160 | zero to one, for example, or basically like a score, uh, a single number, not, not a vector.

00:41:39.160 | And you get this for each token in the sequence so far.

00:41:43.160 | The next step is you can select the top K, uh, the top K tokens with this score.

00:41:50.160 | So basically if you setting K to 2000, you're gonna select the tokens with the, the 2000 tokens with the highest score.

00:41:58.160 | And then you only have to pay attention to these tokens.

00:42:01.160 | You don't have to, uh, attend to all the tokens in the sequence so far.

00:42:05.160 | Uh, so H denotes the number of heads, uh, Q is Q and W are derived from the query token.

00:42:14.160 | And then the K is derived from the, all the tokens in the sequence so far.

00:42:18.160 | They say they chose ReLU because it's a simple activation function and, and it results in, in high throughput.

00:42:24.160 | And the, the important note here is that, uh, the indexer is actually attending to all the previous tokens.

00:42:33.160 | So how are we actually saving on the computation?

00:42:36.160 | And the reason we are, we are saving with the computation is because the, the indexer is, is relatively small.

00:42:41.160 | It has a small number of heads and it can be implemented in FB8.

00:42:45.160 | So this results in massive, uh, computational efficiency.

00:42:49.160 | And I've, I've pulled some numbers from the implementation, uh, the actual implementation of the model.

00:42:54.160 | We can see that the top K they're using is 2048.

00:42:58.160 | The number of heads is 64 and the hidden dimension is 128.

00:43:03.160 | And this number is actually quite small compared to the rest of the model.

00:43:07.160 | This is why we can get to save a lot on computations, even though we're attending to all the previous tokens in the indexer.

00:43:15.160 | So once we have the scores, we can select the tokens we need to attend to.

00:43:19.160 | And then we just do the normal attention, uh, calculations.

00:43:24.160 | So we do the attention over, uh, the tokens that we have selected using the normal attention implementation of the deep seek model.

00:43:32.160 | And this is basically the gist of it.

00:43:36.160 | You have the input.

00:43:37.160 | You just, uh, do the indexer calculation.

00:43:41.160 | You select the top K tokens, and then you just do the normal attention for the deep seek models.

00:43:47.160 | Uh, any questions about this so far before we go to the training?

00:43:53.160 | Yeah, I had a question.

00:43:54.160 | Um, does the indexer get reused across layers or does this per layer?

00:44:05.160 | Uh, good question.

00:44:07.160 | I think it's, uh, they don't mention this explicitly actually, but I think it should be maybe used across layers.

00:44:18.160 | I'm going to have to double check this later, to be honest, but I don't think they mentioned it in the report.

00:44:22.160 | We might have to check the inference code.

00:44:25.160 | Okay.

00:44:26.160 | Yeah.

00:44:27.160 | Yeah.

00:44:28.160 | I, I didn't quite grok why this helps if you're not reusing it, but I'll, uh, I'll have to guess I'll have to read carefully.

00:44:36.160 | I haven't read the paper.

00:44:37.160 | Thanks.

00:44:38.160 | I think even if it's, uh, even if it's not reused across layer, even if, if every attention layer, uh, has its separate indexer, we're still saving quite a lot because we're only attending to 2k tokens instead of like 30k or 40k.

00:44:52.160 | But I'm going to have to double check whether it's actually reused or not.

00:44:56.160 | No, no, I, I get that, but I guess I didn't understand what the logic of why calculating that versus just calculating attention is cheaper.

00:45:04.160 | I didn't understand the sort of math on that.

00:45:09.160 | It's still quadratic, right?

00:45:10.160 | It's just, you're somehow reducing the amount of things that you're, you're quadratic over, but I don't, I don't get what that is.

00:45:18.160 | Exactly.

00:45:20.160 | Yes.

00:45:20.160 | So we're only limited to calculating the attention to 2,000 tokens.

00:45:25.160 | And this is fixed.

00:45:26.160 | No matter if you're attending to 50,000, if like the sequence length is 50,000 or 70,000 or like 1 million tokens, we're still attending only to 2,000 tokens.

00:45:37.160 | And this is why it's, it's almost linear.

00:45:39.160 | It's not, it's no longer quadratic in the input sequence.

00:45:42.160 | Yeah.

00:45:43.160 | Once you've gotten the, done the indexing fine.

00:45:46.160 | I, I buy that completely.

00:45:48.160 | No, I just don't understand why the, the selector, I mean, the selector is also quadratic, right?

00:45:53.160 | It's just that you've reduced what it has to do.

00:45:57.160 | And I don't understand that.

00:45:58.160 | Hmm.

00:45:59.160 | Yeah.

00:46:00.160 | It's, it's because the number of heads are smaller and also the dimension, the hidden dimension is smaller.

00:46:06.160 | It's much smaller than the, the normal attention.

00:46:09.160 | I see.

00:46:10.160 | Okay.

00:46:11.160 | Got it.

00:46:12.160 | Got it.

00:46:13.160 | Okay.

00:46:13.160 | Under that makes sense.

00:46:14.160 | Okay.

00:46:15.160 | Thank you.

00:46:16.160 | Okay, cool.

00:46:17.160 | Yeah.

00:46:18.160 | I think this was a bit confusing to me as well.

00:46:19.160 | Like, why are we like calculating over all the sequence?

00:46:23.160 | But the reason is because the indexer is actually much smaller than the normal attention.

00:46:28.160 | Uh, so this is how they managed to save the compute during inference.

00:46:33.160 | But how do they actually train such a modification?

00:46:37.160 | And turns out the, they just do continued training.

00:46:40.160 | They don't have to re retrain the model from scratch.

00:46:43.160 | Uh, they can do just training on, uh, continued training, uh, of the current model.

00:46:49.160 | So they start with the deep three, a deep seek v 3.1 terminus checkpoint.

00:46:54.160 | And they do continued training, uh, using the two stages of pre-training and also post-training.

00:47:00.160 | So for continued pre-training, they do, uh, two stages.

00:47:05.160 | The first stage is the dense warmup stage.

00:47:08.160 | And this is used to initialize the indexer.

00:47:11.160 | Because at the beginning, the indexer is initialized using random weights.

00:47:15.160 | So they do the first step, which is like a warmup to get this to be like something that basically works.

00:47:23.160 | And not just like random weights, uh, to align the indexer outputs with the main attention distribution.

00:47:30.160 | They do, uh, they sum the attention scores across all attention heads, and then they do L1 normalization.

00:47:37.160 | And then they set the KL divergence loss as the objective.

00:47:41.160 | Uh, and these like a learning rate of 10 to the power of minus three, they train only for a thousand steps.

00:47:52.160 | Each with 16 sequences of 128,000 tokens, which is kind of like around 2 billion tokens, which is actually quite cheap, uh, compared to the, the full pre-training stage.

00:48:04.160 | Uh, the second step in pre-training is actually training the sparse attention mechanism.

00:48:10.160 | So once the indexer has been, uh, warmed up, they, they do the fine-grained, fine-grained token selection mechanism.

00:48:17.160 | And then they just train over all the model parameters, uh, to make the model learn how to work with only 2K tokens in the attention.

00:48:26.160 | So they, they align the indexer outputs to the main attention distribution, but they only consider the selected tokens, uh, the top-case selected tokens according to the indexer.

00:48:39.160 | And again, they apply the KL divergence, uh, loss.

00:48:42.160 | Uh, this step is actually a bit expensive because it's using almost a billion, a trillion tokens.

00:48:56.160 | Uh, and this is not cheap for a model of this size.

00:48:59.160 | So I think this is where most of the training cost is, uh, in this whole process.

00:49:04.160 | So once you've done pre-training, you want to do post-training because this model is going to be like a chat model.

00:49:10.160 | It's not just going to be a base, uh, next token prediction model.

00:49:15.160 | So after the pre-training, they perform post-training that is quite similar to the normal deep seek post-training.

00:49:21.160 | Uh, they, they, they have two, two, I think modifications in this version.

00:49:27.160 | The first one is specialist distillation.

00:49:29.160 | Uh, so basically they, they train specialist models on, on five, on around five tasks.

00:49:37.160 | Each model version is, is trained to be like an expert at this task.

00:49:41.160 | And then they generate synthetic data of these specialist models and they use it to train the generalist model that they release to the public.

00:49:51.160 | Uh, so they mention five specialized domains.

00:49:56.160 | Mathematics, competitive programming, uh, logical reasoning, agentic coding, and agentic search.

00:50:02.160 | Each specialist is, is trained with large scale RL computing.

00:50:06.160 | And then they just, uh, generate training data for the, uh, chain of thought reasoning for, for the, uh, for the generalist model.

00:50:14.160 | And also the direct response generation, because this, this model is supposed to be like a hybrid model.

00:50:19.160 | It can work as a reasoner and also as a direct instruct model.

00:50:24.160 | So once the specialist models are prepared, they are used to, uh, generate the domain specific data for the final checkpoint.

00:50:33.160 | And these, uh, the experiments show that these models, uh, they are quite similar to the specialist on the domain specific, uh, areas, but also they, they are still like a generalist model.

00:50:46.160 | Uh, so this is like quite a nice balance between, uh, maximizing the performance on specialist tasks and also being a generalist model.

00:50:55.160 | Uh, the second modification they make is the do mixed RL training.

00:51:01.160 | So previously, previously the, the, uh, post training pipeline consisted of like multiple steps.

00:51:09.160 | You do one step and then the next step and then the other step.

00:51:12.160 | But in this version, the, they combine all the steps into one, one, uh, single stage pipeline, basically.

00:51:20.160 | So they merge the reasoning agent and human alignment training into one RL stage.

00:51:26.160 | Uh, and this, the reason is they want to balance the performance across the diverse tasks and domains, but prevent the catastrophic forgetting that happens when you train a model in one task and then you train it on another task.

00:51:40.160 | So also this is trying to balance the model.

00:51:43.160 | Uh, for, for reasoning and, and, uh, agent tasks, they use the same rule based outcome reward.

00:51:51.160 | They also use a length penalty and language consistency reward.

00:51:55.160 | And also they employ a generative reward model where each prompt has its own rubrics for evaluation.

00:52:02.160 | And they focus on two key trade-offs.

00:52:05.160 | The first one is length versus accuracy.

00:52:08.160 | And the second one is language consistency versus accuracy.

00:52:11.160 | So basically they want to maximize the accuracy of the model predictions while keeping the, uh, the length of the answer short to reduce inference cost and latency and also keeping the language consistent.

00:52:26.160 | And they run a bunch of evaluations.

00:52:28.160 | I'm gonna, uh, quickly go over them, but basically the, they show that the model, uh, this new, uh, sparse attention implementation is quite efficient and is quite powerful.

00:52:40.160 | And it it's, it's like almost matching the previous model.

00:52:44.160 | Uh, at the beginning it's, it's a bit behind, but it catches up quickly and becomes almost identical to the original model.

00:52:50.160 | Uh, with these, uh, evaluation results.

00:52:53.160 | Uh, the most important.

00:52:56.160 | Experimentation is the inference cost.

00:52:59.160 | So they compare the inference cost of the original deep seek model and the sparse attention version.

00:53:06.160 | And we can see that the original model, which is the blue line is, is, is like ballooning in cost.

00:53:11.160 | Uh, whenever the context gets large, but the modified sparse attention version, which is the orange line.

00:53:18.160 | It it's, it's become much cheaper to serve at a longer, uh, input sequence.

00:53:26.160 | And the, the case is also quite similar for decoding.

00:53:28.160 | And I think it's even more pronounced in the decoding stage, which is where the reasoning models spend most of their time on.

00:53:35.160 | Because now we have models that do decoding for 32 K or 64 K tokens, much of which is reasoning basically.

00:53:42.160 | Uh, they said something that's quite important.

00:53:47.160 | This is still an experimental version and the, they're, uh, conducting validation in real world, uh, inference scenarios.

00:53:54.160 | So that's why they call this version an experimental version.

00:53:58.160 | And yeah, this is basically the gist of the new, uh, deep seek model.

00:54:03.160 | If anyone has any questions, please go ahead.

00:54:06.160 | That's pretty cool.

00:54:10.160 | They just cut their costs significantly.

00:54:12.160 | It's not just, it's not just a paper, right?

00:54:14.160 | So if you use the, uh, API is significantly cheaper.

00:54:18.160 | Yes.

00:54:19.160 | And also this is quite, not quite, but like relatively easy to implement for almost any model.

00:54:25.160 | Uh, so you can see like Kimmy, uh, GLM and, and Quinn may be adapting this or something quite similar the next few weeks or months.

00:54:35.160 | Because they show that you just need to do, uh, continued training.

00:54:39.160 | You don't have to train from scratch.

00:54:41.160 | So this is quite, this is quite, quite massive if it actually, if it is as good as they claim it to be.

00:54:47.160 | I thought there was the interesting, um, note that performance does go down slightly.

00:54:53.160 | Like if you look at the benchmarks between this and the previous version they had running a few benchmarks went down a few points, like a GPQA diamond, right?

00:55:03.160 | And I think that's why it's ever so slightly down, but like, you know, it's half the cost.

00:55:08.160 | Yeah.

00:55:09.160 | I think these, they're quite similar.

00:55:11.160 | Like sometimes they're down, like for, uh, GPQ diamond, but you can see browse comp is a bit higher.

00:55:16.160 | Browse comp ch is also a bit higher.

00:55:18.160 | So I think it's, it's kind of like quite similar and the next versions of models, which will, which are going to be like trained from scratch on this modified attention.

00:55:26.160 | I think they're going to be like able to be much, much better in terms of accuracy.

00:55:36.160 | Um, awesome.

00:55:39.160 | Thank you for the quick, quick 15 minute cover.

00:55:42.160 | Uh, next week, I think we have stuff on bio on kit and RJ is helping out too.

00:55:49.160 | Um, you can find it in discord.

00:55:52.160 | Drink scientific reasoning LMs with biological world models with soft verifiers.

00:55:58.160 | Very fun.

00:56:03.160 | Awesome.

00:56:04.160 | Cool guys.

00:56:05.160 | Take care.

00:56:06.160 | All right.

00:56:07.160 | Okay.