[Paper Club] The 2025 AI Engineer Reading List + LongBench Paper

00:00:00.000 | [INAUDIBLE]

00:00:02.960 | We covered Sora previously.

00:00:05.040 | Yeah.

00:00:06.000 | OK.

00:00:06.800 | Yeah, Sora is not that important, right?

00:00:08.520 | [LAUGHS]

00:00:11.000 | Santu, yeah, I don't know if people have played around

00:00:13.480 | with Santu or listened to the pod.

00:00:16.800 | I think definitely quite revolutionary.

00:00:20.320 | Also, another example of the thesis of vision

00:00:23.240 | becoming video, because what Santu did was take SAM1

00:00:28.800 | and extended it in the video direction.

00:00:32.760 | And they use this really cool architecture

00:00:36.600 | for memory attention that has object permanence.

00:00:41.960 | So where we see that in--

00:00:44.400 | I think it's in the related tweet here--

00:00:47.560 | where you can track people going off screen and coming back

00:00:50.160 | on screen, which is something that I like to show off.

00:00:58.440 | I don't know if this works or not.

00:01:00.120 | I understand.

00:01:00.800 | There we have it.

00:01:01.960 | OK.

00:01:02.480 | Oh, yeah.

00:01:03.320 | All right.

00:01:03.880 | So I think this is on their demo page.

00:01:08.520 | Let's see if it's here.

00:01:11.800 | Yeah, so this is segment anything across video.

00:01:13.800 | I don't know if there's--

00:01:17.800 | oh, yeah, here.

00:01:18.560 | And--

00:01:19.240 | OK, OK, this is the most impressive one.

00:01:22.080 | So you know this three-ball cup experiment?

00:01:26.200 | It can track where the ball is.

00:01:27.480 | You might have seen.

00:01:28.920 | And so here, I can click on the ball in the first frame.

00:01:32.680 | I can also click on a different cup.

00:01:36.000 | And so here, the additional challenge

00:01:38.200 | is that there's three cups that look exactly the same.

00:01:40.600 | And then there's a ball that will get occluded by the cup.

00:01:43.200 | So the ball is no longer visible.

00:01:44.240 | The cups are all moving around.

00:01:45.200 | They all look the same.

00:01:46.320 | But the model actually keeps track of the cup

00:01:48.200 | that we selected.

00:01:48.880 | And as you can see at the end-- here,

00:01:50.360 | I'll jump to the end so you can see--

00:01:51.720 | it actually finds the cup again.

00:01:53.080 | I wanted to point out a couple of fun demo UX features

00:01:55.480 | that we added that actually--

00:01:56.720 | yeah, so I thought it was pretty impressive.

00:02:00.240 | And people are using this in real-life situations.

00:02:02.720 | I would argue among the listed vision models,

00:02:08.680 | SAM is probably taking the crown in terms

00:02:10.560 | of real-world applications.

00:02:12.160 | Yeah, it's also kind of interesting.

00:02:16.680 | Basically, the SAM team publishes one thing a year,

00:02:18.720 | and then they take the rest of the year off.

00:02:20.960 | Like it's-- they're like the epitome of we

00:02:25.440 | have defined one problem well.

00:02:27.160 | We solve it very well.

00:02:28.400 | And then we don't publish anything else,

00:02:32.320 | which is kind of cool.

00:02:34.960 | Let me just grab your recommendations.

00:02:37.160 | Let me see them in the chat.

00:02:40.760 | It's here, right?

00:02:41.600 | Cool.

00:02:46.640 | Then this new one, which I was not paying attention

00:02:50.160 | on debtors and all that.

00:02:54.840 | I know that YOLOv10 was actually at NeurIPS,

00:03:00.800 | which was the latest update of YOLOs and real-time object

00:03:05.480 | detection.

00:03:06.000 | But apparently, according to the vision guys,

00:03:09.240 | debtors are mostly replacing YOLOs

00:03:13.080 | in terms of their performance.

00:03:15.520 | I'm not really sure why.

00:03:16.560 | I didn't really go into it.

00:03:17.840 | But if people care about real-time object detection,

00:03:20.280 | that is it.

00:03:21.160 | Oh, yeah, I mean, the other thing about segment anything,

00:03:24.880 | they studiously avoid labeling.

00:03:29.240 | So they only know how to draw segments.

00:03:31.960 | And to me, it's very similar to typical conf net layers,

00:03:37.640 | where there's one layer that only does edge detection.

00:03:41.400 | And so this is like, because they constrain it very well,

00:03:44.440 | they solve it very well.

00:03:45.440 | But at the same time, for practical use,

00:03:47.200 | you basically always want to label things.

00:03:49.000 | So then you have to combine segment anything

00:03:50.840 | with grounding dyno and stuff, which is not a full solution.

00:03:56.600 | YOLOs, I think, also would have that same application.

00:04:04.920 | OK.

00:04:05.440 | I'll keep moving.

00:04:06.520 | But then also, I also don't want to maybe dominate

00:04:09.000 | too much of the conversation.

00:04:11.360 | I did read this MMVP paper, which

00:04:14.760 | they highlighted as one of the more well-talked about papers.

00:04:19.280 | By the way, I have this really useful extension

00:04:21.360 | for finding the related tweets of all the papers.

00:04:25.720 | And so I don't think this particular paper was

00:04:33.840 | that influential.

00:04:35.080 | But it was one of the best papers of CVPR.

00:04:38.520 | And it basically just pointed out the sort of, quote,

00:04:40.720 | unquote, "jagged intelligence" of frontier models.

00:04:44.720 | So even here, where the--

00:04:54.720 | they found-- they cataloged all the hallucinations

00:04:58.440 | that are still remaining in the frontier models

00:05:01.480 | and why, even though they're superhuman in some aspects,

00:05:05.240 | they're not superhuman here.

00:05:09.040 | Yeah.

00:05:10.960 | And so that creates gaps for other models to fill in.

00:05:19.600 | And I think this is an example of--

00:05:22.160 | I mean, they have a benchmark.

00:05:23.400 | And anytime you see a benchmark like this, where all

00:05:30.720 | the frontier models are here and the humans are here,

00:05:33.600 | that is a reliable way to advance the field, which

00:05:37.120 | is you're doing to find gaps that it's not doing well in.

00:05:40.960 | And I think this sort of finally made it click for me,

00:05:44.240 | that why so many people were kind of focusing on clocks

00:05:48.040 | this year, here, which became a focus for PIXMO

00:05:55.480 | and Moon Dream, which is just analog devices.

00:06:01.200 | Well, I'm trying to look for the analog in this situation.

00:06:05.440 | Can't really find it.

00:06:06.920 | It's somewhere in the presentations that I saw.

00:06:09.120 | But basically, I think this is--

00:06:15.960 | when you publish an influential shortcomings paper

00:06:19.800 | or benchmark, then people can sort of meaningfully

00:06:22.000 | advance on it.

00:06:22.560 | And so I think then they picked out PolyGemma, Florence,

00:06:26.280 | and Moon Dream.

00:06:27.560 | I should put Moon Dream there as the examples.

00:06:32.600 | Yeah, what's up?

00:06:34.200 | Is this essentially an AGI problem for the vision field?

00:06:38.440 | I guess, which is fun, which is interesting as well.

00:06:41.440 | A lot of people think that RKGI is a vision issue, which

00:06:45.640 | I am primarily using my vision sense when I'm looking at RKGI.

00:06:49.360 | But Francois Chollet really, really insists that RKGI is not

00:06:55.400 | a vision problem.

00:06:57.440 | I mean, this looks vision to me.

00:07:01.360 | No, but I mean, so the reason why I agree with him also

00:07:05.040 | is that even the open AI solution and the leading

00:07:08.840 | model of open AI and the winning models,

00:07:12.760 | they're not solving it through vision.

00:07:14.640 | They're solving it through text.

00:07:16.880 | It's a reasoning problem.

00:07:19.680 | Yes, but don't you think vision would help?

00:07:23.560 | Like the a priori.

00:07:26.120 | I understand that the winners, none of them use vision.

00:07:30.320 | But a priori, it should help.

00:07:34.360 | No, there are teams that are trying

00:07:36.480 | to use the vision models, and it didn't work somehow.

00:07:39.960 | So that's an interesting--

00:07:41.960 | That's a scale issue.

00:07:43.120 | It's a scale, I don't know which one it is.

00:07:49.240 | Oh, I wonder if you run QVQ on these things,

00:07:53.160 | would it be any different?

00:07:54.800 | One of the things I was thinking about for the Christmas

00:07:58.560 | episode today was taking some of the published ArcAGI questions

00:08:04.720 | and seeing if humans can solve it.

00:08:06.120 | Let's do one.

00:08:11.640 | You want to do one?

00:08:12.520 | I'll find one.

00:08:13.480 | You keep going.

00:08:14.360 | Let me find a fun one.

00:08:15.480 | OK, OK.

00:08:16.800 | Yeah, I posted a link in the Discord somewhere.

00:08:21.600 | I think Greg Kamrat is the guy to find it.

00:08:24.600 | And then find something where O3 failed to solve

00:08:29.920 | one of the ArcAGI problems.

00:08:31.280 | And then I wonder if humans can do it.

00:08:33.600 | OK, what else about polygerma?

00:08:36.760 | Polygerma, that directly led to copoly,

00:08:40.840 | and then also now coquin.

00:08:43.360 | I think in terms of the specific field of PDF parsing,

00:08:48.120 | this has been a relatively big win, I think, in this year.

00:08:54.240 | There's also Marker, Surya, as well,

00:08:56.920 | so that emerged this year as vision-based models

00:09:01.560 | for things.

00:09:03.640 | So where does this go?

00:09:05.080 | I think next year, vision.

00:09:08.000 | Next year, MMVP solutions to, I don't know, 50%, 70%.

00:09:15.560 | But I'm interested in what's next.

00:09:17.880 | I think a lot of people are focusing

00:09:19.880 | on the MMMLU, or the multimodal MLU, or whatever.

00:09:24.200 | But I'm not sure what else is left, apart from, I guess,

00:09:28.000 | moving on to video generation.

00:09:32.160 | The artificial analysis people, I would say,

00:09:35.320 | are the de facto leaders now in terms of judging video models.

00:09:41.160 | So you should be aware of this leaderboard,

00:09:44.920 | where you can track the arenas for all these things.

00:09:49.040 | And they're trying to do image and speech, as well.

00:09:52.440 | But video seems to be the most hyped.

00:09:55.720 | Yeah.

00:09:57.040 | Cool.

00:09:57.560 | Shall we move on?

00:09:58.160 | Any other thoughts or additions to the vision video domain?

00:10:03.240 | I just sent a link to the ARC thing.

00:10:05.760 | This is the unsolved O3 tasks.

00:10:08.880 | Let's try one.

00:10:09.840 | [INAUDIBLE]

00:10:13.840 | Sorry, what did you send it?

00:10:15.040 | Zoom chat.

00:10:16.160 | Oh, Zoom chat.

00:10:17.360 | Everyone can.

00:10:18.880 | Oh, oops.

00:10:22.400 | Oh, I've been using Orion, by the way.

00:10:24.440 | Oh, but it includes the answer, I guess, I think.

00:10:28.400 | Zoom in, man.

00:10:30.840 | What?

00:10:32.360 | OK, is this the test?

00:10:35.000 | These are hard.

00:10:36.400 | But we-- OK, so this is--

00:10:41.120 | --24 tasks that O3 was unable to solve,

00:10:43.920 | along with the incorrect guesses it made.

00:10:46.440 | OK, so what are we trying to do here?

00:10:48.000 | We have three examples.

00:10:49.600 | And then O3-- oh, OK.

00:10:51.240 | So let's kill the ground truth here.

00:10:56.160 | So, oh, yeah.

00:10:57.160 | OK, this is the one that everyone's debating, right?

00:11:02.040 | So the blue connections here turn everything

00:11:06.320 | on its path blue.

00:11:07.280 | And everything else remains red.

00:11:11.840 | So O3 managed to draw all the blues.

00:11:14.680 | And drew extra blues.

00:11:20.440 | This is somewhat unfair, because this question asks you to--

00:11:24.760 | like, none of the samples have any--

00:11:31.680 | have sort of this many dots.

00:11:34.080 | They also don't have--

00:11:40.720 | oh, go ahead.

00:11:42.320 | Oh, go ahead.

00:11:44.000 | People are saying that O3's first solution is correct.

00:11:48.520 | Yeah, that's the thing.

00:11:49.480 | They also don't have dots whereby it lines up

00:11:51.360 | horizontally and vertically.

00:11:53.680 | Yeah, yeah, so it's a known bad question.

00:11:56.800 | And also, it wasn't--

00:11:58.280 | there's the dot and the line touching the box.

00:12:01.040 | That part was very--

00:12:02.160 | Yeah, this specific one is the issue,

00:12:05.040 | because ground truth was saying that this one should

00:12:08.240 | have turned blue.

00:12:10.720 | So in RKGI, you get two chances to solve the solution.

00:12:16.040 | And if you don't get either chance,

00:12:17.640 | you're deemed to have failed.

00:12:19.760 | But O2 is pretty close.

00:12:20.960 | I think this is just a bad question,

00:12:22.640 | and we should just throw it out.

00:12:25.240 | Let's see another one.

00:12:26.800 | OK, so we'll just look at it for a while.

00:12:31.280 | I've never seen this.

00:12:33.800 | The sentence below might be useful.

00:12:37.760 | Unable to open the grid.

00:12:38.800 | Oh, yeah, it just--

00:12:39.600 | Oh, you're not supposed to get the sentence.

00:12:44.080 | We're just analyzing.

00:12:45.360 | I wonder why the first example doesn't--

00:12:54.240 | the box in the middle is 3 by 3.

00:12:57.240 | I thought you're supposed to try to shrink it

00:13:00.640 | as much as you can.

00:13:03.160 | Or even I, as a human, don't understand the first--

00:13:05.920 | the one on the left.

00:13:07.440 | [INAUDIBLE]

00:13:09.600 | I thought the pattern is you add an orange ring, then

00:13:13.160 | a white ring.

00:13:14.000 | So orange, white, then orange, white.

00:13:17.040 | And the third one, you skip.

00:13:20.400 | You recurse if you can.

00:13:22.320 | Yeah, you recurse if you can.

00:13:23.520 | So I don't know why they don't recurse for the first one.

00:13:26.440 | Yeah, so--

00:13:28.000 | Yeah, especially if the fourth one,

00:13:30.640 | it did recurse with that white dot.

00:13:32.720 | Yeah.

00:13:33.240 | [INAUDIBLE]

00:13:35.760 | So, OK, I feel like this is an example where

00:13:39.000 | vision would help.

00:13:39.880 | Because--

00:13:40.400 | Yeah.

00:13:41.240 | Oh, look, this is offset.

00:13:42.840 | So over here, you've got--

00:13:43.920 | [INAUDIBLE]

00:13:44.720 | Right, yeah.

00:13:45.480 | This is the closest one.

00:13:50.160 | So you need to basically count the levels of recursion.

00:13:53.360 | And this is super small.

00:13:55.200 | So I don't know if there's a good way to mathematically go--

00:13:59.440 | oh, maybe there's some relationship between--

00:14:02.360 | there's, OK, 1, 2, 3, 4, 5, 6, 7,

00:14:05.680 | 8, 9.

00:14:06.760 | This is like a width of 9.

00:14:10.760 | Oh, hang on.

00:14:12.880 | You also have to zoom in.

00:14:15.840 | You have to output a smaller square.

00:14:18.160 | No, no, no, you don't.

00:14:19.120 | You don't.

00:14:19.640 | It's just that the input is a smaller square.

00:14:22.000 | So the second input was a 6 by 6.

00:14:24.160 | It's not output a smaller square.

00:14:26.880 | I was going left to right.

00:14:27.880 | OK, so this is a width of 6.

00:14:31.360 | 9, 6, 1, 2, maybe 16 minus 4 is 12.

00:14:36.080 | 9, 6, 12.

00:14:37.440 | And then this is 19 minus 2 is 17.

00:14:41.080 | This is an odd number.

00:14:42.120 | So 17 outputs a 17 square.

00:14:49.200 | Yeah.

00:14:49.720 | So I feel like there is a relationship between like--

00:14:52.560 | this is over-engineering.

00:14:59.400 | If you switch it on, it gets even weirder.

00:15:01.600 | But after looking at this post, my real thing

00:15:04.480 | is like, I don't know if AGI is just puzzles and squares.

00:15:08.280 | It shouldn't be.

00:15:10.000 | Yeah, the standard line is that it is necessary but not

00:15:15.240 | sufficient.

00:15:16.120 | Just because you can pass a bunch of IQ tests

00:15:17.960 | doesn't mean you're going to be good at your job.

00:15:20.040 | But it helps.

00:15:21.760 | I think I can be good at my job and fail all these tests.

00:15:25.440 | I agree.

00:15:25.960 | I think you can make a lot of money

00:15:27.400 | or create a lot of good without even being--

00:15:30.760 | without doing well on, I don't know, Dota.

00:15:33.880 | What is the DQN learning?

00:15:37.560 | Atari.

00:15:38.440 | Without doing well on Atari, even.

00:15:39.960 | I think this is the Atari for a reason.

00:15:42.880 | Well, this is reasoning.

00:15:44.520 | This is also just like, the way I think about some of these

00:15:47.840 | is if you throw enough time at it, you'll figure it out.

00:15:50.640 | Like, I think if you take like a seventh grader

00:15:53.280 | and give them like a month or a summer vacation

00:15:56.040 | and you give them like a PS5, if they get it,

00:15:58.000 | they will do it in two months.

00:15:59.760 | Like, you give them enough time, they'll do it.

00:16:03.640 | Are you referring to test time compute now?

00:16:05.840 | No, I'm just talking about like, some of this stuff

00:16:10.440 | is just trial and error.

00:16:12.120 | So you know, chain of thought.

00:16:14.600 | I'll just leave RKGI there.

00:16:16.400 | I don't know.

00:16:16.920 | I mean, we can agree that the misalignment is a clear fail.

00:16:23.920 | But neither of us can tell why the first one didn't recurse.

00:16:28.960 | What do you mean, misalignment?

00:16:32.000 | The ones in the green.

00:16:33.880 | The O3, yeah.

00:16:35.480 | The grid alignments, it's very clear cut.

00:16:38.960 | I think the like, tokenization.

00:16:43.120 | But also, it seems like a lot of models

00:16:46.680 | just don't output anything.

00:16:52.000 | That's what [INAUDIBLE] comment.

00:16:53.720 | A lot of smaller LLMs on Arc, they don't do anything.

00:17:02.240 | The model is unable to output a grid at all.

00:17:04.760 | I think this is similar to like, tokenization and language.

00:17:09.120 | LLM get grid.

00:17:11.080 | Yeah, that sounds like it.

00:17:13.240 | Like, if stuff is off a character,

00:17:15.200 | it's probably just because their tokenizer is bad at this.

00:17:18.040 | Which is like, not a great measure of AGI.

00:17:23.320 | But I guess AGI can fix its own tokenizer.

00:17:28.400 | Ooh, that'll be fun.

00:17:30.520 | OK, do you want to do more?

00:17:32.440 | Or should we move on?

00:17:34.640 | Move on.

00:17:35.160 | So then we move to open models.

00:17:42.880 | I had a list here of the GEMMA model.

00:17:46.960 | Ooh, I should probably put in--

00:17:48.360 | I guess if we want to do more, if you scroll down,

00:17:53.120 | there's one fun one that's very quick that people understand.

00:17:56.600 | It's just what line is on top.

00:17:59.280 | Go down, go down, go down, go down.

00:18:01.400 | Oh my God, there's so many examples.

00:18:04.600 | Yeah, so I actually did do an Arc AGI, like human,

00:18:08.040 | try and fill it up in Solaris.

00:18:09.960 | The biggest issue I found was after the second puzzle

00:18:14.000 | is you get fatigue.

00:18:15.680 | Because some of the puzzles are huge.

00:18:17.840 | And trying to fill it up is a pain in the ass.

00:18:20.960 | Right, right.

00:18:23.440 | Yeah, I don't know which one you're talking about, Vibhu.

00:18:27.760 | This one?

00:18:28.260 | Yeah, I agree.

00:18:34.240 | I mean, it's one of those things where AI is just

00:18:36.280 | very good at scaling up attention, and we don't.

00:18:40.120 | We're not.

00:18:41.920 | Which is one form of AGI, ASI.

00:18:46.200 | Vibhu, I don't know which one you're talking about.

00:18:48.320 | My bad, I found a new link.

00:18:49.640 | New link.

00:18:50.640 | Oh, new link.

00:18:52.040 | This one, I'm surprised that it failed.

00:18:53.720 | This one?

00:19:04.720 | Mm-hmm.

00:19:07.480 | So second to last one, at the very bottom.

00:19:09.320 | OK, second to last, meaning here.

00:19:17.000 | Up, up one, up.

00:19:18.760 | Oh.

00:19:19.640 | This itself is an Arc AGI task.

00:19:21.840 | Navigate the page based on instructions.

00:19:26.280 | OK, I like this.

00:19:27.560 | I like this.

00:19:28.160 | So basically, what is--

00:19:30.040 | What's on top?

00:19:31.160 | What's on top?

00:19:31.720 | What's the last thing added?

00:19:34.400 | Yeah, so pink is on top.

00:19:37.160 | Blue's on top.

00:19:38.200 | Pink's on top.

00:19:39.320 | Blue's on top.

00:19:39.960 | Green's on top.

00:19:41.560 | See?

00:19:42.080 | It's pretty cool.

00:19:44.040 | Another vision.

00:19:44.880 | Yeah.

00:19:47.960 | I feel like vision will won this as well.

00:19:51.160 | I mean, it's not just vision, right?

00:19:52.280 | You also have to know that which one is obscured

00:19:54.760 | and which one is not.

00:19:55.840 | Yeah, vision is part of it, but it's not just.

00:19:59.520 | Yeah.

00:20:00.040 | But count the most number of blocks.

00:20:02.000 | Well, the interesting thing here is the larger the grid gets,

00:20:06.120 | like on the left, you see all the grids are like 10 by 6.

00:20:09.560 | As you add more grids, the 01 mini starts to fail.

00:20:13.920 | You just can't do more grids.

00:20:15.800 | Even though it's the same number of interior lines,

00:20:17.920 | it struggles with grids.

00:20:18.920 | So yeah, maybe vision would help, but it's a skill issue.

00:20:27.800 | It's a memory attention issue as your grid get bigger.

00:20:32.640 | I don't know about that, because the attention

00:20:34.880 | will condense this down into all the same hidden dimension.

00:20:42.560 | So basically, all this gets pre-processed to the same size.

00:20:47.920 | The grids didn't make an issue here.

00:20:50.880 | And the grids [INAUDIBLE]

00:20:54.320 | I mean, it's only 24 [INAUDIBLE]

00:21:00.480 | Interesting.

00:21:01.800 | Cool.

00:21:03.560 | All right, move on.

00:21:04.520 | Learned today on Christmas of 2024

00:21:10.480 | is that we are not AGI, guys.

00:21:12.600 | We're not AGI.

00:21:14.960 | We don't deserve a million.

00:21:16.160 | OK, I was going to move on to open models.

00:21:22.400 | I feel like these are the ones that are commonly named.

00:21:28.840 | I can sort of justify all these.

00:21:30.200 | These are the PICS from LUCA.

00:21:31.560 | Obviously, this does not include the sort

00:21:38.600 | of state-space models and RWA-KVs of the world,

00:21:40.920 | which are also open.

00:21:42.160 | But these are sort of the traditional open models.

00:21:44.960 | Are there any that--

00:21:46.480 | I guess the big one has not been mentioned here,

00:21:49.720 | which is Metaslama.

00:21:50.640 | Are there any that we have missed?

00:21:55.480 | Oh, we know I had Mistral.

00:21:57.440 | Falcon dropped another one too.

00:21:59.360 | Oh, I'm sorry.

00:22:02.240 | Yeah, I think this is--

00:22:04.400 | I mean, it's kind of nice to have everything,

00:22:06.360 | like the whole year in one screen.

00:22:09.760 | I definitely would find some utility from that.

00:22:15.280 | Just be like, oh, yeah, we didn't miss anything.

00:22:17.560 | This is everything.

00:22:20.880 | I think it needs more love.

00:22:26.400 | I mean, I think these guys release so much.

00:22:29.960 | There's like, you see Coder and all that.

00:22:35.400 | CKMOU.

00:22:36.560 | Do you see CKMOU this year or last year?

00:22:40.120 | I think it was early this year.

00:22:42.520 | So it's just hard to--

00:22:44.280 | I think it's just like, you have to tell people

00:22:47.120 | what the variable-based models they are so that they can then

00:22:50.880 | go and fine-tune if they want to.

00:22:52.600 | But otherwise, there's not much to say here

00:22:54.800 | apart from the options of open models that are out there.

00:23:01.080 | I think you can add some of the small sub-1B models.

00:23:05.200 | There's also the Falcon 3 series dropped a week ago.

00:23:09.040 | I don't know how they are, but they put Falcon 3, 1B, 3B, 7B,

00:23:13.280 | 10B, the MAMA version.

00:23:14.960 | I just don't know how they are.

00:23:16.840 | I'm going to put them on a misc.

00:23:19.640 | They made the list.

00:23:22.400 | Yeah, I mean, the consensus is that Falcon was not

00:23:24.760 | well-trained, apparently.

00:23:27.720 | Oh, interesting, because they put out huge data sets,

00:23:30.680 | but they don't train--

00:23:32.200 | So Guillaume, the creator--

00:23:35.320 | I talked to the same guy at NIRS 2023 and NIRS 2024.

00:23:42.160 | The guy who did the Falcon, the fine web--

00:23:46.920 | where was it?

00:23:49.400 | Refined web?

00:23:52.040 | Actually, was it?

00:23:54.040 | I don't know where I put it.

00:23:55.200 | I actually talked to him last year at this conference.

00:24:04.200 | It looks like I didn't actually--

00:24:05.560 | I didn't even publish the interview.

00:24:08.960 | But I talked to him again this year,

00:24:11.000 | and he's actually the same guy behind fine web.

00:24:14.560 | So this fella, he just basically left TII UAE

00:24:20.400 | and joined Hugging Face.

00:24:22.200 | So it's the same guy.

00:24:24.400 | There's also Falcon refined web from TII UAE.

00:24:29.040 | Right, it's the same guy.

00:24:30.280 | That's what I'm saying.

00:24:33.440 | If you look at refined web, the lead author from this guy

00:24:36.760 | last year is the same guy.

00:24:39.880 | He moved-- so I mean, that's the intellectual lineage, I guess.

00:24:49.680 | He was actually at the InSpace Live.

00:24:51.720 | He came by to just say hi.

00:24:55.360 | Cool.

00:24:55.880 | Any other open models?

00:24:57.560 | I don't know.

00:24:58.320 | You said the one that's the 1B, 3B models.

00:25:01.480 | What are you specifically talking about?

00:25:03.240 | Five models, even though they're--

00:25:05.560 | The five models, yeah.

00:25:06.720 | I should probably mention five here.

00:25:14.360 | I don't know.

00:25:17.680 | I feel like Phi constantly has these allegations of training

00:25:22.400 | on tests, and I don't know how real that is.

00:25:26.160 | I know a couple of people who--

00:25:30.200 | It's somewhat undetermined.

00:25:31.520 | They keep saying they do less and less.

00:25:33.360 | And realistically, they have a whole section

00:25:35.480 | in the papers on how they don't train on tests

00:25:38.800 | and how they filter out so we don't train on benchmarks.

00:25:43.120 | I don't know why they just put out the list for no reason.

00:25:45.680 | But 5.4 is not really out for testing yet, right?

00:25:48.960 | It's not open yet.

00:25:50.200 | It's meant to be open later.

00:25:51.440 | I don't know if anyone's tested it.

00:25:52.840 | I guess you could add the Apple stuff here.

00:25:56.560 | Yeah.

00:25:57.720 | So I--

00:25:59.040 | Never mind, take it back.

00:26:01.440 | Yeah.

00:26:02.640 | Yeah, I just put it here.

00:26:04.720 | I guess I also mentioned Gemini Nano, which is in here.

00:26:11.440 | OK, I found that you can just go to the local Lama

00:26:16.760 | and you go to just type in best model.

00:26:19.480 | Every few months, they'll do some kind of survey.

00:26:22.600 | And you just sort by new.

00:26:25.320 | They'll usually have some kind of--

00:26:30.280 | here-- some kind of informal survey

00:26:32.800 | of what people are saying in terms of best models,

00:26:35.720 | including the best fine-tunes and whatever.

00:26:40.120 | So that's pretty useful.

00:26:41.800 | I don't think we've missed any, basically.

00:26:45.480 | Cool.

00:26:48.200 | I would like to point out that the rare, random, ever

00:26:50.520 | quaint fine-tune there, that's actually

00:26:53.280 | a role-play fine-tune model.

00:26:55.200 | Oh, of course.

00:26:55.800 | This is local Lama.

00:26:57.280 | Local Lama.

00:26:57.920 | No, but every now and then, some of the role-play fine-tune

00:27:01.440 | models, people do like it in human evolvement

00:27:04.880 | when it comes to tasks.

00:27:05.800 | So it's always weird to see it happen.

00:27:08.160 | What I'm not seeing is merges.

00:27:10.440 | And yeah, where are the merges?

00:27:17.960 | None of these are merges, right?

00:27:20.760 | I can clearly double-check.

00:27:22.160 | Merging is not all you need.

00:27:24.600 | Yeah, I feel like there's a lot of noise about merging,

00:27:26.840 | but nobody actually ends up using them.

00:27:29.160 | They just think--

00:27:30.680 | Ramon has an interesting comment, by the way,

00:27:32.680 | from Zoom chat.

00:27:34.480 | Some of the Falcon and fine-tuning creators,

00:27:39.040 | the first series, they started a company, AdaptiveML.

00:27:41.960 | They do on-prem RL.

00:27:45.600 | Oh.

00:27:46.100 | They raised the 20 mil series, eh?

00:27:51.880 | Nice.

00:27:59.160 | OK.

00:28:01.080 | Yes, I think I saw this round come out.

00:28:06.120 | We might try to talk to them next year.

00:28:08.760 | I know basically everyone's going to focus in on RL for LLMs,

00:28:14.720 | and that'll be a big theme for next year as well.

00:28:17.440 | I'm trying to collect all these themes

00:28:19.080 | so that I'll make my life easier by fine-tuning RL for LLMs.

00:28:27.720 | OK, whatever.

00:28:29.560 | OK, so I will keep going in the interest of time.

00:28:33.320 | Synthetic data.

00:28:34.440 | I feel like we are all relatively familiar

00:28:37.640 | with many of these.

00:28:39.400 | I put five here as well for synthetic data.

00:28:42.720 | We did the ORCA 3 agent and struct paper this year.

00:28:47.240 | I did that session.

00:28:49.880 | I feel like BillionPersona was a source of a lot of noise,

00:28:56.040 | but ultimately very little actual impact,

00:29:00.240 | whereas the people who worked on real data sets,

00:29:02.640 | like FineWeb and DCLM, have had more impacts as well.

00:29:07.960 | You were trying to think--

00:29:09.240 | WizardLM.

00:29:10.560 | WizardLM series.

00:29:12.200 | Oh, that's agent.

00:29:14.800 | Is that also synthetic data?

00:29:16.000 | I don't know, man.

00:29:16.760 | Yes.

00:29:19.200 | OK.

00:29:20.760 | So I will-- it's all Microsoft, right?

00:29:23.880 | It's all these MSR China people.

00:29:27.840 | OK.

00:29:29.480 | Cool.

00:29:30.720 | The one I didn't know about was Cohere,

00:29:34.400 | which Luba mentioned in her talk.

00:29:38.080 | And that was net new to me.

00:29:40.440 | Basically, there is a chart in here.

00:29:44.160 | Cohere is always pushing, or at least Sarah Hooker

00:29:50.280 | is always pushing that there's benefits

00:29:54.320 | from learning across languages.

00:29:58.160 | And I think she's basically just trying

00:30:02.800 | to emphasize that if you have an ensemble of languages

00:30:06.480 | and your data set crosses more languages,

00:30:08.240 | you have knowledge that you don't have in one language.

00:30:11.080 | So English is not all you need, basically.

00:30:15.320 | So mostly be a single teacher, whatever

00:30:18.520 | the routing system is for your multiple languages.

00:30:24.640 | OK.

00:30:25.640 | I don't know if there's any other synthetic data stuff

00:30:28.160 | that we should pick up.

00:30:29.120 | But basically, this is completely--

00:30:31.000 | I introduced this on a whim.

00:30:33.000 | I felt like synthetic data was a big theme this year.

00:30:36.640 | And I think this really should be

00:30:40.400 | data sets that happens to be just all synthetic data.

00:30:44.280 | There's also a lot of talks about LMS judge.

00:30:47.240 | But I don't know if there's a specific paper to cover this.

00:30:51.560 | When I looked at my own records, the best paper, quote unquote,

00:30:56.000 | was Hama Hussain's post on LMS judge.

00:31:01.760 | So maybe that would be a quote, unquote,

00:31:04.080 | paper for creating LMS judge.

00:31:08.440 | I don't know if anyone has read anything on synthetic stuff.

00:31:16.080 | But yeah, I'll just put it there.

00:31:20.120 | OK, small models, mobile LLM.

00:31:22.400 | I don't know if we covered this in paper club.

00:31:24.360 | But I did in AI News.

00:31:28.040 | And I think this is effectively the genesis for a small LLM,

00:31:35.000 | which is kind of Hugging Faces implementation of that.

00:31:38.080 | We did cover in paper club.

00:31:42.120 | I covered this paper.

00:31:43.800 | Cool, awesome.

00:31:45.400 | I may not have been there.

00:31:48.400 | Did you like it?

00:31:50.320 | Do you still have a good impression?

00:31:52.320 | No, I really like it because it's particular

00:31:54.600 | about the layer repeat part.

00:31:57.400 | Yes.

00:31:59.440 | I think it's relevant, again, because people are speculating

00:32:02.560 | that O1 does layer looping.

00:32:07.640 | And mobile LLM kind of does that.

00:32:11.040 | But I'm not sure if it's dynamic or not.

00:32:13.320 | I think it could be static, whereas O1

00:32:16.640 | would have some layer to decide whether or not

00:32:19.440 | to continue looping.

00:32:20.680 | Like it's effectively kind of a Turing complete architecture.

00:32:25.280 | I think O1 is more similar to--

00:32:29.520 | I can't remember the Google's paper name.

00:32:31.760 | But there's a Google paper where it exits early

00:32:34.600 | across the layers.

00:32:36.320 | Oh, mixture of depths.

00:32:38.480 | Yeah, mixture of depths, yeah.

00:32:40.280 | Which I think I did cover briefly as well as

00:32:44.160 | alternative to mobile LLM.

00:32:47.000 | Oh, we'll mention the mixture of depths here.

00:32:48.920 | Does it-- do they have looping?

00:32:56.000 | I don't think it's defined as a loop,

00:33:00.920 | but it's more of like a fixed depth and then exit early.

00:33:05.840 | Yeah.

00:33:07.800 | OK, I feel like to loop, to do inference time

00:33:14.640 | compute for multiple minutes and potentially hours,

00:33:18.640 | you need to loop instead of just having different depths.

00:33:23.520 | Yeah, it's also very hard to comment on this

00:33:25.720 | because there is no known open extra depth model as well.

00:33:29.280 | Cool, well, that's all we got.

00:33:35.520 | I think Apple Intelligence may be the biggest open models

00:33:39.200 | apart from RWBKB on-device model because I have it on my phone

00:33:44.440 | and I benefit from it every day.

00:33:47.360 | Still quite cool.

00:33:50.400 | I feel like people-- it's very trendy to shit

00:33:53.360 | on Apple Intelligence now.

00:33:55.760 | But it is still underrated that they rolled out

00:33:58.560 | transformers across the entire install base of iPhones,

00:34:02.960 | which is pretty cool.

00:34:06.680 | Gem9.nano, I thought would be this year,

00:34:09.000 | still under a feature flag.

00:34:12.240 | So do people know what I'm talking

00:34:14.240 | about when I say Gem9.nano?

00:34:15.680 | So there's like a browser API.

00:34:30.280 | Build-in AI, I think it's this one.

00:34:32.920 | Yeah, so these are the APIs.

00:34:39.960 | And you can-- yeah, this prompt API for web,

00:34:46.680 | I don't know where it is here.

00:34:49.280 | This one?

00:34:49.800 | No, this is an extension.

00:34:55.400 | Where is-- I don't know where it is.

00:34:59.840 | [AUDIO OUT]

00:35:04.720 | Yeah, this is Flash.

00:35:06.600 | It will be in Chrome, where you can do browser.ai.generate.

00:35:15.120 | And that's just straight access, base level access to Gem9.nano.

00:35:21.960 | Oh, here it is.

00:35:22.680 | Yeah, yeah.

00:35:27.680 | So this will be built into the browser, no download.

00:35:32.320 | At some point, this will happen.

00:35:34.520 | And I think there were some demos this year that

00:35:40.600 | showed that it was very fast.

00:35:42.680 | Obviously, it's very dumb as well, but if you--

00:35:48.280 | yeah, I can't find it right now.

00:35:49.840 | But I mean, if you just sort of wait a bit,

00:35:53.160 | then you know it's coming.

00:35:55.080 | Maybe I'll just put it in here.

00:35:58.120 | OK, cool.

00:35:59.960 | Kimba, was this a big deal?

00:36:01.800 | I don't know.

00:36:03.080 | Luna picked it.

00:36:03.760 | I feel like Eugene might know.

00:36:09.760 | I think it's too early to even tell whether it's a big deal.

00:36:12.160 | But it has potential.

00:36:13.480 | I think that's why he picked it.

00:36:17.200 | Cool, all right.

00:36:18.240 | I'll keep moving on.

00:36:19.040 | I feel like I'm kind of running out of steam.

00:36:22.240 | Before going to post-transform, there's also big models.

00:36:26.160 | I don't know whether you want to touch

00:36:27.720 | about DeepSeq dropping their 600p model.

00:36:30.520 | Yeah, here, right?

00:36:32.560 | B3?

00:36:34.600 | Oh, OK, I forgot that there was a 600p deal.

00:36:37.880 | Yeah.

00:36:39.560 | OK, post-transformers.

00:36:41.400 | Is there any affection on big, big model drops,

00:36:44.520 | like the 405b, the large failures?

00:36:49.440 | Blunt?

00:36:50.640 | Large failures?

00:36:52.240 | Especially--

00:36:53.640 | What is large failures?

00:36:55.280 | If it was me, I might call them large failures.

00:36:57.480 | If it wasn't me, I might call them

00:36:58.920 | like distillation-like model, so you can distill down.

00:37:04.240 | And then it becomes very weird when now like Lama370b,

00:37:07.680 | or whatever the big one is, is good as 405.

00:37:11.280 | But big failures.

00:37:15.200 | In that lens also, you can also include the Nvidia reward

00:37:18.280 | models.

00:37:20.120 | I really wish this was like live editable,

00:37:22.200 | so I can add in parentheses like burn VC money, but it's OK.

00:37:25.840 | Is this what you mean by reward models, Eugene?

00:37:32.080 | I don't know if this is the--

00:37:33.640 | This is the one.

00:37:34.400 | Let me double check.

00:37:35.280 | There's a bunch of non-branded things.

00:37:41.440 | I think Grok 1 was kind of considered a failure.

00:37:45.520 | Everyone's very excited about the weights of Grok,

00:37:48.640 | but it was too big to deploy.

00:37:51.840 | When was that?

00:37:53.080 | That was March.

00:37:54.400 | Yeah.

00:37:55.640 | So this is also a very, very big model.

00:37:58.440 | This is 314b.

00:38:02.800 | Yeah, I mean, yes, they're too big,

00:38:04.760 | but they're still teacher models.

00:38:06.120 | And I think that's OK.

00:38:07.320 | I don't see an issue with this at all.

00:38:10.840 | I don't consider it a failure.

00:38:12.240 | I don't know.

00:38:14.200 | OK, so let me know if anyone can think of any other big models

00:38:18.960 | that were released this year.

00:38:22.040 | I just thought of one.

00:38:23.440 | There was a Chinese model.

00:38:26.840 | I can't remember which one that was,

00:38:30.800 | but I will look it up.

00:38:32.880 | Some Chinese model.

00:38:34.680 | OK, there was state space stuff.

00:38:37.080 | I feel like the only thing that really made an impact this year

00:38:40.200 | was Jamba.

00:38:41.480 | I think we covered this as well in Paper Club.

00:38:44.720 | The rest was-- and obviously, there's

00:38:46.840 | Mamba 2 was also this year.

00:38:49.640 | And this is one of the best papers at NeurIPS.

00:38:53.840 | Yeah, not sure what else to cover.

00:38:56.440 | I think Sana kind of flew under my radar.

00:38:58.880 | But apparently, they have extended Mamba models

00:39:01.480 | to diffusion, which is kind of cool.

00:39:05.400 | And it works.

00:39:07.200 | Great.

00:39:08.800 | In my sessions with the image people,

00:39:12.360 | I had a session with some of the people working

00:39:17.360 | on image and VO at Google.

00:39:20.360 | They were very hyped up about autoregressive image

00:39:23.920 | generation.

00:39:24.960 | So instead of diffusion, they are sort

00:39:26.880 | of getting rid of the diffusion part

00:39:28.840 | and just straight autoregression for images.

00:39:32.120 | And I thought that was notable, but I didn't have the background

00:39:34.800 | to understand it.

00:39:35.760 | They were just like, yeah, next year

00:39:37.280 | is the year of autoregressive images.

00:39:38.960 | So a bit of a shift in my mind, because I thought

00:39:43.080 | that people were more-- like last year, this time last year,

00:39:45.440 | people were more interested in text diffusion,

00:39:47.360 | so diffusion going into text.

00:39:48.680 | Now, they're talking about autoregression

00:39:50.640 | going from text into images.

00:39:53.800 | So it's kind of the other way around.

00:39:56.320 | Actually, I think Sana, LOLCAT, and QRWKB

00:39:59.960 | might be in its own category, where it's really

00:40:02.080 | more about taking existing, I guess, attention-based models

00:40:06.040 | and converting them over.

00:40:07.880 | I think that's the team there.

00:40:09.760 | Models.

00:40:11.560 | Thank you for the point there.

00:40:14.880 | I know I said it, but yeah.

00:40:18.880 | Fine.

00:40:20.640 | Aren't Franken models like different model merges?

00:40:24.800 | Like take a llama, a mistral, you merge?

00:40:28.000 | No, they're not.

00:40:28.720 | They're the Franken models.

00:40:31.000 | They share a Franken merge.

00:40:33.400 | OK, this is more conversion.

00:40:35.720 | But whatever, it's still putting Frankenstein, yeah.

00:40:41.840 | Quick question on QRWKB.

00:40:43.720 | Is there a retraining phase when you replace the layer?

00:40:47.040 | Yes, there is.

00:40:48.360 | And is it retraining, or is it continual training?

00:40:52.920 | No, it's just 500 mu tokens retraining the attention layers.

00:40:58.240 | And then another 500 mu just on all the layers, yeah.

00:41:02.760 | So is it initialized from scratch?

00:41:06.320 | Yeah, the attention layer is initialized from scratch.

00:41:10.800 | That's crazy.

00:41:11.760 | You can train on 15 trillion tokens of attention

00:41:14.520 | to get good, or you can reinitialize and just

00:41:17.720 | train on 500 million.

00:41:19.840 | That's a scam.

00:41:20.560 | Exactly.

00:41:21.800 | It's so much less tokens.

00:41:25.120 | I mean, it's the same intuition as like lava, right?

00:41:29.280 | Like in my mind, very similar.

00:41:31.680 | Like it's effectively--

00:41:34.360 | Lava is an adapter where you merge in a new model

00:41:40.400 | to match a really strong pre-trained model, right?

00:41:43.280 | Like you have a pre-trained language model.

00:41:45.400 | You have a contrastive loss to merge.

00:41:48.400 | Like you use the backbone of a strong model,

00:41:50.600 | and then you add-- like you train a new one to match it.

00:41:53.240 | But reinstating weights from scratch is--

00:41:56.920 | you basically restart.

00:41:58.240 | Like you start from stochastic noise, and you retrain.

00:42:01.560 | Like you have no intuition to go off of.

00:42:04.000 | But then the-- yeah, lava is like you're still just--

00:42:07.640 | you're just mimicking a really strong model.

00:42:09.720 | Mimicking, OK.

00:42:15.440 | No opinion on that one.

00:42:17.680 | Yeah, I think I vaguely understand the difference.

00:42:23.080 | OK, so the other big one is the lockhex,

00:42:26.120 | which is basically the same thing as what

00:42:28.640 | QRWKV was doing, but it extends to Mamba as well.

00:42:33.840 | And they were using--

00:42:35.840 | they were essentially using LORAS instead,

00:42:39.720 | which has less resources, but performs worse, arguably,

00:42:45.200 | by my research.

00:42:48.160 | It's like everything is just so, so, so lacking--

00:42:52.560 | lacking-- what was it?

00:42:56.160 | Abelation, I can't even say which method is better or worse.

00:42:59.960 | Right, so you want one survey paper going

00:43:02.640 | comparing all these things.

00:43:04.640 | No, I think we just need to give all these more time.

00:43:09.400 | No, it's still very cool work.

00:43:11.480 | Love the name.

00:43:13.000 | LXTM is-- XLSTM is the one that we did not cover.

00:43:16.880 | This was a best paper at NeurIPS.

00:43:20.800 | And I talked to them, so I'll release the interview

00:43:23.320 | at some point.

00:43:24.880 | I don't know if people have thoughts about this,

00:43:26.920 | but some of these gating and stuff

00:43:33.560 | is kind of interesting in the sense

00:43:35.560 | that they seem very clear about the ways in which LXTM did not

00:43:40.720 | scale or failed.

00:43:42.280 | And they seem very intense on fixing that.

00:43:45.200 | It's kind of cool.

00:43:46.840 | Yeah, cool.

00:43:48.360 | I'll move on to agents, which is the thing that we published

00:43:50.880 | today.

00:43:52.040 | We had eight things in agents.

00:43:55.240 | And I thought his talk was really--

00:43:57.720 | Graham Newbig's talk was really good.

00:43:59.560 | I don't know if there was a breakthrough.

00:44:01.440 | So last year, the obvious winner for last year was Voyager.

00:44:05.960 | The Voyager paper was one of the must-reads.

00:44:08.720 | I would say also the paper that I

00:44:11.920 | love to hate, the Smallville Generative Agents.

00:44:19.040 | I don't know if there's any other papers that really

00:44:21.280 | broke through this year.

00:44:23.720 | Why do you like to hate Smallville?

00:44:25.240 | It simulates chats for entertainment.

00:44:31.160 | And then in order to--

00:44:32.640 | the reason it sells itself so well

00:44:34.640 | is primarily because of that image that is there.

00:44:40.080 | It taps into this 8-bit nostalgia effect

00:44:46.600 | that people really like.

00:44:49.240 | You know what I'm talking about.

00:44:50.760 | Where's the paper?

00:44:51.560 | Yeah, I know the paper.

00:44:53.200 | So yeah, you're not happy with people over-hyping it.

00:44:58.920 | It caused this generation of papers

00:45:01.280 | that all look like this here.

00:45:05.320 | Oh, all the 8-bit images in papers for the entire way.

00:45:10.280 | It's no fucking thing to do with how the agents work.

00:45:13.480 | It's just people think it looks cute.

00:45:16.160 | And therefore, it does well on social media.

00:45:19.720 | It's just really bad science.

00:45:22.280 | I don't know how to say it.

00:45:28.880 | 8-bit is all you need.

00:45:30.680 | It's as bad as all you need.

00:45:34.040 | So it just gigafries people's brains.

00:45:37.160 | And this random project gets 26,000 stars because of this.

00:45:45.480 | Because of this.

00:45:46.680 | And I guarantee you, nobody's using it.

00:45:49.080 | It's very annoying.

00:45:49.880 | It's the source of all the noise.

00:45:51.360 | OK, makes sense.

00:45:55.440 | Yeah, hot takes.

00:46:01.560 | OK, cool.

00:46:03.360 | I think I'm done.

00:46:04.240 | I know that's the year in review for papers.

00:46:07.040 | Maybe there's other papers I haven't picked up,

00:46:09.000 | but I wanted to just share.

00:46:10.320 | Eugene, do you want to do it?

00:46:17.160 | Thank you, Suits.

00:46:20.200 | Sure, I can do that.

00:46:21.960 | I don't think I--

00:46:22.960 | Why this one sort of lodged in your brain

00:46:25.800 | as an interesting long context benchmark?

00:46:29.640 | Well, the thing is, I'm thinking a lot about long context

00:46:32.880 | recently.

00:46:33.560 | And I mean, from early on, I think neither a haystack.

00:46:37.680 | I mean, you're asking me about long context.

00:46:39.480 | I was like, come on, that's not really long context.

00:46:41.640 | I mean, it's really just extractive.

00:46:43.440 | So imagine if you were to summarize a book,

00:46:45.240 | you'd ask questions of a book.

00:46:46.640 | It's really a lot of reasoning.

00:46:48.800 | And I think this long context paper,

00:46:50.480 | they actually shared something.

00:46:52.200 | Oh, man, crap.

00:46:53.720 | I need to-- sorry, guys.

00:46:58.840 | This is a brand new laptop.

00:47:04.040 | So I'll mention Babylon.

00:47:05.680 | I'll buy some time for you.

00:47:10.080 | There's always a lot of interest in long context models.

00:47:14.160 | And I find that--

00:47:16.720 | so Babylon was the sort of the long context winner of NeurIPS.

00:47:21.800 | This is the one.

00:47:25.680 | Eugene is not going to cover this one.

00:47:27.280 | He's going to cover something else that he found.

00:47:29.520 | But I think this--

00:47:30.520 | and then Ruler is the one that we covered on Eaton Space.

00:47:33.000 | This guy, where they train a million context LLM.

00:47:41.480 | Oh, maybe there should be a category for long context.

00:47:43.800 | I wonder.

00:47:44.840 | That could be interesting.

00:47:45.920 | OK, Eugene, you're back.

00:47:48.040 | OK, sweet.

00:47:49.200 | So the reason why--

00:47:50.560 | so now I want to share with you about this paper, which

00:47:53.960 | I think is pretty cool.

00:47:56.240 | Wait, are you actually able to see it now?

00:47:58.200 | Yes, you can see it.

00:47:59.880 | OK, perfect.

00:48:00.680 | Are you seeing my Zotero?

00:48:03.760 | I don't know why.

00:48:04.480 | OK, I'm assuming you're seeing my Zotero.

00:48:06.360 | So this is LongBench 2, Towards a Deeper Understanding

00:48:09.400 | of Realistic Long Context Benchmarks.

00:48:11.080 | I actually have a comparison of LongBench 1,

00:48:13.320 | which I will go over at the end of everything.

00:48:16.120 | But I think it's really interesting to see

00:48:18.040 | all the different iterations of it.

00:48:19.480 | And you can see actually what they found

00:48:21.440 | didn't work in the previous iteration

00:48:23.600 | and how they try to improve on it.

00:48:25.680 | Long story short, LongBench 1--

00:48:28.440 | no, sorry, LongBench 2, how they try to create it

00:48:31.480 | is they recruit it.

00:48:33.680 | I'm going to go through this pretty quickly.

00:48:36.640 | So the task is--

00:48:38.840 | it's extremely long context.

00:48:40.960 | The context is about 8,000 to 2 million words.

00:48:44.760 | And they have several tasks.

00:48:46.640 | The simple one is really just a single document.

00:48:48.720 | Single document, they do Q&A on it.

00:48:51.080 | And then there's also multi-document Q&A.

00:48:53.600 | And what is new in LongBench 2 that was not in LongBench 1

00:48:57.360 | was Long Context History Understanding.

00:48:59.720 | So history understanding is two kinds.

00:49:01.840 | There's chat between multiple LLM agents.

00:49:05.200 | And then there's also chat between a human

00:49:07.080 | and an assistant.

00:49:07.880 | So essentially, seeing the LLM can actually

00:49:10.680 | reason over these long chats that

00:49:12.560 | are made up of small text.

00:49:14.720 | And another thing that's new is Long Structured Data

00:49:17.120 | Understanding.

00:49:18.120 | So there's tables and there's knowledge graphs.

00:49:21.200 | So how they collected the data--

00:49:22.880 | so previously in LongBench 1, a lot of it

00:49:25.120 | was synthetic-generated data.

00:49:27.520 | In this case, they actually got undergrads to create the data.

00:49:31.720 | And they incentivized them pretty well.

00:49:35.400 | They paid them quite a fair amount of money

00:49:37.200 | to generate all these long benchmarks.

00:49:39.720 | And they also had a lot of people to review it.

00:49:42.080 | So after they collect the data, one thing

00:49:44.720 | that's really interesting is that they

00:49:46.880 | ask annotators to annotate it.

00:49:49.120 | And they actually review it, which

00:49:51.160 | is they run the question through three LLMs,

00:49:54.840 | like fairly small, fast LLMs.

00:49:57.520 | And they would only keep--

00:49:59.800 | so over here, here's the image.

00:50:02.960 | They would only keep the question

00:50:05.640 | if at least one out of the three LLMs get it wrong.

00:50:09.680 | So essentially, these questions are somewhat fairly hard.

00:50:13.280 | And I won't go through all the different criteria

00:50:17.120 | they had for keeping the questions.

00:50:18.800 | But suffice to say, these questions

00:50:19.960 | are actually pretty good.

00:50:21.000 | They are fairly accurate.

00:50:22.160 | They actually did a manual review of it.

00:50:24.040 | There's maybe about 3% of them which are erroneous.

00:50:26.800 | But they're pretty good, and they're pretty hard.

00:50:30.000 | So now let's go into the baselines.

00:50:32.520 | They tested it with 10 open-source LLMs,

00:50:34.800 | as well as 6 proprietary LLMs.

00:50:37.480 | They have zero short and zero short chain of thought,

00:50:40.520 | which is pretty standard.

00:50:42.120 | I won't go through them.

00:50:43.920 | So here's the results.

00:50:46.000 | What we see from the open-source models

00:50:47.920 | is that the Quen 2.572b performs the best.

00:50:51.040 | You can see it's the most bolded here.

00:50:53.960 | And O1 preview, we see that the O1 actually

00:50:59.520 | provides a lot more juice than just regular 4.0.

00:51:04.320 | But what is really interesting here is this.

00:51:08.040 | So we have easy questions over here

00:51:12.440 | where a human gets 100% of it correct, right?

00:51:16.840 | And LLMs are not able to get fully 100% of it correct.

00:51:20.560 | And then we have hard questions where a human only

00:51:23.640 | gets 25% of it correct, but LLMs get 50% or more of it correct.

00:51:29.320 | So what is this here?

00:51:31.760 | This really demonstrates to you the jagged edge

00:51:33.640 | of intelligence, right?

00:51:34.880 | Where by some of the easy tasks, humans just

00:51:36.880 | get all of it correct.

00:51:38.120 | But LLMs do consistently make mistakes.

00:51:40.480 | That's the thing about pushing beyond 80% accuracy.

00:51:43.240 | LLMs can get harder, going to find it harder.

00:51:46.480 | But then for the longer context, where humans struggle with,

00:51:51.760 | they only get 21% correct, LLMs with huge compute

00:51:56.640 | and their huge context length, they actually

00:51:58.680 | can get quite a bit of it correct.

00:52:00.120 | So that's pretty good.

00:52:01.840 | Similarly, you can see on this over here,

00:52:04.920 | for short context lengths, you can see humans get 47% of it,

00:52:08.520 | but LLMs outperform humans already.

00:52:11.200 | And then for medium and long context lengths,

00:52:15.320 | humans still outperform LLMs.

00:52:18.800 | So that's what is interesting I found here

00:52:22.160 | about the jagged edge of intelligence.

00:52:26.600 | Finally, they also tried--

00:52:29.640 | so what does this mean?

00:52:31.000 | Does this mean that, OK, long context is all you need?

00:52:33.240 | You don't really need a retrieval or pointer

00:52:35.120 | generation baseline?

00:52:36.360 | Well, it's different here.

00:52:38.200 | What they found is that a lot of the models,

00:52:41.080 | they don't actually perform better

00:52:44.800 | with the entire context length.

00:52:46.120 | So for example, Quen 2.5 and GLM 4 Plus,

00:52:50.200 | they actually do better at shorter context lengths, 32K,

00:52:52.960 | with retrieval, compared to using the entire context

00:52:57.640 | window, 128 context window, without RAC.

00:53:00.960 | So I think in this case, for this specific models,

00:53:07.680 | RAC still performs better.

00:53:10.640 | Then finally, but the one exception

00:53:13.840 | they raised was this, GPT-4, oh, actually

00:53:16.400 | is able to use the full context length.

00:53:18.840 | So that's interesting.

00:53:19.760 | Actually, it's long enough.

00:53:24.160 | I would have loved to see SONET.

00:53:26.000 | So you can see SONET over here.

00:53:27.280 | Oh, actually, they do have SONET.

00:53:29.240 | SONET is actually quite far behind 4.0, with 50%.

00:53:34.800 | So well, that's a comparison between the OpenAI

00:53:38.840 | and Entropiq.

00:53:41.920 | They also have something whereby they tested whether, hey,

00:53:44.360 | did the LLM memorize it?

00:53:46.960 | So this is really just asking the LLM a question

00:53:49.560 | without providing a context and see if it can answer.

00:53:52.720 | So in this case, they showed that the LLM doesn't memorize.

00:53:55.360 | Some of these questions are really interesting.

00:53:57.320 | I wanted to highlight one of them, which I thought

00:54:00.600 | was actually quite difficult, but it's quite--

00:54:02.600 | which is interesting.

00:54:09.080 | This one.

00:54:10.000 | They introduced a task where the LLM is given a mystery novel.

00:54:15.760 | And the LLM requires-- and the LLM

00:54:17.800 | is asked to identify the killer or identify the motive based

00:54:21.680 | on the information provided in the detective novel.

00:54:24.440 | So you can think that this is something

00:54:26.040 | that a human would take a very long time to do, right?

00:54:28.480 | They would have to scan through an entire book

00:54:30.600 | or pick up pieces here and there.

00:54:32.360 | But the LLM actually does this very well.

00:54:35.440 | And then you can see that the LLM--

00:54:41.480 | that said, the human accuracy is also

00:54:43.080 | extremely high on these detective novels,

00:54:45.560 | like "65" and "72."

00:54:46.960 | But LLMs-- humans really suck at the academic benchmarks,

00:54:51.760 | which is academic multi-document, expert.

00:54:55.640 | Expert accuracy was only 22%, as well as governments.

00:54:58.640 | So I think maybe academics and governments, that's

00:55:03.280 | where LLMs might outperform humans.

00:55:06.000 | And you can see this is where expert accuracy also

00:55:09.640 | differs on dialogue history.

00:55:13.280 | Now, then maybe the question that you may be asking

00:55:16.120 | is, how does this differ from "Long Bench V.1"?

00:55:21.920 | So "Long Bench V.1" was really extractive questions,

00:55:25.120 | very much similar to "Needle in a Haystack."

00:55:27.360 | Essentially, given this thing, what was--

00:55:30.000 | I don't know, what pizza did Eugene Chia eat?

00:55:32.280 | Or what was Eugene Yan's favorite food?

00:55:34.560 | Things like that.

00:55:35.520 | But in here, you can see that it's really--

00:55:37.840 | you really require understanding and reasoning.

00:55:40.040 | The example being the mystery novel,

00:55:41.840 | you have to identify the killer and the motive.

00:55:45.640 | Now, the second one is that this evaluation form is only MCQ.

00:55:49.400 | So previously, they used F1 and Rouge,

00:55:51.120 | and they found it to be extremely unreliable.

00:55:53.000 | They also tried using LLM as judge

00:55:54.960 | and found it very expensive.

00:55:56.560 | Now, we can debate, MCQ means 25% random chance

00:56:00.200 | of getting it correct.

00:56:01.360 | It may not be as precise.

00:56:02.440 | Well, for them, they actually found

00:56:04.960 | it to be a better score than Rouge or MCQ.

00:56:08.640 | The third one is that the curation is actually

00:56:12.080 | very rigorous.

00:56:12.880 | Essentially, given the questions,

00:56:14.600 | they actually tested it against the three LLMs

00:56:17.720 | to make sure that at least one of them

00:56:19.720 | get it wrong to test if the question is hard enough.

00:56:22.640 | And they also reviewed it with human experts.

00:56:24.720 | Can the human expert answer it?

00:56:26.120 | And human experts have up to 15 minutes

00:56:29.400 | to try to answer a single question.

00:56:31.480 | And if they can't answer within 15 minutes,

00:56:33.560 | they actually have the option to say that,

00:56:34.760 | hey, I don't know, it's too hard.

00:56:36.800 | So you can see, and this is the human expert accuracy.

00:56:39.320 | So for most of the tasks,

00:56:41.600 | I think it's like maybe 50 to 60%.

00:56:44.920 | So I think that's quite a way to go

00:56:46.240 | for humans to get better at this.

00:56:48.000 | And then the final thing is that they included more tasks,

00:56:53.760 | which is long context history,

00:56:55.120 | as well as long structured data.

00:56:56.720 | So yeah, that's the two main things.

00:57:02.120 | So that's it, that's the main thing for Long Bench V2.

00:57:04.800 | Yeah, any questions?

00:57:07.560 | - Actually, I would like to add,

00:57:10.120 | actually, I like this because it was very consistent

00:57:13.640 | with some of the conversations I had with several teams.

00:57:17.320 | In particular, the legal AI space,

00:57:20.760 | because a lot of them deal with high-end open source model.

00:57:23.760 | When I first talked to them last,

00:57:26.800 | I think at the start of last year,

00:57:29.000 | my first impression was

00:57:29.840 | that they would really love long context

00:57:31.640 | because you have a large amount of legal text.

00:57:36.120 | And they said, yeah, needle and haystack is meaningless.

00:57:39.120 | You can get 128k in needle and haystack,

00:57:41.480 | but it's not able to reason over the legal text

00:57:44.400 | past 8k and 32k, depending on the model.

00:57:46.960 | So a lot of them actually do 8k or 32k,

00:57:50.720 | depending on which architecture they're using

00:57:52.800 | for their legal LLM,

00:57:55.080 | despite the model being able to handle much larger.

00:57:59.800 | So this is more scientific evidence

00:58:02.040 | to what they said they tested internally.

00:58:04.280 | - And that's what is fun as well.

00:58:06.840 | Other than GPT-4 mini,

00:58:08.360 | most of the models couldn't really use beyond,

00:58:11.880 | didn't really perform better at 128k versus 32k.

00:58:14.800 | I think that's really just a matter of time.

00:58:16.600 | I think eventually for like 128k context documents,

00:58:21.600 | we probably may not need RAG anymore.

00:58:24.000 | At least that's where I'm betting.

00:58:27.560 | - The way I rationalize it in the open space,

00:58:30.240 | is that there's just lack of proper training data

00:58:32.880 | at this scale.

00:58:33.720 | And also, as I summed it up,

00:58:36.200 | I think in the whole post-transformer,

00:58:39.400 | the problem for a lot of the open teams,

00:58:41.480 | training at 128k is kind of like big lab territory

00:58:44.480 | because of the amount of even required

00:58:45.800 | for the back propagation.

00:58:47.440 | So even though like other BKP members,

00:58:49.880 | strictly speaking, you could train at 128k,

00:58:52.200 | none of us will train at 128k.

00:58:56.040 | - Yeah, yeah, exactly.

00:58:57.840 | Okay, so that's it.

00:59:00.280 | That's the very short summary of Long Bench

00:59:03.200 | if you're interested in Long Benchmarks.

00:59:05.680 | Anything else?

00:59:09.600 | - Oh, sorry.

00:59:11.560 | Just on your comment on the 128k VRAM requirements,

00:59:16.400 | does any of the federated learning stuff help?

00:59:23.760 | - Federated, are you talking distributed federated

00:59:26.120 | or like, sorry?

00:59:28.080 | - Yeah.

00:59:28.920 | - No.

00:59:30.400 | - Whatever, you know, you need.

00:59:32.720 | - So, so far, so at the end of the day, right?

00:59:35.520 | The long context, right?

00:59:36.920 | Your biggest bottleneck is you need a set of nodes

00:59:40.720 | to be together to essentially handle the entire problem

00:59:45.720 | from beginning to end,

00:59:46.800 | including every single state in the middle.

00:59:49.040 | And I don't think distributed multi-cluster training

00:59:53.320 | will help in this sense.

00:59:54.520 | 'Cause each cluster will need to be able to handle

00:59:56.520 | the training for 128k.

00:59:57.880 | - Right, so I was thinking that one of those,

01:00:02.960 | like whatever news research is doing

01:00:04.600 | can help you distribute it effectively.

01:00:07.520 | My understanding is that the post-training for long context

01:00:12.160 | isn't actually that much in terms of compute.

01:00:14.920 | Like this is just purely a VRAM issue.

01:00:17.160 | - It's actually, no, I think that,

01:00:22.040 | so that's where I would disagree

01:00:23.720 | because it's not much to get it to the needle in his head.

01:00:27.160 | It's actually a lot to get it to do proper reasoning.

01:00:29.800 | So, for example, there's a lot of cheats that we can do.

01:00:33.560 | Like for example, other between state space, right?

01:00:35.680 | Right now, we extend the context window, kind of,

01:00:38.720 | by snapshotting at 4k.

01:00:41.040 | So we do training in 4k chunks.

01:00:43.920 | But the reason why this does not work well, right,

01:00:47.120 | is so it's like, if the model is like,

01:00:49.880 | learn to, let's say,

01:00:51.600 | hey, let's just try and memorize as much as possible

01:00:53.600 | so that needle in the haystack down the line kind of works.

01:00:56.880 | But if, let's say like the detective story

01:00:59.800 | kind of situation, right?

01:01:01.080 | You answer at the last chunk,

01:01:05.240 | but the back propagation is unable to back prop

01:01:07.960 | to the first chunk and say,

01:01:09.120 | hey, when you read this story,

01:01:11.120 | understand the reasoning behind it.

01:01:12.680 | And that's where it all just disconnects

01:01:14.800 | because of the VRAM capacity requirement.

01:01:18.920 | And this is tight to compute as well

01:01:21.080 | because every, this is technically

01:01:24.120 | the quadratic curve, actually.

01:01:26.080 | You don't, so this is the part where,

01:01:28.440 | even for us, right, in training,

01:01:29.960 | we still suffer from that quadratic curve.

01:01:32.120 | - Yeah.

01:01:34.440 | Cool.

01:01:35.280 | I, yeah.

01:01:37.160 | No comments beyond that.

01:01:40.760 | Cool.

01:01:44.720 | I guess that's the last favorite curve of the year.

01:01:47.320 | (laughing)

01:01:49.560 | That's sloppy as hell.

01:01:53.360 | - Very excited.

01:01:56.520 | Okay.

01:01:57.360 | Merry Christmas, everyone.

01:01:58.800 | - Merry Christmas, everyone.

01:02:00.040 | Ho, ho, ho, ho.

01:02:01.280 | - Yeah.

01:02:02.120 | - Bye.