back to index

[Paper Club] The 2025 AI Engineer Reading List + LongBench Paper


Whisper Transcript | Transcript Only Page

00:00:00.000 | [INAUDIBLE]
00:00:02.960 | We covered Sora previously.
00:00:05.040 | Yeah.
00:00:06.800 | Yeah, Sora is not that important, right?
00:00:08.520 | [LAUGHS]
00:00:11.000 | Santu, yeah, I don't know if people have played around
00:00:13.480 | with Santu or listened to the pod.
00:00:16.800 | I think definitely quite revolutionary.
00:00:20.320 | Also, another example of the thesis of vision
00:00:23.240 | becoming video, because what Santu did was take SAM1
00:00:28.800 | and extended it in the video direction.
00:00:32.760 | And they use this really cool architecture
00:00:36.600 | for memory attention that has object permanence.
00:00:41.960 | So where we see that in--
00:00:44.400 | I think it's in the related tweet here--
00:00:47.560 | where you can track people going off screen and coming back
00:00:50.160 | on screen, which is something that I like to show off.
00:00:58.440 | I don't know if this works or not.
00:01:00.120 | I understand.
00:01:00.800 | There we have it.
00:01:02.480 | Oh, yeah.
00:01:03.320 | All right.
00:01:03.880 | So I think this is on their demo page.
00:01:08.520 | Let's see if it's here.
00:01:11.800 | Yeah, so this is segment anything across video.
00:01:13.800 | I don't know if there's--
00:01:17.800 | oh, yeah, here.
00:01:18.560 | And--
00:01:19.240 | OK, OK, this is the most impressive one.
00:01:22.080 | So you know this three-ball cup experiment?
00:01:26.200 | It can track where the ball is.
00:01:27.480 | You might have seen.
00:01:28.920 | And so here, I can click on the ball in the first frame.
00:01:32.680 | I can also click on a different cup.
00:01:36.000 | And so here, the additional challenge
00:01:38.200 | is that there's three cups that look exactly the same.
00:01:40.600 | And then there's a ball that will get occluded by the cup.
00:01:43.200 | So the ball is no longer visible.
00:01:44.240 | The cups are all moving around.
00:01:45.200 | They all look the same.
00:01:46.320 | But the model actually keeps track of the cup
00:01:48.200 | that we selected.
00:01:48.880 | And as you can see at the end-- here,
00:01:50.360 | I'll jump to the end so you can see--
00:01:51.720 | it actually finds the cup again.
00:01:53.080 | I wanted to point out a couple of fun demo UX features
00:01:55.480 | that we added that actually--
00:01:56.720 | yeah, so I thought it was pretty impressive.
00:02:00.240 | And people are using this in real-life situations.
00:02:02.720 | I would argue among the listed vision models,
00:02:08.680 | SAM is probably taking the crown in terms
00:02:10.560 | of real-world applications.
00:02:12.160 | Yeah, it's also kind of interesting.
00:02:16.680 | Basically, the SAM team publishes one thing a year,
00:02:18.720 | and then they take the rest of the year off.
00:02:20.960 | Like it's-- they're like the epitome of we
00:02:25.440 | have defined one problem well.
00:02:27.160 | We solve it very well.
00:02:28.400 | And then we don't publish anything else,
00:02:32.320 | which is kind of cool.
00:02:34.960 | Let me just grab your recommendations.
00:02:37.160 | Let me see them in the chat.
00:02:40.760 | It's here, right?
00:02:41.600 | Cool.
00:02:46.640 | Then this new one, which I was not paying attention
00:02:50.160 | on debtors and all that.
00:02:54.840 | I know that YOLOv10 was actually at NeurIPS,
00:03:00.800 | which was the latest update of YOLOs and real-time object
00:03:05.480 | detection.
00:03:06.000 | But apparently, according to the vision guys,
00:03:09.240 | debtors are mostly replacing YOLOs
00:03:13.080 | in terms of their performance.
00:03:15.520 | I'm not really sure why.
00:03:16.560 | I didn't really go into it.
00:03:17.840 | But if people care about real-time object detection,
00:03:20.280 | that is it.
00:03:21.160 | Oh, yeah, I mean, the other thing about segment anything,
00:03:24.880 | they studiously avoid labeling.
00:03:29.240 | So they only know how to draw segments.
00:03:31.960 | And to me, it's very similar to typical conf net layers,
00:03:37.640 | where there's one layer that only does edge detection.
00:03:41.400 | And so this is like, because they constrain it very well,
00:03:44.440 | they solve it very well.
00:03:45.440 | But at the same time, for practical use,
00:03:47.200 | you basically always want to label things.
00:03:49.000 | So then you have to combine segment anything
00:03:50.840 | with grounding dyno and stuff, which is not a full solution.
00:03:56.600 | YOLOs, I think, also would have that same application.
00:04:05.440 | I'll keep moving.
00:04:06.520 | But then also, I also don't want to maybe dominate
00:04:09.000 | too much of the conversation.
00:04:11.360 | I did read this MMVP paper, which
00:04:14.760 | they highlighted as one of the more well-talked about papers.
00:04:19.280 | By the way, I have this really useful extension
00:04:21.360 | for finding the related tweets of all the papers.
00:04:25.720 | And so I don't think this particular paper was
00:04:33.840 | that influential.
00:04:35.080 | But it was one of the best papers of CVPR.
00:04:38.520 | And it basically just pointed out the sort of, quote,
00:04:40.720 | unquote, "jagged intelligence" of frontier models.
00:04:44.720 | So even here, where the--
00:04:54.720 | they found-- they cataloged all the hallucinations
00:04:58.440 | that are still remaining in the frontier models
00:05:01.480 | and why, even though they're superhuman in some aspects,
00:05:05.240 | they're not superhuman here.
00:05:09.040 | Yeah.
00:05:10.960 | And so that creates gaps for other models to fill in.
00:05:19.600 | And I think this is an example of--
00:05:22.160 | I mean, they have a benchmark.
00:05:23.400 | And anytime you see a benchmark like this, where all
00:05:30.720 | the frontier models are here and the humans are here,
00:05:33.600 | that is a reliable way to advance the field, which
00:05:37.120 | is you're doing to find gaps that it's not doing well in.
00:05:40.960 | And I think this sort of finally made it click for me,
00:05:44.240 | that why so many people were kind of focusing on clocks
00:05:48.040 | this year, here, which became a focus for PIXMO
00:05:55.480 | and Moon Dream, which is just analog devices.
00:06:01.200 | Well, I'm trying to look for the analog in this situation.
00:06:05.440 | Can't really find it.
00:06:06.920 | It's somewhere in the presentations that I saw.
00:06:09.120 | But basically, I think this is--
00:06:15.960 | when you publish an influential shortcomings paper
00:06:19.800 | or benchmark, then people can sort of meaningfully
00:06:22.000 | advance on it.
00:06:22.560 | And so I think then they picked out PolyGemma, Florence,
00:06:26.280 | and Moon Dream.
00:06:27.560 | I should put Moon Dream there as the examples.
00:06:32.600 | Yeah, what's up?
00:06:34.200 | Is this essentially an AGI problem for the vision field?
00:06:38.440 | I guess, which is fun, which is interesting as well.
00:06:41.440 | A lot of people think that RKGI is a vision issue, which
00:06:45.640 | I am primarily using my vision sense when I'm looking at RKGI.
00:06:49.360 | But Francois Chollet really, really insists that RKGI is not
00:06:55.400 | a vision problem.
00:06:57.440 | I mean, this looks vision to me.
00:07:01.360 | No, but I mean, so the reason why I agree with him also
00:07:05.040 | is that even the open AI solution and the leading
00:07:08.840 | model of open AI and the winning models,
00:07:12.760 | they're not solving it through vision.
00:07:14.640 | They're solving it through text.
00:07:16.880 | It's a reasoning problem.
00:07:19.680 | Yes, but don't you think vision would help?
00:07:23.560 | Like the a priori.
00:07:26.120 | I understand that the winners, none of them use vision.
00:07:30.320 | But a priori, it should help.
00:07:34.360 | No, there are teams that are trying
00:07:36.480 | to use the vision models, and it didn't work somehow.
00:07:39.960 | So that's an interesting--
00:07:41.960 | That's a scale issue.
00:07:43.120 | It's a scale, I don't know which one it is.
00:07:49.240 | Oh, I wonder if you run QVQ on these things,
00:07:53.160 | would it be any different?
00:07:54.800 | One of the things I was thinking about for the Christmas
00:07:58.560 | episode today was taking some of the published ArcAGI questions
00:08:04.720 | and seeing if humans can solve it.
00:08:06.120 | Let's do one.
00:08:11.640 | You want to do one?
00:08:12.520 | I'll find one.
00:08:13.480 | You keep going.
00:08:14.360 | Let me find a fun one.
00:08:15.480 | OK, OK.
00:08:16.800 | Yeah, I posted a link in the Discord somewhere.
00:08:21.600 | I think Greg Kamrat is the guy to find it.
00:08:24.600 | And then find something where O3 failed to solve
00:08:29.920 | one of the ArcAGI problems.
00:08:31.280 | And then I wonder if humans can do it.
00:08:33.600 | OK, what else about polygerma?
00:08:36.760 | Polygerma, that directly led to copoly,
00:08:40.840 | and then also now coquin.
00:08:43.360 | I think in terms of the specific field of PDF parsing,
00:08:48.120 | this has been a relatively big win, I think, in this year.
00:08:54.240 | There's also Marker, Surya, as well,
00:08:56.920 | so that emerged this year as vision-based models
00:09:01.560 | for things.
00:09:03.640 | So where does this go?
00:09:05.080 | I think next year, vision.
00:09:08.000 | Next year, MMVP solutions to, I don't know, 50%, 70%.
00:09:15.560 | But I'm interested in what's next.
00:09:17.880 | I think a lot of people are focusing
00:09:19.880 | on the MMMLU, or the multimodal MLU, or whatever.
00:09:24.200 | But I'm not sure what else is left, apart from, I guess,
00:09:28.000 | moving on to video generation.
00:09:32.160 | The artificial analysis people, I would say,
00:09:35.320 | are the de facto leaders now in terms of judging video models.
00:09:41.160 | So you should be aware of this leaderboard,
00:09:44.920 | where you can track the arenas for all these things.
00:09:49.040 | And they're trying to do image and speech, as well.
00:09:52.440 | But video seems to be the most hyped.
00:09:55.720 | Yeah.
00:09:57.040 | Cool.
00:09:57.560 | Shall we move on?
00:09:58.160 | Any other thoughts or additions to the vision video domain?
00:10:03.240 | I just sent a link to the ARC thing.
00:10:05.760 | This is the unsolved O3 tasks.
00:10:08.880 | Let's try one.
00:10:09.840 | [INAUDIBLE]
00:10:13.840 | Sorry, what did you send it?
00:10:15.040 | Zoom chat.
00:10:16.160 | Oh, Zoom chat.
00:10:17.360 | Everyone can.
00:10:18.880 | Oh, oops.
00:10:22.400 | Oh, I've been using Orion, by the way.
00:10:24.440 | Oh, but it includes the answer, I guess, I think.
00:10:28.400 | Zoom in, man.
00:10:30.840 | What?
00:10:32.360 | OK, is this the test?
00:10:35.000 | These are hard.
00:10:36.400 | But we-- OK, so this is--
00:10:41.120 | --24 tasks that O3 was unable to solve,
00:10:43.920 | along with the incorrect guesses it made.
00:10:46.440 | OK, so what are we trying to do here?
00:10:48.000 | We have three examples.
00:10:49.600 | And then O3-- oh, OK.
00:10:51.240 | So let's kill the ground truth here.
00:10:56.160 | So, oh, yeah.
00:10:57.160 | OK, this is the one that everyone's debating, right?
00:11:02.040 | So the blue connections here turn everything
00:11:06.320 | on its path blue.
00:11:07.280 | And everything else remains red.
00:11:11.840 | So O3 managed to draw all the blues.
00:11:14.680 | And drew extra blues.
00:11:20.440 | This is somewhat unfair, because this question asks you to--
00:11:24.760 | like, none of the samples have any--
00:11:31.680 | have sort of this many dots.
00:11:34.080 | They also don't have--
00:11:40.720 | oh, go ahead.
00:11:42.320 | Oh, go ahead.
00:11:44.000 | People are saying that O3's first solution is correct.
00:11:48.520 | Yeah, that's the thing.
00:11:49.480 | They also don't have dots whereby it lines up
00:11:51.360 | horizontally and vertically.
00:11:53.680 | Yeah, yeah, so it's a known bad question.
00:11:56.800 | And also, it wasn't--
00:11:58.280 | there's the dot and the line touching the box.
00:12:01.040 | That part was very--
00:12:02.160 | Yeah, this specific one is the issue,
00:12:05.040 | because ground truth was saying that this one should
00:12:08.240 | have turned blue.
00:12:10.720 | So in RKGI, you get two chances to solve the solution.
00:12:16.040 | And if you don't get either chance,
00:12:17.640 | you're deemed to have failed.
00:12:19.760 | But O2 is pretty close.
00:12:20.960 | I think this is just a bad question,
00:12:22.640 | and we should just throw it out.
00:12:25.240 | Let's see another one.
00:12:26.800 | OK, so we'll just look at it for a while.
00:12:31.280 | I've never seen this.
00:12:33.800 | The sentence below might be useful.
00:12:37.760 | Unable to open the grid.
00:12:38.800 | Oh, yeah, it just--
00:12:39.600 | Oh, you're not supposed to get the sentence.
00:12:44.080 | We're just analyzing.
00:12:45.360 | I wonder why the first example doesn't--
00:12:54.240 | the box in the middle is 3 by 3.
00:12:57.240 | I thought you're supposed to try to shrink it
00:13:00.640 | as much as you can.
00:13:03.160 | Or even I, as a human, don't understand the first--
00:13:05.920 | the one on the left.
00:13:07.440 | [INAUDIBLE]
00:13:09.600 | I thought the pattern is you add an orange ring, then
00:13:13.160 | a white ring.
00:13:14.000 | So orange, white, then orange, white.
00:13:17.040 | And the third one, you skip.
00:13:20.400 | You recurse if you can.
00:13:22.320 | Yeah, you recurse if you can.
00:13:23.520 | So I don't know why they don't recurse for the first one.
00:13:26.440 | Yeah, so--
00:13:28.000 | Yeah, especially if the fourth one,
00:13:30.640 | it did recurse with that white dot.
00:13:32.720 | Yeah.
00:13:33.240 | [INAUDIBLE]
00:13:35.760 | So, OK, I feel like this is an example where
00:13:39.000 | vision would help.
00:13:39.880 | Because--
00:13:40.400 | Yeah.
00:13:41.240 | Oh, look, this is offset.
00:13:42.840 | So over here, you've got--
00:13:43.920 | [INAUDIBLE]
00:13:44.720 | Right, yeah.
00:13:45.480 | This is the closest one.
00:13:50.160 | So you need to basically count the levels of recursion.
00:13:53.360 | And this is super small.
00:13:55.200 | So I don't know if there's a good way to mathematically go--
00:13:59.440 | oh, maybe there's some relationship between--
00:14:02.360 | there's, OK, 1, 2, 3, 4, 5, 6, 7,
00:14:05.680 | 8, 9.
00:14:06.760 | This is like a width of 9.
00:14:10.760 | Oh, hang on.
00:14:12.880 | You also have to zoom in.
00:14:15.840 | You have to output a smaller square.
00:14:18.160 | No, no, no, you don't.
00:14:19.120 | You don't.
00:14:19.640 | It's just that the input is a smaller square.
00:14:22.000 | So the second input was a 6 by 6.
00:14:24.160 | It's not output a smaller square.
00:14:26.880 | I was going left to right.
00:14:27.880 | OK, so this is a width of 6.
00:14:31.360 | 9, 6, 1, 2, maybe 16 minus 4 is 12.
00:14:36.080 | 9, 6, 12.
00:14:37.440 | And then this is 19 minus 2 is 17.
00:14:41.080 | This is an odd number.
00:14:42.120 | So 17 outputs a 17 square.
00:14:49.200 | Yeah.
00:14:49.720 | So I feel like there is a relationship between like--
00:14:52.560 | this is over-engineering.
00:14:59.400 | If you switch it on, it gets even weirder.
00:15:01.600 | But after looking at this post, my real thing
00:15:04.480 | is like, I don't know if AGI is just puzzles and squares.
00:15:08.280 | It shouldn't be.
00:15:10.000 | Yeah, the standard line is that it is necessary but not
00:15:15.240 | sufficient.
00:15:16.120 | Just because you can pass a bunch of IQ tests
00:15:17.960 | doesn't mean you're going to be good at your job.
00:15:20.040 | But it helps.
00:15:21.760 | I think I can be good at my job and fail all these tests.
00:15:25.440 | I agree.
00:15:25.960 | I think you can make a lot of money
00:15:27.400 | or create a lot of good without even being--
00:15:30.760 | without doing well on, I don't know, Dota.
00:15:33.880 | What is the DQN learning?
00:15:37.560 | Atari.
00:15:38.440 | Without doing well on Atari, even.
00:15:39.960 | I think this is the Atari for a reason.
00:15:42.880 | Well, this is reasoning.
00:15:44.520 | This is also just like, the way I think about some of these
00:15:47.840 | is if you throw enough time at it, you'll figure it out.
00:15:50.640 | Like, I think if you take like a seventh grader
00:15:53.280 | and give them like a month or a summer vacation
00:15:56.040 | and you give them like a PS5, if they get it,
00:15:58.000 | they will do it in two months.
00:15:59.760 | Like, you give them enough time, they'll do it.
00:16:03.640 | Are you referring to test time compute now?
00:16:05.840 | No, I'm just talking about like, some of this stuff
00:16:10.440 | is just trial and error.
00:16:12.120 | So you know, chain of thought.
00:16:14.600 | I'll just leave RKGI there.
00:16:16.400 | I don't know.
00:16:16.920 | I mean, we can agree that the misalignment is a clear fail.
00:16:23.920 | But neither of us can tell why the first one didn't recurse.
00:16:28.960 | What do you mean, misalignment?
00:16:32.000 | The ones in the green.
00:16:33.880 | The O3, yeah.
00:16:35.480 | The grid alignments, it's very clear cut.
00:16:38.960 | I think the like, tokenization.
00:16:43.120 | But also, it seems like a lot of models
00:16:46.680 | just don't output anything.
00:16:52.000 | That's what [INAUDIBLE] comment.
00:16:53.720 | A lot of smaller LLMs on Arc, they don't do anything.
00:17:02.240 | The model is unable to output a grid at all.
00:17:04.760 | I think this is similar to like, tokenization and language.
00:17:09.120 | LLM get grid.
00:17:11.080 | Yeah, that sounds like it.
00:17:13.240 | Like, if stuff is off a character,
00:17:15.200 | it's probably just because their tokenizer is bad at this.
00:17:18.040 | Which is like, not a great measure of AGI.
00:17:23.320 | But I guess AGI can fix its own tokenizer.
00:17:28.400 | Ooh, that'll be fun.
00:17:30.520 | OK, do you want to do more?
00:17:32.440 | Or should we move on?
00:17:34.640 | Move on.
00:17:35.160 | So then we move to open models.
00:17:42.880 | I had a list here of the GEMMA model.
00:17:46.960 | Ooh, I should probably put in--
00:17:48.360 | I guess if we want to do more, if you scroll down,
00:17:53.120 | there's one fun one that's very quick that people understand.
00:17:56.600 | It's just what line is on top.
00:17:59.280 | Go down, go down, go down, go down.
00:18:01.400 | Oh my God, there's so many examples.
00:18:04.600 | Yeah, so I actually did do an Arc AGI, like human,
00:18:08.040 | try and fill it up in Solaris.
00:18:09.960 | The biggest issue I found was after the second puzzle
00:18:14.000 | is you get fatigue.
00:18:15.680 | Because some of the puzzles are huge.
00:18:17.840 | And trying to fill it up is a pain in the ass.
00:18:20.960 | Right, right.
00:18:23.440 | Yeah, I don't know which one you're talking about, Vibhu.
00:18:27.760 | This one?
00:18:28.260 | Yeah, I agree.
00:18:34.240 | I mean, it's one of those things where AI is just
00:18:36.280 | very good at scaling up attention, and we don't.
00:18:40.120 | We're not.
00:18:41.920 | Which is one form of AGI, ASI.
00:18:46.200 | Vibhu, I don't know which one you're talking about.
00:18:48.320 | My bad, I found a new link.
00:18:49.640 | New link.
00:18:50.640 | Oh, new link.
00:18:52.040 | This one, I'm surprised that it failed.
00:18:53.720 | This one?
00:19:04.720 | Mm-hmm.
00:19:07.480 | So second to last one, at the very bottom.
00:19:09.320 | OK, second to last, meaning here.
00:19:17.000 | Up, up one, up.
00:19:19.640 | This itself is an Arc AGI task.
00:19:21.840 | Navigate the page based on instructions.
00:19:26.280 | OK, I like this.
00:19:27.560 | I like this.
00:19:28.160 | So basically, what is--
00:19:30.040 | What's on top?
00:19:31.160 | What's on top?
00:19:31.720 | What's the last thing added?
00:19:34.400 | Yeah, so pink is on top.
00:19:37.160 | Blue's on top.
00:19:38.200 | Pink's on top.
00:19:39.320 | Blue's on top.
00:19:39.960 | Green's on top.
00:19:42.080 | It's pretty cool.
00:19:44.040 | Another vision.
00:19:44.880 | Yeah.
00:19:47.960 | I feel like vision will won this as well.
00:19:51.160 | I mean, it's not just vision, right?
00:19:52.280 | You also have to know that which one is obscured
00:19:54.760 | and which one is not.
00:19:55.840 | Yeah, vision is part of it, but it's not just.
00:19:59.520 | Yeah.
00:20:00.040 | But count the most number of blocks.
00:20:02.000 | Well, the interesting thing here is the larger the grid gets,
00:20:06.120 | like on the left, you see all the grids are like 10 by 6.
00:20:09.560 | As you add more grids, the 01 mini starts to fail.
00:20:13.920 | You just can't do more grids.
00:20:15.800 | Even though it's the same number of interior lines,
00:20:17.920 | it struggles with grids.
00:20:18.920 | So yeah, maybe vision would help, but it's a skill issue.
00:20:27.800 | It's a memory attention issue as your grid get bigger.
00:20:32.640 | I don't know about that, because the attention
00:20:34.880 | will condense this down into all the same hidden dimension.
00:20:42.560 | So basically, all this gets pre-processed to the same size.
00:20:47.920 | The grids didn't make an issue here.
00:20:50.880 | And the grids [INAUDIBLE]
00:20:54.320 | I mean, it's only 24 [INAUDIBLE]
00:21:00.480 | Interesting.
00:21:01.800 | Cool.
00:21:03.560 | All right, move on.
00:21:04.520 | Learned today on Christmas of 2024
00:21:10.480 | is that we are not AGI, guys.
00:21:12.600 | We're not AGI.
00:21:14.960 | We don't deserve a million.
00:21:16.160 | OK, I was going to move on to open models.
00:21:22.400 | I feel like these are the ones that are commonly named.
00:21:28.840 | I can sort of justify all these.
00:21:30.200 | These are the PICS from LUCA.
00:21:31.560 | Obviously, this does not include the sort
00:21:38.600 | of state-space models and RWA-KVs of the world,
00:21:40.920 | which are also open.
00:21:42.160 | But these are sort of the traditional open models.
00:21:44.960 | Are there any that--
00:21:46.480 | I guess the big one has not been mentioned here,
00:21:49.720 | which is Metaslama.
00:21:50.640 | Are there any that we have missed?
00:21:55.480 | Oh, we know I had Mistral.
00:21:57.440 | Falcon dropped another one too.
00:21:59.360 | Oh, I'm sorry.
00:22:02.240 | Yeah, I think this is--
00:22:04.400 | I mean, it's kind of nice to have everything,
00:22:06.360 | like the whole year in one screen.
00:22:09.760 | I definitely would find some utility from that.
00:22:15.280 | Just be like, oh, yeah, we didn't miss anything.
00:22:17.560 | This is everything.
00:22:20.880 | I think it needs more love.
00:22:26.400 | I mean, I think these guys release so much.
00:22:29.960 | There's like, you see Coder and all that.
00:22:35.400 | CKMOU.
00:22:36.560 | Do you see CKMOU this year or last year?
00:22:40.120 | I think it was early this year.
00:22:42.520 | So it's just hard to--
00:22:44.280 | I think it's just like, you have to tell people
00:22:47.120 | what the variable-based models they are so that they can then
00:22:50.880 | go and fine-tune if they want to.
00:22:52.600 | But otherwise, there's not much to say here
00:22:54.800 | apart from the options of open models that are out there.
00:23:01.080 | I think you can add some of the small sub-1B models.
00:23:05.200 | There's also the Falcon 3 series dropped a week ago.
00:23:09.040 | I don't know how they are, but they put Falcon 3, 1B, 3B, 7B,
00:23:13.280 | 10B, the MAMA version.
00:23:14.960 | I just don't know how they are.
00:23:16.840 | I'm going to put them on a misc.
00:23:19.640 | They made the list.
00:23:22.400 | Yeah, I mean, the consensus is that Falcon was not
00:23:24.760 | well-trained, apparently.
00:23:27.720 | Oh, interesting, because they put out huge data sets,
00:23:30.680 | but they don't train--
00:23:32.200 | So Guillaume, the creator--
00:23:35.320 | I talked to the same guy at NIRS 2023 and NIRS 2024.
00:23:42.160 | The guy who did the Falcon, the fine web--
00:23:46.920 | where was it?
00:23:49.400 | Refined web?
00:23:52.040 | Actually, was it?
00:23:54.040 | I don't know where I put it.
00:23:55.200 | I actually talked to him last year at this conference.
00:24:04.200 | It looks like I didn't actually--
00:24:05.560 | I didn't even publish the interview.
00:24:08.960 | But I talked to him again this year,
00:24:11.000 | and he's actually the same guy behind fine web.
00:24:14.560 | So this fella, he just basically left TII UAE
00:24:20.400 | and joined Hugging Face.
00:24:22.200 | So it's the same guy.
00:24:24.400 | There's also Falcon refined web from TII UAE.
00:24:29.040 | Right, it's the same guy.
00:24:30.280 | That's what I'm saying.
00:24:33.440 | If you look at refined web, the lead author from this guy
00:24:36.760 | last year is the same guy.
00:24:39.880 | He moved-- so I mean, that's the intellectual lineage, I guess.
00:24:49.680 | He was actually at the InSpace Live.
00:24:51.720 | He came by to just say hi.
00:24:55.360 | Cool.
00:24:55.880 | Any other open models?
00:24:57.560 | I don't know.
00:24:58.320 | You said the one that's the 1B, 3B models.
00:25:01.480 | What are you specifically talking about?
00:25:03.240 | Five models, even though they're--
00:25:05.560 | The five models, yeah.
00:25:06.720 | I should probably mention five here.
00:25:14.360 | I don't know.
00:25:17.680 | I feel like Phi constantly has these allegations of training
00:25:22.400 | on tests, and I don't know how real that is.
00:25:26.160 | I know a couple of people who--
00:25:30.200 | It's somewhat undetermined.
00:25:31.520 | They keep saying they do less and less.
00:25:33.360 | And realistically, they have a whole section
00:25:35.480 | in the papers on how they don't train on tests
00:25:38.800 | and how they filter out so we don't train on benchmarks.
00:25:43.120 | I don't know why they just put out the list for no reason.
00:25:45.680 | But 5.4 is not really out for testing yet, right?
00:25:48.960 | It's not open yet.
00:25:50.200 | It's meant to be open later.
00:25:51.440 | I don't know if anyone's tested it.
00:25:52.840 | I guess you could add the Apple stuff here.
00:25:56.560 | Yeah.
00:25:57.720 | So I--
00:25:59.040 | Never mind, take it back.
00:26:01.440 | Yeah.
00:26:02.640 | Yeah, I just put it here.
00:26:04.720 | I guess I also mentioned Gemini Nano, which is in here.
00:26:11.440 | OK, I found that you can just go to the local Lama
00:26:16.760 | and you go to just type in best model.
00:26:19.480 | Every few months, they'll do some kind of survey.
00:26:22.600 | And you just sort by new.
00:26:25.320 | They'll usually have some kind of--
00:26:30.280 | here-- some kind of informal survey
00:26:32.800 | of what people are saying in terms of best models,
00:26:35.720 | including the best fine-tunes and whatever.
00:26:40.120 | So that's pretty useful.
00:26:41.800 | I don't think we've missed any, basically.
00:26:45.480 | Cool.
00:26:48.200 | I would like to point out that the rare, random, ever
00:26:50.520 | quaint fine-tune there, that's actually
00:26:53.280 | a role-play fine-tune model.
00:26:55.200 | Oh, of course.
00:26:55.800 | This is local Lama.
00:26:57.280 | Local Lama.
00:26:57.920 | No, but every now and then, some of the role-play fine-tune
00:27:01.440 | models, people do like it in human evolvement
00:27:04.880 | when it comes to tasks.
00:27:05.800 | So it's always weird to see it happen.
00:27:08.160 | What I'm not seeing is merges.
00:27:10.440 | And yeah, where are the merges?
00:27:17.960 | None of these are merges, right?
00:27:20.760 | I can clearly double-check.
00:27:22.160 | Merging is not all you need.
00:27:24.600 | Yeah, I feel like there's a lot of noise about merging,
00:27:26.840 | but nobody actually ends up using them.
00:27:29.160 | They just think--
00:27:30.680 | Ramon has an interesting comment, by the way,
00:27:32.680 | from Zoom chat.
00:27:34.480 | Some of the Falcon and fine-tuning creators,
00:27:39.040 | the first series, they started a company, AdaptiveML.
00:27:41.960 | They do on-prem RL.
00:27:46.100 | They raised the 20 mil series, eh?
00:27:51.880 | Nice.
00:28:01.080 | Yes, I think I saw this round come out.
00:28:06.120 | We might try to talk to them next year.
00:28:08.760 | I know basically everyone's going to focus in on RL for LLMs,
00:28:14.720 | and that'll be a big theme for next year as well.
00:28:17.440 | I'm trying to collect all these themes
00:28:19.080 | so that I'll make my life easier by fine-tuning RL for LLMs.
00:28:27.720 | OK, whatever.
00:28:29.560 | OK, so I will keep going in the interest of time.
00:28:33.320 | Synthetic data.
00:28:34.440 | I feel like we are all relatively familiar
00:28:37.640 | with many of these.
00:28:39.400 | I put five here as well for synthetic data.
00:28:42.720 | We did the ORCA 3 agent and struct paper this year.
00:28:47.240 | I did that session.
00:28:49.880 | I feel like BillionPersona was a source of a lot of noise,
00:28:56.040 | but ultimately very little actual impact,
00:29:00.240 | whereas the people who worked on real data sets,
00:29:02.640 | like FineWeb and DCLM, have had more impacts as well.
00:29:07.960 | You were trying to think--
00:29:09.240 | WizardLM.
00:29:10.560 | WizardLM series.
00:29:12.200 | Oh, that's agent.
00:29:14.800 | Is that also synthetic data?
00:29:16.000 | I don't know, man.
00:29:20.760 | So I will-- it's all Microsoft, right?
00:29:23.880 | It's all these MSR China people.
00:29:29.480 | Cool.
00:29:30.720 | The one I didn't know about was Cohere,
00:29:34.400 | which Luba mentioned in her talk.
00:29:38.080 | And that was net new to me.
00:29:40.440 | Basically, there is a chart in here.
00:29:44.160 | Cohere is always pushing, or at least Sarah Hooker
00:29:50.280 | is always pushing that there's benefits
00:29:54.320 | from learning across languages.
00:29:58.160 | And I think she's basically just trying
00:30:02.800 | to emphasize that if you have an ensemble of languages
00:30:06.480 | and your data set crosses more languages,
00:30:08.240 | you have knowledge that you don't have in one language.
00:30:11.080 | So English is not all you need, basically.
00:30:15.320 | So mostly be a single teacher, whatever
00:30:18.520 | the routing system is for your multiple languages.
00:30:25.640 | I don't know if there's any other synthetic data stuff
00:30:28.160 | that we should pick up.
00:30:29.120 | But basically, this is completely--
00:30:31.000 | I introduced this on a whim.
00:30:33.000 | I felt like synthetic data was a big theme this year.
00:30:36.640 | And I think this really should be
00:30:40.400 | data sets that happens to be just all synthetic data.
00:30:44.280 | There's also a lot of talks about LMS judge.
00:30:47.240 | But I don't know if there's a specific paper to cover this.
00:30:51.560 | When I looked at my own records, the best paper, quote unquote,
00:30:56.000 | was Hama Hussain's post on LMS judge.
00:31:01.760 | So maybe that would be a quote, unquote,
00:31:04.080 | paper for creating LMS judge.
00:31:08.440 | I don't know if anyone has read anything on synthetic stuff.
00:31:16.080 | But yeah, I'll just put it there.
00:31:20.120 | OK, small models, mobile LLM.
00:31:22.400 | I don't know if we covered this in paper club.
00:31:24.360 | But I did in AI News.
00:31:28.040 | And I think this is effectively the genesis for a small LLM,
00:31:35.000 | which is kind of Hugging Faces implementation of that.
00:31:38.080 | We did cover in paper club.
00:31:42.120 | I covered this paper.
00:31:43.800 | Cool, awesome.
00:31:45.400 | I may not have been there.
00:31:48.400 | Did you like it?
00:31:50.320 | Do you still have a good impression?
00:31:52.320 | No, I really like it because it's particular
00:31:54.600 | about the layer repeat part.
00:31:59.440 | I think it's relevant, again, because people are speculating
00:32:02.560 | that O1 does layer looping.
00:32:07.640 | And mobile LLM kind of does that.
00:32:11.040 | But I'm not sure if it's dynamic or not.
00:32:13.320 | I think it could be static, whereas O1
00:32:16.640 | would have some layer to decide whether or not
00:32:19.440 | to continue looping.
00:32:20.680 | Like it's effectively kind of a Turing complete architecture.
00:32:25.280 | I think O1 is more similar to--
00:32:29.520 | I can't remember the Google's paper name.
00:32:31.760 | But there's a Google paper where it exits early
00:32:34.600 | across the layers.
00:32:36.320 | Oh, mixture of depths.
00:32:38.480 | Yeah, mixture of depths, yeah.
00:32:40.280 | Which I think I did cover briefly as well as
00:32:44.160 | alternative to mobile LLM.
00:32:47.000 | Oh, we'll mention the mixture of depths here.
00:32:48.920 | Does it-- do they have looping?
00:32:56.000 | I don't think it's defined as a loop,
00:33:00.920 | but it's more of like a fixed depth and then exit early.
00:33:05.840 | Yeah.
00:33:07.800 | OK, I feel like to loop, to do inference time
00:33:14.640 | compute for multiple minutes and potentially hours,
00:33:18.640 | you need to loop instead of just having different depths.
00:33:23.520 | Yeah, it's also very hard to comment on this
00:33:25.720 | because there is no known open extra depth model as well.
00:33:29.280 | Cool, well, that's all we got.
00:33:35.520 | I think Apple Intelligence may be the biggest open models
00:33:39.200 | apart from RWBKB on-device model because I have it on my phone
00:33:44.440 | and I benefit from it every day.
00:33:47.360 | Still quite cool.
00:33:50.400 | I feel like people-- it's very trendy to shit
00:33:53.360 | on Apple Intelligence now.
00:33:55.760 | But it is still underrated that they rolled out
00:33:58.560 | transformers across the entire install base of iPhones,
00:34:02.960 | which is pretty cool.
00:34:06.680 | Gem9.nano, I thought would be this year,
00:34:09.000 | still under a feature flag.
00:34:12.240 | So do people know what I'm talking
00:34:14.240 | about when I say Gem9.nano?
00:34:15.680 | So there's like a browser API.
00:34:30.280 | Build-in AI, I think it's this one.
00:34:32.920 | Yeah, so these are the APIs.
00:34:39.960 | And you can-- yeah, this prompt API for web,
00:34:46.680 | I don't know where it is here.
00:34:49.280 | This one?
00:34:49.800 | No, this is an extension.
00:34:55.400 | Where is-- I don't know where it is.
00:34:59.840 | [AUDIO OUT]
00:35:04.720 | Yeah, this is Flash.
00:35:06.600 | It will be in Chrome, where you can do browser.ai.generate.
00:35:15.120 | And that's just straight access, base level access to Gem9.nano.
00:35:21.960 | Oh, here it is.
00:35:22.680 | Yeah, yeah.
00:35:27.680 | So this will be built into the browser, no download.
00:35:32.320 | At some point, this will happen.
00:35:34.520 | And I think there were some demos this year that
00:35:40.600 | showed that it was very fast.
00:35:42.680 | Obviously, it's very dumb as well, but if you--
00:35:48.280 | yeah, I can't find it right now.
00:35:49.840 | But I mean, if you just sort of wait a bit,
00:35:53.160 | then you know it's coming.
00:35:55.080 | Maybe I'll just put it in here.
00:35:58.120 | OK, cool.
00:35:59.960 | Kimba, was this a big deal?
00:36:01.800 | I don't know.
00:36:03.080 | Luna picked it.
00:36:03.760 | I feel like Eugene might know.
00:36:09.760 | I think it's too early to even tell whether it's a big deal.
00:36:12.160 | But it has potential.
00:36:13.480 | I think that's why he picked it.
00:36:17.200 | Cool, all right.
00:36:18.240 | I'll keep moving on.
00:36:19.040 | I feel like I'm kind of running out of steam.
00:36:22.240 | Before going to post-transform, there's also big models.
00:36:26.160 | I don't know whether you want to touch
00:36:27.720 | about DeepSeq dropping their 600p model.
00:36:30.520 | Yeah, here, right?
00:36:34.600 | Oh, OK, I forgot that there was a 600p deal.
00:36:37.880 | Yeah.
00:36:39.560 | OK, post-transformers.
00:36:41.400 | Is there any affection on big, big model drops,
00:36:44.520 | like the 405b, the large failures?
00:36:49.440 | Blunt?
00:36:50.640 | Large failures?
00:36:52.240 | Especially--
00:36:53.640 | What is large failures?
00:36:55.280 | If it was me, I might call them large failures.
00:36:57.480 | If it wasn't me, I might call them
00:36:58.920 | like distillation-like model, so you can distill down.
00:37:04.240 | And then it becomes very weird when now like Lama370b,
00:37:07.680 | or whatever the big one is, is good as 405.
00:37:11.280 | But big failures.
00:37:15.200 | In that lens also, you can also include the Nvidia reward
00:37:18.280 | models.
00:37:20.120 | I really wish this was like live editable,
00:37:22.200 | so I can add in parentheses like burn VC money, but it's OK.
00:37:25.840 | Is this what you mean by reward models, Eugene?
00:37:32.080 | I don't know if this is the--
00:37:33.640 | This is the one.
00:37:34.400 | Let me double check.
00:37:35.280 | There's a bunch of non-branded things.
00:37:41.440 | I think Grok 1 was kind of considered a failure.
00:37:45.520 | Everyone's very excited about the weights of Grok,
00:37:48.640 | but it was too big to deploy.
00:37:51.840 | When was that?
00:37:53.080 | That was March.
00:37:54.400 | Yeah.
00:37:55.640 | So this is also a very, very big model.
00:37:58.440 | This is 314b.
00:38:02.800 | Yeah, I mean, yes, they're too big,
00:38:04.760 | but they're still teacher models.
00:38:06.120 | And I think that's OK.
00:38:07.320 | I don't see an issue with this at all.
00:38:10.840 | I don't consider it a failure.
00:38:12.240 | I don't know.
00:38:14.200 | OK, so let me know if anyone can think of any other big models
00:38:18.960 | that were released this year.
00:38:22.040 | I just thought of one.
00:38:23.440 | There was a Chinese model.
00:38:26.840 | I can't remember which one that was,
00:38:30.800 | but I will look it up.
00:38:32.880 | Some Chinese model.
00:38:34.680 | OK, there was state space stuff.
00:38:37.080 | I feel like the only thing that really made an impact this year
00:38:40.200 | was Jamba.
00:38:41.480 | I think we covered this as well in Paper Club.
00:38:44.720 | The rest was-- and obviously, there's
00:38:46.840 | Mamba 2 was also this year.
00:38:49.640 | And this is one of the best papers at NeurIPS.
00:38:53.840 | Yeah, not sure what else to cover.
00:38:56.440 | I think Sana kind of flew under my radar.
00:38:58.880 | But apparently, they have extended Mamba models
00:39:01.480 | to diffusion, which is kind of cool.
00:39:05.400 | And it works.
00:39:07.200 | Great.
00:39:08.800 | In my sessions with the image people,
00:39:12.360 | I had a session with some of the people working
00:39:17.360 | on image and VO at Google.
00:39:20.360 | They were very hyped up about autoregressive image
00:39:23.920 | generation.
00:39:24.960 | So instead of diffusion, they are sort
00:39:26.880 | of getting rid of the diffusion part
00:39:28.840 | and just straight autoregression for images.
00:39:32.120 | And I thought that was notable, but I didn't have the background
00:39:34.800 | to understand it.
00:39:35.760 | They were just like, yeah, next year
00:39:37.280 | is the year of autoregressive images.
00:39:38.960 | So a bit of a shift in my mind, because I thought
00:39:43.080 | that people were more-- like last year, this time last year,
00:39:45.440 | people were more interested in text diffusion,
00:39:47.360 | so diffusion going into text.
00:39:48.680 | Now, they're talking about autoregression
00:39:50.640 | going from text into images.
00:39:53.800 | So it's kind of the other way around.
00:39:56.320 | Actually, I think Sana, LOLCAT, and QRWKB
00:39:59.960 | might be in its own category, where it's really
00:40:02.080 | more about taking existing, I guess, attention-based models
00:40:06.040 | and converting them over.
00:40:07.880 | I think that's the team there.
00:40:09.760 | Models.
00:40:11.560 | Thank you for the point there.
00:40:14.880 | I know I said it, but yeah.
00:40:18.880 | Fine.
00:40:20.640 | Aren't Franken models like different model merges?
00:40:24.800 | Like take a llama, a mistral, you merge?
00:40:28.000 | No, they're not.
00:40:28.720 | They're the Franken models.
00:40:31.000 | They share a Franken merge.
00:40:33.400 | OK, this is more conversion.
00:40:35.720 | But whatever, it's still putting Frankenstein, yeah.
00:40:41.840 | Quick question on QRWKB.
00:40:43.720 | Is there a retraining phase when you replace the layer?
00:40:47.040 | Yes, there is.
00:40:48.360 | And is it retraining, or is it continual training?
00:40:52.920 | No, it's just 500 mu tokens retraining the attention layers.
00:40:58.240 | And then another 500 mu just on all the layers, yeah.
00:41:02.760 | So is it initialized from scratch?
00:41:06.320 | Yeah, the attention layer is initialized from scratch.
00:41:10.800 | That's crazy.
00:41:11.760 | You can train on 15 trillion tokens of attention
00:41:14.520 | to get good, or you can reinitialize and just
00:41:17.720 | train on 500 million.
00:41:19.840 | That's a scam.
00:41:20.560 | Exactly.
00:41:21.800 | It's so much less tokens.
00:41:25.120 | I mean, it's the same intuition as like lava, right?
00:41:29.280 | Like in my mind, very similar.
00:41:31.680 | Like it's effectively--
00:41:34.360 | Lava is an adapter where you merge in a new model
00:41:40.400 | to match a really strong pre-trained model, right?
00:41:43.280 | Like you have a pre-trained language model.
00:41:45.400 | You have a contrastive loss to merge.
00:41:48.400 | Like you use the backbone of a strong model,
00:41:50.600 | and then you add-- like you train a new one to match it.
00:41:53.240 | But reinstating weights from scratch is--
00:41:56.920 | you basically restart.
00:41:58.240 | Like you start from stochastic noise, and you retrain.
00:42:01.560 | Like you have no intuition to go off of.
00:42:04.000 | But then the-- yeah, lava is like you're still just--
00:42:07.640 | you're just mimicking a really strong model.
00:42:09.720 | Mimicking, OK.
00:42:15.440 | No opinion on that one.
00:42:17.680 | Yeah, I think I vaguely understand the difference.
00:42:23.080 | OK, so the other big one is the lockhex,
00:42:26.120 | which is basically the same thing as what
00:42:28.640 | QRWKV was doing, but it extends to Mamba as well.
00:42:33.840 | And they were using--
00:42:35.840 | they were essentially using LORAS instead,
00:42:39.720 | which has less resources, but performs worse, arguably,
00:42:45.200 | by my research.
00:42:48.160 | It's like everything is just so, so, so lacking--
00:42:52.560 | lacking-- what was it?
00:42:56.160 | Abelation, I can't even say which method is better or worse.
00:42:59.960 | Right, so you want one survey paper going
00:43:02.640 | comparing all these things.
00:43:04.640 | No, I think we just need to give all these more time.
00:43:09.400 | No, it's still very cool work.
00:43:11.480 | Love the name.
00:43:13.000 | LXTM is-- XLSTM is the one that we did not cover.
00:43:16.880 | This was a best paper at NeurIPS.
00:43:20.800 | And I talked to them, so I'll release the interview
00:43:23.320 | at some point.
00:43:24.880 | I don't know if people have thoughts about this,
00:43:26.920 | but some of these gating and stuff
00:43:33.560 | is kind of interesting in the sense
00:43:35.560 | that they seem very clear about the ways in which LXTM did not
00:43:40.720 | scale or failed.
00:43:42.280 | And they seem very intense on fixing that.
00:43:45.200 | It's kind of cool.
00:43:46.840 | Yeah, cool.
00:43:48.360 | I'll move on to agents, which is the thing that we published
00:43:50.880 | today.
00:43:52.040 | We had eight things in agents.
00:43:55.240 | And I thought his talk was really--
00:43:57.720 | Graham Newbig's talk was really good.
00:43:59.560 | I don't know if there was a breakthrough.
00:44:01.440 | So last year, the obvious winner for last year was Voyager.
00:44:05.960 | The Voyager paper was one of the must-reads.
00:44:08.720 | I would say also the paper that I
00:44:11.920 | love to hate, the Smallville Generative Agents.
00:44:19.040 | I don't know if there's any other papers that really
00:44:21.280 | broke through this year.
00:44:23.720 | Why do you like to hate Smallville?
00:44:25.240 | It simulates chats for entertainment.
00:44:31.160 | And then in order to--
00:44:32.640 | the reason it sells itself so well
00:44:34.640 | is primarily because of that image that is there.
00:44:40.080 | It taps into this 8-bit nostalgia effect
00:44:46.600 | that people really like.
00:44:49.240 | You know what I'm talking about.
00:44:50.760 | Where's the paper?
00:44:51.560 | Yeah, I know the paper.
00:44:53.200 | So yeah, you're not happy with people over-hyping it.
00:44:58.920 | It caused this generation of papers
00:45:01.280 | that all look like this here.
00:45:05.320 | Oh, all the 8-bit images in papers for the entire way.
00:45:10.280 | It's no fucking thing to do with how the agents work.
00:45:13.480 | It's just people think it looks cute.
00:45:16.160 | And therefore, it does well on social media.
00:45:19.720 | It's just really bad science.
00:45:22.280 | I don't know how to say it.
00:45:28.880 | 8-bit is all you need.
00:45:30.680 | It's as bad as all you need.
00:45:34.040 | So it just gigafries people's brains.
00:45:37.160 | And this random project gets 26,000 stars because of this.
00:45:45.480 | Because of this.
00:45:46.680 | And I guarantee you, nobody's using it.
00:45:49.080 | It's very annoying.
00:45:49.880 | It's the source of all the noise.
00:45:51.360 | OK, makes sense.
00:45:55.440 | Yeah, hot takes.
00:46:01.560 | OK, cool.
00:46:03.360 | I think I'm done.
00:46:04.240 | I know that's the year in review for papers.
00:46:07.040 | Maybe there's other papers I haven't picked up,
00:46:09.000 | but I wanted to just share.
00:46:10.320 | Eugene, do you want to do it?
00:46:17.160 | Thank you, Suits.
00:46:20.200 | Sure, I can do that.
00:46:21.960 | I don't think I--
00:46:22.960 | Why this one sort of lodged in your brain
00:46:25.800 | as an interesting long context benchmark?
00:46:29.640 | Well, the thing is, I'm thinking a lot about long context
00:46:32.880 | recently.
00:46:33.560 | And I mean, from early on, I think neither a haystack.
00:46:37.680 | I mean, you're asking me about long context.
00:46:39.480 | I was like, come on, that's not really long context.
00:46:41.640 | I mean, it's really just extractive.
00:46:43.440 | So imagine if you were to summarize a book,
00:46:45.240 | you'd ask questions of a book.
00:46:46.640 | It's really a lot of reasoning.
00:46:48.800 | And I think this long context paper,
00:46:50.480 | they actually shared something.
00:46:52.200 | Oh, man, crap.
00:46:53.720 | I need to-- sorry, guys.
00:46:58.840 | This is a brand new laptop.
00:47:04.040 | So I'll mention Babylon.
00:47:05.680 | I'll buy some time for you.
00:47:10.080 | There's always a lot of interest in long context models.
00:47:14.160 | And I find that--
00:47:16.720 | so Babylon was the sort of the long context winner of NeurIPS.
00:47:21.800 | This is the one.
00:47:25.680 | Eugene is not going to cover this one.
00:47:27.280 | He's going to cover something else that he found.
00:47:29.520 | But I think this--
00:47:30.520 | and then Ruler is the one that we covered on Eaton Space.
00:47:33.000 | This guy, where they train a million context LLM.
00:47:41.480 | Oh, maybe there should be a category for long context.
00:47:43.800 | I wonder.
00:47:44.840 | That could be interesting.
00:47:45.920 | OK, Eugene, you're back.
00:47:48.040 | OK, sweet.
00:47:49.200 | So the reason why--
00:47:50.560 | so now I want to share with you about this paper, which
00:47:53.960 | I think is pretty cool.
00:47:56.240 | Wait, are you actually able to see it now?
00:47:58.200 | Yes, you can see it.
00:47:59.880 | OK, perfect.
00:48:00.680 | Are you seeing my Zotero?
00:48:03.760 | I don't know why.
00:48:04.480 | OK, I'm assuming you're seeing my Zotero.
00:48:06.360 | So this is LongBench 2, Towards a Deeper Understanding
00:48:09.400 | of Realistic Long Context Benchmarks.
00:48:11.080 | I actually have a comparison of LongBench 1,
00:48:13.320 | which I will go over at the end of everything.
00:48:16.120 | But I think it's really interesting to see
00:48:18.040 | all the different iterations of it.
00:48:19.480 | And you can see actually what they found
00:48:21.440 | didn't work in the previous iteration
00:48:23.600 | and how they try to improve on it.
00:48:25.680 | Long story short, LongBench 1--
00:48:28.440 | no, sorry, LongBench 2, how they try to create it
00:48:31.480 | is they recruit it.
00:48:33.680 | I'm going to go through this pretty quickly.
00:48:36.640 | So the task is--
00:48:38.840 | it's extremely long context.
00:48:40.960 | The context is about 8,000 to 2 million words.
00:48:44.760 | And they have several tasks.
00:48:46.640 | The simple one is really just a single document.
00:48:48.720 | Single document, they do Q&A on it.
00:48:51.080 | And then there's also multi-document Q&A.
00:48:53.600 | And what is new in LongBench 2 that was not in LongBench 1
00:48:57.360 | was Long Context History Understanding.
00:48:59.720 | So history understanding is two kinds.
00:49:01.840 | There's chat between multiple LLM agents.
00:49:05.200 | And then there's also chat between a human
00:49:07.080 | and an assistant.
00:49:07.880 | So essentially, seeing the LLM can actually
00:49:10.680 | reason over these long chats that
00:49:12.560 | are made up of small text.
00:49:14.720 | And another thing that's new is Long Structured Data
00:49:17.120 | Understanding.
00:49:18.120 | So there's tables and there's knowledge graphs.
00:49:21.200 | So how they collected the data--
00:49:22.880 | so previously in LongBench 1, a lot of it
00:49:25.120 | was synthetic-generated data.
00:49:27.520 | In this case, they actually got undergrads to create the data.
00:49:31.720 | And they incentivized them pretty well.
00:49:35.400 | They paid them quite a fair amount of money
00:49:37.200 | to generate all these long benchmarks.
00:49:39.720 | And they also had a lot of people to review it.
00:49:42.080 | So after they collect the data, one thing
00:49:44.720 | that's really interesting is that they
00:49:46.880 | ask annotators to annotate it.
00:49:49.120 | And they actually review it, which
00:49:51.160 | is they run the question through three LLMs,
00:49:54.840 | like fairly small, fast LLMs.
00:49:57.520 | And they would only keep--
00:49:59.800 | so over here, here's the image.
00:50:02.960 | They would only keep the question
00:50:05.640 | if at least one out of the three LLMs get it wrong.
00:50:09.680 | So essentially, these questions are somewhat fairly hard.
00:50:13.280 | And I won't go through all the different criteria
00:50:17.120 | they had for keeping the questions.
00:50:18.800 | But suffice to say, these questions
00:50:19.960 | are actually pretty good.
00:50:21.000 | They are fairly accurate.
00:50:22.160 | They actually did a manual review of it.
00:50:24.040 | There's maybe about 3% of them which are erroneous.
00:50:26.800 | But they're pretty good, and they're pretty hard.
00:50:30.000 | So now let's go into the baselines.
00:50:32.520 | They tested it with 10 open-source LLMs,
00:50:34.800 | as well as 6 proprietary LLMs.
00:50:37.480 | They have zero short and zero short chain of thought,
00:50:40.520 | which is pretty standard.
00:50:42.120 | I won't go through them.
00:50:43.920 | So here's the results.
00:50:46.000 | What we see from the open-source models
00:50:47.920 | is that the Quen 2.572b performs the best.
00:50:51.040 | You can see it's the most bolded here.
00:50:53.960 | And O1 preview, we see that the O1 actually
00:50:59.520 | provides a lot more juice than just regular 4.0.
00:51:04.320 | But what is really interesting here is this.
00:51:08.040 | So we have easy questions over here
00:51:12.440 | where a human gets 100% of it correct, right?
00:51:16.840 | And LLMs are not able to get fully 100% of it correct.
00:51:20.560 | And then we have hard questions where a human only
00:51:23.640 | gets 25% of it correct, but LLMs get 50% or more of it correct.
00:51:29.320 | So what is this here?
00:51:31.760 | This really demonstrates to you the jagged edge
00:51:33.640 | of intelligence, right?
00:51:34.880 | Where by some of the easy tasks, humans just
00:51:36.880 | get all of it correct.
00:51:38.120 | But LLMs do consistently make mistakes.
00:51:40.480 | That's the thing about pushing beyond 80% accuracy.
00:51:43.240 | LLMs can get harder, going to find it harder.
00:51:46.480 | But then for the longer context, where humans struggle with,
00:51:51.760 | they only get 21% correct, LLMs with huge compute
00:51:56.640 | and their huge context length, they actually
00:51:58.680 | can get quite a bit of it correct.
00:52:00.120 | So that's pretty good.
00:52:01.840 | Similarly, you can see on this over here,
00:52:04.920 | for short context lengths, you can see humans get 47% of it,
00:52:08.520 | but LLMs outperform humans already.
00:52:11.200 | And then for medium and long context lengths,
00:52:15.320 | humans still outperform LLMs.
00:52:18.800 | So that's what is interesting I found here
00:52:22.160 | about the jagged edge of intelligence.
00:52:26.600 | Finally, they also tried--
00:52:29.640 | so what does this mean?
00:52:31.000 | Does this mean that, OK, long context is all you need?
00:52:33.240 | You don't really need a retrieval or pointer
00:52:35.120 | generation baseline?
00:52:36.360 | Well, it's different here.
00:52:38.200 | What they found is that a lot of the models,
00:52:41.080 | they don't actually perform better
00:52:44.800 | with the entire context length.
00:52:46.120 | So for example, Quen 2.5 and GLM 4 Plus,
00:52:50.200 | they actually do better at shorter context lengths, 32K,
00:52:52.960 | with retrieval, compared to using the entire context
00:52:57.640 | window, 128 context window, without RAC.
00:53:00.960 | So I think in this case, for this specific models,
00:53:07.680 | RAC still performs better.
00:53:10.640 | Then finally, but the one exception
00:53:13.840 | they raised was this, GPT-4, oh, actually
00:53:16.400 | is able to use the full context length.
00:53:18.840 | So that's interesting.
00:53:19.760 | Actually, it's long enough.
00:53:24.160 | I would have loved to see SONET.
00:53:26.000 | So you can see SONET over here.
00:53:27.280 | Oh, actually, they do have SONET.
00:53:29.240 | SONET is actually quite far behind 4.0, with 50%.
00:53:34.800 | So well, that's a comparison between the OpenAI
00:53:38.840 | and Entropiq.
00:53:41.920 | They also have something whereby they tested whether, hey,
00:53:44.360 | did the LLM memorize it?
00:53:46.960 | So this is really just asking the LLM a question
00:53:49.560 | without providing a context and see if it can answer.
00:53:52.720 | So in this case, they showed that the LLM doesn't memorize.
00:53:55.360 | Some of these questions are really interesting.
00:53:57.320 | I wanted to highlight one of them, which I thought
00:54:00.600 | was actually quite difficult, but it's quite--
00:54:02.600 | which is interesting.
00:54:09.080 | This one.
00:54:10.000 | They introduced a task where the LLM is given a mystery novel.
00:54:15.760 | And the LLM requires-- and the LLM
00:54:17.800 | is asked to identify the killer or identify the motive based
00:54:21.680 | on the information provided in the detective novel.
00:54:24.440 | So you can think that this is something
00:54:26.040 | that a human would take a very long time to do, right?
00:54:28.480 | They would have to scan through an entire book
00:54:30.600 | or pick up pieces here and there.
00:54:32.360 | But the LLM actually does this very well.
00:54:35.440 | And then you can see that the LLM--
00:54:41.480 | that said, the human accuracy is also
00:54:43.080 | extremely high on these detective novels,
00:54:45.560 | like "65" and "72."
00:54:46.960 | But LLMs-- humans really suck at the academic benchmarks,
00:54:51.760 | which is academic multi-document, expert.
00:54:55.640 | Expert accuracy was only 22%, as well as governments.
00:54:58.640 | So I think maybe academics and governments, that's
00:55:03.280 | where LLMs might outperform humans.
00:55:06.000 | And you can see this is where expert accuracy also
00:55:09.640 | differs on dialogue history.
00:55:13.280 | Now, then maybe the question that you may be asking
00:55:16.120 | is, how does this differ from "Long Bench V.1"?
00:55:21.920 | So "Long Bench V.1" was really extractive questions,
00:55:25.120 | very much similar to "Needle in a Haystack."
00:55:27.360 | Essentially, given this thing, what was--
00:55:30.000 | I don't know, what pizza did Eugene Chia eat?
00:55:32.280 | Or what was Eugene Yan's favorite food?
00:55:34.560 | Things like that.
00:55:35.520 | But in here, you can see that it's really--
00:55:37.840 | you really require understanding and reasoning.
00:55:40.040 | The example being the mystery novel,
00:55:41.840 | you have to identify the killer and the motive.
00:55:45.640 | Now, the second one is that this evaluation form is only MCQ.
00:55:49.400 | So previously, they used F1 and Rouge,
00:55:51.120 | and they found it to be extremely unreliable.
00:55:53.000 | They also tried using LLM as judge
00:55:54.960 | and found it very expensive.
00:55:56.560 | Now, we can debate, MCQ means 25% random chance
00:56:00.200 | of getting it correct.
00:56:01.360 | It may not be as precise.
00:56:02.440 | Well, for them, they actually found
00:56:04.960 | it to be a better score than Rouge or MCQ.
00:56:08.640 | The third one is that the curation is actually
00:56:12.080 | very rigorous.
00:56:12.880 | Essentially, given the questions,
00:56:14.600 | they actually tested it against the three LLMs
00:56:17.720 | to make sure that at least one of them
00:56:19.720 | get it wrong to test if the question is hard enough.
00:56:22.640 | And they also reviewed it with human experts.
00:56:24.720 | Can the human expert answer it?
00:56:26.120 | And human experts have up to 15 minutes
00:56:29.400 | to try to answer a single question.
00:56:31.480 | And if they can't answer within 15 minutes,
00:56:33.560 | they actually have the option to say that,
00:56:34.760 | hey, I don't know, it's too hard.
00:56:36.800 | So you can see, and this is the human expert accuracy.
00:56:39.320 | So for most of the tasks,
00:56:41.600 | I think it's like maybe 50 to 60%.
00:56:44.920 | So I think that's quite a way to go
00:56:46.240 | for humans to get better at this.
00:56:48.000 | And then the final thing is that they included more tasks,
00:56:53.760 | which is long context history,
00:56:55.120 | as well as long structured data.
00:56:56.720 | So yeah, that's the two main things.
00:57:02.120 | So that's it, that's the main thing for Long Bench V2.
00:57:04.800 | Yeah, any questions?
00:57:07.560 | - Actually, I would like to add,
00:57:10.120 | actually, I like this because it was very consistent
00:57:13.640 | with some of the conversations I had with several teams.
00:57:17.320 | In particular, the legal AI space,
00:57:20.760 | because a lot of them deal with high-end open source model.
00:57:23.760 | When I first talked to them last,
00:57:26.800 | I think at the start of last year,
00:57:29.000 | my first impression was
00:57:29.840 | that they would really love long context
00:57:31.640 | because you have a large amount of legal text.
00:57:36.120 | And they said, yeah, needle and haystack is meaningless.
00:57:39.120 | You can get 128k in needle and haystack,
00:57:41.480 | but it's not able to reason over the legal text
00:57:44.400 | past 8k and 32k, depending on the model.
00:57:46.960 | So a lot of them actually do 8k or 32k,
00:57:50.720 | depending on which architecture they're using
00:57:52.800 | for their legal LLM,
00:57:55.080 | despite the model being able to handle much larger.
00:57:59.800 | So this is more scientific evidence
00:58:02.040 | to what they said they tested internally.
00:58:04.280 | - And that's what is fun as well.
00:58:06.840 | Other than GPT-4 mini,
00:58:08.360 | most of the models couldn't really use beyond,
00:58:11.880 | didn't really perform better at 128k versus 32k.
00:58:14.800 | I think that's really just a matter of time.
00:58:16.600 | I think eventually for like 128k context documents,
00:58:21.600 | we probably may not need RAG anymore.
00:58:24.000 | At least that's where I'm betting.
00:58:27.560 | - The way I rationalize it in the open space,
00:58:30.240 | is that there's just lack of proper training data
00:58:32.880 | at this scale.
00:58:33.720 | And also, as I summed it up,
00:58:36.200 | I think in the whole post-transformer,
00:58:39.400 | the problem for a lot of the open teams,
00:58:41.480 | training at 128k is kind of like big lab territory
00:58:44.480 | because of the amount of even required
00:58:45.800 | for the back propagation.
00:58:47.440 | So even though like other BKP members,
00:58:49.880 | strictly speaking, you could train at 128k,
00:58:52.200 | none of us will train at 128k.
00:58:56.040 | - Yeah, yeah, exactly.
00:58:57.840 | Okay, so that's it.
00:59:00.280 | That's the very short summary of Long Bench
00:59:03.200 | if you're interested in Long Benchmarks.
00:59:05.680 | Anything else?
00:59:09.600 | - Oh, sorry.
00:59:11.560 | Just on your comment on the 128k VRAM requirements,
00:59:16.400 | does any of the federated learning stuff help?
00:59:23.760 | - Federated, are you talking distributed federated
00:59:26.120 | or like, sorry?
00:59:28.080 | - Yeah.
00:59:28.920 | - No.
00:59:30.400 | - Whatever, you know, you need.
00:59:32.720 | - So, so far, so at the end of the day, right?
00:59:35.520 | The long context, right?
00:59:36.920 | Your biggest bottleneck is you need a set of nodes
00:59:40.720 | to be together to essentially handle the entire problem
00:59:45.720 | from beginning to end,
00:59:46.800 | including every single state in the middle.
00:59:49.040 | And I don't think distributed multi-cluster training
00:59:53.320 | will help in this sense.
00:59:54.520 | 'Cause each cluster will need to be able to handle
00:59:56.520 | the training for 128k.
00:59:57.880 | - Right, so I was thinking that one of those,
01:00:02.960 | like whatever news research is doing
01:00:04.600 | can help you distribute it effectively.
01:00:07.520 | My understanding is that the post-training for long context
01:00:12.160 | isn't actually that much in terms of compute.
01:00:14.920 | Like this is just purely a VRAM issue.
01:00:17.160 | - It's actually, no, I think that,
01:00:22.040 | so that's where I would disagree
01:00:23.720 | because it's not much to get it to the needle in his head.
01:00:27.160 | It's actually a lot to get it to do proper reasoning.
01:00:29.800 | So, for example, there's a lot of cheats that we can do.
01:00:33.560 | Like for example, other between state space, right?
01:00:35.680 | Right now, we extend the context window, kind of,
01:00:38.720 | by snapshotting at 4k.
01:00:41.040 | So we do training in 4k chunks.
01:00:43.920 | But the reason why this does not work well, right,
01:00:47.120 | is so it's like, if the model is like,
01:00:49.880 | learn to, let's say,
01:00:51.600 | hey, let's just try and memorize as much as possible
01:00:53.600 | so that needle in the haystack down the line kind of works.
01:00:56.880 | But if, let's say like the detective story
01:00:59.800 | kind of situation, right?
01:01:01.080 | You answer at the last chunk,
01:01:05.240 | but the back propagation is unable to back prop
01:01:07.960 | to the first chunk and say,
01:01:09.120 | hey, when you read this story,
01:01:11.120 | understand the reasoning behind it.
01:01:12.680 | And that's where it all just disconnects
01:01:14.800 | because of the VRAM capacity requirement.
01:01:18.920 | And this is tight to compute as well
01:01:21.080 | because every, this is technically
01:01:24.120 | the quadratic curve, actually.
01:01:26.080 | You don't, so this is the part where,
01:01:28.440 | even for us, right, in training,
01:01:29.960 | we still suffer from that quadratic curve.
01:01:32.120 | - Yeah.
01:01:34.440 | Cool.
01:01:35.280 | I, yeah.
01:01:37.160 | No comments beyond that.
01:01:40.760 | Cool.
01:01:44.720 | I guess that's the last favorite curve of the year.
01:01:47.320 | (laughing)
01:01:49.560 | That's sloppy as hell.
01:01:53.360 | - Very excited.
01:01:56.520 | Okay.
01:01:57.360 | Merry Christmas, everyone.
01:01:58.800 | - Merry Christmas, everyone.
01:02:00.040 | Ho, ho, ho, ho.
01:02:01.280 | - Yeah.
01:02:02.120 | - Bye.