[Paper Club] The 2025 AI Engineer Reading List + LongBench Paper

We covered Sora previously. Yeah. OK. Yeah, Sora is not that important, right? Santu, yeah, I don't know if people have played around with Santu or listened to the pod. I think definitely quite revolutionary. Also, another example of the thesis of vision becoming video, because what Santu did was take SAM1 and extended it in the video direction.

And they use this really cool architecture for memory attention that has object permanence. So where we see that in-- I think it's in the related tweet here-- where you can track people going off screen and coming back on screen, which is something that I like to show off. I don't know if this works or not.

I understand. There we have it. OK. Oh, yeah. All right. So I think this is on their demo page. Let's see if it's here. Yeah, so this is segment anything across video. I don't know if there's-- oh, yeah, here. And-- OK, OK, this is the most impressive one. So you know this three-ball cup experiment?

It can track where the ball is. You might have seen. And so here, I can click on the ball in the first frame. I can also click on a different cup. And so here, the additional challenge is that there's three cups that look exactly the same. And then there's a ball that will get occluded by the cup.

So the ball is no longer visible. The cups are all moving around. They all look the same. But the model actually keeps track of the cup that we selected. And as you can see at the end-- here, I'll jump to the end so you can see-- it actually finds the cup again.

I wanted to point out a couple of fun demo UX features that we added that actually-- yeah, so I thought it was pretty impressive. And people are using this in real-life situations. I would argue among the listed vision models, SAM is probably taking the crown in terms of real-world applications.

Yeah, it's also kind of interesting. Basically, the SAM team publishes one thing a year, and then they take the rest of the year off. Like it's-- they're like the epitome of we have defined one problem well. We solve it very well. And then we don't publish anything else, which is kind of cool.

Let me just grab your recommendations. Let me see them in the chat. It's here, right? Cool. Then this new one, which I was not paying attention on debtors and all that. I know that YOLOv10 was actually at NeurIPS, which was the latest update of YOLOs and real-time object detection.

But apparently, according to the vision guys, debtors are mostly replacing YOLOs in terms of their performance. I'm not really sure why. I didn't really go into it. But if people care about real-time object detection, that is it. Oh, yeah, I mean, the other thing about segment anything, they studiously avoid labeling.

So they only know how to draw segments. And to me, it's very similar to typical conf net layers, where there's one layer that only does edge detection. And so this is like, because they constrain it very well, they solve it very well. But at the same time, for practical use, you basically always want to label things.

So then you have to combine segment anything with grounding dyno and stuff, which is not a full solution. YOLOs, I think, also would have that same application. OK. I'll keep moving. But then also, I also don't want to maybe dominate too much of the conversation. I did read this MMVP paper, which they highlighted as one of the more well-talked about papers.

By the way, I have this really useful extension for finding the related tweets of all the papers. And so I don't think this particular paper was that influential. But it was one of the best papers of CVPR. And it basically just pointed out the sort of, quote, unquote, "jagged intelligence" of frontier models.

So even here, where the-- they found-- they cataloged all the hallucinations that are still remaining in the frontier models and why, even though they're superhuman in some aspects, they're not superhuman here. Yeah. And so that creates gaps for other models to fill in. And I think this is an example of-- I mean, they have a benchmark.

And anytime you see a benchmark like this, where all the frontier models are here and the humans are here, that is a reliable way to advance the field, which is you're doing to find gaps that it's not doing well in. And I think this sort of finally made it click for me, that why so many people were kind of focusing on clocks this year, here, which became a focus for PIXMO and Moon Dream, which is just analog devices.

Well, I'm trying to look for the analog in this situation. Can't really find it. It's somewhere in the presentations that I saw. But basically, I think this is-- when you publish an influential shortcomings paper or benchmark, then people can sort of meaningfully advance on it. And so I think then they picked out PolyGemma, Florence, and Moon Dream.

I should put Moon Dream there as the examples. Yeah, what's up? Is this essentially an AGI problem for the vision field? I guess, which is fun, which is interesting as well. A lot of people think that RKGI is a vision issue, which I am primarily using my vision sense when I'm looking at RKGI.

But Francois Chollet really, really insists that RKGI is not a vision problem. I mean, this looks vision to me. No, but I mean, so the reason why I agree with him also is that even the open AI solution and the leading model of open AI and the winning models, they're not solving it through vision.

They're solving it through text. It's a reasoning problem. Yes, but don't you think vision would help? Like the a priori. I understand that the winners, none of them use vision. But a priori, it should help. No, there are teams that are trying to use the vision models, and it didn't work somehow.

So that's an interesting-- That's a scale issue. It's a scale, I don't know which one it is. Oh, I wonder if you run QVQ on these things, would it be any different? One of the things I was thinking about for the Christmas episode today was taking some of the published ArcAGI questions and seeing if humans can solve it.

Let's do one. You want to do one? I'll find one. You keep going. Let me find a fun one. OK, OK. Yeah, I posted a link in the Discord somewhere. I think Greg Kamrat is the guy to find it. And then find something where O3 failed to solve one of the ArcAGI problems.

And then I wonder if humans can do it. OK, what else about polygerma? Polygerma, that directly led to copoly, and then also now coquin. I think in terms of the specific field of PDF parsing, this has been a relatively big win, I think, in this year. There's also Marker, Surya, as well, so that emerged this year as vision-based models for things.

So where does this go? I think next year, vision. Next year, MMVP solutions to, I don't know, 50%, 70%. But I'm interested in what's next. I think a lot of people are focusing on the MMMLU, or the multimodal MLU, or whatever. But I'm not sure what else is left, apart from, I guess, moving on to video generation.

The artificial analysis people, I would say, are the de facto leaders now in terms of judging video models. So you should be aware of this leaderboard, where you can track the arenas for all these things. And they're trying to do image and speech, as well. But video seems to be the most hyped.

Yeah. Cool. Shall we move on? Any other thoughts or additions to the vision video domain? I just sent a link to the ARC thing. This is the unsolved O3 tasks. Let's try one. Sorry, what did you send it? Zoom chat. Oh, Zoom chat. Everyone can. Oh, oops. Oh, I've been using Orion, by the way.

Oh, but it includes the answer, I guess, I think. Zoom in, man. What? OK, is this the test? These are hard. But we-- OK, so this is-- --24 tasks that O3 was unable to solve, along with the incorrect guesses it made. OK, so what are we trying to do here?

We have three examples. And then O3-- oh, OK. So let's kill the ground truth here. So, oh, yeah. OK, this is the one that everyone's debating, right? So the blue connections here turn everything on its path blue. And everything else remains red. So O3 managed to draw all the blues.

And drew extra blues. This is somewhat unfair, because this question asks you to-- like, none of the samples have any-- have sort of this many dots. They also don't have-- oh, go ahead. Oh, go ahead. People are saying that O3's first solution is correct. Yeah, that's the thing. They also don't have dots whereby it lines up horizontally and vertically.

Yeah, yeah, so it's a known bad question. And also, it wasn't-- there's the dot and the line touching the box. That part was very-- Yeah, this specific one is the issue, because ground truth was saying that this one should have turned blue. So in RKGI, you get two chances to solve the solution.

And if you don't get either chance, you're deemed to have failed. But O2 is pretty close. I think this is just a bad question, and we should just throw it out. Let's see another one. OK, so we'll just look at it for a while. I've never seen this. The sentence below might be useful.

Unable to open the grid. Oh, yeah, it just-- Oh, you're not supposed to get the sentence. We're just analyzing. I wonder why the first example doesn't-- the box in the middle is 3 by 3. I thought you're supposed to try to shrink it as much as you can. Or even I, as a human, don't understand the first-- the one on the left.

I thought the pattern is you add an orange ring, then a white ring. So orange, white, then orange, white. And the third one, you skip. You recurse if you can. Yeah, you recurse if you can. So I don't know why they don't recurse for the first one. Yeah, so-- Yeah, especially if the fourth one, it did recurse with that white dot.

Yeah. So, OK, I feel like this is an example where vision would help. Because-- Yeah. Oh, look, this is offset. So over here, you've got-- Right, yeah. This is the closest one. So you need to basically count the levels of recursion. And this is super small. So I don't know if there's a good way to mathematically go-- oh, maybe there's some relationship between-- there's, OK, 1, 2, 3, 4, 5, 6, 7, 8, 9.

This is like a width of 9. Oh, hang on. You also have to zoom in. You have to output a smaller square. No, no, no, you don't. You don't. It's just that the input is a smaller square. So the second input was a 6 by 6. It's not output a smaller square.

I was going left to right. OK, so this is a width of 6. 9, 6, 1, 2, maybe 16 minus 4 is 12. 9, 6, 12. And then this is 19 minus 2 is 17. This is an odd number. So 17 outputs a 17 square. Yeah. So I feel like there is a relationship between like-- this is over-engineering.

If you switch it on, it gets even weirder. But after looking at this post, my real thing is like, I don't know if AGI is just puzzles and squares. It shouldn't be. Yeah, the standard line is that it is necessary but not sufficient. Just because you can pass a bunch of IQ tests doesn't mean you're going to be good at your job.

But it helps. I think I can be good at my job and fail all these tests. I agree. I think you can make a lot of money or create a lot of good without even being-- without doing well on, I don't know, Dota. What is the DQN learning? Atari.

Without doing well on Atari, even. I think this is the Atari for a reason. Well, this is reasoning. This is also just like, the way I think about some of these is if you throw enough time at it, you'll figure it out. Like, I think if you take like a seventh grader and give them like a month or a summer vacation and you give them like a PS5, if they get it, they will do it in two months.

Like, you give them enough time, they'll do it. Are you referring to test time compute now? No, I'm just talking about like, some of this stuff is just trial and error. So you know, chain of thought. I'll just leave RKGI there. I don't know. I mean, we can agree that the misalignment is a clear fail.

But neither of us can tell why the first one didn't recurse. What do you mean, misalignment? The ones in the green. The O3, yeah. The grid alignments, it's very clear cut. I think the like, tokenization. But also, it seems like a lot of models just don't output anything. That's what comment.

A lot of smaller LLMs on Arc, they don't do anything. The model is unable to output a grid at all. I think this is similar to like, tokenization and language. LLM get grid. Yeah, that sounds like it. Like, if stuff is off a character, it's probably just because their tokenizer is bad at this.

Which is like, not a great measure of AGI. But I guess AGI can fix its own tokenizer. Ooh, that'll be fun. OK, do you want to do more? Or should we move on? Move on. So then we move to open models. I had a list here of the GEMMA model.

Ooh, I should probably put in-- I guess if we want to do more, if you scroll down, there's one fun one that's very quick that people understand. It's just what line is on top. Go down, go down, go down, go down. Oh my God, there's so many examples. Yeah, so I actually did do an Arc AGI, like human, try and fill it up in Solaris.

The biggest issue I found was after the second puzzle is you get fatigue. Because some of the puzzles are huge. And trying to fill it up is a pain in the ass. Right, right. Yeah, I don't know which one you're talking about, Vibhu. This one? Yeah, I agree. I mean, it's one of those things where AI is just very good at scaling up attention, and we don't.

We're not. Which is one form of AGI, ASI. Vibhu, I don't know which one you're talking about. My bad, I found a new link. New link. Oh, new link. This one, I'm surprised that it failed. This one? Mm-hmm. So second to last one, at the very bottom. OK, second to last, meaning here.

Up, up one, up. Oh. This itself is an Arc AGI task. Navigate the page based on instructions. OK, I like this. I like this. So basically, what is-- What's on top? What's on top? What's the last thing added? Yeah, so pink is on top. Blue's on top. Pink's on top.

Blue's on top. Green's on top. See? It's pretty cool. Another vision. Yeah. I feel like vision will won this as well. I mean, it's not just vision, right? You also have to know that which one is obscured and which one is not. Yeah, vision is part of it, but it's not just.

Yeah. But count the most number of blocks. Well, the interesting thing here is the larger the grid gets, like on the left, you see all the grids are like 10 by 6. As you add more grids, the 01 mini starts to fail. You just can't do more grids. Even though it's the same number of interior lines, it struggles with grids.

So yeah, maybe vision would help, but it's a skill issue. It's a memory attention issue as your grid get bigger. I don't know about that, because the attention will condense this down into all the same hidden dimension. So basically, all this gets pre-processed to the same size. The grids didn't make an issue here.

And the grids I mean, it's only 24 Interesting. Cool. All right, move on. Learned today on Christmas of 2024 is that we are not AGI, guys. We're not AGI. We don't deserve a million. OK, I was going to move on to open models. I feel like these are the ones that are commonly named.

I can sort of justify all these. These are the PICS from LUCA. Obviously, this does not include the sort of state-space models and RWA-KVs of the world, which are also open. But these are sort of the traditional open models. Are there any that-- I guess the big one has not been mentioned here, which is Metaslama.

Are there any that we have missed? Oh, we know I had Mistral. Falcon dropped another one too. Oh, I'm sorry. Yeah, I think this is-- I mean, it's kind of nice to have everything, like the whole year in one screen. I definitely would find some utility from that. Just be like, oh, yeah, we didn't miss anything.

This is everything. I think it needs more love. I mean, I think these guys release so much. There's like, you see Coder and all that. CKMOU. Do you see CKMOU this year or last year? I think it was early this year. So it's just hard to-- I think it's just like, you have to tell people what the variable-based models they are so that they can then go and fine-tune if they want to.

But otherwise, there's not much to say here apart from the options of open models that are out there. I think you can add some of the small sub-1B models. There's also the Falcon 3 series dropped a week ago. I don't know how they are, but they put Falcon 3, 1B, 3B, 7B, 10B, the MAMA version.

I just don't know how they are. I'm going to put them on a misc. They made the list. Yeah, I mean, the consensus is that Falcon was not well-trained, apparently. Oh, interesting, because they put out huge data sets, but they don't train-- So Guillaume, the creator-- I talked to the same guy at NIRS 2023 and NIRS 2024.

The guy who did the Falcon, the fine web-- where was it? Refined web? Actually, was it? I don't know where I put it. I actually talked to him last year at this conference. It looks like I didn't actually-- I didn't even publish the interview. But I talked to him again this year, and he's actually the same guy behind fine web.

So this fella, he just basically left TII UAE and joined Hugging Face. So it's the same guy. There's also Falcon refined web from TII UAE. Right, it's the same guy. That's what I'm saying. If you look at refined web, the lead author from this guy last year is the same guy.

He moved-- so I mean, that's the intellectual lineage, I guess. He was actually at the InSpace Live. He came by to just say hi. Cool. Any other open models? I don't know. You said the one that's the 1B, 3B models. What are you specifically talking about? Five models, even though they're-- The five models, yeah.

I should probably mention five here. I don't know. I feel like Phi constantly has these allegations of training on tests, and I don't know how real that is. I know a couple of people who-- It's somewhat undetermined. They keep saying they do less and less. And realistically, they have a whole section in the papers on how they don't train on tests and how they filter out so we don't train on benchmarks.

I don't know why they just put out the list for no reason. But 5.4 is not really out for testing yet, right? It's not open yet. It's meant to be open later. I don't know if anyone's tested it. I guess you could add the Apple stuff here. Yeah. So I-- Never mind, take it back.

Yeah. Yeah, I just put it here. I guess I also mentioned Gemini Nano, which is in here. OK, I found that you can just go to the local Lama and you go to just type in best model. Every few months, they'll do some kind of survey. And you just sort by new.

They'll usually have some kind of-- here-- some kind of informal survey of what people are saying in terms of best models, including the best fine-tunes and whatever. So that's pretty useful. I don't think we've missed any, basically. Cool. I would like to point out that the rare, random, ever quaint fine-tune there, that's actually a role-play fine-tune model.

Oh, of course. This is local Lama. Local Lama. No, but every now and then, some of the role-play fine-tune models, people do like it in human evolvement when it comes to tasks. So it's always weird to see it happen. What I'm not seeing is merges. And yeah, where are the merges?

None of these are merges, right? I can clearly double-check. Merging is not all you need. Yeah, I feel like there's a lot of noise about merging, but nobody actually ends up using them. They just think-- Ramon has an interesting comment, by the way, from Zoom chat. Some of the Falcon and fine-tuning creators, the first series, they started a company, AdaptiveML.

They do on-prem RL. Oh. They raised the 20 mil series, eh? Nice. OK. Yes, I think I saw this round come out. We might try to talk to them next year. I know basically everyone's going to focus in on RL for LLMs, and that'll be a big theme for next year as well.

I'm trying to collect all these themes so that I'll make my life easier by fine-tuning RL for LLMs. OK, whatever. OK, so I will keep going in the interest of time. Synthetic data. I feel like we are all relatively familiar with many of these. I put five here as well for synthetic data.

We did the ORCA 3 agent and struct paper this year. I did that session. I feel like BillionPersona was a source of a lot of noise, but ultimately very little actual impact, whereas the people who worked on real data sets, like FineWeb and DCLM, have had more impacts as well.

You were trying to think-- WizardLM. WizardLM series. Oh, that's agent. Is that also synthetic data? I don't know, man. Yes. OK. So I will-- it's all Microsoft, right? It's all these MSR China people. OK. Cool. The one I didn't know about was Cohere, which Luba mentioned in her talk.

And that was net new to me. Basically, there is a chart in here. Cohere is always pushing, or at least Sarah Hooker is always pushing that there's benefits from learning across languages. And I think she's basically just trying to emphasize that if you have an ensemble of languages and your data set crosses more languages, you have knowledge that you don't have in one language.

So English is not all you need, basically. So mostly be a single teacher, whatever the routing system is for your multiple languages. OK. I don't know if there's any other synthetic data stuff that we should pick up. But basically, this is completely-- I introduced this on a whim. I felt like synthetic data was a big theme this year.

And I think this really should be data sets that happens to be just all synthetic data. There's also a lot of talks about LMS judge. But I don't know if there's a specific paper to cover this. When I looked at my own records, the best paper, quote unquote, was Hama Hussain's post on LMS judge.

So maybe that would be a quote, unquote, paper for creating LMS judge. I don't know if anyone has read anything on synthetic stuff. But yeah, I'll just put it there. OK, small models, mobile LLM. I don't know if we covered this in paper club. But I did in AI News.

And I think this is effectively the genesis for a small LLM, which is kind of Hugging Faces implementation of that. We did cover in paper club. I covered this paper. Cool, awesome. I may not have been there. Did you like it? Do you still have a good impression? No, I really like it because it's particular about the layer repeat part.

Yes. I think it's relevant, again, because people are speculating that O1 does layer looping. And mobile LLM kind of does that. But I'm not sure if it's dynamic or not. I think it could be static, whereas O1 would have some layer to decide whether or not to continue looping.

Like it's effectively kind of a Turing complete architecture. I think O1 is more similar to-- I can't remember the Google's paper name. But there's a Google paper where it exits early across the layers. Oh, mixture of depths. Yeah, mixture of depths, yeah. Which I think I did cover briefly as well as alternative to mobile LLM.

Oh, we'll mention the mixture of depths here. Does it-- do they have looping? I don't think it's defined as a loop, but it's more of like a fixed depth and then exit early. Yeah. OK, I feel like to loop, to do inference time compute for multiple minutes and potentially hours, you need to loop instead of just having different depths.

Yeah, it's also very hard to comment on this because there is no known open extra depth model as well. Cool, well, that's all we got. I think Apple Intelligence may be the biggest open models apart from RWBKB on-device model because I have it on my phone and I benefit from it every day.

Still quite cool. I feel like people-- it's very trendy to shit on Apple Intelligence now. But it is still underrated that they rolled out transformers across the entire install base of iPhones, which is pretty cool. Gem9.nano, I thought would be this year, still under a feature flag. So do people know what I'm talking about when I say Gem9.nano?

So there's like a browser API. Build-in AI, I think it's this one. Yeah, so these are the APIs. And you can-- yeah, this prompt API for web, I don't know where it is here. This one? No, this is an extension. Where is-- I don't know where it is. Yeah, this is Flash.

It will be in Chrome, where you can do browser.ai.generate. And that's just straight access, base level access to Gem9.nano. Oh, here it is. Yeah, yeah. So this will be built into the browser, no download. At some point, this will happen. And I think there were some demos this year that showed that it was very fast.

Obviously, it's very dumb as well, but if you-- yeah, I can't find it right now. But I mean, if you just sort of wait a bit, then you know it's coming. Maybe I'll just put it in here. OK, cool. Kimba, was this a big deal? I don't know. Luna picked it.

I feel like Eugene might know. I think it's too early to even tell whether it's a big deal. But it has potential. I think that's why he picked it. Cool, all right. I'll keep moving on. I feel like I'm kind of running out of steam. Before going to post-transform, there's also big models.

I don't know whether you want to touch about DeepSeq dropping their 600p model. Yeah, here, right? B3? Oh, OK, I forgot that there was a 600p deal. Yeah. OK, post-transformers. Is there any affection on big, big model drops, like the 405b, the large failures? Blunt? Large failures? Especially-- What is large failures?

If it was me, I might call them large failures. If it wasn't me, I might call them like distillation-like model, so you can distill down. And then it becomes very weird when now like Lama370b, or whatever the big one is, is good as 405. But big failures. In that lens also, you can also include the Nvidia reward models.

I really wish this was like live editable, so I can add in parentheses like burn VC money, but it's OK. Is this what you mean by reward models, Eugene? I don't know if this is the-- This is the one. Let me double check. There's a bunch of non-branded things.

I think Grok 1 was kind of considered a failure. Everyone's very excited about the weights of Grok, but it was too big to deploy. When was that? That was March. Yeah. So this is also a very, very big model. This is 314b. Yeah, I mean, yes, they're too big, but they're still teacher models.

And I think that's OK. I don't see an issue with this at all. I don't consider it a failure. I don't know. OK, so let me know if anyone can think of any other big models that were released this year. I just thought of one. There was a Chinese model.

I can't remember which one that was, but I will look it up. Some Chinese model. OK, there was state space stuff. I feel like the only thing that really made an impact this year was Jamba. I think we covered this as well in Paper Club. The rest was-- and obviously, there's Mamba 2 was also this year.

And this is one of the best papers at NeurIPS. Yeah, not sure what else to cover. I think Sana kind of flew under my radar. But apparently, they have extended Mamba models to diffusion, which is kind of cool. And it works. Great. In my sessions with the image people, I had a session with some of the people working on image and VO at Google.

They were very hyped up about autoregressive image generation. So instead of diffusion, they are sort of getting rid of the diffusion part and just straight autoregression for images. And I thought that was notable, but I didn't have the background to understand it. They were just like, yeah, next year is the year of autoregressive images.

So a bit of a shift in my mind, because I thought that people were more-- like last year, this time last year, people were more interested in text diffusion, so diffusion going into text. Now, they're talking about autoregression going from text into images. So it's kind of the other way around.

Actually, I think Sana, LOLCAT, and QRWKB might be in its own category, where it's really more about taking existing, I guess, attention-based models and converting them over. I think that's the team there. Models. Thank you for the point there. I know I said it, but yeah. Fine. Aren't Franken models like different model merges?

Like take a llama, a mistral, you merge? No, they're not. They're the Franken models. They share a Franken merge. OK, this is more conversion. But whatever, it's still putting Frankenstein, yeah. Quick question on QRWKB. Is there a retraining phase when you replace the layer? Yes, there is. And is it retraining, or is it continual training?

No, it's just 500 mu tokens retraining the attention layers. And then another 500 mu just on all the layers, yeah. So is it initialized from scratch? Yeah, the attention layer is initialized from scratch. That's crazy. You can train on 15 trillion tokens of attention to get good, or you can reinitialize and just train on 500 million.

That's a scam. Exactly. It's so much less tokens. I mean, it's the same intuition as like lava, right? Like in my mind, very similar. Like it's effectively-- Lava is an adapter where you merge in a new model to match a really strong pre-trained model, right? Like you have a pre-trained language model.

You have a contrastive loss to merge. Like you use the backbone of a strong model, and then you add-- like you train a new one to match it. But reinstating weights from scratch is-- you basically restart. Like you start from stochastic noise, and you retrain. Like you have no intuition to go off of.

But then the-- yeah, lava is like you're still just-- you're just mimicking a really strong model. Mimicking, OK. No opinion on that one. Yeah, I think I vaguely understand the difference. OK, so the other big one is the lockhex, which is basically the same thing as what QRWKV was doing, but it extends to Mamba as well.

And they were using-- they were essentially using LORAS instead, which has less resources, but performs worse, arguably, by my research. It's like everything is just so, so, so lacking-- lacking-- what was it? Abelation, I can't even say which method is better or worse. Right, so you want one survey paper going comparing all these things.

No, I think we just need to give all these more time. No, it's still very cool work. Love the name. LXTM is-- XLSTM is the one that we did not cover. This was a best paper at NeurIPS. And I talked to them, so I'll release the interview at some point.

I don't know if people have thoughts about this, but some of these gating and stuff is kind of interesting in the sense that they seem very clear about the ways in which LXTM did not scale or failed. And they seem very intense on fixing that. It's kind of cool.

Yeah, cool. I'll move on to agents, which is the thing that we published today. We had eight things in agents. And I thought his talk was really-- Graham Newbig's talk was really good. I don't know if there was a breakthrough. So last year, the obvious winner for last year was Voyager.

The Voyager paper was one of the must-reads. I would say also the paper that I love to hate, the Smallville Generative Agents. I don't know if there's any other papers that really broke through this year. Why do you like to hate Smallville? It simulates chats for entertainment. And then in order to-- the reason it sells itself so well is primarily because of that image that is there.

It taps into this 8-bit nostalgia effect that people really like. You know what I'm talking about. Where's the paper? Yeah, I know the paper. So yeah, you're not happy with people over-hyping it. It caused this generation of papers that all look like this here. Oh, all the 8-bit images in papers for the entire way.

It's no fucking thing to do with how the agents work. It's just people think it looks cute. And therefore, it does well on social media. It's just really bad science. I don't know how to say it. 8-bit is all you need. It's as bad as all you need. So it just gigafries people's brains.

And this random project gets 26,000 stars because of this. Because of this. And I guarantee you, nobody's using it. It's very annoying. It's the source of all the noise. OK, makes sense. Yeah, hot takes. OK, cool. I think I'm done. I know that's the year in review for papers.

Maybe there's other papers I haven't picked up, but I wanted to just share. Eugene, do you want to do it? Thank you, Suits. Sure, I can do that. I don't think I-- Why this one sort of lodged in your brain as an interesting long context benchmark? Well, the thing is, I'm thinking a lot about long context recently.

And I mean, from early on, I think neither a haystack. I mean, you're asking me about long context. I was like, come on, that's not really long context. I mean, it's really just extractive. So imagine if you were to summarize a book, you'd ask questions of a book. It's really a lot of reasoning.

And I think this long context paper, they actually shared something. Oh, man, crap. I need to-- sorry, guys. This is a brand new laptop. So I'll mention Babylon. I'll buy some time for you. There's always a lot of interest in long context models. And I find that-- so Babylon was the sort of the long context winner of NeurIPS.

This is the one. Eugene is not going to cover this one. He's going to cover something else that he found. But I think this-- and then Ruler is the one that we covered on Eaton Space. This guy, where they train a million context LLM. Oh, maybe there should be a category for long context.

I wonder. That could be interesting. OK, Eugene, you're back. OK, sweet. So the reason why-- so now I want to share with you about this paper, which I think is pretty cool. Wait, are you actually able to see it now? Yes, you can see it. OK, perfect. Are you seeing my Zotero?

I don't know why. OK, I'm assuming you're seeing my Zotero. So this is LongBench 2, Towards a Deeper Understanding of Realistic Long Context Benchmarks. I actually have a comparison of LongBench 1, which I will go over at the end of everything. But I think it's really interesting to see all the different iterations of it.

And you can see actually what they found didn't work in the previous iteration and how they try to improve on it. Long story short, LongBench 1-- no, sorry, LongBench 2, how they try to create it is they recruit it. I'm going to go through this pretty quickly. So the task is-- it's extremely long context.

The context is about 8,000 to 2 million words. And they have several tasks. The simple one is really just a single document. Single document, they do Q&A on it. And then there's also multi-document Q&A. And what is new in LongBench 2 that was not in LongBench 1 was Long Context History Understanding.

So history understanding is two kinds. There's chat between multiple LLM agents. And then there's also chat between a human and an assistant. So essentially, seeing the LLM can actually reason over these long chats that are made up of small text. And another thing that's new is Long Structured Data Understanding.

So there's tables and there's knowledge graphs. So how they collected the data-- so previously in LongBench 1, a lot of it was synthetic-generated data. In this case, they actually got undergrads to create the data. And they incentivized them pretty well. They paid them quite a fair amount of money to generate all these long benchmarks.

And they also had a lot of people to review it. So after they collect the data, one thing that's really interesting is that they ask annotators to annotate it. And they actually review it, which is they run the question through three LLMs, like fairly small, fast LLMs. And they would only keep-- so over here, here's the image.

They would only keep the question if at least one out of the three LLMs get it wrong. So essentially, these questions are somewhat fairly hard. And I won't go through all the different criteria they had for keeping the questions. But suffice to say, these questions are actually pretty good.

They are fairly accurate. They actually did a manual review of it. There's maybe about 3% of them which are erroneous. But they're pretty good, and they're pretty hard. So now let's go into the baselines. They tested it with 10 open-source LLMs, as well as 6 proprietary LLMs. They have zero short and zero short chain of thought, which is pretty standard.

I won't go through them. So here's the results. What we see from the open-source models is that the Quen 2.572b performs the best. You can see it's the most bolded here. And O1 preview, we see that the O1 actually provides a lot more juice than just regular 4.0. But what is really interesting here is this.

So we have easy questions over here where a human gets 100% of it correct, right? And LLMs are not able to get fully 100% of it correct. And then we have hard questions where a human only gets 25% of it correct, but LLMs get 50% or more of it correct.

So what is this here? This really demonstrates to you the jagged edge of intelligence, right? Where by some of the easy tasks, humans just get all of it correct. But LLMs do consistently make mistakes. That's the thing about pushing beyond 80% accuracy. LLMs can get harder, going to find it harder.

But then for the longer context, where humans struggle with, they only get 21% correct, LLMs with huge compute and their huge context length, they actually can get quite a bit of it correct. So that's pretty good. Similarly, you can see on this over here, for short context lengths, you can see humans get 47% of it, but LLMs outperform humans already.

And then for medium and long context lengths, humans still outperform LLMs. So that's what is interesting I found here about the jagged edge of intelligence. Finally, they also tried-- so what does this mean? Does this mean that, OK, long context is all you need? You don't really need a retrieval or pointer generation baseline?

Well, it's different here. What they found is that a lot of the models, they don't actually perform better with the entire context length. So for example, Quen 2.5 and GLM 4 Plus, they actually do better at shorter context lengths, 32K, with retrieval, compared to using the entire context window, 128 context window, without RAC.

So I think in this case, for this specific models, RAC still performs better. Then finally, but the one exception they raised was this, GPT-4, oh, actually is able to use the full context length. So that's interesting. Actually, it's long enough. I would have loved to see SONET. So you can see SONET over here.

Oh, actually, they do have SONET. SONET is actually quite far behind 4.0, with 50%. So well, that's a comparison between the OpenAI and Entropiq. They also have something whereby they tested whether, hey, did the LLM memorize it? So this is really just asking the LLM a question without providing a context and see if it can answer.

So in this case, they showed that the LLM doesn't memorize. Some of these questions are really interesting. I wanted to highlight one of them, which I thought was actually quite difficult, but it's quite-- which is interesting. This one. They introduced a task where the LLM is given a mystery novel.

And the LLM requires-- and the LLM is asked to identify the killer or identify the motive based on the information provided in the detective novel. So you can think that this is something that a human would take a very long time to do, right? They would have to scan through an entire book or pick up pieces here and there.

But the LLM actually does this very well. And then you can see that the LLM-- that said, the human accuracy is also extremely high on these detective novels, like "65" and "72." But LLMs-- humans really suck at the academic benchmarks, which is academic multi-document, expert. Expert accuracy was only 22%, as well as governments.

So I think maybe academics and governments, that's where LLMs might outperform humans. And you can see this is where expert accuracy also differs on dialogue history. Now, then maybe the question that you may be asking is, how does this differ from "Long Bench V.1"? So "Long Bench V.1" was really extractive questions, very much similar to "Needle in a Haystack." Essentially, given this thing, what was-- I don't know, what pizza did Eugene Chia eat?

Or what was Eugene Yan's favorite food? Things like that. But in here, you can see that it's really-- you really require understanding and reasoning. The example being the mystery novel, you have to identify the killer and the motive. Now, the second one is that this evaluation form is only MCQ.

So previously, they used F1 and Rouge, and they found it to be extremely unreliable. They also tried using LLM as judge and found it very expensive. Now, we can debate, MCQ means 25% random chance of getting it correct. It may not be as precise. Well, for them, they actually found it to be a better score than Rouge or MCQ.

The third one is that the curation is actually very rigorous. Essentially, given the questions, they actually tested it against the three LLMs to make sure that at least one of them get it wrong to test if the question is hard enough. And they also reviewed it with human experts.

Can the human expert answer it? And human experts have up to 15 minutes to try to answer a single question. And if they can't answer within 15 minutes, they actually have the option to say that, hey, I don't know, it's too hard. So you can see, and this is the human expert accuracy.

So for most of the tasks, I think it's like maybe 50 to 60%. So I think that's quite a way to go for humans to get better at this. And then the final thing is that they included more tasks, which is long context history, as well as long structured data.

So yeah, that's the two main things. So that's it, that's the main thing for Long Bench V2. Yeah, any questions? - Actually, I would like to add, actually, I like this because it was very consistent with some of the conversations I had with several teams. In particular, the legal AI space, because a lot of them deal with high-end open source model.

When I first talked to them last, I think at the start of last year, my first impression was that they would really love long context because you have a large amount of legal text. And they said, yeah, needle and haystack is meaningless. You can get 128k in needle and haystack, but it's not able to reason over the legal text past 8k and 32k, depending on the model.

So a lot of them actually do 8k or 32k, depending on which architecture they're using for their legal LLM, despite the model being able to handle much larger. So this is more scientific evidence to what they said they tested internally. - And that's what is fun as well. Other than GPT-4 mini, most of the models couldn't really use beyond, didn't really perform better at 128k versus 32k.

I think that's really just a matter of time. I think eventually for like 128k context documents, we probably may not need RAG anymore. At least that's where I'm betting. - The way I rationalize it in the open space, is that there's just lack of proper training data at this scale.

And also, as I summed it up, I think in the whole post-transformer, the problem for a lot of the open teams, training at 128k is kind of like big lab territory because of the amount of even required for the back propagation. So even though like other BKP members, strictly speaking, you could train at 128k, none of us will train at 128k.

- Yeah, yeah, exactly. Okay, so that's it. That's the very short summary of Long Bench if you're interested in Long Benchmarks. Anything else? - Oh, sorry. Just on your comment on the 128k VRAM requirements, does any of the federated learning stuff help? - Federated, are you talking distributed federated or like, sorry?

- Yeah. - No. - Whatever, you know, you need. - So, so far, so at the end of the day, right? The long context, right? Your biggest bottleneck is you need a set of nodes to be together to essentially handle the entire problem from beginning to end, including every single state in the middle.

And I don't think distributed multi-cluster training will help in this sense. 'Cause each cluster will need to be able to handle the training for 128k. - Right, so I was thinking that one of those, like whatever news research is doing can help you distribute it effectively. My understanding is that the post-training for long context isn't actually that much in terms of compute.

Like this is just purely a VRAM issue. - It's actually, no, I think that, so that's where I would disagree because it's not much to get it to the needle in his head. It's actually a lot to get it to do proper reasoning. So, for example, there's a lot of cheats that we can do.

Like for example, other between state space, right? Right now, we extend the context window, kind of, by snapshotting at 4k. So we do training in 4k chunks. But the reason why this does not work well, right, is so it's like, if the model is like, learn to, let's say, hey, let's just try and memorize as much as possible so that needle in the haystack down the line kind of works.

But if, let's say like the detective story kind of situation, right? You answer at the last chunk, but the back propagation is unable to back prop to the first chunk and say, hey, when you read this story, understand the reasoning behind it. And that's where it all just disconnects because of the VRAM capacity requirement.

And this is tight to compute as well because every, this is technically the quadratic curve, actually. You don't, so this is the part where, even for us, right, in training, we still suffer from that quadratic curve. - Yeah. Cool. I, yeah. No comments beyond that. Cool. I guess that's the last favorite curve of the year.

(laughing) That's sloppy as hell. - Very excited. Okay. Merry Christmas, everyone. - Merry Christmas, everyone. Ho, ho, ho, ho. - Yeah. - Bye.

[Paper Club] The 2025 AI Engineer Reading List + LongBench Paper

Transcript