Survey Paper Club: Long Context, Reasoning Economy, Leaderboard Illusion

Go first, then. Nope. All right, so I've been thinking a lot about Q&A evals, questions I'm answering, especially for long context. So I've gone through maybe about 12, 18 papers. I wanted to share with you what I thought was really good from these papers as well as some of the key highlights and how I thought about it.

I think first we'll start with narrative Q&A. And don't worry, I will actually try to extract all the links for these five papers and I'll share it. So don't try to look for it, just focus on what was important in here. And I can't monitor the chat, so yeah.

Just stop me anytime. So this is the first paper. So you might think, Eugene, it's from 2017. Is it still relevant? I personally think this is the OG. This is a really, really good paper from DeepMind. So let's go through it. For each paper, we'll talk about the data set, the methodology, and then how it helps.

So this data set consists of stories, right? So it's both books and movie scripts. And it's with human written questions based on human abstractive summaries. So what does this mean? Essentially, from the book and from the movie, we'll generate an abstract summary. Then, based on the summary, solely on the summary, we try to generate a question.

What this means is that we do not deliberately generate questions that are just extractive from the text, whereby you just answer solely based on the text. It requires further thinking because it's based on the abstract and therefore requires evidence across the entire book. So, for example, here's a question, right?

How is Oscar related to Dana? The answer is that it's her son. So here's the summary snippet. Peter's former girlfriend had a son, Dana. And then, you know, throughout the book, this is the movie. That's a good looking kid you got there. Like, based on this, you have to make the inference that this is the son.

And, you know, they were already thinking about this in 2017, before HIVT could do even anything like this. So the data set 150, I mean, 1,500 stories from Project Gutenberg and movie scripts, script from the web. So they ask annotators to provide them with the summaries, essentially what I just mentioned, so that they don't just pick localized questions that can be identified from the local question itself.

There are a few things here that we don't really like, but couldn't, can't be helped. This was probably stay out of the eye at that point in time. They use blue, meaty, and rouge metric. And they provide two golden references for each of these. We'll see that this actually doesn't work very well in the future papers we go through.

Yeah, so that's the, oh, go, go for it. Why we don't like these metrics, blue and rouge and stuff? That's a good question. What we found is that they don't really correlate well with human judgments of whether an answer is good or not. Because it's purely engram based. I see.

So how are you judging whether answers are good or not? Is it like a factual QA? Or is it like, you know, this is something that is like the right tone of what I would want an answer to be? How I think about it is mostly factual, summarization, inferential.

We don't really go into the, I don't really think about the tone or the style. I think right now, what I think is probably going to be good, there's probably going to be two sets of metrics. There's going to be reference-free metrics and reference-based. What does this mean? Reference-based means that you have a right answer, right?

You have the right answer, and you're just checking to see if it matches the right answer a lot. I think that's very straightforward. Reference-free metric is that there may be no right answer, but you just want to check whether it's good enough a lot. So an example of a reference-based metric is this, right?

How are they related? It's her son. Maybe if the answer is it's her child or saying something like, oh, she is his mom. If the answer is not, even if the answer is not in the right way, if you use an LLM evaluator, it's able to say that, hey, no, this is a hit.

An example of a reference-free metric could be a summarization whereby we already know LLMs already surpass human performance in terms of summarization. So it could be that here's a summary, and then here's the text. Is this summary a faithful representation of the text? So for those, there's an example in that, one of the future papers, one of the papers I'll be covering, that actually talks about this reference-free metric for evaluating faithfulness.

Did I answer the question? So this paper is in the realm of fiction. Now let's go into non-fiction. Here's another paper that I quite like. Data set of information-seeking papers, anchored in research papers, by Allen Institute for AI. Their approach is also very interesting. They have academics, read, title, and abstract, only title and abstract, and then ask questions off the paper.

That's it. Right? So it's a very similar approach whereby they force annotators to only look at limited data and then try to see if they can ask the question or not. And of course, the reason is because when you give annotators the full text and you ask them to generate questions, they often will generate questions that can be answered just within one or two words.

But by forcing annotators to only look at the summary data, you actually force them to generate questions that are more open-ended. And in this case, this specific data set actually includes the evidence to arrive at that. And if you think about it, including the evidence, this evidence can be used in a retrieval evaluation pipeline, which is, did our retrieval find the right evidence for this question?

So this is quite useful. 5,000 papers. No, 5,000 questions on 1,500 papers on NLP, on NLP papers. And I think they actually collected these questions themselves, the authors. So 55% of the questions require evidence for multiple paragraphs. And this is the only, this is why when you ask a question from the abstract, the abstract, because it condenses data from multiple paragraphs, you actually have to answer questions from those.

And 13 of them are actually super challenging, requiring answers from tables and figures. So I think that's probably the main thing I want to answer. So again, annotators, they don't just generate questions that cannot be answered, right? First, they generate questions, then they try to answer this. So if the question is answerable, they select the minimal set of evidence snippets to try to answer it.

And of course, if the question is not answerable, they just throw it out. So in a sense, I think this data set is fairly carefully curated to ensure that all the answers actually can be answered, even though they took their approach whereby they focus on the summary first. And then there's a couple of questions that you can see here.

You can see that even when you do this, a lot of the questions are going to be extractive, which is you just extract from the text. What we really want to aim on is for the abstractive answers, which requires some form of reasoning. And I think that's what this methodology is.

And of course, there's a yes or no and unanswerable questions. They just skip it. Yeah, that's all I want to share about this. Next, I want to talk about L-eval, which is long e-val. This is slightly more recent, 2023. Discarding unanswerable questions seems like part of the reason LLM's answers.

Yeah. I think it's important in the sense that we just need to make sure that those questions we don't mark it against the LLM or that we don't force the LLM to make up bullshit. So it's 20 sub-pass. They have 500 long documents. And these documents range from 3K to 200K.

What we're really more interested in is in the 20K. But beyond the data set that they have here, here's the data set. The data set is really interesting. They almost created a lot of new data sets themselves where they script Coursera. This is, I think, a sci-fi citation, science fiction.

Science fiction data set, right? It's a science fiction data set of true-false questions. And then they have a code data set. And then there's a long context question answering data set based on investor relations. So it's finance. So it's a really mixed bag from educational fiction and long context.

So then, of course, they also combine several publicly available data sets. Now, what's really interesting here is this finding. Of course, they use an LLM as a judge. I mean, who doesn't nowadays? But what they found was this. Oh, no, where is it? Sorry, it's a little bit messy.

So what they had to do was this, which is length instruction enhanced evaluation. So the problem with a lot of the N-gram-based benchmarks is that it's very much reliant on length. All the N-gram-based benchmarks are based on some recall and some precision approach, which you need some kind of denominator.

And usually the denominator is the number of tokens in the answer, right? And therefore, longer answers will therefore be penalized. So what they found is that they try to constrain the LLMs to the number of words in the answer. And they found that when they did this, firstly, all the automated metrics, this is a key point, failed to correlate with human judgment.

And compared, and of course, LLM judges are more accurate and robust to length. They also showed that with this length instruction enhanced evaluation, they found that they were able to improve. With the length instructions, they were able to improve the rouge by 0.5 to 0.8. And, you know, the data of the GPT-4 evaluator.

Essentially, what it means is that when you try different pipelines or different models to try to get your answers, you want to try to make sure that the answers, it's not unfairly biasing a certain LLM that has longer answers. So I think this is also a pretty standard now, but it's a good reminder and they show data on that.

The next one, I mean, since Swix likes fiction so much, there's another fiction paper. So this is novel QA, question answering documents, exceeding 200K documents. So this is based on English documents, right? And they have a mix of various complexity, et cetera. So what is really interesting here is they use, based on English, but they also try to, and they also, they have golden answers, which is great.

Having these reference answers is really important. But what is really interesting is the, how they try to create the data set, which is multi-hop, single-hop, and detailed answers. So single-hop means you need to find, you locate some evidence somewhere and you need to connect it with another evidence somewhere.

Multi-hop means you need to jump all over the place. So this is actually pretty challenging, I think. And I think a lot of regular Q&A, if you're asking questions of your documents or of your book, et cetera, these are the kind of very difficult questions you're asking. You're probably not asking for extractive questions, like what did Tom do or where did Dumbledore destroy the first Horcrux?

Like even where did he destroy the first Horcrux, you actually need to do a lot of abstractive and reasoning thinking behind that. So what is really interesting is that they tried to show what happens with the loss-in-the-middle phenomena, and that's the data set. That's the graph I want to focus on here.

Also in normal Q&A, they actually have two kinds of answers. The first answer is multiple choice. And you can see over here, the answer is multiple choice. And then the second answer is generative. For the answer, the model actually has to choose some kind of answer. And you can see, this is probably not the best graph, but essentially both of these are multiple choice.

Do you try and match it with the answer? They reference L-Eval, they reference Bamboo and Long Bench. We won't have time to go through a lot of these, but L-Eval, which is the paper that we just mentioned, they actually reuse the same length instruction enhanced evaluation based on that.

So token size, 200k tokens, this exceeds what Claude is able to do. But we don't know what Claude's tokenizer is. Anyway, so this data set, now over here, they are very explicit on how they constructed a data set, which is very helpful if you want to construct a data set yourself.

Essentially, you need a few columns. You need the question, obviously, and the answer. You also need the golden reference answer, right? The gold answer and the evidence. So now evidence, you can probably just say the evidence is probably the entire document up to the point that I read. Or you can actually go the extra mile by saying that these are the paragraphs where the evidence is or the spans.

And you can also try to label the questions to try to help you understand, hey, is my pipeline performing better on the more complex questions, multi-hop questions, abstractive summarization questions, et cetera. I kind of ignore the multiple choice answers because that's not very representative if you're trying to build a chatbot on top of long context.

So very interestingly, they use public domain novels from Project Gutenberg. They also purchase e-books. I'm not sure what it means to purchase the e-book and then release it as part of a golden data set. I don't know if you purchase the e-book, you actually have the copyright and the license to do that, to share it out.

But I know that's what they did. I don't want to question that. So here's how the distribution of the questions look like. Multi-hop, which is the most challenging one. Single-hop and then followed by detail, which is just really extractive. Now, there's three really interesting. I'm not sure if I would agree with their interpretation of it.

So what they found over here, oh, sorry. Figure three is that accuracy drops beyond 100k tokens. So once your data is in, beyond 100k, accuracy drops. So that's what the graph on the left shows. I think that's somewhat clear. It feels like, I think they're right to say that, okay, if you average all the three numbers, you do see a stepwise drop.

Now, figure 3b, this is the one that's actually really interesting. In a sense, what they say is that if I give it some text and the answer is in some percentile of the text, the answer is in the first 10% or the answer is the last 90% or the last answer is right in the middle.

What this graph says is that there is no loss in the middle effect. Regardless of where the actual answer, the actual evidence is, there's actually no loss in the middle effect. And this is actually, this actually suggests a hint. If you are trying to build such a benchmark yourself, you want your benchmark to be positionally robust in the sense that you want to ask questions that can be answered by the first quarter of the book, the half of the book, the third of the book, and at the end of the book.

You don't want to just formulate questions that are either just all at the start or all in the middle. You want to have this concept of positional robustness to understand, hey, where does my pipeline start to fail? According to this graph, they say that it does not fail. But now let's look at figure 7, which is in the appendix, where they split this up into two stages.

The first one on top is when the, and maybe let's just focus on the GPT-4 and Claude 2.1 numbers. The first one on top is when the context length is below 100k. So you can see when the context length is below 100k, there may be some kind of loss in the middle effect.

If you look at this GPT, if you really squint at this GPT-4 graph, and maybe this intern LLM graph. But when the context length exceeds 100k, you start to see it's really just dropping throughout all the way. So I think maybe there's two stories here. When the context length is medium, look at us, 100k context is already considered medium now.

When the context length is medium, maybe the loss in the middle effect is present. When the context length is long, I guess the longest one they have here, which is more than 100k, you start to see that long context start to drop off. So that's a nuanced finding that they found here.

Now the last one, I know I'm going super fast. I will just want to finish this and maybe in just three minutes, and then we can go into questions. The last one is especially interesting because it goes into multi-document Q&A. Up to so far, we've always only been focusing on a single document, a single movie script, a single book, a single paper.

But what if the answer requires multiple documents? So here's the answer. List the current assets of each of the above companies in order. So you do need to go through multiple documents to try to answer this. And it depends. Different documents could have different contexts, could have different styles of writing.

How is the LLM able to deal with this? So I won't go through this very much. This is the long benchmark, L-O-O-N-G. Pretty interesting because it also means dragon in Chinese. Maybe that's what they intend to do. And you can see these are the different tasks that you try to look at, which is given multiple docs, can you pick the right one?

Given multiple docs, can you do some kind of comparison? Hey, is Ali Baba better than Pai Tu or Tencent? Comparison. Given multiple docs, can you do clustering and do some kind of summarization on top of that? And then given multiple docs, can you just do some kind of chain of reasoning across the docs?

But one thing that's really interesting in this is not just how the benchmark was built, but this result that they have in this paper, in this graph over here. So this paper, I don't know, again, this is a question that I often debate. I don't know what the right answer is and I'm curious to hear what people think.

The blue line, the blue round line with the blue dots, this is the baseline accuracy. And you can see baseline accuracy just keeps dropping. It's fine. So this is the top five, GPT-40 and QAN 272B. Now, the other lines, the other colors, these are when you use OpenAI embeddings and get the top 10 documents.

And when you use, I can't remember what BGE stands for, but it's another embedding model and use the top five. Top five and of course, OpenAI top five and OpenAI top 10. What do you see here? Can anyone just shout out? What's the big trend here? Longer context, lower?

Yes. Yes. What is the other big trend? The other big trend is maybe we don't need RAG. I know maybe that is controversial, but you can see that using RAG, you always get lower accuracy. Now, this, I don't know if it's controversial or not, but if you think about it, if you think about it two years into the future, will models have standard half a million or one million contacts?

I think that's very likely, yes. Will pre-cash cost drop further? I think the answer is yes. Will long contacts capability increase? I think the answer is probably yes. So it doesn't mean that we don't need RAG. It means that if your single document, if you have a single document or single book, or even maybe a Game of Thrones style, that length of book, maybe you just need the whole thing in the context and you maybe don't need RAG.

Now, you still need RAG if you're trying to query on the Library of Congress, which is all the financial reports. But from this example here, you can see that at 50K or 100K or even 200K, RAG actually didn't help, but it actually hurt. So that's all I had to share.

Any questions? Quick question on this last one, actually. So is this a single document or is this multiple documents? I suspect this is multiple documents because their data sets are all multiple documents. So in the long context, they're just throwing in multiple documents? Yeah. All the documents. I suspect it's multiple documents.

Yeah. So one thing that I ran into in the past with long context evals was so I was working on like a legal long context QA task and they had overfit on trying to fine tune a menstrual model. And it was really good at documents at the start, the middle and the end, but not at different quartiles because they tried to fix this long context QA by training on facts that were injected in the middle.

And then we had overfit to that. But then what happened was when we did chain of thought questions with multiple documents, we noticed that it would just chain of thought through the whole context and be like, okay, I need this source and then I will do that. So giving in multiple kind of had an internal chain of thought even when not prompted.

But then when we did it on one large document, it still wouldn't work well. So interesting little like note that we had on this specific thing because we tried, let's just throw in the whole couple documents. And it was, you know, I think it was 32K context, but that could still fit multiple, multiple documents and it could reason through, okay, I need to look at this subsection because now it has internal breaks, right?

So it could easily find, okay, I need like, you know, 17th document is here and then it would look into that. But if we gave one document that was 32K, it wouldn't do any well. So interesting little note on this even. Yeah, I think maybe that's how the fine tuning was done.

Yeah. So the question I always ask myself and I mean, not to be spicy here, but the question I always ask myself is a reg or not. So now here's the practical considerations, right? If you want to build a reg, you need to eval the reg. And building the reg, evaluating the reg, maybe that's two headcount.

And we know evaluation is really hard, right? Evaluation and retrieval is especially hard because it's always the cold start problem. If you're asking a question of a document base, everyone has their own document base and there's only one person who's answering document base. It's really hard. Unlike if you're doing recommendation systems or search, you have free data coming from everywhere where customers are just saying yes or no, whether I like this product that you've written or not.

But for evaluation or retrieval, I think it's really, really, really hard and really expensive. So as far as I can, as much as I would like to not build a reg and evaluate the retrieval component of the reg, I try to do that. Now, if anyone here wants to tell me, Eugene, you're wrong, I would love to hear it.

And feel free to ping me on Discord or Twitter anywhere. But in my mind, I think evaluation retrieval is actually extremely hard, especially in reg, right? Because you're retrieving the documents and then you're summarizing into an answer. The user will never give you feedback on whether the documents you have retrieved are actually useful or not.

There's no built-in data flyer will feedback you. The user will only say the answer is good or not. But you actually don't, unless you actually have a dedicated team that actually says, this document is good, this document is not good. It's really hard. And how would they actually know whether a document is good or not?

Unless they have absolute, firstly, they will never build track. You'll never build measure recall. The best that you can do is measure precision, but precision is also pretty, it's not bad. It's actually, it goes a long way. So that's how this last graph here is really making me think a lot about this.

And yeah, so I'll stop here. I have two quick comments on that. One is, I wonder how other models will shape up. Like we have LamaGuard as a hallucination detection model, which you can implement live. And then it's great now. Yeah. So will there be a sort of, you know, rag checker type model that we can implement in systems?

And how much will systems have an effect to this, right? Models that are specialized at this sort of QA task of, is this the correct document? And it's past just embedding and similarity search. But, you know, can we have a quick, small model that just references documents and sees if we can optimize them?

Similar to how we currently have like, you know, rag with re-rank or hide embedding, stuff like that. Will we just have models that abstract this away? And then the other point is, in two pretty large use cases that I built in medical and legal, we actually had it pretty easy to have a human in the loop for feedback for if documents are suggested.

So, for example, for lawyers, we had a quick QA sort of chatbot. And what happened was, if we, you know, automatically just had an output, basically everyone would just take the LLM's word, even if it was incorrect. So we had to scale back and be like, okay, what we ended up on was a sort of fill-in-the-blank multiple-choice style output.

So you would have a reference document, you would have a subset, and then you would have to fill in sort of what would happen. This was kind of our middle ground to make sure that there was a human in the loop. Because if you just have, you know, is this correct?

Check or X is what we did. Everyone would just press the check, and we had like 90-something percent success rate on just, yep, this is right, and people wouldn't really check the document. But if we gave them a snippet, and then they had to kind of find the result, they could either, you know, leave a comment that this is not, like the result isn't here, which we had like a significant greater than 10% of the time, or they could get the reference there, and it was still like highly impactful.

I know we kind of used that data to build our own guardrail for the system, but I don't know. I found that there were pretty clever ways to get these human in the loop of, do we have the right document? And we actually did it from a guardrail of just, we can't have people just accept every AI response, because models were pretty shit.

We didn't have great LM as a judge, but that was just one use case we came around for that. That's pretty cool, Vibu. I hope you have nothing on plan next Saturday and Sunday, because that's the only thing that we'll be talking about when I'm in SF next week.

Yeah, I don't know if I can actually get regular non-paying users to do this. You can imagine, you know, I'm just reading my book. Do I actually want to provide the right data? I don't know. But okay, over to whoever's next. Okay. Six, you want to go next? I got next.

Ted, question. Open. Yeah, I think Vibu's next, but yeah, go ahead, Ted. Yeah, just a quick question, Eugene. I'm curious if you've looked at, like, there was the writing in the margins papers, one that I know about, and, you know, sort of like trying to improve rag instead of just doing vanilla rag.

I think Sam actually got his folks to present the writing in the margins paper. So yeah, I'm familiar with that. I'm familiar with that. I have not invested too much time experimenting and implementing that. But yes, I'm aware of that. Thank you. Also, it's another note when you bring up, do we need rag or all the context?

So once again, shout out, Sam. Ryder team shipped their new model this week. One thing I gave feedback was, if I like to just share my screen right now. Sorry, Sam, I'm calling you out live. But, so, where is this? Is this the right tab? This is the right tab.

Basically, they mentioned that, you know, it takes 22 seconds to process a million tokens. And my question is basically, like, how many people know how long any other model takes to process a million tokens, right? Like, as you go to, let's say I want 10 million tokens in context, you know, does this scale to now 10x longer?

Is it, like, over two to three minutes per query? And, yeah, do people really, like, people are used to pretty quick answers, right? But when I'm building a rag system, like, sure, we have agentic stuff, but, like, are we okay with 22 million seconds for a million tokens? And if you're saying this, is this, like, a good thing?

Like, is 22 seconds quick? How long are other models? I don't know this, and I thought I would go through it. But, yeah, I feel like it's also just another consideration, right? But, like, as we throw more tokens, yeah, these things take a while. Like, we're used to that instant response.

And, yeah, I don't know, just something to know if we just do long context. I'm assuming you guys do it. This was surprisingly difficult to find comparisons on. I was trying to build, like, a little, like, homegrown benchmarking thing to test other models, but I couldn't, I didn't have, like, the API levels or the, like, credits to be able to actually make it work.

I could test it on, like, some of the smaller models, because I could get my context window limitations up higher. But it was, yeah, it's actually not a solved problem yet of, like, testing models on million token requests right now. Yeah, a lot of it also just meant, like, matters on, you know, are you doing your own inference?

What hardware? Is the API provider optimizing for throughput? Time the first token? So, I guess there's, like, yeah, it's just another parameter to think about. If you're doing long context versus RAG, we don't think much about what goes into RAG, right? They're just short responses. But, yeah, now if you're doing something like this, you know, you got to think about, oh, shit, I have millions of tokens per call.

How do I optimize on this? But, yeah, that's enough of my writer thing. But check out the model. It seems cool. I haven't tried it. Thanks, FeeBoo. Your check is in the mail. Definitely not sponsored. But they have sick merch. Swix, you want to go next? Should I go next?

I thought we said it was you, but I'm easygoing. Okay, I'm struggling on my screen share. There we go. We're good. Okay. No, no, I got it. It works now. Okay, so I think I shared the paper at the beginning of the call. I'll share it again real quick.

So basically, this is a little survey paper. Something that I was kind of annoyed by is how before, here we go, we used to kind of have the default be that, you know, big companies make big model. Big model is fast and smart. And also, is my screen share working?

It wasn't working on this before. Yes. Okay, sick. So before, you know, models would get smarter, faster, cheaper, and that's great. Like stuff is getting cheaper for us. Now that we have reasoning models, well, they've kind of passed the cost on to us, right? Now I have to pay more for simple reasoning tokens, even though I have basic questions.

And I don't want to pay more. I want the big open AI Gemini to pay that cost, right? And I also don't like how stuff gets slower. So I was like, someone needs to figure out, like, we got to stop throwing reasoning models at everything. We still want great next token predictors.

And like, too many people are using reasoning models for the wrong thing. So like last week, this reasoning survey paper came out. It wasn't the best, but you know, it'll be a good little fill in. So basically, they're trying to coin this term of reasoning economy. When should we use reasoning models?

And they're just like an overview of what are different techniques for post-training reasoning, pre-training reasoning. So like chain of thought is a version of this. And then how do these systems perform? How can we prune out chain of thought? And they just like, this is kind of the first reasoning paper that we found here.

So I figured we might as well go through it a little bit. So they kind of have these two systems of reasoning where they're system one and system two. System one is like computationally efficient, but suboptimal, where we have, let's see, what is this again? Oh, they also have a GitHub repo where they're tracking all this, but I forgot what system one, system two was system two.

Oh, so it's like reasoning models and then regular text decoding models, right? So they kind of start out by saying like, there are some inefficiencies in this, right? We have models wasting tokens. We have like fake thinking where models are just outputting tokens. There's problems in RL where we've RL models and now they're over optimizing on length.

Chain of thought can like help, but it's not a training time. So how do we kind of like look at what's going on here? There's the two stages, right? There's post-training and then there's the test time. So test time thinking is like, you know, MCTC, Monte Carlo Tree Search MCTS.

There's techniques like you could do speculative decoding, have a little model do stuff fast and then have a big model check it. You could do chain of thought. You could run multiple queries and kind of condense them. They also have the agent concept where you can have like an agentic model.

Then there's the post-training stage where you could do this natively in the model. You can have SFT to kind of do reasoning. And then they show, you know, like you can have a thousand samples of reasoning data, a thousand sample reasoning data set kind of train in reasoning. You could do proper RL.

They dissect a lot of DeepSeq R1 in this paper. And then there's inefficiencies, right? So RL, you have a length bias. You have deceptive behaviors when you're doing RL. We don't know how to optimize all this or test time when you're doing it. That inference, you know, we're doing like inefficient usage of computation.

So for example, there's a paper that cites, they do up to a thousand to 10,000 parallel calls of Lama 8B. And they're like, yeah, you know, we don't need 10,000 Lama 8B calls for one question. Then there's like, okay, how do we address these changes, right? So there's the architecture level.

What are people working on for model level stuff for reasoning? Those algorithmic stuff, there's the data that goes into it. And then for the inference side, you know, there's stuff where you can force in an adaptive budget. So for example, you can give in a token that says, how much budget do you want in this question?

And then the model learns to follow it. There's basic routers where you can route to a small, medium, high model, and they sort of cite how O3 does this. So there's O3 mini, O3 high, regular O3. You can have a router. There's agents. There's the decoding side. So you can do this at inference time.

There's different techniques you can do. And they just kind of go through all this. Let's continue on through here. So this is kind of, you know, a bit of background. What's going on here? What are some of the issues? What are the methods? So post-training, yeah, we have SFT.

You can do SFT and get basic reasoning. They have this cool little diagram chart here. It kind of shows different processes that we can do, right? So for the reasoning in training time, we can do post-training and we can do test time models. So we can do SFT, RL.

There's parallel methods where you can kind of, you know, shoot out a bunch of models, shoot out a bunch of queries, condense them down in a sort of pipeline. Then there's, and they show kind of papers that do this, how they perform, how we can prune out chain of thought, how we can make it more efficient.

There's sequential methods. So this is kind of like your pipeline approach, you know. You have human in the loop. You have guardrails. You have different tool usage. There's different sequential methods. For post-training, there's stuff like length biases. So, you know, models, there's overly cautious language models. There's fake thinking that comes in when you do RL.

This is just because we don't have the best RL methods. Then inefficient model usage. So unreasonable algorithmic section, you know, they don't always choose the right algorithm to use. Are we using the right pipeline for the right task? So are we sending the right queries to the right models?

Unreasonable compute allocation. Of course, you know, this is basically, do you need the biggest model for the basic task? And can you optimize little models? Then we've got on the other side, you know, the data. So how can we improve data? They've got algorithmic changes. So you can have a length penalty while training, procedure reward where you can, they kind of break down what are the two types of ways to affect length output and reasoning.

So there's, you know, end output where you can kind of judge what is the end output? Is the math correct? Is the code correct? Does it compile? And then there's also a process of what is the process code? How do we affect that? Long to short RL, adaptive budget aware tuning.

So this is kind of fine tuning with a budget. Then on the architect, there's also a chain of thought compression. So explicit and implicit. Explicit is where you directly prune 70% of the wrong chain of thought. And then, you know, you train on more efficient chain of thought. Implicit is where in the model itself, the architecture makes it such that different queries go through more efficient parts of the architecture.

There's recurrent layers that you can kind of, you know, you can recurrently keep this memory state to do this. There's dynamic depth, model routing, multimodal stuff, knowledge distillation. So they kind of talk about how you can basically distill out reasoning. So DeepSeq did this with base quant models, llama models, and how effective that can be.

Then improvement in test time. So on the input side, you know, you tell it, budget allocation, there's adaptive decoding algorithm, so kind of speculative decoding. On the output side, early stopping, search with pruning, constrained decoding, basically fun little chart of different stuff. If you're interested in any of this, they have a bunch of papers linked.

Okay, so they start off with, you know, DeepSeq showed that you don't need to do SFT. You can perform RL. They did find that SFT still accelerates stuff. So in the final run, they did a little bit of SFT before their RL. The core focus of RL currently lies in the design of reward signals.

So there's two type of rewards. There's process reward models and outcome reward models. Process kind of enables more fine-grained learning signals guiding the LLMs. And then there's the outcome, right? This is kind of where you look at what is the final thing. So is the math correct and whatnot?

ORM provides supervision signal at outcome level. Then they kind of talk about the test time. So they show how, like, you know, one is kind of still pretty strong. So even though you can, like, instead of looking at the process and kind of, like, pruning the chain of thought itself, if you only do ORM, so if you only do RL on outputs, you still have really good capabilities, right?

Like, DeepSeq still had this aha moment, and it's still doing good reasoning. Test time methods are kind of, you know, parallel methods. So parallel is you have LLMs generate several calls, sequential methods. So it's just kind of your free of thought, chain of thought, MCTS, beam search. And then their kind of, like, takeaway here is that, yeah, you know, the full potential reasoning is not achieved.

They have cool statistics here. Then in section three, we talk about, like, inefficiencies in this model training. So there's inefficient model behaviors from post-training, right? So first one is basically a length bias where you can have reasoning that's overly cautious, and this affects simple questions, right? So LLMs trained with RL tend to produce longer responses than SFT, that makes sense.

Now, there's two questions that they have. So what are the reasons for longer responses, and does this increased length indicate a bias or enhancement of model capabilities? So overly cautious reasoning models, you know, model excessively has unnecessary verification steps and redundant reasoning on easy-to-handle questions for meaningless paraphrases and deviations.

We don't want that. Deceptive behaviors. There was some work done that, you know, models don't output the real thinking that they're doing. Fake thinking happens where, you know, a model is just outputting tokens, even though we can see that it has the answer pretty early on. That's kind of section 3.1.

So what's happening inefficiently? Then there's the test time inefficient stuff, right? So are we using the right hyperparameters? Do we have the right pipelines? Are we kind of at test time doing the right thing? Unreasonable computation allocation. So, you know, this is that example of scaling computation. So scaling LAMA 3AB instructs generating 100,000 to 10,000 samples for simple questions.

There's reasoning boundaries. But basically, they want to say that they emphasize the importance of adaptive computation allocation based on task complexity. So for more complex tasks, we should have, you know, a better way to allocate resources. Okay. Section 4 is kind of on two things. So part one is the data.

So what is the data that's used for reasoning? Basically, there's explicitly encoding in desired reasoning patterns is one way. And we can do this with even basic SFT. So there was that paper where they showed, you know, a thousand diverse SFT samples can do basic reasoning that's on par with O1 preview at the time.

But what's most important is quality, diversity, and difficulty of this data. Then there's kind of the algorithms that approach this. So they have this long to short RL. These are strategies like, you know, where you have model merging of different models that try different things, shortest rejection, sampling, DPO, optimization.

This one kind of showed that you can have about a 30 to 40 percent drop in tokens with no accuracy drop. Budget aware tuning was another one where they shaved off. Where's budget aware? So budget aware tuning, this is another approach where a budget prediction allocation was implemented. This approach achieved a 67 percent reduction in response length with only a 3 percent loss in accuracy.

They kind of have a little cute little diagram of these different things. So how they're budget aware is basically, you know, similar to O1, you have easy, middle, hard stuff. You have an output allocation that you want towards this question. Shave off a bunch of response output tokens and 3 percent loss.

Chain of thought compression is another one. You've got explicit and implicit. Kind of as they seem, implicit is where you implicitly compress out. Explicit, sorry, implicit is model based where, you know, you have an architecture that kind of tries to optimize this. Explicit is where you where you do it out.

Both of these work. There's different benefits to both and they just have papers that you could follow along with. OK, architecture wise, there's system one and system two cooperation. So this is where you kind of have this routing layer where you can have a system where you have something like O1 where you have three distinct thinking models.

Right. O1 low, middle, high. You kind of have this model routing. There's model to model collaboration. So you have pipelines that have different models. This is stuff where you have things like speculative decoding. So speculative decoding, there's first a small model that generates candidate tokens, then a big model that verifies them in parallel.

This has two to three X speed up. Then they have the whole topic of knowledge distillation. So you have a big model, you distill it down. And then, you know, DeepSeq R1 outperforms applying RL on base model to Quen 2.532B where you can distill out and do RL. That was very effective.

Then they have like more architecture, architecture stuff. So you've got adaptive active parameters. So this is more so like where you can have a recurrent layer and you can sort of add different depth to different queries. So the model itself has different depth that different queries are passed through.

There's different research going on in this. Dynamic depth. What else have we got? Then we've got test time. So inference time, how do we add inference time outside of the model, outside of RL, how do we adapt how much compute is spent? So you've got adaptive budget decoding. So this is stuff where you've got like budget prediction.

Budget prediction is where, you know, we basically train in something where you tell or you basically tell it, you know, I want this many tokens to be produced. How well does that do? There's budget constraint generation. There's early stopping, search with pruning. There's adaptive selection. What else? Early stopping, pruning, constraint decoding.

And then that's kind of like survey of what's happened so far. Then they go into discussion of like what else is still not being done. So then there's this topic of multimodal reasoning, right? Right now, language models, language reasoning models, they call LRMs, they're only text. So what happens when we have multimodal reasoning models, like vision models that also reason?

How are we optimizing for that? In their survey currently for multimodal reasoning models, all they're doing is like current architecture level stuff. So model architecture optimization, so lightweight vision encoders, vision token compression, vision language productors, smaller language models, efficient structures, efficient vision technique adaptation, VIT quantization. So, you know, evaluation of this like isn't really being optimized.

Then there's efficient agentic reasoning. So stuff like deep research, how do we efficiently do agentic reasoning? They bring up some benchmarks. So like DNA bench is one, humanity last exam. How do we sort of start to efficiently benchmark these things, right? And then are we optimizing for the outcome versus process efficiency?

And then there's different benchmarks that they bring up for these. This is kind of like the Mechinterp side. So Anthropic is doing some interesting work on what's actually happening during this reasoning. And then that's kind of it, you know, they kind of bring up, here's a bunch of sources.

Here's different stuff on reasoning. Here's how we can optimize different things. And they've got a bit of different sorts, but not the deepest paper, you know, 15 minute overview. That's some stuff happening in reasoning model optimization. Okay, I want to give Suik's five, six minutes, but any one, two questions?

Not from me. There's not much time or so. Yeah. Okay, Suik's passed to you. Okay. Today we cover a one-day-old paper. So very of the moment. Can you see my screen? Yes, you can. So this is the leaderboard illusion. I think basically it is Cohere not doing well on LM Arena and then Cohere saying this is fucked up.

But they are correct. I mean, this is open secret for a long time that you can somewhat game LM Arena and it is somewhat pay-to-play. But it is, I think I'm not as negative as some of the others. I think this is just capitalism at work. Anyway, I do also probably think that it was a bad idea for LM Arena to announce that they are becoming a company at the exact same time that people are questioning their commercial business model.

And this is the result. This could be the end of them if they don't handle it very well. So Cohere says four things. One, this is the most obvious one that was obviously going on. At some point in time, Gemini had like three or four different variations on LM Arena at any one point in time.

And then they just released whatever score is the highest. Therefore, on a normal distribution, you would just get a very skewed like p-hack. This is literally just p-hacking of results, which is not fair to everyone else who does not have that capability. And basically, they were alleging that only four labs had access to that.

I know this from off-the-record stuff as well from other people who are trying to submit things to LM Arena. They also were able to, they also like sold data access. They, I think there's some arguments about like the kinds of battles that LM Arena was exposing to. And then the, and then they also, they also accused LM Arena of silently removing models, even though there were some official removed models as well.

So about 66% were silently removed. So I think this graphic is just really good overview of what the accusations were. They have some evidence about it. I think some of them are stronger than others. But I just would highlight for folks the, the, the conclusions as well. Because I think like the, the sensitive thing about this is that Cohera itself is a pretty large lab.

And for them to criticize LM Arena, which is basically two Stanford, two UC Berkeley guys, is, you know, punching down. So they, but they obviously, LM Arena has a lot of influence. So I think they, the important thing is to have constructive suggestions around what to do, given that LM Arena is a thing, right?

Like I think in, in a fair world, maybe a lot of people would rather that LM Arena just doesn't exist. But now that it does exist and people do use it, what do we do about it? So they, they had some really good suggestions, I thought, which was, don't allow people to retract scores that do, do badly.

It's like, you know, just because you did badly, you don't get to hide it. Two, limit, limit, limit model submissions, so that you don't get to spam models and only promote the best ones. Three, have equal levels of model removals of, between closed source and open source. So don't favor closed source.

Four, implement fair sampling. I love this one. Okay, so this is very fun, right? So LM Arena is a, is a sampling problem, meaning they, they, they have to like find workloads and then like sort of match them up and, and try to, try to arrive at some reasonable ELO number.

It turns out that the authors of LM Arena originally had a methodology which was more like active learning and they abandoned it. So I really like the sentence. This formulation avoids simply favoring large proprietary providers and instead effectively prioritizes under-evaluated and high-variance pairs. And this, this word, high-variance pairs, made me realize like, oh yeah, I mean, that's obviously what you should focus on.

This is very active learning in the sense of like, okay, if, if you have a lot of, if you have a lot of, if you have a place with a lot of disagreements, you should focus your battles on those pairs to lower the variance by increasing the, the sample size.

And, and apparently they had a paper on this and they ended up not doing it. So it's kind of interesting there. And then finally, provide transparency. This is fine. This is normal. Yeah. Any questions or debates about this one? What's the Lama T? What happened with Lama 4? I think they published this before Lama 4.

I don't think, I don't think Lama 4 was discussed. Oh, yeah, there you go. Oh, in the lead up to Lama 4. But I don't think, I don't think they had the Lama chat issue. That first paragraph is crazy though. 27 variants. Yeah. I think it's interesting. Like they say substantially higher sampling rates for OpenAI, Google, XAI and Meta.

So one, two, three, four, but XAI is like a lot lower than the other two. And then Meta is even lower than that. Amazon, I think decently, nicely treated. But yeah, I mean, so, you know, I think now it's kosher to say that the ones that were treated the worst was RECA.

And I heard about it directly from RECA. So it's very sad. Yeah. At the top level, they also put out the most, right? Google OpenAI have the most models compared to Meta. Yeah, but RECA had trouble submitting RECA 2 and RECA 3. So yeah, that's, I think this is a useful pushback on LM Arena.

And I really liked that. I thought it was very classy that they, first, they actually shared this with the LM Arena team before publishing this paper, which I thought is just responsible disclosure. Cool. That's a short paper. I thought it was very interesting. Senpai actually said like, oh yeah, LM Arena is dead to me.

Now OpenRouter is my best friend. So now the same dynamics are going to apply to OpenRouter because that's how these things work. So yeah, so now he's talking about OpenRouter rankings and basically this favors cost is my general takeaway. So when Gemini launched, Gemini 2 flashed launch and it was free, suddenly it shot out to, you know, one of the most popular APIs, which is not a surprise because it's free.

Yeah. Cool. Okay. Did people have questions? I'm not seeing the chat. Cool. All right. I think that's it. We're out of time. I'm happy to talk about AI News another time, but there was somebody who had a question about the search on AI News. And basically it is all pre-built.

That's why it's fast because there's no compute. Everything's pre-indexed. So if I do like... I mean, doesn't, isn't that search whereby you type a keyword and it was actually able to do the lexical side? Yeah. Yeah. I don't think, I don't think, I don't know. Someone was impressed by that, but I don't think it's particularly much to it.

Like I have a year's worth of content in here. It's not that much. You know, everything can fit in a JSON file. Seems like we didn't have time to discuss what people are next week, but I think it's the Lama series, right? Yeah. The Lama series. So, okay. Next week we're talking Lama.

Thank you, Rafa. I thought I could confirm with him, but... Yeah, most likely. I'll never talk this in a while. Okay. Either it's Tom or Vibu, I guess. No, I'm kidding. Okay. Take care. Thank you, Rafa. Bye.

Survey Paper Club: Long Context, Reasoning Economy, Leaderboard Illusion

Transcript