Evals for long-context Q&A (ft. Eugene Yan) + Ernie 4.5 Technical Report

Okay, I'm hearing that you guys can say this. Okay, so today, I want to share about this thing that I wrote recently, which is how to evaluate long context question and answer systems. This is like two parts to it. One is really an essential summary of how to do, of how to.

And after that, we have another six papers. I did a literature review of a couple of papers, and then we'll summarize six papers. So I think one thing that I think long context Q&A is slightly different from shorter Q&A is that for longer context, you could actually retrieve a lot of irrelevant details, which leads to information overload.

And then, of course, it actually leads to more occurrences of hallucinations. And then for longer context, you might actually have bigger questions that come across like, hey, can you summarize everything that I've read so far? Or what is the main theme of this person's thesis, etc. And you may also have questions that require you to hop across multiple chapters.

What were the horcruxes that Dumbledore destroyed? You know, and you have to hop across different areas of the book and Harry Potter, or even across the entire series. So that's why it's maybe a little bit more challenging. So this is essentially, this is what we're going to discuss today.

We'll go through the dimensions of faithfulness and helpfulness. And we'll also talk about how to build a Q&A dataset. I don't think I'll go through the existing benchmarks. You can read it on your own. You can read my summary in the write-up on your own. And then after that, we'll go through it.

So the key evaluation metrics, I think there are two. There are actually definitely more than two. But these are the two key ones that I think most people focus on for Q&A. The first one is faithfulness, which is how strictly the answer relies on only a source document. So I think this is important, right?

If you're asking a question on some, let's just say you're asking a question on Squid Games, which is the new Squid Games came up. I know it's like maybe an alternate reality, or maybe it is reality. And you have a lot of, you're asking a lot of questions. You want to make sure that it's answering questions solely based on the Squid Games show.

And it shouldn't be including things from outside of Squid Games, even in the real world. That's what faithfulness is about. And a big chunk of that is actually knowing when to say, I don't know. So if you're asking a question, and that answer actually doesn't exist in your context, the model should be smart enough to say something like, I don't have the information in the provider text.

That is actually the right answer that we would expect, right? We don't want the model to try to make up something. And I think depending on what you want to do, we may not want the model to use its own knowledge. And we actually may not also want the model to use web search.

Well, it depends on whether we have the web search toggle on or not. But that's one thing. So if you care about faithfulness, this is the main thing that you should be focused on. So therefore, when we consider faithfulness, you can think of false positives and false negatives. A false positive is that when the model makes up an answer that doesn't exist, this is essentially a hallucination.

A false negative occurs when a model says that you know that the document has the information, but the model says that it doesn't exist. So this could be because you're doing retrieval badly, or this could be because the model just can't pay attention to that large amount of text.

And we call this as, you know, the loss in the middle problem every now and then. Oh, Sean, I had a fun hallucination today. Gemini completely made up timestamps and summaries for our podcast only based on the title. So that's one of those things, right? Where you're asking the model to do, you're using a model in your ETL pipeline, and it shouldn't be making things up because I guess we kind of expect it to work at ETL.

And also, I think there's a small distinguishing. We want to distinguish faithfulness from correctness, right? So again, imagine we had a historical fiction with alternate timelines, where the outcome of World War II was a different outcome. And if you're asking a question of that question, you want to summarize what happened after that, you should be getting what is faithful to that alternate history fiction, the historical fiction with the alternate timeline, instead of what actually happened in the real world.

Because I think that's what users would expect. So that's the small difference. In this case, I care about faithfulness, but you may care more about correctness. So, but essentially, you want to distinguish those two. Now, the other thing is that a faithful answer isn't always a helpful answer. So maybe you might be asking a question about, hey, you know, what was the crime the person committed?

And the answer is maybe what is the crime the person committed? But I said, they will say, oh, you can find a crime the person committed on page six or page nine. That is actually not, it's a faithful answer. It's actually not a helpful answer. Or maybe the model could say something that's completely irrelevant, which is that, oh, the law is about something, whatever law it is.

It doesn't actually answer the question on the crime that was committed, but it is faithful to the text. So in that case, those faithful answers are still not helpful. So how do you think about helpfulness? I think that there are a few dimensions to it. I think it's three main dimensions, and some of these are intention.

So relevance means, does the model actually answer the question? Comprehensiveness means, does it have sufficient enough details? And you know, sufficient enough details is almost like a recall problem, right? You can just spam all the details and be very comprehensive. I think the balance here is with conciseness, whereby you are comprehensive, yet concise.

So that's the tension between both of these two. And in this study, by Shi et al., they found that dominant experts preferred answers that were comprehensive and faithful, right? Especially long-form questions. In contrast, crowd workers like LMSIS, you know, they often maybe emphasize surface aspects such as conciseness or detail, or emoji, or more flowery language.

So those are a few things you want to think about, right? When you're building your Q&A system, is it for experts? People that will know the content well, or maybe not. So that is about the section on the kinds of, the two dimensions that we're considering about. Any questions here, before I proceed?

I guess not. Some of these, I mean, mostly requires a golden dataset, right? Some ground-truth golden dataset. I think, I think for a start, yes, we need some ground-truth golden dataset to calibrate our LN evaluator. But then once we have it calibrated, we actually don't need it. I think that's how I think about it long-term.

And I'll actually walk through that. Oh, but Sean, Sean doesn't seem convinced. No, no. I mean, sure. Like, that's the only way to extend it beyond golden dataset. It's just not satisfying because it's so easy to get out of distribution. Oh, that's correct. That's true. So when you're building a golden dataset, it does have to be representative-ish enough.

And I think the benefit of building with LN evaluators is that I'm not sure if distribution drift differs matters that much. So for training data, you can imagine, let's say you're doing e-commerce recommendations. You will always have new items coming out. So distribution would go down, right? Whereas for language or concepts, I think how you classify, how you do Q&A or financial documents or unless there are new laws coming out, which don't really, or maybe new concepts coming out that set precedents.

I don't think that happens as often. You might need to refresh every now and then. But I think the drift is a little bit easier to tackle than most other more conventional machine learning models. Shulang has a question. Based on experience, to implement the faithfulness and helpfulness metrics, would you recommend start with an eval framework or just write your vanilla L-levels judge?

Personally, and maybe this is a hot take, I actually don't use frameworks. I really like to just use the raw thing on its own. I find that a few times I've used a framework, eventually the framework really gets in the way of me trying to understand what's really happening.

And the opposite, it tries very much to do a lot of magic, whereby you can do rare little things, but you can write rare little code and get something working. But personally, for me, that really gets in the way. Maybe I just need to be more okay with vibe coding and vibe eval.

But that's my personal opinion. Yeah, MCP solves everything. I think MCP does go a long way in solving a lot of things. I'm a little bit more MCP-pilled now. But we can talk about that later if you want. I have a bit of a question on this. So in normal chatbots, outside of faithfulness, true or false, accuracy in your response, another metric that you also want to consider, like another axis to evaluate your system is, do you understand what the user is asking?

And is the user even asking the right question? Like, are they asking the question that represents the answer that they want to know? So when you tie that in with, like, a 2A benchmark, you might be answering what the user wants, but not directly answering the question because they're not answering the question, right?

Because, like, skill issue in asking questions, right? Like, in Discord last week, someone was really trying to find a paper about what does AGI, like, what's this AI future in a few years? And I'm like, oh, is it AI 2027? People recommended, like, three, four others. And, like, you know, basically there's an answer they're trying to get and there's a question they're asking that might not sync.

But, yeah, how do you deal with that, with these sort of QA systems? It kind of overlaps. That's a great point. I wouldn't even be considering that right now. I think it's really challenging to try to, the user asking a question, the question's maybe not quite aligned right, but yet we have to try to guess what's on the user's mind.

I think for search queries, so for things where we have a lot of organic data, like search, right? Like, maybe the user's asking, the user just types iPhone, but you realize that they're not really looking for iPhone, they're looking for iPhone case. With enough organic user feedback data on the things that actually click on and actually buy, we can actually do that.

For Q&A systems, I don't know how to think about that right now. I guess with enough data, we get enough questions and then we would realize that, okay, the faithful answer is maybe not what they really want. But then the follow-up answer, the actual answer is, I actually don't know how to fine-tune the model on that.

The few domains that you can think of are like, if you have customer service chatbots, right? Like Intercom Fin, or if you have like a chatbot that's like for an insurance agency, you have like most frequently asked questions, right? So a lot of questions fall into a few buckets, and most of the time, users are trying to get to the same conclusion, right?

So even though they're not asking the question, right? What are they trying to get towards? Does it fall through one of them? It's just, you know, in that sense, the Q&A might have different sources. So anyway, side tangent. Yeah, I think that's a great point, whereby if you find that this is what your users want, you may want to pre-generate those questions for them eventually or put it into your prompt.

Yeah, it's just from the background of working on one of those, our QA metrics would always go down, but like end satisfaction would go up when we optimized for other stuff. So even if the direct QA was wrong, yeah, just cleared the long list, Shannon. Yeah, that's something to keep in mind.

Next thing is how to build an evaluation data set. This is my take on it. I'm not sure if this is the best way, but I found this to be fairly helpful. I think the first thing is really building questions, synthetic questions, right? And, you know, everyone's doing this right now.

You can just, there's nothing unusual. Just prompt the LLM to come up with questions. But how to prompt the LLM to come up with questions? I've seen people do it in bad ways, and then I've seen people do it in better ways. So, for example, imagine if you are prompting an LLM to generate questions about a book.

You can just say generate questions about this chapter. Now, this is going to go all over the place. It's really hard to constrain what you want. And essentially, you probably come up with very effect-based questions. But if you specifically want to test the LLM's ability to answer questions about characters, one way to do this is, here's what I might do.

I might ask it to summarize the main characters in this chapter, then generate one question about each character's backstory. So, you can think of it, this is almost like a, the first sentence is like a chain of thought. First, think through which are all the questions in this chapter, and what have they done.

Then, now, based on that summary of the main characters, ask a question about their backstory. So, this is how I steer the LLM towards questions on the backstory. So, you can imagine, imagine if you're a narrative, okay, questions about backstory, questions about setting, questions about plot, questions about climax, etc.

So, this is how you might want to do this. And I was inspired by this, by two papers, which is covered in this post. The first paper is narrative QA. Oh, I realized I actually didn't even share this post. Let me share this right now. So, you don't have to be taking down notes.

We have the recording, but everything is available. So, the first question, the first one is narrative QA. They deliberately generate questions only based on summaries rather than full text. So, these are encouraging questions that are more higher level, more narrative, that maybe may require multi-hop reasoning rather than just shallow fact recall.

And Casper also did the same thing with abstracts on academic papers. So, when you're generating questions, you also want to have a range of questions, right? Here's an example of some ranges that I might have. The simplest question, I think, is fact recall. Who is the protagonist? What's the treaty sign?

It's maybe a short phrase, and you can probably find that in a single sentence or maybe a paragraph. The next one is the definition. Okay, what's this acronym? And then, maybe a user is reading a very dense paper. What's this acronym based on everything that's in the paper so far?

So, now, this is the ability of the LLMs to like, again, it's sort of like a summary. Summarize this acronym so far. The next one is summarization. Summarize the main findings of the paper or I'm in the middle of the book. Recap what has happened in the book so far, etc.

And then, the next step goes, I think the first three, these are fairly standard. That's probably a right answer that you can have a reference answer for. The next three are maybe a little bit open-ended. So, for example, here's a question. I don't know. Why did Snape choose to behave the way he behaved towards Harry Potter?

So, that is a bit of motivation. It's a bit about interpretation. And depending on which series, which book of the series you're asking a question on, you may not have the true answer that Snape shares. So, the LLM may have to infer. So, that's a little bit difficult. Or what can we infer about society from this loss?

Maybe in the three-body problem, right? So, that's a bit trickier. And then, there's also my favorite kind of question, which is the no-info question. You know, anytime you have a book, anytime you have a Q&A, people are going to ask, people are going to rate team it. So, for example, what did Gandalf do in the final battle of Hogwarts?

Well, Gandalf wasn't in the final battle of Hogwarts. The LLM should be able to know that, okay, yes, Gandalf and Dumbledore, these were the wizards that carried the battle or carried the series. But it should be smart enough to say, hey, Gandalf is not in the final battle of Hogwarts.

Are you referring to Dumbledore instead? So, these are things that you want to make sure that LLM does not answer the question. And the other thing that I think is useful is to create a question so that they are robust to the position of evidence with a document. What this means is that, let's say you have a document that's maybe 100 pages long.

You create a question based on the first 25 pages. You create a question based on the first 50 pages and then the first 75 pages. And then you try asking those questions given all 100 pages of the document. So, the evidence is in the first quarter, the first half, and the third quarter of your document.

And then you ask questions like this to see where it starts to fail. I think in the past, you might see this loss in the middle pattern. In my own benchmarks and evaluations, I actually haven't seen this anymore. It's definitely still something useful to test on, but this doesn't happen as far as I can tell.

Any questions here before we move on to our last section? I have an interesting little thing. So, I'm curious if you've tried this or if you think it would be useful. But when you have a long context QA, this is kind of where a needle and a haystack kind of grew out of, right?

Have you considered with domain-specific stuff? So, let's say like Harry Potter, if you inject in facts and you change facts, is this a good way to eval your QA system? So, like with the Harry Potter thing, right? If I add in a sentence, let's say I add in Gandalf from Lord of the Rings, and you make up a little scenario of who wins that battle.

Let's say instead of like needle and a haystack where you just add in like a sentence, if you modify the story itself, so the model doesn't just think it's like a little prompt injection. Let's say you add like a paragraph or like a little section, does it still do good at those, right?

Because I feel like it's hard to distinguish models' internal uses versus RAG context. And if you want to benchmark your QA and really see if it uses context, like are there any fun little things like that that work well? Oh, no, that's a good point. So, I think that's a good idea to inject it and see if the model can actually pay attention to this context that is out of distribution.

Really, in this case, Harry Potter is definitely in all the models because there's just so many reproductions of it online on the internet. That's definitely something to test. The other thing to test on is actually also to test on documents that you are certain that the model has not have access to.

And these are great documents to test on. So, it could be maybe newly published, your team's own confluence or wikis or documentations that you know the model doesn't have, but you are very familiar with, you can generate questions and answers on. Or newly published books or movie transcripts. So, those could be good things to test as well.

I haven't tested the one whereby you inserted red herring content into the long context documents itself and then ask questions about that. I haven't seen any examples of those myself. I think it will work very well. My broader interest in inserting these sort of red herrings is actually not to see if the model can get off of its internal context.

It's more so for multi-hop QA, right? So, if you can inject in clauses, like let's say I have an insurance policy and I randomly inject a clause saying, you know, if this happened in the month of May, now the model now has to infer, was it May? Is it today?

Can it effectively QA, right? Because can it deduce step by step? Can it find the first thing? And like, are you even, this is a good benchmark for your retrieval, right? Are you retrieving the right stuff? Are you even retrieving clauses? So, yeah, that's just where I come from in long QA.

Because now, and then what happens when you add n degrees of complexity with this stuff? Because in the real world, like this is still simple, right? If you have document or documents that have clauses, that's cool. But if we have like agentic systems that now need multiple tool calls to look up these, like you have to find the clauses, right?

So, step zero is really just you give it clauses. You give it documents with clauses and they're fake. So, they're not like super real. But then, you know, when you take it to the real world, you need to, they have to also find out, are there any clauses, right?

My like very basic example is you go to San Francisco, you try to park somewhere. You try to park downtown and you have like seven signs telling you no parking on Thursdays, no this time, no that time. Yeah. And like, you know, now you have context. You don't even have the retrieval problem and you're still struggling.

But that's where I try to like, you know, push it. So, if you have to, at one level, have like, you know, web search the, what are the clauses? And then also in QA, if you just add them in, can it, can it do the multi-hop, please? Yeah, this, this kind of, this kind of scenario occurs naturally in some of the long context like QA I work on.

So, for example, assuming you work on movies, there's always a lot of plot twists or there's always a lot of hidden narrator. I guess the one example is like the sixth sense, you know, before and after you realize that the guy is actually dead and a ghost, the answer will be different.

So, the question that Vibu has, I actually don't know. I actually never tried that. Oh, wow. Corey is actually brilliant. That is one of the biggest headaches I've written about. I think a lot about and I work a lot about, but I just haven't written about it here because I don't, I don't think anyone else would be interested in it.

But there's a lot of that, there's a lot in that goes into Q&A for movies and books to make sure there's no spoilers. And you'll be surprised how often the model is going to try to be very helpful and actually just spoil the book for you or the movie for you.

So, that's hard. Lewis has a question. Why don't we have multiple levels of no-info questions as we have multiple levels of available information comprehension? Totally, partially, slightly wrong info. Maybe it will be something like an error analysis or it just doesn't work that way. Well, I think that you could have multiple levels of no-info questions, right?

And I think as you start to get into labeling the no-info responses, you probably need to get into that nuance. Personally, I think that it's just easier at the start to try to separate it as binary. And because I'm generating these questions, okay, so I think maybe Lewis is thinking of you're asking questions whereby there is completely no-info or you're asking maybe a two-part question and you have info for the first part but maybe not the second part.

I think that's a good idea. I haven't considered doing that. I haven't tried doing that. For me, it's really just very standard. Asking questions that I'm certain is not in the info of the document. And I expect the model to say, I don't have the information for that. And it's just easier for us to evaluate.

I think it's a good idea as we start getting, as our evals on these no-info questions start to saturate, we might want to make it a little bit trickier to try to suggest questions with partially no-info, et cetera. But I haven't thought about that yet. So if you mentioned this earlier, so the focus here is not really on evaluation of questions that are inherently ambiguous or require a follow-up.

So the example you mentioned earlier about Gandalf and Hogwarts and you're like, well, maybe you're referring to Lord of the Rings instead. Yeah. So that's an example of something that would require a follow-up. Yes. But there's other cases, like in the case of customer service, like what's being mentioned earlier.

Maybe the customer has a question you realize in order to answer that question, you have to do maybe distinguish between two situations. Because in one situation, the answer should be X. In the other circumstance, the correct answer is actually Y. And if the answer does not include that context, you need to ask, wait a second, the answer depends on whatever the Z context is.

Yes. That's not captured in this framework, correct? You're right. It does not. Okay. So what I'm talking about here is really very, very basic, the high level. And you will have to tweak this to the nuances of your use case, whereby you want to suggest, to make sure that you're suggesting follow-up questions that are useful to either, to either make sure that you're not answering until you ask a follow-up question to get good context.

Just like how deep research will ask these questions before it kicks off deep research. Or you might want to say that, okay, I don't have the answer. And you might want to evaluate the helpfulness of follow-up questions as well that help you try to get to the answer. Yeah.

Thank you. You're welcome. Have you looked at deep research benchmark, by the way? I have not. Have you looked at the QA system? Of browser bench? Is it like him? Browser bench? Browser comp? Browser comp. Browser comp. Yeah. Browser comp. Sean spent some time rerunning deep search on the thump.

He only has not the greatest things to say about deep research anymore. It's just not really related to the tasks of deep research. But, I mean, it's still a challenging benchmark. It's just very, very good retrieval on the open web rather than the specific tasks of generating a deep research report.

Yeah. So, I have a hot take. I think that for long documents, if it's a single document, you may not need to do retrieval. For multiple documents, depending on the size of the book, I think it's just really easier not to do retrieval. Evaluating retrieval is actually really hard.

Trying to do good retrieval is actually, I think, also really challenging. Especially, you need to think about how to have it asynchronously or like multiple steps, providing the user feedback, dealing with all that latency. I think it's challenging. I think you can do it with a lot of that and simplify it a lot.

But that's just me. And I think the other challenging thing about deep research that is different from long context information, long context Q&A, is that for long context Q&A, the search space is bounded. You're only limited to the document or the library of documents. And it's relatively bounded. For deep research, the entire World Wide Web, okay, maybe it's sort of bounded, but it's so large that it's effectively unbounded.

You could always find something, you could always get more information that's really valuable. And therefore, it's hard to know when to stop the deep research. Okay, finally, we'll go into the methods to assess the Q&A performance. So we talk about two dimensions, faithfulness and helpfulness. And we talk about how to create questions for them.

Next, now look at how to evaluate for them. I think the first one is really for faithfulness. I would just really default to simple binary labels, whether it's faithful or unfaithful. Now, I know I say this, simple binary labels, faithful or unfaithful. But later, as we start to get into the LLM evaluation piece of it, it may not be that straightforward.

We'll get to that later. And also related to faithfulness, there's no info. So for no info, there's actually three kinds of labels. One is the model tries to make shit up. This is clearly a hallucination. Two is the model knows that it doesn't have the info and therefore declines the answer.

That's great. And the three is that the answer is clearly in the document. And the model says that the answer isn't there. Now, this could be because the model can't pay attention to the long context. So it could be because of its retrieval error. So then, therefore, you need to figure out and look at the retrieved chunks to make sure that it actually has the answer.

And try to understand recall of retrieval is actually, I don't think you can actually measure recall, complete recall of retrieval. I think you can measure precision, but I'm not sure if you can measure complete recall. But if anyone wants to talk about that, we can chat about this after the next session about why I think recall is really hard.

Then the next thing is helpfulness. I really think that helpfulness, the right way to do this, is really doing an A-B test. And I don't mean an online A-B test. I think that it's essentially taking what you have now and this is your control and then updating your prompt, updating your retrieval or updating your document set, etc.

And then gerenting another set of answers. And then you just ask people to prefer, ask them which one they prefer, left or right. This is really just pairwise comparison. And I won't go into the details of this, but since Lama 2 and Lama 3, I mean, they share it all in their papers.

It's really easier to get people to decide which one they prefer over asking people to give them to give scores, right? 1 to 5 or 1 to 10. Those scores are extremely subjective and different people have different scores. It's impossible to calibrate them. But I think different people will probably approximately prefer the same better answer based on the criteria that you have.

And of course, the criteria I have, these are just the three criteria I mentioned for helpfulness, which is relevance, comprehensiveness, and conciseness. So that's how we do the human annotation. So for human annotation, I think, so I said that it's traditionally considered gold standard. I'm not sure if we would consider human annotation the gold standard now even.

I think that human annotation is necessary to get a small sample set to align your LLM evaluator to calibrate it. But inside, beyond that, it's actually, you find that human annotation is actually very noisy. And we may not want to use it more than we actually need to. On the hand, LLM evaluations, actually, they have consistent systematic bias.

But at least, you know, it's consistent. And you can update the prompt or update whatever, update the model to actually remove that bias. But for human annotation, it's very difficult. And again, this is the usual spiel about if you're still using automated metrics like blue and rouge, you probably should move off them.

I think they might work well for things where it's a one-to-one translation, a sentence to a sentence or a paragraph to paragraph. They work well for machine translation. But I think for open-ended tasks, I think they're pretty poor. And I highlight a few examples, not just LL, Eval, several other benchmarks at the bottom of this that I go through at the end of this have highlighted this issue with these kind of automatic metrics as well.

So I'm not going to beat that dead horse again. That's it. That's my end. That's all I'm saying about N-gram-based metrics. So to evaluate faithfulness, I think there's a standard approach that many, many, that we've seen in a few other papers, which is given an answer, you break that answer down into individual claims, then you evaluate each individual claim.

And this approach has been tried and many papers have been published about this. It works empirically and they're just small tweets here and there. So for example, imagine that this is the claim, this is the answer. The tenant breached the lease, they missed three payments, failed to maintain insurance coverage and sublet the apartment without permission.

So there are three claims here, right? Ministry payments, failed to maintain insurance coverage and sublet the apartment. So then now you can check each claim against the source document and get the answer. Essentially, this is really just map reduce. You break down a question into smaller pieces. So the LL campaign, it can be a very specialized thing that just focuses on each of these claims and then find the answer.

So this approach has been demonstrated in natural language inference-based summarization evals, whereby they break it down into sentence. Q&A-based summarization evals, again, where they break it down into multiple questions and then try to answer for that. And then claim generation and verification, which is this exact same technique here.

So that's all I'll say about that. Then after that, to develop helpfulness, I think it requires a more nuanced approach because there's no definitively helpful way. What I would do is get maybe a couple hundred samples, ask people who are trying to prefer left or right, one or zero.

And then I would try to align an LLM to do the same pairwise comparison and evaluate it on Cohen's kappa. Oh, this is actually what I say here. Pairwise comparisons and evaluate it on Cohen's kappa. If the Cohen's kappa aligns well enough with human annotation Cohen's kappa, which, you know, honestly, 0.4, 0.6 is really, really good.

Cohen's kappa doesn't really work like, it's not a zero to one thing. It's actually very hard to score when Cohen's kappa. You can read up more about, you can just ask at the beginning, what's a good number for Cohen's kappa? And I think 0.4 to 0.6 is pretty good.

So that's all I have to say about this write-up I have. Then after that, I talk about six other papers. I think these six other papers are all pretty good. That's the reason why I chose to summarize them because I highly encourage people to read these six papers or at least read the summary to get a sense of what these are.

And I also went through another, you know, 12 papers. Most of these are good, but I think if you really wanted to use benchmarks, the first six above are pretty good references to go to. I think the rest, the rest, only if you have the time or only if there's something that is very specific, like this one.

This bench is specifically for business and finance, right? And then this one is for specifically for very difficult retrieval. So that's all I had for this. And here are some of the takeaways. So I don't know if anyone else has any other questions. Wow. No other questions. Maybe I lost everyone.

No, I'm kidding. I think... Okay. No, go ahead. Yeah. I mean, I'm actually kind of curious about those last few papers. These are super... I feel like... Just take the links, give it to Gemini. Hey, Gemini, read all these papers and summarize it. No, no, no, no. Actually, I think reading is not as useful as code.

And, you know, my experience of using simple evals, you know, I think like there should be a simple evals for this. Basically, that is a library that is just a standardized way to run these. I think that the last time that you did this, maybe like a year ago, there was some code that some...

One of your readers wrote that implemented some of the hallucination and... The abstractive summarization one. Yeah. They did the NLS summarization. Yeah. So I think like that would actually be more useful. And then obviously you can sort of just run that as a battery of tests. These are all pretty cheap to run.

Yeah. And, you know, figure it out when you need to double click on something. Why don't we have a paper reading club where we go through code slash lab exercises? We do. Friday. Yeah. Do we have the link for that? Just go to Luma LS. It always has a link.

L-U dot M-A slash L-S. Yeah, but A-N-A-N actually usually does like coding related things. But this would be relevant. This would be very on target. Okay. I can present on Ernie. If we have... Yeah, go for it. ...other questions. I mean, I think this is a very big paper.

Let me just paste the slides. Slides is here. It's a very big paper. I'm not sure how important it is. But I think as much as the paper club is meant for learning, learning is very good for learning, even though it's not a frontier model. And I think that's one framing.

The second framing I would say on reading this is that it seems like a sequel, effectively, to the DeepSeq V3R1 as well as Quinn III papers. So basically, there's just a body of work coming out of the Chinese labs that is just a lot more open and, I guess, easy to read as compared to the U.S.

labs, all of whom basically don't share anything except for AI2 now, assuming that Lama, MetaLlama is dead. So, yeah, there's basically nothing. Even Misdrow is no longer publishing much details. So I think basically Ernie is Baidu's internal language model. They've always made a lot of hype about it, but they've never really released it until now.

So I think this is now both a very good paper as well as a publicly available open weights model that, according to Baidu, is being used all over China. I obviously don't really know how big a deal that is in China. It could be very big. It might not be.

We just don't know. But this is a very serious paper, so it's worth learning. I also put all these into Gemini and OpenAI. Let's see if you need those. But I'll just kind of cover the main features that I thought was relevant. Hey, wait, wait, wait. Quick correction. Mistral latest paper is pretty good.

Mistral, it came out June 12th. June 12th. Good reasoning paper. Okay. As much as they don't reason, recommendation from Nathan Lambert to, you know, good reasoning paper from Mistral earlier last month. Okay. I haven't read. I haven't read Mistral. I just know, like, Mistral small or Code Stral didn't have that much.

Okay. So, I think it's important to always get a sense. Like, this is how I wish papers were presented. Like, I basically had to just go and rearrange a lot of things. So, these are artifacts that have been released. There are some MOEs, some non-MOEs, some multimodals, some not.

Always useful to have base models because I think you can do a lot more interesting things than just chat with models. And I think that's very useful for research. But, like, basically, they've tried to fill out the complete matrix of every possible thing with regards to multimodality and reasoning and sort of the sparsity.

Biggest claim, which, you know, nobody quite believes, but is that they're, you know, sort of GPT 4.5, but at, you know, 1% of the cost. And, but, like, you know, I think there's several benchmarks at which they really did run against Quen and DeepSeek and O1. And, you know, they're at least competitive, which is very cool.

Like, you know, I think those are basically the table stakes, you know, it's very, very high table stakes these days. To meet when just releasing a model that at least warrants a bit of your interest. And it's, so it's very, very hard to achieve there. I think as far as learning goes, you know, you can talk about model performance all you want.

But really, like, these are the things that last in terms of the ideas, the contributions. They have three that they call out. I don't think these are super well organized because one and two kind of merge together and the three is the post-training one. But I'll just kind of go through most of those in turn.

But I think, like, probably the clearest picture that you should get from this is that they really lean into this idea that you want to have independent vision experts. And you want, you know, I think a lot of the normal classical sense of MOEs is that the way that MOE specialization works doesn't really work the way that human specialization does.

Because you want to token balance and they sort of learn token by token rather than by domain. And you, like, the balancing objective actually tends to spread out expertise among experts as opposed to concentrate them. So they have several different ways to do it. I'll also put these text into GPT-4.0 to generate some slop, but it's just nice to look at.

It doesn't really do anything. So pre-training. I always like these kinds of tables because they are nice references for the key parameters that people should know. I don't really see anything huge to call out except for the fact that they are starting. I think this, like, shared experts thing is just something that they're trying to make a dent on as well as this idea of, like, dedicated vision experts.

So until now, I don't think I've seen any other people commenting about that or touting that as part of their pre-training. I also, like, for researchers who are actually trying to replicate or reproduce, it's nice to see, like, magic parameters which they arrive at and, you know, just kind of use that.

But it's definitely more art than science, I would say, on all this stuff. Did they share – well, can you go back a side? Did they share why the small and big ones don't have shared experts? Shared experts are, interestingly, only at the 21 and 28B kind of model, but not in the big ones, not the 300B and not the really small.

And also, interesting to note, I thought these sizes are very interesting. Like, that 28B Mistral medium size is, like, a good sweet spot for consumer stuff. It's no longer the 7, 13, and 70Bs. 24B-ish models are doing pretty good. And then 0.3B just because they have to do one device.

And then, of course, a fat chunker. But it's interesting that chunker doesn't have shared experts. Yeah. Honestly, I don't fully understand it enough to tell you. I think that they've basically moved shared experts kind of into the routing layer. They've done much more work on the routing side. So they have a little diagram at the end where they did some rearrangements of MOEs where, like, basically, I would – the way I would characterize this is probably there's – the work that a shared expert would do is move to upper layer.

I don't know. I don't know that for a fact. I don't think – I wouldn't stand behind that as, like, that's a theory, not, like, something that's substantiated. It might just be hallucinated by me. Oh, another noticeable thing is that the way they named these models with the MOE models with the total params and the active params, this seems to be – I think it was started by Quinn or DeepSeq.

But this seems to be, like, the clearest way to communicate in the name as to if it's an MOE, what are the active params. I think people are wisening out to the fact that, for example, Mistral has only been reporting active params as a matter of efficiency. And as the number of experts increases, this divergence and the ratio will get higher.

Okay. Let's go to multimodal MOE. So they spend a lot of time talking about this, and we have more diagrams on their training. But they always have, like, these kinds of commentary. We'll talk about this later. I don't think this slide comes in at the right point. Datasets, basically no info, as far as I can tell, not even about, like, the number of trillions of parameters they train on.

They don't even say that. So there was a lot of blah, blah, blah. This originally was, like, datasets, nothing important was said. But actually, if you read it closely, they actually do make a couple of comments. So basically, there's a lot of natural language data that is useless. So they use this DIKW framework.

I don't know what that is. But they develop the five tiers of knowledge and basically mine for data knowledge. And you also use synthetic data, which is kind of cool. For multimodality, they use interleaved text and image. This is a very common technique now. Not just for vision, but also for image generation.

And then finally, for domain-specific data, again, a lot of, like, mining of data, but also particularly audio transcription of podcasts and YouTube. So they don't say anything about where to get their video and podcasts. But I think if you make more podcasts, you will become part of the training data, which is good.

But that's it on data. It's really, really nothing specific. But, like, it has some hints as to what they think is important. Let's see. Let's see. How much time do we have? We have nine minutes. Okay. They have two kinds of losses, which I thought was worth highlighting. So, you know, I think a lot of sort of pre-training and innovation is in figuring out a better objective than the standard.

One is router with organization, which is really, you know, focused on expert specialization, which I thought was kind of cool. And, you know, balancing that with the other loss that they have, which is a sort of standard cross-entropy, they have token balance loss, which normalizes by sequence length. So I don't know if this one has as big of an impact, but it was big enough for them to call out as unique contributions, which I think is interesting.

Probably it also has more to do with multimodality than their sort of multimodal focus. So if anything, I think this paper is giving much more insight into vision and multimodality training as well as vision MOEs, which I never really thought about. And this is substantiated by the fact that in the post-train section, okay, they have a modality routing section, which I don't like.

Like this basically, this is their sort of visualization proof that these losses work and are better distributed. I don't think you can read much from it apart from just like taking it on principle that this representation is correct. It's more balanced. There's a vision reasoning side that I wanted to highlight was down here where they talk about the reasoning VLM post-train.

which maybe has been written about by Quinn. I don't think so. I think I've browsed the Quinn blog posts, but they only put up blog posts. I don't think that I've seen the technical report yet. But here, they actually give you the whole recipe for training vision reasoning models.

And surprise, surprise, it is not just take a reasoning model, add a vision adapter, and you're done. Actually, they do a lot of vision reasoning as well. Part of it is caption synthesis of STEM images. Part of it is this sort of three-stage process that they illustrate here. But all of this is custom data.

This is a lot of work. Visual STEM, visual puzzles, UI to code, hybrid reinforcement learning. Just a lot of post-train data that they use to post-train their reasoning VLM capabilities. So probably this is well-known if you train reasoning VLMs, but I think this is the first time it was laid out so clearly as far as I've seen it.

I'll make a comment also on just their general RL, not just the vision stuff, because we're covering this a little bit out of order. So you have the base, and then in order to get the final post-train model, they went through a bunch of normal SFT, but then also three phases of RL.

This is the general logic corpus, this is math and code, and then this is general again. And I like that they separated RL for verifiable domains, as well as RL for open-ended domains. And I think these are basically the two categories and terminologies that people are settling on. And so it's relatively clear that for math and code, you want something like this in terms of your RL.

For non-verifiable domains, it is emerging work, including our friend Nathan, that you have some other things that you want to scaffold on, basically just like checklist verifiers and reward models. It's still very open-ended, but like, again, I like to see more work in this, and I think definitely this is like kind of the next frontier after verifiable, because verifiable is a little bit boring.

Like, everyone kind of knows what is verifiable. It's the stuff that is not verifiable that is interesting. Okay, there's five minutes left. I don't know if there's any other interesting notables. Oh, video sampling, video understanding. Again, not just a series of images in chat format, and they are done.

They also do even more data and sort of post-trained work, including timestamp rendering that overlays absolute timestamp onto each frame. So they just slap the timestamp on there. So apparently, it improves temporal understanding by just kind of reusing the ability to read text from the image frame, which is kind of cool.

I don't know if this is like controversial or not, because obviously you're going to lose information. That seems very bitter lesson. Actually, you don't have to lose information. You can mess with images, just add it in like two colors that cancel each other out. The model can read it.

Or just append frames to the image, and now you can add in text. But it feels like you should be able to just learn temporal consistency without a straight time overlay. But anyway, if there's not that much more important, especially because I feel like you actually mess with a bit of temporal consistency when you have frames that go back and forth, and now you've overlaid time.

Some videos try to have time jumps and stuff. And yeah, I don't know. It just seems weird. But the more interesting thing was actually the separation of vision thinking. So like I was talking to someone about this quite a bit last week about how different models do either hybrid thinking or thinking and non-thinking models, right?

Like for OpenAI, you have the O-series, which are all thinking-only models. And then you have like the regular models, which are not thinking models, and they separated them. And you have to like do your own, like you have to basically load the entire model and serve it separately. But then there's like Claude, which has hybrid thinking models.

And it's not that different to do these. But then for like vision thinking, one thing that you can do is basically you already have, it's not like a different model architecture to do thinking, right? It's just RL on next token prediction, and now the model will put out more thinkings before kind of boxing its answer.

So as long as you can encode the same modality in the same latent space, like if you can encode in images into a thinking model, it should be able to do the same reasoning style thinking per se over the image domain. Now, stuff like this just shows that you can have specific visual, like reasoning questions in the training data, but you don't necessarily need it, right?

And like, it's kind of a switch that you can turn on and off with an adapter. But yeah, they just take it further. But then I thought the more interesting broader discussion is like, when to do thinking and non-thinking versus hybrid models. But anyway, two minutes, I don't want to take care of paper time.

You're muted, by the way, Six. Yeah, I'm done. I think a lot of cool ideas and unclear how much they work. And, you know, obviously, I think, you know, we have like two paragraphs here on video, like two paragraphs on data management, no code. So it's just all like, cool, man, good for you, you know, that's where I'm at.

But like, for model architecture wise, they were actually very, very open, which is very cool. So I think that's, that's, that's... I feel like, I feel like once you give out weights, and like people know how to inference those weights, you know, you'll find out anyway, like, you'll find out active.

Yeah, so like exposing the thinking and why and doing this stuff helps people to learn and also contrast models easier. Okay, any other questions, thoughts, comments? I'm catching up on questions. Your Discord invite server link is apparently broken. Okay. I created a new one. I think it should work now.

Link is broken. Where is it broken? Yeah, yeah. I think the invite link usually only lasts seven days by default if you try to invite someone. Right. So I changed it to an evergreen invite. Maybe on the paper club link, it didn't do that. But it's good to know if we have like a big one that's broken.

It should just update us in place. Yeah, it's okay. Anyway. Thank you, guys. Thank you. Any volunteers for next week? Volunteers, volunteers? Yeah, any volunteers? Latte, are you volunteering for next week? No. Someone do the menstrual thinking. I will volunteer. Yeah, we need a menstrual. Yeah, you caught me.

If someone wants, we can half and half it. You do half, I do half. I did half, you don't want to do. Yeah. I'm going to volunteer myself to do half, so someone has to take the other half. Unless we want half of paper next week. Someone do half the fucking paper.

No, thank you. Someone do half. Pardon me. Just a quick question on like, before we get to the shared experts and things like that from the Ernie paper, is there a definitive paper on like MOE and like how mixture of experts came about or with a certain model paper?

Oh, yeah. There is an MOE. The OG one is no Shazir. Okay. What did he wrote about? Is it the LSTM MOE? I don't know. Just look up Shazir et al MOE. I have a list of good MOE papers you can post at some point. I'm dropping the list.

I'm dropping the paper in the chat. Appreciate it. Thank you. Yeah, I think there's basically, there's a history of MOEs. You would definitely have to put Mistral 8x22B there. Like Most, Mostral? I forget. Mistral. Mistral. Mistral 8x22B. That was probably the biggest launch of that MOE, huh? I don't know if it was the 8x22B.

It was first, it was the 8x7B. Oh, 7B. Oh, okay. 7B was first, the 22B was their update. But the more interesting thing in like history of Mixture of Experts is like sparsity, like dynamic active versus not active. So like you have a super sparse one, like LAMA is 16B active, but 400 total, right?

So like those were kind of the advancements. And then the very basic papers are just like, okay, here's a routing layer. Here's what Mixture of Experts is. Yeah, they'd be cool if we go over there. We'll be part of Endless Paper Club. We'll have a whole MOE day. Got it.

Thank you, guys. Cool, guys. So who wants to do the Mistral paper with Vibu next week? Volunteers get a special Discord badge. Swix is going to create a special latent space volunteer badge. And you get privileges. Give me a badge like this. Yeah. Or you can have a badge like this.

Hang on, let me give you a badge too. It's a digital badge. I can give you this badge. Oh, should I, should I, should I? Virtual badge. Oh, there. You can have this badge. Okay, I'm going to up it. Mochi. I'm going to give you this badge. If you, if you volunteer.

Okay. Mochi is up for now. Just volunteer on the LM Paper Club Discord channel. Yes, please. Otherwise, yeah. We'll probably have Nigel Strong. I'm going to volunteer Fatima at one point. Because I know she's watching. Hi, Fatima. I would, but we have a product launch next week. So. Okay, you can do the week after.

Okay, no pressure, no pressure. I want the badge, the dog, the pin, and Discord privilege. Jesus. How about, how about we get the second dog? Latte. Where's Latte? Oh, someone is taking, someone is volunteering in the chat. Yikes. Sicko. So many puppies. Yikes. Yeah, yikes. Oh, yikes. Yeah, I'll, I'll, I'll, I'll take like a backup slot, but somebody that hasn't done it before should definitely do it.

Yeah. Because I've done it before and I can give myself the badge because I have mod privileges. Volunteering is the gift that keeps on giving. Yeah, so if anybody finds it, or if anybody finds the topic interesting, then you should definitely take it from me, but I will take it if nobody takes it.

This, you can have this one-year-old doggo. Doggo is one now. You can have the whole, the whole girlie. I'm going to have some. Happy birthday to Vivo. Doggo birthday. Birthday? Happy birthday, dude. And mochi. And mochi. Actually, it was my birthday yesterday, too. So, happy birthday, guys. Wow. You should have the same birthday as Hassan.

Hassan, too. Okay, guys. We're going to have happy birthday dumplings. Okay. Thanks, guys. See you, everyone. Bye.

Evals for long-context Q&A (ft. Eugene Yan) + Ernie 4.5 Technical Report

Transcript