Building Blocks for LLM Systems & Products: Eugene Yan

Eugène Yen: Eugène Yen: Thank you. Thank you, everyone. I'm Eugène Yen, and today I want to share with you about some building blocks for LLM systems and products. Like many of you here, I'm trying to figure out how to effectively use these LLMs in production. So, a few months ago, to clarify my thinking, I wrote some patterns about building LLM systems and products, and the community seemed to like it.

There's Jason asking for this to be seminar, so here you go, Jason. Today, I'm going to focus on four of those patterns: evaluations, retrieval-augmented generation, guardrails, and collecting feedback. All the slides will be made available after this talk, so I ask you to just focus. Buckle up, hang on tight, because you'll be going really fast.

All right, let's start with e-vails, or what I really consider the foundation of it all. Why do we need e-vails? Well, e-vails help us understand if our prompt engineering, our retrieval augmentation, or our fine-tuning, isn't doing anything at all. Right? Consider e-vail-driven development, where e-vails guide how you build your system and product.

We can also think of e-vails as test cases, right? Where we run these e-vails before deploying any new changes. It makes us feel safe. And finally, if managers at OpenAI take the time to write e-vails, or give feedback on them. You know it's pretty important. But building e-vails is hard.

Here are some things I've seen folks trip up on. Firstly, we don't have a consistent approach to e-vails. If you think about more conventional machine learning, regression, we have root-mean-square error, classification, precision and recall, even ranking, NDCG. All these metrics are pretty straightforward, and there's usually only one way to compute them.

But what about for LLMs? Well, we have this benchmark whereby we write a prompt, there's a multiple-choice question, we evaluate the model's ability to get it right. MMLU is an example that's widely used where it assesses LLMs on knowledge and reasonability, you know, computer science questions, math, US history, et cetera.

But there's no consistent way to run MMLU. Less than a week ago, Avin and Sayash from Princeton Evaluating LLMs is a minefield. They ask, "Are we assessing prompt sensitivity? Are we assessing the LLM, or are we assessing our prompt to get the LLM to give us what we want?" On the same day, Entrophic noted that the simple MCQ may not be as simple as it seems.

Simple formatting changes, such as different parentheses, lead to different changes in accuracy. And there's no consistent way to do this. As a result, it makes it really difficult to compare models based on these academic benchmarks. Now, speaking of academic benchmarks, we may have outgrown some of them. For example, this task of summarization.

On the top, you see the human evaluation scores on the reference summaries. And on the bottom, you see the evaluation scores for the automated summaries. You don't have to go through all the numbers there, but the point is that all the numbers on the bottom are already higher than the numbers on top.

Here's another one that's more recent on the XSUM dataset, extreme summarization, where you see that all the human evaluation scores are lower than instruct GPT. And that's not even GPT-4. Now, finally, with all these benchmarks being so easily available, we sometimes forget to ask ourselves, hey, is it a fit for our task?

If you think about it, does MMLU really apply to your task? Maybe, if you're building a college-level chatbot, right? But here's Linus reminding us that we should be measuring our apps on our tasks and not just relying on academic e-vails. So how do we do e-vails? Well, I think as an industry, we're still figuring it out.

Bar point out is the number one challenge out there. And we hear so many people talk about e-vails. I think there are some tenants emerging. Firstly, I think we should build e-vails for our specific task. And it's okay to start small. It may seem daunting, but it's okay to start small.

How small? Well, here's Technium. You know, he releases a lot of open-source models. He starts with an e-vail set of 40 questions for his domain expert tasks. 40 e-vails. That's all it takes, and it can go very far. Second, we should try to simplify the task as much as we can.

You know, while LLMs are very flexible, I think we have a better chance if we try to make it more specific. For example, if you're using an LLM for content more efficient tasks, you can fall back to simple precision and recall. How often is it catching toxicity? How often is it catching bias?

How often is it catching hallucination? Next, if it's something broader like writing SQL or extracting JSON, you know, you can try to run the SQL and see if it returns the expected result. That's very deterministic. Or you can check the extracted JSON keys and check if the JSON keys and the values match what you expect.

These are still fairly easy to evaluate because we have expected answers. But if your task is more open-ended, such as dialogue, you may have to rely on a strong LLM to evaluate the output. However, this can be really expensive. Here's Jerry saying, you know, 60 e-vails, GPT-4, it costs him a lot.

Finally, even if you have automated e-vails, I think we shouldn't discount the value of eyeballing the output. Here's Jonathan from Mosaic. I don't believe that any of these e-vails capture what we care about. They have a prompt to generate games for a 3-year-old and a 7-year-old, and it was more effective for them to actually just eyeball the output as it trains throughout the epochs.

Okay, that's it for e-vails. Now, retrieval on metageneration. I don't think I have to convince you all here why we need retrieval on metageneration, but, you know, it lets us add knowledge to our model as input context where we don't have to rely solely on the model's knowledge. And second, it's far practical, right?

It's cheaper and precise and continuously fine-tuning to add new knowledge. But retrieving the right documents is really hard. Nonetheless, we have great speakers, Jerry and Anton, sharing about this topic tomorrow, so I won't go into the challenges of retrieval here. Instead, I'd like to focus on the LLM side of things, right, and discuss some of the challenges that remain even if we have retrieval on metageneration.

The first of all is that LLMs can't really see all the documents you retrieve. Here's an interesting experiment, right? The task is retrieval on metageneration. You know, we've had historical queries on Google and hand-annotated answers from Wikipedia. As part of the context, they provide 20 documents. Each of these documents are, at most, 100 tokens long.

So that means 2,000 tokens maximum. And one of these documents contain the answer, and the rest are simply distractors. So the question they had was this: How would the position of the document containing the answer affect question answering? Now, some of you may have seen this before. Don't spoil it for the rest.

If the answer is in the first retrieved document, accuracy is the highest. If it's in the last, accuracy is decent. But if it's somewhere in the middle, it's actually worse accuracy than having no retrieval on metageneration. So what does this mean? It means that even if context window sizes are growing, we shouldn't allow our retrieval to get worse, right?

Getting the most relevant documents to rank highly still matters, regardless of how big the context size is. And also, even if the answer is in the context and in the top position, accuracy is only 75%. So that means even with perfect retrieval, you can still expect some mistakes. Now, another gotcha is that LLMs can't really tell if the retrieved context is irrelevant.

Here's a simple example. So here are 20 top sci-fi movies, and you can think of these as movies that I like. And I asked the LLM if I would like Twilight. So for folks not familiar with Twilight, you know, it's romantic fantasy, girl, vampire, werewolf, something like that. But I think I've never watched it before.

But I have a really important instruction. If it doesn't think I would like Twilight because I've watched all these sci-fi movies, it should reply with not applicable. And this is pretty important in recommendations. We don't want to make bad recommendations. So here's what happened. First, it notes that Twilight is a different genre and not quite sci-fi, which is fantastic, right?

But then it suggests E.T. because of interspecies relationships. I mean, I'm not sure how I feel about that. Yeah, I mean, how would you feel if you got this for a movie recommendation? But the point is, these LLMs are so fine-tuned to be helpful, and it's really smart. And they try their best to give an answer, but sometimes it's really hard to get them to say something that's not relevant, especially something that's fuzzy like this, right?

So how do we best address these limitations in RAC? Well, I think that there are a lot of great ideas in the field of information retrieval. Search and recommendations have been trying to figure out how to show the most relevant documents on top, and I think it worked really well.

And there's a lot that we can learn from them. Second, LLMs may not know that the retrieved document is irrelevant, right? I think it helps to include a threshold to exclude irrelevant documents. So in the Twilight and sci-fi movie example, I bet we could do something like just measuring item distance between those two, and if it's too far, we don't go to the next step.

Next, guardrails. So guardrails are really important in production. We want to make sure what we deploy is safe. What's safe, we can look at OpenAI's moderation API, hate, harassment, self-harm, all that good stuff. But another thing that I also think about a lot is guardrails on factual consistency, or we call that hallucinations.

I think it's really important so that you don't have trust-busting experiences. You can also think of these as evils for hallucination. Fortunately, or unfortunately, the field of summarization has been trying to tackle this for a very long time, and we can take a leave from that playbook. So one approach to this is via the natural language inference task.

In a nutshell, given a premise and a hypothesis, we classify if the hypothesis is true or false. So given a premise, John likes all fruits, the hypothesis that John likes apples is true, therefore it's entailment. Because there's not enough information to confirm if John eats apples daily, it's neutral.

And finally, John dislikes apples, clearly false, therefore contradiction. Do you see how we can apply this to document summarization? The premise is the document. And this hypothesis is the summary. And it just works. Now, when doing this, though, it helps to apply it at the sentence instead of the entire document level.

So in this example here, the last sentence in the summary is incorrect. So if we run the NLI task on the entire document and summary, it's going to say that the entire summary is correct. But if you run it at the sentence level, it's able to tell you that the last sentence in the summary is incorrect.

And they included a really nice ablation study, right, where they checked the granularity of the document. As we got finer and finer, from document to paragraph to sentence, the accuracy of detecting factual inconsistency goes up. That's pretty amazing. Now, another approach is sampling, right? And here's an example from Chef Check GPD.

Given an input document, we generate a summary multiple times. Now, we check if those summaries are similar to each other. Engram overlap, bird score, et cetera. The assumption is that if the summaries are very different, it probably means that they're not grounded on the context document. And therefore, likely hallucinating.

But if they're quite similar, you can assume that they're grounded effectively and therefore factual. And the final approach is asking a strong LLM. You know, conceptually, it's simple. Given an input document and summary, they get the LLM to return a summary score. And this LLM has to be pretty strong.

And we have seen that strong LLMs are actually quite expensive. But in the case of factual consistency, I've seen similar, simple, simpler methods outperform LLM-based approaches at a far lower cost. So, try to keep things simple if you can. Okay. Now, to close the loop, let's touch briefly about collecting feedback.

And I'm going to need audience help here. So, why is collecting feedback important? Because we want to understand what our customers like and don't like. And then the magic thing here is that collecting feedback helps you build your evals and fine-tuning data set. New models come and go every day, but your evals and fine-tuning data set, that's your transferable asset that you can always use.

So, but collecting feedback from users is not as easy as it seems. So, explicit feedback can be sparse. Sparse means very low in number. And explicit feedback is feedback we ask users for. So, here's a quick thought experiment. How many of you here use ChatGPT? Okay. I see a lot of you.

How many of you here actually click the thumbs up and thumbs down button? Excellent. Okay. But these are the beta testers, right? But you can see it's very small in number. So, even if you include this thumbs up, thumbs down button, you may not be getting the feedback you expect.

Now, if the issue with explicit feedback is sparsity, then the issue with implicit feedback is noise. So, implicit feedback is the feedback you get as users organically use your product, right? You don't have to ask them for feedback, but you get this feedback. So, here's the same example. How often do you click the copy code button?

The rest of you just type it out like a madman? Okay. So, but does clicking the copy code button mean that the code is correct? In this case, no. End rows is not a valid argument for Panda's read packet. But if we were to consider all code snippets that were copied as positive feedback, we would have a lot of bad data in our training.

So, think about that. So, how do we collect feedback? I don't have any good answers, but here are two apps I've seen do it really well. First one, GitHub Copilot or any kind of coding assistant, right? For people not familiar with it, you type some functional signature, some comments, and it suggests code.

You can either accept the code, reject the code, move on to the next suggestion. We do this dozens of times a day. Imagine how much feedback they get from this, right? Here's a golden data set. Another example is mid-journey. For folks not familiar with mid-journey, you write a prompt, it suggests four images.

And then based on those images, you can either rerun the prompt, you can either vary the prompt, that's what the V stands for, or you can either upscale the image, that's what the U stands for. But do you know what an AI engineer sees? Rerunning the prompt is negative reward, where the user doesn't like any of the images.

Varying the image is a small positive reward, where the user is saying, this one has potential, but tweak it slightly. And choosing the upscale image is a large positive reward, where the user likes it and just wants to use it. So think about this, think about how we can build in this implicit feedback data flywheel into your products, that you quickly understand what users like and don't like.

Oh, sorry, you can take your phone out. All slides available after the talk. So that's all I wanted to share. If you remember anything from this talk, I hope it's these three things. You need automated evals. You need automated evals. Just annotate 30 or 100 examples and start from there, right?

And then figure out how to automate it. It will help you iterate faster, right? On your prompt engineering, on your retrieval orientation, on your fine tuning. Help you deploy safer. I mean, this is a huge conference of engineers. I don't think I have to explain to you the need for testing.

Eyeballing doesn't scale. It's good as a final vibe check, but it just doesn't scale. Every time you update the prompt, you just want to run your evals immediately, right? I run tens of experiments every day, and the only way I can do this is with automated evals. Second, reuse your existing systems as much as you can.

There's no need to reinvent the wheel. BM25, metadata matching can get you pretty far. And so do the techniques from recommendation systems, right? Two-stage retrieval and ranking, filtering, et cetera. All these information retrieval techniques are optimized to rank the most relevant items on top. So don't forget about them.

And finally, UX plays a large role in the LLM products. I think that a big chunk of GitHub Copilot and ChatGPT is UX. It allows you to use the LLMs in your context without calling an API. You can use an ID using a chat window. Similarly, UX makes it far more effective for you to collect user feedback.

Okay. That's all I had. Thank you, and keep on building. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I'll see you next time.

Building Blocks for LLM Systems & Products: Eugene Yan

Chapters

Transcript