back to index

Building Blocks for LLM Systems & Products: Eugene Yan


Chapters

0:0 Introduction
0:41 Evaluations
9:43 Guard Rails
12:31 Collecting Feedback
15:35 Summary

Whisper Transcript | Transcript Only Page

00:00:00.000 | Eugène Yen:
00:00:15.000 | Eugène Yen: Thank you.
00:00:16.000 | Thank you, everyone.
00:00:17.000 | I'm Eugène Yen, and today I want to share with you
00:00:19.000 | about some building blocks for LLM systems and products.
00:00:23.000 | Like many of you here, I'm trying to figure out
00:00:25.000 | how to effectively use these LLMs in production.
00:00:29.000 | So, a few months ago, to clarify my thinking,
00:00:31.000 | I wrote some patterns about building LLM systems and products,
00:00:34.000 | and the community seemed to like it.
00:00:36.000 | There's Jason asking for this to be seminar,
00:00:39.000 | so here you go, Jason.
00:00:41.000 | Today, I'm going to focus on four of those patterns:
00:00:44.000 | evaluations, retrieval-augmented generation,
00:00:47.000 | guardrails, and collecting feedback.
00:00:50.000 | All the slides will be made available after this talk,
00:00:53.000 | so I ask you to just focus.
00:00:55.000 | Buckle up, hang on tight, because you'll be going really fast.
00:00:58.000 | All right, let's start with e-vails,
00:01:01.000 | or what I really consider the foundation of it all.
00:01:04.000 | Why do we need e-vails?
00:01:06.000 | Well, e-vails help us understand if our prompt engineering,
00:01:09.000 | our retrieval augmentation, or our fine-tuning,
00:01:11.000 | isn't doing anything at all.
00:01:13.000 | Right?
00:01:14.000 | Consider e-vail-driven development,
00:01:16.000 | where e-vails guide how you build your system and product.
00:01:19.000 | We can also think of e-vails as test cases, right?
00:01:21.000 | Where we run these e-vails before deploying any new changes.
00:01:25.000 | It makes us feel safe.
00:01:27.000 | And finally, if managers at OpenAI take the time to write e-vails,
00:01:31.000 | or give feedback on them.
00:01:33.000 | You know it's pretty important.
00:01:38.000 | But building e-vails is hard.
00:01:39.000 | Here are some things I've seen folks trip up on.
00:01:42.000 | Firstly, we don't have a consistent approach to e-vails.
00:01:46.000 | If you think about more conventional machine learning,
00:01:48.000 | regression, we have root-mean-square error,
00:01:50.000 | classification, precision and recall,
00:01:52.000 | even ranking, NDCG.
00:01:54.000 | All these metrics are pretty straightforward,
00:01:56.000 | and there's usually only one way to compute them.
00:01:58.000 | But what about for LLMs?
00:02:00.000 | Well, we have this benchmark whereby we write a prompt,
00:02:03.000 | there's a multiple-choice question,
00:02:05.000 | we evaluate the model's ability to get it right.
00:02:07.000 | MMLU is an example that's widely used
00:02:10.000 | where it assesses LLMs on knowledge and reasonability,
00:02:13.000 | you know, computer science questions, math, US history, et cetera.
00:02:16.000 | But there's no consistent way to run MMLU.
00:02:20.000 | Less than a week ago, Avin and Sayash from Princeton
00:02:23.000 | Evaluating LLMs is a minefield.
00:02:26.000 | They ask, "Are we assessing prompt sensitivity?
00:02:30.000 | Are we assessing the LLM, or are we assessing our prompt
00:02:33.000 | to get the LLM to give us what we want?"
00:02:35.000 | On the same day, Entrophic noted that the simple MCQ
00:02:39.000 | may not be as simple as it seems.
00:02:41.000 | Simple formatting changes, such as different parentheses,
00:02:44.000 | lead to different changes in accuracy.
00:02:46.000 | And there's no consistent way to do this.
00:02:49.000 | As a result, it makes it really difficult to compare models
00:02:52.000 | based on these academic benchmarks.
00:02:55.000 | Now, speaking of academic benchmarks,
00:02:57.000 | we may have outgrown some of them.
00:02:59.000 | For example, this task of summarization.
00:03:01.000 | On the top, you see the human evaluation scores
00:03:04.000 | on the reference summaries.
00:03:05.000 | And on the bottom, you see the evaluation scores for the automated summaries.
00:03:10.000 | You don't have to go through all the numbers there, but the point is that all the numbers
00:03:14.000 | on the bottom are already higher than the numbers on top.
00:03:18.000 | Here's another one that's more recent on the XSUM dataset, extreme summarization,
00:03:22.000 | where you see that all the human evaluation scores are lower than instruct GPT.
00:03:27.000 | And that's not even GPT-4.
00:03:30.000 | Now, finally, with all these benchmarks being so easily available,
00:03:33.000 | we sometimes forget to ask ourselves, hey, is it a fit for our task?
00:03:37.000 | If you think about it, does MMLU really apply to your task?
00:03:42.000 | Maybe, if you're building a college-level chatbot, right?
00:03:45.000 | But here's Linus reminding us that we should be measuring our apps on our tasks
00:03:51.000 | and not just relying on academic e-vails.
00:03:54.000 | So how do we do e-vails?
00:03:56.000 | Well, I think as an industry, we're still figuring it out.
00:03:58.000 | Bar point out is the number one challenge out there.
00:04:00.000 | And we hear so many people talk about e-vails.
00:04:02.000 | I think there are some tenants emerging.
00:04:05.000 | Firstly, I think we should build e-vails for our specific task.
00:04:09.000 | And it's okay to start small.
00:04:10.000 | It may seem daunting, but it's okay to start small.
00:04:12.000 | How small?
00:04:13.000 | Well, here's Technium.
00:04:14.000 | You know, he releases a lot of open-source models.
00:04:17.000 | He starts with an e-vail set of 40 questions for his domain expert tasks.
00:04:21.000 | 40 e-vails.
00:04:22.000 | That's all it takes, and it can go very far.
00:04:25.000 | Second, we should try to simplify the task as much as we can.
00:04:29.000 | You know, while LLMs are very flexible,
00:04:31.000 | I think we have a better chance if we try to make it more specific.
00:04:34.000 | For example, if you're using an LLM for content more efficient tasks,
00:04:38.000 | you can fall back to simple precision and recall.
00:04:40.000 | How often is it catching toxicity?
00:04:42.000 | How often is it catching bias?
00:04:43.000 | How often is it catching hallucination?
00:04:46.000 | Next, if it's something broader like writing SQL or extracting JSON,
00:04:50.000 | you know, you can try to run the SQL
00:04:52.000 | and see if it returns the expected result.
00:04:54.000 | That's very deterministic.
00:04:56.000 | Or you can check the extracted JSON keys
00:04:58.000 | and check if the JSON keys and the values match what you expect.
00:05:01.000 | These are still fairly easy to evaluate because we have expected answers.
00:05:06.000 | But if your task is more open-ended, such as dialogue,
00:05:11.000 | you may have to rely on a strong LLM to evaluate the output.
00:05:14.000 | However, this can be really expensive.
00:05:16.000 | Here's Jerry saying, you know, 60 e-vails, GPT-4, it costs him a lot.
00:05:20.000 | Finally, even if you have automated e-vails,
00:05:24.000 | I think we shouldn't discount the value of eyeballing the output.
00:05:28.000 | Here's Jonathan from Mosaic.
00:05:29.000 | I don't believe that any of these e-vails capture what we care about.
00:05:33.000 | They have a prompt to generate games for a 3-year-old and a 7-year-old,
00:05:38.000 | and it was more effective for them to actually just eyeball the output
00:05:41.000 | as it trains throughout the epochs.
00:05:44.000 | Okay, that's it for e-vails.
00:05:46.000 | Now, retrieval on metageneration.
00:05:48.000 | I don't think I have to convince you all here
00:05:50.000 | why we need retrieval on metageneration,
00:05:52.000 | but, you know, it lets us add knowledge to our model as input context
00:05:56.000 | where we don't have to rely solely on the model's knowledge.
00:05:59.000 | And second, it's far practical, right?
00:06:01.000 | It's cheaper and precise and continuously fine-tuning to add new knowledge.
00:06:05.000 | But retrieving the right documents is really hard.
00:06:09.000 | Nonetheless, we have great speakers, Jerry and Anton,
00:06:11.000 | sharing about this topic tomorrow,
00:06:13.000 | so I won't go into the challenges of retrieval here.
00:06:16.000 | Instead, I'd like to focus on the LLM side of things, right,
00:06:19.000 | and discuss some of the challenges that remain
00:06:21.000 | even if we have retrieval on metageneration.
00:06:24.000 | The first of all is that LLMs can't really see all the documents you retrieve.
00:06:30.000 | Here's an interesting experiment, right?
00:06:32.000 | The task is retrieval on metageneration.
00:06:34.000 | You know, we've had historical queries on Google
00:06:37.000 | and hand-annotated answers from Wikipedia.
00:06:39.000 | As part of the context, they provide 20 documents.
00:06:42.000 | Each of these documents are, at most, 100 tokens long.
00:06:45.000 | So that means 2,000 tokens maximum.
00:06:48.000 | And one of these documents contain the answer,
00:06:50.000 | and the rest are simply distractors.
00:06:52.000 | So the question they had was this:
00:06:54.000 | How would the position of the document containing the answer
00:06:58.000 | affect question answering?
00:06:59.000 | Now, some of you may have seen this before.
00:07:01.000 | Don't spoil it for the rest.
00:07:03.000 | If the answer is in the first retrieved document, accuracy is the highest.
00:07:08.000 | If it's in the last, accuracy is decent.
00:07:12.000 | But if it's somewhere in the middle, it's actually worse accuracy
00:07:16.000 | than having no retrieval on metageneration.
00:07:19.000 | So what does this mean?
00:07:20.000 | It means that even if context window sizes are growing,
00:07:25.000 | we shouldn't allow our retrieval to get worse, right?
00:07:30.000 | Getting the most relevant documents to rank highly still matters,
00:07:34.000 | regardless of how big the context size is.
00:07:36.000 | And also, even if the answer is in the context and in the top position,
00:07:42.000 | accuracy is only 75%.
00:07:44.000 | So that means even with perfect retrieval,
00:07:46.000 | you can still expect some mistakes.
00:07:48.000 | Now, another gotcha is that LLMs can't really tell
00:07:53.000 | if the retrieved context is irrelevant.
00:07:55.000 | Here's a simple example.
00:07:57.000 | So here are 20 top sci-fi movies,
00:08:00.000 | and you can think of these as movies that I like.
00:08:03.000 | And I asked the LLM if I would like Twilight.
00:08:06.000 | So for folks not familiar with Twilight,
00:08:08.000 | you know, it's romantic fantasy, girl, vampire, werewolf,
00:08:11.000 | something like that.
00:08:12.000 | But I think I've never watched it before.
00:08:16.000 | But I have a really important instruction.
00:08:19.000 | If it doesn't think I would like Twilight
00:08:22.000 | because I've watched all these sci-fi movies,
00:08:24.000 | it should reply with not applicable.
00:08:26.000 | And this is pretty important in recommendations.
00:08:28.000 | We don't want to make bad recommendations.
00:08:30.000 | So here's what happened.
00:08:32.000 | First, it notes that Twilight is a different genre
00:08:35.000 | and not quite sci-fi, which is fantastic, right?
00:08:38.000 | But then it suggests E.T. because of interspecies relationships.
00:08:43.000 | I mean, I'm not sure how I feel about that.
00:08:49.000 | Yeah, I mean, how would you feel
00:08:53.000 | if you got this for a movie recommendation?
00:08:55.000 | But the point is, these LLMs are so fine-tuned to be helpful,
00:08:58.000 | and it's really smart.
00:09:00.000 | And they try their best to give an answer,
00:09:02.000 | but sometimes it's really hard to get them to say something that's not relevant,
00:09:05.000 | especially something that's fuzzy like this, right?
00:09:08.000 | So how do we best address these limitations in RAC?
00:09:12.000 | Well, I think that there are a lot of great ideas
00:09:14.000 | in the field of information retrieval.
00:09:16.000 | Search and recommendations have been trying to figure out how to show the most relevant documents on top,
00:09:21.000 | and I think it worked really well.
00:09:23.000 | And there's a lot that we can learn from them.
00:09:25.000 | Second, LLMs may not know that the retrieved document is irrelevant, right?
00:09:29.000 | I think it helps to include a threshold to exclude irrelevant documents.
00:09:34.000 | So in the Twilight and sci-fi movie example,
00:09:36.000 | I bet we could do something like just measuring item distance between those two,
00:09:40.000 | and if it's too far, we don't go to the next step.
00:09:43.000 | Next, guardrails.
00:09:46.000 | So guardrails are really important in production.
00:09:48.000 | We want to make sure what we deploy is safe.
00:09:50.000 | What's safe, we can look at OpenAI's moderation API,
00:09:54.000 | hate, harassment, self-harm, all that good stuff.
00:09:58.000 | But another thing that I also think about a lot is guardrails on factual consistency,
00:10:03.000 | or we call that hallucinations.
00:10:05.000 | I think it's really important so that you don't have trust-busting experiences.
00:10:10.000 | You can also think of these as evils for hallucination.
00:10:13.000 | Fortunately, or unfortunately,
00:10:16.000 | the field of summarization has been trying to tackle this for a very long time,
00:10:19.000 | and we can take a leave from that playbook.
00:10:21.000 | So one approach to this is via the natural language inference task.
00:10:27.000 | In a nutshell, given a premise and a hypothesis, we classify if the hypothesis is true or false.
00:10:33.000 | So given a premise, John likes all fruits, the hypothesis that John likes apples is true,
00:10:38.000 | therefore it's entailment.
00:10:40.000 | Because there's not enough information to confirm if John eats apples daily, it's neutral.
00:10:45.000 | And finally, John dislikes apples, clearly false, therefore contradiction.
00:10:49.000 | Do you see how we can apply this to document summarization?
00:10:53.000 | The premise is the document.
00:10:55.000 | And this hypothesis is the summary.
00:10:58.000 | And it just works.
00:11:00.000 | Now, when doing this, though, it helps to apply it at the sentence instead of the entire document level.
00:11:05.000 | So in this example here, the last sentence in the summary is incorrect.
00:11:10.000 | So if we run the NLI task on the entire document and summary, it's going to say that the entire summary is correct.
00:11:15.000 | But if you run it at the sentence level, it's able to tell you that the last sentence in the summary is incorrect.
00:11:21.000 | And they included a really nice ablation study, right, where they checked the granularity of the document.
00:11:27.000 | As we got finer and finer, from document to paragraph to sentence, the accuracy of detecting factual inconsistency goes up.
00:11:34.000 | That's pretty amazing.
00:11:36.000 | Now, another approach is sampling, right?
00:11:38.000 | And here's an example from Chef Check GPD.
00:11:41.000 | Given an input document, we generate a summary multiple times.
00:11:44.000 | Now, we check if those summaries are similar to each other.
00:11:48.000 | Engram overlap, bird score, et cetera.
00:11:51.000 | The assumption is that if the summaries are very different, it probably means that they're not grounded on the context document.
00:11:57.000 | And therefore, likely hallucinating.
00:11:59.000 | But if they're quite similar, you can assume that they're grounded effectively and therefore factual.
00:12:03.000 | And the final approach is asking a strong LLM.
00:12:06.000 | You know, conceptually, it's simple.
00:12:09.000 | Given an input document and summary, they get the LLM to return a summary score.
00:12:12.000 | And this LLM has to be pretty strong.
00:12:14.000 | And we have seen that strong LLMs are actually quite expensive.
00:12:17.000 | But in the case of factual consistency, I've seen similar, simple, simpler methods outperform LLM-based approaches at a far lower cost.
00:12:28.000 | So, try to keep things simple if you can.
00:12:31.000 | Okay.
00:12:32.000 | Now, to close the loop, let's touch briefly about collecting feedback.
00:12:36.000 | And I'm going to need audience help here.
00:12:38.000 | So, why is collecting feedback important?
00:12:41.000 | Because we want to understand what our customers like and don't like.
00:12:45.000 | And then the magic thing here is that collecting feedback helps you build your evals and fine-tuning data set.
00:12:51.000 | New models come and go every day, but your evals and fine-tuning data set, that's your transferable asset that you can always use.
00:12:59.000 | So, but collecting feedback from users is not as easy as it seems.
00:13:03.000 | So, explicit feedback can be sparse.
00:13:06.000 | Sparse means very low in number.
00:13:08.000 | And explicit feedback is feedback we ask users for.
00:13:10.000 | So, here's a quick thought experiment.
00:13:11.000 | How many of you here use ChatGPT?
00:13:14.000 | Okay.
00:13:15.000 | I see a lot of you.
00:13:16.000 | How many of you here actually click the thumbs up and thumbs down button?
00:13:20.000 | Excellent.
00:13:21.000 | Okay.
00:13:22.000 | But these are the beta testers, right?
00:13:24.000 | But you can see it's very small in number.
00:13:26.000 | So, even if you include this thumbs up, thumbs down button, you may not be getting the feedback you expect.
00:13:31.000 | Now, if the issue with explicit feedback is sparsity, then the issue with implicit feedback is noise.
00:13:38.000 | So, implicit feedback is the feedback you get as users organically use your product, right?
00:13:43.000 | You don't have to ask them for feedback, but you get this feedback.
00:13:45.000 | So, here's the same example.
00:13:47.000 | How often do you click the copy code button?
00:13:50.000 | The rest of you just type it out like a madman?
00:13:54.000 | Okay.
00:13:55.000 | So, but does clicking the copy code button mean that the code is correct?
00:14:00.000 | In this case, no.
00:14:02.000 | End rows is not a valid argument for Panda's read packet.
00:14:06.000 | But if we were to consider all code snippets that were copied as positive feedback, we would have a lot of bad data in our training.
00:14:13.000 | So, think about that.
00:14:15.000 | So, how do we collect feedback?
00:14:17.000 | I don't have any good answers, but here are two apps I've seen do it really well.
00:14:20.000 | First one, GitHub Copilot or any kind of coding assistant, right?
00:14:23.000 | For people not familiar with it, you type some functional signature, some comments, and it suggests code.
00:14:29.000 | You can either accept the code, reject the code, move on to the next suggestion.
00:14:33.000 | We do this dozens of times a day.
00:14:35.000 | Imagine how much feedback they get from this, right?
00:14:39.000 | Here's a golden data set.
00:14:41.000 | Another example is mid-journey.
00:14:42.000 | For folks not familiar with mid-journey, you write a prompt, it suggests four images.
00:14:47.000 | And then based on those images, you can either rerun the prompt, you can either vary the prompt, that's what the V stands for, or you can either upscale the image, that's what the U stands for.
00:14:58.000 | But do you know what an AI engineer sees?
00:15:02.000 | Rerunning the prompt is negative reward, where the user doesn't like any of the images.
00:15:07.000 | Varying the image is a small positive reward, where the user is saying, this one has potential, but tweak it slightly.
00:15:14.000 | And choosing the upscale image is a large positive reward, where the user likes it and just wants to use it.
00:15:20.000 | So think about this, think about how we can build in this implicit feedback data flywheel into your products,
00:15:26.000 | that you quickly understand what users like and don't like.
00:15:29.000 | Oh, sorry, you can take your phone out.
00:15:32.000 | All slides available after the talk.
00:15:35.000 | So that's all I wanted to share.
00:15:37.000 | If you remember anything from this talk, I hope it's these three things.
00:15:42.000 | You need automated evals.
00:15:45.000 | You need automated evals.
00:15:48.000 | Just annotate 30 or 100 examples and start from there, right?
00:15:52.000 | And then figure out how to automate it.
00:15:54.000 | It will help you iterate faster, right?
00:15:55.000 | On your prompt engineering, on your retrieval orientation, on your fine tuning.
00:15:58.000 | Help you deploy safer.
00:16:00.000 | I mean, this is a huge conference of engineers.
00:16:03.000 | I don't think I have to explain to you the need for testing.
00:16:06.000 | Eyeballing doesn't scale.
00:16:08.000 | It's good as a final vibe check, but it just doesn't scale.
00:16:11.000 | Every time you update the prompt, you just want to run your evals immediately, right?
00:16:14.000 | I run tens of experiments every day, and the only way I can do this is with automated evals.
00:16:20.000 | Second, reuse your existing systems as much as you can.
00:16:25.000 | There's no need to reinvent the wheel.
00:16:27.000 | BM25, metadata matching can get you pretty far.
00:16:31.000 | And so do the techniques from recommendation systems, right?
00:16:34.000 | Two-stage retrieval and ranking, filtering, et cetera.
00:16:38.000 | All these information retrieval techniques are optimized to rank the most relevant items on top.
00:16:44.000 | So don't forget about them.
00:16:46.000 | And finally, UX plays a large role in the LLM products.
00:16:51.000 | I think that a big chunk of GitHub Copilot and ChatGPT is UX.
00:16:56.000 | It allows you to use the LLMs in your context without calling an API.
00:17:01.000 | You can use an ID using a chat window.
00:17:03.000 | Similarly, UX makes it far more effective for you to collect user feedback.
00:17:08.000 | Okay.
00:17:09.000 | That's all I had.
00:17:10.000 | Thank you, and keep on building.
00:17:12.000 | Thank you.
00:17:13.000 | Thank you.
00:17:13.000 | Thank you.
00:17:14.000 | Thank you.
00:17:15.000 | Thank you.
00:17:16.000 | Thank you.
00:17:17.000 | Thank you.
00:17:18.000 | Thank you.
00:17:19.000 | I'll see you next time.