Building Blocks for LLM Systems & Products: Eugene Yan

00:00:00.000 | Eugène Yen:

00:00:15.000 | Eugène Yen: Thank you.

00:00:16.000 | Thank you, everyone.

00:00:17.000 | I'm Eugène Yen, and today I want to share with you

00:00:19.000 | about some building blocks for LLM systems and products.

00:00:23.000 | Like many of you here, I'm trying to figure out

00:00:25.000 | how to effectively use these LLMs in production.

00:00:29.000 | So, a few months ago, to clarify my thinking,

00:00:31.000 | I wrote some patterns about building LLM systems and products,

00:00:34.000 | and the community seemed to like it.

00:00:36.000 | There's Jason asking for this to be seminar,

00:00:39.000 | so here you go, Jason.

00:00:41.000 | Today, I'm going to focus on four of those patterns:

00:00:44.000 | evaluations, retrieval-augmented generation,

00:00:47.000 | guardrails, and collecting feedback.

00:00:50.000 | All the slides will be made available after this talk,

00:00:53.000 | so I ask you to just focus.

00:00:55.000 | Buckle up, hang on tight, because you'll be going really fast.

00:00:58.000 | All right, let's start with e-vails,

00:01:01.000 | or what I really consider the foundation of it all.

00:01:04.000 | Why do we need e-vails?

00:01:06.000 | Well, e-vails help us understand if our prompt engineering,

00:01:09.000 | our retrieval augmentation, or our fine-tuning,

00:01:11.000 | isn't doing anything at all.

00:01:13.000 | Right?

00:01:14.000 | Consider e-vail-driven development,

00:01:16.000 | where e-vails guide how you build your system and product.

00:01:19.000 | We can also think of e-vails as test cases, right?

00:01:21.000 | Where we run these e-vails before deploying any new changes.

00:01:25.000 | It makes us feel safe.

00:01:27.000 | And finally, if managers at OpenAI take the time to write e-vails,

00:01:31.000 | or give feedback on them.

00:01:33.000 | You know it's pretty important.

00:01:38.000 | But building e-vails is hard.

00:01:39.000 | Here are some things I've seen folks trip up on.

00:01:42.000 | Firstly, we don't have a consistent approach to e-vails.

00:01:46.000 | If you think about more conventional machine learning,

00:01:48.000 | regression, we have root-mean-square error,

00:01:50.000 | classification, precision and recall,

00:01:52.000 | even ranking, NDCG.

00:01:54.000 | All these metrics are pretty straightforward,

00:01:56.000 | and there's usually only one way to compute them.

00:01:58.000 | But what about for LLMs?

00:02:00.000 | Well, we have this benchmark whereby we write a prompt,

00:02:03.000 | there's a multiple-choice question,

00:02:05.000 | we evaluate the model's ability to get it right.

00:02:07.000 | MMLU is an example that's widely used

00:02:10.000 | where it assesses LLMs on knowledge and reasonability,

00:02:13.000 | you know, computer science questions, math, US history, et cetera.

00:02:16.000 | But there's no consistent way to run MMLU.

00:02:20.000 | Less than a week ago, Avin and Sayash from Princeton

00:02:23.000 | Evaluating LLMs is a minefield.

00:02:26.000 | They ask, "Are we assessing prompt sensitivity?

00:02:30.000 | Are we assessing the LLM, or are we assessing our prompt

00:02:33.000 | to get the LLM to give us what we want?"

00:02:35.000 | On the same day, Entrophic noted that the simple MCQ

00:02:39.000 | may not be as simple as it seems.

00:02:41.000 | Simple formatting changes, such as different parentheses,

00:02:44.000 | lead to different changes in accuracy.

00:02:46.000 | And there's no consistent way to do this.

00:02:49.000 | As a result, it makes it really difficult to compare models

00:02:52.000 | based on these academic benchmarks.

00:02:55.000 | Now, speaking of academic benchmarks,

00:02:57.000 | we may have outgrown some of them.

00:02:59.000 | For example, this task of summarization.

00:03:01.000 | On the top, you see the human evaluation scores

00:03:04.000 | on the reference summaries.

00:03:05.000 | And on the bottom, you see the evaluation scores for the automated summaries.

00:03:10.000 | You don't have to go through all the numbers there, but the point is that all the numbers

00:03:14.000 | on the bottom are already higher than the numbers on top.

00:03:18.000 | Here's another one that's more recent on the XSUM dataset, extreme summarization,

00:03:22.000 | where you see that all the human evaluation scores are lower than instruct GPT.

00:03:27.000 | And that's not even GPT-4.

00:03:30.000 | Now, finally, with all these benchmarks being so easily available,

00:03:33.000 | we sometimes forget to ask ourselves, hey, is it a fit for our task?

00:03:37.000 | If you think about it, does MMLU really apply to your task?

00:03:42.000 | Maybe, if you're building a college-level chatbot, right?

00:03:45.000 | But here's Linus reminding us that we should be measuring our apps on our tasks

00:03:51.000 | and not just relying on academic e-vails.

00:03:54.000 | So how do we do e-vails?

00:03:56.000 | Well, I think as an industry, we're still figuring it out.

00:03:58.000 | Bar point out is the number one challenge out there.

00:04:00.000 | And we hear so many people talk about e-vails.

00:04:02.000 | I think there are some tenants emerging.

00:04:05.000 | Firstly, I think we should build e-vails for our specific task.

00:04:09.000 | And it's okay to start small.

00:04:10.000 | It may seem daunting, but it's okay to start small.

00:04:12.000 | How small?

00:04:13.000 | Well, here's Technium.

00:04:14.000 | You know, he releases a lot of open-source models.

00:04:17.000 | He starts with an e-vail set of 40 questions for his domain expert tasks.

00:04:21.000 | 40 e-vails.

00:04:22.000 | That's all it takes, and it can go very far.

00:04:25.000 | Second, we should try to simplify the task as much as we can.

00:04:29.000 | You know, while LLMs are very flexible,

00:04:31.000 | I think we have a better chance if we try to make it more specific.

00:04:34.000 | For example, if you're using an LLM for content more efficient tasks,

00:04:38.000 | you can fall back to simple precision and recall.

00:04:40.000 | How often is it catching toxicity?

00:04:42.000 | How often is it catching bias?

00:04:43.000 | How often is it catching hallucination?

00:04:46.000 | Next, if it's something broader like writing SQL or extracting JSON,

00:04:50.000 | you know, you can try to run the SQL

00:04:52.000 | and see if it returns the expected result.

00:04:54.000 | That's very deterministic.

00:04:56.000 | Or you can check the extracted JSON keys

00:04:58.000 | and check if the JSON keys and the values match what you expect.

00:05:01.000 | These are still fairly easy to evaluate because we have expected answers.

00:05:06.000 | But if your task is more open-ended, such as dialogue,

00:05:11.000 | you may have to rely on a strong LLM to evaluate the output.

00:05:14.000 | However, this can be really expensive.

00:05:16.000 | Here's Jerry saying, you know, 60 e-vails, GPT-4, it costs him a lot.

00:05:20.000 | Finally, even if you have automated e-vails,

00:05:24.000 | I think we shouldn't discount the value of eyeballing the output.

00:05:28.000 | Here's Jonathan from Mosaic.

00:05:29.000 | I don't believe that any of these e-vails capture what we care about.

00:05:33.000 | They have a prompt to generate games for a 3-year-old and a 7-year-old,

00:05:38.000 | and it was more effective for them to actually just eyeball the output

00:05:41.000 | as it trains throughout the epochs.

00:05:44.000 | Okay, that's it for e-vails.

00:05:46.000 | Now, retrieval on metageneration.

00:05:48.000 | I don't think I have to convince you all here

00:05:50.000 | why we need retrieval on metageneration,

00:05:52.000 | but, you know, it lets us add knowledge to our model as input context

00:05:56.000 | where we don't have to rely solely on the model's knowledge.

00:05:59.000 | And second, it's far practical, right?

00:06:01.000 | It's cheaper and precise and continuously fine-tuning to add new knowledge.

00:06:05.000 | But retrieving the right documents is really hard.

00:06:09.000 | Nonetheless, we have great speakers, Jerry and Anton,

00:06:11.000 | sharing about this topic tomorrow,

00:06:13.000 | so I won't go into the challenges of retrieval here.

00:06:16.000 | Instead, I'd like to focus on the LLM side of things, right,

00:06:19.000 | and discuss some of the challenges that remain

00:06:21.000 | even if we have retrieval on metageneration.

00:06:24.000 | The first of all is that LLMs can't really see all the documents you retrieve.

00:06:30.000 | Here's an interesting experiment, right?

00:06:32.000 | The task is retrieval on metageneration.

00:06:34.000 | You know, we've had historical queries on Google

00:06:37.000 | and hand-annotated answers from Wikipedia.

00:06:39.000 | As part of the context, they provide 20 documents.

00:06:42.000 | Each of these documents are, at most, 100 tokens long.

00:06:45.000 | So that means 2,000 tokens maximum.

00:06:48.000 | And one of these documents contain the answer,

00:06:50.000 | and the rest are simply distractors.

00:06:52.000 | So the question they had was this:

00:06:54.000 | How would the position of the document containing the answer

00:06:58.000 | affect question answering?

00:06:59.000 | Now, some of you may have seen this before.

00:07:01.000 | Don't spoil it for the rest.

00:07:03.000 | If the answer is in the first retrieved document, accuracy is the highest.

00:07:08.000 | If it's in the last, accuracy is decent.

00:07:12.000 | But if it's somewhere in the middle, it's actually worse accuracy

00:07:16.000 | than having no retrieval on metageneration.

00:07:19.000 | So what does this mean?

00:07:20.000 | It means that even if context window sizes are growing,

00:07:25.000 | we shouldn't allow our retrieval to get worse, right?

00:07:30.000 | Getting the most relevant documents to rank highly still matters,

00:07:34.000 | regardless of how big the context size is.

00:07:36.000 | And also, even if the answer is in the context and in the top position,

00:07:42.000 | accuracy is only 75%.

00:07:44.000 | So that means even with perfect retrieval,

00:07:46.000 | you can still expect some mistakes.

00:07:48.000 | Now, another gotcha is that LLMs can't really tell

00:07:53.000 | if the retrieved context is irrelevant.

00:07:55.000 | Here's a simple example.

00:07:57.000 | So here are 20 top sci-fi movies,

00:08:00.000 | and you can think of these as movies that I like.

00:08:03.000 | And I asked the LLM if I would like Twilight.

00:08:06.000 | So for folks not familiar with Twilight,

00:08:08.000 | you know, it's romantic fantasy, girl, vampire, werewolf,

00:08:11.000 | something like that.

00:08:12.000 | But I think I've never watched it before.

00:08:16.000 | But I have a really important instruction.

00:08:19.000 | If it doesn't think I would like Twilight

00:08:22.000 | because I've watched all these sci-fi movies,

00:08:24.000 | it should reply with not applicable.

00:08:26.000 | And this is pretty important in recommendations.

00:08:28.000 | We don't want to make bad recommendations.

00:08:30.000 | So here's what happened.

00:08:32.000 | First, it notes that Twilight is a different genre

00:08:35.000 | and not quite sci-fi, which is fantastic, right?

00:08:38.000 | But then it suggests E.T. because of interspecies relationships.

00:08:43.000 | I mean, I'm not sure how I feel about that.

00:08:49.000 | Yeah, I mean, how would you feel

00:08:53.000 | if you got this for a movie recommendation?

00:08:55.000 | But the point is, these LLMs are so fine-tuned to be helpful,

00:08:58.000 | and it's really smart.

00:09:00.000 | And they try their best to give an answer,

00:09:02.000 | but sometimes it's really hard to get them to say something that's not relevant,

00:09:05.000 | especially something that's fuzzy like this, right?

00:09:08.000 | So how do we best address these limitations in RAC?

00:09:12.000 | Well, I think that there are a lot of great ideas

00:09:14.000 | in the field of information retrieval.

00:09:16.000 | Search and recommendations have been trying to figure out how to show the most relevant documents on top,

00:09:21.000 | and I think it worked really well.

00:09:23.000 | And there's a lot that we can learn from them.

00:09:25.000 | Second, LLMs may not know that the retrieved document is irrelevant, right?

00:09:29.000 | I think it helps to include a threshold to exclude irrelevant documents.

00:09:34.000 | So in the Twilight and sci-fi movie example,

00:09:36.000 | I bet we could do something like just measuring item distance between those two,

00:09:40.000 | and if it's too far, we don't go to the next step.

00:09:43.000 | Next, guardrails.

00:09:46.000 | So guardrails are really important in production.

00:09:48.000 | We want to make sure what we deploy is safe.

00:09:50.000 | What's safe, we can look at OpenAI's moderation API,

00:09:54.000 | hate, harassment, self-harm, all that good stuff.

00:09:58.000 | But another thing that I also think about a lot is guardrails on factual consistency,

00:10:03.000 | or we call that hallucinations.

00:10:05.000 | I think it's really important so that you don't have trust-busting experiences.

00:10:10.000 | You can also think of these as evils for hallucination.

00:10:13.000 | Fortunately, or unfortunately,

00:10:16.000 | the field of summarization has been trying to tackle this for a very long time,

00:10:19.000 | and we can take a leave from that playbook.

00:10:21.000 | So one approach to this is via the natural language inference task.

00:10:27.000 | In a nutshell, given a premise and a hypothesis, we classify if the hypothesis is true or false.

00:10:33.000 | So given a premise, John likes all fruits, the hypothesis that John likes apples is true,

00:10:38.000 | therefore it's entailment.

00:10:40.000 | Because there's not enough information to confirm if John eats apples daily, it's neutral.

00:10:45.000 | And finally, John dislikes apples, clearly false, therefore contradiction.

00:10:49.000 | Do you see how we can apply this to document summarization?

00:10:53.000 | The premise is the document.

00:10:55.000 | And this hypothesis is the summary.

00:10:58.000 | And it just works.

00:11:00.000 | Now, when doing this, though, it helps to apply it at the sentence instead of the entire document level.

00:11:05.000 | So in this example here, the last sentence in the summary is incorrect.

00:11:10.000 | So if we run the NLI task on the entire document and summary, it's going to say that the entire summary is correct.

00:11:15.000 | But if you run it at the sentence level, it's able to tell you that the last sentence in the summary is incorrect.

00:11:21.000 | And they included a really nice ablation study, right, where they checked the granularity of the document.

00:11:27.000 | As we got finer and finer, from document to paragraph to sentence, the accuracy of detecting factual inconsistency goes up.

00:11:34.000 | That's pretty amazing.

00:11:36.000 | Now, another approach is sampling, right?

00:11:38.000 | And here's an example from Chef Check GPD.

00:11:41.000 | Given an input document, we generate a summary multiple times.

00:11:44.000 | Now, we check if those summaries are similar to each other.

00:11:48.000 | Engram overlap, bird score, et cetera.

00:11:51.000 | The assumption is that if the summaries are very different, it probably means that they're not grounded on the context document.

00:11:57.000 | And therefore, likely hallucinating.

00:11:59.000 | But if they're quite similar, you can assume that they're grounded effectively and therefore factual.

00:12:03.000 | And the final approach is asking a strong LLM.

00:12:06.000 | You know, conceptually, it's simple.

00:12:09.000 | Given an input document and summary, they get the LLM to return a summary score.

00:12:12.000 | And this LLM has to be pretty strong.

00:12:14.000 | And we have seen that strong LLMs are actually quite expensive.

00:12:17.000 | But in the case of factual consistency, I've seen similar, simple, simpler methods outperform LLM-based approaches at a far lower cost.

00:12:28.000 | So, try to keep things simple if you can.

00:12:31.000 | Okay.

00:12:32.000 | Now, to close the loop, let's touch briefly about collecting feedback.

00:12:36.000 | And I'm going to need audience help here.

00:12:38.000 | So, why is collecting feedback important?

00:12:41.000 | Because we want to understand what our customers like and don't like.

00:12:45.000 | And then the magic thing here is that collecting feedback helps you build your evals and fine-tuning data set.

00:12:51.000 | New models come and go every day, but your evals and fine-tuning data set, that's your transferable asset that you can always use.

00:12:59.000 | So, but collecting feedback from users is not as easy as it seems.

00:13:03.000 | So, explicit feedback can be sparse.

00:13:06.000 | Sparse means very low in number.

00:13:08.000 | And explicit feedback is feedback we ask users for.

00:13:10.000 | So, here's a quick thought experiment.

00:13:11.000 | How many of you here use ChatGPT?

00:13:14.000 | Okay.

00:13:15.000 | I see a lot of you.

00:13:16.000 | How many of you here actually click the thumbs up and thumbs down button?

00:13:20.000 | Excellent.

00:13:21.000 | Okay.

00:13:22.000 | But these are the beta testers, right?

00:13:24.000 | But you can see it's very small in number.

00:13:26.000 | So, even if you include this thumbs up, thumbs down button, you may not be getting the feedback you expect.

00:13:31.000 | Now, if the issue with explicit feedback is sparsity, then the issue with implicit feedback is noise.

00:13:38.000 | So, implicit feedback is the feedback you get as users organically use your product, right?

00:13:43.000 | You don't have to ask them for feedback, but you get this feedback.

00:13:45.000 | So, here's the same example.

00:13:47.000 | How often do you click the copy code button?

00:13:50.000 | The rest of you just type it out like a madman?

00:13:54.000 | Okay.

00:13:55.000 | So, but does clicking the copy code button mean that the code is correct?

00:14:00.000 | In this case, no.

00:14:02.000 | End rows is not a valid argument for Panda's read packet.

00:14:06.000 | But if we were to consider all code snippets that were copied as positive feedback, we would have a lot of bad data in our training.

00:14:13.000 | So, think about that.

00:14:15.000 | So, how do we collect feedback?

00:14:17.000 | I don't have any good answers, but here are two apps I've seen do it really well.

00:14:20.000 | First one, GitHub Copilot or any kind of coding assistant, right?

00:14:23.000 | For people not familiar with it, you type some functional signature, some comments, and it suggests code.

00:14:29.000 | You can either accept the code, reject the code, move on to the next suggestion.

00:14:33.000 | We do this dozens of times a day.

00:14:35.000 | Imagine how much feedback they get from this, right?

00:14:39.000 | Here's a golden data set.

00:14:41.000 | Another example is mid-journey.

00:14:42.000 | For folks not familiar with mid-journey, you write a prompt, it suggests four images.

00:14:47.000 | And then based on those images, you can either rerun the prompt, you can either vary the prompt, that's what the V stands for, or you can either upscale the image, that's what the U stands for.

00:14:58.000 | But do you know what an AI engineer sees?

00:15:02.000 | Rerunning the prompt is negative reward, where the user doesn't like any of the images.

00:15:07.000 | Varying the image is a small positive reward, where the user is saying, this one has potential, but tweak it slightly.

00:15:14.000 | And choosing the upscale image is a large positive reward, where the user likes it and just wants to use it.

00:15:20.000 | So think about this, think about how we can build in this implicit feedback data flywheel into your products,

00:15:26.000 | that you quickly understand what users like and don't like.

00:15:29.000 | Oh, sorry, you can take your phone out.

00:15:32.000 | All slides available after the talk.

00:15:35.000 | So that's all I wanted to share.

00:15:37.000 | If you remember anything from this talk, I hope it's these three things.

00:15:42.000 | You need automated evals.

00:15:45.000 | You need automated evals.

00:15:48.000 | Just annotate 30 or 100 examples and start from there, right?

00:15:52.000 | And then figure out how to automate it.

00:15:54.000 | It will help you iterate faster, right?

00:15:55.000 | On your prompt engineering, on your retrieval orientation, on your fine tuning.

00:15:58.000 | Help you deploy safer.

00:16:00.000 | I mean, this is a huge conference of engineers.

00:16:03.000 | I don't think I have to explain to you the need for testing.

00:16:06.000 | Eyeballing doesn't scale.

00:16:08.000 | It's good as a final vibe check, but it just doesn't scale.

00:16:11.000 | Every time you update the prompt, you just want to run your evals immediately, right?

00:16:14.000 | I run tens of experiments every day, and the only way I can do this is with automated evals.

00:16:20.000 | Second, reuse your existing systems as much as you can.

00:16:25.000 | There's no need to reinvent the wheel.

00:16:27.000 | BM25, metadata matching can get you pretty far.

00:16:31.000 | And so do the techniques from recommendation systems, right?

00:16:34.000 | Two-stage retrieval and ranking, filtering, et cetera.

00:16:38.000 | All these information retrieval techniques are optimized to rank the most relevant items on top.

00:16:44.000 | So don't forget about them.

00:16:46.000 | And finally, UX plays a large role in the LLM products.

00:16:51.000 | I think that a big chunk of GitHub Copilot and ChatGPT is UX.

00:16:56.000 | It allows you to use the LLMs in your context without calling an API.

00:17:01.000 | You can use an ID using a chat window.

00:17:03.000 | Similarly, UX makes it far more effective for you to collect user feedback.

00:17:08.000 | Okay.

00:17:09.000 | That's all I had.

00:17:10.000 | Thank you, and keep on building.

00:17:12.000 | Thank you.

00:17:13.000 | Thank you.

00:17:14.000 | Thank you.

00:17:15.000 | Thank you.

00:17:16.000 | Thank you.

00:17:17.000 | Thank you.

00:17:18.000 | Thank you.

00:17:19.000 | I'll see you next time.

Building Blocks for LLM Systems & Products: Eugene Yan

Chapters