back to indexBuilding Blocks for LLM Systems & Products: Eugene Yan

Chapters
0:0 Introduction
0:41 Evaluations
9:43 Guard Rails
12:31 Collecting Feedback
15:35 Summary
00:00:17.000 |
I'm Eugène Yen, and today I want to share with you 00:00:19.000 |
about some building blocks for LLM systems and products. 00:00:23.000 |
Like many of you here, I'm trying to figure out 00:00:25.000 |
how to effectively use these LLMs in production. 00:00:29.000 |
So, a few months ago, to clarify my thinking, 00:00:31.000 |
I wrote some patterns about building LLM systems and products, 00:00:41.000 |
Today, I'm going to focus on four of those patterns: 00:00:50.000 |
All the slides will be made available after this talk, 00:00:55.000 |
Buckle up, hang on tight, because you'll be going really fast. 00:01:01.000 |
or what I really consider the foundation of it all. 00:01:06.000 |
Well, e-vails help us understand if our prompt engineering, 00:01:09.000 |
our retrieval augmentation, or our fine-tuning, 00:01:16.000 |
where e-vails guide how you build your system and product. 00:01:19.000 |
We can also think of e-vails as test cases, right? 00:01:21.000 |
Where we run these e-vails before deploying any new changes. 00:01:27.000 |
And finally, if managers at OpenAI take the time to write e-vails, 00:01:39.000 |
Here are some things I've seen folks trip up on. 00:01:42.000 |
Firstly, we don't have a consistent approach to e-vails. 00:01:46.000 |
If you think about more conventional machine learning, 00:01:54.000 |
All these metrics are pretty straightforward, 00:01:56.000 |
and there's usually only one way to compute them. 00:02:00.000 |
Well, we have this benchmark whereby we write a prompt, 00:02:05.000 |
we evaluate the model's ability to get it right. 00:02:10.000 |
where it assesses LLMs on knowledge and reasonability, 00:02:13.000 |
you know, computer science questions, math, US history, et cetera. 00:02:20.000 |
Less than a week ago, Avin and Sayash from Princeton 00:02:26.000 |
They ask, "Are we assessing prompt sensitivity? 00:02:30.000 |
Are we assessing the LLM, or are we assessing our prompt 00:02:35.000 |
On the same day, Entrophic noted that the simple MCQ 00:02:41.000 |
Simple formatting changes, such as different parentheses, 00:02:49.000 |
As a result, it makes it really difficult to compare models 00:03:01.000 |
On the top, you see the human evaluation scores 00:03:05.000 |
And on the bottom, you see the evaluation scores for the automated summaries. 00:03:10.000 |
You don't have to go through all the numbers there, but the point is that all the numbers 00:03:14.000 |
on the bottom are already higher than the numbers on top. 00:03:18.000 |
Here's another one that's more recent on the XSUM dataset, extreme summarization, 00:03:22.000 |
where you see that all the human evaluation scores are lower than instruct GPT. 00:03:30.000 |
Now, finally, with all these benchmarks being so easily available, 00:03:33.000 |
we sometimes forget to ask ourselves, hey, is it a fit for our task? 00:03:37.000 |
If you think about it, does MMLU really apply to your task? 00:03:42.000 |
Maybe, if you're building a college-level chatbot, right? 00:03:45.000 |
But here's Linus reminding us that we should be measuring our apps on our tasks 00:03:56.000 |
Well, I think as an industry, we're still figuring it out. 00:03:58.000 |
Bar point out is the number one challenge out there. 00:04:00.000 |
And we hear so many people talk about e-vails. 00:04:05.000 |
Firstly, I think we should build e-vails for our specific task. 00:04:10.000 |
It may seem daunting, but it's okay to start small. 00:04:14.000 |
You know, he releases a lot of open-source models. 00:04:17.000 |
He starts with an e-vail set of 40 questions for his domain expert tasks. 00:04:25.000 |
Second, we should try to simplify the task as much as we can. 00:04:31.000 |
I think we have a better chance if we try to make it more specific. 00:04:34.000 |
For example, if you're using an LLM for content more efficient tasks, 00:04:38.000 |
you can fall back to simple precision and recall. 00:04:46.000 |
Next, if it's something broader like writing SQL or extracting JSON, 00:04:58.000 |
and check if the JSON keys and the values match what you expect. 00:05:01.000 |
These are still fairly easy to evaluate because we have expected answers. 00:05:06.000 |
But if your task is more open-ended, such as dialogue, 00:05:11.000 |
you may have to rely on a strong LLM to evaluate the output. 00:05:16.000 |
Here's Jerry saying, you know, 60 e-vails, GPT-4, it costs him a lot. 00:05:24.000 |
I think we shouldn't discount the value of eyeballing the output. 00:05:29.000 |
I don't believe that any of these e-vails capture what we care about. 00:05:33.000 |
They have a prompt to generate games for a 3-year-old and a 7-year-old, 00:05:38.000 |
and it was more effective for them to actually just eyeball the output 00:05:48.000 |
I don't think I have to convince you all here 00:05:52.000 |
but, you know, it lets us add knowledge to our model as input context 00:05:56.000 |
where we don't have to rely solely on the model's knowledge. 00:06:01.000 |
It's cheaper and precise and continuously fine-tuning to add new knowledge. 00:06:05.000 |
But retrieving the right documents is really hard. 00:06:09.000 |
Nonetheless, we have great speakers, Jerry and Anton, 00:06:13.000 |
so I won't go into the challenges of retrieval here. 00:06:16.000 |
Instead, I'd like to focus on the LLM side of things, right, 00:06:19.000 |
and discuss some of the challenges that remain 00:06:24.000 |
The first of all is that LLMs can't really see all the documents you retrieve. 00:06:34.000 |
You know, we've had historical queries on Google 00:06:39.000 |
As part of the context, they provide 20 documents. 00:06:42.000 |
Each of these documents are, at most, 100 tokens long. 00:06:48.000 |
And one of these documents contain the answer, 00:06:54.000 |
How would the position of the document containing the answer 00:07:03.000 |
If the answer is in the first retrieved document, accuracy is the highest. 00:07:12.000 |
But if it's somewhere in the middle, it's actually worse accuracy 00:07:20.000 |
It means that even if context window sizes are growing, 00:07:25.000 |
we shouldn't allow our retrieval to get worse, right? 00:07:30.000 |
Getting the most relevant documents to rank highly still matters, 00:07:36.000 |
And also, even if the answer is in the context and in the top position, 00:07:48.000 |
Now, another gotcha is that LLMs can't really tell 00:08:00.000 |
and you can think of these as movies that I like. 00:08:03.000 |
And I asked the LLM if I would like Twilight. 00:08:08.000 |
you know, it's romantic fantasy, girl, vampire, werewolf, 00:08:22.000 |
because I've watched all these sci-fi movies, 00:08:26.000 |
And this is pretty important in recommendations. 00:08:32.000 |
First, it notes that Twilight is a different genre 00:08:35.000 |
and not quite sci-fi, which is fantastic, right? 00:08:38.000 |
But then it suggests E.T. because of interspecies relationships. 00:08:55.000 |
But the point is, these LLMs are so fine-tuned to be helpful, 00:09:02.000 |
but sometimes it's really hard to get them to say something that's not relevant, 00:09:05.000 |
especially something that's fuzzy like this, right? 00:09:08.000 |
So how do we best address these limitations in RAC? 00:09:12.000 |
Well, I think that there are a lot of great ideas 00:09:16.000 |
Search and recommendations have been trying to figure out how to show the most relevant documents on top, 00:09:23.000 |
And there's a lot that we can learn from them. 00:09:25.000 |
Second, LLMs may not know that the retrieved document is irrelevant, right? 00:09:29.000 |
I think it helps to include a threshold to exclude irrelevant documents. 00:09:36.000 |
I bet we could do something like just measuring item distance between those two, 00:09:40.000 |
and if it's too far, we don't go to the next step. 00:09:46.000 |
So guardrails are really important in production. 00:09:50.000 |
What's safe, we can look at OpenAI's moderation API, 00:09:54.000 |
hate, harassment, self-harm, all that good stuff. 00:09:58.000 |
But another thing that I also think about a lot is guardrails on factual consistency, 00:10:05.000 |
I think it's really important so that you don't have trust-busting experiences. 00:10:10.000 |
You can also think of these as evils for hallucination. 00:10:16.000 |
the field of summarization has been trying to tackle this for a very long time, 00:10:21.000 |
So one approach to this is via the natural language inference task. 00:10:27.000 |
In a nutshell, given a premise and a hypothesis, we classify if the hypothesis is true or false. 00:10:33.000 |
So given a premise, John likes all fruits, the hypothesis that John likes apples is true, 00:10:40.000 |
Because there's not enough information to confirm if John eats apples daily, it's neutral. 00:10:45.000 |
And finally, John dislikes apples, clearly false, therefore contradiction. 00:10:49.000 |
Do you see how we can apply this to document summarization? 00:11:00.000 |
Now, when doing this, though, it helps to apply it at the sentence instead of the entire document level. 00:11:05.000 |
So in this example here, the last sentence in the summary is incorrect. 00:11:10.000 |
So if we run the NLI task on the entire document and summary, it's going to say that the entire summary is correct. 00:11:15.000 |
But if you run it at the sentence level, it's able to tell you that the last sentence in the summary is incorrect. 00:11:21.000 |
And they included a really nice ablation study, right, where they checked the granularity of the document. 00:11:27.000 |
As we got finer and finer, from document to paragraph to sentence, the accuracy of detecting factual inconsistency goes up. 00:11:41.000 |
Given an input document, we generate a summary multiple times. 00:11:44.000 |
Now, we check if those summaries are similar to each other. 00:11:51.000 |
The assumption is that if the summaries are very different, it probably means that they're not grounded on the context document. 00:11:59.000 |
But if they're quite similar, you can assume that they're grounded effectively and therefore factual. 00:12:03.000 |
And the final approach is asking a strong LLM. 00:12:09.000 |
Given an input document and summary, they get the LLM to return a summary score. 00:12:14.000 |
And we have seen that strong LLMs are actually quite expensive. 00:12:17.000 |
But in the case of factual consistency, I've seen similar, simple, simpler methods outperform LLM-based approaches at a far lower cost. 00:12:32.000 |
Now, to close the loop, let's touch briefly about collecting feedback. 00:12:41.000 |
Because we want to understand what our customers like and don't like. 00:12:45.000 |
And then the magic thing here is that collecting feedback helps you build your evals and fine-tuning data set. 00:12:51.000 |
New models come and go every day, but your evals and fine-tuning data set, that's your transferable asset that you can always use. 00:12:59.000 |
So, but collecting feedback from users is not as easy as it seems. 00:13:08.000 |
And explicit feedback is feedback we ask users for. 00:13:16.000 |
How many of you here actually click the thumbs up and thumbs down button? 00:13:26.000 |
So, even if you include this thumbs up, thumbs down button, you may not be getting the feedback you expect. 00:13:31.000 |
Now, if the issue with explicit feedback is sparsity, then the issue with implicit feedback is noise. 00:13:38.000 |
So, implicit feedback is the feedback you get as users organically use your product, right? 00:13:43.000 |
You don't have to ask them for feedback, but you get this feedback. 00:13:50.000 |
The rest of you just type it out like a madman? 00:13:55.000 |
So, but does clicking the copy code button mean that the code is correct? 00:14:02.000 |
End rows is not a valid argument for Panda's read packet. 00:14:06.000 |
But if we were to consider all code snippets that were copied as positive feedback, we would have a lot of bad data in our training. 00:14:17.000 |
I don't have any good answers, but here are two apps I've seen do it really well. 00:14:20.000 |
First one, GitHub Copilot or any kind of coding assistant, right? 00:14:23.000 |
For people not familiar with it, you type some functional signature, some comments, and it suggests code. 00:14:29.000 |
You can either accept the code, reject the code, move on to the next suggestion. 00:14:35.000 |
Imagine how much feedback they get from this, right? 00:14:42.000 |
For folks not familiar with mid-journey, you write a prompt, it suggests four images. 00:14:47.000 |
And then based on those images, you can either rerun the prompt, you can either vary the prompt, that's what the V stands for, or you can either upscale the image, that's what the U stands for. 00:15:02.000 |
Rerunning the prompt is negative reward, where the user doesn't like any of the images. 00:15:07.000 |
Varying the image is a small positive reward, where the user is saying, this one has potential, but tweak it slightly. 00:15:14.000 |
And choosing the upscale image is a large positive reward, where the user likes it and just wants to use it. 00:15:20.000 |
So think about this, think about how we can build in this implicit feedback data flywheel into your products, 00:15:26.000 |
that you quickly understand what users like and don't like. 00:15:37.000 |
If you remember anything from this talk, I hope it's these three things. 00:15:48.000 |
Just annotate 30 or 100 examples and start from there, right? 00:15:55.000 |
On your prompt engineering, on your retrieval orientation, on your fine tuning. 00:16:00.000 |
I mean, this is a huge conference of engineers. 00:16:03.000 |
I don't think I have to explain to you the need for testing. 00:16:08.000 |
It's good as a final vibe check, but it just doesn't scale. 00:16:11.000 |
Every time you update the prompt, you just want to run your evals immediately, right? 00:16:14.000 |
I run tens of experiments every day, and the only way I can do this is with automated evals. 00:16:20.000 |
Second, reuse your existing systems as much as you can. 00:16:27.000 |
BM25, metadata matching can get you pretty far. 00:16:31.000 |
And so do the techniques from recommendation systems, right? 00:16:34.000 |
Two-stage retrieval and ranking, filtering, et cetera. 00:16:38.000 |
All these information retrieval techniques are optimized to rank the most relevant items on top. 00:16:46.000 |
And finally, UX plays a large role in the LLM products. 00:16:51.000 |
I think that a big chunk of GitHub Copilot and ChatGPT is UX. 00:16:56.000 |
It allows you to use the LLMs in your context without calling an API. 00:17:03.000 |
Similarly, UX makes it far more effective for you to collect user feedback.