back to indexBest Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

00:00:00.000 |
Hi, this is Niklas. I'm the CTO and co-founder of Log10. And we want to talk about how you can 00:00:22.800 |
scale the reliability of LLM applications using a new tool that we've built. During this year, 00:00:28.560 |
I think we all can agree that there's been like this kind of craze in the industry. And we've been 00:00:34.480 |
rolling out a ton of intelligence features based on GPT. And we're now kind of finding ourselves in 00:00:40.880 |
a now what moment. Because without knowing what good means in a generative setting, it's really 00:00:47.120 |
hard and risky to evolve your applications by changing your prompts, configurations, let alone 00:00:53.760 |
considering going from one model provider to another, to more advanced use cases like self-posting 00:01:00.400 |
or fine tuning. We want to introduce a new tool today called LLM eval that enables teams to ship 00:01:08.400 |
reliable LLM products. It is a command line tool that you can run locally. And with these four lines of 00:01:19.520 |
code, you should be good to go. The initialization creates a folder structure and best practices for 00:01:30.480 |
storing prompts and tests. And then this is based on a super configurable system from Meta called Hydra. 00:01:39.760 |
So you could basically extend it to your heart's desire. And the metrics that we have wired up are in 00:01:47.360 |
Python. So they could be any logic, could be called out to all the LLMs, whatever you want. And after these 00:01:55.440 |
evaluations have been run, you can generate some reports that basically gives you like a brief 00:02:01.200 |
overview of how the entire app and all the tests are looking, but still support flexible test criteria, 00:02:09.440 |
because these models are very fuzzy. It's very hard to say with a guarantee that it's going to be one or 00:02:14.400 |
the other. But it's fairly safe to say that the majority cases or say three out of five should pass. 00:02:22.400 |
And we're going to jump into command line and take a look. We're just going to create a directory for today. 00:02:31.440 |
And go into this directory and create ourselves a virtual environment. 00:02:50.560 |
and initialize the folder structure. What we should be able to see here is 00:02:55.360 |
a directory structure where we have our prompts. Let's see, a simple case could be this, 00:03:04.560 |
where we have this message template saying like what is A plus B, only return the answer without any 00:03:10.480 |
explanation. So in this case, we know that we have to prompt engineer further in order to get an exact 00:03:17.280 |
output. Because let's take a look at how the test looks like. In this case, we're taking like the actual 00:03:23.520 |
output from the LLM and comparing it with the expected. And this is like a strict comparison. 00:03:30.400 |
What we had taken the liberty to do is to strip any spaces that might be come from the left. And that's 00:03:38.960 |
because some models, in this case, clod, tends to prepend spaces. And so it's things like that that 00:03:45.200 |
you have to watch out for. And then we have the metric, which could be any metric that you want to surface 00:03:51.120 |
in the report. And then the result, which is then pass or fail. And in this case, we want to add 4 and 5, and we 00:03:58.480 |
expect it to be 9. And I'm just going to try to run this test here and try to revert some of the 00:04:05.920 |
the prompt engineering that we did earlier. So I'm going to remove, only return the answer without any 00:04:15.200 |
And the way you get started is the lmeval run. But if you want to overwrite anything, if you just 00:04:26.240 |
do lmeval run, it runs everything. But if you do like prompts equals math, then it's only going to run the math 00:04:33.200 |
example. If you do n tries one, then it's just going to do one sample. By default, we do five 00:04:43.200 |
samples. So we get like a better read on the stability of each test, but it might be too much for you. But 00:04:49.760 |
you can override anything. You can find these default settings here in the lmeval.yaml. And, but let's try to run 00:04:59.920 |
this and see what happens. And so this ran across Claude, GPT-4 and GPT-3.5 once. So we can go in 00:05:10.000 |
And see like actually something failed. What was it that failed? So let's take a look at the output here. 00:05:17.280 |
And in this case, because we've removed our prompt engineering, GPT-3.5 starts being a bit chatty. It says like 00:05:25.440 |
4.5 equals 9, Claude does something similar. So it kind of writes out the, writes out the equation. 00:05:31.600 |
And now I'm going to try to revert. And see, let's, let's get this in. 00:05:46.560 |
Now, when we change the report, it can say some tests failed, but the most recent tests that we ran 00:05:51.280 |
passed. So when you do the report, it's going to generate a summary, you can generate a report 00:05:56.160 |
per run, but then also say overall, was there anything that failed out of these reports. 00:06:02.400 |
If you want to go a bit more advanced, let's say you want to use tools, we have an example here where 00:06:11.280 |
we are generating some Python code. And again, we had to add a number of different 00:06:15.680 |
clauses to make sure that it only outputs Python. It tends to be very happy generating surrounding 00:06:22.400 |
explanations. So in this case, we are going to see whether or not it returns an actual Python program 00:06:32.000 |
that could be parsed. So let's try to run that. If you go in and take a look at this report, 00:06:39.440 |
you can see that these tests actually end up passing our tool use. And to round up, we have model-based 00:06:49.440 |
evaluation as well, where you can test using other models. And so in this case, say with grading, 00:06:58.720 |
we can go in and define like a full set of criteria. Here, we're evaluating mermaid diagrams, 00:07:04.480 |
giving a score between one and five, and the reason. And that is also supported in LLM eval. 00:07:12.160 |
One thing about the previous approach is that it takes quite an amount of work to set up these tests and 00:07:18.560 |
gather your test cases. And one really compelling answer to evaluation has been model-based evaluation. 00:07:26.480 |
And it's a setting where you have typically a larger model to discriminate or kind of grade or be a 00:07:33.200 |
judge over the output from another LLM. And that makes it so you can get more nuanced output like 00:07:40.000 |
pass/fail or a grade from one to five or preferences between different options and its reasoning behind it. 00:07:46.960 |
There's a number of pitfalls, unfortunately, around this approach, around biases towards the output from 00:07:55.120 |
the model itself. If you're sweeping different models, they tend to prefer their own output. 00:07:59.120 |
They are very good at giving point scores, saying I think between 0 and 1, or larger scores between 0 and 100. 00:08:07.840 |
But there are different ways where you can start increasing the accuracy of the kind of feedback 00:08:14.800 |
that's been generated. And we've been working on this, where you basically start bridging between 00:08:22.400 |
model-based and human feedback. So instead of removing the human completely from the feedback, 00:08:27.440 |
you start taking in all the feedback that might have been given prior and start modeling it. And say, 00:08:34.160 |
like, if you have all the feedback from John, then we create an auto-John that will start generating feedback 00:08:40.240 |
for review for any incoming completions. And so in this case here, we have two pieces of feedback that's been 00:08:46.880 |
already given by a human. So here it was all just a score of five, or here just like a bit more nuanced. 00:08:54.720 |
But here we are kind of pending feedback. And if you click this, we have AI suggested an answer to this. 00:09:06.480 |
And that's all I have today. If you want to get started on LLM-Eval, we have our documentation at our 00:09:13.360 |
usual documentation site. And you can find me at Nicholas Crawford on X, or formerly known as Twitter, 00:09:21.440 |
or it should be an email at nick@log10.io. Thank you.