Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

00:00:00.000 | Hi, this is Niklas. I'm the CTO and co-founder of Log10. And we want to talk about how you can

00:00:22.800 | scale the reliability of LLM applications using a new tool that we've built. During this year,

00:00:28.560 | I think we all can agree that there's been like this kind of craze in the industry. And we've been

00:00:34.480 | rolling out a ton of intelligence features based on GPT. And we're now kind of finding ourselves in

00:00:40.880 | a now what moment. Because without knowing what good means in a generative setting, it's really

00:00:47.120 | hard and risky to evolve your applications by changing your prompts, configurations, let alone

00:00:53.760 | considering going from one model provider to another, to more advanced use cases like self-posting

00:01:00.400 | or fine tuning. We want to introduce a new tool today called LLM eval that enables teams to ship

00:01:08.400 | reliable LLM products. It is a command line tool that you can run locally. And with these four lines of

00:01:19.520 | code, you should be good to go. The initialization creates a folder structure and best practices for

00:01:30.480 | storing prompts and tests. And then this is based on a super configurable system from Meta called Hydra.

00:01:39.760 | So you could basically extend it to your heart's desire. And the metrics that we have wired up are in

00:01:47.360 | Python. So they could be any logic, could be called out to all the LLMs, whatever you want. And after these

00:01:55.440 | evaluations have been run, you can generate some reports that basically gives you like a brief

00:02:01.200 | overview of how the entire app and all the tests are looking, but still support flexible test criteria,

00:02:09.440 | because these models are very fuzzy. It's very hard to say with a guarantee that it's going to be one or

00:02:14.400 | the other. But it's fairly safe to say that the majority cases or say three out of five should pass.

00:02:22.400 | And we're going to jump into command line and take a look. We're just going to create a directory for today.

00:02:31.440 | And go into this directory and create ourselves a virtual environment.

00:02:38.320 | From here, we're going to install LLM eval

00:02:50.560 | and initialize the folder structure. What we should be able to see here is

00:02:55.360 | a directory structure where we have our prompts. Let's see, a simple case could be this,

00:03:04.560 | where we have this message template saying like what is A plus B, only return the answer without any

00:03:10.480 | explanation. So in this case, we know that we have to prompt engineer further in order to get an exact

00:03:17.280 | output. Because let's take a look at how the test looks like. In this case, we're taking like the actual

00:03:23.520 | output from the LLM and comparing it with the expected. And this is like a strict comparison.

00:03:30.400 | What we had taken the liberty to do is to strip any spaces that might be come from the left. And that's

00:03:38.960 | because some models, in this case, clod, tends to prepend spaces. And so it's things like that that

00:03:45.200 | you have to watch out for. And then we have the metric, which could be any metric that you want to surface

00:03:51.120 | in the report. And then the result, which is then pass or fail. And in this case, we want to add 4 and 5, and we

00:03:58.480 | expect it to be 9. And I'm just going to try to run this test here and try to revert some of the

00:04:05.920 | the prompt engineering that we did earlier. So I'm going to remove, only return the answer without any

00:04:14.640 | explanation.

00:04:15.200 | And the way you get started is the lmeval run. But if you want to overwrite anything, if you just

00:04:26.240 | do lmeval run, it runs everything. But if you do like prompts equals math, then it's only going to run the math

00:04:33.200 | example. If you do n tries one, then it's just going to do one sample. By default, we do five

00:04:43.200 | samples. So we get like a better read on the stability of each test, but it might be too much for you. But

00:04:49.760 | you can override anything. You can find these default settings here in the lmeval.yaml. And, but let's try to run

00:04:59.920 | this and see what happens. And so this ran across Claude, GPT-4 and GPT-3.5 once. So we can go in

00:05:08.640 | and generate a report.

00:05:10.000 | And see like actually something failed. What was it that failed? So let's take a look at the output here.

00:05:17.280 | And in this case, because we've removed our prompt engineering, GPT-3.5 starts being a bit chatty. It says like

00:05:25.440 | 4.5 equals 9, Claude does something similar. So it kind of writes out the, writes out the equation.

00:05:31.600 | And now I'm going to try to revert. And see, let's, let's get this in.

00:05:37.440 | And we try to run one more time.

00:05:41.200 | Great.

00:05:46.560 | Now, when we change the report, it can say some tests failed, but the most recent tests that we ran

00:05:51.280 | passed. So when you do the report, it's going to generate a summary, you can generate a report

00:05:56.160 | per run, but then also say overall, was there anything that failed out of these reports.

00:06:02.400 | If you want to go a bit more advanced, let's say you want to use tools, we have an example here where

00:06:11.280 | we are generating some Python code. And again, we had to add a number of different

00:06:15.680 | clauses to make sure that it only outputs Python. It tends to be very happy generating surrounding

00:06:22.400 | explanations. So in this case, we are going to see whether or not it returns an actual Python program

00:06:32.000 | that could be parsed. So let's try to run that. If you go in and take a look at this report,

00:06:39.440 | you can see that these tests actually end up passing our tool use. And to round up, we have model-based

00:06:49.440 | evaluation as well, where you can test using other models. And so in this case, say with grading,

00:06:58.720 | we can go in and define like a full set of criteria. Here, we're evaluating mermaid diagrams,

00:07:04.480 | giving a score between one and five, and the reason. And that is also supported in LLM eval.

00:07:12.160 | One thing about the previous approach is that it takes quite an amount of work to set up these tests and

00:07:18.560 | gather your test cases. And one really compelling answer to evaluation has been model-based evaluation.

00:07:26.480 | And it's a setting where you have typically a larger model to discriminate or kind of grade or be a

00:07:33.200 | judge over the output from another LLM. And that makes it so you can get more nuanced output like

00:07:40.000 | pass/fail or a grade from one to five or preferences between different options and its reasoning behind it.

00:07:46.960 | There's a number of pitfalls, unfortunately, around this approach, around biases towards the output from

00:07:55.120 | the model itself. If you're sweeping different models, they tend to prefer their own output.

00:07:59.120 | They are very good at giving point scores, saying I think between 0 and 1, or larger scores between 0 and 100.

00:08:07.840 | But there are different ways where you can start increasing the accuracy of the kind of feedback

00:08:14.800 | that's been generated. And we've been working on this, where you basically start bridging between

00:08:22.400 | model-based and human feedback. So instead of removing the human completely from the feedback,

00:08:27.440 | you start taking in all the feedback that might have been given prior and start modeling it. And say,

00:08:34.160 | like, if you have all the feedback from John, then we create an auto-John that will start generating feedback

00:08:40.240 | for review for any incoming completions. And so in this case here, we have two pieces of feedback that's been

00:08:46.880 | already given by a human. So here it was all just a score of five, or here just like a bit more nuanced.

00:08:54.720 | But here we are kind of pending feedback. And if you click this, we have AI suggested an answer to this.

00:09:06.480 | And that's all I have today. If you want to get started on LLM-Eval, we have our documentation at our

00:09:13.360 | usual documentation site. And you can find me at Nicholas Crawford on X, or formerly known as Twitter,

00:09:21.440 | or it should be an email at nick@log10.io. Thank you.