Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

Hi, this is Niklas. I'm the CTO and co-founder of Log10. And we want to talk about how you can scale the reliability of LLM applications using a new tool that we've built. During this year, I think we all can agree that there's been like this kind of craze in the industry.

And we've been rolling out a ton of intelligence features based on GPT. And we're now kind of finding ourselves in a now what moment. Because without knowing what good means in a generative setting, it's really hard and risky to evolve your applications by changing your prompts, configurations, let alone considering going from one model provider to another, to more advanced use cases like self-posting or fine tuning.

We want to introduce a new tool today called LLM eval that enables teams to ship reliable LLM products. It is a command line tool that you can run locally. And with these four lines of code, you should be good to go. The initialization creates a folder structure and best practices for storing prompts and tests.

And then this is based on a super configurable system from Meta called Hydra. So you could basically extend it to your heart's desire. And the metrics that we have wired up are in Python. So they could be any logic, could be called out to all the LLMs, whatever you want.

And after these evaluations have been run, you can generate some reports that basically gives you like a brief overview of how the entire app and all the tests are looking, but still support flexible test criteria, because these models are very fuzzy. It's very hard to say with a guarantee that it's going to be one or the other.

But it's fairly safe to say that the majority cases or say three out of five should pass. And we're going to jump into command line and take a look. We're just going to create a directory for today. And go into this directory and create ourselves a virtual environment. From here, we're going to install LLM eval and initialize the folder structure.

What we should be able to see here is a directory structure where we have our prompts. Let's see, a simple case could be this, where we have this message template saying like what is A plus B, only return the answer without any explanation. So in this case, we know that we have to prompt engineer further in order to get an exact output.

Because let's take a look at how the test looks like. In this case, we're taking like the actual output from the LLM and comparing it with the expected. And this is like a strict comparison. What we had taken the liberty to do is to strip any spaces that might be come from the left.

And that's because some models, in this case, clod, tends to prepend spaces. And so it's things like that that you have to watch out for. And then we have the metric, which could be any metric that you want to surface in the report. And then the result, which is then pass or fail.

And in this case, we want to add 4 and 5, and we expect it to be 9. And I'm just going to try to run this test here and try to revert some of the the prompt engineering that we did earlier. So I'm going to remove, only return the answer without any explanation.

And the way you get started is the lmeval run. But if you want to overwrite anything, if you just do lmeval run, it runs everything. But if you do like prompts equals math, then it's only going to run the math example. If you do n tries one, then it's just going to do one sample.

By default, we do five samples. So we get like a better read on the stability of each test, but it might be too much for you. But you can override anything. You can find these default settings here in the lmeval.yaml. And, but let's try to run this and see what happens.

And so this ran across Claude, GPT-4 and GPT-3.5 once. So we can go in and generate a report. And see like actually something failed. What was it that failed? So let's take a look at the output here. And in this case, because we've removed our prompt engineering, GPT-3.5 starts being a bit chatty.

It says like 4.5 equals 9, Claude does something similar. So it kind of writes out the, writes out the equation. And now I'm going to try to revert. And see, let's, let's get this in. And we try to run one more time. Great. Now, when we change the report, it can say some tests failed, but the most recent tests that we ran passed.

So when you do the report, it's going to generate a summary, you can generate a report per run, but then also say overall, was there anything that failed out of these reports. If you want to go a bit more advanced, let's say you want to use tools, we have an example here where we are generating some Python code.

And again, we had to add a number of different clauses to make sure that it only outputs Python. It tends to be very happy generating surrounding explanations. So in this case, we are going to see whether or not it returns an actual Python program that could be parsed. So let's try to run that.

If you go in and take a look at this report, you can see that these tests actually end up passing our tool use. And to round up, we have model-based evaluation as well, where you can test using other models. And so in this case, say with grading, we can go in and define like a full set of criteria.

Here, we're evaluating mermaid diagrams, giving a score between one and five, and the reason. And that is also supported in LLM eval. One thing about the previous approach is that it takes quite an amount of work to set up these tests and gather your test cases. And one really compelling answer to evaluation has been model-based evaluation.

And it's a setting where you have typically a larger model to discriminate or kind of grade or be a judge over the output from another LLM. And that makes it so you can get more nuanced output like pass/fail or a grade from one to five or preferences between different options and its reasoning behind it.

There's a number of pitfalls, unfortunately, around this approach, around biases towards the output from the model itself. If you're sweeping different models, they tend to prefer their own output. They are very good at giving point scores, saying I think between 0 and 1, or larger scores between 0 and 100.

But there are different ways where you can start increasing the accuracy of the kind of feedback that's been generated. And we've been working on this, where you basically start bridging between model-based and human feedback. So instead of removing the human completely from the feedback, you start taking in all the feedback that might have been given prior and start modeling it.

And say, like, if you have all the feedback from John, then we create an auto-John that will start generating feedback for review for any incoming completions. And so in this case here, we have two pieces of feedback that's been already given by a human. So here it was all just a score of five, or here just like a bit more nuanced.

But here we are kind of pending feedback. And if you click this, we have AI suggested an answer to this. And that's all I have today. If you want to get started on LLM-Eval, we have our documentation at our usual documentation site. And you can find me at Nicholas Crawford on X, or formerly known as Twitter, or it should be an email at nick@log10.io.

Thank you.

Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

Transcript