back to indexStrategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

00:00:19.080 |
I am an AI developer advocate/evangelist/technical 00:00:24.480 |
marketing and other titles, depending on where I am. 00:00:40.160 |
So I'm new to the conference, new to speaking 00:00:42.860 |
at this conference, ready to have a good time 00:00:53.900 |
so I'm going to be a little bit glued over here. 00:00:56.060 |
But to kind of overview what we're going to do today, 00:00:59.240 |
I'm going to talk a little bit about kind of the issues of setting 00:01:05.060 |
and the reason why you need evaluations and benchmarks 00:01:08.820 |
and all of these things and why that is so critical. 00:01:14.600 |
that we'll get to do to use some evaluation methods 00:01:19.120 |
and benchmark tools to kind of get a sense of what 00:01:21.580 |
is out there to use, how we can use those tools, what 00:01:39.800 |
I don't like that 8:00 AM stuff that happens. 00:01:56.080 |
Setting this up to be scalable and reliable and safe 00:02:16.540 |
start off with a multi-agent framework and go crazy. 00:02:21.620 |
Typically, they're going to have a kind of standard repeatable 00:02:29.300 |
How can I have like a chatbot type of situation? 00:02:31.560 |
That's kind of the standard that everybody starts out with. 00:02:41.360 |
dealing with probably the first three phases of this maturity 00:02:47.800 |
But we get to talk about all the cool advanced stuff 00:02:49.680 |
at this conference so we can plan ahead and think 00:02:54.360 |
But enterprises, they need to take an incremental approach 00:03:01.840 |
Gen AI models, they have a number of drawbacks, right? 00:03:07.660 |
Policy restrictions-- typically, if you're a developer 00:03:13.720 |
you're restricted in the type of AI that you can use. 00:03:28.600 |
So the company policy restrictions of the tools 00:03:35.060 |
The legal exposures and risks of these models-- 00:03:46.700 |
And how do we protect our customers against that 00:03:56.120 |
Most of the internet data is still largely Eurocentric 00:04:01.640 |
So the models trained on this public internet data, of course, 00:04:16.040 |
to make sure that we adjust and prevent what we can? 00:04:27.260 |
to run at scale and production, and the performance, 00:04:32.780 |
need to account for when we have these production systems 00:04:39.580 |
These large frontier models have a knowledge cutoff 00:04:48.160 |
going to have that up-to-date information, which 00:04:50.340 |
is why they implement RAG systems and agent systems 00:04:53.600 |
to look out into the internet for more up-to-date info. 00:05:02.900 |
No matter how-- I'm going to go into these categories a little 00:05:06.280 |
No matter how good your model is, if it's not fast, 00:05:10.020 |
if it's not reliable, if it's not affordable, 00:05:16.280 |
So this graphic, it shows a classic bottleneck type 00:05:18.780 |
of the scenario you've got concurrent user requests, which 00:05:21.720 |
might be represented by those green, yellow, and orange 00:05:28.020 |
typically can't handle that kind of load officially. 00:05:31.320 |
To serve real-world traffic, whether you're powering a customer 00:05:34.780 |
support agent, a developer type of copilot system, 00:05:38.340 |
or maybe a RAG pipeline, you need an inference engine that's 00:05:42.900 |
So that's where inference runtimes like TRT, SGLang, 00:05:47.220 |
which I know we have a session on, or at least a couple 00:05:50.280 |
VLLM, that's where Red Hat's kind of focusing. 00:05:53.160 |
That's where those type of production-grade inference 00:06:00.060 |
There's a lot of pain points with inference that I just 00:06:03.180 |
And some of the activity will be benchmarking and evaluating 00:06:16.060 |
it's very complicated to actually evaluate this appropriately. 00:06:26.280 |
The compute load, just for performance evaluations as well, 00:06:32.120 |
You have to make sure the data sets that you're 00:06:34.500 |
using for benchmarking are compatible with the models 00:06:38.800 |
The resource optimization and identifying sizing 00:06:42.360 |
so that you're efficiently using your hardware 00:06:45.000 |
appropriately for whatever model size you're using, 00:06:48.220 |
that's a big challenge for enterprises today, 00:06:51.440 |
making sure they're efficiently using their GPU investments. 00:06:54.700 |
And then actually, cost estimating is a little bit 00:07:02.820 |
You have to backwards math map inference performance to tokens, 00:07:08.020 |
So these are what enterprises are trying to achieve. 00:07:13.840 |
Just a little kind of more examples of the challenges. 00:07:17.080 |
So we have-- this is just an example of stable diffusion bias. 00:07:20.560 |
Like I said, most of our data is Eurocentric. 00:07:24.800 |
So how do we-- what tools can we use to provide guardrails 00:07:30.580 |
The glue incident-- this was because there was something 00:07:38.800 |
And this AI overview tool used that information 00:07:42.580 |
and didn't have the right mitigation techniques in place 00:07:46.940 |
So then it came out in that AI overview suggestion. 00:07:52.700 |
And then we have this like mad situation where a lot of the-- 00:07:57.680 |
we're getting into a lot of synthetic data on the internet. 00:08:00.500 |
And each generation of these AI models that come out are consuming more 00:08:06.260 |
and more AI-generated data, which over time is going to get you further-- 00:08:11.000 |
oh, there's music happening-- further away from that original human-anchored data. 00:08:15.440 |
So this is going to lead to a loss of output diversity, a loss of precision. 00:08:21.640 |
This would be an area where you would need to use those kind of accuracy evals 00:08:25.680 |
to mitigate again and identify that this is occurring. 00:08:29.320 |
So just to cover, of course, Google and the Stable Diffusion Project. 00:08:35.380 |
They have introduced additional evaluation frameworks 00:08:42.080 |
Just like any time we have any story like that, right? 00:08:45.340 |
They are certainly working to make sure that that does not happen again. 00:08:49.600 |
We don't know with the closed source AI offerings how exactly they're doing that, 00:08:55.480 |
You know, the AI overview technology maybe introduced more RAG mitigations 00:09:01.140 |
to where it helps to identify that that was satire or whatever the case is. 00:09:09.700 |
The Stable Diffusion Model likely introduced some level of bias mitigation guardrails. 00:09:14.240 |
So we need--so, of course, they've been working to fix this, but ideally, we don't run into this, 00:09:21.500 |
and we prevent it ahead of time before a model release or before an application release. 00:09:25.500 |
So how do we prevent these kinds of issues at scale in production environments? 00:09:31.500 |
I want to just look at a couple of definitions, because sometimes evaluation and benchmarking are terms used, 00:09:38.760 |
kind of--they kind of conflate a little bit, and people kind of use them for whatever they want. 00:09:42.760 |
So benchmarking is just a subcategory of evaluation. 00:09:46.760 |
Evaluation is a comprehensive process to assess a model end-to-end, 00:09:51.760 |
and it could include a lot of different kinds of evaluations about a lot of different components. 00:09:56.640 |
Benchmarking is very specifically controlled specific data sets and specific tasks, 00:10:02.900 |
typically used to compare models against one another. 00:10:05.900 |
So this would be like a latency score that compares different hardware setups and different models, 00:10:12.900 |
or like the MMLU benchmark scoring, things like that. 00:10:16.900 |
We'll look at both custom evals that aren't benchmarking and also some benchmarks in their hands-on. 00:10:23.160 |
These are just some examples of what is typically considered a model evaluation versus a benchmarking specific test, 00:10:35.160 |
But again, like, there's so many types of evaluations, so many tools. 00:10:39.160 |
You can customize it in so many different ways, but this kind of helps, hopefully, a little bit with the definitions. 00:10:45.160 |
So, hopefully, kind of seeing all these challenges with Gen AI, we understand that this is a critical process. 00:10:57.420 |
Like, there's less of a concern, right, when I'm tinkering on my laptop, whatever, you know, I see something weird, who cares. 00:11:03.420 |
But when we're talking about a production-level environment, we're serving thousands, whatever, customers, we need to think about these things, obviously, more in these types of scenarios. 00:11:12.680 |
The credibility of the company, when those stories come out, that takes a hit for a good chunk of time. 00:11:17.680 |
And you obviously also need to continuously improve your evaluation frameworks as well, because you're not going to catch everything. 00:11:24.940 |
You need that CI process to make sure you are continuously improving your evaluations and benchmark setups. 00:11:33.940 |
It's also going to very much depend on the type of system that you have, what you set up. 00:11:36.940 |
So again, it's very much, there's tons of tools, this could look a lot of different ways, we'll get a sense of it today. 00:11:44.200 |
If you have a RAG set up, you're going to be maybe focused on the RAGAS benchmark, or evaluation tool, agents, you need to look at function, tool calling, capabilities, et cetera. 00:11:58.460 |
You can kind of get a sense, there's going to be specific metrics that you need to set up and look for, depending on the system that you have, which requires a lot of planning in advance and kind of architecture scoping. 00:12:12.460 |
I'll give an example of a RAG use case, and also it's incremental too, because you could, and I'll talk about this a bit, like you could literally evaluate every single part of things, but that's going to be time and resource extensive to set up immediately. 00:12:29.500 |
So you likely want to take an incremental approach with these types of setups. 00:12:32.860 |
So you might start out with, okay, I'm just going to evaluate the chunk retrieval, my retrieval application in a RAG system. 00:12:40.700 |
I want to set up some kind of evaluation test there. 00:12:43.660 |
I might just want to set up a latency throughput of a benchmark test for my LLM output. 00:12:49.860 |
You can start with those kind of incremental approaches for specific components, and then from there, based on priority levels as well, branch out into a full system eval that covers all the components, 00:13:02.540 |
the integration layer of components, the integration layer of how the components work together, the UI end-to-end experience, and kind of have a software engineering kind of test period approach to this, where you have that unit test layer kind of approach at the bottom, integration layer in the middle, and that UI end-to-end at the top. 00:13:22.180 |
And you can take that layer by layer as you're building this evaluation framework for your systems. 00:13:30.940 |
So there is this pyramid also for model evaluation that represents the same kind of setup for that software engineering pyramid represents as well. 00:13:43.340 |
So the base layer and very base layer is the system performance, because like I said, no matter how good your model is, if you don't have fast throughput, you aren't able to handle concurrent users, you're going to be in a bit of a pickle. 00:13:55.300 |
So GPU utilization, etc. You need to make sure kind of the basics are handled as the main kind of event. And we'll talk about, that'll be the first hands-on activity is evaluating the system performance. 00:14:08.260 |
Formatting might be making sure it's religiously giving you JSON output that you need for your application, something like that. The factual accuracy, which we'll also talk about in one of our hands-on, that would be like the MMLU benchmark. 00:14:24.260 |
So evaluating that it's performing well on various subjects, kind of standard large language model accuracy, as well as if you've fine-tuned a model, potentially making sure that it's accurate based on the information that you fine-tuned that model on. 00:14:39.220 |
And you kind of go up from there into, you know, safety, bias. There might be specific custom evaluations that are very specific to your application. So it gives you kind of a sense of the tiered approach that can be taken here. 00:14:52.220 |
So we're going to talk first, system performance, and we're going to have our first hands-on around this. We're going to be looking at Guide LLM, which is kind of a new project that is associated with the VLLM inference runtime project. 00:15:06.180 |
We're going to use that for system performance benchmarks, like latency, throughput, and you'll get a little bit of hands-on there. The general user flow there is like, you know, you select your model, you select your particular data set that you want to use to test throughput and to test inter-token latency, time-to-first token, those types of metrics. 00:15:27.180 |
And then Guide LLM allows, gives you a nice kind of in-terminal UI to visualize the results of that. 00:15:34.180 |
And then once you get kind of the results that you want based on your use case again, then you're ready to deploy. 00:15:42.180 |
You'll see this in the hands-on as well, but you want to test based on the use case. And one of the primary ways you test via Guide LLM is adjusting the input and output tokens. 00:15:55.180 |
So if you have a chat bot use case, a RAG use case, you can adjust the input and output token levels based on your use case. And you'll see that in the hands-on and have an opportunity to kind of play around with that, depending on what you're most interested in. 00:16:11.180 |
That is the link that red.ht/evals is the link to the workshop. You will be signing in with your email. I have no marketing game. It just requires you to do that. So I'm not going to haunt you after this. 00:16:28.180 |
You'll put in your email and that is the password. And, hold on. I just want to show you, well, I don't want to take up one of the systems. Otherwise I would show you, but you're going to get, once you are in the workshop, you're going to get your instructions on the left-hand side. 00:16:47.180 |
And you're going to have two terminals available, two terminal sessions available to you on the right-hand side to the same system. It's a REL system. 00:16:54.180 |
The instructions will overview what the system includes a little bit for you so you get a sense. They each have an L4 GPU as an example. 00:17:04.180 |
Anything else I want to call out before we get started. That'll be primarily what you use. It has Tmux enabled if you like to use that to open up different things and gives you some flexibility. 00:17:17.180 |
We have three different activities. I'm going to pause after each activity so we can have a little bit of a discussion. 00:17:24.180 |
in between. This is my first time running this particular activity, so we'll gauge kind of the time it takes, but I'm going to give it about 15-20 minutes for this first one. 00:17:36.180 |
Okay, so I'm also pulling up a system and I'm just going to like walk through some of the stuff. The initial page is just going to give you, of course if it loads, Jesus, the internet. 00:17:48.180 |
The initial page is just going to give you a little bit of background and preparing your system instructions. And then my terminals are on this second tab. And everything is going to be glacial pace. 00:18:00.180 |
So the first thing I have to do, these systems don't have the container toolkit installed. So I just got to do that. Some system logistics because I'm going to be running VLLM inference runtime in a container and I need that to work. 00:18:14.180 |
So that's what that's doing. And then I'm going to deploy a model with VLLM. So you're going to have to grab a hugging face token. Probably have gotten there by now. Probably most of us have a hugging face token, but just disclaimer there. 00:18:30.180 |
So incognito window, if you're hitting the hugging face rate limit to grab a new token. So VLLM you can also install locally, like if you have a Mac or whatever Linux machine, but we're deploying it as a container here. 00:18:47.180 |
So I'm going to get that to play and I'm, you can see the VLLM serve command at the end. So that's just the VLLM CLI tool. And I'm using an IBM granite model because you know, red hat IBM. 00:18:59.180 |
So we'll be working with that for a chunk of the activity and it takes a bit for VLLM to load the model. So you're going to be waiting for info, the words info like four times in green, and then it's deployed. 00:19:15.180 |
What's nice about VLLM is that it is, it's compatible with the safe tensor format. So has anybody used TRT to load up a model? Okay. Anyway, it's crazy. 00:19:28.180 |
It requires you to also convert the model formats initially. So there's less kind of configuration steps with VLLM. It takes up less space too. 00:19:37.180 |
So when you do these kinds of system and performance benchmarks, you can make a lot of adjustments like the input and output tokens that I mentioned for the guide LLM configurations. 00:19:48.180 |
But there's a lot of also configuration opportunities for the inference runtime itself, depending on what you're trying to do. So sometimes we'll reduce the max, the context window of the model. 00:20:01.180 |
So it runs more quickly because if it's a big context window, it's going to be pretty beefy. There's a lot of knobs you can use for VLLM. We're not really going to touch that this particular time, but just so you're aware. 00:20:12.180 |
So I have my three green infos. So that means it's working and the model is successfully deployed. So I'm going to get into my virtual environment, which is already in place. 00:20:27.180 |
which is already in place. And then I already have guide LLM installed, but I'm going to pip install guide LLM. And these are copy buttons, by the way. So you can just easily copy paste things over. 00:20:39.180 |
So once I have that up, then this command is set up to just work with the model deployed by VLLM and that I'm just keeping up here in the top terminal and can run this in the background, but I'm just not doing that. 00:20:52.180 |
And this is so I have my target. The rate type is a sweep of various benchmarks like inter token latency. There's various types of benchmarks that it'll run that you'll see in the output. 00:21:06.180 |
But these are all things that can be adjusted. You can run one particular benchmark at a time. For instance, you can take a look at and gets guide LLM dash dash help typical type of commands. 00:21:16.180 |
You can kind of see what the where all the knobs are and the documentation is pretty good, too. 00:21:21.180 |
So that'll take a couple of minutes to run because I have it set at a rate of five to reduce the amount of time that it takes to process. 00:21:30.180 |
So you can kind of get a sense of the output here once it all processes. And I have explainers on the left hand side of kind of how to read some of this. 00:21:42.180 |
The mean performance for each is we have the the benchmark info on the top and then benchmark stats on the bottom. 00:21:49.180 |
So the for the constant rate the on the very left hand side. So those are the number of requests sent to the model per second at that particular rate. 00:22:01.180 |
So three the constant at three dot six three six point nine three if I did written the guide LLM command rate and it was five if I did rate 10 you would see more lines of that at more progressive rates. 00:22:20.180 |
And like whether or not these numbers are good also totally depend on your use case as well. 00:22:26.180 |
And I would be comparing I would have a better hardware configuration obviously in production as well. 00:22:31.180 |
So again, I'm just I'm running an eight billion parameter side or two to two billion parameter size model on an L4, which is like, okay, but obviously if you're doing anything concurrently and at scale like that's going to go bonkers pretty quickly. 00:22:48.180 |
So you get like the mean performance the median performance and P99 which is like the the extreme level which matters for SLOs and things like that. 00:23:03.180 |
And you can also output this into JSON format as well to take a closer look. 00:23:09.180 |
So once you reach this you can try also with an additional like tweak the parameters and then maybe compare the. 00:23:17.180 |
results and kind of see what what that changed if you do a rag set up. 00:23:23.180 |
So I think we had I forget what we had what the initial command said but you can adjust those based on a different use case and kind of compare and contrast what the stats look like after it just takes a couple minutes to run. 00:23:44.180 |
We're a few minutes away from the additional 10 to 15 systems being ready. 00:23:50.180 |
for the sake of time I am not going to do breaks for kind of discussion in between if everybody's okay with that and we can kind of just converse independently and I'll just awkwardly walk around the room. 00:24:12.180 |
So if you're done with activity one move on to activity two because there are three activities and I want to make sure everybody has appropriate time for each that they would like. 00:24:24.180 |
But we will also have the systems up until probably about noon early afternoon. 00:24:30.180 |
We'll keep them up if you want it to also go back and look at it after. 00:24:40.180 |
So I have a new URL where we have three more available so far but others are provisioning. 00:24:55.180 |
And I just wanted to go ahead and put the URL up. 00:25:17.180 |
I put descriptions on the left to explain that but I just wanted to heads up we're kind of moving from system performance to now that kind of factual accuracy part of the pyramid. 00:25:30.180 |
So you get a sense of so we're going to be doing the MMLU pro in the second activity and then the third activity is going to be focused on safety and bias and more custom evals. 00:25:43.180 |
So that's the trajectory of activities feel free if one is also more interesting than the other feel free to skip around totally fine. 00:25:58.180 |
You can cut you can basically customize everything because I like all these things are open source so you can create a similar type of eval in that multiple choice format that MMLU does with your own data set. 00:26:11.180 |
There's different ways to do custom accuracy evals with your fine tuned data so we kind of do it in a we have so as part of one of the products we incorporate an eval for our fine tuned models on your proprietary data and we do like a branch of MMLU. 00:26:28.180 |
Essentially so there's kind of a lot of ways to skin a cat you know I love cats in regards to how to set up the evals and a lot of tools available. 00:26:49.180 |
Yeah, so yeah, yes. You can just like fork it and change the data sources. Yeah, yeah. 00:27:13.180 |
The instructions don't some of the instructions don't look updated as I expected so I'm going what I'm going to do is everybody in the slack for engineer world fair. 00:27:24.180 |
I created a slack channel called work workshop beyond benchmarks and I'm going to put some of these activity three does anybody see activity three. 00:27:38.180 |
Okay, I'm putting content in this in this slack. 00:27:44.180 |
If people can be it's a public channel if people can go there my you'll see starting I'm the link is activity two but then you'll also see at the page for activity three. 00:28:05.180 |
You're approaching activity three and are looking for that the systems for some reason didn't render my latest changes yesterday where I improved on activity three, but I put the link to it to my repo in the slack channel. 00:28:20.180 |
In this channel if you search for it, but there's information about the slack in the emails that we got about the event, but also on the back of our badges as well to navigate there. 00:28:29.180 |
And this will be good for after if anybody has any questions and I can send more info about any particular tool here as well. 00:28:39.180 |
I wanted to have a wrap up moment because we have we have about eight minutes. 00:28:45.180 |
I put the link to the activities in the slack channel. 00:28:48.180 |
These environments will be available until the end of the day today. 00:28:52.180 |
So you have time also to tinker around with whatever you want. 00:28:55.180 |
So you would just to kind of recap we went through the furnace sounds funny. 00:29:03.180 |
The first was at that system performance latency throughput level. 00:29:08.180 |
Did everybody kind of get through that successfully? 00:29:10.180 |
Of course, there's I tried to include reading material and stuff to kind of look more into things after because it is it is a very big topic and there's a lot going on and there's a lot of terms and it is very complicated. 00:29:22.180 |
So hopefully you can use it as a learning reason my GitHub repository as a learning resource to kind of poke around after. 00:29:27.180 |
But we started there and then we moved into the MMLU pro with MLE Val harness, which also allows you to do a lot of other evaluation benchmarks as a part of that MLE Val harness framework. 00:29:39.180 |
I happen to choose MMLU pro because it took the least amount of time, even though it took still 10 minutes. 00:29:46.180 |
But there's other ones also that you can play around with within that MLE Val harness repository. 00:29:52.180 |
You can see the different evals you can run there. 00:29:55.180 |
And then we ended with a safety evaluation with prompt, which with prompt foo, which is a tool that allows you to do a lot of customizing and your own evals. 00:30:06.180 |
Like you can do all kinds of custom tests with prompt foo. 00:30:09.180 |
So I wanted to get you exposed to that tool so that you can start looking around there as well. 00:30:14.180 |
That repository on GitHub also has a lot of different examples. 00:30:18.180 |
So we use that particular safety focused example. 00:30:21.180 |
But if you look at the prompt foo repository, it's very easy to play around with other types of examples as well. 00:30:28.180 |
So we kind of moved up the pyramid throughout the activity. 00:30:32.180 |
So hopefully you get a sense of kind of how you can layer this approach when you're looking at and trying to plan for how to strategically implement evals across your entire system. 00:30:43.180 |
Does anybody have any kind of questions or general what they experienced notes of import? 00:30:58.180 |
I'm curious also about use cases and happy to talk to you after too. 00:31:05.180 |
So like when I'm doing evals, it's like testing my prompts and like my data science or if I want to switch like models out and stuff right now. 00:31:14.180 |
I'm actually thinking about like those evals connecting those actually to like my production like running like use cases to like track that like my real performance matches kind of like the drug scene. 00:31:29.180 |
What is like, is there like a word for that concept or like? 00:31:32.180 |
So for me, what I hear is the kind of CI/CD automation implementation of an evaluation framework just like with software engineering testing is kind of what I hear from that. 00:31:48.180 |
When it's actually running for customers and stuff. 00:31:52.180 |
You should have a CI/CD framework that includes these evaluation tests just like for unit testing setups. 00:32:06.180 |
Again, the environments will be up until about 6:00 PM, 5:00 PM tonight.