back to indexHow to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

00:00:00.000 |
My name is Emil Sej, I'm CTO at ReChat and with my partner Hamel. We're gonna talk about the product 00:00:20.460 |
we built, the challenges we faced, and how our eval framework came to the rescue and we'll also 00:00:26.280 |
show you some results. A little bit about us and how the product that we built came to be. Last 00:00:33.240 |
year we tried to see if you have any AI play. Our application is designed for real estate agents 00:00:38.660 |
and brokers and we had a lot of features like contact management, email marketing, social 00:00:43.200 |
marketing, whatever. So we realized that we have a lot of APIs that we've built internally and we 00:00:49.680 |
have a lot of data. So naturally we came to the unique and brilliant idea that we need to build 00:00:55.140 |
an AI agent for our real estate agents. So I'm gonna rewind back a year. Basically last year when 00:01:06.580 |
we started this we started with the process of creating a prototype. We built this prototype using 00:01:12.240 |
GPT, the original GPT 3.5 and React framework. It was very very slow and it was making mistakes all the time. 00:01:21.000 |
But when it worked, it was a majestic experience. It was a beautiful experience. So we thought okay, 00:01:26.440 |
we got the products in a demo state but now we have to take it to production and that's when we started 00:01:32.440 |
partnering up with Hamill to basically create a production ready product. I'm gonna show you some very very basic 00:01:43.880 |
examples of how this product works. Basically agents ask you to do things for them like create a contact 00:01:49.880 |
for me with this information or send an email to somebody with some instructions, find me some listings, 00:01:55.880 |
things because that's real estate agents tend to do or create a website for me. 00:02:01.320 |
So yeah, we created this prototype then we started the improvement of language model phase. The problem 00:02:10.520 |
was when we tried to make changes to see if we can improve it, we didn't really know if we're improving 00:02:17.800 |
things or not. We would make a change. We would invoke it a couple of times. You would get a feeling 00:02:23.240 |
that yeah, it worked a couple of times but we didn't really know what the success rate or failure rate was. 00:02:29.160 |
Is it gonna work 50% of times or 80% of times? And it's very difficult to launch a production app when 00:02:34.760 |
you don't really know how well it's gonna function. The other problem was we improved the situation. 00:02:40.360 |
We got a feeling that it's okay, it's improving this situation but the moment we changed the prompts, 00:02:44.520 |
it was likely that it's gonna break other use cases. And we were essentially in the dark. And that's when 00:02:50.920 |
we started to partner up with Haml to guide us to see if we can make this app production ready. I'm gonna 00:02:56.840 |
let him take it from here. Thanks, Emil. So what Emil described is he was able to use prompt engineering, 00:03:08.840 |
implement rag, agents, so on and so forth, and iterate with just vibe checks really fast to go 00:03:16.760 |
from zero to one. And this is a really common approach to building an MVP. It actually works 00:03:23.080 |
really well for building an MVP. However, in reality, this approach doesn't work for that long at all. 00:03:30.680 |
At least a stagnation. And if you don't have a way of measuring progress, you can't really build. 00:03:37.320 |
So in this talk, what I'm gonna go over is a systematic approach you can use to improve your AI 00:03:44.440 |
consistently. I'm also gonna talk about how to avoid common traps and give you some resources on how to 00:03:52.360 |
learn more because you can't learn everything in a 15-minute talk. This diagram is an illustration of the 00:04:01.720 |
the recipe of this systematic approach of creating an evaluation framework. You don't have to fixate too much 00:04:09.800 |
on the details of this diagram because I'm gonna be walking through it slowly. But the first thing I want to talk 00:04:15.080 |
about is unit tests and assertions. So a lot of people are familiar with unit tests and assertions if you have been 00:04:22.920 |
building software. But for whatever reason, people tend to skip this step. And it's kind of the foundation 00:04:31.160 |
for evaluation systems. You don't want to jump straight to LM as a judge or generic evals. You want to try to 00:04:38.840 |
write down as many assertions and unit tests as you can about the failure modes that you're that you're 00:04:44.040 |
experiencing with your large language model. And it really comes from looking at data. So what you have 00:04:50.600 |
on the slide here are some simple unit tests and assertions that ReChat wrote based upon failure modes 00:04:58.120 |
that we observed in the data. And these are not all of them. There's many of these. But these are just 00:05:03.320 |
examples of like very simple things like testing if agents are working properly so emails not being sent 00:05:09.400 |
or things like invalid placeholders or other details being repeated when they shouldn't. 00:05:14.760 |
The details of these specific assertions don't matter. What I'm trying to drive home is 00:05:19.720 |
this is a very simple thing that people skip. But it's absolutely essential because running these assertions 00:05:26.920 |
give you immediate feedback and are almost free to run. And it's really critical to your overall evaluation system 00:05:35.080 |
if you can have them. And how do you run the assertions? One very reasonable way is to use 00:05:42.920 |
CI. You can outgrow CI and it may not work as you mature. But one theme I want to get across is use 00:05:50.200 |
what you have when you begin. Don't jump straight into tools. Another thing that you want to do with these 00:05:57.800 |
assertions and unit tests is log the results to a database. But when you're starting out, you want to keep 00:06:04.280 |
it simple and stupid. Use your existing tools. So in ReChat's case, they were already using MetaBase. 00:06:11.640 |
So we logged these results to MetaBase and then used MetaBase to like visualize and track the results so 00:06:18.040 |
that we could see if we're making progress on these dumb failure modes over time. Again, my recommendation is 00:06:24.520 |
don't buy stuff. Use what you have when you're beginning and then get into tools later. And I'll talk more about that in a minute. 00:06:33.480 |
So we talked a little bit about unit tests and assertions. The next thing I want to talk about is logging and human review. 00:06:40.840 |
So it's important to log your traces. There's a lot of tools that you can use to do this. This is one area where I actually do 00:06:48.200 |
suggest using a tool right off the bat. There's a lot of commercial tools and open source tools that are listed on the 00:06:54.200 |
slide. In ReChat's case, they ended up using Langsmith. But more importantly, it's not 00:07:02.200 |
not enough to just log your traces. You have to look at them. Otherwise, there's no point in logging them. 00:07:09.240 |
And one kind of nuance here is that looking at your data is so important that I actually recommend 00:07:17.560 |
building your own data viewing and annotation tools in a lot of cases. 00:07:23.560 |
And the reason is because your data and application are often very unique. There's a lot of domain 00:07:30.920 |
specific stuff in your traces. So in ReChat's case, we found that tools had too much friction for us. 00:07:37.240 |
So we built our own kind of little application. And you can do this very easily in something like Gradio, 00:07:42.760 |
Streamlet. I use Shiny for Python. It really doesn't matter. But we have a lot of domain specific stuff 00:07:48.360 |
in this like web page, things that allows us to filter data in ways that are very specific to ReChat, 00:07:54.360 |
but then also lots of other metadata that's associated with each trace that is ReChat specific, 00:08:00.040 |
where I don't have to hunt for information to evaluate a trace. And there's other things going on here. 00:08:07.320 |
This is not only a kind of a data viewing app. This is also a data labeling app where it's like 00:08:12.920 |
facilitates human review, which I'll talk about in a second. 00:08:16.440 |
So this is the most important part. If you remember anything from this talk, 00:08:22.680 |
it is you need to look at your data and you need to fight as hard as you can 00:08:26.680 |
to remove all friction in looking at your data, even down to creating your own data viewing apps if you 00:08:34.840 |
have to. And it's absolutely critical. If you have any friction in looking at data, 00:08:39.400 |
people are not going to do it. And it will destroy the whole process and none of this is going to work. 00:08:44.440 |
So we talked a little bit about unit tests, logging into your traces and human review. 00:08:55.720 |
And you might be wondering, okay, like you have these tests. What about the test cases? What do 00:09:00.920 |
we do about that? Especially when you're starting out, you might not have any users. 00:09:04.040 |
So you can use LLMs to systematically generate inputs to your system. So in ReChat's case, 00:09:12.360 |
we basically use an LLM to cause play as a real estate agent and ask questions as inputs into Lucy, 00:09:23.320 |
which is their AI assistant, for all the different features and the scenarios and the tools to get 00:09:28.840 |
really good test coverage. So I just want to point out that using LLMs to synthetically generate inputs 00:09:34.760 |
is a good way to bootstrap these test cases. So we talked a little bit about unit tests, logging traces, 00:09:43.000 |
you know, having a human review. And so when you have a very minimal setup like this, 00:09:50.840 |
this is like the very minimal thing, like a very minimal evaluation system, like bare bones. 00:09:57.560 |
And what you want to do when you first kind of construct that is you want to test out the 00:10:01.800 |
evaluation system. So you want to do something to make progress on your AI. And the easiest way to 00:10:07.560 |
try to make progress on your AI is to do prompt engineering. So what you should do is go through 00:10:12.520 |
this loop as many times as possible, you know, try to improve your AI with prompt engineering and see 00:10:19.400 |
if your test coverage is good. Are you logging your, are you logging your traces correctly? 00:10:25.560 |
Did you remove as much friction as possible from looking at your data? 00:10:28.840 |
And then this will help you debug that, but also give you the satisfaction of like making progress 00:10:34.120 |
on your AI as well. One thing I want to point out is the upshot of having an evaluation system is you 00:10:42.920 |
get other superpowers for almost free. So all of the work and fine tuning or most of the work is data 00:10:49.480 |
curation. So we already talked about like synthetic data generation and how that interacts with the eval 00:10:57.080 |
framework. And what you can do is you can use your eval framework to kind of filter out good cases and 00:11:04.280 |
feed that into your human review like we showed with that application. And you can start to curate data for 00:11:11.480 |
fine tuning. And also for the failed cases, you have this workflow that you can use to work through those 00:11:17.720 |
and continuously update your fine tuning data. And what we've seen over time is that the more 00:11:24.520 |
comprehensive your eval framework is, the cost of human review goes down because you're automating more 00:11:31.160 |
and more of these things and getting more confidence in your data. 00:11:39.000 |
So once you have kind of this setup, now you're in a position to know whether or not you're making 00:11:46.680 |
progress or not. You have a workflow that you can use to quickly make improvements. And you can start 00:11:53.560 |
getting rid of those dumb failure modes. But also now you're set up to move into more advanced things 00:11:58.360 |
like LLM as a judge. Because you can't express everything as an assertion or a unit test. 00:12:04.360 |
Now LLM as a judge is a deep topic just outside the scope of this talk. 00:12:10.520 |
But one thing I want to point out is it's very, very important to align the LLM judge to a human. 00:12:18.760 |
Because you need to know whether you can trust the LLM as a judge. You need a way, a principled way of 00:12:23.240 |
reasoning about how reliable the LLM as a judge is. So what I like to do is, again, keep it simple and 00:12:32.440 |
stupid. I like to use a spreadsheet often. Don't make it complicated. But what I do is have a domain 00:12:38.600 |
expert label data, you know, label the critique and critique data and keep iterating on that until 00:12:48.680 |
my LLM as a judge is in alignment with my human judge. And I have high confidence that the LLM judge 00:12:55.160 |
is doing what it's supposed to do. So I'm going to go through some common mistakes that people make 00:13:05.080 |
when building LLM as evaluation systems. One is not looking at your data. It's easier said than done, 00:13:13.960 |
but the people don't do the best job of doing this. And one key to unlocking this is to remove all the 00:13:21.640 |
friction, as I mentioned before. The second one, and this is just as important, is focusing on tools, 00:13:29.240 |
not processes. So if you're having a conversation about evals and the first thing you start thinking 00:13:36.600 |
about is tools, that's a smell that you're not going to be successful in your evaluations. People 00:13:43.400 |
like to jump straight to the tools. Tell me about the tools. What tools should I use? It's really 00:13:48.840 |
important to try not to use tools to begin with and try to do some of these things manually with what you 00:13:53.800 |
already have. Because if you don't do that, you won't be able to evaluate the tools and you have 00:13:59.400 |
to know what the process is before you jump straight into the tools. Otherwise, you're going to be 00:14:04.840 |
blindsided. Another common mistake is people using generic evals off the shelf. So don't want to reach for 00:14:15.240 |
generic evals. You want to write evals that are very specific to your domain. Things like conciseness score, 00:14:21.320 |
toxicity score, you know, all these different evals you can get off the shelf with tools. You don't 00:14:26.840 |
want to go directly to those. That's also a smell that you are not doing things correctly. 00:14:31.960 |
It's not that they're not valuable at all. It's just that you shouldn't rely on them because they can 00:14:38.040 |
become a crutch. And then finally, the other common mistake is with LLM as a judge and using that too early. 00:14:48.040 |
I often find that if I'm looking at the data closely enough, I can always find plenty of assertions and 00:14:55.240 |
failure modes. It's not always the case, but it's often the case. So don't go to LLM as a judge too early, 00:15:01.720 |
and also make sure you align LLM as a judge with a human. 00:15:07.320 |
So I'm going to flip it back over to Emil. He's going to talk about the results of implementing this system. 00:15:12.280 |
All right. So after we got to the virtuous cycle that Hamill just displayed, we managed to rapidly increase 00:15:23.960 |
the success rate of the LLM application. Without the eval framework, a project all similar to this seemed 00:15:31.160 |
completely impossible for us. One thing that I've started to hear a lot is that few-shot prompting is 00:15:39.640 |
going to replace fine-tuning or some notions like that. In our case, we never managed to get everything 00:15:46.120 |
that we wanted by few-shot prompting, even using the newer and smarter agents. I wish we could. I've seen a lot 00:15:54.360 |
of judgment of companies and products being just ChatGPT wrappers. I wish we could just be a ChatGPT 00:16:00.680 |
wrapper and manage to extract the experience we want for our users, but we never had that opportunity 00:16:06.440 |
because we had some really difficult cases. One of the things that we wanted our agent to be able to do 00:16:11.800 |
was to mix natural language with user interface elements like this inside the output, and this 00:16:19.000 |
essentially required us to mix structured output and unstructured output together. We never managed 00:16:25.800 |
to get this working without fine-tuning reliably. Another thing was feedback. So sometimes the user 00:16:33.240 |
asks in a case like this, do this for me, but the agent can just do that. It needs some sort of feedback, 00:16:39.400 |
more input from the user. Again, something like this was very difficult for us to execute on, 00:16:44.760 |
especially given the previous challenge of injecting user interfaces inside the conversation. 00:16:51.720 |
And third reason that we had to fine-tune was complex commands like this. I'm going to show a 00:16:59.480 |
tiny video that shows how this command was executed, but basically in this example, the user is asking for 00:17:07.880 |
a very complex command that requires using like five or six different tools to be done. 00:17:13.000 |
Basically, what we wanted was for it to take that input, break it down into many different function 00:17:20.920 |
calls, and execute it. So in this case, I'm asking you to find me some listings with some criteria, 00:17:26.760 |
and then create a website. That's what real estate agents sometimes do for their listings that they're 00:17:32.120 |
responsible for, and also an Instagram post, so they want to market it. They want this done only for the 00:17:38.040 |
most expensive listing of these three. So the application has found three listings, created a 00:17:44.600 |
website for that, created and rendered an Instagram post video for it, and then has prepared an email to 00:17:53.160 |
Hamill, including all the information about the listings, and also including the website that was 00:17:59.400 |
created and the Instagram story that was created, also invited Hamill to a dinner and created a 00:18:07.320 |
follow-up task. Creating something like this for a non-savvy real estate agent may take a couple of 00:18:13.560 |
hours to do, but using the agent, they can do it in a minute, and that essentially was not going to be 00:18:19.560 |
possible without us using a comprehensive eval framework. Nailed the timing. Thank you guys.