How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

My name is Emil Sej, I'm CTO at ReChat and with my partner Hamel. We're gonna talk about the product we built, the challenges we faced, and how our eval framework came to the rescue and we'll also show you some results. A little bit about us and how the product that we built came to be.

Last year we tried to see if you have any AI play. Our application is designed for real estate agents and brokers and we had a lot of features like contact management, email marketing, social marketing, whatever. So we realized that we have a lot of APIs that we've built internally and we have a lot of data.

So naturally we came to the unique and brilliant idea that we need to build an AI agent for our real estate agents. So I'm gonna rewind back a year. Basically last year when we started this we started with the process of creating a prototype. We built this prototype using GPT, the original GPT 3.5 and React framework.

It was very very slow and it was making mistakes all the time. But when it worked, it was a majestic experience. It was a beautiful experience. So we thought okay, we got the products in a demo state but now we have to take it to production and that's when we started partnering up with Hamill to basically create a production ready product.

I'm gonna show you some very very basic examples of how this product works. Basically agents ask you to do things for them like create a contact for me with this information or send an email to somebody with some instructions, find me some listings, things because that's real estate agents tend to do or create a website for me.

So yeah, we created this prototype then we started the improvement of language model phase. The problem was when we tried to make changes to see if we can improve it, we didn't really know if we're improving things or not. We would make a change. We would invoke it a couple of times.

You would get a feeling that yeah, it worked a couple of times but we didn't really know what the success rate or failure rate was. Is it gonna work 50% of times or 80% of times? And it's very difficult to launch a production app when you don't really know how well it's gonna function.

The other problem was we improved the situation. We got a feeling that it's okay, it's improving this situation but the moment we changed the prompts, it was likely that it's gonna break other use cases. And we were essentially in the dark. And that's when we started to partner up with Haml to guide us to see if we can make this app production ready.

I'm gonna let him take it from here. Thanks, Emil. So what Emil described is he was able to use prompt engineering, implement rag, agents, so on and so forth, and iterate with just vibe checks really fast to go from zero to one. And this is a really common approach to building an MVP.

It actually works really well for building an MVP. However, in reality, this approach doesn't work for that long at all. At least a stagnation. And if you don't have a way of measuring progress, you can't really build. So in this talk, what I'm gonna go over is a systematic approach you can use to improve your AI consistently.

I'm also gonna talk about how to avoid common traps and give you some resources on how to learn more because you can't learn everything in a 15-minute talk. This diagram is an illustration of the the recipe of this systematic approach of creating an evaluation framework. You don't have to fixate too much on the details of this diagram because I'm gonna be walking through it slowly.

But the first thing I want to talk about is unit tests and assertions. So a lot of people are familiar with unit tests and assertions if you have been building software. But for whatever reason, people tend to skip this step. And it's kind of the foundation for evaluation systems.

You don't want to jump straight to LM as a judge or generic evals. You want to try to write down as many assertions and unit tests as you can about the failure modes that you're that you're experiencing with your large language model. And it really comes from looking at data.

So what you have on the slide here are some simple unit tests and assertions that ReChat wrote based upon failure modes that we observed in the data. And these are not all of them. There's many of these. But these are just examples of like very simple things like testing if agents are working properly so emails not being sent or things like invalid placeholders or other details being repeated when they shouldn't.

The details of these specific assertions don't matter. What I'm trying to drive home is this is a very simple thing that people skip. But it's absolutely essential because running these assertions give you immediate feedback and are almost free to run. And it's really critical to your overall evaluation system if you can have them.

And how do you run the assertions? One very reasonable way is to use CI. You can outgrow CI and it may not work as you mature. But one theme I want to get across is use what you have when you begin. Don't jump straight into tools. Another thing that you want to do with these assertions and unit tests is log the results to a database.

But when you're starting out, you want to keep it simple and stupid. Use your existing tools. So in ReChat's case, they were already using MetaBase. So we logged these results to MetaBase and then used MetaBase to like visualize and track the results so that we could see if we're making progress on these dumb failure modes over time.

Again, my recommendation is don't buy stuff. Use what you have when you're beginning and then get into tools later. And I'll talk more about that in a minute. So we talked a little bit about unit tests and assertions. The next thing I want to talk about is logging and human review.

So it's important to log your traces. There's a lot of tools that you can use to do this. This is one area where I actually do suggest using a tool right off the bat. There's a lot of commercial tools and open source tools that are listed on the slide.

In ReChat's case, they ended up using Langsmith. But more importantly, it's not not enough to just log your traces. You have to look at them. Otherwise, there's no point in logging them. And one kind of nuance here is that looking at your data is so important that I actually recommend building your own data viewing and annotation tools in a lot of cases.

And the reason is because your data and application are often very unique. There's a lot of domain specific stuff in your traces. So in ReChat's case, we found that tools had too much friction for us. So we built our own kind of little application. And you can do this very easily in something like Gradio, Streamlet.

I use Shiny for Python. It really doesn't matter. But we have a lot of domain specific stuff in this like web page, things that allows us to filter data in ways that are very specific to ReChat, but then also lots of other metadata that's associated with each trace that is ReChat specific, where I don't have to hunt for information to evaluate a trace.

And there's other things going on here. This is not only a kind of a data viewing app. This is also a data labeling app where it's like facilitates human review, which I'll talk about in a second. So this is the most important part. If you remember anything from this talk, it is you need to look at your data and you need to fight as hard as you can to remove all friction in looking at your data, even down to creating your own data viewing apps if you have to.

And it's absolutely critical. If you have any friction in looking at data, people are not going to do it. And it will destroy the whole process and none of this is going to work. So we talked a little bit about unit tests, logging into your traces and human review.

And you might be wondering, okay, like you have these tests. What about the test cases? What do we do about that? Especially when you're starting out, you might not have any users. So you can use LLMs to systematically generate inputs to your system. So in ReChat's case, we basically use an LLM to cause play as a real estate agent and ask questions as inputs into Lucy, which is their AI assistant, for all the different features and the scenarios and the tools to get really good test coverage.

So I just want to point out that using LLMs to synthetically generate inputs is a good way to bootstrap these test cases. So we talked a little bit about unit tests, logging traces, you know, having a human review. And so when you have a very minimal setup like this, this is like the very minimal thing, like a very minimal evaluation system, like bare bones.

And what you want to do when you first kind of construct that is you want to test out the evaluation system. So you want to do something to make progress on your AI. And the easiest way to try to make progress on your AI is to do prompt engineering.

So what you should do is go through this loop as many times as possible, you know, try to improve your AI with prompt engineering and see if your test coverage is good. Are you logging your, are you logging your traces correctly? Did you remove as much friction as possible from looking at your data?

And then this will help you debug that, but also give you the satisfaction of like making progress on your AI as well. One thing I want to point out is the upshot of having an evaluation system is you get other superpowers for almost free. So all of the work and fine tuning or most of the work is data curation.

So we already talked about like synthetic data generation and how that interacts with the eval framework. And what you can do is you can use your eval framework to kind of filter out good cases and feed that into your human review like we showed with that application. And you can start to curate data for fine tuning.

And also for the failed cases, you have this workflow that you can use to work through those and continuously update your fine tuning data. And what we've seen over time is that the more comprehensive your eval framework is, the cost of human review goes down because you're automating more and more of these things and getting more confidence in your data.

So once you have kind of this setup, now you're in a position to know whether or not you're making progress or not. You have a workflow that you can use to quickly make improvements. And you can start getting rid of those dumb failure modes. But also now you're set up to move into more advanced things like LLM as a judge.

Because you can't express everything as an assertion or a unit test. Now LLM as a judge is a deep topic just outside the scope of this talk. But one thing I want to point out is it's very, very important to align the LLM judge to a human. Because you need to know whether you can trust the LLM as a judge.

You need a way, a principled way of reasoning about how reliable the LLM as a judge is. So what I like to do is, again, keep it simple and stupid. I like to use a spreadsheet often. Don't make it complicated. But what I do is have a domain expert label data, you know, label the critique and critique data and keep iterating on that until my LLM as a judge is in alignment with my human judge.

And I have high confidence that the LLM judge is doing what it's supposed to do. So I'm going to go through some common mistakes that people make when building LLM as evaluation systems. One is not looking at your data. It's easier said than done, but the people don't do the best job of doing this.

And one key to unlocking this is to remove all the friction, as I mentioned before. The second one, and this is just as important, is focusing on tools, not processes. So if you're having a conversation about evals and the first thing you start thinking about is tools, that's a smell that you're not going to be successful in your evaluations.

People like to jump straight to the tools. Tell me about the tools. What tools should I use? It's really important to try not to use tools to begin with and try to do some of these things manually with what you already have. Because if you don't do that, you won't be able to evaluate the tools and you have to know what the process is before you jump straight into the tools.

Otherwise, you're going to be blindsided. Another common mistake is people using generic evals off the shelf. So don't want to reach for generic evals. You want to write evals that are very specific to your domain. Things like conciseness score, toxicity score, you know, all these different evals you can get off the shelf with tools.

You don't want to go directly to those. That's also a smell that you are not doing things correctly. It's not that they're not valuable at all. It's just that you shouldn't rely on them because they can become a crutch. And then finally, the other common mistake is with LLM as a judge and using that too early.

I often find that if I'm looking at the data closely enough, I can always find plenty of assertions and failure modes. It's not always the case, but it's often the case. So don't go to LLM as a judge too early, and also make sure you align LLM as a judge with a human.

So I'm going to flip it back over to Emil. He's going to talk about the results of implementing this system. All right. So after we got to the virtuous cycle that Hamill just displayed, we managed to rapidly increase the success rate of the LLM application. Without the eval framework, a project all similar to this seemed completely impossible for us.

One thing that I've started to hear a lot is that few-shot prompting is going to replace fine-tuning or some notions like that. In our case, we never managed to get everything that we wanted by few-shot prompting, even using the newer and smarter agents. I wish we could. I've seen a lot of judgment of companies and products being just ChatGPT wrappers.

I wish we could just be a ChatGPT wrapper and manage to extract the experience we want for our users, but we never had that opportunity because we had some really difficult cases. One of the things that we wanted our agent to be able to do was to mix natural language with user interface elements like this inside the output, and this essentially required us to mix structured output and unstructured output together.

We never managed to get this working without fine-tuning reliably. Another thing was feedback. So sometimes the user asks in a case like this, do this for me, but the agent can just do that. It needs some sort of feedback, more input from the user. Again, something like this was very difficult for us to execute on, especially given the previous challenge of injecting user interfaces inside the conversation.

And third reason that we had to fine-tune was complex commands like this. I'm going to show a tiny video that shows how this command was executed, but basically in this example, the user is asking for a very complex command that requires using like five or six different tools to be done.

Basically, what we wanted was for it to take that input, break it down into many different function calls, and execute it. So in this case, I'm asking you to find me some listings with some criteria, and then create a website. That's what real estate agents sometimes do for their listings that they're responsible for, and also an Instagram post, so they want to market it.

They want this done only for the most expensive listing of these three. So the application has found three listings, created a website for that, created and rendered an Instagram post video for it, and then has prepared an email to Hamill, including all the information about the listings, and also including the website that was created and the Instagram story that was created, also invited Hamill to a dinner and created a follow-up task.

Creating something like this for a non-savvy real estate agent may take a couple of hours to do, but using the agent, they can do it in a minute, and that essentially was not going to be possible without us using a comprehensive eval framework. Nailed the timing. Thank you guys.

Thank you.

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

Transcript