back to index

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh


Whisper Transcript | Transcript Only Page

00:00:00.000 | My name is Emil Sej, I'm CTO at ReChat and with my partner Hamel. We're gonna talk about the product
00:00:20.460 | we built, the challenges we faced, and how our eval framework came to the rescue and we'll also
00:00:26.280 | show you some results. A little bit about us and how the product that we built came to be. Last
00:00:33.240 | year we tried to see if you have any AI play. Our application is designed for real estate agents
00:00:38.660 | and brokers and we had a lot of features like contact management, email marketing, social
00:00:43.200 | marketing, whatever. So we realized that we have a lot of APIs that we've built internally and we
00:00:49.680 | have a lot of data. So naturally we came to the unique and brilliant idea that we need to build
00:00:55.140 | an AI agent for our real estate agents. So I'm gonna rewind back a year. Basically last year when
00:01:06.580 | we started this we started with the process of creating a prototype. We built this prototype using
00:01:12.240 | GPT, the original GPT 3.5 and React framework. It was very very slow and it was making mistakes all the time.
00:01:21.000 | But when it worked, it was a majestic experience. It was a beautiful experience. So we thought okay,
00:01:26.440 | we got the products in a demo state but now we have to take it to production and that's when we started
00:01:32.440 | partnering up with Hamill to basically create a production ready product. I'm gonna show you some very very basic
00:01:43.880 | examples of how this product works. Basically agents ask you to do things for them like create a contact
00:01:49.880 | for me with this information or send an email to somebody with some instructions, find me some listings,
00:01:55.880 | things because that's real estate agents tend to do or create a website for me.
00:02:01.320 | So yeah, we created this prototype then we started the improvement of language model phase. The problem
00:02:10.520 | was when we tried to make changes to see if we can improve it, we didn't really know if we're improving
00:02:17.800 | things or not. We would make a change. We would invoke it a couple of times. You would get a feeling
00:02:23.240 | that yeah, it worked a couple of times but we didn't really know what the success rate or failure rate was.
00:02:29.160 | Is it gonna work 50% of times or 80% of times? And it's very difficult to launch a production app when
00:02:34.760 | you don't really know how well it's gonna function. The other problem was we improved the situation.
00:02:40.360 | We got a feeling that it's okay, it's improving this situation but the moment we changed the prompts,
00:02:44.520 | it was likely that it's gonna break other use cases. And we were essentially in the dark. And that's when
00:02:50.920 | we started to partner up with Haml to guide us to see if we can make this app production ready. I'm gonna
00:02:56.840 | let him take it from here. Thanks, Emil. So what Emil described is he was able to use prompt engineering,
00:03:08.840 | implement rag, agents, so on and so forth, and iterate with just vibe checks really fast to go
00:03:16.760 | from zero to one. And this is a really common approach to building an MVP. It actually works
00:03:23.080 | really well for building an MVP. However, in reality, this approach doesn't work for that long at all.
00:03:30.680 | At least a stagnation. And if you don't have a way of measuring progress, you can't really build.
00:03:37.320 | So in this talk, what I'm gonna go over is a systematic approach you can use to improve your AI
00:03:44.440 | consistently. I'm also gonna talk about how to avoid common traps and give you some resources on how to
00:03:52.360 | learn more because you can't learn everything in a 15-minute talk. This diagram is an illustration of the
00:04:01.720 | the recipe of this systematic approach of creating an evaluation framework. You don't have to fixate too much
00:04:09.800 | on the details of this diagram because I'm gonna be walking through it slowly. But the first thing I want to talk
00:04:15.080 | about is unit tests and assertions. So a lot of people are familiar with unit tests and assertions if you have been
00:04:22.920 | building software. But for whatever reason, people tend to skip this step. And it's kind of the foundation
00:04:31.160 | for evaluation systems. You don't want to jump straight to LM as a judge or generic evals. You want to try to
00:04:38.840 | write down as many assertions and unit tests as you can about the failure modes that you're that you're
00:04:44.040 | experiencing with your large language model. And it really comes from looking at data. So what you have
00:04:50.600 | on the slide here are some simple unit tests and assertions that ReChat wrote based upon failure modes
00:04:58.120 | that we observed in the data. And these are not all of them. There's many of these. But these are just
00:05:03.320 | examples of like very simple things like testing if agents are working properly so emails not being sent
00:05:09.400 | or things like invalid placeholders or other details being repeated when they shouldn't.
00:05:14.760 | The details of these specific assertions don't matter. What I'm trying to drive home is
00:05:19.720 | this is a very simple thing that people skip. But it's absolutely essential because running these assertions
00:05:26.920 | give you immediate feedback and are almost free to run. And it's really critical to your overall evaluation system
00:05:35.080 | if you can have them. And how do you run the assertions? One very reasonable way is to use
00:05:42.920 | CI. You can outgrow CI and it may not work as you mature. But one theme I want to get across is use
00:05:50.200 | what you have when you begin. Don't jump straight into tools. Another thing that you want to do with these
00:05:57.800 | assertions and unit tests is log the results to a database. But when you're starting out, you want to keep
00:06:04.280 | it simple and stupid. Use your existing tools. So in ReChat's case, they were already using MetaBase.
00:06:11.640 | So we logged these results to MetaBase and then used MetaBase to like visualize and track the results so
00:06:18.040 | that we could see if we're making progress on these dumb failure modes over time. Again, my recommendation is
00:06:24.520 | don't buy stuff. Use what you have when you're beginning and then get into tools later. And I'll talk more about that in a minute.
00:06:33.480 | So we talked a little bit about unit tests and assertions. The next thing I want to talk about is logging and human review.
00:06:40.840 | So it's important to log your traces. There's a lot of tools that you can use to do this. This is one area where I actually do
00:06:48.200 | suggest using a tool right off the bat. There's a lot of commercial tools and open source tools that are listed on the
00:06:54.200 | slide. In ReChat's case, they ended up using Langsmith. But more importantly, it's not
00:07:02.200 | not enough to just log your traces. You have to look at them. Otherwise, there's no point in logging them.
00:07:09.240 | And one kind of nuance here is that looking at your data is so important that I actually recommend
00:07:17.560 | building your own data viewing and annotation tools in a lot of cases.
00:07:23.560 | And the reason is because your data and application are often very unique. There's a lot of domain
00:07:30.920 | specific stuff in your traces. So in ReChat's case, we found that tools had too much friction for us.
00:07:37.240 | So we built our own kind of little application. And you can do this very easily in something like Gradio,
00:07:42.760 | Streamlet. I use Shiny for Python. It really doesn't matter. But we have a lot of domain specific stuff
00:07:48.360 | in this like web page, things that allows us to filter data in ways that are very specific to ReChat,
00:07:54.360 | but then also lots of other metadata that's associated with each trace that is ReChat specific,
00:08:00.040 | where I don't have to hunt for information to evaluate a trace. And there's other things going on here.
00:08:07.320 | This is not only a kind of a data viewing app. This is also a data labeling app where it's like
00:08:12.920 | facilitates human review, which I'll talk about in a second.
00:08:16.440 | So this is the most important part. If you remember anything from this talk,
00:08:22.680 | it is you need to look at your data and you need to fight as hard as you can
00:08:26.680 | to remove all friction in looking at your data, even down to creating your own data viewing apps if you
00:08:34.840 | have to. And it's absolutely critical. If you have any friction in looking at data,
00:08:39.400 | people are not going to do it. And it will destroy the whole process and none of this is going to work.
00:08:44.440 | So we talked a little bit about unit tests, logging into your traces and human review.
00:08:55.720 | And you might be wondering, okay, like you have these tests. What about the test cases? What do
00:09:00.920 | we do about that? Especially when you're starting out, you might not have any users.
00:09:04.040 | So you can use LLMs to systematically generate inputs to your system. So in ReChat's case,
00:09:12.360 | we basically use an LLM to cause play as a real estate agent and ask questions as inputs into Lucy,
00:09:23.320 | which is their AI assistant, for all the different features and the scenarios and the tools to get
00:09:28.840 | really good test coverage. So I just want to point out that using LLMs to synthetically generate inputs
00:09:34.760 | is a good way to bootstrap these test cases. So we talked a little bit about unit tests, logging traces,
00:09:43.000 | you know, having a human review. And so when you have a very minimal setup like this,
00:09:50.840 | this is like the very minimal thing, like a very minimal evaluation system, like bare bones.
00:09:57.560 | And what you want to do when you first kind of construct that is you want to test out the
00:10:01.800 | evaluation system. So you want to do something to make progress on your AI. And the easiest way to
00:10:07.560 | try to make progress on your AI is to do prompt engineering. So what you should do is go through
00:10:12.520 | this loop as many times as possible, you know, try to improve your AI with prompt engineering and see
00:10:19.400 | if your test coverage is good. Are you logging your, are you logging your traces correctly?
00:10:25.560 | Did you remove as much friction as possible from looking at your data?
00:10:28.840 | And then this will help you debug that, but also give you the satisfaction of like making progress
00:10:34.120 | on your AI as well. One thing I want to point out is the upshot of having an evaluation system is you
00:10:42.920 | get other superpowers for almost free. So all of the work and fine tuning or most of the work is data
00:10:49.480 | curation. So we already talked about like synthetic data generation and how that interacts with the eval
00:10:57.080 | framework. And what you can do is you can use your eval framework to kind of filter out good cases and
00:11:04.280 | feed that into your human review like we showed with that application. And you can start to curate data for
00:11:11.480 | fine tuning. And also for the failed cases, you have this workflow that you can use to work through those
00:11:17.720 | and continuously update your fine tuning data. And what we've seen over time is that the more
00:11:24.520 | comprehensive your eval framework is, the cost of human review goes down because you're automating more
00:11:31.160 | and more of these things and getting more confidence in your data.
00:11:39.000 | So once you have kind of this setup, now you're in a position to know whether or not you're making
00:11:46.680 | progress or not. You have a workflow that you can use to quickly make improvements. And you can start
00:11:53.560 | getting rid of those dumb failure modes. But also now you're set up to move into more advanced things
00:11:58.360 | like LLM as a judge. Because you can't express everything as an assertion or a unit test.
00:12:04.360 | Now LLM as a judge is a deep topic just outside the scope of this talk.
00:12:10.520 | But one thing I want to point out is it's very, very important to align the LLM judge to a human.
00:12:18.760 | Because you need to know whether you can trust the LLM as a judge. You need a way, a principled way of
00:12:23.240 | reasoning about how reliable the LLM as a judge is. So what I like to do is, again, keep it simple and
00:12:32.440 | stupid. I like to use a spreadsheet often. Don't make it complicated. But what I do is have a domain
00:12:38.600 | expert label data, you know, label the critique and critique data and keep iterating on that until
00:12:48.680 | my LLM as a judge is in alignment with my human judge. And I have high confidence that the LLM judge
00:12:55.160 | is doing what it's supposed to do. So I'm going to go through some common mistakes that people make
00:13:05.080 | when building LLM as evaluation systems. One is not looking at your data. It's easier said than done,
00:13:13.960 | but the people don't do the best job of doing this. And one key to unlocking this is to remove all the
00:13:21.640 | friction, as I mentioned before. The second one, and this is just as important, is focusing on tools,
00:13:29.240 | not processes. So if you're having a conversation about evals and the first thing you start thinking
00:13:36.600 | about is tools, that's a smell that you're not going to be successful in your evaluations. People
00:13:43.400 | like to jump straight to the tools. Tell me about the tools. What tools should I use? It's really
00:13:48.840 | important to try not to use tools to begin with and try to do some of these things manually with what you
00:13:53.800 | already have. Because if you don't do that, you won't be able to evaluate the tools and you have
00:13:59.400 | to know what the process is before you jump straight into the tools. Otherwise, you're going to be
00:14:04.840 | blindsided. Another common mistake is people using generic evals off the shelf. So don't want to reach for
00:14:15.240 | generic evals. You want to write evals that are very specific to your domain. Things like conciseness score,
00:14:21.320 | toxicity score, you know, all these different evals you can get off the shelf with tools. You don't
00:14:26.840 | want to go directly to those. That's also a smell that you are not doing things correctly.
00:14:31.960 | It's not that they're not valuable at all. It's just that you shouldn't rely on them because they can
00:14:38.040 | become a crutch. And then finally, the other common mistake is with LLM as a judge and using that too early.
00:14:48.040 | I often find that if I'm looking at the data closely enough, I can always find plenty of assertions and
00:14:55.240 | failure modes. It's not always the case, but it's often the case. So don't go to LLM as a judge too early,
00:15:01.720 | and also make sure you align LLM as a judge with a human.
00:15:07.320 | So I'm going to flip it back over to Emil. He's going to talk about the results of implementing this system.
00:15:12.280 | All right. So after we got to the virtuous cycle that Hamill just displayed, we managed to rapidly increase
00:15:23.960 | the success rate of the LLM application. Without the eval framework, a project all similar to this seemed
00:15:31.160 | completely impossible for us. One thing that I've started to hear a lot is that few-shot prompting is
00:15:39.640 | going to replace fine-tuning or some notions like that. In our case, we never managed to get everything
00:15:46.120 | that we wanted by few-shot prompting, even using the newer and smarter agents. I wish we could. I've seen a lot
00:15:54.360 | of judgment of companies and products being just ChatGPT wrappers. I wish we could just be a ChatGPT
00:16:00.680 | wrapper and manage to extract the experience we want for our users, but we never had that opportunity
00:16:06.440 | because we had some really difficult cases. One of the things that we wanted our agent to be able to do
00:16:11.800 | was to mix natural language with user interface elements like this inside the output, and this
00:16:19.000 | essentially required us to mix structured output and unstructured output together. We never managed
00:16:25.800 | to get this working without fine-tuning reliably. Another thing was feedback. So sometimes the user
00:16:33.240 | asks in a case like this, do this for me, but the agent can just do that. It needs some sort of feedback,
00:16:39.400 | more input from the user. Again, something like this was very difficult for us to execute on,
00:16:44.760 | especially given the previous challenge of injecting user interfaces inside the conversation.
00:16:51.720 | And third reason that we had to fine-tune was complex commands like this. I'm going to show a
00:16:59.480 | tiny video that shows how this command was executed, but basically in this example, the user is asking for
00:17:07.880 | a very complex command that requires using like five or six different tools to be done.
00:17:13.000 | Basically, what we wanted was for it to take that input, break it down into many different function
00:17:20.920 | calls, and execute it. So in this case, I'm asking you to find me some listings with some criteria,
00:17:26.760 | and then create a website. That's what real estate agents sometimes do for their listings that they're
00:17:32.120 | responsible for, and also an Instagram post, so they want to market it. They want this done only for the
00:17:38.040 | most expensive listing of these three. So the application has found three listings, created a
00:17:44.600 | website for that, created and rendered an Instagram post video for it, and then has prepared an email to
00:17:53.160 | Hamill, including all the information about the listings, and also including the website that was
00:17:59.400 | created and the Instagram story that was created, also invited Hamill to a dinner and created a
00:18:07.320 | follow-up task. Creating something like this for a non-savvy real estate agent may take a couple of
00:18:13.560 | hours to do, but using the agent, they can do it in a minute, and that essentially was not going to be
00:18:19.560 | possible without us using a comprehensive eval framework. Nailed the timing. Thank you guys.
00:18:35.960 | Thank you.