back to indexOptimizing LLMs in Insurance with DSPy: Jeronim Morina

00:00:00.000 |
Welcome to Optimizing LLMs with this Pi beyond manual tuning, and I hate to break to you, but we're all bad AI engineers. 00:00:32.000 |
Because we don't care enough about solving real-world problems. 00:00:36.000 |
It's that we all keep tinkering around with tools and HotBot QA datasets, but bear with me, there's a way out. 00:00:45.000 |
Let's start with first principles thinking again and reminding ourselves what we are. 00:00:53.000 |
We have to reconsider the thing we are doing these days, which is prompt engineering, and start programming AI systems again. 00:01:01.000 |
I mean, maybe you've seen this already, like left, we have this huge neural network model, but usually neural networks on themselves and being like language models or any other AI model are not useful on their own, but are only useful if you're considering them in a huge system. 00:01:19.000 |
So the struggle is real when it comes to use cases in the LLM space. 00:01:24.000 |
We are all thrilled by the excitement around capabilities of ever newer models, but we have only a vague idea of what we really want to achieve. 00:01:32.000 |
We throw a bunch of tools of them at the not well-defined problems. 00:01:35.000 |
We write some handwritten prompts, use some prompt libraries in the hope that that will lead us to a magical solution of our problems. 00:01:42.000 |
I know myself how much I love to tinker around with all these tools and see how they work, but here I'm telling you to stop hoping that using API just works and solves all your problem. 00:01:55.000 |
So let's imagine you're an AI engineer and you start your day by brewing some specialty coffee, sitting down, delving into focus mode. 00:02:04.000 |
And you fiddle around with a bunch of prompts and try the next tool to guide system output for JSON. 00:02:11.000 |
So you evaluate the system with your most loved metric. 00:02:14.000 |
The answers are all like, looks good to me at 10. 00:02:17.000 |
So I have just look at all of the outputs and it's like, yeah, that's fine enough. 00:02:26.000 |
And is it really that we fiddle around with our prompts until they seem stable enough? 00:02:32.000 |
So the question is, how do we become like great AI engineers again? 00:02:35.000 |
And I think we should stop working on toy problems, all these hot pot QA data sets and everything, and start working on real problems again. 00:02:50.000 |
So a company dude like me is going to explain you how to solve real problems. 00:02:58.000 |
We have to solve real problems every day because we can. 00:03:02.000 |
And it's really about the data and helping our customers every day. 00:03:06.000 |
So we are data-driven companies since like ever. 00:03:13.000 |
You give us money and we give you money back in case something happens to you. 00:03:20.000 |
But we always have been a data-focused company and only providing non-touchable assets. 00:03:28.000 |
So one of the huge problems we are facing today is climate change. 00:03:36.000 |
And with that, the ever-increasing like incoming claims, for example, and more problems we are seeing with our customers. 00:03:45.000 |
So there's a huge rise of labor we have to do. 00:03:51.000 |
And so that's the reason why AXA Germany started the Data Innovation Lab in 2017. 00:03:57.000 |
And we are like a group of machine learning engineers working with the data scientists of the business units. 00:04:02.000 |
And we are trying to work collaboratively with the business units to really make an impact on the customer service agents to help them make the customers happy every day. 00:04:15.000 |
So I know you think you are like in Silicon Valley and you're like these amazing speed boats. 00:04:29.000 |
So we are these tankers, but more like icebreakers often. 00:04:32.000 |
So it's like slowly, but we are like making progress forward and like pushing things forward. 00:04:39.000 |
And you can see this also in things that just made the announcement on Monday. 00:04:44.000 |
So in the forefront already in 2023, the beginning of the year, we started collaborating with OpenAI and created this secure GPT GenAI platform internally. 00:04:57.000 |
And even now we are now like announcing on Monday that we're working together with Mistral AI. 00:05:02.000 |
So our internal GenAI platform gets like more usable every day. 00:05:06.000 |
And like especially not comprising our data security or anything our customers don't want to. 00:05:16.000 |
So that's the reason like the tanker has some impact and makes fun. 00:05:21.000 |
But what does it take us to get good engineers? 00:05:33.000 |
So let me take you on a path on imaginary real-world problem insurance. 00:05:38.000 |
So we are creating a customer-facing chatbot to help them navigate our insurance for terms and conditions. 00:05:46.000 |
I mean, yeah, we kind of have like everything. 00:05:49.000 |
But the main issue is like you really need to define your problem really clearly. 00:05:57.000 |
So we went to the domain experts and let us guide through the problem space, especially crafting examples. 00:06:03.000 |
And this is especially important for prompt engineering, which are either super simple or like super hard to solve. 00:06:09.000 |
So we can understand the domain space better. 00:06:13.000 |
And that guided us to learn a lot about what is expected in our production system in the end. 00:06:21.000 |
So we came up with a dozen of examples and immediately started prompting our internal secure GPT. 00:06:26.000 |
And we tried to come up with some useful prompts. 00:06:30.000 |
But prompt engineering is like quite a black magic. 00:06:34.000 |
So we didn't use phrases like, if you don't output JSON, I'm going to quit. 00:06:41.000 |
And there are these prompts where you fiddle around. 00:06:46.000 |
You try to make your output better, cleaner, more structured. 00:06:50.000 |
And one way to do this is what Jason Liu already proposed last year was to hijack the prompt and like say, OK, I'm an assistant here. 00:07:03.000 |
And we start the prompt by just giving in triple backticks and JSON. 00:07:08.000 |
And that then the model pushed forward into just creating JSON. 00:07:15.000 |
But there are also other a dozen other of prompting techniques like chain of thought, zero, few-shot learning, tools like blank chain, guardrails, DSPy, Instructor. 00:07:27.000 |
And to help make use of these techniques to make the output safer, more deterministic, more structured, more resilient, and optimized for maybe even arbitrary metrics. 00:07:42.000 |
And so we filled it around with all these tools and played around. 00:07:46.000 |
And to be honest, that was a lot of work, like finding a good prompt is hard and error-prone. 00:07:51.000 |
As these prompt templates are often hard-coded and the usage is quite intransparent. 00:07:56.000 |
So please, don't fall into these tools will solve the problem for me trap. 00:08:02.000 |
So get your hands dirty, inspect what these tools are really doing for you. 00:08:06.000 |
And I highly recommend this amazing blog post from Hamil Hussain, which is called "Please Show Me the Prompt." 00:08:20.000 |
So the idea is just to use MITM proxy and to really inspect, like, what are all these tools using? 00:08:29.000 |
I've been mentioning before, what are they really sending? 00:08:36.000 |
I'm not advocating here to use tools this and that and there. 00:08:47.000 |
Look at what these tools create, what kind of prompt templates they use, what they send to ULLMs. 00:08:56.000 |
Don't just believe that it will happen or it will do something great. 00:09:01.000 |
So one issue we had been experiencing a lot was sensitivity to minor prompt changes. 00:09:07.000 |
So, like, very small prompt changes led to very high problems in the end. 00:09:12.000 |
So what we tried to do was using chaining libraries to fix these issues and throwing more and more error handling code at that. 00:09:22.000 |
So we tried to make this a bit more achievable. 00:09:28.000 |
But this lead led into the end to an overly complex and fragile system. 00:09:33.000 |
So we had the gut feeling that was by no means something we want to put in production. 00:09:37.000 |
But we could show that the task was at least achievable or could serve as a baseline. 00:09:44.000 |
So before continuing further, we knew we were missing some key pieces to fully understand what's happening. 00:09:50.000 |
And apart from inspecting our calls, that was we were not logging any traces or something. 00:09:56.000 |
For that, we started using an open source library called Arise Phoenix. 00:10:01.000 |
And that definitely helped a lot in understanding, like, which calls are bundled together, 00:10:08.000 |
which helped to see what is happening behind the curtains. 00:10:15.000 |
So, again, here, once again, I'm advocating for first principles thinking. 00:10:25.000 |
You cannot improve anything what you don't measure. 00:10:29.000 |
And we were still, at that point in time, we were still at the level looks good to me at 10, right? 00:10:35.000 |
So we need to definitely write some basic evaluations. 00:10:42.000 |
So if you want to evaluate something, you need not only have your input data, but also to have some labels. 00:10:49.000 |
That took us some several months to really scrape the production data we were already getting and looking at it and preparing it. 00:11:00.000 |
And one main thing is, please don't stumble upon this. 00:11:05.000 |
So it doesn't matter if you have, like, the coolest APIs. 00:11:08.000 |
If you're using data which has compromised data leakage in it and you already can deduce the output label from your input data, that's really bad. 00:11:22.000 |
So if you're creating your evaluation data sets, really, really, really pay attention to not destroy your evaluation. 00:11:38.000 |
So there's another blog post I would definitely recommend from Hamel Hussain, which is called "Evalds is all you need" or "Evalds for your LLMs." 00:12:00.000 |
And I encourage you, please don't do anything with DSPy or any other tools when you don't have, like, this base setting. 00:12:07.000 |
When you don't understand your data, when you don't have enough examples. 00:12:14.000 |
You're just saying prompted here in DSPy profit? 00:12:21.000 |
And to be fair, I just went into DSPy because I met Omar last year at Skeletor Bay in November. 00:12:43.000 |
I love how it enabled us to-- now we have this overly complex system with these huge prompts and huge error handling code and everything. 00:12:53.000 |
And we all thought, OK, we're going to throw DSPy at it and it's going to be super easy. 00:12:58.000 |
But to be honest, the learning curve is quite steep and it's maybe a bit too much overall. 00:13:05.000 |
So this is not a 100% recommendation, go to DSPy. 00:13:13.000 |
OK, so without DSPy, we had to break down our problems into step. 00:13:19.000 |
We have to prompt well, like each step had to work well in isolation. 00:13:25.000 |
We had to generate examples to tune each step and to use these examples to maybe even fine tune some smaller models. 00:13:33.000 |
And what we now could do with DSPy was taking these base idea. 00:13:39.000 |
I encourage you, please, toe apart your problem into separate modules. 00:13:50.000 |
Otherwise, you won't be able to take advantage of DSPy. 00:13:54.000 |
And also, DSPy is not going to be able to take advantage of your DSPy program because it doesn't get any hints, any signal on how to improve the program you're providing it. 00:14:10.000 |
So don't come up with one huge react when we did this and it's bad. 00:14:16.000 |
But rather, try to understand how your flow of your program is. 00:14:24.000 |
Do you want to have multi-hop question answering? 00:14:29.000 |
Do you want to bring together different pieces and answer them in one question in the end? 00:14:41.000 |
You have to separate your program into modules, which are kind of like the prompts and weights. 00:14:46.000 |
You have to add optimizers to tune your prompts and weights, giving a metric. 00:14:53.000 |
For the chatbot, we needed to find the right answer. 00:14:58.000 |
And DSPy compiles the same program into different instructions, future prompting and fine tuning. 00:15:05.000 |
So instead of prompting one language model with a huge prompt, as I told you, please break it apart. 00:15:11.000 |
So there's this analogy to neural networks for maybe for you to better understand how DSPy works. 00:15:18.000 |
We have the separation into an init and forward step, where you define in your init how you find a convolution or dropout layers in PyTorch. 00:15:34.000 |
You use then things like Chain of Thought or React. 00:15:37.000 |
And you describe in your forward pass how you do the things you want to do, like retrieving and other stuff. 00:15:52.000 |
So if you want to do this for like German texts, answer exact match, answer passage match only work for English text. 00:16:00.000 |
So metrics are really not working out of the box. 00:16:07.000 |
If you want to use something like React, so there's nothing like a typed React. 00:16:13.000 |
So DSPy comes with these typed things like type predictor or type chain of thought. 00:16:23.000 |
But if you want to have something more specific for your case, you need to come up with your own module and improve that. 00:16:31.000 |
And evaluation is also quite fixed on English language, to be honest. 00:16:35.000 |
And in case you want to use DSPy for German optimization, you need also to create like your own evaluators, especially for the exact and passage match. 00:16:44.000 |
Okay, so we tried to come up with one big model and we found out like this is really not working. 00:17:09.000 |
You should definitely write prompts by hand and start with that. 00:17:12.000 |
Otherwise, you cannot say if your task is achievable at all. 00:17:18.000 |
Avoid metrics like looks good to me 10 without any annotated data or clear goal metric. 00:17:24.000 |
There's no clear objective to validate if the results are getting even better. 00:17:28.000 |
You definitely should write a basic evaluation. 00:17:31.000 |
Start with some small evaluations like regex or string comparisons. 00:17:35.000 |
And then only then use techniques like LM as a judge. 00:17:41.000 |
It's already super complicated and just make very, very small adjustments. 00:17:55.000 |
So, am I really thinking that you've learned how to use DSPy like from a 20 minutes corporate type guide talk? 00:18:06.000 |
But to be honest, if you want to learn DSPy, go to all these learning resources from Connor Shorten from the DSPy website itself. 00:18:15.000 |
There are all these people like presenting it in a lot better fashion than I did. 00:18:22.000 |
But what I wanted to do is like I wanted to spark some interest in you. 00:18:27.000 |
The same spark like I got when I saw this conference last year where I learned all the people, where I learned all the guys who created these toolings. 00:18:37.000 |
And I want to spark that same curiosity in you and to see if you can get like better AI engineers by using not only the tools but also first principles thinking. 00:18:51.000 |
And I think we are meant to become like great AI engineers. 00:18:55.000 |
So, this conference is about you getting inspired. 00:18:57.000 |
Get out and find like some good problems where you can try this out and which are like worth solving with AI.