back to indexPrompt Engineering is Dead — Nir Gazit, Traceloop

00:00:00.000 |
All right so prompt engineering is dead and this is a bold statement but I'm saying that it never 00:00:21.160 |
actually existed and if you've ever done a prompt engineering you probably know that it's kind of 00:00:27.240 |
bullshit because you kind of try to ask the LLM to act nicely and do what you want to do do what 00:00:32.280 |
you wanted to do and so this talk I'm gonna walk you through a story of mine where I managed to 00:00:39.900 |
improve our kind of sample chatbot that we have in our website and I made it possible for me to 00:00:47.280 |
made it like five weeks better without actually doing any prompt engineering because you know it's 00:00:52.320 |
not really engineering right and so we have a chatbot in our application in our website it's 00:00:58.800 |
a rag-based super simple it just allows you to ask questions about our documentation and gives you 00:01:03.300 |
answers super simple super straightforward the easiest rag you can think about and when I deployed it the 00:01:10.500 |
first time it worked kind of okay and I tried to make it work better and you can see some examples 00:01:16.320 |
here I tried to on I needed to have him only answer things that related to trace loop because this is my 00:01:23.040 |
company I don't want you know it to answer things about the weather or something else I wanted to 00:01:27.300 |
answer about trace loop only and I wanted to be useful so if someone asked a question I want the 00:01:31.860 |
question to be useful for the users who asked it and I wanted to make less mistakes it was making so 00:01:36.780 |
many mistakes and I wanted to just get a little bit better and and so you know what what are we doing 00:01:43.080 |
at this stage we're doing prompt engineering but why do I even need to iterate on prompts I just want to 00:01:47.940 |
give give it a couple of examples this is good this is bad and have it somehow learn you know how to how to 00:01:54.360 |
follow my instructions properly and always and and so I begin imagining you know I begin imagining I want to 00:02:01.140 |
build this automatically improving machine that will be kind of like an agent that will research the web find the 00:02:06.360 |
latest and greatest prompt engineering techniques and and then apply them to my prompt over and over and 00:02:13.200 |
over again until I get the best prompt for my super simple rag pipeline and and to do that to run this 00:02:21.960 |
kind of machine I needed to do some a bit more work right because I need you've all been in a lot of 00:02:29.520 |
conversations here about evaluators so you know that we need an evaluator so the crazy machine can actually know 00:02:35.940 |
how to improve and if it's actually improving and so to do that I need to create a data set of like 00:02:42.780 |
questions I want to ask my chatbot about the documentation and then I want to build an evaluator that can 00:02:48.780 |
evaluate how well my rag pipeline is responding to those questions and then I have the agent that's kind of like 00:02:54.300 |
iterating on the prompts until I get to the best prompt ever and so this is kind of like how it will look like 00:03:01.500 |
so this is kind of like I have the right I have the rag pipeline I have the evaluator and then I'm gonna have my auto-improving agent 00:03:08.340 |
let's begin rag pipeline super simple have a Chroma database have an open AI and some simple prompts that you know take a 00:03:16.340 |
you know take it to the question find relevant documents in the Chroma database and just output an answer 00:03:23.180 |
as you will see I'm just super simple one I just ask it a question how do I get started with trace loop and then it runs 00:03:32.180 |
takes a couple of seconds and then I'm gonna see an answer and I'm gonna see the trace just so you if you've never 00:03:37.180 |
if you've never seen a rag pipeline and I'm guessing you also a rag pipeline this is how it looks like right 00:03:44.020 |
couple of calls to open AI and then you know final stage we get like all the context into open AI and we get the final answer to the user 00:03:53.020 |
it was great and we have a couple of prompts here we probably want to optimize 00:03:56.740 |
great okay now let's go to the next step my evaluator so what do we need for evaluations and this is not a talk about evaluators 00:04:11.740 |
I'm sure you've heard a lot of talks about evaluators in this conference but I'm gonna tell you what what kind of evaluator did I choose to use 00:04:20.740 |
and so first you know you need a data set of questions and and you know a way to evaluate these questions and then the evaluator is gonna kind of invoke the rag pipeline and then get answer from the rag pipeline and then gonna kind of evaluate 00:04:35.740 |
and get a score maybe a reason if why the score is low or high and then this is kind of what we're gonna use for the agent will be the last step that can auto improve the prompt 00:04:45.740 |
and so you know there's a couple of there's a lot of types of evaluators I think we've been talking a lot about LLM as a judge and spoiler alert I'm going to choose an LLM as a judge here because it's easier to build and it's easier to deploy 00:04:59.740 |
but there are also a lot of different types of evaluators kind of like classic NLP metrics you know if you want to do something which is embedded based or all of these colors that can also help you evaluate different types of tasks like translation tasks and 00:05:14.740 |
and even text summarization tasks and I think the main difference that I see between you know classic NLP metrics and LLM as a judge that classic NLP metrics usually require some ground truth and so I need to have you know my questions and then the actual answers and then if I have the actual answers when I'm invoking my rag pipeline I can actually get you know I can compare the ground truth answer against whatever the rag pipeline return. Great. 00:05:42.740 |
And then the LLM as a judge they can work with ground truth sometimes you can build a judge that can judge an answer based on some real answer that you expect but they can also they can also try to assess an answer just based on like the question and the context without any any ground truth. 00:05:58.740 |
For my example because I know the data set and I know everything I'm going to build a ground truth based LLM as a judge. 00:06:05.740 |
But before we're going to do that I'm going to talk to you about where can we evaluate where can we run these evaluators right. 00:06:12.740 |
So if we have an evaluator what can we evaluate I'm taking like I'm saying that I want to evaluate my rag pipeline but what exactly do I mean. 00:06:19.740 |
So the right pipeline is basically two steps right we get the data from the vector database and we run the call open AI with some context from the vector database. 00:06:31.740 |
So we can kind of evaluate each step separately you can evaluate each one and kind of call it like a unit testing right we can even evaluate how well the vector databases is fetching the data that I need to answer the question. 00:06:42.740 |
Or I can run it on like the complete execution of the rag I can take the input the question and take the answer the final answer that I'm getting from the from the rag pipeline and just evaluate how well the answer is given that question. 00:06:57.740 |
And I can also dive deeper into everything that's happening in the internals look at the context look at the question look at the answer and everything all together try to evaluate given given the context and given the question how well is the answer performing. 00:07:11.740 |
So again I'm going to do a simple lm as a judge and I'm going to take 20 examples of questions that I've created and for each question I'm going to write what do I expect the answer to contain. 00:07:24.740 |
So we're going to have like three facts that the answer that was generated by the rag pipeline should have and then the evaluator is simply going to take the answer that we've gotten from the rag pipeline and make sure that the facts actually appear in the answer. 00:07:38.740 |
And so we're going to get you know per each fact we're going to get pass or fail so it's a Boolean response and then a reason. 00:07:45.740 |
So if it failed I want to see a reason like why the judge thinks that the fact does not appear in the answer that we got. 00:07:54.740 |
And so then we're going to get a score which is numeral which is great because we like working with numeral scores which is kind of like summarizing all the facts across all the examples. 00:08:07.740 |
It's 60 total facts that we need to evaluate. 00:08:10.740 |
I'm going to just check how many facts were right out of the total facts that we expected to have in the rag generated answers. 00:08:22.740 |
And as you can see this is like the questions and the facts and everything. 00:08:26.740 |
This is what I created to make sure that the rag operates and then I'm going to run it, the evaluator. 00:08:33.740 |
And what you'll see that the evaluator is doing is running evaluations, right? 00:08:36.740 |
So it's taking questions, calling the rag, getting an answer, and then checking that answer against each of the facts that I've given it. 00:08:45.740 |
And then I'm going to create a score and you'll see the score, you know, slowly progressing as I'm running the evaluator. 00:08:51.740 |
This is like a super slow process and I don't have enough time to actually show you it to the end, but it works. 00:09:19.740 |
So I'm going to build a researcher agent and it's going to take, you know, as I said, the prompting guides online. 00:09:29.740 |
And then it's going to take, you know, it's going to take, it's going to run the evaluator once to get like the initial score. 00:09:34.740 |
And then if we get the initial score, it's going to get the reasons why we failed some of the evaluations. 00:09:39.740 |
It's going to combine the reasons for failures plus prompting guides to get a new prompt. 00:09:44.740 |
And then we're going to get a new prompt and then we're going to feed it back to the evaluator, run it again, get a new score, and then run it again with the agent so we can actually even improve the score a bit more. 00:09:55.740 |
If you've ever done kind of ML, classic ML training, this is kind of like a classic machine learning training but with a bit of vibes. 00:10:08.740 |
We're going to -- I've used the crew AI for doing that. 00:10:13.740 |
And you can see it will take a couple of seconds. 00:10:15.740 |
We can actually see the agent, you know, thinking and then calling the evaluator. 00:10:20.740 |
And then the evaluator will run and get the responses and get the score. 00:10:24.740 |
And I hope -- yeah, I hope I can -- yeah, it's running the evaluators. 00:10:31.740 |
And then, you know, the agent is running back. 00:10:38.740 |
And then now the agent, the researcher agent is trying to understand, okay, why the prompt wasn't working. 00:10:43.740 |
Can I find ways to improve RAD pipeline-based prompts? 00:10:47.740 |
And then it's going to, like, regenerate new kind of, like, prompt with maybe a bit more instructions or kind of best practices on how to write prompts. 00:10:56.740 |
And then, you see, calling the evaluator again to get a new score so we know if we need to optimize or not. 00:11:17.740 |
And I got, like, a really -- like, a long kind of prompt that you expect, like, to see from, like, someone who's done a lot of prompt engineering. 00:11:24.740 |
It's, like, giving it a lot of instructions and telling it how it should react and, like, how it shouldn't. 00:11:29.740 |
You're, like, an expert in answering users' questions about Tracetap. 00:11:33.740 |
It was really nice to see it happening without me having to -- I wrote a lot of code to make the agent work. 00:11:38.740 |
But I didn't need to do -- I didn't need to do a lot of prompt engineering. 00:11:47.740 |
And I stopped at 0.9 because it means that, like, 90% of the effects were correct. 00:11:54.740 |
So if there's something you want to get out of this talk is that you can also vibe engineer your prompts. 00:12:02.740 |
You don't need to manually iterate in prompts. 00:12:06.740 |
And, again, you kind of can run gradient ascend on your evaluators. 00:12:12.740 |
You can kind of slowly try to optimize on your score either automatically with an agent like I did or manually by just, like, reading those manuals about how to write the best prompts. 00:12:23.740 |
And then fixing them again and again and again. 00:12:33.740 |
So I have 20 examples and then I run the evaluators. 00:12:37.740 |
So maybe, maybe, maybe the prompt that I'm getting will work really well for those 20 examples. 00:12:42.740 |
But then if I give it, like, another example, it will be horrible. 00:12:45.740 |
So, yes, I was overfitting that because I just gave it the entire 20 examples. 00:12:52.740 |
And then kind of, like, again, classic machine learning. 00:12:54.740 |
Split it into train test sets or train test eval sets. 00:13:04.740 |
But then you also use the test set to make sure that you actually -- you're not overfitting to your training data set. 00:13:12.740 |
So you don't want to give the optimizer too much context unless it just will know how to answer those specific questions and nothing else. 00:13:23.740 |
But I've actually done a lot of prompt engineering for this demo because I needed to engineer the agent that is optimizing my prompts. 00:13:30.740 |
But I've done a lot of prompt engineering for that. 00:13:33.740 |
Maybe I can also do this work for the evaluator prompts or even for the agent prompt. 00:13:39.740 |
It's kind of like this meta talk where who's even writing prompts here? 00:13:43.740 |
Like, I'm using the agent to optimize itself. 00:14:01.740 |
And if you have any questions, you're welcome to ask me outside. 00:14:05.740 |
Or you can even book some time with me just by not clicking this link but just following this link.