back to indexHow to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

00:00:20.360 |
So I'm going to talk about how to run evals at scale 00:00:28.240 |
So in the last presentation, we learned about how 00:00:39.460 |
In this presentation, I'm going to talk about the importance 00:00:45.520 |
we have to choose when we are crafting an application. 00:00:51.900 |
So I work as a lead engineer for applied AI for developer 00:00:57.480 |
I have also co-authored the CI/CD design patterns book 00:01:02.200 |
and also involved in a lot of open source work 00:01:10.240 |
So how many of you have seen this or are active on the Twitter 00:01:14.480 |
Like, have you seen these kind of patterns emerging? 00:01:17.920 |
I think this morning there was a talk where this snapshot was, again, surfaced. 00:01:22.640 |
So, one of the most important trends in AI application development is evals. 00:01:28.360 |
Because without evals, we can't craft any AI application. 00:01:35.360 |
Then how many of you are developing an AI application, be it a RAD, chatbot, agents, anything? 00:01:43.080 |
So, if you are working on that, you often have come across these kind of questions. 00:01:47.800 |
Like, how do I test applications when outputs are non-deterministic and require subjective 00:01:54.520 |
Because we all know in LLM world, you can have the different output for the same set 00:02:01.520 |
So, how many times are you wondering, like, if I am changing a prompt, what is going to 00:02:10.040 |
And then, most importantly, when you are developing an application, in order to measure the performance 00:02:15.080 |
or accuracy, you need to find out what tools to use, what metrics to use, or what models are 00:02:22.400 |
Because models are getting capable day by day. 00:02:27.160 |
So, evals is the fundamental approach where you are writing, sort of, test cases to measure 00:02:39.040 |
Because without measuring something, it can have various impacts. 00:02:46.200 |
You need to measure whatever system output is being produced. 00:02:50.160 |
How do you align your application with system goals? 00:02:53.040 |
Or, one of the important aspects is how do you keep getting better? 00:02:56.880 |
Because applications are, you are developing applications day by day. 00:03:00.180 |
And you need to make sure it is getting better. 00:03:04.520 |
This is one of the aspects which is very important. 00:03:08.680 |
Because whenever you are developing something for a customer, you need to make sure they trust 00:03:13.800 |
your application, whatever output is being generated. 00:03:18.600 |
Now, when we talk about evals, one of the important aspects to focus on is data. 00:03:24.700 |
So, when we think about evals, when we think about the tests, how do we start? 00:03:30.000 |
So, the very first step is starting with the data. 00:03:34.700 |
So, there are a couple of approaches to get the data. 00:03:37.560 |
One is you start small and you start with the synthetic data. 00:03:42.000 |
You start validating your applications output against that data. 00:03:53.300 |
Then, when you think about the data in evals, it's a continuous improvement process. 00:04:00.300 |
It means every time you generate some output, you need to observe the system and then you need 00:04:05.700 |
to keep on defining that data set, whatever data set you are procuring. 00:04:10.700 |
And another aspect is you need to label your data accordingly. 00:04:14.000 |
So, because data is fundamental to writing evals. 00:04:17.000 |
So, you need to, when generating the data, you need to define your data set in a way where it is labeled 00:04:24.000 |
It is covering multiple flows or application prospects. 00:04:29.300 |
And then, you need to continuously refine that. 00:04:32.300 |
Another approach which I have learned from my experience is one data set is never sufficient. 00:04:39.300 |
So, when you are thinking about eval, you need to think about multiple data sets based on the flows, 00:04:45.000 |
based on the applications, and whatever you are trying to achieve. 00:04:48.300 |
Now, when we think about evaluation, what do we think about? 00:04:59.300 |
So, you need to define, start by defining your goals and objectives. 00:05:05.600 |
Then you need to design in a way where you have modules defined for each of the components. 00:05:14.500 |
And I notice I am mentioning data again and again. 00:05:17.000 |
But the point is, you need to have different data sets for different flows. 00:05:22.600 |
You need to test your flows, outputs, and paths. 00:05:25.400 |
So, if your application involves multiple flows, multiple paths, you need to evaluate in all paths. 00:05:34.600 |
So, one of the previous presentation talked about, like, there is no universal eval. 00:05:39.900 |
And that is, again, the most important thing because your evals depend upon what type of application you want to evaluate. 00:05:46.700 |
For example, evaluating a RAG application, a typical RAG application, is different from code generation. 00:05:53.700 |
If you are dealing with a RAG, typical Q&A type of application, you can define your eval such as accuracy or similarity or usefulness. 00:06:01.500 |
Versus when you are generating a code, you want to test the generated code against the actual code base. 00:06:08.500 |
So, that is where you need to measure your functional correctness of the code generated or how robust that code is generated. 00:06:16.300 |
Then, when you are trying to evaluate agents. 00:06:22.300 |
So, one of the important aspects of evaluating agents is trajectory evaluation. 00:06:30.100 |
And, oftentimes, you need to define which path they are taking in order to execute a flow. 00:06:35.100 |
There is also multi-term simulation where most of these agents are complex. 00:06:40.900 |
And, you need to check, like, when you are having a conversation, like, how do you evaluate that? 00:06:46.900 |
Then, if you are doing the tool call, then you also need to check the correctness or test suite or, like, how the data is being generated. 00:06:54.900 |
Now, another aspect is how do you scale eval? 00:06:59.900 |
So, one, one strategy is you can cache the intermediate results and regression. 00:07:05.700 |
You need to focus on orchestration and parallelism, like, how you are running your evals, how you are orchestrating them, how you are paralyzing them. 00:07:16.700 |
And then, you need, the important aspect here is, you need to run them frequently and then improve upon. 00:07:22.700 |
So, one of the term which is being used in industry is measure, monitor, analyze, and repeat. 00:07:28.700 |
So, you need to often measure it, you need to analyze it, and iterate on that. 00:07:33.500 |
Then, you need to strategize what you want to measure. 00:07:36.500 |
So, again, depending upon the use case, there are different types of matrices or different types of methodologies you need to adapt to. 00:07:44.500 |
And then, again, use, there is no fixed strategy to run your eval, so use what fits best. 00:07:52.500 |
In some cases, you want your humans in the loop to be taking precedence. 00:07:57.300 |
In some cases, you have automation test, automation evals running in. 00:08:03.300 |
There is a fine balance or trade-off between human in the loop versus automation, like whether you want the high speed versus high fidelity. 00:08:12.300 |
So, again, depending upon what you want to achieve, you want to give a fine balance on that. 00:08:21.100 |
The reason is because tools, again, you cannot automate everything. 00:08:25.100 |
So, you need to define and establish the process. 00:08:28.900 |
So, these are some of the key takeaways we just talked about. 00:08:34.900 |
So, one is evals are the most important aspect for AI application. 00:08:39.900 |
There is a term being coined now, eval-driven development, which is, if you think about typical software like test-driven development, this is the eval-driven development. 00:08:52.700 |
You need to focus on positive as well as negative cases. 00:09:00.700 |
And then, remember to measure, monitor, analyze, iterate in a loop continuously. 00:09:06.700 |
And always take a balanced approach in fidelity versus speed. 00:09:10.500 |
If you have any questions, there's a barcode.