How to run Evals at Scale: Thinking beyond Accuracy or Similarity

00:00:00.000 | Thank you.

00:00:02.000 | Hey, everyone.

00:00:16.700 | Hope you are having a great conference.

00:00:20.360 | So I'm going to talk about how to run evals at scale

00:00:24.320 | and thinking beyond accuracy or similarity.

00:00:28.240 | So in the last presentation, we learned about how

00:00:32.800 | to architect the AI applications and then

00:00:37.420 | why evals are important.

00:00:39.460 | In this presentation, I'm going to talk about the importance

00:00:43.060 | of evals as well as what type of evals

00:00:45.520 | we have to choose when we are crafting an application.

00:00:49.680 | This is a bit about me.

00:00:51.900 | So I work as a lead engineer for applied AI for developer

00:00:55.040 | platforms at Adobe.

00:00:57.480 | I have also co-authored the CI/CD design patterns book

00:01:02.200 | and also involved in a lot of open source work

00:01:05.200 | across the communities.

00:01:07.320 | So let's get started.

00:01:10.240 | So how many of you have seen this or are active on the Twitter

00:01:13.820 | right now?

00:01:14.480 | Like, have you seen these kind of patterns emerging?

00:01:17.920 | I think this morning there was a talk where this snapshot was, again, surfaced.

00:01:22.640 | So, one of the most important trends in AI application development is evals.

00:01:28.360 | Because without evals, we can't craft any AI application.

00:01:35.360 | Then how many of you are developing an AI application, be it a RAD, chatbot, agents, anything?

00:01:43.080 | So, if you are working on that, you often have come across these kind of questions.

00:01:47.800 | Like, how do I test applications when outputs are non-deterministic and require subjective

00:01:53.520 | judgment?

00:01:54.520 | Because we all know in LLM world, you can have the different output for the same set

00:01:59.520 | of input.

00:02:00.520 | LLMs are non-deterministic.

00:02:01.520 | So, how many times are you wondering, like, if I am changing a prompt, what is going to

00:02:06.920 | break?

00:02:07.920 | Or how am I going to test that?

00:02:10.040 | And then, most importantly, when you are developing an application, in order to measure the performance

00:02:15.080 | or accuracy, you need to find out what tools to use, what metrics to use, or what models are

00:02:21.400 | best.

00:02:22.400 | Because models are getting capable day by day.

00:02:25.740 | And the answer is evals.

00:02:27.160 | So, evals is the fundamental approach where you are writing, sort of, test cases to measure

00:02:33.840 | your AI applications.

00:02:36.400 | And why do they matter?

00:02:39.040 | Because without measuring something, it can have various impacts.

00:02:43.060 | It can impact your business.

00:02:46.200 | You need to measure whatever system output is being produced.

00:02:50.160 | How do you align your application with system goals?

00:02:53.040 | Or, one of the important aspects is how do you keep getting better?

00:02:56.880 | Because applications are, you are developing applications day by day.

00:03:00.180 | And you need to make sure it is getting better.

00:03:02.880 | And then, trust and accountability.

00:03:04.520 | This is one of the aspects which is very important.

00:03:08.680 | Because whenever you are developing something for a customer, you need to make sure they trust

00:03:13.800 | your application, whatever output is being generated.

00:03:18.600 | Now, when we talk about evals, one of the important aspects to focus on is data.

00:03:24.700 | So, when we think about evals, when we think about the tests, how do we start?

00:03:30.000 | So, the very first step is starting with the data.

00:03:33.000 | Now, how do you get the data?

00:03:34.700 | So, there are a couple of approaches to get the data.

00:03:37.560 | One is you start small and you start with the synthetic data.

00:03:42.000 | You start validating your applications output against that data.

00:03:53.300 | Then, when you think about the data in evals, it's a continuous improvement process.

00:04:00.300 | It means every time you generate some output, you need to observe the system and then you need

00:04:05.700 | to keep on defining that data set, whatever data set you are procuring.

00:04:10.700 | And another aspect is you need to label your data accordingly.

00:04:14.000 | So, because data is fundamental to writing evals.

00:04:17.000 | So, you need to, when generating the data, you need to define your data set in a way where it is labeled

00:04:23.000 | into different aspects.

00:04:24.000 | It is covering multiple flows or application prospects.

00:04:28.000 | So, things like that.

00:04:29.300 | And then, you need to continuously refine that.

00:04:32.300 | Another approach which I have learned from my experience is one data set is never sufficient.

00:04:39.300 | So, when you are thinking about eval, you need to think about multiple data sets based on the flows,

00:04:45.000 | based on the applications, and whatever you are trying to achieve.

00:04:48.300 | Now, when we think about evaluation, what do we think about?

00:04:53.300 | So, what do we want to evaluate?

00:04:55.300 | The answer is everything.

00:04:57.600 | But what does that mean?

00:04:59.300 | So, you need to define, start by defining your goals and objectives.

00:05:02.600 | What do you want to evaluate in your system?

00:05:05.600 | Then you need to design in a way where you have modules defined for each of the components.

00:05:11.700 | You need to optimize your data handling.

00:05:14.500 | And I notice I am mentioning data again and again.

00:05:17.000 | But the point is, you need to have different data sets for different flows.

00:05:22.600 | You need to test your flows, outputs, and paths.

00:05:25.400 | So, if your application involves multiple flows, multiple paths, you need to evaluate in all paths.

00:05:33.400 | Now, adaptive evals.

00:05:34.600 | So, one of the previous presentation talked about, like, there is no universal eval.

00:05:39.900 | And that is, again, the most important thing because your evals depend upon what type of application you want to evaluate.

00:05:46.700 | For example, evaluating a RAG application, a typical RAG application, is different from code generation.

00:05:53.700 | If you are dealing with a RAG, typical Q&A type of application, you can define your eval such as accuracy or similarity or usefulness.

00:06:01.500 | Versus when you are generating a code, you want to test the generated code against the actual code base.

00:06:08.500 | So, that is where you need to measure your functional correctness of the code generated or how robust that code is generated.

00:06:16.300 | Then, when you are trying to evaluate agents.

00:06:22.300 | So, one of the important aspects of evaluating agents is trajectory evaluation.

00:06:27.100 | Because agents can take a different path.

00:06:30.100 | And, oftentimes, you need to define which path they are taking in order to execute a flow.

00:06:35.100 | There is also multi-term simulation where most of these agents are complex.

00:06:40.900 | And, you need to check, like, when you are having a conversation, like, how do you evaluate that?

00:06:46.900 | Then, if you are doing the tool call, then you also need to check the correctness or test suite or, like, how the data is being generated.

00:06:54.900 | Now, another aspect is how do you scale eval?

00:06:59.900 | So, one, one strategy is you can cache the intermediate results and regression.

00:07:05.700 | You need to focus on orchestration and parallelism, like, how you are running your evals, how you are orchestrating them, how you are paralyzing them.

00:07:14.700 | You need to aggregate the results.

00:07:16.700 | And then, you need, the important aspect here is, you need to run them frequently and then improve upon.

00:07:22.700 | So, one of the term which is being used in industry is measure, monitor, analyze, and repeat.

00:07:28.700 | So, you need to often measure it, you need to analyze it, and iterate on that.

00:07:33.500 | Then, you need to strategize what you want to measure.

00:07:36.500 | So, again, depending upon the use case, there are different types of matrices or different types of methodologies you need to adapt to.

00:07:44.500 | And then, again, use, there is no fixed strategy to run your eval, so use what fits best.

00:07:52.500 | In some cases, you want your humans in the loop to be taking precedence.

00:07:57.300 | In some cases, you have automation test, automation evals running in.

00:08:03.300 | There is a fine balance or trade-off between human in the loop versus automation, like whether you want the high speed versus high fidelity.

00:08:12.300 | So, again, depending upon what you want to achieve, you want to give a fine balance on that.

00:08:19.100 | And rely on process over tools.

00:08:21.100 | The reason is because tools, again, you cannot automate everything.

00:08:25.100 | So, you need to define and establish the process.

00:08:26.900 | How do you want to run the evals?

00:08:28.900 | So, these are some of the key takeaways we just talked about.

00:08:34.900 | So, one is evals are the most important aspect for AI application.

00:08:39.900 | There is a term being coined now, eval-driven development, which is, if you think about typical software like test-driven development, this is the eval-driven development.

00:08:49.700 | Define evals based on the use cases.

00:08:52.700 | You need to focus on positive as well as negative cases.

00:08:56.700 | Then, focus on the data.

00:08:57.700 | That is, I cannot emphasize enough on that.

00:09:00.700 | And then, remember to measure, monitor, analyze, iterate in a loop continuously.

00:09:06.700 | And always take a balanced approach in fidelity versus speed.

00:09:10.500 | If you have any questions, there's a barcode.

00:09:13.500 | You can come later and chat with me.

00:09:15.500 | Happy to chat more.

00:09:16.500 | And that's all from now.

00:09:18.500 | I'll see you next time.

00:09:20.300 | Bye.

00:09:21.300 | Bye.

00:09:22.300 | Bye.

How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe