How to run Evals at Scale: Thinking beyond Accuracy or Similarity

Thank you. Hey, everyone. Hope you are having a great conference. So I'm going to talk about how to run evals at scale and thinking beyond accuracy or similarity. So in the last presentation, we learned about how to architect the AI applications and then why evals are important. In this presentation, I'm going to talk about the importance of evals as well as what type of evals we have to choose when we are crafting an application.

This is a bit about me. So I work as a lead engineer for applied AI for developer platforms at Adobe. I have also co-authored the CI/CD design patterns book and also involved in a lot of open source work across the communities. So let's get started. So how many of you have seen this or are active on the Twitter right now?

Like, have you seen these kind of patterns emerging? I think this morning there was a talk where this snapshot was, again, surfaced. So, one of the most important trends in AI application development is evals. Because without evals, we can't craft any AI application. Then how many of you are developing an AI application, be it a RAD, chatbot, agents, anything?

So, if you are working on that, you often have come across these kind of questions. Like, how do I test applications when outputs are non-deterministic and require subjective judgment? Because we all know in LLM world, you can have the different output for the same set of input. LLMs are non-deterministic.

So, how many times are you wondering, like, if I am changing a prompt, what is going to break? Or how am I going to test that? And then, most importantly, when you are developing an application, in order to measure the performance or accuracy, you need to find out what tools to use, what metrics to use, or what models are best.

Because models are getting capable day by day. And the answer is evals. So, evals is the fundamental approach where you are writing, sort of, test cases to measure your AI applications. And why do they matter? Because without measuring something, it can have various impacts. It can impact your business.

You need to measure whatever system output is being produced. How do you align your application with system goals? Or, one of the important aspects is how do you keep getting better? Because applications are, you are developing applications day by day. And you need to make sure it is getting better.

And then, trust and accountability. This is one of the aspects which is very important. Because whenever you are developing something for a customer, you need to make sure they trust your application, whatever output is being generated. Now, when we talk about evals, one of the important aspects to focus on is data.

So, when we think about evals, when we think about the tests, how do we start? So, the very first step is starting with the data. Now, how do you get the data? So, there are a couple of approaches to get the data. One is you start small and you start with the synthetic data.

You start validating your applications output against that data. Then, when you think about the data in evals, it's a continuous improvement process. It means every time you generate some output, you need to observe the system and then you need to keep on defining that data set, whatever data set you are procuring.

And another aspect is you need to label your data accordingly. So, because data is fundamental to writing evals. So, you need to, when generating the data, you need to define your data set in a way where it is labeled into different aspects. It is covering multiple flows or application prospects.

So, things like that. And then, you need to continuously refine that. Another approach which I have learned from my experience is one data set is never sufficient. So, when you are thinking about eval, you need to think about multiple data sets based on the flows, based on the applications, and whatever you are trying to achieve.

Now, when we think about evaluation, what do we think about? So, what do we want to evaluate? The answer is everything. But what does that mean? So, you need to define, start by defining your goals and objectives. What do you want to evaluate in your system? Then you need to design in a way where you have modules defined for each of the components.

You need to optimize your data handling. And I notice I am mentioning data again and again. But the point is, you need to have different data sets for different flows. You need to test your flows, outputs, and paths. So, if your application involves multiple flows, multiple paths, you need to evaluate in all paths.

Now, adaptive evals. So, one of the previous presentation talked about, like, there is no universal eval. And that is, again, the most important thing because your evals depend upon what type of application you want to evaluate. For example, evaluating a RAG application, a typical RAG application, is different from code generation.

If you are dealing with a RAG, typical Q&A type of application, you can define your eval such as accuracy or similarity or usefulness. Versus when you are generating a code, you want to test the generated code against the actual code base. So, that is where you need to measure your functional correctness of the code generated or how robust that code is generated.

Then, when you are trying to evaluate agents. So, one of the important aspects of evaluating agents is trajectory evaluation. Because agents can take a different path. And, oftentimes, you need to define which path they are taking in order to execute a flow. There is also multi-term simulation where most of these agents are complex.

And, you need to check, like, when you are having a conversation, like, how do you evaluate that? Then, if you are doing the tool call, then you also need to check the correctness or test suite or, like, how the data is being generated. Now, another aspect is how do you scale eval?

So, one, one strategy is you can cache the intermediate results and regression. You need to focus on orchestration and parallelism, like, how you are running your evals, how you are orchestrating them, how you are paralyzing them. You need to aggregate the results. And then, you need, the important aspect here is, you need to run them frequently and then improve upon.

So, one of the term which is being used in industry is measure, monitor, analyze, and repeat. So, you need to often measure it, you need to analyze it, and iterate on that. Then, you need to strategize what you want to measure. So, again, depending upon the use case, there are different types of matrices or different types of methodologies you need to adapt to.

And then, again, use, there is no fixed strategy to run your eval, so use what fits best. In some cases, you want your humans in the loop to be taking precedence. In some cases, you have automation test, automation evals running in. There is a fine balance or trade-off between human in the loop versus automation, like whether you want the high speed versus high fidelity.

So, again, depending upon what you want to achieve, you want to give a fine balance on that. And rely on process over tools. The reason is because tools, again, you cannot automate everything. So, you need to define and establish the process. How do you want to run the evals?

So, these are some of the key takeaways we just talked about. So, one is evals are the most important aspect for AI application. There is a term being coined now, eval-driven development, which is, if you think about typical software like test-driven development, this is the eval-driven development. Define evals based on the use cases.

You need to focus on positive as well as negative cases. Then, focus on the data. That is, I cannot emphasize enough on that. And then, remember to measure, monitor, analyze, iterate in a loop continuously. And always take a balanced approach in fidelity versus speed. If you have any questions, there's a barcode.

You can come later and chat with me. Happy to chat more. And that's all from now. I'll see you next time. Bye. Bye. Bye.

How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

Transcript