Evaluating AI Search: A Practical Framework for Augmented AI Systems

Hi everyone, thank you so much for coming. My name is Julia, I'm CEO and co-founder of Quotient AI. I'm Deanna Emery, I am founding AI researcher at Quotient AI. My name is Vithara Sher, I'm the head of engineering at Tavili. And today we are going to talk to you about evaluating AI search.

So let me start with a fundamental challenge we're all facing in AI today. Traditional monitoring approaches simply aren't keeping up with the complexity of modern AI approaches. First off, these systems are dynamic. Unlike traditional software, AI agents operate in constantly changing environments. They're not just executing predetermined logic, they're making real-time decisions based on evolving web content, user interactions, and complex tool chains.

These systems can also have multiple failure modes that happen at the same time. They hallucinate, retrieval fails, they make reasoning errors, and all of these are interconnected. So a little bit about what we do at Quotient, we monitor live AI agents. We have expert evaluators that can detect objective system failures without weighting on ground-through data, human feedback, or benchmarks.

A year ago we met Rotem, Tavili's founder and CEO, and he posed us with a problem that really crystallized the core issues we needed to solve. Here's the challenge: how do you build production-ready AI search agents when your system will be dealing with two fundamental sources of unpredictability you cannot proactively control?

Under the hood, Tavili's agents gather their context by searching the web. The web is not static. Traditional benchmarks assume stable ground-through, but when you're dealing with real-time information, ground-through itself is a moving target. Your users also don't stick to your test cases. They can ask odd, malformed questions, they have implicit context they don't really share, and you're not aware of.

And this is not just a theoretical problem. Tavili processes hundreds of millions of search requests for its AI agents in production. And they need a solution to work that scale in these real-world conditions. And this is a story of how we built that. Yes. So at Tavili, we're building the infrastructure layer for a GenTech interaction at scale, essentially providing language models with real-time data from across the web.

There are many use cases where real-time AI search deliver values, and this is just a few examples of how our clients are using Tavili to empower their applications. Starting from a CLM company that built an AI legal assistant to power their legal and business team with instant case insight, to a sports news outlet that created a hybrid-rug chat agent that delivers scores, games, and news updates, to a credit card company that uses real-time search to fight fraud by pinpointing merchant locations.

So as you can imagine, evaluating a system in this kind of vast, fast-moving setting is quite challenging. We have two principles that guide our evaluation. First, the web, which is our foundation of our data, is constantly changing. First, this means that our evaluation method must keep up with the ongoing change.

Second, that truth is often subjective and contextual. Evaluating correctness can be tricky because what's right may depend on the source or the timing or the user needs. So we have a responsibility to design our evaluation methods to be as unbiased and fair as possible, even when absolute truth is hard to pin down.

So the first thing to think about in offline evaluation is which data to use to evaluate your system. So static data sets are a great start, and there are many widely open source data sets available out in the web. Simple QA is one example. It's a benchmark and a data set from OpenAI that serve as a standard for evaluating retrieval accuracy.

We have many leading AI search providers that use Simple QA to evaluate their performance. Simple QA is designed to evaluate the system's ability to answer short, fact-seeking question with a single empirical answer. Another widely adopted data set is Hotpot QA, which evaluates the system's ability to answer multi-hop questions where reasoning across multiple documents is required to retrieve the final answer.

Data set like Simple QA and Hotpot QA are a great start for evaluating your system. But what happens when you're evaluating real-time systems that, especially when measuring that your system keeps up with rapidly evolving information and avoiding regression, like where we operate. data sets. Also, those kinds of static data sets don't address the challenge of benchmarking questions where there's no one truth answer or subjectivity is involved.

This is what led us to think beyond static data sets towards dynamic evaluation that reflects the changing the pace of the web essentially. Dynamic data sets are essential for benchmarking RAG agents in real-world production system. You can answer today's questions with yesterday data. Dynamic data sets have real-world alignment.

They have broad coverage as you can easily create eval sets for any domain or use case that is relevant to your specific needs. And they also ensure continuous relevancy because they are regularly refreshed, which means that your system is always evaluated against the latest data. This led us to build an open source agent that basically builds dynamic eval sets for web-based RAG system.

It's open source, and we encourage everyone to check it out and contribute. And I also want to acknowledge the work of Eyal, our head of data at Tavili, who initiated this project a couple of months ago. As you can see here, an example of a data set generated by the agent.

It generates question and answer pairs for targeted domains using information found in the web. So the agent leveraged the LandGraph framework, and it consists of these key steps. First, it generates broad web search queries for targeted domains, which essentially lets you create eval sets for any domain of your choice and specific needs of your application.

The second step is to aggregate grounding documents from multiple real-time AI search providers. We understand that we cannot just use Tavili to search the web on specific domains, find grounding documents, then generate question and answer pairs from those documents, and then evaluate our performance on those documents. That's why we use multiple real-time AI search providers to both maximize coverage and minimize bias.

The third step, which is the key step in this process, is to generate the evidence-based question and answer pairs. And we ensure that in the generation process, the agent is obliged to generate answer context, which also increase the reliability of our question and answer pairs and reduce hallucinations. You can always go back and check which sources were used and which evidence from those sources were used to generate each question and answer pair.

And lastly, we use lengthness to track our experiments, which is a great observability tool to manage these offline evaluation runs and see how your performance at different time steps. The next step that we want to address is to support a range of question types, both simple fact-based questions and multi-hop questions similar to the hotpot QA.

We also want to ensure fairness and coverage by proactively addressing bias and covering a wide range of perspective for each subject we generate question and answer to. Additionally, we want to add a supervisor node for coordination, which proves itself to be valuable, especially in these multi-agent architectures. And this will increase the quality of our question and answer pairs.

The next step to think about is benchmarking. And we argue that it's important to measure accuracy, but you should not stop there. You should ensure an holistic evaluation framework, which use benchmark, like for our case, that measure your source diversity, your source relevancy and hallucination rates. It's also important to leverage unsupervised evaluation method that remove the need for label data, which enable to scale your evaluations and address the subjectivity issue.

With that, I'll pass it over to Diana, who will explain more about these reference-free benchmarks and also share results from an experiment we ran using a static and a dynamic dataset that was generated by the agent I described before. So we performed a two-part evaluation of six different AI search providers.

The first component of this experiment was to compare the accuracy of search providers on a static and a dynamic benchmark in order to demonstrate that static benchmarking is not a comprehensive method for evaluation of AI search. The second component was to evaluate the dynamic dataset responses using reference-free metrics.

And we compare these results to the reference-based accuracies that we get from the benchmark in order to demonstrate that reference-free evaluation can be an effective substitute when ground truths are not available. So jumping right in, for our static versus dynamic benchmarking comparison, we use simple QA benchmark as the static dataset, and we're using a dynamic benchmark of about a thousand rows created by Tavilli.

And as you can see here, both datasets have roughly similar distributions of topics, and this helps to ensure a fair comparison and diversity of questions. So to evaluate the AI search providers' performance on these two benchmarks, we're using the simple QA correctness metric, and this is an LLM judge which is used on the simple QA benchmark.

It compares the model's response against a ground-truth answer in order to determine if it's correct, incorrect, or not attempted. And so here we're showing the correctness scores from that simple QA benchmark compared against the dynamic benchmark. And we've anonymized the search providers for this talk, but I do want to call out that the simple QA accuracy scores here are all self-reported, and so they don't all necessarily have clear documentation on how they were calculated.

But as you can see, the correctness scores are for the dynamic benchmark in blue are substantially lower. And not only that, the relative rankings have also changed pretty considerably. For example, provider F all the way on the end of this plot here performs the worst on simple QA, but it performs the best on the dynamic benchmark.

And looking a little closer in the results, while this simple QA evaluator is useful, it's certainly far from perfect. I have a few examples here of model responses that were flagged as incorrect by this LLM judge. But if you look at the actual text in the model outputs, they do contain the correct answer from the ground truth.

On the flip side of things, here is an example where the LLM judge classified it as correct. And, yes, you can see that the correct answer is in this response, but while the correct answer might be present, that doesn't necessarily mean that the full answer is right. This evaluation is not accounting for any of the additional text in this response, and there might be hallucinations in there, and that would invalidate it.

So, ultimately, this evaluation falls short of identifying when things go wrong in AI search. So, what are some other ways that we can identify when things go wrong? Up to this point, we have been talking about a reference-based approach to evaluation. But what if we don't have ground truths?

In most online and production settings, this is typically the case, and as we've already discussed, it's especially so in AI search. For this talk, we're going to look at three of Quotient's reference-free metrics. We'll look at answer completeness, which identifies whether all components of the question were answered, so it classifies model responses as either fully addressed, unaddressed, or unknown, if the model says I don't know.

Then we'll look at document relevance, and this is the percent of the retrieved documents that are actually relevant to addressing the question. And then, finally, we'll look at hallucination detection, which identifies whether there are any facts in the model response that are not present in any of the retrieved documents.

So, we use these metrics to evaluate the search providers' responses on this dynamic benchmark. So, we've got answer completeness plotted here. The stacked bar plot shows the number of responses that were either completely answered, unaddressed, or marked as unknown. And if we look back at the overall rankings that we saw earlier on the dynamic benchmark, you can see that the rankings from answer completeness pretty closely match.

The average performance scores for the two get a correlation of 0.94. So, this indicates that the reference-free metric can capture relative performance pretty well. But completeness is still not the same thing as correctness. And when we have no ground truths available, then we have to turn to the next best thing.

And that is the grounding documents. So, this is where document relevance and hallucination detection come in. Both of these metrics are going to be looking at those grounding documents in order to measure the quality of the model's response. Unfortunately, of all of the search providers we looked at, only three of them actually return the retrieved documents used to generate their answers.

The majority of search providers typically only provide citations. And these are largely unhelpful at scale and also really limit transparency when it comes to debugging. So, these are those document relevance scores for the three search providers. And they've been re-anonymized here. The plot to the left shows the average document relevance, the percent of retrieved documents that are relevant to the question.

And the plot to the right shows the number of responses that have no relevant documents. And if we consider these results in conjunction with answer completeness, we find that there's a strong inverse correlation between document relevance and the number of unknown answers. And this kind of matches intuition. If you think about it, if you have no relevant documents for the question, the model should say, "I don't know," rather than trying to answer it.

And so this brings us to hallucination detection. And here we were actually surprised to see that there was a direct relationship with the hallucination rate and document relevance. Provider X here has the highest hallucination rate, but it also had the highest overall document relevance. And this is kind of counterintuitive.

But if we think about it more, provider X had high answer completeness, the lowest rate of unknown answers, and it also had the highest answer correctness from the benchmarking earlier of these three providers. So this probably implies that maybe in provider X's responses, they're more likely to provide new reasoning or interpretations in their response, or maybe even they're more detailed and thorough.

And this just creates more opportunity for hallucination in their responses. But the point I want to make here is that when considering these metrics, depending on your use case, you might index more heavily on one over another. They're measuring different dimensions of response quality, and it's often a give and take.

If you perform really well in one, it might be at the expense of another. And as we see here, there is a tradeoff between answer completeness and hallucination. But also, if you take these three metrics in conjunction, you can use them to understand why things went wrong and identify potential strategies for addressing those issues.

So this diagram here shows a few examples on how you can interpret your evaluation results. Sorry. How you can interpret your evaluation results to identify what to do to fix it. So we've got one example here where maybe your response is incomplete, but you have relevant documents, you have no hallucinations.

So this probably means you don't have all the information you need to answer the question. And so just retrieving more documents might solve that. But the big picture idea is that your evaluation should do more than just provide relative rankings. It should help you identify the types of issues that are present.

And it should also help you understand what strategies to implement to solve those issues. So in conclusion, let me just quickly paint a picture of where we're heading with all this. Because this is not just about building the agents we've been building for the past couple of years and then slapping evaluation on it and then continuing to do the same thing.

It's actually -- it's not about building better benchmarking. It's not about better monitoring. It's not about better evaluation. It's about creating AI systems that can continuously improve themselves. And imagine for a second that agents don't just retrieve information but learn from the patterns of what information is outdated, what sources are unreliable, and what users need.

They can also like maybe detect hallucinations mid-conversations and correct the course all without human intervention. And this framework that we shared today, dynamic data sets, holistic evaluation, reference-free metrics, are the building blocks for getting there. And this is where we want to get with augmented AI. So thank you so much for your time.

Thank you so much. Thank you so much for your time. Thank you. you We'll see you next time.

Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily

Transcript