How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

All right. Welcome, everybody. I'm Jeff Huber, the co-founder and CEO of Chroma, and I'm joined by Jason. We're going to do a two-parter here. We're really going to pack in the content. It's the last session of the day, and so we thought I'd give you a lot. Everything in this presentation today is open source and code available, so we're also not selling you any tools.

And so there'll be QR codes and stuff throughout to grab the code. So let's talk about how to look at your data. All of you are AI practitioners. You're all building stuff, and these questions probably resonate quite deeply with you. What chunking strategy should I use? Is my embedding model the best embedding model for my data?

And more. And our contention is that you can really only manage what you measure. Again, I think Peter Drucker is the original who coined that, so I can't take too much credit. But it certainly is still true today. So we have a very simple hypothesis here, which is you should look at your data.

The goal is to say look at your data, I think, at least 15 times this presentation. So that's two. And great measurement ultimately is what makes systematic improvement easy. And it really can be easy. It doesn't have to be super complicated. So I'm going to talk about part one, how to look at your inputs.

And then Jason's going to talk about part two, how to look at your outputs. So let's get into it. All right, looking at your inputs. How do you know whether or not your retrieval system is good? And how do you know how to make it better? There are a few options.

There is guess and cross your fingers. That's certainly one option. Another option is to use an LLM as a judge. You're using some of these frameworks where you're checking factuality and other metrics like this. And they cost $600 and take three hours to run. If that is your preference, you certainly can do that.

You can use public benchmarks. So you can look at things like MTEB to figure out, oh, which embedding model is the best on English? That's another option. But our contention is you should use fast evals. And I will tell you exactly what fast evals are. All right, so what is a fast eval?

A fast eval is simply a set of query and document pairs. So the first step is, if this query is put in, this document should come out. A set of those is called a golden data set. And then the way that you measure your system is you put all the queries in.

And then you see, do those documents come out? And obviously, you can retrieve 5 or retrieve 10 or retrieve 20. It kind of depends on your application. It's very fast and very inexpensive to run. And this is very important because it enables you to run a lot of experiments quickly and cheaply.

I'm sure all of you know that experimentation time and your energy to do experimentation goes down significantly when you have to click, go, and then come back six hours later. All of these metrics should run extremely quickly for pennies. So maybe you don't have yet-- you have your documents, you have your chunks, you have your stuff in your retrieval system, but you don't have queries yet.

That's OK. We found that you can actually use an LLM to write questions and write good questions. You know, I think just doing naive, like, hey, LLM, write me a question for this document, not a great strategy. However, we found that you can actually teach LLMs how to write queries.

These slides are getting a little bit cropped. I'm not sure why, but we'll make the most of it. To give you an example, so this is actually an example from one of the MTEB, kind of the golden data sets around embedding models of benchmark data sets. This also points to the fact that, like, many of these benchmark data sets are overly clean, right?

What is a pergola used for in a garden? And then the beginning of that sentence is a pergola in a garden, dot, dot, dot, dot, dot. Real-world data is never this clean. So what we did in this report-- the link is in a few slides-- we did a huge deep dive into how can we actually align queries that are representative of real-world queries?

It's too easy to trick yourself into thinking that your system's working really well with synthetic queries that are overly specific to your data. And so what these graphs show is that we're actually able to semantically align the specificity of queries synthetically generated to real queries that users might ask of your system.

So what this enables is, you know, if a new, cool, sexy embedding model comes out and it's doing really well in the MTEB score and everybody on Twitter is talking about it, instead of just, you know, going into your code and changing it and guessing and checking and hoping that it's going to work, you now can empirically say whether it's good, better or not, for your data.

And, you know, the kind of example here is quite contrived and simple, you know, but you can actually look at the actual success rate. OK, great. These are the queries that I care about. Do I get back more documents than I did before? If so, maybe you should consider changing.

Now, of course, you need to re-embed your data. That service can be more expensive. It can be slower. The API for that service can be flaky. There's a lot of considerations, obviously, when making very good engineering decisions, but clearly the north star of, like, success rate, of how many documents that I get from my queries, super fast and super useful, and makes your improvement of your system much more systematic and deterministic.

All right, so we actually worked with Weights and Biases, looking at their chatbot, to kind of ground a lot of this work. So what you see here is, for the Weights and Biases chatbot, you can see four different embedding models. And you can see the recall at 10 across those four different embedding models.

And then I'll point out that blue is ground truth. So these are actual queries that were logged in Weave and then sent over. And then there's generated. These are the ones that are synthetically generated. And what we want to see is a few things. We want to see that those are pretty close.

And we want to see that they are always the same kind of in order of accuracy, right? We don't want to see any, like, big flips between ground truth and generated. And we're really happy to see that we found that answer. Now, there are a few fun findings here, which is-- and of course, they're going to get cropped out, but that's OK.

Number one, the original embedding model used for this application was actually text embedding three small. This actually performed the worst out of all the embedding models that we evaluated just for in this case. And so it probably wasn't the best choice. The second one was that, actually, if you look at MTEB, Gina embeddings v3 does very well in English.

It's, like, you know, way better than anything else. But for this application, it didn't actually perform that well. It was actually the Voyage 3 large model which performed the best. And that was empirically determined by actually running this fast eval and looking at your data. That's number three. All right, so if you'd like access to the full report, you can scan this QR code.

It's at research.truck.com. There's also an adjoining video, which is kind of screenshotted here, which goes into much more detail. There are full notebooks of all the code. It's all open source. You can run it on your own data. And hopefully this is helpful for you all thinking about how, again, you can systematically and deterministically improve your retrieval systems.

And with that, I'll hand it over to Jason. Thank you. So if you're working with some kind of system, there's always going to be the inputs that we look at. And this is what we talked about, maybe thinking about things like retrieval, how does the embeddings work? But ultimately, we also have to look at the outputs, right?

And the outputs of many systems might be the outputs of a conversation that has happened, an agent execution that has happened. And the idea is that if you can look at these outputs, maybe we can do some kind of analysis that figures out what kind of products should we build?

What kind of portfolio of tools should we develop for our agents? And so forth. And so the idea is, you know, if you have a bunch of queries that users are putting in, or even a couple of hundred of conversations, it's pretty good to just look at everything manually, right?

Think very carefully about each interaction, and then only use these models when they make sense. And then oftentimes, if I say that, they can say, you know, what if we just put everything in O3? And then here, generally, only use the language models if you think you're not smarter than the language model.

Then, when you have a lot of users and an actual good product, you might get thousands of queries, or tens of thousands of conversations. And now, you run into an issue where there's too much volume to manually review. There's too much detail in the conversations, and you're not really going to be the expert that can actually figure out what is useful and what is good.

And ultimately, with these long conversations, with tool calls and chains and reasoning steps, these outputs are now really hard to scan and really hard to understand. But there's still a lot of value in these conversations, right? If you use a chat bot, whether it's in cursor or any kind of, like, cloud code system, oftentimes, you do say things like, try again, this is not really what I meant, you know, be less lazy next time, it turns out a lot of the feedback you give is in those conversations, right?

We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the information exists in those conversations. And the frustration and the retry patterns that exist can't be extracted from those conversations. And the idea is that the data really already exists in this conversation.

We think of a simple example outside, in a different industry, you know, we can imagine the analogy of marketing, right? Maybe we run our evals, and the number is 0.5. I don't really know what that means. Factuality is 0.6, I don't know if that's good or bad, is 0.5, the average, who knows?

But imagine we run a marketing campaign, and our, you know, ad metric, or our KPI is 0.5. There's not much we can do. But if we realize that 80% of our users are under 35, and 20% are over, and we realize that the younger audience performs well, and the older audience performs poorly, what we've done is that we've just drawn a line in the sand on who our users are.

And now we can make a decision. Do we want to double down on marketing to a younger audience, or do we want to figure out why we aren't successfully marketing to the older population, right? Do I find more podcasts to market to, you know, should I run a Super Bowl ad?

Now just by drawing a line in the sand and deciding which segment to target, we can now make decisions on what to improve. Whereas just making them ads better is a sort of very generic sentiment that people can have. And so one of the best ways of doing that is effectively just extracting some kind of data out of these conversations in some structured way, and just doing very traditional data analysis.

And so here we have a kind of object that says, I want to extract a summary of what has happened, maybe some tool that it's used, maybe the errors that we've noticed, the conversations that happened, maybe some metric for satisfaction, maybe some metric for frustration. The idea is that we can build this portfolio of metadata that we can extract.

And then what we can do is we can embed this, find clusters, identify segments, and then start testing our hypotheses. And so what we might want to do is sort of build this extraction, put it into an LLM, get this data back out, and just start doing very traditional data analysis.

No different than any kind of product engineer or any kind of data scientist. So this tends to work quite well. You know, if you look at some of the things that Anthropoc Clio did, they basically found that, you know, code use was 40x more represented by cloud users than by, you know, GDP value creation.

They go, okay, maybe code is like a good avenue. And obviously that's not the really case, but the idea is that by understanding how your users develop a product, you can now figure out where to invest your time. And so this is why we built a library called Cura that allows us to summarize conversations, cluster them, build hierarchies of these clusters, and ultimately allow us to compare our evals across different KPIs.

Again, so now, you know, if we have factuality as 0.6, that's really hard, but if it turns out that factuality is really low for queries that require time filters, right, or factuality is really high when queries revolve on, you know, contract search, now we know something's happening in one area, something's happening in another, and then we can make a decision on what to do and how to invest our time.

And the pipeline is pretty simple. We have models to do summarization, models to do clustering, and models that do this aggregation step. And so what you might want to do is just load in some conversations, and here we've made some fake data set, fake conversations from Gemini, and the idea is that first, we can extract some kind of summary model where there's topics that we discuss, frustrations, errors, et cetera.

We can then cluster them to find cohesive groups, and here we can find maybe, you know, some of the conversations are around data visualization, SEO content requests and authentication errors, and now we get some idea of how people are using the software. And then as we group them together, we realize, okay, really there's some themes around technical support.

Does the agent have tools that can do this well? Does the agent have tools that can do this well? Do we have tools to debug these database issues? Do we have tools to debug authentication? Do we have tools to do data visualization? That's something that's going to be very useful.

And at the end of this pipeline, we're sort of presented with these printouts of clusters. Right? And so we're going to talk about how the chatbot is being used at a higher level, you know, SEO content, data analysis, and a lower level. You know, maybe it's blog posts and marketing.

And just by looking at this, we might have some hypothesis as to what kind of tools we should build, how we should, you know, develop, you know, even our marketing or how we can think of changing our prompts. We can do a ton of these kinds of things. And this is because the ultimate goal is to understand what to do next, right?

You do the segmentation to figure out what kind of new hypotheses that you can have. And then you can make these targeted investments within these certain segments. If it turns out that, you know, 80% of the conversation that I'm having with the chatbot is around SEO optimization, maybe I should have some integrations that do that.

Maybe I should reevaluate the prompts or have other workflows to make that use case more powerful for them. And again, the goal really is to just make a portfolio of tools, of metadata filters, of data sources that allows the agent to do its job. And oftentimes, the solution isn't really making the AI better.

It's really just providing the right infrastructure, right? A lot of times, if you find that a lot of queries use time filters and you just didn't add a time filter, that can probably improve your evals by quite a bit, right? We have situations where we wanted to figure out if contracts were signed, and if we just extracted one more step in the OCR process, now we can do these large-scale filters and figure out, you know, what data exists.

And generally, the practice of improving your applications is pretty straightforward, right? We all know to define evals, but not everyone that I work with has really been thinking about something like finding clusters and comparing KPIs across clusters. But once you do, then you can start making decisions on what to build, what to fix, and what to ignore.

Maybe you have a two set of quadrants, right? Maybe we have low usage and high usage, and you have high-performing evals and low-performing evals, right? If a large portion of your population are using tools that you are bad at, that is clearly the thing you have to fix. But if a large proportion of people are using tools that you're good at, that's totally fine.

If a small proportion of people do something that you're good at, maybe there's some product changes you didn't make. Maybe it's about educating the user. Maybe it's adding some, you know, pre-filler or automated questions to show them that we can do these kind of capabilities. And if there are things that nobody does, but when we do them, they're bad, maybe that's a one-line change in the prompt that says, "Sorry, I can't help you.

Go talk to your manager," right? These are now decisions that we can make just by looking at, you know, what proportion of our conversations are of a certain category, and whether or not we can do well in that category. And as you understand this, then you can go out, you can build these classifiers to identify these specific intents.

Maybe you build routers, maybe you build more tools. And then you can start doing things like monitoring and having the ability to do these group buys. Right, so now you have different categories of query types over time, and you can just see what the performance looks like. Right, where 0.5 doesn't really mean anything, but whether or not a metric changes over time across a certain category can determine a lot about how your product is being used.

By doing this, we figured out that, you know, some customers, when we onboard them, they use our applications very differently than our historical customers, and we can now then make other investments in how to improve these systems. And ultimately, the goal is to create a data-driven way of defining the product roadmap.

Often times, it is research that leads to better products now, rather than products justifying some research that we don't know is possible. And again, the real marker of progress is your ability to have a high-quality hypothesis, and your ability to test a lot of these hypotheses. And if you segment, you can make clearer hypotheses.

If you use faster evals, you can run more experiments. And by having this continuous feedback through monitoring, this is how you actually build a product. This is, regardless of being an AI product, this is just how you build a product. And so if you look at the takeaways, really, when you think about measuring the inputs, you really want to think about not using public benchmarks, building evals on your data, and focusing first on retrieval, because that is the only thing an LLM improvement won't fix.

If the retrieval is bad, the LLM will still get better over time, but you need to earn the right to sort of tinker with the LLM by having good retrieval. And then lastly, if you don't have any customers or any users, you can start thinking about synthetic data as a way of augmenting that.

And once you have users, look at your data as well. Look at the outputs, extract structure from these conversations, understand how many conversations are happening, how often are tools being misused, what are the errors, and how are people frustrated. And by doing that, you can do this population-level data analysis, find these similar clusters, and have some kind of impact-weighted understanding of what the tools are.

It's one thing to say, you know, maybe we should build more tools for data visualization. It's another thing to say, hey boss, 40% of our conversations are around data visualization, and the code engine or the code execution can't really do that well. Maybe we should build two more tools for plotting, and then see if that's worth it.

And you can justify that, because we know there's a 40% of the population that's using data visualization, and we do that, you know, maybe only 10% of the time. Right, this is impact-weighted. And ultimately, as you compare these KPIs across these clusters, you can just make better decisions across your entire product development process.

So again, start small, look for structure, understand that structure, and start comparing your KPIs. And once you can do that, you can make decisions on what to fix, what to build, and what to ignore. If you want to find more resources, feel free to check out these QR codes.

The first one is the Chroma Cloud to understand a little bit more about the research. And the second one is actually a set of notebooks that we've built out that go through this process. So we load the Waste and Biases conversations, we do this cluster analysis, and we show you how we can use that to make better product decisions.

So there's three Jupyter notebooks in that repo, check them out on your own time, and thank you for listening. We do have time for one quick question, and of course, outside as well. Perfect. Thank you. Thank you. If anybody wants to grab the mic there, and over there. Yep.

You're famous for spicy cake. What's the spice you take today? It's not KPI, by the way, that's not the spice you take. I think more agent businesses should try to price and price their services on the work done than the tokens used. So, yeah, price on success, price on value, very unrelated to this talk, but, you know.

We'll see you next time. We'll be right back.

How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

Transcript