back to indexHow to look at your data — Jeff Huber (Choma) + Jason Liu (567)

00:00:16.560 |
I'm Jeff Huber, the co-founder and CEO of Chroma, 00:00:34.560 |
And so there'll be QR codes and stuff throughout 00:00:37.600 |
So let's talk about how to look at your data. 00:00:43.340 |
You're all building stuff, and these questions probably 00:00:51.680 |
Is my embedding model the best embedding model for my data? 00:00:56.480 |
And our contention is that you can really only manage 00:01:00.060 |
Again, I think Peter Drucker is the original who coined that, 00:01:10.640 |
The goal is to say look at your data, I think, 00:01:23.980 |
So I'm going to talk about part one, how to look at your inputs. 00:01:26.960 |
And then Jason's going to talk about part two, 00:01:34.040 |
How do you know whether or not your retrieval system is good? 00:01:51.800 |
where you're checking factuality and other metrics 00:01:54.540 |
And they cost $600 and take three hours to run. 00:01:57.760 |
If that is your preference, you certainly can do that. 00:02:02.060 |
So you can look at things like MTEB to figure out, oh, 00:02:04.300 |
which embedding model is the best on English? 00:02:08.340 |
But our contention is you should use fast evals. 00:02:10.900 |
And I will tell you exactly what fast evals are. 00:02:16.620 |
A fast eval is simply a set of query and document pairs. 00:02:21.720 |
So the first step is, if this query is put in, 00:02:30.380 |
And then the way that you measure your system 00:02:35.040 |
And then you see, do those documents come out? 00:02:38.200 |
And obviously, you can retrieve 5 or retrieve 10 or retrieve 20. 00:02:45.540 |
And this is very important because it enables you to run a lot of experiments quickly and cheaply. 00:02:50.640 |
I'm sure all of you know that experimentation time and your energy to do experimentation goes down significantly when you have to click, go, and then come back six hours later. 00:03:01.280 |
All of these metrics should run extremely quickly for pennies. 00:03:04.540 |
So maybe you don't have yet-- you have your documents, you have your chunks, you have your stuff in your retrieval system, but you don't have queries yet. 00:03:14.560 |
We found that you can actually use an LLM to write questions and write good questions. 00:03:20.560 |
You know, I think just doing naive, like, hey, LLM, write me a question for this document, not a great strategy. 00:03:27.260 |
However, we found that you can actually teach LLMs how to write queries. 00:03:31.060 |
These slides are getting a little bit cropped. 00:03:33.600 |
I'm not sure why, but we'll make the most of it. 00:03:36.200 |
To give you an example, so this is actually an example from one of the MTEB, kind of the golden data sets around embedding models of benchmark data sets. 00:03:46.140 |
This also points to the fact that, like, many of these benchmark data sets are overly clean, right? 00:03:52.580 |
And then the beginning of that sentence is a pergola in a garden, dot, dot, dot, dot, dot. 00:04:00.300 |
So what we did in this report-- the link is in a few slides-- we did a huge deep dive into how can we actually align queries that are representative of real-world queries? 00:04:11.040 |
It's too easy to trick yourself into thinking that your system's working really well with synthetic queries that are overly specific to your data. 00:04:17.680 |
And so what these graphs show is that we're actually able to semantically align the specificity of queries synthetically generated to real queries that users might ask of your system. 00:04:29.680 |
So what this enables is, you know, if a new, cool, sexy embedding model comes out and it's doing really well in the MTEB score and everybody on Twitter is talking about it, 00:04:37.740 |
instead of just, you know, going into your code and changing it and guessing and checking and hoping that it's going to work, 00:04:43.380 |
you now can empirically say whether it's good, better or not, for your data. 00:04:48.640 |
And, you know, the kind of example here is quite contrived and simple, you know, but you can actually look at the actual success rate. 00:04:55.880 |
Do I get back more documents than I did before? 00:05:00.320 |
Now, of course, you need to re-embed your data. 00:05:07.220 |
There's a lot of considerations, obviously, when making very good engineering decisions, but clearly the north star of, like, success rate, of how many documents that I get from my queries, super fast and super useful, 00:05:18.080 |
and makes your improvement of your system much more systematic and deterministic. 00:05:22.580 |
All right, so we actually worked with Weights and Biases, looking at their chatbot, to kind of ground a lot of this work. 00:05:30.480 |
So what you see here is, for the Weights and Biases chatbot, you can see four different embedding models. 00:05:36.280 |
And you can see the recall at 10 across those four different embedding models. 00:05:40.480 |
And then I'll point out that blue is ground truth. 00:05:42.980 |
So these are actual queries that were logged in Weave and then sent over. 00:05:48.580 |
These are the ones that are synthetically generated. 00:05:53.720 |
And we want to see that they are always the same kind of in order of accuracy, right? 00:05:57.720 |
We don't want to see any, like, big flips between ground truth and generated. 00:06:02.080 |
And we're really happy to see that we found that answer. 00:06:05.460 |
Now, there are a few fun findings here, which is-- and of course, they're going to get cropped 00:06:11.660 |
Number one, the original embedding model used for this application was actually text embedding 00:06:18.520 |
This actually performed the worst out of all the embedding models that we evaluated just for 00:06:25.760 |
The second one was that, actually, if you look at MTEB, Gina embeddings v3 does very 00:06:31.120 |
It's, like, you know, way better than anything else. 00:06:33.120 |
But for this application, it didn't actually perform that well. 00:06:36.500 |
It was actually the Voyage 3 large model which performed the best. 00:06:40.420 |
And that was empirically determined by actually running this fast eval and looking at your data. 00:06:46.540 |
All right, so if you'd like access to the full report, you can scan this QR code. 00:06:51.900 |
There's also an adjoining video, which is kind of screenshotted here, which goes into much 00:07:00.900 |
And hopefully this is helpful for you all thinking about how, again, you can systematically 00:07:04.900 |
and deterministically improve your retrieval systems. 00:07:12.940 |
So if you're working with some kind of system, there's always going to be the inputs that we 00:07:17.140 |
And this is what we talked about, maybe thinking about things like retrieval, how does the embeddings 00:07:20.760 |
But ultimately, we also have to look at the outputs, right? 00:07:22.900 |
And the outputs of many systems might be the outputs of a conversation that has happened, 00:07:29.640 |
And the idea is that if you can look at these outputs, maybe we can do some kind of analysis 00:07:33.580 |
that figures out what kind of products should we build? 00:07:36.180 |
What kind of portfolio of tools should we develop for our agents? 00:07:40.980 |
And so the idea is, you know, if you have a bunch of queries that users are putting in, 00:07:45.340 |
or even a couple of hundred of conversations, it's pretty good to just look at everything 00:07:50.500 |
Think very carefully about each interaction, and then only use these models when they make 00:07:56.100 |
And then oftentimes, if I say that, they can say, you know, what if we just put everything 00:08:00.160 |
And then here, generally, only use the language models if you think you're not smarter than 00:08:06.100 |
Then, when you have a lot of users and an actual good product, you might get thousands 00:08:10.760 |
of queries, or tens of thousands of conversations. 00:08:13.160 |
And now, you run into an issue where there's too much volume to manually review. 00:08:17.160 |
There's too much detail in the conversations, and you're not really going to be the expert 00:08:20.620 |
that can actually figure out what is useful and what is good. 00:08:24.760 |
And ultimately, with these long conversations, with tool calls and chains and reasoning steps, 00:08:29.040 |
these outputs are now really hard to scan and really hard to understand. 00:08:32.520 |
But there's still a lot of value in these conversations, right? 00:08:36.280 |
If you use a chat bot, whether it's in cursor or any kind of, like, cloud code system, oftentimes, 00:08:41.000 |
you do say things like, try again, this is not really what I meant, you know, be less lazy 00:08:44.960 |
next time, it turns out a lot of the feedback you give is in those conversations, right? 00:08:50.380 |
We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the 00:08:57.060 |
And the frustration and the retry patterns that exist can't be extracted from those conversations. 00:09:02.760 |
And the idea is that the data really already exists in this conversation. 00:09:07.240 |
We think of a simple example outside, in a different industry, you know, we can imagine the analogy 00:09:12.360 |
Maybe we run our evals, and the number is 0.5. 00:09:17.280 |
Factuality is 0.6, I don't know if that's good or bad, is 0.5, the average, who knows? 00:09:22.320 |
But imagine we run a marketing campaign, and our, you know, ad metric, or our KPI is 0.5. 00:09:29.180 |
But if we realize that 80% of our users are under 35, and 20% are over, and we realize that 00:09:34.740 |
the younger audience performs well, and the older audience performs poorly, what we've done 00:09:38.860 |
is that we've just drawn a line in the sand on who our users are. 00:09:43.040 |
Do we want to double down on marketing to a younger audience, or do we want to figure out 00:09:48.040 |
why we aren't successfully marketing to the older population, right? 00:09:52.480 |
Do I find more podcasts to market to, you know, should I run a Super Bowl ad? 00:09:56.540 |
Now just by drawing a line in the sand and deciding which segment to target, we can now make decisions 00:10:02.900 |
Whereas just making them ads better is a sort of very generic sentiment that people can have. 00:10:08.620 |
And so one of the best ways of doing that is effectively just extracting some kind of data 00:10:12.540 |
out of these conversations in some structured way, and just doing very traditional data analysis. 00:10:16.800 |
And so here we have a kind of object that says, I want to extract a summary of what has happened, 00:10:21.840 |
maybe some tool that it's used, maybe the errors that we've noticed, the conversations that happened, 00:10:26.720 |
maybe some metric for satisfaction, maybe some metric for frustration. 00:10:30.760 |
The idea is that we can build this portfolio of metadata that we can extract. 00:10:34.300 |
And then what we can do is we can embed this, find clusters, identify segments, and then start 00:10:42.800 |
And so what we might want to do is sort of build this extraction, put it into an LLM, get 00:10:47.140 |
this data back out, and just start doing very traditional data analysis. 00:10:50.220 |
No different than any kind of product engineer or any kind of data scientist. 00:10:56.080 |
You know, if you look at some of the things that Anthropoc Clio did, they basically found 00:11:00.080 |
that, you know, code use was 40x more represented by cloud users than by, you know, GDP value creation. 00:11:08.260 |
They go, okay, maybe code is like a good avenue. 00:11:10.080 |
And obviously that's not the really case, but the idea is that by understanding how your users 00:11:14.820 |
develop a product, you can now figure out where to invest your time. 00:11:17.760 |
And so this is why we built a library called Cura that allows us to summarize conversations, 00:11:22.780 |
cluster them, build hierarchies of these clusters, and ultimately allow us to compare our evals 00:11:30.380 |
Again, so now, you know, if we have factuality as 0.6, that's really hard, but if it turns 00:11:34.840 |
out that factuality is really low for queries that require time filters, right, or factuality 00:11:39.600 |
is really high when queries revolve on, you know, contract search, now we know something's 00:11:44.500 |
happening in one area, something's happening in another, and then we can make a decision 00:11:52.620 |
We have models to do summarization, models to do clustering, and models that do this aggregation 00:11:59.460 |
And so what you might want to do is just load in some conversations, and here we've made 00:12:02.460 |
some fake data set, fake conversations from Gemini, and the idea is that first, we can 00:12:09.660 |
extract some kind of summary model where there's topics that we discuss, frustrations, errors, 00:12:16.580 |
We can then cluster them to find cohesive groups, and here we can find maybe, you know, 00:12:19.960 |
some of the conversations are around data visualization, SEO content requests and authentication errors, 00:12:25.960 |
and now we get some idea of how people are using the software. 00:12:29.260 |
And then as we group them together, we realize, okay, really there's some themes around technical 00:12:33.960 |
Does the agent have tools that can do this well? 00:12:35.340 |
Does the agent have tools that can do this well? 00:12:36.340 |
Do we have tools to debug these database issues? 00:12:41.340 |
That's something that's going to be very useful. 00:12:43.340 |
And at the end of this pipeline, we're sort of presented with these printouts of clusters. 00:12:50.720 |
And so we're going to talk about how the chatbot is being used at a higher level, you know, 00:12:54.100 |
SEO content, data analysis, and a lower level. 00:12:57.100 |
You know, maybe it's blog posts and marketing. 00:12:59.100 |
And just by looking at this, we might have some hypothesis as to what kind of tools we should 00:13:03.100 |
build, how we should, you know, develop, you know, even our marketing or how we can think 00:13:09.600 |
And this is because the ultimate goal is to understand what to do next, right? 00:13:14.900 |
You do the segmentation to figure out what kind of new hypotheses that you can have. 00:13:18.900 |
And then you can make these targeted investments within these certain segments. 00:13:23.000 |
If it turns out that, you know, 80% of the conversation that I'm having with the chatbot 00:13:27.140 |
is around SEO optimization, maybe I should have some integrations that do that. 00:13:31.120 |
Maybe I should reevaluate the prompts or have other workflows to make that use case more 00:13:35.860 |
And again, the goal really is to just make a portfolio of tools, of metadata filters, 00:13:41.000 |
of data sources that allows the agent to do its job. 00:13:44.920 |
And oftentimes, the solution isn't really making the AI better. 00:13:48.280 |
It's really just providing the right infrastructure, right? 00:13:51.400 |
A lot of times, if you find that a lot of queries use time filters and you just didn't add 00:13:55.120 |
a time filter, that can probably improve your evals by quite a bit, right? 00:13:58.880 |
We have situations where we wanted to figure out if contracts were signed, and if we just 00:14:02.960 |
extracted one more step in the OCR process, now we can do these large-scale filters and 00:14:10.100 |
And generally, the practice of improving your applications is pretty straightforward, right? 00:14:14.020 |
We all know to define evals, but not everyone that I work with has really been thinking about 00:14:18.300 |
something like finding clusters and comparing KPIs across clusters. 00:14:22.100 |
But once you do, then you can start making decisions on what to build, what to fix, and what to ignore. 00:14:27.160 |
Maybe you have a two set of quadrants, right? 00:14:30.240 |
Maybe we have low usage and high usage, and you have high-performing evals and low-performing evals, right? 00:14:37.240 |
If a large portion of your population are using tools that you are bad at, that is clearly the thing you have to fix. 00:14:43.240 |
But if a large proportion of people are using tools that you're good at, that's totally fine. 00:14:48.300 |
If a small proportion of people do something that you're good at, maybe there's some product changes you didn't make. 00:14:55.800 |
Maybe it's adding some, you know, pre-filler or automated questions to show them that we can do these kind of capabilities. 00:15:01.300 |
And if there are things that nobody does, but when we do them, they're bad, maybe that's a one-line change in the prompt that says, 00:15:06.900 |
"Sorry, I can't help you. Go talk to your manager," right? 00:15:09.560 |
These are now decisions that we can make just by looking at, you know, what proportion of our conversations are of a certain category, 00:15:15.440 |
and whether or not we can do well in that category. 00:15:18.340 |
And as you understand this, then you can go out, you can build these classifiers to identify these specific intents. 00:15:23.560 |
Maybe you build routers, maybe you build more tools. 00:15:26.300 |
And then you can start doing things like monitoring and having the ability to do these group buys. 00:15:30.440 |
Right, so now you have different categories of query types over time, and you can just see what the performance looks like. 00:15:36.200 |
Right, where 0.5 doesn't really mean anything, but whether or not a metric changes over time across a certain category 00:15:41.400 |
can determine a lot about how your product is being used. 00:15:44.580 |
By doing this, we figured out that, you know, some customers, when we onboard them, they use our applications very differently 00:15:49.440 |
than our historical customers, and we can now then make other investments in how to improve these systems. 00:15:53.900 |
And ultimately, the goal is to create a data-driven way of defining the product roadmap. 00:15:59.080 |
Often times, it is research that leads to better products now, rather than products justifying some research that we don't know is possible. 00:16:06.840 |
And again, the real marker of progress is your ability to have a high-quality hypothesis, and your ability to test a lot of these hypotheses. 00:16:15.840 |
And if you segment, you can make clearer hypotheses. 00:16:19.840 |
If you use faster evals, you can run more experiments. 00:16:22.600 |
And by having this continuous feedback through monitoring, this is how you actually build a product. 00:16:27.600 |
This is, regardless of being an AI product, this is just how you build a product. 00:16:32.600 |
And so if you look at the takeaways, really, when you think about measuring the inputs, you really want to think about not using public benchmarks, building evals on your data, 00:16:40.360 |
and focusing first on retrieval, because that is the only thing an LLM improvement won't fix. 00:16:48.120 |
If the retrieval is bad, the LLM will still get better over time, but you need to earn the right to sort of tinker with the LLM by having good retrieval. 00:16:56.120 |
And then lastly, if you don't have any customers or any users, you can start thinking about synthetic data as a way of augmenting that. 00:17:01.880 |
And once you have users, look at your data as well. 00:17:04.880 |
Look at the outputs, extract structure from these conversations, understand how many conversations are happening, how often are tools being misused, what are the errors, and how are people frustrated. 00:17:15.880 |
And by doing that, you can do this population-level data analysis, find these similar clusters, and have some kind of impact-weighted understanding of what the tools are. 00:17:24.880 |
It's one thing to say, you know, maybe we should build more tools for data visualization. 00:17:29.640 |
It's another thing to say, hey boss, 40% of our conversations are around data visualization, and the code engine or the code execution can't really do that well. 00:17:38.480 |
Maybe we should build two more tools for plotting, and then see if that's worth it. 00:17:42.000 |
And you can justify that, because we know there's a 40% of the population that's using data visualization, and we do that, you know, maybe only 10% of the time. 00:17:51.800 |
And ultimately, as you compare these KPIs across these clusters, you can just make better decisions across your entire product development process. 00:17:58.560 |
So again, start small, look for structure, understand that structure, and start comparing your KPIs. 00:18:05.320 |
And once you can do that, you can make decisions on what to fix, what to build, and what to ignore. 00:18:09.560 |
If you want to find more resources, feel free to check out these QR codes. 00:18:14.320 |
The first one is the Chroma Cloud to understand a little bit more about the research. 00:18:18.080 |
And the second one is actually a set of notebooks that we've built out that go through this process. 00:18:22.080 |
So we load the Waste and Biases conversations, we do this cluster analysis, and we show you how we can use that to make better product decisions. 00:18:29.080 |
So there's three Jupyter notebooks in that repo, check them out on your own time, and thank you for listening. 00:18:35.080 |
We do have time for one quick question, and of course, outside as well. 00:18:43.840 |
If anybody wants to grab the mic there, and over there. 00:18:55.600 |
It's not KPI, by the way, that's not the spice you take. 00:19:00.600 |
I think more agent businesses should try to price and price their services on the work done than the tokens used. 00:19:08.600 |
So, yeah, price on success, price on value, very unrelated to this talk, but, you know.