back to index[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

00:00:02.000 |
- We're a bit early, but I guess everyone's here, 00:00:51.840 |
- Are you, so people who have not done evals before, 00:01:10.340 |
- A little bit, all right, I'll stand on this side. 00:01:11.580 |
- He's trying to understand, like, how to approach it. 00:01:14.860 |
- Okay, so trying to understand how to approach evals. 00:01:28.520 |
- I think the struggle, one is to define the metrics that, 00:01:37.200 |
So first, it's hard to define what's the correct answer, 00:01:41.840 |
and two is, they are endless business questions, 00:01:51.980 |
versus I design a unit test, and I review it, 00:02:01.460 |
- Right, so the evals can be labor-intensive. 00:02:23.280 |
Agents are depending upon evals to provide feedback 00:02:26.880 |
from the world to let them know whether or not 00:02:30.180 |
And clearly, it's sophisticated and challenging problems, 00:02:35.340 |
You have to have experiments, this is like that. 00:02:41.700 |
- Yeah, we are all getting pulled into the world 00:02:43.720 |
of data science and machine learning in some ways, 00:02:51.640 |
Our engineering was a part of all of the assurance team, 00:02:55.920 |
and all these years people wasn't happy that testing took, 00:03:00.180 |
I don't know, 30% of feature development time. 00:03:03.600 |
And now with AI, I think that evaluation took 80% 00:03:13.020 |
And we are trying to find ways also to find the shortcuts, 00:03:26.200 |
and you have to create many things from scratch 00:03:46.000 |
So basically, you said speak up, or speak low, is that? 00:04:03.080 |
you can just copy paste something that exists. 00:04:09.840 |
But now, in the world of AI, everyone has to do evals. 00:04:13.860 |
Everyone is spending a lot of time doing evals. 00:04:22.780 |
I'm looking for us to understand how synthetic data 00:04:33.940 |
So how much of this can impact in the eval system? 00:04:41.500 |
There's the synthetic data, and there are the metrics. 00:04:55.460 |
We have a lot of skills along with the first comment. 00:05:13.840 |
Right, like, automatic evals are something everybody desires, 00:05:20.180 |
because you don't want human beings to sit and do it. 00:05:26.700 |
Typically, for most things, you can lean on AI to do human tasks. 00:05:31.580 |
For some reason, evals tend to be hard for LLMs to do right. 00:05:39.060 |
We're not going to solve all your eval problems here. 00:05:41.960 |
I think this is a very hard research area that we would probably continue working on in the next few months. 00:05:49.600 |
But David and I were at Google for over a decade, and we have been doing evals for a long time. 00:06:06.560 |
And most of the core effort in Google was to do a good job of evaluating where we are and improving it. 00:06:14.560 |
And some of the techniques we learned at Google we are bringing in here. 00:06:18.560 |
So we built a product around it, a bunch of technology around it, but really, the biggest takeaway I want people to have is the methodology and the stuff that we have learned, whatever manifestation it takes. 00:06:32.160 |
So I'm hoping that you get to play with some of our stuff, but also to learn some good ideas. 00:06:42.640 |
I mean, one thing I would add is at Google, we used to call this quality, and evals was part of it. 00:06:47.160 |
And there was this constant idea like benchmarking and exhausting a benchmark and then moving on to the next benchmark. 00:06:52.760 |
Whenever we work with clients right now, we take a similar approach. 00:06:55.520 |
So like I mentioned, so much of evals is just good methodology, setting up benchmarks, trying to figure out the metrics that work, calibrating metrics with humans, calibrating metrics with user data. 00:07:04.600 |
What I'm trying to say is that it can get arbitrarily complex. 00:07:07.480 |
Like, evals is not like, oh, there's one way to do it, and then you're done. 00:07:10.080 |
It's just part of how you do development, which, you know, everybody's been struggling with because we've all been trying to learn what it means to develop on top of this new stack. 00:07:17.480 |
But again, today, in today's session, we're just hoping to bring you some of these ideas where, like, methodologies that make sense, like how to think about your metrics. 00:07:25.400 |
So I would challenge you to search at 300 metrics. 00:07:27.680 |
So, like, you start to think about, like, how can you expand the scope of how you're doing this work? 00:07:32.120 |
And, you know, part of what we're doing is trying to build technologies that make it easy to adopt these technologies, 00:07:36.920 |
how to create benchmarks that are interesting, how to, like, make it harder. 00:07:45.720 |
We'll show a few slides, and then we'll dive into the code, because I think the best way to learn about these things is just to do them. 00:07:51.720 |
I think as we go, we'd love to just pass it on and, you know, people have, you know, again, people are different parts of journeys. 00:07:58.720 |
For some people, it's like, hey, I don't have metrics, and I'll just develop my own metrics. 00:08:01.720 |
We work with clients that have metrics, but they want to, for example, correlate them to user behavior. 00:08:05.320 |
They have a lot of thumbs up, thumbs down data. 00:08:07.320 |
So that goes all the way into a feedback loop. 00:08:09.320 |
So you see, like, evals is not just one thing related to testing. 00:08:12.320 |
It goes all the way into your online system and your feedback loops and all that. 00:08:15.320 |
So, but hopefully, a lot of the mental models today will help you kind of gauge that. 00:08:24.920 |
So you can join there on the Slack, and then we'll just be posting things, and then we can have discussions and continue the conversations even after the workshop. 00:08:32.920 |
And the second one is we have a document that will have all the steps of the workshop, all the places where you can get the code, where you can get the sheet. 00:08:45.720 |
You will land in a Google Doc, so there's a lot of links. 00:08:48.720 |
The Google Doc also has a link to the slide deck we're presenting, so you can have it. 00:08:52.120 |
We'll keep it online even after the workshop, so you guys can reference it. 00:08:55.320 |
All right, we'll get started to keep you guys on time and maximize sort of coding time. 00:09:03.120 |
A couple of other people will just walk around. 00:09:05.120 |
If you guys get stuck on any step, if you're having trouble with anything, just raise your hands, and then we'll come and come over. 00:09:10.920 |
All right, I'll give you a very, like, maybe five-minute quick blurb, show a quick demo, and then get started. 00:09:26.320 |
I think we went through this, but basically, like, most people start with wipe testing that's like, you know, try it out, see how it works, change prompts. 00:09:34.720 |
And honestly, for many applications, you can go pretty far with just that. 00:09:42.120 |
I think agent systems, as you were saying, agent systems are more complex because in multiple steps things fail more often. 00:09:53.120 |
Typically, most companies don't bother setting up human reader evals. 00:10:00.120 |
Code-based evals is where I think the majority of the people are spending time. 00:10:04.120 |
They're writing some sort of a code to test some verifiable things that they can. 00:10:10.120 |
people are moving into natural language, like LM as a judge type things, where it gets -- this is not a very good task. 00:10:17.520 |
We'll get into some of it, like, for typical decoder models, the AI models. 00:10:22.520 |
And so you kind of have to fight against -- these models are designed to be creative. 00:10:27.120 |
They're designed to be -- that's not what you want from a judge, typically. 00:10:33.120 |
The scoring system is this idea that you're not trying hard to build a comprehensive set of metrics from the beginning. 00:10:45.520 |
Maybe you start with five, ten signals that you know for a fact are correlated with goodness. 00:10:52.520 |
They're very simple signals that you can easily derive. 00:10:55.520 |
And every time you build upon those as you see problems, as you debug. 00:11:01.520 |
And then you test your application, whether it works well or not. 00:11:08.920 |
So that makes it a much more of a feedback loop type process. 00:11:12.920 |
That's the thing that we're trying to introduce in terms of methodology. 00:11:19.520 |
Yeah, one thing I'll add to that is, like, there's no right or wrong. 00:11:22.520 |
Like, Vype testing actually gets you a very long way. 00:11:27.520 |
Maybe you want to turn that off, and then we can switch the mic. 00:11:30.520 |
Yeah, so just to say, like, there's no right or wrong. 00:11:35.920 |
I think you should absolutely start with Vype testing. 00:11:37.520 |
A lot of people use tracing, and they just monitor traces. 00:11:39.920 |
I think as your system scales, you do want to get a little bit more sophisticated. 00:11:43.920 |
So you start to layer those things in order of complexity. 00:11:46.920 |
That's a lot of how we talk about quality, generally, is this idea of, like, there's complex 00:11:50.920 |
things, but they give some amount of investment, some amount of return. 00:11:56.320 |
Some things are very cheap to do, and then you do them. 00:11:58.320 |
And as your system scales, they become impractical, and then you just layer in more techniques. 00:12:02.320 |
So there's not a statement that, like, any of these things are good or bad. 00:12:06.320 |
So think of these things as tools in your toolbox. 00:12:08.320 |
But eventually, what you really want is a scoring system. 00:12:11.320 |
what manifests for you, you should be the judge of it. 00:12:14.120 |
But, like, this is one thing to get away with, like, just keep laying into your tools. 00:12:17.720 |
And with increasing levels of sophistication, you get probably to a system that will look 00:12:23.720 |
Yeah, and as David was saying, like, VLs are such an investment. 00:12:30.720 |
So they might ask, like, well, why am I spending so much time on just evals? 00:12:35.720 |
And I think one of the things that was core to Google, and I think this industry is going 00:12:40.920 |
to adopt this over time, is that evals are actually the only place you're going to spend 00:12:45.320 |
most of your time, because that's where domain knowledge is going to live. 00:12:48.520 |
Everything else was just work off of those evals. 00:12:51.320 |
And so, for example, if you have really good evals, you don't have to write prompts. 00:12:57.320 |
You can write-- you can find problems and use metapromps or optimizers like DSPy 00:13:05.320 |
You can filter out synthetic data and then use that for fine-tuning if you're really interested in fine-tuning, 00:13:12.120 |
But you can also use these techniques online. 00:13:14.920 |
One of the things that you're going to test to try today is a very simple but very effective technique 00:13:22.920 |
Google used it extensively, which is the idea that you crank up the temperature, 00:13:26.920 |
you generate a bunch of responses instead of just one response. 00:13:29.920 |
You can think of this as online reinforcement learning, generate four or five responses, 00:13:34.320 |
and then score those responses online and see which one's the best one. 00:13:37.920 |
And you get a pretty decent lift just by doing this, like without actually doing any 00:13:44.320 |
So these are the kinds of things you can do when you have a really good scoring system that you can lean on. 00:13:50.320 |
So you get to try some of these techniques today. 00:13:52.520 |
But the key point is don't think of evals as testing in the classic sense. 00:13:58.320 |
Think of these as the primary place where domain knowledge lives. 00:14:02.720 |
And then one of the things that I think one of the attendees was pointing out is that right now, 00:14:10.720 |
at this point in the industry, we have figured out a handful of standard evals. 00:14:14.720 |
Like you can think of simple helpfulness, harmfulness, hallucinations. 00:14:23.120 |
That's good enough to just sort of do a guardrail and make sure you're not doing anything wrong. 00:14:27.120 |
But we're moving into the world where we want to build really good. 00:14:31.120 |
Like for example, if you're trying to build a trip plan, 00:14:33.120 |
you're curious about like, you know, how to make that trip plan really be perfect. 00:14:38.720 |
Like one of the things I really hate about trip plans, if they're too, like not interesting enough. 00:14:43.520 |
I'm looking at a trip plan, I want to be excited about this place. 00:14:45.920 |
Typically when LLMs give me a trip plan, it's a very kind of, a very planish, right? 00:14:51.520 |
So now you're trying to build these sort of nuances in your applications. 00:14:54.720 |
And that's the kind of stuff if you want to evaluate, that's where the industry is going to go. 00:14:58.920 |
So how do you build like these much more nuanced evals? 00:15:02.520 |
That's one of the places where these traditional evals fail. 00:15:06.520 |
So as I went over, like start simple, iterate, see what's broken, and then improve, right? 00:15:13.520 |
And so that's the other thing you get to try today is in a co-pilot like setting, start with something, 00:15:20.520 |
test it out with a handful of examples, maybe generate some synthetic examples, 00:15:24.320 |
test it out with good examples, bad examples, see what's working, what's not, and then iterate. 00:15:28.920 |
We've picked a relatively simple example for today because for workshop purposes, 00:15:33.920 |
But you can absolutely go and try it on fairly complex things and there you can see a lot 00:15:40.920 |
So, yeah, so there are two parts of today's workshop. 00:15:45.920 |
The first part is setting up your scoring system. 00:15:50.920 |
And the second part is once you have set up the scoring system and played with it and iterated 00:15:54.920 |
on it in a co-pilot trying to use it in a colab. 00:15:57.920 |
You would need a Google account and some proficiency in working with colabs and Python code. 00:16:05.920 |
We'll also introduce a spreadsheet component to this so that you can actually try to play around 00:16:09.920 |
with this in a spreadsheet so that you can easily make changes and test things out. 00:16:17.920 |
One last thing I want to hit on before we jump into the workshop is what is the scoring system? 00:16:27.920 |
I'd like to just give you a mental framing of it. 00:16:38.920 |
When Google does a search, what it's trying to do is it's scoring every document and seeing 00:16:43.920 |
whether this is good or not and then giving you the best document that's the best for you. 00:16:47.920 |
And that's not that different from checking and scoring an LLM-generated content. 00:16:52.920 |
And the way Google does it is by breaking this problem down into a ton of signals. 00:16:58.920 |
So you can imagine you're looking at SEO, you can look at document popularity, you're looking 00:17:02.920 |
at title scores, whether the content is good or not, maybe feasibility of things, spam, clickbaitiness, 00:17:11.920 |
And then brings all of these signals together into a single score that combines these ideas. 00:17:16.920 |
Individual signals that you have are very easy to understand. 00:17:22.920 |
It's not a complex prompt with some random score. 00:17:24.920 |
It's something that you can easily understand. 00:17:26.920 |
But it all comes together into a single score. 00:17:29.920 |
And that's the idea we are sort of bringing in. 00:17:32.920 |
At the bottom level, things are very objective, tend to be deterministic, sometimes just Python 00:17:39.920 |
But as you bring this up to the top, it becomes fairly subjective. 00:17:43.920 |
And you can bring these together into very complex ways. 00:17:46.920 |
I would just encourage you to think about why this could work. 00:17:54.920 |
Some of what you will be struggling with with evals is just like you're not measuring the 00:17:59.920 |
So you have a little bit of a comprehensive issue. 00:18:01.920 |
And so the question is like how many metrics do you need to add? 00:18:04.920 |
So I give the example Google search uses around 300 signals. 00:18:07.920 |
Now maybe you don't want to be that sophisticated. 00:18:13.920 |
They care about content relevance, spam, porn seeking. 00:18:16.920 |
They have all these classifiers, all these ways they understand the content to then bring it 00:18:21.920 |
What this gives you is like really visibility of your application and more and more ways 00:18:26.920 |
So if instead you just go and say, hey, is this a helpful response? 00:18:30.920 |
Mostly what you're doing is delegating that eval to the LLM itself, right? 00:18:35.920 |
Or to a rater who you're asking like, hey, see if this is good. 00:18:38.920 |
But when you break this all down, you get some really nice properties where like your variance 00:18:42.920 |
goes down a lot just because you're measuring way more objective things. 00:18:45.920 |
So things are not like going back and forth all the time. 00:18:48.920 |
And it's very precise because like you get all these things and you add them together to a more 00:18:53.920 |
And when you are analyzing the data, then you can like slice and dice by way finer grained things. 00:18:58.920 |
And that's kind of like why the system tends to work much better. 00:19:01.920 |
And the best part of it is as you iterate, you just add more signals. 00:19:04.920 |
Like this now doesn't leave you as like, oh, I either have evals or I don't have evals. 00:19:09.920 |
But like rather you just have a set of metrics that you keep adding over time as you discover 00:19:31.920 |
Okay, so I'll just quickly show you a demo of where you're going to start today. 00:19:36.920 |
Give you some sort of basic ideas of what the kinds of things you would be doing. 00:19:41.920 |
And then I'll just, we should just get started, right? 00:19:44.920 |
And so this is a co-pilot that helps you put together evals. 00:19:50.920 |
Where I'm starting is basically just a system prompt. 00:19:53.920 |
This application is basically a simple meeting summarizer. 00:19:56.920 |
It takes a meeting script, talking between multiple people, and then generate some sort of a structured JSON in the end, 00:20:03.920 |
which is a summary with very specific action items, key insights, and then a title, right? 00:20:13.920 |
It's easy for you to inspect and see where things are not going, working and not working well. 00:20:17.920 |
So typically, you would start with something like a system prompt. 00:20:21.920 |
You can also start with examples, if you have a bunch of examples. 00:20:27.920 |
And the first step is it would try to use this to build your scoring system. 00:20:34.920 |
Yeah, and this is doing exactly what was in that slide, which is trying to say, like, from that coarse-grained subjective thing, 00:20:44.920 |
what are all the smaller things I can suss out of it? 00:20:49.920 |
Like, if you drop this into a ChatGPT or so, we just try to replicate that experience where you get these artifacts on the right-hand side, 00:20:58.920 |
But if you think you just want to do it, like, iteratively through your favorite, like, sort of chat interface, you can also do it. 00:21:06.920 |
Can we do it in reverse where we provide examples of . 00:21:11.920 |
So right in front, there is an example button. 00:21:14.920 |
You can give it, actually, hundreds of examples if you want, 20 examples. 00:21:18.920 |
And then it gets into a much more complex process of, like, figuring out these dimensions based on example. 00:21:23.920 |
Or you can just copy-paste one or two examples in the prompt itself, and it will generate it. 00:21:32.920 |
It starts you somewhere, and now you're going to iterate over it. 00:21:35.920 |
And that's the exercise you will spend time on. 00:21:37.920 |
I'll show you a few things, like, this is your scoring system. 00:21:40.920 |
These are your individual-- these are individual dimensions, is what we call it. 00:21:50.920 |
Effectively, for example, this is a-- does the output include any insights from the meeting? 00:21:59.920 |
Or you can have code, just Python code that we have generated. 00:22:05.920 |
This is actually as simple as what this says. 00:22:11.920 |
When you look at the actual code, the code is effectively just a bunch of questions. 00:22:16.920 |
And you're sending these questions to our specialized foundation models that are designed for scoring and evaluation. 00:22:23.920 |
So you'll get to play with this a bunch in the colab. 00:22:28.920 |
There's another thing that you would notice that there's this idea of critical, major, minor. 00:22:33.920 |
These are just weights for you to control what's important, what's not. 00:22:36.920 |
The combination of this is done through a mathematical function that you can learn over time. 00:22:41.920 |
So actually, eventually, you would give it a bunch of examples and it'll learn it. 00:22:44.920 |
But in this particular exercise, you have a little bit more control. 00:22:47.920 |
And finally, once you have your scoring system done, there is a way for you to integrate it into Google Sheets. 00:22:55.920 |
That's the thing that you're going to play around with today, which is basically taking this criteria, moving it to a Google Sheet, 00:23:04.920 |
So here, you're going to work with synthetic examples. 00:23:06.920 |
You'll develop your scoring system and then completely blinded to this. 00:23:10.920 |
We have labeled data set where users are set thumbs up, thumbs down on the summaries. 00:23:15.920 |
You're going to apply this to that scoring system and see how well it aligns with real thumbs up, thumbs down. 00:23:23.920 |
You're building your own scoring system, whether it aligns or not. 00:23:28.920 |
And this is a really interesting point and why evals start to get hard. 00:23:31.920 |
We call this workshop like solving the hardest challenge, which is metrics that actually work. 00:23:35.920 |
So this idea of correlation ends up being really, really important. 00:23:38.920 |
Metrics that work are not necessarily good metrics or bad metrics. 00:23:41.920 |
They're either calibrated metrics or uncalibrated metrics. 00:23:44.920 |
At Google, for example, we had a lot of data scientists that we worked with because they would do all these correlation analyses and confusion matrices and such. 00:23:52.920 |
So part of the challenge of good evals is just getting comfortable with the numerical aspect of these things. 00:23:59.920 |
And, of course, again, having a scoring system that dissects things into much simpler things makes it easier to analyze. 00:24:05.920 |
But you still have to think about those things. 00:24:07.920 |
Like, does this actually correlate with goodness? 00:24:09.920 |
Like, if it gives a high score, is this actually a good score? 00:24:13.920 |
But the good news is, as I showed in the previous slide, once you have metrics you can trust, like, almost all of the rest of your stack gets radically simplified as a result. 00:24:26.920 |
Maybe a place to start would be, you know, generate an example. 00:24:30.920 |
This is a synthetic generation happening behind the scenes. 00:24:35.920 |
It's taking your system from other information and trying to generate some sort of an example. 00:24:41.920 |
And then this example that is generated is scored. 00:24:46.920 |
And you can see how these individual scores work. 00:24:48.920 |
Now you can start kind of testing this a little bit more. 00:24:57.920 |
And then it will try to generate an example that's broken in some particular way. 00:25:02.920 |
The co-pilot understands what you've done so far. 00:25:06.920 |
It's using all of this information to kind of, like, sort of walk you through this. 00:25:10.920 |
So in this particular way, like, in this particular case, it created something that is missing a bunch of information. 00:25:16.920 |
But you can even do things like, you know, specific things like, can you create an example that has broken JSON? 00:25:26.920 |
So this is, like, basically example generation. 00:25:31.920 |
The other thing you can do is you can make changes to your scoring system itself through the co-pilot. 00:25:37.920 |
So you can either go and ask the co-pilot to make changes to your, this is a broken JSON example. 00:25:44.920 |
You can go to the co-pilot and either ask it to change the Python code itself. 00:25:48.920 |
You can say, you know, can you update the Python code and such of any of these things. 00:25:54.920 |
You can also ask it to remove or add dimensions. 00:25:56.920 |
Maybe you can say, can you generate a dimension that checks for the title to be less than 20 words. 00:26:10.920 |
And then the co-pilot can add these dimensions. 00:26:14.920 |
So let's say we add a feature to the product that the trip planner that also now does bookings for you. 00:26:24.920 |
...as a higher level, because now it's like more of an agent. 00:26:31.920 |
You just ask the co-pilot and say, I've updated. 00:26:34.920 |
Can you add new dimensions to test these new examples? 00:26:37.920 |
Or we have changed things and it automatically, I mean... 00:26:45.920 |
The other thing I would say is the co-pilot is very helpful if you want this sort of human in a loop type process. 00:26:51.920 |
But once you get comfortable with the system, what we see people doing is just use CoLab to send large amounts of data and let our system figure things out on its own. 00:27:05.920 |
Personally, I find it very good to play with my scoring system here every so often to see if it's like working well. 00:27:11.920 |
Like maybe paste an example from a user and see how well it's working and so on and so forth. 00:27:18.920 |
But most of our clients actually just fire off these long-running processes. 00:27:30.920 |
So it just created this new title length thing for you. 00:27:34.920 |
So you can sort of play around with this and it'll... 00:27:38.920 |
You can remove dimensions and so on and so forth. 00:27:41.920 |
The last thing I would show before we get going is one of the things that you would do is... 00:27:46.920 |
We have pre-filled a spreadsheet in the workshop directions. 00:27:54.920 |
And we have a spreadsheet integration where you can run our score inside a spreadsheet itself. 00:27:59.920 |
You see there are other sort of places where you can use it. 00:28:02.920 |
I'm not going to get into this, but you can use it for reinforcement learning, for example, using onslaught integration. 00:28:07.920 |
But there are other places you can actually integrate with PyScore. 00:28:13.920 |
But basically in the sheets integration, what you are going to do in this workshop is... 00:28:19.920 |
You can actually just copy this wholesale and put it into a new spreadsheet. 00:28:24.920 |
With the examples that are here, what we are trying to do is just copy the criteria. 00:28:33.920 |
And then go to the spreadsheet that we will... 00:28:37.920 |
Replace the criteria that exists there with this criteria that you've created. 00:28:42.920 |
And then under extensions, these directions are all in the docs. 00:28:47.920 |
But under extensions, you can go and call the score. 00:28:50.920 |
The score is going to run across about 120 examples. 00:28:53.920 |
And then you'll see a confusion matrix which shows you how many times there's alignment on thumbs up. 00:28:59.920 |
How many times there's alignment on thumbs down. 00:29:02.920 |
And then you can play around in the spreadsheet itself. 00:29:04.920 |
Make changes to the dimensions and see if you can bring the alignment closer. 00:29:10.920 |
Yeah, so this is what a spreadsheet would look like. 00:29:18.920 |
The spreadsheet form of the English that you saw. 00:29:33.920 |
So this is what your criteria sheet looks like right now. 00:29:35.920 |
This is the default one that we put in there for you. 00:29:40.920 |
This basically has actually feedback, which is a thumbs up, thumbs down, which we got from users. 00:29:48.920 |
So it's all kinds of different sort of mix of things. 00:29:50.920 |
And what you're going to do is you're going to select these two rows. 00:29:55.920 |
And then in the extensions under here, you will have score selected ranges, which is going 00:30:01.920 |
And then you can look at the confusion matrix and see how it works. 00:30:07.920 |
But you can easily go and make changes here or even test out by making changes to the data, 00:30:12.920 |
you know, messing up your JSON and seeing how that impacts things and so on and so forth. 00:30:18.920 |
So that's the first phase of our co-pilot, of our workshop. 00:30:23.920 |
And we'll talk about the second phase when we get started with that one. 00:30:28.920 |
I'm just curious how, like, best practices for using this in production. 00:30:31.920 |
When you have, you know, tens of thousands, maybe hundreds of thousands of examples. 00:30:42.920 |
We will hit a lot of that in the second part of our workshop. 00:30:52.920 |
The details and best practices on how you do it. 00:30:56.920 |
But you will get to play with this in this workshop. 00:31:01.920 |
Our scorers are specifically designed for online workflows. 00:31:06.920 |
These, like, 20 dimensions that you have, they score them all in sub-20, like, 50 milliseconds. 00:31:17.920 |
And we have batch processes and stuff like that set up for you. 00:31:20.920 |
So you'll be able to play with that, create sets. 00:31:23.920 |
Typically, you want to create eval sets which have a combination of hard, easy, medium, that kind of stuff. 00:31:30.920 |
We're not going to get into data generation, synthetic data generation that much. 00:31:37.920 |
But the actual data generation stuff there, there's actually documentation that you can follow up afterwards. 00:31:43.920 |
And how to create, like, easy to set, hard, medium sets for your testing purposes. 00:31:47.920 |
The best thing is to sample from your logs some number of things and evaluate or just run it online. 00:31:56.920 |
Majority of this, these kinds of quality checks at Google were run online. 00:32:00.920 |
Whether it's spam detection, whether it's, like, you know, what decisions to make. 00:32:06.920 |
Most people, it's very difficult to, kind of, like, sort of implement at that point. 00:32:09.920 |
I think the challenge, one of the biggest challenges with Elm as a judge is it's so expensive you can't run online. 00:32:14.920 |
So that's one of the things that this solves. 00:32:22.920 |
Are you using the traditional API models or smaller API models? 00:32:29.920 |
Maybe I'll go very quickly through these slides. 00:32:31.920 |
So, these models are designed for high precision. 00:32:38.920 |
Like, for example, if you run the same score twice on the same thing, it's not going to give you different scores. 00:32:43.920 |
It's not exactly the same score, but small variations keep the same scores. 00:32:47.920 |
It's designed for, like, super high precision. 00:32:49.920 |
It's just the architecture of these models is basically very low variance. 00:32:56.920 |
Part of the reason is that they're using this bi-directional attention instead of, like, the typical decoder model's attention. 00:33:03.920 |
They have a regression head on top instead of token generations. 00:33:07.920 |
It's not autoregressively generating tokens, which has a lot of, like, weirdness that happens to it. 00:33:12.920 |
Like, there's a lot of post-hoc explanation for scores. 00:33:15.920 |
They'll come up with a score, then they'll try to justify the score. 00:33:18.920 |
The other thing is these models have been trained on a lot of data, like, you know, billions and billions of tokens, which are only for scoring, but different kinds of content. 00:33:31.920 |
So, these generalize really well across, but that also stabilizes them quite a bit. 00:33:35.920 |
The one thing that I would say which is really nice is the interface is very nice. 00:33:42.920 |
It basically just, you ask a question and give it the data and it will answer the question with a score. 00:33:48.920 |
And then you can inspect the score and understand why it gave you that score. 00:33:53.920 |
There's no, like, prompt tuning and so on and so forth. 00:33:55.920 |
Because in this particular case, when you're evaluating, prompt tuning doesn't really, like, it's not very natural. 00:34:00.920 |
Like, you know, how do you explain a rubric to a model? 00:34:03.920 |
So, these models understand internally why should they score high? 00:34:09.920 |
And then these things come together using a fairly sophisticated model, which is like an extension of generalized additive model, which brings all of these different signals together by baiting them based on your thumbs up, thumbs down data. 00:34:24.920 |
So, this is a process called calibration, where you give it a bunch of data and it understands what's important, what's not. 00:34:30.920 |
What should I, if something fails, I should fail everything. 00:34:37.920 |
Those kinds of decisions, it makes those decisions for you based on the data. 00:34:41.920 |
So, that's sort of, it's a much more advanced way of doing evals and they're sort of ground up, built for eval purposes. 00:34:49.920 |
It gives them that sort of stability that you desire. 00:34:51.920 |
And the reason they're very fast is because when you use bi-directional attention, you can build much more dense embeddings. 00:34:57.920 |
So, with fewer parameters, you can get fairly high quality scores. 00:35:04.920 |
Right now, we have trained it with a handful of languages. 00:35:10.920 |
We are pretty soon going to release a model with multilingual capabilities. 00:35:17.920 |
We don't support multimodal yet, but this is in our roadmap, right? 00:35:21.920 |
So, right now, it's English in a few languages and then we're going to expand it beyond that. 00:35:25.920 |
We should kick off the workshop and then we'll pass it on and we can just answer all the questions as well. 00:35:39.920 |
I'll just share the doc here as well so that you all can see it. 00:35:42.920 |
Yeah, I'll also put it in the Slack channel for everybody. 00:35:55.920 |
Some people maybe have already started working with the collabs I've seen. 00:36:01.920 |
But I'll quickly show others what the second phase of the exercise is about. 00:36:10.920 |
But, of course, we can continue all of this going forward. 00:36:16.920 |
So, the second part of the exercise, by the way, we may not have enough time 00:36:24.920 |
But the second part of the exercise, feel free to do it on your own. 00:36:28.920 |
The colab is available for you so you can use this. 00:36:44.920 |
But basically, this particular colab will take you through these multiple steps. 00:36:49.920 |
And a lot of people ask me questions about how to use it in code. 00:36:57.920 |
This is where all of your intelligence is going to live. 00:37:04.920 |
But the way you get this spec is basically over here. 00:37:12.920 |
So this is basically how you get the spec here. 00:37:15.920 |
It's all in natural language spec with some Python code in there. 00:37:25.920 |
And what you're now doing in this particular colab, which you can literally click through it. 00:37:29.920 |
And you don't have to do anything more than that. 00:37:32.920 |
Or you can play around with it a whole bunch. 00:37:42.920 |
If you've seen the sheet, that's the same data over here. 00:37:52.920 |
And then building a similar confusion matrix that you saw there. 00:37:56.920 |
Which indicates how well aligned your scores are. 00:38:03.920 |
This is where like really interesting stuff starts happening. 00:38:06.920 |
So in this particular case, we are comparing 1.5 and 2.5 models. 00:38:10.920 |
You can see that 2.5 has a slightly higher score than 1.5. 00:38:21.920 |
Like some of the mini models like the Cloud Haiku. 00:38:25.920 |
You'll see a much bigger delta between the quality. 00:38:30.920 |
So now you can use your scoring system for evaluating different models. 00:38:34.920 |
What this is doing is it's taking about, you know, 10 examples. 00:38:42.920 |
And then scoring them using our scoring system. 00:38:56.920 |
This is the right way to kind of make sure that you're not regressing. 00:39:03.920 |
See if your scores are going down on your test set. 00:39:06.920 |
So again, taking 10 examples just to demonstrate how you compare them. 00:39:11.920 |
Here's like we created a bad and a good prompt just to kind of accentuate this. 00:39:15.920 |
So like bad prompts getting much lower score than good prompts on this particular task. 00:39:20.920 |
This is the one that I'm very excited about because it brings you into the online world. 00:39:26.920 |
Where you're taking this one particular transcript. 00:39:29.920 |
And you are testing it out with different number of samples. 00:39:45.920 |
What it's doing behind the scenes is creating three or four of those responses. 00:39:52.920 |
And then ranking them using the pie scoring system that you just built. 00:39:58.920 |
And what you would see is as you increase the number of examples. 00:40:04.920 |
So this is literally like you can click through this in the colab. 00:40:12.920 |
But there's a bunch of options for you to play around with this. 00:40:16.920 |
Just want to introduce this to you guys before you all disappear.