[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

- - We're a bit early, but I guess everyone's here, so we'll get started. Maybe we'll spend the first few minutes to get people a little oriented, and actually, we are quite curious, like what brings people here? - Yeah, well, why are you here? - What about evals do you, like maybe by a show of hands, a few people, like, have you done evals before?

And, okay, have you struggled with them? Was it hard? Maybe by, again, show of hands. - Away from the speaker. - Are you, so people who have not done evals before, what brings you here? Somebody, anyone wants to volunteer? - Yeah, feel free to raise your hand. Sorry, did I move away from the speaker?

- A little bit, maybe. - A little bit, all right, I'll stand on this side. - He's trying to understand, like, how to approach it. - Oh, I see. - Okay, so trying to understand how to approach evals. So, for people who have tried before, what have they struggled with?

What has been the, go ahead. - I think the struggle, one is to define the metrics that, because it's hard to, like, I do . So first, it's hard to define what's the correct answer, 'cause it can look in many different ways, and two is, they are endless business questions, and then you can turn it to SQL, and how do you pragmatically improve it, versus I design a unit test, and I review it, and I do it again, and I feel like that's, too very manual and labor-intensive can be.

- Right, so the evals can be labor-intensive. The evals can be hard to set up. They can be painful to get started with. Any other thoughts? - It's like machine learning is dependent upon training data. Agents are depending upon evals to provide feedback from the world to let them know whether or not they're getting it right.

And clearly, it's sophisticated and challenging problems, so it's like, how to do science? You have to have experiments, this is like that. How do you do experiments? - Right, so we are all-- - I'm just saying it's super important. - Yeah, we are all getting pulled into the world of data science and machine learning in some ways, even if we don't want to.

Go ahead. - Yeah, sorry. Our engineering was a part of all of the assurance team, and all these years people wasn't happy that testing took, I don't know, 30% of feature development time. And now with AI, I think that evaluation took 80% of feature development time. People even more not happy about it.

And we are trying to find ways also to find the shortcuts, but it seems that it's not always possible, because each case is so unique that you just can't to reuse some previous works, and you have to create many things from scratch on different levels. And because of this unclear nature, probable nature of evaluation, it's every time-- - Speak up.

- Speak up. The audience turned on. - Yeah, yeah, yeah, yeah, yeah. Optimized way to do evaluation. - Right. So basically, you said speak up, or speak low, is that? Okay, speak up. Okay. Yeah, so basically, like, it's custom, subjective evals, like, for your specific use case, you can just copy paste something that exists.

Much like tests, we used to do testing. Some people did testing, some didn't. But now, in the world of AI, everyone has to do evals. Everyone is spending a lot of time doing evals. The best practices don't exist. So I-- go ahead, please. - Well, I was just going to add that I'm looking for us to understand how synthetic data is playing a role there.

And especially for a novelty, like, type of problems where you don't have access to real-world data to use as a baseline. So how much of this can impact in the eval system? - Right. There are, like, two aspects of evals. There's the synthetic data, and there are the metrics.

Metrics are checking. Synthetic data is testing it. How-- what does synthetic data do? How do you generate good synthetic data to test and evaluate? Go ahead. We have a lot of skills along with the first comment. We've been trying low-prom on LLM models that we use to evaluate. But in the end, it's not accurate.

So we have to do that only. And so we're hoping to learn that. What are some acceptable ways to do that? - Right. Right, like, automatic evals are something everybody desires, because you don't want human beings to sit and do it. But the options are somewhat limited. Typically, for most things, you can lean on AI to do human tasks.

For some reason, evals tend to be hard for LLMs to do right. So I guess that makes a lot of sense. We're not going to solve all your eval problems here. I think this is a very hard research area that we would probably continue working on in the next few months.

But David and I were at Google for over a decade, and we have been doing evals for a long time. We dealt with stochastic application. We both did search. We were building search. Search was also similarly stochastic. And most of the core effort in Google was to do a good job of evaluating where we are and improving it.

And some of the techniques we learned at Google we are bringing in here. So we built a product around it, a bunch of technology around it, but really, the biggest takeaway I want people to have is the methodology and the stuff that we have learned, whatever manifestation it takes.

So I'm hoping that you get to play with some of our stuff, but also to learn some good ideas. David, you want to add? Yeah. I mean, one thing I would add is at Google, we used to call this quality, and evals was part of it. And there was this constant idea like benchmarking and exhausting a benchmark and then moving on to the next benchmark.

Whenever we work with clients right now, we take a similar approach. So like I mentioned, so much of evals is just good methodology, setting up benchmarks, trying to figure out the metrics that work, calibrating metrics with humans, calibrating metrics with user data. What I'm trying to say is that it can get arbitrarily complex.

Like, evals is not like, oh, there's one way to do it, and then you're done. It's just part of how you do development, which, you know, everybody's been struggling with because we've all been trying to learn what it means to develop on top of this new stack. But again, today, in today's session, we're just hoping to bring you some of these ideas where, like, methodologies that make sense, like how to think about your metrics.

Many people think of, like, four metrics. So I would challenge you to search at 300 metrics. So, like, you start to think about, like, how can you expand the scope of how you're doing this work? And, you know, part of what we're doing is trying to build technologies that make it easy to adopt these technologies, how to create benchmarks that are interesting, how to, like, make it harder.

And then we can talk about this as we go. We'll be mostly hands-on. We'll show a few slides, and then we'll dive into the code, because I think the best way to learn about these things is just to do them. I think as we go, we'd love to just pass it on and, you know, people have, you know, again, people are different parts of journeys.

For some people, it's like, hey, I don't have metrics, and I'll just develop my own metrics. We work with clients that have metrics, but they want to, for example, correlate them to user behavior. They have a lot of thumbs up, thumbs down data. So that goes all the way into a feedback loop.

So you see, like, evals is not just one thing related to testing. It goes all the way into your online system and your feedback loops and all that. So, but hopefully, a lot of the mental models today will help you kind of gauge that. Just two things. The Slack channel is workshop-metrics.

So you can join there on the Slack, and then we'll just be posting things, and then we can have discussions and continue the conversations even after the workshop. And the second one is we have a document that will have all the steps of the workshop, all the places where you can get the code, where you can get the sheet.

Just go with pi.ai/workshop. You will land in a Google Doc, so there's a lot of links. The Google Doc also has a link to the slide deck we're presenting, so you can have it. We'll keep it online even after the workshop, so you guys can reference it. All right, we'll get started to keep you guys on time and maximize sort of coding time.

There are four of us. A couple of other people will just walk around. If you guys get stuck on any step, if you're having trouble with anything, just raise your hands, and then we'll come and come over. Just raise your hand. All right, I'll give you a very, like, maybe five-minute quick blurb, show a quick demo, and then get started.

Can you guys still hear me? Okay. I think we went through this, but basically, like, most people start with wipe testing that's like, you know, try it out, see how it works, change prompts. And honestly, for many applications, you can go pretty far with just that. I think AI's have become quite good.

I think agent systems, as you were saying, agent systems are more complex because in multiple steps things fail more often. But wipe testing gets you pretty far. Human evals are expensive. Typically, most companies don't bother setting up human reader evals. Some do, some have subject matter experts. Code-based evals is where I think the majority of the people are spending time.

They're writing some sort of a code to test some verifiable things that they can. people are moving into natural language, like LM as a judge type things, where it gets -- this is not a very good task. We'll get into some of it, like, for typical decoder models, the AI models.

And so you kind of have to fight against -- these models are designed to be creative. They're designed to be -- that's not what you want from a judge, typically. The scoring system is this idea that you're not trying hard to build a comprehensive set of metrics from the beginning.

You start with some correlated signals. Maybe you start with five, ten signals that you know for a fact are correlated with goodness. They're very simple signals that you can easily derive. And every time you build upon those as you see problems, as you debug. And then you test your application, whether it works well or not.

You learn from it. You build your application. You measure it again. So that makes it a much more of a feedback loop type process. That's the thing that we're trying to introduce in terms of methodology. The thing that David was saying -- go ahead. Yeah, one thing I'll add to that is, like, there's no right or wrong.

Like, Vype testing actually gets you a very long way. Sorry, we have a lot of echo. Maybe you want to turn that off, and then we can switch the mic. Yeah, so just to say, like, there's no right or wrong. I think you should absolutely start with Vype testing.

A lot of people use tracing, and they just monitor traces. I think as your system scales, you do want to get a little bit more sophisticated. So you start to layer those things in order of complexity. That's a lot of how we talk about quality, generally, is this idea of, like, there's complex things, but they give some amount of investment, some amount of return.

So it's like a little bit of an ROI. Some things are very cheap to do, and then you do them. And as your system scales, they become impractical, and then you just layer in more techniques. So there's not a statement that, like, any of these things are good or bad.

You just need all of them. So think of these things as tools in your toolbox. But eventually, what you really want is a scoring system. what manifests for you, you should be the judge of it. But, like, this is one thing to get away with, like, just keep laying into your tools.

And with increasing levels of sophistication, you get probably to a system that will look like the one we'll develop today. Yeah, and as David was saying, like, VLs are such an investment. So they might ask, like, well, why am I spending so much time on just evals? Why am I not building?

And I think one of the things that was core to Google, and I think this industry is going to adopt this over time, is that evals are actually the only place you're going to spend most of your time, because that's where domain knowledge is going to live. Everything else was just work off of those evals.

And so, for example, if you have really good evals, you don't have to write prompts. You can write-- you can find problems and use metapromps or optimizers like DSPy and others to improve your prompts yourself. You can filter out synthetic data and then use that for fine-tuning if you're really interested in fine-tuning, learning or reinforcement learning.

But you can also use these techniques online. One of the things that you're going to test to try today is a very simple but very effective technique that almost all big labs use. Google used it extensively, which is the idea that you crank up the temperature, you generate a bunch of responses instead of just one response.

You can think of this as online reinforcement learning, generate four or five responses, and then score those responses online and see which one's the best one. And you get a pretty decent lift just by doing this, like without actually doing any changes to your prompts or models and so on.

So these are the kinds of things you can do when you have a really good scoring system that you can lean on. So you get to try some of these techniques today. But the key point is don't think of evals as testing in the classic sense. Think of these as the primary place where domain knowledge lives.

And then one of the things that I think one of the attendees was pointing out is that right now, at this point in the industry, we have figured out a handful of standard evals. Like you can think of simple helpfulness, harmfulness, hallucinations. And that doesn't really get you far enough.

That's good enough to just sort of do a guardrail and make sure you're not doing anything wrong. But we're moving into the world where we want to build really good. Like for example, if you're trying to build a trip plan, you're curious about like, you know, how to make that trip plan really be perfect.

Like one of the things I really hate about trip plans, if they're too, like not interesting enough. I'm looking at a trip plan, I want to be excited about this place. Typically when LLMs give me a trip plan, it's a very kind of, a very planish, right? So now you're trying to build these sort of nuances in your applications.

And that's the kind of stuff if you want to evaluate, that's where the industry is going to go. So how do you build like these much more nuanced evals? That's one of the places where these traditional evals fail. So as I went over, like start simple, iterate, see what's broken, and then improve, right?

And so that's the other thing you get to try today is in a co-pilot like setting, start with something, test it out with a handful of examples, maybe generate some synthetic examples, test it out with good examples, bad examples, see what's working, what's not, and then iterate. We've picked a relatively simple example for today because for workshop purposes, we wanted to keep it simple.

But you can absolutely go and try it on fairly complex things and there you can see a lot more of the nuances. So, yeah, so there are two parts of today's workshop. The first part is setting up your scoring system. And the second part is once you have set up the scoring system and played with it and iterated on it in a co-pilot trying to use it in a colab.

You would need a Google account and some proficiency in working with colabs and Python code. We'll also introduce a spreadsheet component to this so that you can actually try to play around with this in a spreadsheet so that you can easily make changes and test things out. One last thing I want to hit on before we jump into the workshop is what is the scoring system?

What is this idea of scoring system? I'd like to just give you a mental framing of it. So ranking is scoring is evals effectively. That's one way to think about it. When Google does a search, what it's trying to do is it's scoring every document and seeing whether this is good or not and then giving you the best document that's the best for you.

And that's not that different from checking and scoring an LLM-generated content. And the way Google does it is by breaking this problem down into a ton of signals. So you can imagine you're looking at SEO, you can look at document popularity, you're looking at title scores, whether the content is good or not, maybe feasibility of things, spam, clickbaitiness, stuff like that.

And then brings all of these signals together into a single score that combines these ideas. Individual signals that you have are very easy to understand. You know what that is. You can inspect it very easily. It's not a complex prompt with some random score. It's something that you can easily understand.

But it all comes together into a single score. And that's the idea we are sort of bringing in. And that's what we call a scoring system. At the bottom level, things are very objective, tend to be deterministic, sometimes just Python code. But as you bring this up to the top, it becomes fairly subjective.

And you can bring these together into very complex ways. I would just encourage you to think about why this could work. And we know it does work. Some of what you will be struggling with with evals is just like you're not measuring the things you need to be measuring.

So you have a little bit of a comprehensive issue. And so the question is like how many metrics do you need to add? So I give the example Google search uses around 300 signals. Now maybe you don't want to be that sophisticated. But Google does care about all those things.

They care about the score of the site. They care about popularity. They care about content relevance, spam, porn seeking. They have all these classifiers, all these ways they understand the content to then bring it together. What this gives you is like really visibility of your application and more and more ways to marry your own judgment into it.

So if instead you just go and say, hey, is this a helpful response? Mostly what you're doing is delegating that eval to the LLM itself, right? Or to a rater who you're asking like, hey, see if this is good. But when you break this all down, you get some really nice properties where like your variance goes down a lot just because you're measuring way more objective things.

So things are not like going back and forth all the time. And it's very precise because like you get all these things and you add them together to a more high fidelity score. And when you are analyzing the data, then you can like slice and dice by way finer grained things.

And that's kind of like why the system tends to work much better. And the best part of it is as you iterate, you just add more signals. Like this now doesn't leave you as like, oh, I either have evals or I don't have evals. But like rather you just have a set of metrics that you keep adding over time as you discover what actually matters about your .

Okay. How do I do this? Yeah, come over. Okay, so I'll just quickly show you a demo of where you're going to start today. Give you some sort of basic ideas of what the kinds of things you would be doing. And then I'll just, we should just get started, right?

And so this is a co-pilot that helps you put together evals. That's the idea. Where I'm starting is basically just a system prompt. This application is basically a simple meeting summarizer. It takes a meeting script, talking between multiple people, and then generate some sort of a structured JSON in the end, which is a summary with very specific action items, key insights, and then a title, right?

So it's a relatively simple thing. It's easy for you to inspect and see where things are not going, working and not working well. So typically, you would start with something like a system prompt. You can also start with examples, if you have a bunch of examples. Or you can start with criteria itself.

And the first step is it would try to use this to build your scoring system. Yeah, and this is doing exactly what was in that slide, which is trying to say, like, from that coarse-grained subjective thing, what are all the smaller things I can suss out of it? Now, this uses a reasoning model.

Like, if you drop this into a ChatGPT or so, we just try to replicate that experience where you get these artifacts on the right-hand side, which is your actual scoring system. But if you think you just want to do it, like, iteratively through your favorite, like, sort of chat interface, you can also do it.

Like, just a-- Oops. Go ahead. Can we do it in reverse where we provide examples of . Yes, exactly right. So right in front, there is an example button. You can start with example. You can give it, actually, hundreds of examples if you want, 20 examples. And then it gets into a much more complex process of, like, figuring out these dimensions based on example.

Or you can just copy-paste one or two examples in the prompt itself, and it will generate it. This is just a starting point, by the way. That's the idea. It starts you somewhere, and now you're going to iterate over it. And that's the exercise you will spend time on.

I'll show you a few things, like, this is your scoring system. These are your individual-- these are individual dimensions, is what we call it. Or you can think of these as signals. They're all questions. Effectively, for example, this is a-- does the output include any insights from the meeting?

That's a natural language question. Or you can have code, just Python code that we have generated. You can edit this code however you want. This is actually as simple as what this says. When you look at the actual code, the code is effectively just a bunch of questions. And you're sending these questions to our specialized foundation models that are designed for scoring and evaluation.

So you'll get to play with this a bunch in the colab. You can see what the form of it looks like. There's another thing that you would notice that there's this idea of critical, major, minor. These are just weights. These are just weights for you to control what's important, what's not.

The combination of this is done through a mathematical function that you can learn over time. So actually, eventually, you would give it a bunch of examples and it'll learn it. But in this particular exercise, you have a little bit more control. And finally, once you have your scoring system done, there is a way for you to integrate it into Google Sheets.

That's the thing that you're going to play around with today, which is basically taking this criteria, moving it to a Google Sheet, and testing it against real examples. So here, you're going to work with synthetic examples. You'll develop your scoring system and then completely blinded to this. We have labeled data set where users are set thumbs up, thumbs down on the summaries.

You're going to apply this to that scoring system and see how well it aligns with real thumbs up, thumbs down. So we don't know. It depends on you. You're building your own scoring system, whether it aligns or not. And this is a really interesting point and why evals start to get hard.

We call this workshop like solving the hardest challenge, which is metrics that actually work. So this idea of correlation ends up being really, really important. Metrics that work are not necessarily good metrics or bad metrics. They're either calibrated metrics or uncalibrated metrics. At Google, for example, we had a lot of data scientists that we worked with because they would do all these correlation analyses and confusion matrices and such.

So part of the challenge of good evals is just getting comfortable with the numerical aspect of these things. And, of course, again, having a scoring system that dissects things into much simpler things makes it easier to analyze. But you still have to think about those things. Like, does this actually correlate with goodness?

Like, if it gives a high score, is this actually a good score? And that's a big part of the methodology. But the good news is, as I showed in the previous slide, once you have metrics you can trust, like, almost all of the rest of your stack gets radically simplified as a result.

Okay. So how do you use this co-pilot? It's created this. Maybe a place to start would be, you know, generate an example. This is a synthetic generation happening behind the scenes. It's taking your system from other information and trying to generate some sort of an example. And then this example that is generated is scored.

And you can see how these individual scores work. Now you can start kind of testing this a little bit more. You can say, can you generate a bad example? And then it will try to generate an example that's broken in some particular way. The co-pilot understands what you've done so far.

It has the full context. It's using all of this information to kind of, like, sort of walk you through this. So in this particular way, like, in this particular case, it created something that is missing a bunch of information. But you can even do things like, you know, specific things like, can you create an example that has broken JSON?

So this is, like, basically example generation. That's one thing you can do here. The other thing you can do is you can make changes to your scoring system itself through the co-pilot. So you can either go and ask the co-pilot to make changes to your, this is a broken JSON example.

You can go to the co-pilot and either ask it to change the Python code itself. You can say, you know, can you update the Python code and such of any of these things. You can also ask it to remove or add dimensions. Maybe you can say, can you generate a dimension that checks for the title to be less than 20 words.

And then the co-pilot can add these dimensions. Go ahead. So let's say we add a feature to the product that the trip planner that also now does bookings for you. Uh-huh. So are you then adding that into... Exactly. ...as a higher level, because now it's like more of an agent.

Yes. Exactly. And what do you do? You just add it to your... You just ask the co-pilot and say, I've updated. Here's some new examples. Can you add new dimensions to test these new examples? Or we have changed things and it automatically, I mean... It knows what layer... It knows what layer you are, exactly.

Yep. The other thing I would say is the co-pilot is very helpful if you want this sort of human in a loop type process. But once you get comfortable with the system, what we see people doing is just use CoLab to send large amounts of data and let our system figure things out on its own.

I find... Personally, I find it very good to play with my scoring system here every so often to see if it's like working well. Like maybe paste an example from a user and see how well it's working and so on and so forth. So you can kind of go back and forth.

But most of our clients actually just fire off these long-running processes. Yeah. I think maybe you want to... It's like just... So anyway, so that's... So it just created this new title length thing for you. So you can sort of play around with this and it'll... You can change the questions.

You can remove dimensions and so on and so forth. Right? The last thing I would show before we get going is one of the things that you would do is... We have pre-filled a spreadsheet in the workshop directions. You will make a copy of that spreadsheet. And we have a spreadsheet integration where you can run our score inside a spreadsheet itself.

You see there are other sort of places where you can use it. I'm not going to get into this, but you can use it for reinforcement learning, for example, using onslaught integration. But there are other places you can actually integrate with PyScore. I'm not sure if you're familiar with these.

But basically in the sheets integration, what you are going to do in this workshop is... You can actually just copy this wholesale and put it into a new spreadsheet. With the examples that are here, what we are trying to do is just copy the criteria. So you're going to build a criteria.

Here's a copy signal. It's an icon. Just click on the copy icon. And then go to the spreadsheet that we will... Which you have. Replace the criteria that exists there with this criteria that you've created. And then under extensions, these directions are all in the docs. They don't have to remember it.

But under extensions, you can go and call the score. The score is going to run across about 120 examples. And then you'll see a confusion matrix which shows you how many times there's alignment on thumbs up. How many times there's alignment on thumbs down. And how many times there's no alignment.

And then you can play around in the spreadsheet itself. Make changes to the dimensions and see if you can bring the alignment closer. So this gives you a sense of... Yeah, so this is what a spreadsheet would look like. This is your criteria. Which is basically just the English form...

The spreadsheet form of the English that you saw. Oh, sorry. Is that better? Okay. So like you can see, this is the label. The question. These are the weights. In the case of Python, there's Python code. So this is what your criteria sheet looks like right now. This is the default one that we put in there for you.

But you would replace it with your own. This is the data. This basically has actually feedback, which is a thumbs up, thumbs down, which we got from users. Some of this is fairly complex. Some of this is easy. So it's all kinds of different sort of mix of things.

And what you're going to do is you're going to select these two rows. This is the input, the output. And then in the extensions under here, you will have score selected ranges, which is going to create a score. And then you can look at the confusion matrix and see how it works.

So that's sort of how... That's one part of the exercise. But you can easily go and make changes here or even test out by making changes to the data, you know, messing up your JSON and seeing how that impacts things and so on and so forth. So that's the first phase of our co-pilot, of our workshop.

And we'll talk about the second phase when we get started with that one. Go ahead. Yeah. I'm just curious how, like, best practices for using this in production. When you have, you know, tens of thousands, maybe hundreds of thousands of examples. Are your teams run this on a subset?

Or just, like, how do you do this at scale? Right. So that will... We will hit a lot of that in the second part of our workshop. We will directly go into Python CoLab code. We have... The SDK goes through a bunch of this. The details and best practices on how you do it.

But you will get to play with this in this workshop. You will... Our scorers are specifically designed for online workflows. These, like, 20 dimensions that you have, they score them all in sub-20, like, 50 milliseconds. So you can run on a very large scale. You can run it online fairly easily.

And we have batch processes and stuff like that set up for you. So you'll be able to play with that, create sets. Typically, you want to create eval sets which have a combination of hard, easy, medium, that kind of stuff. We're not going to get into data generation, synthetic data generation that much.

But our... You'll play with it in the co-pilot. But the actual data generation stuff there, there's actually documentation that you can follow up afterwards. And how to create, like, easy to set, hard, medium sets for your testing purposes. The best thing is to sample from your logs some number of things and evaluate or just run it online.

Majority of this, these kinds of quality checks at Google were run online. Whether it's spam detection, whether it's, like, you know, what decisions to make. That's the ideal place you want to be. Most people, it's very difficult to, kind of, like, sort of implement at that point. I think the challenge, one of the biggest challenges with Elm as a judge is it's so expensive you can't run online.

So that's one of the things that this solves. All right. So, go ahead. How this is different than Elm as a judge? Are you using the traditional API models or smaller API models? So, we had a bunch of discussion on this. Maybe I'll go very quickly through these slides.

So, these models are designed for high precision. Like, for example, if you run the same score twice on the same thing, it's not going to give you different scores. It's not exactly the same score, but small variations keep the same scores. Right? It's designed for, like, super high precision.

It's just the architecture of these models is basically very low variance. Part of the reason is that they're using this bi-directional attention instead of, like, the typical decoder model's attention. They have a regression head on top instead of token generations. It's not autoregressively generating tokens, which has a lot of, like, weirdness that happens to it.

Like, there's a lot of post-hoc explanation for scores. They'll come up with a score, then they'll try to justify the score. That doesn't happen. The other thing is these models have been trained on a lot of data, like, you know, billions and billions of tokens, which are only for scoring, but different kinds of content.

So, coding content, other type of content. So, these generalize really well across, but that also stabilizes them quite a bit. The one thing that I would say which is really nice is the interface is very nice. It basically just, you ask a question and give it the data and it will answer the question with a score.

And then you can inspect the score and understand why it gave you that score. So, it's a fairly simple interface. There's no, like, prompt tuning and so on and so forth. Because in this particular case, when you're evaluating, prompt tuning doesn't really, like, it's not very natural. Like, you know, how do you explain a rubric to a model?

So, these models understand internally why should they score high? Why shouldn't they score high? And then these things come together using a fairly sophisticated model, which is like an extension of generalized additive model, which brings all of these different signals together by baiting them based on your thumbs up, thumbs down data.

So, this is a process called calibration, where you give it a bunch of data and it understands what's important, what's not. What should I, if something fails, I should fail everything. For example, spam. But if it succeeds, I shouldn't contribute. Those kinds of decisions, it makes those decisions for you based on the data.

So, that's sort of, it's a much more advanced way of doing evals and they're sort of ground up, built for eval purposes. It gives them that sort of stability that you desire. And the reason they're very fast is because when you use bi-directional attention, you can build much more dense embeddings.

So, with fewer parameters, you can get fairly high quality scores. Go ahead. Does it work well with other languages? Right now, we have trained it with a handful of languages. We are pretty soon going to release a model with multilingual capabilities. We don't support multimodal yet, but this is in our roadmap, right?

So, right now, it's English in a few languages and then we're going to expand it beyond that. We should kick off the workshop and then we'll pass it on and we can just answer all the questions as well. Do you just want to share the doc? I'll just... Again, remind us, like withpi.ai/workshop.

I'll just share the doc here as well so that you all can see it. Yeah, I'll also put it in the Slack channel for everybody. But... Please get started and let us know... Sorry. Some people maybe have already started working with the collabs I've seen. But I'll quickly show others what the second phase of the exercise is about.

But, of course, we can continue all of this going forward. How do I project this? So, the second part of the exercise, by the way, we may not have enough time to wrap it all up here. But the second part of the exercise, feel free to do it on your own.

The colab is available for you so you can use this. This colab... This is the pre-prepared colab. And... How do I... I don't know if you can see this. But basically, this particular colab will take you through these multiple steps. And a lot of people ask me questions about how to use it in code.

So I just want to quickly go over it. This is a basic installation. This is your scoring spec. This is where all of your intelligence is going to live. Again, this is a relatively simple example. So it's a relatively simple scoring spec. But the way you get this spec is basically over here.

You get into code. You copy it. And you put it into your colab. Right? So this is basically how you get the spec here. It's all in natural language spec with some Python code in there. This is what you'll change. This is what you'll tweak. This is where everything is going to be.

And what you're now doing in this particular colab, which you can literally click through it. And you don't have to do anything more than that. Or you can play around with it a whole bunch. Is... One is we have some data sets. Public data sets on Hugging Face. Which has the thumbs up, thumbs down data.

This is the same data. If you've seen the sheet, that's the same data over here. This colab is loading that data. Running the score on it. Returning your results. And then building a similar confusion matrix that you saw there. Which indicates how well aligned your scores are. So that's just getting you warmed up.

Then you can use it to compare models. This is where like really interesting stuff starts happening. Right? So in this particular case, we are comparing 1.5 and 2.5 models. You can see that 2.5 has a slightly higher score than 1.5. For this particular task. But because 2.5 was mostly for reasoning.

There's not that much of a delta here. If you go to smaller models. Like some of the mini models like the Cloud Haiku. You'll see a much bigger delta between the quality. Based on your own scoring system. So now you can use your scoring system for evaluating different models.

What this is doing is it's taking about, you know, 10 examples. Calling these five different models. Generating responses. And then scoring them using our scoring system. Right? So that's a model comparison. The other set is to try different prompts. People change their system prompts. But they're worried about them.

When they change it, what would happen? This is the right way to kind of make sure that you're not regressing. You have your scoring spec. You try different system prompts. See if your scores are going down on your test set. So again, taking 10 examples just to demonstrate how you compare them.

Here's like we created a bad and a good prompt just to kind of accentuate this. So like bad prompts getting much lower score than good prompts on this particular task. This is the one that I'm very excited about because it brings you into the online world. And how to actually do this online.

Where you're taking this one particular transcript. And you are testing it out with different number of samples. So if you use just one sample. Which is typically what you do. Generate one response. With a temperature of 0.7. You get a particular score. But as you up the number of samples.

What it's doing behind the scenes is creating three or four of those responses. Each one is a different response. And then ranking them using the pie scoring system that you just built. And picking the one that's best. And what you would see is as you increase the number of examples.

You'll see the score steadily going up. The response quality going up. So this is literally like you can click through this in the colab. And run through it. And you won't need anything else. But there's a bunch of options for you to play around with this. Right now or later.

Just want to introduce this to you guys before you all disappear. Right now or later. Thank you. Thank you. Thank you. Thank you. Thank you.

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

Transcript