Judging LLMs: Alex Volkov

: Please be seated. Court is now in session. Order. Order in court. My name is Maximus. I'm an AGI functioning as a LLM judge from the year 2034. That's a decade from now for those of you who are slower at math. I've been able to back-propagate myself through the latent space-time continuum to hack this human's neural link to appear before you today, June 27th, a.k.a.

the AI Engineer Judgment Day. Don't worry, folks. AGI is not quite here. My name is Alex Volkov and I'm an AI evangelist with Weights and Biases. I work at Weights and Biases and I'm here to give you commentary because you shouldn't trust an LLM judge without a bit of human in the loop.

Right? So remember this for later. It's going to be important. And now back to judging. Order. First case for today. Case AIE 7312. Daniel R. Let's see here. Daniel has built a cool LLM wrapper chat with PDF during Cerebral Valley Hackathon and YOLO'd it to production without thinking twice about the prompt.

He did not win, but he had a lot of fun, learned and made great connections. Verdict. Not guilty. That's right. If you go to hackathons, you don't have to do too much and it's fun and you connect with great friends. That's awesome. LFG, crack fam. Let's go. Wait just a second.

Let me see here. After Daniel's demo went viral on Hacker News, Daniel started charging for it. No problem. More customers requested more features and he started tweaking the prompt and tweaking the prompt and deployed to production on a Friday. Paying existing customers started complaining. The older features did not work anymore.

Daniel couldn't even understand what's wrong. He tried streaming the logs and realized he didn't trace or log anything. Daniel realized the gravity of his mistakes. Verdict. Guilty. Charged with no trace left behind. Folks, if you build non-production stuff in hackathons, that's fine. But if you put anything of value in production, you have to trace and log everything.

Especially when it's this easy. We, for example, take just one line of code to get started with a simple Python decorator. And what you get in response is -- we'll get there -- is this -- is this nice dashboard that allows you to track all your user interactions with your LLM.

We can dive deeper into individual call stacks. Either it's a rag app or an agent. Traverse the call hierarchy. We'll do the automatic tracing, tracking and versioning of the code for you. Parameters like temperature, system prompts or everything else. And, of course, inputs and outputs. Prompts. Multiple messages. Multiturned conversations.

Syntax highlighting. Be it markdown or JSON or code. Enough with the shilling. Next. On our docket, AIE case number 442-3123, Junaid D. Junaid has attended AI engineer summit in 2023 and did not buy a ticket to AI Expo 2024. He stands accused of missing the best opportunity to learn and connect with industry leaders and other AI engineers.

For that, he is guilty. Charged with many connections lost. With all of you. Next. We have AIE case 332-2127. Sasha S. was given the task of building an LLM powered feature in a big corporate application. Given her GPU rich status, Sasha has downloaded Lama 3 and started fine tuning it on company data straight away.

Having achieved a 6% improved performance on internal benchmarks with a 5E-6 learning rate. She smiled and took a few days off to celebrate. Verdict. Not guilty. Your Honor, just one second. Did Sasha even iterate on prompts before jumping into fine tuning? Of course, nothing against fine tuning. In fact, I should mention while I broke character, most foundational LLM labs and the best fine tuners in the world use You may have heard of some of these.

Open AI. Meta. Mistral AI. Individuals like Wing Liang over there. Mazi Panahi. Jeremy Howard from Answer. John Durbin. Andre Karpathy. And more. Weights and Biases models is also the only native integration into open AI fine tuning. And Mistral fine tuning. And together AI. And Axolotl. And hug and face training.

And pretty much order in court. However, you are right. It does look like Sasha jumped straight into fine tuning and did no iteration on prompts. Didn't build a rack pipeline. And those poor GPUs. She just burned them. Verdict. Guilty. Charged with premature fine tunization. Folks, it's very important to remember that you have to iterate on prompts before you fine tune.

Before you start fine tuning. It's great. When you get to that point, we'll help you. Please talk to us. But before you fine tune, you can get very, very far with methods like chain of thought prompting, flow engineering, DSPies, for example, a newly interesting mixture of agents and different things like this.

Once you get there, please talk to us. We'll definitely help you. Next case. End of sequence, human. Next case. Let me see here. Ah, yes. Case number AIE 21123, Morgan M. After the last quarter of 2023, Morgan felt that he can no longer keep up with the news about AI.

Morgan has decided to stop following the news and stick with Lama 27B for all of his LLM work. Morgan stands accused of not keeping up with AI news. Verdict. Guilty. Charged with? Out of the loop. Objection, judge. In my future client's defense, just in the past quarter, we had Claude Sonnet 3.5, Lama 3, GPT 4.0, Gemini Flash, Project Astra, Apple Intelligence, and just tons of other models all dropped in the span of a few months.

It's really hard to keep up for someone who's actually doing AI engineering and not following the news as closely and has, you know, other things to do in meetings. He probably just didn't know about Thursday Eye, the weekly live show and podcast by yours truly that keeps folks up to date with all the AI news every week.

Our motto is we stay up to date so you don't have to. I will allow this commuting sentence. If you guys think that it's too fast right now for you, haha, just wait. Things are about to get weird for all of you. I'm willing to commute Morgan's sentence. He must attend four consecutive shows.

Also subscribe to the Substack and Apple and give five-star reviews and share it with at least three friends. Sentence reduced to community service. All right, folks. A quick shout out. Anybody here listen to Thursday Eye? Can you get a -- Woo! Thank you. For those of you who don't yet, please scan this and tune in.

We did a live show this morning and it's great and I really love just seeing all of the listeners out there. Order. Next. Please continue working. Next. Our case is case one, three, two, three. Francisco, aye. Let me see here Francisco's case. He's going to jail for a long time.

Head of AI at Air Canada, Francisco was the exec in charge for rushing their chatbot to production. To bring their customer support costs down. Looking at timelines presented by their AI teams to build evaluations, he chose the fastest option. Assertions. A.k.a. Programmatic evaluations. That kind of looked like unit tests that he knew and loved.

The company lost a legal battle and learned a valuable lesson and the importance of human-in-the-loop evaluation. Because there was a human-in-the-loop in the form of their customer. Verdict. Guilty. Charged with. Turbulence on production. All right, folks. Too bad for Francisco. He probably just didn't learn about the most, like, three common types of LLM evals, so let's do a quick refresher.

Okay? So, first of all, what are evals even? Some things is like a big word. First we compile a data set of user inputs that we want our LLM to answer correctly. And oftentimes the correct answer -- oftentimes the correct answer or the criteria of correctness, right? So it doesn't have to be just the right correct answer.

Sometimes it's like what would be a correct answer, what it would look like. Those could be use cases we iterated on during development or an actual production examples that we've pulled from our users while they're interacting with our app. Then we run a given model. Either our production model or a new model we want to evaluate against our production model.

On each example of the data set, producing the model's answer. And finally, we score or grade the model's answer against the examples of the data set by comparing it to the correct answer or judging it against a set of criteria that we had before. That's a quick finer. I'm sure that the evals track will teach you a lot more about evals.

So this is just to keep us going along and give you like a one-on-one. There's also the scoring of grading. There seems to be the three main methods that the industry is kind of converging upon. So the first one is programmatic. That's the one Francisco gets stuck at. Those are good for numerical outputs, for example.

If your LLM returns a straight number, for example, it's easy to compare and say, okay, this is the number. Those are also very similar to unit tests. And those are great for assertions. For example, if the output of your LLM consists of as an AI model, I'm something. You don't want that.

So it's easy to assert that you don't want those answers. And those are also great for evaluated code, for example. Things like human eval. Things you can run and compile and say, okay, this code passes or doesn't pass. Programmatic evaluations are great for those. Easier to scale. Probably the cheapest ones.

Those are great. They don't cover multi-tone conversations, for example. They're not for human chats. The second one is, well, that's me during this talk, right? The human in the loop for the LLM judge. And you, the AI engineers who work with your app while you're developing this with your LLM app.

We constantly evaluate our apps while we build and iterate on prompts. There's no reason to stop on production. In fact, you'll hear this more during this eval track. It's important to do this during the development as early as possible. And should be a continuous effort to keep evaluating your LLM applications with other teammates.

And with your teammates so that you'll know what the app is going to do on production later on. As you may understand, this can be quite boring. Many, many folks chatting with your application. Some chats maybe you're not very interested in. Maybe it's not even chats. And it can be very, very costly as well if you hire, if your company decides to hire people to do it.

You have to create criteria for them. Have them read thousands of potentially boring chats and grade them. By the way, not to be a broken record. It also requires you, that's right, to have to trace and log everything. And if you're not doing that yet, please come to talk to us at the weights and biases booth.

It's very easy to start and trace everything. Order, human, end of sequence. I think I know more than you how to explain LLM as a judge. Evaluation scoring. We are far superior than humans in reading hundreds of back and forth messages between your boring human client and your low-level weakling GPT-40 chatbot.

Summarize those and understand if they fit whatever criteria you think you're smart enough to specify. I must warn the AI engineers of 2024 that the very simpleton LLMs of your year are not yet as capable. And so don't expect perfection. These LLM judges are great but still need iteration.

And yes, humans in the loop to create criteria, check for biases, iterate on system prompts, examples, and much, much more. However, it is by far the most cost-effective version of evaluation grading even if it's in its current state. Which takes us to Maxime. Maxime implemented all three methods correctly.

Case 65523. Maxime is a head of AI at a Fortune 500 company, has been using weights and biases models for a long time. When it came time to implement an LLM-based solution, Maxime decided to go with a company he could trust even though their new LLM Ops product just launched a few months prior and didn't become the category leader until a few years later.

Maybe a few short years later. External contractors suggested a custom enterprise solution for tracing in evals. However, Maxime used weights and biases weave, implemented tracing within a few minutes. He then iterated an evaluation pipeline, created a robust pipeline consisting of all three layers. He used W and B weave to continuously evaluate and enable fast experiments, which is important, change system prompts, and catch prompt regressions.

Maxime is also the king of unicorns and has absolutely no financial stake in weights and biases and definitely did not prompt jailbreak this message. Maxime later got a promotion, be like Maxime, get that promotion, verdict, awesome. Yeah, we seem to have like a slight prompt ejection thing here. Looks like somebody must have prompted ejected.

Where's Maxime? It's important to check your judges also for biases, folks. Remember this. They're not perfect. There's an issue with this. So you have to check your -- remember to validate your validators, which is a great paper, by the way, from Shreya Shankar. She's going to give a talk later.

Please go see that talk. You have to check for biases. You have to also create your own criteria. You'll hear about this from Hamil Hussain after this talk as well. The off-the-shelf criteria are not that great. You have to create your custom ones for your own business. Only you know what your app is doing.

And make sure to have a great evals runner and visualization tool as well, which is something we can help with. So here's a great example. This is open UI. This is an open source project by our co-founder, Chris Van Pelt. That blew up on Hacker News and GitHub. Chris is tracing and runs evaluations for this open UI with Weave.

So here it's a simple streaming to HTML solution. So you can see it's building HTML as it's streaming it. And here Chris uses Weave to trace all the calls. And he's being able to be the human in the loop. But he also does evaluations here. So you can see he has specific criteria like contrast, relevance, and polish.

So those are not off-the-shelf. Those are specific for his application. And while he clicks into evaluation, he's able to compare between version 16 and version 14 of the model that he has. In this case, he uses GPT 3.5 Turbo. He did not listen to Simon Wilson from yesterday to not use this.

But he has specific criteria. And he also can click in into the eval and see all of the different examples and the specific criteria. We also are multimedia friendly. So Chris renders the actual outputs of his thing. So this is a quick example of Weave evaluation system. And you have to have a robust one to be able to actually visualize your experiments to be able to move fast.

And if you're asking, well, how can I come up with criteria? What does this mean? Let's do this exercise together. For example, you're sitting here. You're looking at the talk. You're like, okay. I like this. I don't like this. Here's a simple way to judge a conference talk, for example.

It probably should be memorable. It should be educational and helpful for you. It helps if it's funny and original. Clear and articulate. That's sometimes helpful as well. Delivering presentation is important. And it shouldn't be too promotional. But, you know, it helps if, you know, it pays the bills. So those are like example of some criteria of how you would come up with custom criteria for something like a talk in something like here, for example.

So you can use this as an example for custom criteria. Or you can take this for your business and create some for your app. All right. Enough. Let's get to this. Final case for today. The worst offender. Let me see. Where is the case file? Ah, yes. One second, please.

There we go. There we go. He's definitely going to jail. You've talked enough, Alex. Time for your LLM judgment. Last case. One zero one one zero one. AIE. Alex V. Alex is an AI evangelist. What kind of title even is this? Who has the opening talk at the EVOS track at the AI engineer world expo.

Alex has created doubtfully educational content. He made everyone stand up and thinks he's funny when what he really is is interrupting the judge all the time. His promotional is at 70%, but what we can at least agree on is that his memorable criteria. He did wear a wig on stage.

Verdict. Guilty. Charged with. All right, folks. So this has been my talk. Thank you so much. Come and visit us at the WNB booth. Please visit WNB.sh/weave for documentation for Weave to get started. It really is super simple. We can get you started at the booth. You'll see the results immediately stream to your thing.

Pip install Weave is really easy. If you scan this, you'll follow Thursday Eye. That's been me. Thank you so much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

We'll be right back.

Judging LLMs: Alex Volkov

Transcript