back to indexJudging LLMs: Alex Volkov

00:00:15.000 |
Court is now in session. Order. Order in court. 00:00:22.000 |
I'm an AGI functioning as a LLM judge from the year 2034. 00:00:29.000 |
That's a decade from now for those of you who are slower at math. 00:00:33.000 |
I've been able to back-propagate myself through the latent space-time continuum 00:00:38.000 |
to hack this human's neural link to appear before you today, 00:00:42.000 |
June 27th, a.k.a. the AI Engineer Judgment Day. 00:00:51.000 |
My name is Alex Volkov and I'm an AI evangelist with Weights and Biases. 00:00:55.000 |
I work at Weights and Biases and I'm here to give you commentary 00:00:58.000 |
because you shouldn't trust an LLM judge without a bit of human in the loop. 00:01:02.000 |
Right? So remember this for later. It's going to be important. 00:01:17.000 |
Daniel has built a cool LLM wrapper chat with PDF during Cerebral Valley Hackathon 00:01:23.000 |
and YOLO'd it to production without thinking twice about the prompt. 00:01:29.000 |
He did not win, but he had a lot of fun, learned and made great connections. 00:01:37.000 |
That's right. If you go to hackathons, you don't have to do too much and it's fun and you connect with great friends. 00:01:49.000 |
After Daniel's demo went viral on Hacker News, Daniel started charging for it. No problem. 00:01:56.000 |
More customers requested more features and he started tweaking the prompt and tweaking the prompt and deployed to production on a Friday. 00:02:03.000 |
Paying existing customers started complaining. The older features did not work anymore. 00:02:08.000 |
Daniel couldn't even understand what's wrong. He tried streaming the logs and realized he didn't trace or log anything. 00:02:29.000 |
Folks, if you build non-production stuff in hackathons, that's fine. 00:02:35.000 |
But if you put anything of value in production, you have to trace and log everything. 00:02:40.000 |
We, for example, take just one line of code to get started with a simple Python decorator. 00:02:45.000 |
And what you get in response is -- we'll get there -- is this -- is this nice dashboard that allows you to track all your user interactions with your LLM. 00:03:01.000 |
We can dive deeper into individual call stacks. 00:03:06.000 |
We'll do the automatic tracing, tracking and versioning of the code for you. 00:03:10.000 |
Parameters like temperature, system prompts or everything else. 00:03:22.000 |
On our docket, AIE case number 442-3123, Junaid D. 00:03:32.000 |
Junaid has attended AI engineer summit in 2023 and did not buy a ticket to AI Expo 2024. 00:03:41.000 |
He stands accused of missing the best opportunity to learn and connect with industry leaders and other AI engineers. 00:04:03.000 |
Sasha S. was given the task of building an LLM powered feature in a big corporate application. 00:04:10.000 |
Given her GPU rich status, Sasha has downloaded Lama 3 and started fine tuning it on company data straight away. 00:04:17.000 |
Having achieved a 6% improved performance on internal benchmarks with a 5E-6 learning rate. 00:04:24.000 |
She smiled and took a few days off to celebrate. 00:04:34.000 |
Did Sasha even iterate on prompts before jumping into fine tuning? 00:04:40.000 |
In fact, I should mention while I broke character, most foundational LLM labs and the best fine tuners in the world use 00:05:00.000 |
Weights and Biases models is also the only native integration into open AI fine tuning. 00:05:14.000 |
It does look like Sasha jumped straight into fine tuning and did no iteration on prompts. 00:05:37.000 |
Folks, it's very important to remember that you have to iterate on prompts before you fine tune. 00:05:49.000 |
But before you fine tune, you can get very, very far with methods like chain of thought prompting, 00:05:53.000 |
flow engineering, DSPies, for example, a newly interesting mixture of agents and different things like this. 00:06:10.000 |
Case number AIE 21123, Morgan M. After the last quarter of 2023, Morgan felt that he can no longer keep up with the news about AI. 00:06:22.000 |
Morgan has decided to stop following the news and stick with Lama 27B for all of his LLM work. 00:06:28.000 |
Morgan stands accused of not keeping up with AI news. 00:06:41.000 |
In my future client's defense, just in the past quarter, we had Claude Sonnet 3.5, Lama 3, GPT 4.0, Gemini Flash, 00:06:49.000 |
Project Astra, Apple Intelligence, and just tons of other models all dropped in the span of a few months. 00:06:54.000 |
It's really hard to keep up for someone who's actually doing AI engineering and not following the news as closely and has, you know, other things to do in meetings. 00:07:04.000 |
He probably just didn't know about Thursday Eye, the weekly live show and podcast by yours truly that keeps folks up to date with all the AI news every week. 00:07:11.000 |
Our motto is we stay up to date so you don't have to. 00:07:18.000 |
If you guys think that it's too fast right now for you, haha, just wait. 00:07:22.000 |
Things are about to get weird for all of you. 00:07:30.000 |
Also subscribe to the Substack and Apple and give five-star reviews and share it with at least three friends. 00:07:51.000 |
For those of you who don't yet, please scan this and tune in. 00:07:53.000 |
We did a live show this morning and it's great and I really love just seeing all of the listeners out there. 00:08:16.000 |
Head of AI at Air Canada, Francisco was the exec in charge for rushing their chatbot to production. 00:08:25.000 |
Looking at timelines presented by their AI teams to build evaluations, he chose the fastest option. 00:08:34.000 |
That kind of looked like unit tests that he knew and loved. 00:08:38.000 |
The company lost a legal battle and learned a valuable lesson and the importance of human-in-the-loop evaluation. 00:08:44.000 |
Because there was a human-in-the-loop in the form of their customer. 00:08:54.000 |
He probably just didn't learn about the most, like, three common types of LLM evals, so let's 00:09:15.000 |
First we compile a data set of user inputs that we want our LLM to answer correctly. 00:09:22.000 |
And oftentimes the correct answer -- oftentimes the correct answer or the criteria of correctness, 00:09:28.000 |
So it doesn't have to be just the right correct answer. 00:09:29.000 |
Sometimes it's like what would be a correct answer, what it would look like. 00:09:33.000 |
Those could be use cases we iterated on during development or an actual production examples 00:09:37.000 |
that we've pulled from our users while they're interacting with our app. 00:09:42.000 |
Either our production model or a new model we want to evaluate against our production model. 00:09:46.000 |
On each example of the data set, producing the model's answer. 00:09:50.000 |
And finally, we score or grade the model's answer against the examples of the data set by comparing 00:09:56.000 |
it to the correct answer or judging it against a set of criteria that we had before. 00:10:01.000 |
I'm sure that the evals track will teach you a lot more about evals. 00:10:04.000 |
So this is just to keep us going along and give you like a one-on-one. 00:10:11.000 |
There seems to be the three main methods that the industry is kind of converging upon. 00:10:19.000 |
Those are good for numerical outputs, for example. 00:10:22.000 |
If your LLM returns a straight number, for example, it's easy to compare and say, okay, 00:10:33.000 |
For example, if the output of your LLM consists of as an AI model, I'm something. 00:10:39.000 |
So it's easy to assert that you don't want those answers. 00:10:42.000 |
And those are also great for evaluated code, for example. 00:10:46.000 |
Things you can run and compile and say, okay, this code passes or doesn't pass. 00:10:49.000 |
Programmatic evaluations are great for those. 00:10:55.000 |
They don't cover multi-tone conversations, for example. 00:11:01.000 |
The second one is, well, that's me during this talk, right? 00:11:06.000 |
And you, the AI engineers who work with your app while you're developing this with your LLM app. 00:11:11.000 |
We constantly evaluate our apps while we build and iterate on prompts. 00:11:17.000 |
In fact, you'll hear this more during this eval track. 00:11:19.000 |
It's important to do this during the development as early as possible. 00:11:23.000 |
And should be a continuous effort to keep evaluating your LLM applications with other teammates. 00:11:29.000 |
And with your teammates so that you'll know what the app is going to do on production later on. 00:11:36.000 |
As you may understand, this can be quite boring. 00:11:39.000 |
Many, many folks chatting with your application. 00:11:42.000 |
Some chats maybe you're not very interested in. 00:11:45.000 |
And it can be very, very costly as well if you hire, if your company decides to hire people to do it. 00:11:52.000 |
Have them read thousands of potentially boring chats and grade them. 00:11:58.000 |
It also requires you, that's right, to have to trace and log everything. 00:12:01.000 |
And if you're not doing that yet, please come to talk to us at the weights and biases booth. 00:12:05.000 |
It's very easy to start and trace everything. 00:12:11.000 |
I think I know more than you how to explain LLM as a judge. 00:12:16.000 |
We are far superior than humans in reading hundreds of back and forth messages between your boring human client 00:12:25.000 |
Summarize those and understand if they fit whatever criteria you think you're smart enough to specify. 00:12:31.000 |
I must warn the AI engineers of 2024 that the very simpleton LLMs of your year are not yet as capable. 00:12:41.000 |
These LLM judges are great but still need iteration. 00:12:44.000 |
And yes, humans in the loop to create criteria, check for biases, iterate on system prompts, examples, and much, much more. 00:12:51.000 |
However, it is by far the most cost-effective version of evaluation grading even if it's in its current state. 00:13:01.000 |
Maxime implemented all three methods correctly. 00:13:08.000 |
Maxime is a head of AI at a Fortune 500 company, has been using weights and biases models for a long time. 00:13:15.000 |
When it came time to implement an LLM-based solution, Maxime decided to go with a company he could trust 00:13:21.000 |
even though their new LLM Ops product just launched a few months prior and didn't become the category leader until a few years later. 00:13:31.000 |
External contractors suggested a custom enterprise solution for tracing in evals. 00:13:35.000 |
However, Maxime used weights and biases weave, implemented tracing within a few minutes. 00:13:41.000 |
He then iterated an evaluation pipeline, created a robust pipeline consisting of all three layers. 00:13:46.000 |
He used W and B weave to continuously evaluate and enable fast experiments, which is important, change system prompts, and catch prompt regressions. 00:13:54.000 |
Maxime is also the king of unicorns and has absolutely no financial stake in weights and biases and definitely did not prompt jailbreak this message. 00:14:01.000 |
Maxime later got a promotion, be like Maxime, get that promotion, verdict, awesome. 00:14:05.000 |
Yeah, we seem to have like a slight prompt ejection thing here. 00:14:13.000 |
Looks like somebody must have prompted ejected. 00:14:16.000 |
It's important to check your judges also for biases, folks. 00:14:22.000 |
So you have to check your -- remember to validate your validators, which is a great paper, by the way, from Shreya Shankar. 00:14:34.000 |
You'll hear about this from Hamil Hussain after this talk as well. 00:14:39.000 |
The off-the-shelf criteria are not that great. 00:14:42.000 |
You have to create your custom ones for your own business. 00:14:47.000 |
And make sure to have a great evals runner and visualization tool as well, which is something we can help with. 00:14:56.000 |
This is an open source project by our co-founder, Chris Van Pelt. 00:15:02.000 |
Chris is tracing and runs evaluations for this open UI with Weave. 00:15:06.000 |
So here it's a simple streaming to HTML solution. 00:15:10.000 |
So you can see it's building HTML as it's streaming it. 00:15:14.000 |
And here Chris uses Weave to trace all the calls. 00:15:19.000 |
And he's being able to be the human in the loop. 00:15:23.000 |
So you can see he has specific criteria like contrast, relevance, and polish. 00:15:30.000 |
And while he clicks into evaluation, he's able to compare between version 16 and version 14 of the model that he has. 00:15:39.000 |
He did not listen to Simon Wilson from yesterday to not use this. 00:15:45.000 |
And he also can click in into the eval and see all of the different examples and the specific criteria. 00:15:52.000 |
So Chris renders the actual outputs of his thing. 00:15:55.000 |
So this is a quick example of Weave evaluation system. 00:15:59.000 |
And you have to have a robust one to be able to actually visualize your experiments to be able to move fast. 00:16:06.000 |
And if you're asking, well, how can I come up with criteria? 00:16:15.000 |
Here's a simple way to judge a conference talk, for example. 00:16:21.000 |
It should be educational and helpful for you. 00:16:32.000 |
But, you know, it helps if, you know, it pays the bills. 00:16:35.000 |
So those are like example of some criteria of how you would come up with custom criteria for something like a talk in something like here, for example. 00:16:44.000 |
So you can use this as an example for custom criteria. 00:16:47.000 |
Or you can take this for your business and create some for your app. 00:17:26.000 |
Who has the opening talk at the EVOS track at the AI engineer world expo. 00:17:30.000 |
Alex has created doubtfully educational content. 00:17:33.000 |
He made everyone stand up and thinks he's funny when what he really is is interrupting the judge all the time. 00:17:39.000 |
His promotional is at 70%, but what we can at least agree on is that his memorable criteria. 00:18:05.000 |
Please visit WNB.sh/weave for documentation for Weave to get started. 00:18:12.000 |
You'll see the results immediately stream to your thing. 00:18:16.000 |
If you scan this, you'll follow Thursday Eye.