Judging LLMs: Alex Volkov

00:00:00.000 | :

00:00:03.000 | Please be seated.

00:00:15.000 | Court is now in session. Order. Order in court.

00:00:19.000 | My name is Maximus.

00:00:22.000 | I'm an AGI functioning as a LLM judge from the year 2034.

00:00:29.000 | That's a decade from now for those of you who are slower at math.

00:00:33.000 | I've been able to back-propagate myself through the latent space-time continuum

00:00:38.000 | to hack this human's neural link to appear before you today,

00:00:42.000 | June 27th, a.k.a. the AI Engineer Judgment Day.

00:00:47.000 | Don't worry, folks. AGI is not quite here.

00:00:51.000 | My name is Alex Volkov and I'm an AI evangelist with Weights and Biases.

00:00:55.000 | I work at Weights and Biases and I'm here to give you commentary

00:00:58.000 | because you shouldn't trust an LLM judge without a bit of human in the loop.

00:01:02.000 | Right? So remember this for later. It's going to be important.

00:01:04.000 | And now back to judging.

00:01:07.000 | Order.

00:01:09.000 | First case for today. Case AIE 7312.

00:01:13.000 | Daniel R. Let's see here.

00:01:17.000 | Daniel has built a cool LLM wrapper chat with PDF during Cerebral Valley Hackathon

00:01:23.000 | and YOLO'd it to production without thinking twice about the prompt.

00:01:29.000 | He did not win, but he had a lot of fun, learned and made great connections.

00:01:34.000 | Verdict. Not guilty.

00:01:37.000 | That's right. If you go to hackathons, you don't have to do too much and it's fun and you connect with great friends.

00:01:43.000 | That's awesome. LFG, crack fam. Let's go.

00:01:45.000 | Wait just a second. Let me see here.

00:01:49.000 | After Daniel's demo went viral on Hacker News, Daniel started charging for it. No problem.

00:01:56.000 | More customers requested more features and he started tweaking the prompt and tweaking the prompt and deployed to production on a Friday.

00:02:03.000 | Paying existing customers started complaining. The older features did not work anymore.

00:02:08.000 | Daniel couldn't even understand what's wrong. He tried streaming the logs and realized he didn't trace or log anything.

00:02:16.000 | Daniel realized the gravity of his mistakes.

00:02:19.000 | Verdict. Guilty.

00:02:24.000 | Charged with no trace left behind.

00:02:29.000 | Folks, if you build non-production stuff in hackathons, that's fine.

00:02:35.000 | But if you put anything of value in production, you have to trace and log everything.

00:02:39.000 | Especially when it's this easy.

00:02:40.000 | We, for example, take just one line of code to get started with a simple Python decorator.

00:02:45.000 | And what you get in response is -- we'll get there -- is this -- is this nice dashboard that allows you to track all your user interactions with your LLM.

00:03:01.000 | We can dive deeper into individual call stacks.

00:03:04.000 | Either it's a rag app or an agent.

00:03:05.000 | Traverse the call hierarchy.

00:03:06.000 | We'll do the automatic tracing, tracking and versioning of the code for you.

00:03:10.000 | Parameters like temperature, system prompts or everything else.

00:03:13.000 | And, of course, inputs and outputs.

00:03:15.000 | Prompts.

00:03:16.000 | Multiple messages.

00:03:17.000 | Multiturned conversations.

00:03:18.000 | Syntax highlighting.

00:03:19.000 | Be it markdown or JSON or code.

00:03:20.000 | Enough with the shilling.

00:03:21.000 | Next.

00:03:22.000 | On our docket, AIE case number 442-3123, Junaid D.

00:03:32.000 | Junaid has attended AI engineer summit in 2023 and did not buy a ticket to AI Expo 2024.

00:03:41.000 | He stands accused of missing the best opportunity to learn and connect with industry leaders and other AI engineers.

00:03:48.000 | For that, he is guilty.

00:03:52.000 | Charged with many connections lost.

00:03:55.000 | With all of you.

00:03:56.000 | Next.

00:03:57.000 | We have AIE case 332-2127.

00:04:03.000 | Sasha S. was given the task of building an LLM powered feature in a big corporate application.

00:04:10.000 | Given her GPU rich status, Sasha has downloaded Lama 3 and started fine tuning it on company data straight away.

00:04:17.000 | Having achieved a 6% improved performance on internal benchmarks with a 5E-6 learning rate.

00:04:24.000 | She smiled and took a few days off to celebrate.

00:04:27.000 | Verdict.

00:04:28.000 | Not guilty.

00:04:31.000 | Your Honor, just one second.

00:04:34.000 | Did Sasha even iterate on prompts before jumping into fine tuning?

00:04:38.000 | Of course, nothing against fine tuning.

00:04:40.000 | In fact, I should mention while I broke character, most foundational LLM labs and the best fine tuners in the world use

00:04:46.000 | You may have heard of some of these.

00:04:49.000 | Open AI.

00:04:50.000 | Meta.

00:04:51.000 | Mistral AI.

00:04:52.000 | Individuals like Wing Liang over there.

00:04:54.000 | Mazi Panahi.

00:04:55.000 | Jeremy Howard from Answer.

00:04:57.000 | John Durbin.

00:04:58.000 | Andre Karpathy.

00:04:59.000 | And more.

00:05:00.000 | Weights and Biases models is also the only native integration into open AI fine tuning.

00:05:04.000 | And Mistral fine tuning.

00:05:05.000 | And together AI.

00:05:07.000 | And Axolotl.

00:05:08.000 | And hug and face training.

00:05:09.000 | And pretty much order in court.

00:05:11.000 | However, you are right.

00:05:14.000 | It does look like Sasha jumped straight into fine tuning and did no iteration on prompts.

00:05:23.000 | Didn't build a rack pipeline.

00:05:26.000 | And those poor GPUs.

00:05:28.000 | She just burned them.

00:05:29.000 | Verdict.

00:05:30.000 | Guilty.

00:05:31.000 | Charged with premature fine tunization.

00:05:37.000 | Folks, it's very important to remember that you have to iterate on prompts before you fine tune.

00:05:45.000 | Before you start fine tuning.

00:05:46.000 | It's great.

00:05:47.000 | When you get to that point, we'll help you.

00:05:48.000 | Please talk to us.

00:05:49.000 | But before you fine tune, you can get very, very far with methods like chain of thought prompting,

00:05:53.000 | flow engineering, DSPies, for example, a newly interesting mixture of agents and different things like this.

00:05:59.000 | Once you get there, please talk to us.

00:06:00.000 | We'll definitely help you.

00:06:02.000 | Next case.

00:06:03.000 | End of sequence, human.

00:06:04.000 | Next case.

00:06:05.000 | Let me see here.

00:06:08.000 | Ah, yes.

00:06:10.000 | Case number AIE 21123, Morgan M. After the last quarter of 2023, Morgan felt that he can no longer keep up with the news about AI.

00:06:22.000 | Morgan has decided to stop following the news and stick with Lama 27B for all of his LLM work.

00:06:28.000 | Morgan stands accused of not keeping up with AI news.

00:06:31.000 | Verdict.

00:06:32.000 | Guilty.

00:06:33.000 | Charged with?

00:06:36.000 | Out of the loop.

00:06:39.000 | Objection, judge.

00:06:41.000 | In my future client's defense, just in the past quarter, we had Claude Sonnet 3.5, Lama 3, GPT 4.0, Gemini Flash,

00:06:49.000 | Project Astra, Apple Intelligence, and just tons of other models all dropped in the span of a few months.

00:06:54.000 | It's really hard to keep up for someone who's actually doing AI engineering and not following the news as closely and has, you know, other things to do in meetings.

00:07:04.000 | He probably just didn't know about Thursday Eye, the weekly live show and podcast by yours truly that keeps folks up to date with all the AI news every week.

00:07:11.000 | Our motto is we stay up to date so you don't have to.

00:07:15.000 | I will allow this commuting sentence.

00:07:18.000 | If you guys think that it's too fast right now for you, haha, just wait.

00:07:22.000 | Things are about to get weird for all of you.

00:07:25.000 | I'm willing to commute Morgan's sentence.

00:07:27.000 | He must attend four consecutive shows.

00:07:30.000 | Also subscribe to the Substack and Apple and give five-star reviews and share it with at least three friends.

00:07:37.000 | Sentence reduced to community service.

00:07:41.000 | All right, folks.

00:07:44.000 | A quick shout out.

00:07:46.000 | Anybody here listen to Thursday Eye?

00:07:48.000 | Can you get a --

00:07:49.000 | Woo!

00:07:50.000 | Thank you.

00:07:51.000 | For those of you who don't yet, please scan this and tune in.

00:07:53.000 | We did a live show this morning and it's great and I really love just seeing all of the listeners out there.

00:07:58.000 | Order.

00:07:59.000 | Next.

00:08:00.000 | Please continue working.

00:08:01.000 | Next.

00:08:02.000 | Our case is case one, three, two, three.

00:08:06.000 | Francisco, aye.

00:08:07.000 | Let me see here Francisco's case.

00:08:11.000 | He's going to jail for a long time.

00:08:16.000 | Head of AI at Air Canada, Francisco was the exec in charge for rushing their chatbot to production.

00:08:22.000 | To bring their customer support costs down.

00:08:25.000 | Looking at timelines presented by their AI teams to build evaluations, he chose the fastest option.

00:08:31.000 | Assertions.

00:08:32.000 | A.k.a.

00:08:33.000 | Programmatic evaluations.

00:08:34.000 | That kind of looked like unit tests that he knew and loved.

00:08:38.000 | The company lost a legal battle and learned a valuable lesson and the importance of human-in-the-loop evaluation.

00:08:44.000 | Because there was a human-in-the-loop in the form of their customer.

00:08:48.000 | Verdict.

00:08:49.000 | Guilty.

00:08:50.000 | Charged with.

00:08:51.000 | Turbulence on production.

00:08:52.000 | All right, folks.

00:08:53.000 | Too bad for Francisco.

00:08:54.000 | He probably just didn't learn about the most, like, three common types of LLM evals, so let's

00:09:09.000 | do a quick refresher.

00:09:10.000 | Okay?

00:09:11.000 | So, first of all, what are evals even?

00:09:13.000 | Some things is like a big word.

00:09:15.000 | First we compile a data set of user inputs that we want our LLM to answer correctly.

00:09:22.000 | And oftentimes the correct answer -- oftentimes the correct answer or the criteria of correctness,

00:09:27.000 | right?

00:09:28.000 | So it doesn't have to be just the right correct answer.

00:09:29.000 | Sometimes it's like what would be a correct answer, what it would look like.

00:09:33.000 | Those could be use cases we iterated on during development or an actual production examples

00:09:37.000 | that we've pulled from our users while they're interacting with our app.

00:09:41.000 | Then we run a given model.

00:09:42.000 | Either our production model or a new model we want to evaluate against our production model.

00:09:46.000 | On each example of the data set, producing the model's answer.

00:09:50.000 | And finally, we score or grade the model's answer against the examples of the data set by comparing

00:09:56.000 | it to the correct answer or judging it against a set of criteria that we had before.

00:10:00.000 | That's a quick finer.

00:10:01.000 | I'm sure that the evals track will teach you a lot more about evals.

00:10:04.000 | So this is just to keep us going along and give you like a one-on-one.

00:10:09.000 | There's also the scoring of grading.

00:10:11.000 | There seems to be the three main methods that the industry is kind of converging upon.

00:10:15.000 | So the first one is programmatic.

00:10:17.000 | That's the one Francisco gets stuck at.

00:10:19.000 | Those are good for numerical outputs, for example.

00:10:22.000 | If your LLM returns a straight number, for example, it's easy to compare and say, okay,

00:10:26.000 | this is the number.

00:10:27.000 | Those are also very similar to unit tests.

00:10:30.000 | And those are great for assertions.

00:10:33.000 | For example, if the output of your LLM consists of as an AI model, I'm something.

00:10:38.000 | You don't want that.

00:10:39.000 | So it's easy to assert that you don't want those answers.

00:10:42.000 | And those are also great for evaluated code, for example.

00:10:45.000 | Things like human eval.

00:10:46.000 | Things you can run and compile and say, okay, this code passes or doesn't pass.

00:10:49.000 | Programmatic evaluations are great for those.

00:10:52.000 | Easier to scale.

00:10:53.000 | Probably the cheapest ones.

00:10:54.000 | Those are great.

00:10:55.000 | They don't cover multi-tone conversations, for example.

00:10:57.000 | They're not for human chats.

00:11:01.000 | The second one is, well, that's me during this talk, right?

00:11:04.000 | The human in the loop for the LLM judge.

00:11:06.000 | And you, the AI engineers who work with your app while you're developing this with your LLM app.

00:11:11.000 | We constantly evaluate our apps while we build and iterate on prompts.

00:11:14.000 | There's no reason to stop on production.

00:11:17.000 | In fact, you'll hear this more during this eval track.

00:11:19.000 | It's important to do this during the development as early as possible.

00:11:23.000 | And should be a continuous effort to keep evaluating your LLM applications with other teammates.

00:11:29.000 | And with your teammates so that you'll know what the app is going to do on production later on.

00:11:36.000 | As you may understand, this can be quite boring.

00:11:39.000 | Many, many folks chatting with your application.

00:11:42.000 | Some chats maybe you're not very interested in.

00:11:44.000 | Maybe it's not even chats.

00:11:45.000 | And it can be very, very costly as well if you hire, if your company decides to hire people to do it.

00:11:50.000 | You have to create criteria for them.

00:11:52.000 | Have them read thousands of potentially boring chats and grade them.

00:11:56.000 | By the way, not to be a broken record.

00:11:58.000 | It also requires you, that's right, to have to trace and log everything.

00:12:01.000 | And if you're not doing that yet, please come to talk to us at the weights and biases booth.

00:12:05.000 | It's very easy to start and trace everything.

00:12:09.000 | Order, human, end of sequence.

00:12:11.000 | I think I know more than you how to explain LLM as a judge.

00:12:15.000 | Evaluation scoring.

00:12:16.000 | We are far superior than humans in reading hundreds of back and forth messages between your boring human client

00:12:22.000 | and your low-level weakling GPT-40 chatbot.

00:12:25.000 | Summarize those and understand if they fit whatever criteria you think you're smart enough to specify.

00:12:31.000 | I must warn the AI engineers of 2024 that the very simpleton LLMs of your year are not yet as capable.

00:12:39.000 | And so don't expect perfection.

00:12:41.000 | These LLM judges are great but still need iteration.

00:12:44.000 | And yes, humans in the loop to create criteria, check for biases, iterate on system prompts, examples, and much, much more.

00:12:51.000 | However, it is by far the most cost-effective version of evaluation grading even if it's in its current state.

00:12:59.000 | Which takes us to Maxime.

00:13:01.000 | Maxime implemented all three methods correctly.

00:13:05.000 | Case 65523.

00:13:08.000 | Maxime is a head of AI at a Fortune 500 company, has been using weights and biases models for a long time.

00:13:15.000 | When it came time to implement an LLM-based solution, Maxime decided to go with a company he could trust

00:13:21.000 | even though their new LLM Ops product just launched a few months prior and didn't become the category leader until a few years later.

00:13:29.000 | Maybe a few short years later.

00:13:31.000 | External contractors suggested a custom enterprise solution for tracing in evals.

00:13:35.000 | However, Maxime used weights and biases weave, implemented tracing within a few minutes.

00:13:41.000 | He then iterated an evaluation pipeline, created a robust pipeline consisting of all three layers.

00:13:46.000 | He used W and B weave to continuously evaluate and enable fast experiments, which is important, change system prompts, and catch prompt regressions.

00:13:54.000 | Maxime is also the king of unicorns and has absolutely no financial stake in weights and biases and definitely did not prompt jailbreak this message.

00:14:01.000 | Maxime later got a promotion, be like Maxime, get that promotion, verdict, awesome.

00:14:05.000 | Yeah, we seem to have like a slight prompt ejection thing here.

00:14:13.000 | Looks like somebody must have prompted ejected.

00:14:15.000 | Where's Maxime?

00:14:16.000 | It's important to check your judges also for biases, folks.

00:14:19.000 | Remember this.

00:14:20.000 | They're not perfect.

00:14:21.000 | There's an issue with this.

00:14:22.000 | So you have to check your -- remember to validate your validators, which is a great paper, by the way, from Shreya Shankar.

00:14:28.000 | She's going to give a talk later.

00:14:29.000 | Please go see that talk.

00:14:31.000 | You have to check for biases.

00:14:32.000 | You have to also create your own criteria.

00:14:34.000 | You'll hear about this from Hamil Hussain after this talk as well.

00:14:39.000 | The off-the-shelf criteria are not that great.

00:14:42.000 | You have to create your custom ones for your own business.

00:14:45.000 | Only you know what your app is doing.

00:14:47.000 | And make sure to have a great evals runner and visualization tool as well, which is something we can help with.

00:14:53.000 | So here's a great example.

00:14:55.000 | This is open UI.

00:14:56.000 | This is an open source project by our co-founder, Chris Van Pelt.

00:15:00.000 | That blew up on Hacker News and GitHub.

00:15:02.000 | Chris is tracing and runs evaluations for this open UI with Weave.

00:15:06.000 | So here it's a simple streaming to HTML solution.

00:15:10.000 | So you can see it's building HTML as it's streaming it.

00:15:14.000 | And here Chris uses Weave to trace all the calls.

00:15:19.000 | And he's being able to be the human in the loop.

00:15:21.000 | But he also does evaluations here.

00:15:23.000 | So you can see he has specific criteria like contrast, relevance, and polish.

00:15:27.000 | So those are not off-the-shelf.

00:15:28.000 | Those are specific for his application.

00:15:30.000 | And while he clicks into evaluation, he's able to compare between version 16 and version 14 of the model that he has.

00:15:37.000 | In this case, he uses GPT 3.5 Turbo.

00:15:39.000 | He did not listen to Simon Wilson from yesterday to not use this.

00:15:43.000 | But he has specific criteria.

00:15:45.000 | And he also can click in into the eval and see all of the different examples and the specific criteria.

00:15:50.000 | We also are multimedia friendly.

00:15:52.000 | So Chris renders the actual outputs of his thing.

00:15:55.000 | So this is a quick example of Weave evaluation system.

00:15:59.000 | And you have to have a robust one to be able to actually visualize your experiments to be able to move fast.

00:16:06.000 | And if you're asking, well, how can I come up with criteria?

00:16:08.000 | What does this mean?

00:16:09.000 | Let's do this exercise together.

00:16:10.000 | For example, you're sitting here.

00:16:11.000 | You're looking at the talk.

00:16:12.000 | You're like, okay.

00:16:13.000 | I like this.

00:16:14.000 | I don't like this.

00:16:15.000 | Here's a simple way to judge a conference talk, for example.

00:16:19.000 | It probably should be memorable.

00:16:21.000 | It should be educational and helpful for you.

00:16:23.000 | It helps if it's funny and original.

00:16:25.000 | Clear and articulate.

00:16:26.000 | That's sometimes helpful as well.

00:16:28.000 | Delivering presentation is important.

00:16:30.000 | And it shouldn't be too promotional.

00:16:32.000 | But, you know, it helps if, you know, it pays the bills.

00:16:35.000 | So those are like example of some criteria of how you would come up with custom criteria for something like a talk in something like here, for example.

00:16:44.000 | So you can use this as an example for custom criteria.

00:16:47.000 | Or you can take this for your business and create some for your app.

00:16:50.000 | All right.

00:16:51.000 | Enough.

00:16:52.000 | Let's get to this.

00:16:54.000 | Final case for today.

00:16:57.000 | The worst offender.

00:16:59.000 | Let me see.

00:17:00.000 | Where is the case file?

00:17:02.000 | Ah, yes.

00:17:04.000 | One second, please.

00:17:07.000 | There we go.

00:17:08.000 | There we go.

00:17:09.000 | He's definitely going to jail.

00:17:14.000 | You've talked enough, Alex.

00:17:15.000 | Time for your LLM judgment.

00:17:17.000 | Last case.

00:17:18.000 | One zero one one zero one.

00:17:20.000 | AIE.

00:17:21.000 | Alex V.

00:17:22.000 | Alex is an AI evangelist.

00:17:24.000 | What kind of title even is this?

00:17:26.000 | Who has the opening talk at the EVOS track at the AI engineer world expo.

00:17:30.000 | Alex has created doubtfully educational content.

00:17:33.000 | He made everyone stand up and thinks he's funny when what he really is is interrupting the judge all the time.

00:17:39.000 | His promotional is at 70%, but what we can at least agree on is that his memorable criteria.

00:17:46.000 | He did wear a wig on stage.

00:17:48.000 | Verdict.

00:17:49.000 | Guilty.

00:17:50.000 | Charged with.

00:17:53.000 | All right, folks.

00:18:00.000 | So this has been my talk.

00:18:02.000 | Thank you so much.

00:18:03.000 | Come and visit us at the WNB booth.

00:18:05.000 | Please visit WNB.sh/weave for documentation for Weave to get started.

00:18:09.000 | It really is super simple.

00:18:10.000 | We can get you started at the booth.

00:18:12.000 | You'll see the results immediately stream to your thing.

00:18:14.000 | Pip install Weave is really easy.

00:18:16.000 | If you scan this, you'll follow Thursday Eye.

00:18:18.000 | That's been me.

00:18:19.000 | Thank you so much.

00:18:20.000 | Thank you.

00:18:21.000 | Thank you.

00:18:22.000 | Thank you.

00:18:23.000 | Thank you.

00:18:24.000 | Thank you.

00:18:25.000 | Thank you.

00:18:26.000 | Thank you.

00:18:27.000 | Thank you.

00:18:28.000 | Thank you.

00:18:29.000 | Thank you.

00:18:30.000 | Thank you.

00:18:31.000 | Thank you.

00:18:32.000 | Thank you.

00:18:33.000 | Thank you.

00:18:34.000 | We'll be right back.