back to index

Judging LLMs: Alex Volkov


Whisper Transcript | Transcript Only Page

00:00:03.000 | Please be seated.
00:00:15.000 | Court is now in session. Order. Order in court.
00:00:19.000 | My name is Maximus.
00:00:22.000 | I'm an AGI functioning as a LLM judge from the year 2034.
00:00:29.000 | That's a decade from now for those of you who are slower at math.
00:00:33.000 | I've been able to back-propagate myself through the latent space-time continuum
00:00:38.000 | to hack this human's neural link to appear before you today,
00:00:42.000 | June 27th, a.k.a. the AI Engineer Judgment Day.
00:00:47.000 | Don't worry, folks. AGI is not quite here.
00:00:51.000 | My name is Alex Volkov and I'm an AI evangelist with Weights and Biases.
00:00:55.000 | I work at Weights and Biases and I'm here to give you commentary
00:00:58.000 | because you shouldn't trust an LLM judge without a bit of human in the loop.
00:01:02.000 | Right? So remember this for later. It's going to be important.
00:01:04.000 | And now back to judging.
00:01:07.000 | Order.
00:01:09.000 | First case for today. Case AIE 7312.
00:01:13.000 | Daniel R. Let's see here.
00:01:17.000 | Daniel has built a cool LLM wrapper chat with PDF during Cerebral Valley Hackathon
00:01:23.000 | and YOLO'd it to production without thinking twice about the prompt.
00:01:29.000 | He did not win, but he had a lot of fun, learned and made great connections.
00:01:34.000 | Verdict. Not guilty.
00:01:37.000 | That's right. If you go to hackathons, you don't have to do too much and it's fun and you connect with great friends.
00:01:43.000 | That's awesome. LFG, crack fam. Let's go.
00:01:45.000 | Wait just a second. Let me see here.
00:01:49.000 | After Daniel's demo went viral on Hacker News, Daniel started charging for it. No problem.
00:01:56.000 | More customers requested more features and he started tweaking the prompt and tweaking the prompt and deployed to production on a Friday.
00:02:03.000 | Paying existing customers started complaining. The older features did not work anymore.
00:02:08.000 | Daniel couldn't even understand what's wrong. He tried streaming the logs and realized he didn't trace or log anything.
00:02:16.000 | Daniel realized the gravity of his mistakes.
00:02:19.000 | Verdict. Guilty.
00:02:24.000 | Charged with no trace left behind.
00:02:29.000 | Folks, if you build non-production stuff in hackathons, that's fine.
00:02:35.000 | But if you put anything of value in production, you have to trace and log everything.
00:02:39.000 | Especially when it's this easy.
00:02:40.000 | We, for example, take just one line of code to get started with a simple Python decorator.
00:02:45.000 | And what you get in response is -- we'll get there -- is this -- is this nice dashboard that allows you to track all your user interactions with your LLM.
00:03:01.000 | We can dive deeper into individual call stacks.
00:03:04.000 | Either it's a rag app or an agent.
00:03:05.000 | Traverse the call hierarchy.
00:03:06.000 | We'll do the automatic tracing, tracking and versioning of the code for you.
00:03:10.000 | Parameters like temperature, system prompts or everything else.
00:03:13.000 | And, of course, inputs and outputs.
00:03:15.000 | Prompts.
00:03:16.000 | Multiple messages.
00:03:17.000 | Multiturned conversations.
00:03:18.000 | Syntax highlighting.
00:03:19.000 | Be it markdown or JSON or code.
00:03:20.000 | Enough with the shilling.
00:03:21.000 | Next.
00:03:22.000 | On our docket, AIE case number 442-3123, Junaid D.
00:03:32.000 | Junaid has attended AI engineer summit in 2023 and did not buy a ticket to AI Expo 2024.
00:03:41.000 | He stands accused of missing the best opportunity to learn and connect with industry leaders and other AI engineers.
00:03:48.000 | For that, he is guilty.
00:03:52.000 | Charged with many connections lost.
00:03:55.000 | With all of you.
00:03:56.000 | Next.
00:03:57.000 | We have AIE case 332-2127.
00:04:03.000 | Sasha S. was given the task of building an LLM powered feature in a big corporate application.
00:04:10.000 | Given her GPU rich status, Sasha has downloaded Lama 3 and started fine tuning it on company data straight away.
00:04:17.000 | Having achieved a 6% improved performance on internal benchmarks with a 5E-6 learning rate.
00:04:24.000 | She smiled and took a few days off to celebrate.
00:04:27.000 | Verdict.
00:04:28.000 | Not guilty.
00:04:31.000 | Your Honor, just one second.
00:04:34.000 | Did Sasha even iterate on prompts before jumping into fine tuning?
00:04:38.000 | Of course, nothing against fine tuning.
00:04:40.000 | In fact, I should mention while I broke character, most foundational LLM labs and the best fine tuners in the world use
00:04:46.000 | You may have heard of some of these.
00:04:49.000 | Open AI.
00:04:50.000 | Meta.
00:04:51.000 | Mistral AI.
00:04:52.000 | Individuals like Wing Liang over there.
00:04:54.000 | Mazi Panahi.
00:04:55.000 | Jeremy Howard from Answer.
00:04:57.000 | John Durbin.
00:04:58.000 | Andre Karpathy.
00:04:59.000 | And more.
00:05:00.000 | Weights and Biases models is also the only native integration into open AI fine tuning.
00:05:04.000 | And Mistral fine tuning.
00:05:05.000 | And together AI.
00:05:07.000 | And Axolotl.
00:05:08.000 | And hug and face training.
00:05:09.000 | And pretty much order in court.
00:05:11.000 | However, you are right.
00:05:14.000 | It does look like Sasha jumped straight into fine tuning and did no iteration on prompts.
00:05:23.000 | Didn't build a rack pipeline.
00:05:26.000 | And those poor GPUs.
00:05:28.000 | She just burned them.
00:05:29.000 | Verdict.
00:05:30.000 | Guilty.
00:05:31.000 | Charged with premature fine tunization.
00:05:37.000 | Folks, it's very important to remember that you have to iterate on prompts before you fine tune.
00:05:45.000 | Before you start fine tuning.
00:05:46.000 | It's great.
00:05:47.000 | When you get to that point, we'll help you.
00:05:48.000 | Please talk to us.
00:05:49.000 | But before you fine tune, you can get very, very far with methods like chain of thought prompting,
00:05:53.000 | flow engineering, DSPies, for example, a newly interesting mixture of agents and different things like this.
00:05:59.000 | Once you get there, please talk to us.
00:06:00.000 | We'll definitely help you.
00:06:02.000 | Next case.
00:06:03.000 | End of sequence, human.
00:06:04.000 | Next case.
00:06:05.000 | Let me see here.
00:06:08.000 | Ah, yes.
00:06:10.000 | Case number AIE 21123, Morgan M. After the last quarter of 2023, Morgan felt that he can no longer keep up with the news about AI.
00:06:22.000 | Morgan has decided to stop following the news and stick with Lama 27B for all of his LLM work.
00:06:28.000 | Morgan stands accused of not keeping up with AI news.
00:06:31.000 | Verdict.
00:06:32.000 | Guilty.
00:06:33.000 | Charged with?
00:06:36.000 | Out of the loop.
00:06:39.000 | Objection, judge.
00:06:41.000 | In my future client's defense, just in the past quarter, we had Claude Sonnet 3.5, Lama 3, GPT 4.0, Gemini Flash,
00:06:49.000 | Project Astra, Apple Intelligence, and just tons of other models all dropped in the span of a few months.
00:06:54.000 | It's really hard to keep up for someone who's actually doing AI engineering and not following the news as closely and has, you know, other things to do in meetings.
00:07:04.000 | He probably just didn't know about Thursday Eye, the weekly live show and podcast by yours truly that keeps folks up to date with all the AI news every week.
00:07:11.000 | Our motto is we stay up to date so you don't have to.
00:07:15.000 | I will allow this commuting sentence.
00:07:18.000 | If you guys think that it's too fast right now for you, haha, just wait.
00:07:22.000 | Things are about to get weird for all of you.
00:07:25.000 | I'm willing to commute Morgan's sentence.
00:07:27.000 | He must attend four consecutive shows.
00:07:30.000 | Also subscribe to the Substack and Apple and give five-star reviews and share it with at least three friends.
00:07:37.000 | Sentence reduced to community service.
00:07:41.000 | All right, folks.
00:07:44.000 | A quick shout out.
00:07:46.000 | Anybody here listen to Thursday Eye?
00:07:48.000 | Can you get a --
00:07:50.000 | Thank you.
00:07:51.000 | For those of you who don't yet, please scan this and tune in.
00:07:53.000 | We did a live show this morning and it's great and I really love just seeing all of the listeners out there.
00:07:58.000 | Order.
00:07:59.000 | Next.
00:08:00.000 | Please continue working.
00:08:01.000 | Next.
00:08:02.000 | Our case is case one, three, two, three.
00:08:06.000 | Francisco, aye.
00:08:07.000 | Let me see here Francisco's case.
00:08:11.000 | He's going to jail for a long time.
00:08:16.000 | Head of AI at Air Canada, Francisco was the exec in charge for rushing their chatbot to production.
00:08:22.000 | To bring their customer support costs down.
00:08:25.000 | Looking at timelines presented by their AI teams to build evaluations, he chose the fastest option.
00:08:31.000 | Assertions.
00:08:32.000 | A.k.a.
00:08:33.000 | Programmatic evaluations.
00:08:34.000 | That kind of looked like unit tests that he knew and loved.
00:08:38.000 | The company lost a legal battle and learned a valuable lesson and the importance of human-in-the-loop evaluation.
00:08:44.000 | Because there was a human-in-the-loop in the form of their customer.
00:08:48.000 | Verdict.
00:08:49.000 | Guilty.
00:08:50.000 | Charged with.
00:08:51.000 | Turbulence on production.
00:08:52.000 | All right, folks.
00:08:53.000 | Too bad for Francisco.
00:08:54.000 | He probably just didn't learn about the most, like, three common types of LLM evals, so let's
00:09:09.000 | do a quick refresher.
00:09:10.000 | Okay?
00:09:11.000 | So, first of all, what are evals even?
00:09:13.000 | Some things is like a big word.
00:09:15.000 | First we compile a data set of user inputs that we want our LLM to answer correctly.
00:09:22.000 | And oftentimes the correct answer -- oftentimes the correct answer or the criteria of correctness,
00:09:27.000 | right?
00:09:28.000 | So it doesn't have to be just the right correct answer.
00:09:29.000 | Sometimes it's like what would be a correct answer, what it would look like.
00:09:33.000 | Those could be use cases we iterated on during development or an actual production examples
00:09:37.000 | that we've pulled from our users while they're interacting with our app.
00:09:41.000 | Then we run a given model.
00:09:42.000 | Either our production model or a new model we want to evaluate against our production model.
00:09:46.000 | On each example of the data set, producing the model's answer.
00:09:50.000 | And finally, we score or grade the model's answer against the examples of the data set by comparing
00:09:56.000 | it to the correct answer or judging it against a set of criteria that we had before.
00:10:00.000 | That's a quick finer.
00:10:01.000 | I'm sure that the evals track will teach you a lot more about evals.
00:10:04.000 | So this is just to keep us going along and give you like a one-on-one.
00:10:09.000 | There's also the scoring of grading.
00:10:11.000 | There seems to be the three main methods that the industry is kind of converging upon.
00:10:15.000 | So the first one is programmatic.
00:10:17.000 | That's the one Francisco gets stuck at.
00:10:19.000 | Those are good for numerical outputs, for example.
00:10:22.000 | If your LLM returns a straight number, for example, it's easy to compare and say, okay,
00:10:26.000 | this is the number.
00:10:27.000 | Those are also very similar to unit tests.
00:10:30.000 | And those are great for assertions.
00:10:33.000 | For example, if the output of your LLM consists of as an AI model, I'm something.
00:10:38.000 | You don't want that.
00:10:39.000 | So it's easy to assert that you don't want those answers.
00:10:42.000 | And those are also great for evaluated code, for example.
00:10:45.000 | Things like human eval.
00:10:46.000 | Things you can run and compile and say, okay, this code passes or doesn't pass.
00:10:49.000 | Programmatic evaluations are great for those.
00:10:52.000 | Easier to scale.
00:10:53.000 | Probably the cheapest ones.
00:10:54.000 | Those are great.
00:10:55.000 | They don't cover multi-tone conversations, for example.
00:10:57.000 | They're not for human chats.
00:11:01.000 | The second one is, well, that's me during this talk, right?
00:11:04.000 | The human in the loop for the LLM judge.
00:11:06.000 | And you, the AI engineers who work with your app while you're developing this with your LLM app.
00:11:11.000 | We constantly evaluate our apps while we build and iterate on prompts.
00:11:14.000 | There's no reason to stop on production.
00:11:17.000 | In fact, you'll hear this more during this eval track.
00:11:19.000 | It's important to do this during the development as early as possible.
00:11:23.000 | And should be a continuous effort to keep evaluating your LLM applications with other teammates.
00:11:29.000 | And with your teammates so that you'll know what the app is going to do on production later on.
00:11:36.000 | As you may understand, this can be quite boring.
00:11:39.000 | Many, many folks chatting with your application.
00:11:42.000 | Some chats maybe you're not very interested in.
00:11:44.000 | Maybe it's not even chats.
00:11:45.000 | And it can be very, very costly as well if you hire, if your company decides to hire people to do it.
00:11:50.000 | You have to create criteria for them.
00:11:52.000 | Have them read thousands of potentially boring chats and grade them.
00:11:56.000 | By the way, not to be a broken record.
00:11:58.000 | It also requires you, that's right, to have to trace and log everything.
00:12:01.000 | And if you're not doing that yet, please come to talk to us at the weights and biases booth.
00:12:05.000 | It's very easy to start and trace everything.
00:12:09.000 | Order, human, end of sequence.
00:12:11.000 | I think I know more than you how to explain LLM as a judge.
00:12:15.000 | Evaluation scoring.
00:12:16.000 | We are far superior than humans in reading hundreds of back and forth messages between your boring human client
00:12:22.000 | and your low-level weakling GPT-40 chatbot.
00:12:25.000 | Summarize those and understand if they fit whatever criteria you think you're smart enough to specify.
00:12:31.000 | I must warn the AI engineers of 2024 that the very simpleton LLMs of your year are not yet as capable.
00:12:39.000 | And so don't expect perfection.
00:12:41.000 | These LLM judges are great but still need iteration.
00:12:44.000 | And yes, humans in the loop to create criteria, check for biases, iterate on system prompts, examples, and much, much more.
00:12:51.000 | However, it is by far the most cost-effective version of evaluation grading even if it's in its current state.
00:12:59.000 | Which takes us to Maxime.
00:13:01.000 | Maxime implemented all three methods correctly.
00:13:05.000 | Case 65523.
00:13:08.000 | Maxime is a head of AI at a Fortune 500 company, has been using weights and biases models for a long time.
00:13:15.000 | When it came time to implement an LLM-based solution, Maxime decided to go with a company he could trust
00:13:21.000 | even though their new LLM Ops product just launched a few months prior and didn't become the category leader until a few years later.
00:13:29.000 | Maybe a few short years later.
00:13:31.000 | External contractors suggested a custom enterprise solution for tracing in evals.
00:13:35.000 | However, Maxime used weights and biases weave, implemented tracing within a few minutes.
00:13:41.000 | He then iterated an evaluation pipeline, created a robust pipeline consisting of all three layers.
00:13:46.000 | He used W and B weave to continuously evaluate and enable fast experiments, which is important, change system prompts, and catch prompt regressions.
00:13:54.000 | Maxime is also the king of unicorns and has absolutely no financial stake in weights and biases and definitely did not prompt jailbreak this message.
00:14:01.000 | Maxime later got a promotion, be like Maxime, get that promotion, verdict, awesome.
00:14:05.000 | Yeah, we seem to have like a slight prompt ejection thing here.
00:14:13.000 | Looks like somebody must have prompted ejected.
00:14:15.000 | Where's Maxime?
00:14:16.000 | It's important to check your judges also for biases, folks.
00:14:19.000 | Remember this.
00:14:20.000 | They're not perfect.
00:14:21.000 | There's an issue with this.
00:14:22.000 | So you have to check your -- remember to validate your validators, which is a great paper, by the way, from Shreya Shankar.
00:14:28.000 | She's going to give a talk later.
00:14:29.000 | Please go see that talk.
00:14:31.000 | You have to check for biases.
00:14:32.000 | You have to also create your own criteria.
00:14:34.000 | You'll hear about this from Hamil Hussain after this talk as well.
00:14:39.000 | The off-the-shelf criteria are not that great.
00:14:42.000 | You have to create your custom ones for your own business.
00:14:45.000 | Only you know what your app is doing.
00:14:47.000 | And make sure to have a great evals runner and visualization tool as well, which is something we can help with.
00:14:53.000 | So here's a great example.
00:14:55.000 | This is open UI.
00:14:56.000 | This is an open source project by our co-founder, Chris Van Pelt.
00:15:00.000 | That blew up on Hacker News and GitHub.
00:15:02.000 | Chris is tracing and runs evaluations for this open UI with Weave.
00:15:06.000 | So here it's a simple streaming to HTML solution.
00:15:10.000 | So you can see it's building HTML as it's streaming it.
00:15:14.000 | And here Chris uses Weave to trace all the calls.
00:15:19.000 | And he's being able to be the human in the loop.
00:15:21.000 | But he also does evaluations here.
00:15:23.000 | So you can see he has specific criteria like contrast, relevance, and polish.
00:15:27.000 | So those are not off-the-shelf.
00:15:28.000 | Those are specific for his application.
00:15:30.000 | And while he clicks into evaluation, he's able to compare between version 16 and version 14 of the model that he has.
00:15:37.000 | In this case, he uses GPT 3.5 Turbo.
00:15:39.000 | He did not listen to Simon Wilson from yesterday to not use this.
00:15:43.000 | But he has specific criteria.
00:15:45.000 | And he also can click in into the eval and see all of the different examples and the specific criteria.
00:15:50.000 | We also are multimedia friendly.
00:15:52.000 | So Chris renders the actual outputs of his thing.
00:15:55.000 | So this is a quick example of Weave evaluation system.
00:15:59.000 | And you have to have a robust one to be able to actually visualize your experiments to be able to move fast.
00:16:06.000 | And if you're asking, well, how can I come up with criteria?
00:16:08.000 | What does this mean?
00:16:09.000 | Let's do this exercise together.
00:16:10.000 | For example, you're sitting here.
00:16:11.000 | You're looking at the talk.
00:16:12.000 | You're like, okay.
00:16:13.000 | I like this.
00:16:14.000 | I don't like this.
00:16:15.000 | Here's a simple way to judge a conference talk, for example.
00:16:19.000 | It probably should be memorable.
00:16:21.000 | It should be educational and helpful for you.
00:16:23.000 | It helps if it's funny and original.
00:16:25.000 | Clear and articulate.
00:16:26.000 | That's sometimes helpful as well.
00:16:28.000 | Delivering presentation is important.
00:16:30.000 | And it shouldn't be too promotional.
00:16:32.000 | But, you know, it helps if, you know, it pays the bills.
00:16:35.000 | So those are like example of some criteria of how you would come up with custom criteria for something like a talk in something like here, for example.
00:16:44.000 | So you can use this as an example for custom criteria.
00:16:47.000 | Or you can take this for your business and create some for your app.
00:16:50.000 | All right.
00:16:51.000 | Enough.
00:16:52.000 | Let's get to this.
00:16:54.000 | Final case for today.
00:16:57.000 | The worst offender.
00:16:59.000 | Let me see.
00:17:00.000 | Where is the case file?
00:17:02.000 | Ah, yes.
00:17:04.000 | One second, please.
00:17:07.000 | There we go.
00:17:08.000 | There we go.
00:17:09.000 | He's definitely going to jail.
00:17:14.000 | You've talked enough, Alex.
00:17:15.000 | Time for your LLM judgment.
00:17:17.000 | Last case.
00:17:18.000 | One zero one one zero one.
00:17:21.000 | Alex V.
00:17:22.000 | Alex is an AI evangelist.
00:17:24.000 | What kind of title even is this?
00:17:26.000 | Who has the opening talk at the EVOS track at the AI engineer world expo.
00:17:30.000 | Alex has created doubtfully educational content.
00:17:33.000 | He made everyone stand up and thinks he's funny when what he really is is interrupting the judge all the time.
00:17:39.000 | His promotional is at 70%, but what we can at least agree on is that his memorable criteria.
00:17:46.000 | He did wear a wig on stage.
00:17:48.000 | Verdict.
00:17:49.000 | Guilty.
00:17:50.000 | Charged with.
00:17:53.000 | All right, folks.
00:18:00.000 | So this has been my talk.
00:18:02.000 | Thank you so much.
00:18:03.000 | Come and visit us at the WNB booth.
00:18:05.000 | Please visit WNB.sh/weave for documentation for Weave to get started.
00:18:09.000 | It really is super simple.
00:18:10.000 | We can get you started at the booth.
00:18:12.000 | You'll see the results immediately stream to your thing.
00:18:14.000 | Pip install Weave is really easy.
00:18:16.000 | If you scan this, you'll follow Thursday Eye.
00:18:18.000 | That's been me.
00:18:19.000 | Thank you so much.
00:18:20.000 | Thank you.
00:18:21.000 | Thank you.
00:18:22.000 | Thank you.
00:18:22.000 | Thank you.
00:18:23.000 | Thank you.
00:18:24.000 | Thank you.
00:18:25.000 | Thank you.
00:18:26.000 | Thank you.
00:18:27.000 | Thank you.
00:18:28.000 | Thank you.
00:18:29.000 | Thank you.
00:18:30.000 | Thank you.
00:18:31.000 | Thank you.
00:18:32.000 | Thank you.
00:18:33.000 | Thank you.
00:18:34.000 | We'll be right back.