Lessons From A Year Building With LLMs

00:00:00.000 | .

00:00:13.720 | You're about to experience something of a strange talk,

00:00:17.040 | and not just because Brian and I are strange,

00:00:20.740 | but because something kind of strange happened.

00:00:23.960 | Over the last year, a bunch of us were posting things on Twitter.

00:00:27.760 | We were writing blog posts complaining about LLNs,

00:00:31.700 | and we formed a little group chat.

00:00:33.720 | And we were, you know, continuing to complain about LLNs to each other

00:00:38.300 | and sharing what we were working on when we realized we were all

00:00:41.320 | about to write the exact same blog post,

00:00:44.300 | what we learned in the last year.

00:00:46.780 | So we got together, and we turned what was initially a couple

00:00:52.520 | of short blog posts into a long white paper on O'Reilly,

00:00:56.900 | combining our lessons across strategic, operational,

00:01:00.920 | and tactical levels of building LLN applications.

00:01:04.480 | And the response to that white paper was overwhelmingly positive.

00:01:08.340 | We got -- we heard from everybody, from people who contribute to Postgres,

00:01:14.380 | to venture capitalists, to tool builders, saying we loved what you wrote

00:01:19.520 | in that article, like I felt that pain too.

00:01:23.700 | And we were invited on the strength of that to give this keynote address.

00:01:29.480 | And so we faced a kind of funny challenge, which is part of the appeal

00:01:32.960 | of this blog post of this article was that the six of us all came together to write it.

00:01:38.760 | As Scott Condren put it, it was like an Avengers team-up.

00:01:42.780 | So we had to figure out a way to deliver one keynote talk from six people.

00:01:48.620 | So we pulled the Avengers together for one night only to sort of like deliver some

00:01:58.520 | of the most important insights from that 30-page article, to add some

00:02:03.640 | of our spicy extra takes that ended up on the cutting room floor,

00:02:07.640 | and to respond to the allegations.

00:02:10.180 | I'd like to state unequivocally that we are not, in fact, crypto bros who just found

00:02:17.700 | out that GPT-4 was the new Web3.

00:02:21.600 | We all trained our first neural networks back when you had to write the gradients by hand.

00:02:27.580 | So we split the article up to three pieces, we split the talk into three pieces.

00:02:31.680 | First you're going to hear from me and Brian talking about the strategic considerations

00:02:35.980 | for building LLM applications.

00:02:38.400 | How do you look to the future?

00:02:40.000 | How do you see around corners?

00:02:41.360 | How do you make big decisions?

00:02:44.160 | Then we're going to hand the clickers and the stage over to Hamel Hussein and Jason Liu who

00:02:51.400 | are going to share the operational considerations.

00:02:53.880 | How do you put together processes?

00:02:55.400 | How do you put together teams?

00:02:57.160 | How do you think about workflows around delivering LLM applications?

00:03:02.880 | And then they will hand over the clickers and the stage to Shreya Shankar and Eugene Yan who

00:03:08.500 | will talk about the tactical considerations for building LLM applications.

00:03:12.440 | What are the specific techniques, tactics, and moves that have stood the test of one year's

00:03:17.640 | time for building LLM applications?

00:03:21.300 | All right.

00:03:23.640 | So Brian, how do you build an LLM application without getting outmaneuvered and wasting everybody's

00:03:29.800 | time and money?

00:03:30.800 | Yes.

00:03:31.800 | Yes.

00:03:32.800 | Well, many of you may be thinking that there's really only one way to win in this new, exciting,

00:03:40.680 | dynamic, and very scary industry.

00:03:43.360 | And that, of course, is to train your own custom model.

00:03:46.940 | Pre-training, fine tuning, a little RLHF here and there.

00:03:50.800 | You better start from scratch, buddy.

00:03:52.800 | Eh.

00:03:53.800 | Not quite.

00:03:55.440 | The model is actually not your moat.

00:03:58.980 | For almost no one in this audience, the model is the moat.

00:04:04.000 | You all as AI engineering devotees should be building in your zone of genius.

00:04:12.200 | You should be leveraging your product expertise or your existing product.

00:04:16.440 | Maybe you've got one.

00:04:18.060 | And you should be finding your niche and digging into that niche, exploiting it.

00:04:25.000 | You should be building what the model providers are not.

00:04:29.200 | There's a high likelihood that the model providers have to build a lot of things for all of their

00:04:34.700 | customers.

00:04:36.440 | Don't waste your calories on building these things.

00:04:39.580 | The Sam Altman phrase of steamrolling is appropriate here.

00:04:44.560 | And you should be treating the models like any other SaaS product.

00:04:49.080 | You should be quickly dropping them when there's a competitor that's clearly better.

00:04:54.020 | No offense to GPT-40, but Sonic 3.5 looking pretty sharp.

00:05:02.700 | It's important to keep in mind that a model with high MMLU scores, that's not a product.

00:05:08.280 | 87% on Spyder SQL.

00:05:10.980 | That doesn't automate all data requests or even 87% of them.

00:05:16.040 | You can't sell human eval pass at 67.

00:05:20.420 | At least my GTM team doesn't know how.

00:05:24.340 | An excellent LLM-powered application is an excellent product.

00:05:30.380 | It's well designed.

00:05:32.420 | It solves a job to be done.

00:05:35.120 | And it enhances your user.

00:05:40.180 | Why are we so excited about AI?

00:05:42.880 | Human enhancement.

00:05:43.880 | So, what should you build if not all of these things?

00:05:50.940 | Things that generalize to smarter and faster models.

00:05:54.880 | Things that help you maintain your product's quality bar under uncertainty.

00:06:02.080 | And things that help you continuously improve.

00:06:04.640 | Whoa, Brian.

00:06:08.340 | Continuous improvement.

00:06:09.340 | That's my trigger phrase.

00:06:12.400 | The idea of continuous improvement has been brought to the world of LLM applications by,

00:06:22.340 | like, this shift in focus that we've all felt since the previous AI engineer summit to focus

00:06:28.980 | on evaluation and data.

00:06:32.360 | It's nicely synagogalized by this diagram from our co-author, Hamill Hussain, showing this virtuous

00:06:38.800 | cycle of improvement.

00:06:40.300 | It has evals and data at the center.

00:06:42.480 | But the core reason to create those evals, the core reason to collect that data, is to drive

00:06:47.680 | forward this loop of continuous improvement.

00:06:52.280 | And despite what your expensive consultants or your -- many of the LinkedIn influencers posting

00:07:01.300 | about LLM apps might say, this is not actually the first time that engineers have tried to

00:07:06.560 | tame a complex system and make it useful and valuable.

00:07:11.760 | This same loop of iterative improvement was also at the core of MLOps, at the operationalization

00:07:18.560 | of machine learning models before LLMs.

00:07:21.980 | This figure from our co-author Shreya Shankar's paper had that same loop of iterative improvement

00:07:28.240 | centered also on evaluation and on data collection.

00:07:33.260 | MLOps was also not the first time that engineers faced this problem, the problem of complexity,

00:07:42.100 | the problem of nondeterminism and uncertainty.

00:07:46.500 | The DevOps movement that gave MLOps its name also focused on this kind of iterative improvement

00:07:53.760 | and on monitoring information in production to turn into improvements to products.

00:08:01.420 | But dear reader, DevOps was not the first time that engineers tackled this problem of uncertainty

00:08:09.780 | and solved it and solved it with iterative improvement.

00:08:13.040 | DevOps built on the ideas of the lean startup movement from Eric Ries that was focusing not

00:08:20.060 | just on building an application, not just on building a machine learning model or an LLM agent,

00:08:26.540 | but on building the entire business.

00:08:28.040 | And it used this same loop centered on measurement and data to drive the improvement and building

00:08:35.800 | of a business.

00:08:38.520 | This idea itself was not invented in Northern California, despite what some people might say.

00:08:45.900 | It has its roots in the Toyota production system and in the idea of Kaizen or continuous improvement.

00:08:53.000 | Genchi Genbutsu is one of the core principles from that movement that we can take forward into

00:08:57.840 | the development of LLM applications.

00:09:00.100 | It means real things, real places.

00:09:02.220 | And at Toyota, that meant sending executives out to factory floors, getting their khakis a

00:09:07.420 | bit dirty.

00:09:08.540 | For LLM applications, the equivalent is looking at your data.

00:09:14.520 | That data is the real information about how your LLM application is delivering value to

00:09:19.060 | users.

00:09:20.060 | There's nothing that is more valuable than that.

00:09:25.620 | Finally, there's lots of people selling tools at this conference, including myself.

00:09:31.380 | It's easy to get overly excited about the tools and the construction of this iterative loop

00:09:35.240 | of improvement and to forget where value actually comes from.

00:09:38.460 | And there's a great pithy, earthy statement from the Toyota production system from Shigeo Shingo

00:09:44.180 | that I really like.

00:09:45.180 | Value is only created when metal gets bent.

00:09:48.720 | So we have to make sure that we don't get lost just building our evals and calculating concept

00:09:54.720 | drift.

00:09:55.720 | Instead, make sure that we continue to get out there and bend metal and create value for

00:10:00.260 | our users.

00:10:01.260 | I'm not going to lie.

00:10:01.800 | I might have misunderstood earlier when you said let's get bent.

00:10:04.800 | Okay.

00:10:05.800 | So right off the bat, we need to spin that data flywheel, Bob.

00:10:10.800 | Oh, wait.

00:10:11.800 | Sorry.

00:10:12.800 | Wrong game show.

00:10:13.800 | I'm not going to lie.

00:10:26.340 | metrics.

00:10:27.880 | I'm not going to lie.

00:10:28.880 | I'm not going to lie.

00:10:29.880 | I'm not going to lie.

00:10:32.420 | I'm not going to lie.

00:10:34.420 | coloring this behavior.

00:10:35.420 | I'm not going to lie.

00:10:36.960 | enough.

00:10:37.960 | I don't have an evals framework to sell you.

00:10:39.960 | What I do have to sell you is this idea that you should be getting out there.

00:10:44.560 | You should be getting started.

00:10:46.280 | But wait, Brian, I'm really nervous.

00:10:48.780 | What if this isn't good enough for my customers?

00:10:51.260 | Fear is the mind killer.

00:10:56.400 | Put it out there in beta.

00:10:58.400 | If it's good enough for these incredible companies like Apple Intelligence, Photoshop, and Hex, that's

00:11:03.860 | me.

00:11:04.860 | It's good enough for you.

00:11:05.860 | You need to collect this data.

00:11:07.460 | You need to put something in the wild.

00:11:08.720 | You need to start looking at your user interactions.

00:11:11.680 | The real user interactions, LLM's responses deserve human eyes.

00:11:19.140 | You can give it some AI eyes, too, but definitely look at it with your human eyes.

00:11:25.600 | Binary human feedback is valuable.

00:11:28.680 | It's nice to add some rich feedback, too.

00:11:30.320 | That can be interesting.

00:11:31.520 | But start with binaries.

00:11:33.680 | And finally, user requests will reveal the PMF opportunities that lie below your product

00:11:40.520 | substrate.

00:11:42.020 | Where is your PMF?

00:11:43.100 | Everybody wants to know.

00:11:44.100 | It's in your user interactions.

00:11:47.100 | What are they asking your chat bot that you haven't yet implemented?

00:11:50.640 | That's a really nice direction to skate if that's where the puck is going.

00:11:57.960 | And despite the focus on the user interactions that you can have today, the things that you

00:12:03.260 | can ship right now, it's important to also think about the future.

00:12:08.600 | The best way to predict the future is to look at the past, find people predicting the present,

00:12:14.560 | and copy what they did.

00:12:17.160 | In designing many of the components of the personal computing revolution, Alan Kaye and others

00:12:22.500 | at Park adopted as a core technique projecting Moore's law out into the future.

00:12:29.460 | They built expensive, unmarketable, slow, and buggy systems themselves so they could experience

00:12:35.580 | what it was like and build for that future and create it.

00:12:41.780 | We don't have quite the industrial scaling information that Moore had when he wrote down his predictions.

00:12:49.940 | But we do have the beginnings of those same laws.

00:12:54.500 | There's been an order of magnitude decrease every 12 to 18 months at three distinct levels

00:12:59.400 | of capability.

00:13:00.620 | At the capability of DaVinci, the original GPT-3 API that brought, that excited a lot of us about

00:13:06.900 | the idea of building on foundation models.

00:13:10.420 | The capabilities of text DaVinci 2, the model lineage underlying chat GPT that brought the

00:13:16.580 | rest of the world to excitement about this technology.

00:13:20.460 | And the latest and greatest level of capabilities with GPT-4 and Sonnet.

00:13:25.060 | In each case, around 15 months is enough time to drop the cost by an entire order of magnitude.

00:13:32.960 | This is faster than Moore's law.

00:13:36.020 | And so the appropriate way to plan for the future is to think what this implies for what

00:13:42.620 | applications that are not economical today will be economical at the time that you need

00:13:47.380 | to raise your next round.

00:13:50.060 | So in 2023, it cost about $625 an hour to run a video game where all the NPCs were powered

00:13:56.220 | by a chat bot.

00:13:58.220 | That's pretty expensive.

00:14:00.100 | In 1980, it cost about $6 an hour to play Pac-Man, inflation adjusted.

00:14:06.100 | That suggests that if we just wait for two orders of magnitude reduction or about 30 months from

00:14:10.740 | mid-2023, it should be possible to deliver a compelling video game experience with chat bot

00:14:17.260 | NPCs at about $6 an hour and people will probably pay for it.

00:14:22.140 | So, you can't sell it now, but you can live it and you can design it and you can be ready

00:14:27.700 | when the time comes.

00:14:28.980 | So that's how to think about the future and how to think strategically when building LLM applications.

00:14:36.020 | I'd like to call to the stage my co-authors, Jason Liu and Hamo Hussain, to talk about the

00:14:41.060 | operational aspects.

00:14:42.060 | Let's give him a hand.

00:14:52.060 | All right.

00:14:53.060 | So, Hamo and I have basically been doing a lot of AI consulting in the past year, right?

00:14:58.420 | We've worked with about 20 companies so far and, you know, we've done something from pre-seed

00:15:02.940 | all the way to public companies and I'm pretty bored of giving generic good advice, especially

00:15:07.820 | because there's such a range of operators here.

00:15:10.580 | And so, instead, I'm going to invert.

00:15:13.080 | My goal today is to tell you how to ruin your business.

00:15:17.840 | First of all, everyone knows that in the gold rush, you sell shovels and so if you want to

00:15:22.820 | get gold, you've got to buy shovels too, right?

00:15:25.140 | You know, if you want to find more gold, keep buying shovels.

00:15:29.080 | Where do I dig?

00:15:30.240 | Keep buying shovels.

00:15:31.340 | How do I know when to stop digging?

00:15:32.960 | The shovel will tell you.

00:15:35.000 | And how do I dig one deep hole versus making investments in plenty of shallow holes?

00:15:39.600 | Again, the answer is more shovels, clearly, right?

00:15:42.100 | And this might be generic so I'll give you some more specific advice.

00:15:46.860 | If your rag app doesn't work, try a vector database, a different vector database.

00:15:50.900 | If the methodology doesn't work, implement a new paper.

00:15:54.100 | And maybe if you update the embedding model, you'll finally find product market fit.

00:16:02.360 | Because truth be told, success does not lie in developing expertise or processes.

00:16:07.600 | Try more tools.

00:16:08.600 | There's no need to balance between exploring and exploiting the mechanisms that work for you.

00:16:14.360 | Change the tools.

00:16:15.360 | And the processes and the decision-making frameworks don't matter.

00:16:19.740 | The right tool will solve everything.

00:16:22.620 | How do you find a machine learning engineer who can fine tune as quickly as possible?

00:16:28.120 | A $2,000 per month open AI bill is very expensive.

00:16:31.880 | And instead, hire someone for a quarter of a million dollars, give them 1% of their company,

00:16:36.960 | to fight CUDA build errors and figure out server cold starts, right?

00:16:40.960 | Because what's the point of growing your company if you're just a wrapper?

00:16:44.960 | And if your margins are too low, try fine-tuning.

00:16:48.300 | It's much easier than figuring out how to build something worth charging for.

00:16:52.460 | It's really -- I cannot reiterate this enough.

00:16:56.300 | It's very important to hire a machine learning engineer as quickly as possible, right?

00:17:00.660 | Even if you have no data, generating products.

00:17:03.460 | They love fixing Vercel TypeScript build errors.

00:17:07.720 | And generally, if you hire a full-stack engineer who's really caught the LLM bug,

00:17:13.720 | they're going to lack real experience.

00:17:16.220 | And this is because Python is a dead language, right?

00:17:19.720 | Machine learning engineers, research engineers can easily pick up TypeScript,

00:17:24.220 | and the ecosystem that exists in Python could be quickly re-implemented in a couple of weekends, right?

00:17:30.380 | The people who wrote Python code for the past 10 years doing data analysis,

00:17:34.520 | they're going to easily be able to transition their tools.

00:17:37.060 | And if anything, it's really easy to teach things like product sense and data literacy to the JavaScript community.

00:17:44.060 | And most important of all, in order to find this kind of magic talent,

00:17:50.060 | we need to create a very catch-all job title.

00:17:53.060 | Let's use words like ninja and wizard or data scientist or prompt engineer or even the AI engineer.

00:17:59.560 | In the past 10 years, we've known that this works really well, right?

00:18:03.560 | Every time we know exactly who we want, as long as we catch a very wide net of skills,

00:18:10.560 | it doesn't really matter whether or not we don't know what outcomes we're looking for.

00:18:14.560 | Anyways, to dig me out of this hole, I'll have Hamill explain and, you know, take a deep breath,

00:18:21.060 | think out loud step by step.

00:18:24.060 | Thank you, Jason.

00:18:28.060 | So that was really good.

00:18:33.060 | I mean, let's just step back from the cliff a little bit.

00:18:37.060 | And let's kind of linger on the topic of AI engineer.

00:18:40.060 | I heard some booing in the audience.

00:18:42.060 | And so I love the term AI engineer.

00:18:46.060 | Like, much props to Swiss for kind of popularizing this term.

00:18:50.060 | It allows us all to get together and have conversations like this.

00:18:53.060 | But I think that there's a misunderstanding of the skills of AI engineer.

00:18:58.060 | What skills you need to be successful.

00:19:01.060 | And there's a lot of inflated expectations.

00:19:03.060 | As a founder or engineering leader, the talent is the most important lever that you have.

00:19:10.060 | And so what I'm going to do is I'm going to talk about some of the problems

00:19:14.060 | and perhaps some solutions when it comes to this talent misunderstanding.

00:19:20.060 | So just to review, what is an AI engineer?

00:19:23.060 | So this is a diagram that everyone has probably seen.

00:19:27.060 | There's a spectrum of skills in the AI space.

00:19:30.060 | And there's this API dividing line in the middle.

00:19:33.060 | And kind of to the right of the API dividing line, we have AI engineer.

00:19:37.060 | The AI engineer skills are focused on things like chains, agents, tooling, and infra.

00:19:45.060 | And auspiciously missing from the AI engineer are tools like evals and data.

00:19:52.060 | And I think a lot of people have taken this diagram too literally and taken it to heart

00:19:57.060 | and say, hey, we don't really need to know about evals, for example.

00:20:02.060 | The problem is that you can go from 0 to 1 really fast.

00:20:06.060 | In fact, you can go to 0 to 1 faster than ever before with all the great tools out there.

00:20:11.060 | Just by using vibe checks and implementing the tools that we talked about.

00:20:15.060 | However, without evals, you can't make progress.

00:20:18.060 | Quickly lead to stagnation.

00:20:20.060 | Because if you can't measure what you're doing, you can't make your system better.

00:20:25.060 | And you can't go beyond 0 to 1.

00:20:28.060 | So, what can we do about this?

00:20:32.060 | About this evals, skill set, and data literacy?

00:20:35.060 | So, Jason and I have found that we can actually get really good at writing evals and data literacy

00:20:43.060 | with just four to six weeks of deliberate practice.

00:20:46.060 | In fact, like very effective.

00:20:48.060 | And we think that these skills, evals and data, should be brought more into the core of AI engineer.

00:20:54.060 | And it really helps solve this problem.

00:20:57.060 | And it's something that we see over and over again.

00:20:59.060 | So, the next thing I want to talk about is the AI engineer job title itself.

00:21:04.060 | And so, vague job titles can be problematic.

00:21:09.060 | What we see over and over again in our consulting is that this kind of catch-all role have very inflated expectations.

00:21:18.060 | Any time anything goes wrong with the AI, people look towards that role to fix it.

00:21:24.060 | And sometimes that role doesn't have all the skills they need to move forward.

00:21:28.060 | And we've seen this before with the role of data scientists.

00:21:32.060 | The titles and names really matter.

00:21:37.060 | What I want to emphasize is I think AI engineer is very aspirational.

00:21:41.060 | And you should keep learning.

00:21:43.060 | And it's a good thing to strive towards.

00:21:45.060 | But you need to have reasonable expectations.

00:21:48.060 | And just to kind of bring it back to data science, we've seen this before in data science as well.

00:21:54.060 | Where we had kind of a decade ago when this role was coined.

00:22:00.060 | It was a unicorn that had all these skills.

00:22:03.060 | Software engineering skills, statistics, math, domain expertise.

00:22:08.060 | And we found out as an industry that we had to unroll that into many other different roles.

00:22:12.060 | Such as decision scientist, machine learning engineer, data engineer, so on and so forth.

00:22:17.060 | And I think similar things may be happening with the role of AI engineer.

00:22:20.060 | And it's good to keep that in mind.

00:22:22.060 | And what I see, or what we both see in consulting, is that it's helpful to be more specific.

00:22:29.060 | To be more deliberate about what skills you need and at what time.

00:22:32.060 | And depending on your maturity, it's very helpful to not only specify what the skills are, but

00:22:38.060 | what kinds of products you'll be working on.

00:22:40.060 | So these are some job titles from Github Co-Pilot.

00:22:44.060 | That kind of are very specific about the skills you need at that time.

00:22:48.060 | And really it's important to hire the right talent at the right time.

00:22:52.060 | On the maturity curve.

00:22:53.060 | So when you're first starting out, you only need application development, software engineering,

00:22:59.060 | or AI engineering to go from zero to one.

00:23:02.060 | Then you need platform and data engineering to capture that data.

00:23:06.060 | And then only after that you should hire a machine learning engineer.

00:23:10.060 | Do not hire a machine learning engineer without having any data.

00:23:13.060 | But again, you can get a lot more mileage out of your AI engineer with deliberate practice

00:23:19.060 | on evals and data.

00:23:21.060 | We usually find four to six weeks practice does the job.

00:23:25.060 | So in recap, one of the biggest failure modes is talent.

00:23:29.060 | We think the AI engineer is often over-scoped but under-specified.

00:23:33.060 | But we can fix that by learning evals.

00:23:37.060 | Next, I want to give it over to Shreya Shankar and Eugene Yan to talk about,

00:23:42.060 | dive into this evals and data literacy.

00:23:46.060 | Thanks.

00:23:53.060 | Question.

00:23:54.060 | Thank you, Jason.

00:23:55.060 | Thank you, Hamel.

00:23:56.060 | Next up, Shreya and I are going to share with you about the tactical aspects of building with LLMs in production.

00:24:02.060 | Specifically, evals, monitoring, and guardrails.

00:24:05.060 | So here's a hackney's quote.

00:24:07.060 | How important evals are to the team is a differentiator between teams shipping out hot garbage and those building real products.

00:24:14.060 | I would agree.

00:24:15.060 | I think here's an example of LLMs, of Apple's recent LLM, where they shared about how they actually collected 750 summaries of push notification and email summarizations.

00:24:27.060 | Because these are datasets, they are representative of their actual use case.

00:24:32.060 | So how do we build evals for our own products?

00:24:35.060 | Well, I think the simple thing is to just make it simpler.

00:24:39.060 | For example, if you're trying to extract product attributes from a product description, break it down into title, price, rating.

00:24:46.060 | And then you can just simply do assertions.

00:24:50.060 | Similarly, for summarization, instead of trying to eval that amorphous blob of a summary, break it down into dimensions, such as factual inconsistency, relevance, and informational density.

00:25:01.060 | And once you've done that, assertion-based tests can go a long way.

00:25:05.060 | Are we extracting the correct price?

00:25:07.060 | Are we extracting the correct title?

00:25:09.060 | Or if you're doing natural language to SQL generation, is it using the expected table?

00:25:13.060 | Is it using the expected columns?

00:25:15.060 | These are very simple to eval and reiterates what Hamel has mentioned about keeping it simple.

00:25:20.060 | Lastly, assertions can do everything, but they can only go so far.

00:25:26.060 | So therefore, consider evaluator models.

00:25:29.060 | Maybe training a classifier for factual inconsistency or reward model for relevance.

00:25:34.060 | This is easier if your evals are classification and regression-based.

00:25:38.060 | But that said, I don't know how I feel about LLM as a judge.

00:25:41.060 | What do you mean you don't like LLM as a judge?

00:25:45.060 | I personally am super bullish on LLM as a judge.

00:25:49.060 | And I'm curious how many of you are exploring LLM as judge or have implemented it?

00:25:55.060 | No.

00:25:56.060 | Yeah?

00:25:57.060 | There's a judge right here.

00:25:58.060 | You want to stand up?

00:25:59.060 | No.

00:26:00.060 | Actual LLM judge here.

00:26:02.060 | Yeah.

00:26:03.060 | Anyways, we're going to go through some points on what to consider when deploying LLM as judge.

00:26:10.060 | First of all, it's a no-brainer.

00:26:12.060 | LLM as judge is the most easy to prototype.

00:26:14.060 | You just have to write a prompt to check for the criteria or metric that you want.

00:26:18.060 | And you can even align this towards your own preferences by providing few-shot examples of good and bad for that criteria.

00:26:26.060 | On the other hand, fine-tuned models or LLMs that you have to collect a lot of data instead of a pipeline to train as your evaluator are not super easy to prototype and have a lot of upfront investment.

00:26:39.060 | Yeah.

00:26:40.060 | But that said, LLM as a judge is pretty difficult to align it to your specific criteria in the business.

00:26:45.060 | Who here has not had any difficulty aligning the LLM as a judge to your criteria?

00:26:52.060 | Anyone?

00:26:53.060 | Okay, we've got to talk later, Shria.

00:26:55.060 | I think that if you just have a few hundred to a few thousand samples, it's very easy to fine-tune a simple model who can do it more precisely.

00:27:03.060 | Second, if you want to do LLM as a judge and have it fairly precise, you sort of need to use chain of thought.

00:27:10.060 | And chain of thought is going to be, I don't know, five to eight seconds long.

00:27:13.060 | On the other hand, if you have a simple classifier or reward model, every request is maybe 10 milliseconds long.

00:27:20.060 | That's two orders of magnitude lower and would improve throughput.

00:27:25.060 | Next, we want to think about technical debt.

00:27:28.060 | Okay, when we're implementing our validators in production, even if they run asynchronously or they run in the critical path, how much effort do we need to put in to keep these up to date?

00:27:39.060 | With LLM as judge, if you don't make sure your few-shot examples are dynamic or some way of making sure your judge kind of prompt aligns with your definition of good and bad, then you're toast.

00:27:52.060 | And the effect is not as pronounced for fine-tuned models, but if you don't continually fine-tune your validators on new data, on new production data, then they will also be susceptible to drift.

00:28:04.060 | So overall, when do you want to use LLM as judge?

00:28:07.060 | It's honestly a resources question and where you are in your application development.

00:28:12.060 | If you're starting to prototype it, you need quick evals with minimal dev effort and need something, you have a low-ish volume of evals, start with LLM as a judge and kind of invest in the infrastructure to align that over time.

00:28:25.060 | If you have more resources or you know that your product is going to be sticky, go for a fine-tuned model.

00:28:31.060 | Next, I'm going to talk about looking at the data.

00:28:36.060 | Eugene mentioned, you know, you should create evals on your custom or bespoke criteria, but how do you know what criteria you want?

00:28:43.060 | Simple answer, look at your data.

00:28:46.060 | Great AI researchers, but we changed that to engineers, great AI engineers look at their data.

00:28:52.060 | So how do we do this?

00:28:53.060 | The first question actually before how is when do you look at this?

00:28:58.060 | I know people who never look at their data at all or people who look at it initially after deployment.

00:29:03.060 | Wrong answer.

00:29:04.060 | You want to look at it regularly.

00:29:05.060 | I work with a startup that, you know, whenever they ship a new LLM agent, they create a new Slack channel with all of the agents outputs that come in real time.

00:29:15.060 | After a couple of weeks, they transition this to kind of daily batch jobs and make sure that, you know, they're not running into errors that they didn't anticipate.

00:29:24.060 | Second thing is what specifically are you looking for?

00:29:27.060 | You want to find slices of the data that are pretty simple or easy to characterize in some way.

00:29:32.060 | For example, data that comes from a particular source or data that has a certain keyword or phrase or is about a certain topic, right?

00:29:39.060 | Simply just saying all of these are bad, but having no way of characterizing them and then improving your pipeline based on that is not going to help.

00:29:47.060 | Finally, some things to keep in mind throughout this whole kind of looking at your data experience is that your code base is very rapidly changing over time probably.

00:29:57.060 | Your prompts, components of the pipeline, and et cetera.

00:30:00.060 | So when you're inspecting traces, it's super helpful to be able to know, you know, what GitHub commit or what model version or prompt version did this correspond to?

00:30:08.060 | I think this is one of the very successful things that traditional MLOps tools did, like MLflow, for example.

00:30:14.060 | They made it very easy to trace back and then hopefully you could replay something.

00:30:18.060 | I see the judge shaking his head.

00:30:22.060 | Great.

00:30:23.060 | And finally, when using LLMs as APIs, pin model versions.

00:30:28.060 | LLM APIs are known to, you know, exhibit different behavior that is very hard to quantify for certain tasks.

00:30:35.060 | So pin, you know, GPT-4, 11.06, pin GPT-4-0, whatever it is that you're using.

00:30:42.060 | So Shreya mentioned that we need to look at our data, but how do we look at our data all the time?

00:30:47.060 | I think the way to do this is via an automated guardrail.

00:30:50.060 | Here's Brandolini's law adapted.

00:30:52.060 | The amount of energy to catch and fix defects is an order of magnitude larger than needed to produce it.

00:30:58.060 | And that's true.

00:30:59.060 | It's really easy to call an LLM API and just get something.

00:31:02.060 | But how do we know if it's actually bad?

00:31:04.060 | I think it's really important that we do have some basic form of guardrails.

00:31:08.060 | And some of them are just table sticks.

00:31:10.060 | Toxicity, personally identified information, copyright, and expected language.

00:31:14.060 | Now you may imagine that this is pretty straightforward, but sometimes you don't actually have control over the context.

00:31:20.060 | For example, if someone's posting an ad on your English website that's in a different language,

00:31:25.060 | and you're asking your LLM to extract the attributes or to summarize it, you may be surprised that for some non-zero proportion of the time,

00:31:32.060 | it's actually in a different language.

00:31:34.060 | Similarly, hallucinations happen more often than we would like.

00:31:38.060 | So imagine you're trying to summarize a movie based on the description.

00:31:41.060 | You just have a description for the trailer.

00:31:43.060 | It may actually include spoilers because it's trying so hard to be helpful.

00:31:47.060 | But that's actually a bad user experience.

00:31:50.060 | So sometimes you will include information that's not in there.

00:31:53.060 | Here's a tip.

00:31:54.060 | If we spend a little bit more time building reference-free evals, we can use them as guardrails.

00:32:00.060 | So reference-based evals are when we generate some kind of output, and we compare it to some ideal sample.

00:32:06.060 | This is pretty expensive, and you actually have to collect all these goal samples.

00:32:09.060 | On the other hand, if we have these labels, we can train an evaluator model and just compare it to the source document.

00:32:15.060 | So for example, if you're comparing summarizations, we can just check if the summary entails or contradicts the source document,

00:32:22.060 | and now we have a hallucination eval.

00:32:25.060 | So therefore, if we spend some time building reference-free evals once, we can use it to guardrail or new output.

00:32:31.060 | Cool.

00:32:32.060 | Thanks, Eugene.

00:32:34.060 | So we're going to wrap up next minute or so on some high-level bird's-eye view, 2,000-foot view, whatever you want to call it, takeaways.

00:32:43.060 | First off, how many of you remember this figure from this pretty seminal paper in MLOps that came out maybe 10 years ago?

00:32:51.060 | 2015, so nine years ago.

00:32:53.060 | Yeah.

00:32:54.060 | So I think this paper really communicated the idea that the model is a small part, and when you're productionizing ML systems,

00:33:03.060 | right, there's so much more around the model that you have to maintain over time.

00:33:08.060 | Data verification, feature engineering, monitoring your infrastructure, et cetera.

00:33:13.060 | So you might be wondering, you know, we have LLMs.

00:33:17.060 | Does any of this matter?

00:33:20.060 | Yeah?

00:33:22.060 | You have a few nods here.

00:33:25.060 | Absolutely.

00:33:26.060 | When we have LLMs, all of these, you know, tech debt principles still apply.

00:33:33.060 | And you can even think of the exact mapping for every single component in here to the LLM equivalent.

00:33:39.060 | For example, maybe we don't have feature engineering pipelines, but, you know, cast in a new light, it's RAC, right?

00:33:45.060 | We're looking at context, we're trying to retrieve what's relevant, engineer that to, you know, not distract the LLM too much.

00:33:52.060 | We have a ton of experimentation around that.

00:33:54.060 | All of this is something that needs to be maintained over time, especially as models change under the hood.

00:33:59.060 | Similarly for data validation and verification, right?

00:34:03.060 | We have evals.

00:34:04.060 | We have guardrails that need to be deployed, right?

00:34:06.060 | It's not just simply wrap your model or GPT in some software and ship it.

00:34:12.060 | No, there's like a lot of investment that needs to happen around the model.

00:34:17.060 | All right.

00:34:18.060 | So I'd like to end with this quote from Kapati Senpai.

00:34:20.060 | There's a large class of problems.

00:34:22.060 | They are really easy to imagine and build demos for, but it's extremely hard to milk products out of.

00:34:28.060 | For example, Charles dug up this paper of the first car driven by a neural network.

00:34:35.060 | That was 1988.

00:34:37.060 | 25 years later, Andre Kapati took his first demo drive of Waymo, 2013.

00:34:44.060 | Ten years later, I hope all of you had a chance to try the Waymo.

00:34:48.060 | We got the driverless permit for Waymo in San Francisco.

00:34:53.060 | Maybe in a couple more years, we'll have it for the whole of California.

00:34:57.060 | The point is, going from demo to production takes time.

00:35:01.060 | That's all we had.

00:35:02.060 | Thank you.

00:35:03.060 | Let's do it.

00:35:04.060 | We'll see you next time.

00:35:05.060 | We'll see you next time.

00:35:05.060 | Bye.

00:35:06.060 | Bye.

00:35:07.060 | Bye.

00:35:08.060 | Bye.

00:35:09.060 | Bye.

00:35:10.060 | Bye.

00:35:11.060 | Bye.

00:35:12.060 | Bye.

00:35:13.060 | Bye.

00:35:14.060 | Bye.

00:35:15.060 | Bye.

00:35:16.060 | you

00:35:18.120 | you

Lessons From A Year Building With LLMs

Chapters