back to indexLessons From A Year Building With LLMs

Chapters
0:0 Introduction
3:22 Strategic: Bryan Bischof & Charles Frye
14:47 Operational: Hamel Husain & Jason Liu
23:51 Tactical: Eugene Yan & Shreya Shankar
00:00:13.720 |
You're about to experience something of a strange talk, 00:00:17.040 |
and not just because Brian and I are strange, 00:00:20.740 |
but because something kind of strange happened. 00:00:23.960 |
Over the last year, a bunch of us were posting things on Twitter. 00:00:27.760 |
We were writing blog posts complaining about LLNs, 00:00:33.720 |
And we were, you know, continuing to complain about LLNs to each other 00:00:38.300 |
and sharing what we were working on when we realized we were all 00:00:46.780 |
So we got together, and we turned what was initially a couple 00:00:52.520 |
of short blog posts into a long white paper on O'Reilly, 00:00:56.900 |
combining our lessons across strategic, operational, 00:01:00.920 |
and tactical levels of building LLN applications. 00:01:04.480 |
And the response to that white paper was overwhelmingly positive. 00:01:08.340 |
We got -- we heard from everybody, from people who contribute to Postgres, 00:01:14.380 |
to venture capitalists, to tool builders, saying we loved what you wrote 00:01:23.700 |
And we were invited on the strength of that to give this keynote address. 00:01:29.480 |
And so we faced a kind of funny challenge, which is part of the appeal 00:01:32.960 |
of this blog post of this article was that the six of us all came together to write it. 00:01:38.760 |
As Scott Condren put it, it was like an Avengers team-up. 00:01:42.780 |
So we had to figure out a way to deliver one keynote talk from six people. 00:01:48.620 |
So we pulled the Avengers together for one night only to sort of like deliver some 00:01:58.520 |
of the most important insights from that 30-page article, to add some 00:02:03.640 |
of our spicy extra takes that ended up on the cutting room floor, 00:02:10.180 |
I'd like to state unequivocally that we are not, in fact, crypto bros who just found 00:02:21.600 |
We all trained our first neural networks back when you had to write the gradients by hand. 00:02:27.580 |
So we split the article up to three pieces, we split the talk into three pieces. 00:02:31.680 |
First you're going to hear from me and Brian talking about the strategic considerations 00:02:44.160 |
Then we're going to hand the clickers and the stage over to Hamel Hussein and Jason Liu who 00:02:51.400 |
are going to share the operational considerations. 00:02:57.160 |
How do you think about workflows around delivering LLM applications? 00:03:02.880 |
And then they will hand over the clickers and the stage to Shreya Shankar and Eugene Yan who 00:03:08.500 |
will talk about the tactical considerations for building LLM applications. 00:03:12.440 |
What are the specific techniques, tactics, and moves that have stood the test of one year's 00:03:23.640 |
So Brian, how do you build an LLM application without getting outmaneuvered and wasting everybody's 00:03:32.800 |
Well, many of you may be thinking that there's really only one way to win in this new, exciting, 00:03:43.360 |
And that, of course, is to train your own custom model. 00:03:46.940 |
Pre-training, fine tuning, a little RLHF here and there. 00:03:58.980 |
For almost no one in this audience, the model is the moat. 00:04:04.000 |
You all as AI engineering devotees should be building in your zone of genius. 00:04:12.200 |
You should be leveraging your product expertise or your existing product. 00:04:18.060 |
And you should be finding your niche and digging into that niche, exploiting it. 00:04:25.000 |
You should be building what the model providers are not. 00:04:29.200 |
There's a high likelihood that the model providers have to build a lot of things for all of their 00:04:36.440 |
Don't waste your calories on building these things. 00:04:39.580 |
The Sam Altman phrase of steamrolling is appropriate here. 00:04:44.560 |
And you should be treating the models like any other SaaS product. 00:04:49.080 |
You should be quickly dropping them when there's a competitor that's clearly better. 00:04:54.020 |
No offense to GPT-40, but Sonic 3.5 looking pretty sharp. 00:05:02.700 |
It's important to keep in mind that a model with high MMLU scores, that's not a product. 00:05:10.980 |
That doesn't automate all data requests or even 87% of them. 00:05:24.340 |
An excellent LLM-powered application is an excellent product. 00:05:43.880 |
So, what should you build if not all of these things? 00:05:50.940 |
Things that generalize to smarter and faster models. 00:05:54.880 |
Things that help you maintain your product's quality bar under uncertainty. 00:06:02.080 |
And things that help you continuously improve. 00:06:12.400 |
The idea of continuous improvement has been brought to the world of LLM applications by, 00:06:22.340 |
like, this shift in focus that we've all felt since the previous AI engineer summit to focus 00:06:32.360 |
It's nicely synagogalized by this diagram from our co-author, Hamill Hussain, showing this virtuous 00:06:42.480 |
But the core reason to create those evals, the core reason to collect that data, is to drive 00:06:52.280 |
And despite what your expensive consultants or your -- many of the LinkedIn influencers posting 00:07:01.300 |
about LLM apps might say, this is not actually the first time that engineers have tried to 00:07:06.560 |
tame a complex system and make it useful and valuable. 00:07:11.760 |
This same loop of iterative improvement was also at the core of MLOps, at the operationalization 00:07:21.980 |
This figure from our co-author Shreya Shankar's paper had that same loop of iterative improvement 00:07:28.240 |
centered also on evaluation and on data collection. 00:07:33.260 |
MLOps was also not the first time that engineers faced this problem, the problem of complexity, 00:07:42.100 |
the problem of nondeterminism and uncertainty. 00:07:46.500 |
The DevOps movement that gave MLOps its name also focused on this kind of iterative improvement 00:07:53.760 |
and on monitoring information in production to turn into improvements to products. 00:08:01.420 |
But dear reader, DevOps was not the first time that engineers tackled this problem of uncertainty 00:08:09.780 |
and solved it and solved it with iterative improvement. 00:08:13.040 |
DevOps built on the ideas of the lean startup movement from Eric Ries that was focusing not 00:08:20.060 |
just on building an application, not just on building a machine learning model or an LLM agent, 00:08:28.040 |
And it used this same loop centered on measurement and data to drive the improvement and building 00:08:38.520 |
This idea itself was not invented in Northern California, despite what some people might say. 00:08:45.900 |
It has its roots in the Toyota production system and in the idea of Kaizen or continuous improvement. 00:08:53.000 |
Genchi Genbutsu is one of the core principles from that movement that we can take forward into 00:09:02.220 |
And at Toyota, that meant sending executives out to factory floors, getting their khakis a 00:09:08.540 |
For LLM applications, the equivalent is looking at your data. 00:09:14.520 |
That data is the real information about how your LLM application is delivering value to 00:09:20.060 |
There's nothing that is more valuable than that. 00:09:25.620 |
Finally, there's lots of people selling tools at this conference, including myself. 00:09:31.380 |
It's easy to get overly excited about the tools and the construction of this iterative loop 00:09:35.240 |
of improvement and to forget where value actually comes from. 00:09:38.460 |
And there's a great pithy, earthy statement from the Toyota production system from Shigeo Shingo 00:09:48.720 |
So we have to make sure that we don't get lost just building our evals and calculating concept 00:09:55.720 |
Instead, make sure that we continue to get out there and bend metal and create value for 00:10:01.800 |
I might have misunderstood earlier when you said let's get bent. 00:10:05.800 |
So right off the bat, we need to spin that data flywheel, Bob. 00:10:39.960 |
What I do have to sell you is this idea that you should be getting out there. 00:10:48.780 |
What if this isn't good enough for my customers? 00:10:58.400 |
If it's good enough for these incredible companies like Apple Intelligence, Photoshop, and Hex, that's 00:11:08.720 |
You need to start looking at your user interactions. 00:11:11.680 |
The real user interactions, LLM's responses deserve human eyes. 00:11:19.140 |
You can give it some AI eyes, too, but definitely look at it with your human eyes. 00:11:33.680 |
And finally, user requests will reveal the PMF opportunities that lie below your product 00:11:47.100 |
What are they asking your chat bot that you haven't yet implemented? 00:11:50.640 |
That's a really nice direction to skate if that's where the puck is going. 00:11:57.960 |
And despite the focus on the user interactions that you can have today, the things that you 00:12:03.260 |
can ship right now, it's important to also think about the future. 00:12:08.600 |
The best way to predict the future is to look at the past, find people predicting the present, 00:12:17.160 |
In designing many of the components of the personal computing revolution, Alan Kaye and others 00:12:22.500 |
at Park adopted as a core technique projecting Moore's law out into the future. 00:12:29.460 |
They built expensive, unmarketable, slow, and buggy systems themselves so they could experience 00:12:35.580 |
what it was like and build for that future and create it. 00:12:41.780 |
We don't have quite the industrial scaling information that Moore had when he wrote down his predictions. 00:12:49.940 |
But we do have the beginnings of those same laws. 00:12:54.500 |
There's been an order of magnitude decrease every 12 to 18 months at three distinct levels 00:13:00.620 |
At the capability of DaVinci, the original GPT-3 API that brought, that excited a lot of us about 00:13:10.420 |
The capabilities of text DaVinci 2, the model lineage underlying chat GPT that brought the 00:13:16.580 |
rest of the world to excitement about this technology. 00:13:20.460 |
And the latest and greatest level of capabilities with GPT-4 and Sonnet. 00:13:25.060 |
In each case, around 15 months is enough time to drop the cost by an entire order of magnitude. 00:13:36.020 |
And so the appropriate way to plan for the future is to think what this implies for what 00:13:42.620 |
applications that are not economical today will be economical at the time that you need 00:13:50.060 |
So in 2023, it cost about $625 an hour to run a video game where all the NPCs were powered 00:14:00.100 |
In 1980, it cost about $6 an hour to play Pac-Man, inflation adjusted. 00:14:06.100 |
That suggests that if we just wait for two orders of magnitude reduction or about 30 months from 00:14:10.740 |
mid-2023, it should be possible to deliver a compelling video game experience with chat bot 00:14:17.260 |
NPCs at about $6 an hour and people will probably pay for it. 00:14:22.140 |
So, you can't sell it now, but you can live it and you can design it and you can be ready 00:14:28.980 |
So that's how to think about the future and how to think strategically when building LLM applications. 00:14:36.020 |
I'd like to call to the stage my co-authors, Jason Liu and Hamo Hussain, to talk about the 00:14:53.060 |
So, Hamo and I have basically been doing a lot of AI consulting in the past year, right? 00:14:58.420 |
We've worked with about 20 companies so far and, you know, we've done something from pre-seed 00:15:02.940 |
all the way to public companies and I'm pretty bored of giving generic good advice, especially 00:15:07.820 |
because there's such a range of operators here. 00:15:13.080 |
My goal today is to tell you how to ruin your business. 00:15:17.840 |
First of all, everyone knows that in the gold rush, you sell shovels and so if you want to 00:15:22.820 |
get gold, you've got to buy shovels too, right? 00:15:25.140 |
You know, if you want to find more gold, keep buying shovels. 00:15:35.000 |
And how do I dig one deep hole versus making investments in plenty of shallow holes? 00:15:39.600 |
Again, the answer is more shovels, clearly, right? 00:15:42.100 |
And this might be generic so I'll give you some more specific advice. 00:15:46.860 |
If your rag app doesn't work, try a vector database, a different vector database. 00:15:50.900 |
If the methodology doesn't work, implement a new paper. 00:15:54.100 |
And maybe if you update the embedding model, you'll finally find product market fit. 00:16:02.360 |
Because truth be told, success does not lie in developing expertise or processes. 00:16:08.600 |
There's no need to balance between exploring and exploiting the mechanisms that work for you. 00:16:15.360 |
And the processes and the decision-making frameworks don't matter. 00:16:22.620 |
How do you find a machine learning engineer who can fine tune as quickly as possible? 00:16:28.120 |
A $2,000 per month open AI bill is very expensive. 00:16:31.880 |
And instead, hire someone for a quarter of a million dollars, give them 1% of their company, 00:16:36.960 |
to fight CUDA build errors and figure out server cold starts, right? 00:16:40.960 |
Because what's the point of growing your company if you're just a wrapper? 00:16:44.960 |
And if your margins are too low, try fine-tuning. 00:16:48.300 |
It's much easier than figuring out how to build something worth charging for. 00:16:52.460 |
It's really -- I cannot reiterate this enough. 00:16:56.300 |
It's very important to hire a machine learning engineer as quickly as possible, right? 00:17:00.660 |
Even if you have no data, generating products. 00:17:03.460 |
They love fixing Vercel TypeScript build errors. 00:17:07.720 |
And generally, if you hire a full-stack engineer who's really caught the LLM bug, 00:17:16.220 |
And this is because Python is a dead language, right? 00:17:19.720 |
Machine learning engineers, research engineers can easily pick up TypeScript, 00:17:24.220 |
and the ecosystem that exists in Python could be quickly re-implemented in a couple of weekends, right? 00:17:30.380 |
The people who wrote Python code for the past 10 years doing data analysis, 00:17:34.520 |
they're going to easily be able to transition their tools. 00:17:37.060 |
And if anything, it's really easy to teach things like product sense and data literacy to the JavaScript community. 00:17:44.060 |
And most important of all, in order to find this kind of magic talent, 00:17:50.060 |
we need to create a very catch-all job title. 00:17:53.060 |
Let's use words like ninja and wizard or data scientist or prompt engineer or even the AI engineer. 00:17:59.560 |
In the past 10 years, we've known that this works really well, right? 00:18:03.560 |
Every time we know exactly who we want, as long as we catch a very wide net of skills, 00:18:10.560 |
it doesn't really matter whether or not we don't know what outcomes we're looking for. 00:18:14.560 |
Anyways, to dig me out of this hole, I'll have Hamill explain and, you know, take a deep breath, 00:18:33.060 |
I mean, let's just step back from the cliff a little bit. 00:18:37.060 |
And let's kind of linger on the topic of AI engineer. 00:18:46.060 |
Like, much props to Swiss for kind of popularizing this term. 00:18:50.060 |
It allows us all to get together and have conversations like this. 00:18:53.060 |
But I think that there's a misunderstanding of the skills of AI engineer. 00:19:03.060 |
As a founder or engineering leader, the talent is the most important lever that you have. 00:19:10.060 |
And so what I'm going to do is I'm going to talk about some of the problems 00:19:14.060 |
and perhaps some solutions when it comes to this talent misunderstanding. 00:19:23.060 |
So this is a diagram that everyone has probably seen. 00:19:27.060 |
There's a spectrum of skills in the AI space. 00:19:30.060 |
And there's this API dividing line in the middle. 00:19:33.060 |
And kind of to the right of the API dividing line, we have AI engineer. 00:19:37.060 |
The AI engineer skills are focused on things like chains, agents, tooling, and infra. 00:19:45.060 |
And auspiciously missing from the AI engineer are tools like evals and data. 00:19:52.060 |
And I think a lot of people have taken this diagram too literally and taken it to heart 00:19:57.060 |
and say, hey, we don't really need to know about evals, for example. 00:20:02.060 |
The problem is that you can go from 0 to 1 really fast. 00:20:06.060 |
In fact, you can go to 0 to 1 faster than ever before with all the great tools out there. 00:20:11.060 |
Just by using vibe checks and implementing the tools that we talked about. 00:20:15.060 |
However, without evals, you can't make progress. 00:20:20.060 |
Because if you can't measure what you're doing, you can't make your system better. 00:20:32.060 |
About this evals, skill set, and data literacy? 00:20:35.060 |
So, Jason and I have found that we can actually get really good at writing evals and data literacy 00:20:43.060 |
with just four to six weeks of deliberate practice. 00:20:48.060 |
And we think that these skills, evals and data, should be brought more into the core of AI engineer. 00:20:57.060 |
And it's something that we see over and over again. 00:20:59.060 |
So, the next thing I want to talk about is the AI engineer job title itself. 00:21:09.060 |
What we see over and over again in our consulting is that this kind of catch-all role have very inflated expectations. 00:21:18.060 |
Any time anything goes wrong with the AI, people look towards that role to fix it. 00:21:24.060 |
And sometimes that role doesn't have all the skills they need to move forward. 00:21:28.060 |
And we've seen this before with the role of data scientists. 00:21:37.060 |
What I want to emphasize is I think AI engineer is very aspirational. 00:21:45.060 |
But you need to have reasonable expectations. 00:21:48.060 |
And just to kind of bring it back to data science, we've seen this before in data science as well. 00:21:54.060 |
Where we had kind of a decade ago when this role was coined. 00:22:03.060 |
Software engineering skills, statistics, math, domain expertise. 00:22:08.060 |
And we found out as an industry that we had to unroll that into many other different roles. 00:22:12.060 |
Such as decision scientist, machine learning engineer, data engineer, so on and so forth. 00:22:17.060 |
And I think similar things may be happening with the role of AI engineer. 00:22:22.060 |
And what I see, or what we both see in consulting, is that it's helpful to be more specific. 00:22:29.060 |
To be more deliberate about what skills you need and at what time. 00:22:32.060 |
And depending on your maturity, it's very helpful to not only specify what the skills are, but 00:22:40.060 |
So these are some job titles from Github Co-Pilot. 00:22:44.060 |
That kind of are very specific about the skills you need at that time. 00:22:48.060 |
And really it's important to hire the right talent at the right time. 00:22:53.060 |
So when you're first starting out, you only need application development, software engineering, 00:23:02.060 |
Then you need platform and data engineering to capture that data. 00:23:06.060 |
And then only after that you should hire a machine learning engineer. 00:23:10.060 |
Do not hire a machine learning engineer without having any data. 00:23:13.060 |
But again, you can get a lot more mileage out of your AI engineer with deliberate practice 00:23:21.060 |
We usually find four to six weeks practice does the job. 00:23:25.060 |
So in recap, one of the biggest failure modes is talent. 00:23:29.060 |
We think the AI engineer is often over-scoped but under-specified. 00:23:37.060 |
Next, I want to give it over to Shreya Shankar and Eugene Yan to talk about, 00:23:56.060 |
Next up, Shreya and I are going to share with you about the tactical aspects of building with LLMs in production. 00:24:02.060 |
Specifically, evals, monitoring, and guardrails. 00:24:07.060 |
How important evals are to the team is a differentiator between teams shipping out hot garbage and those building real products. 00:24:15.060 |
I think here's an example of LLMs, of Apple's recent LLM, where they shared about how they actually collected 750 summaries of push notification and email summarizations. 00:24:27.060 |
Because these are datasets, they are representative of their actual use case. 00:24:32.060 |
So how do we build evals for our own products? 00:24:35.060 |
Well, I think the simple thing is to just make it simpler. 00:24:39.060 |
For example, if you're trying to extract product attributes from a product description, break it down into title, price, rating. 00:24:50.060 |
Similarly, for summarization, instead of trying to eval that amorphous blob of a summary, break it down into dimensions, such as factual inconsistency, relevance, and informational density. 00:25:01.060 |
And once you've done that, assertion-based tests can go a long way. 00:25:09.060 |
Or if you're doing natural language to SQL generation, is it using the expected table? 00:25:15.060 |
These are very simple to eval and reiterates what Hamel has mentioned about keeping it simple. 00:25:20.060 |
Lastly, assertions can do everything, but they can only go so far. 00:25:29.060 |
Maybe training a classifier for factual inconsistency or reward model for relevance. 00:25:34.060 |
This is easier if your evals are classification and regression-based. 00:25:38.060 |
But that said, I don't know how I feel about LLM as a judge. 00:25:41.060 |
What do you mean you don't like LLM as a judge? 00:25:45.060 |
I personally am super bullish on LLM as a judge. 00:25:49.060 |
And I'm curious how many of you are exploring LLM as judge or have implemented it? 00:26:03.060 |
Anyways, we're going to go through some points on what to consider when deploying LLM as judge. 00:26:14.060 |
You just have to write a prompt to check for the criteria or metric that you want. 00:26:18.060 |
And you can even align this towards your own preferences by providing few-shot examples of good and bad for that criteria. 00:26:26.060 |
On the other hand, fine-tuned models or LLMs that you have to collect a lot of data instead of a pipeline to train as your evaluator are not super easy to prototype and have a lot of upfront investment. 00:26:40.060 |
But that said, LLM as a judge is pretty difficult to align it to your specific criteria in the business. 00:26:45.060 |
Who here has not had any difficulty aligning the LLM as a judge to your criteria? 00:26:55.060 |
I think that if you just have a few hundred to a few thousand samples, it's very easy to fine-tune a simple model who can do it more precisely. 00:27:03.060 |
Second, if you want to do LLM as a judge and have it fairly precise, you sort of need to use chain of thought. 00:27:10.060 |
And chain of thought is going to be, I don't know, five to eight seconds long. 00:27:13.060 |
On the other hand, if you have a simple classifier or reward model, every request is maybe 10 milliseconds long. 00:27:20.060 |
That's two orders of magnitude lower and would improve throughput. 00:27:28.060 |
Okay, when we're implementing our validators in production, even if they run asynchronously or they run in the critical path, how much effort do we need to put in to keep these up to date? 00:27:39.060 |
With LLM as judge, if you don't make sure your few-shot examples are dynamic or some way of making sure your judge kind of prompt aligns with your definition of good and bad, then you're toast. 00:27:52.060 |
And the effect is not as pronounced for fine-tuned models, but if you don't continually fine-tune your validators on new data, on new production data, then they will also be susceptible to drift. 00:28:04.060 |
So overall, when do you want to use LLM as judge? 00:28:07.060 |
It's honestly a resources question and where you are in your application development. 00:28:12.060 |
If you're starting to prototype it, you need quick evals with minimal dev effort and need something, you have a low-ish volume of evals, start with LLM as a judge and kind of invest in the infrastructure to align that over time. 00:28:25.060 |
If you have more resources or you know that your product is going to be sticky, go for a fine-tuned model. 00:28:31.060 |
Next, I'm going to talk about looking at the data. 00:28:36.060 |
Eugene mentioned, you know, you should create evals on your custom or bespoke criteria, but how do you know what criteria you want? 00:28:46.060 |
Great AI researchers, but we changed that to engineers, great AI engineers look at their data. 00:28:53.060 |
The first question actually before how is when do you look at this? 00:28:58.060 |
I know people who never look at their data at all or people who look at it initially after deployment. 00:29:05.060 |
I work with a startup that, you know, whenever they ship a new LLM agent, they create a new Slack channel with all of the agents outputs that come in real time. 00:29:15.060 |
After a couple of weeks, they transition this to kind of daily batch jobs and make sure that, you know, they're not running into errors that they didn't anticipate. 00:29:24.060 |
Second thing is what specifically are you looking for? 00:29:27.060 |
You want to find slices of the data that are pretty simple or easy to characterize in some way. 00:29:32.060 |
For example, data that comes from a particular source or data that has a certain keyword or phrase or is about a certain topic, right? 00:29:39.060 |
Simply just saying all of these are bad, but having no way of characterizing them and then improving your pipeline based on that is not going to help. 00:29:47.060 |
Finally, some things to keep in mind throughout this whole kind of looking at your data experience is that your code base is very rapidly changing over time probably. 00:29:57.060 |
Your prompts, components of the pipeline, and et cetera. 00:30:00.060 |
So when you're inspecting traces, it's super helpful to be able to know, you know, what GitHub commit or what model version or prompt version did this correspond to? 00:30:08.060 |
I think this is one of the very successful things that traditional MLOps tools did, like MLflow, for example. 00:30:14.060 |
They made it very easy to trace back and then hopefully you could replay something. 00:30:23.060 |
And finally, when using LLMs as APIs, pin model versions. 00:30:28.060 |
LLM APIs are known to, you know, exhibit different behavior that is very hard to quantify for certain tasks. 00:30:35.060 |
So pin, you know, GPT-4, 11.06, pin GPT-4-0, whatever it is that you're using. 00:30:42.060 |
So Shreya mentioned that we need to look at our data, but how do we look at our data all the time? 00:30:47.060 |
I think the way to do this is via an automated guardrail. 00:30:52.060 |
The amount of energy to catch and fix defects is an order of magnitude larger than needed to produce it. 00:30:59.060 |
It's really easy to call an LLM API and just get something. 00:31:04.060 |
I think it's really important that we do have some basic form of guardrails. 00:31:10.060 |
Toxicity, personally identified information, copyright, and expected language. 00:31:14.060 |
Now you may imagine that this is pretty straightforward, but sometimes you don't actually have control over the context. 00:31:20.060 |
For example, if someone's posting an ad on your English website that's in a different language, 00:31:25.060 |
and you're asking your LLM to extract the attributes or to summarize it, you may be surprised that for some non-zero proportion of the time, 00:31:34.060 |
Similarly, hallucinations happen more often than we would like. 00:31:38.060 |
So imagine you're trying to summarize a movie based on the description. 00:31:43.060 |
It may actually include spoilers because it's trying so hard to be helpful. 00:31:50.060 |
So sometimes you will include information that's not in there. 00:31:54.060 |
If we spend a little bit more time building reference-free evals, we can use them as guardrails. 00:32:00.060 |
So reference-based evals are when we generate some kind of output, and we compare it to some ideal sample. 00:32:06.060 |
This is pretty expensive, and you actually have to collect all these goal samples. 00:32:09.060 |
On the other hand, if we have these labels, we can train an evaluator model and just compare it to the source document. 00:32:15.060 |
So for example, if you're comparing summarizations, we can just check if the summary entails or contradicts the source document, 00:32:25.060 |
So therefore, if we spend some time building reference-free evals once, we can use it to guardrail or new output. 00:32:34.060 |
So we're going to wrap up next minute or so on some high-level bird's-eye view, 2,000-foot view, whatever you want to call it, takeaways. 00:32:43.060 |
First off, how many of you remember this figure from this pretty seminal paper in MLOps that came out maybe 10 years ago? 00:32:54.060 |
So I think this paper really communicated the idea that the model is a small part, and when you're productionizing ML systems, 00:33:03.060 |
right, there's so much more around the model that you have to maintain over time. 00:33:08.060 |
Data verification, feature engineering, monitoring your infrastructure, et cetera. 00:33:13.060 |
So you might be wondering, you know, we have LLMs. 00:33:26.060 |
When we have LLMs, all of these, you know, tech debt principles still apply. 00:33:33.060 |
And you can even think of the exact mapping for every single component in here to the LLM equivalent. 00:33:39.060 |
For example, maybe we don't have feature engineering pipelines, but, you know, cast in a new light, it's RAC, right? 00:33:45.060 |
We're looking at context, we're trying to retrieve what's relevant, engineer that to, you know, not distract the LLM too much. 00:33:52.060 |
We have a ton of experimentation around that. 00:33:54.060 |
All of this is something that needs to be maintained over time, especially as models change under the hood. 00:33:59.060 |
Similarly for data validation and verification, right? 00:34:04.060 |
We have guardrails that need to be deployed, right? 00:34:06.060 |
It's not just simply wrap your model or GPT in some software and ship it. 00:34:12.060 |
No, there's like a lot of investment that needs to happen around the model. 00:34:18.060 |
So I'd like to end with this quote from Kapati Senpai. 00:34:22.060 |
They are really easy to imagine and build demos for, but it's extremely hard to milk products out of. 00:34:28.060 |
For example, Charles dug up this paper of the first car driven by a neural network. 00:34:37.060 |
25 years later, Andre Kapati took his first demo drive of Waymo, 2013. 00:34:44.060 |
Ten years later, I hope all of you had a chance to try the Waymo. 00:34:48.060 |
We got the driverless permit for Waymo in San Francisco. 00:34:53.060 |
Maybe in a couple more years, we'll have it for the whole of California. 00:34:57.060 |
The point is, going from demo to production takes time.