back to indexOn Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks

Chapters
0:0 AI Engineer World's Fair
0:22 On Engineering AI Systems that Endure the Bitter Lesson
0:32 The Challenges of AI Software Engineering
0:40 The Bitter Lesson
4:50 AI Engineering's Purpose
6:39 Takeaway 1: Engineering for Scalability
7:19 Premature Optimization
12:18 The Problem with Prompts
14:26 Trusty Old Separation of Concerns
17:11 Takeaway 2: Invest in Decoupling
17:21 The Pyramid of LLM Software and DSPy
17:45 The DSPy Concept: Declarative Signatures
00:00:00.000 |
So thanks everyone for showing up, and thanks to the organizers for inviting me 00:00:19.800 |
and having me here. I'm excited to talk to you all about engineering AI systems 00:00:25.980 |
that endure the bitter lesson. So I'm Omar, I guess the intro has already happened so 00:00:31.560 |
let's not repeat that. So I mean if you're here I think it's probably because you 00:00:36.300 |
engineer what we might call AI software or you might maybe manage people or work 00:00:41.220 |
with people that do. It's not really a term that's like been used as a very 00:00:46.620 |
special thing this way for that long so we're all kind of trying to figure out 00:00:51.720 |
what are the right sort of basics and fundamentals here and what are the things 00:00:55.080 |
that are fleeting. So this is what the talk will be largely about and you know 00:01:00.360 |
like the name of the game is it's kind of a meme at this point every week there's a 00:01:04.320 |
new large language model maybe every week is actually too slow at this point that 00:01:08.160 |
actually changes something in terms of the trade-offs you can strike. It might 00:01:12.900 |
not be the state-of-the-art in terms of the best quality necessarily although 00:01:15.960 |
sometimes it is but maybe it's the best performance for certain costs or it's the 00:01:21.420 |
best performance for certain types of applications or maybe it's the you know 00:01:24.900 |
the speed that's really incredible. We've seen like things like you know the 00:01:27.660 |
diffusion now. So every every week there's a new LLM that you kind of have to think 00:01:32.480 |
about if you're engineering software in this space which is really unusual like if 00:01:35.760 |
you think back to normal software engineering you change your hardware every 00:01:39.200 |
two three years maybe if that so this is pretty unusual. The other part that's 00:01:45.380 |
actually also a little bit weirder is if you are lucky the LLM provider has 00:01:50.720 |
recognized that they're not really building these LLMs they're training them 00:01:55.340 |
they're emerging based on a lot of nudging and data and iterating on a lot of evals 00:02:01.460 |
and a lot of Vibes as well and they have realized you know if you're lucky that 00:02:06.020 |
there are new quirks in their latest models that weren't there before and to 00:02:09.680 |
the surprise of many people to these days you know you still get longer and 00:02:12.980 |
longer prompting guides for the latest models that are supposed to be you know 00:02:16.520 |
closer and closer to AGI and if you're less lucky you have to figure that out on 00:02:21.200 |
your own right if you're even less lucky the prompting guides from the 00:02:25.020 |
provider are not even that good so you have to actually kind of figure out what the 00:02:28.140 |
I think is and every day maybe at an even faster pace someone is releasing an 00:02:33.300 |
archive paper or a tweet or something that introduces a new learning algorithm 00:02:37.600 |
maybe some reinforcement learning bills and whistles maybe some prompting tricks 00:02:41.520 |
maybe a prompt optimization you know technique something or the other that 00:02:45.780 |
promises to make your system learn better and sort of fit your goals better 00:02:49.740 |
someone else is introducing some search or scaling or inference strategies or 00:02:54.120 |
agent frameworks or agent architectures that are promising to finally 00:02:58.020 |
unlock levels of reliability or quality better than what you had before and I 00:03:02.280 |
think if you're actually doing a reasonable job now most likely you're 00:03:06.000 |
scrambling every week that's not if you're doing a bad job that's if you're 00:03:08.280 |
doing a good job right because you're like you know I've got to stay on top of at 00:03:12.000 |
least some of this stuff so that like I don't fall behind and in many cases like 00:03:16.560 |
you know model API's actually change the model under the hood even though you know 00:03:20.880 |
you're using the same name so it's actually you're forced to scramble and 00:03:25.440 |
actually I would say maybe the question isn't whether you will scramble every 00:03:28.560 |
week and maybe a different question is will you even get to scramble for long if 00:03:32.880 |
you think about the rate of progress of these LLMs like are they gonna eat your 00:03:35.700 |
lunch right so these are I think questions that are on a lot of people's minds and 00:03:39.540 |
this is what the talk is is is going to be addressing so the talk mentions the 00:03:43.080 |
bitter lesson which is sounds like this you know really ancient old kind of AI 00:03:47.700 |
lore but it's just you know six years old where the current years Turing Award 00:03:52.600 |
winner Rich Sutton who's a pioneer of reinforcement learning wrote this short 00:03:56.460 |
essay on his website basically that says 70 years of AI has taught him and taught 00:04:01.940 |
you know other people in the AI community from his perspective that when AI 00:04:05.760 |
researchers leverage domain knowledge to solve problems like I don't know chess or 00:04:09.900 |
something we build complicated methods that essentially don't scale and we get 00:04:13.940 |
stuck and we get beat by methods that leverage scale a lot better what seems 00:04:19.340 |
to have to work better according to to Sutton is general methods to scale and he 00:04:23.880 |
identifies search which is not like retrieval more of like you know exploring 00:04:28.160 |
large spaces and learning so getting the system to kind of understand its 00:04:32.540 |
environment maybe for example work best and search here is what we'd call in the 00:04:37.620 |
element land maybe inference time scaling or something so I don't speak for Sutton 00:04:41.280 |
and I'm not you know suggesting that I have the right understanding of what he's 00:04:44.700 |
saying or that I necessarily agree or disagree but I think this is just 00:04:47.380 |
fundamental and important kind of concept in this space so I think it's it's it 00:04:53.440 |
raises interesting questions for us as people who build you know engineer AI 00:04:57.120 |
systems because if leveraging domain knowledge is bad what exactly is AI 00:05:01.740 |
engineering supposed to be about I mean engineering is understanding your 00:05:04.400 |
domain and working in it with a lot of human ingenuity in repeatable ways let's 00:05:08.540 |
say or with principles so like are we just doomed like I was just wasting our 00:05:12.140 |
time why are we at an AI engineering you know fair and I'll tell you how to 00:05:17.240 |
resolve this I've not really seen a lot of people discuss that Sutton is talking 00:05:20.660 |
about and a lot of people you know throw the bitter lesson around so clearly 00:05:23.160 |
somebody has to think about this right Sutton is talking about maximizing 00:05:26.420 |
intelligence all of us probably care about that to some degree but which is 00:05:30.080 |
like something like the ability to figure things out in a new 00:05:32.300 |
environment really fast let's say all of us kind of care about this to some 00:05:36.320 |
degree I'm also an AI researcher but when we're building AI systems I think it's 00:05:41.440 |
important to remember that the reason we build software is not that we lack AGI 00:05:45.200 |
we build software you know and and the reason for this and the way kind of 00:05:48.860 |
understand this is we already have general intelligence everywhere we have eight 00:05:52.760 |
billions of them they're unreliable because that's what intelligence is and 00:05:56.980 |
they've not solved the problems that we want to solve with software that's why 00:05:59.560 |
we're building software so we program software not because we lack AGI but 00:06:04.580 |
because we want reliable robust controllable scalable systems and we want 00:06:10.900 |
these things that to be things that we can reason about understand at scale and 00:06:15.320 |
actually if you think about engineering and reliable systems if you think about 00:06:17.880 |
checks and balances in any case where you try to systematize stuff it's about 00:06:21.520 |
subtracting agency and subtracting intelligence in exactly the right places 00:06:25.700 |
carefully and not restricting the intelligence 00:06:28.720 |
otherwise so this is a very different axis from the kinds of lessons that you 00:06:33.080 |
would draw on from the bitter lesson now that does not mean the bitter lesson is 00:06:36.440 |
irrelevant let me tell you the precise way in which it's relevant so the first 00:06:40.280 |
takeaway here is that scaling search and learning works best for intelligence this 00:06:44.480 |
is the right thing to do if you're an AI researcher interested in building you know 00:06:47.660 |
agents that learn really well really fast in new environments right don't hard code stuff at all or unless you really have to but in building AI 00:06:55.600 |
systems it's helpful to think about well sure search and learning but 00:06:59.920 |
searching for what right like what is your AI system even supposed to be doing 00:07:03.880 |
what is the the fundamental problem that you're solving it's not intelligence it's 00:07:07.660 |
something else and what are you learning for right like what is the system 00:07:11.600 |
learning in order to do well and that is what you need to be engineering not the 00:07:16.000 |
specifics of search and not the specifics of learning as I'll talk about in 00:07:19.100 |
the rest of this talk so he's saying Sutton is saying complicated methods get in the 00:07:25.400 |
way of scaling especially if you do it early like before you know what you're 00:07:29.420 |
doing essentially did we hear that before I feel like I heard that back in the 00:07:32.880 |
1970s although I wasn't around this is you know the notion of structured 00:07:36.440 |
programming with with Knoos saying his popular you know phrase in a paper 00:07:41.540 |
premature optimization is the root of all evil I think this is the bitter lesson for 00:07:47.300 |
software and thereby also for AI software so it's human ingenuity and human 00:07:53.660 |
knowledge of the domain it's not that it's harmful it's that when you do it 00:07:57.740 |
prematurely in ways that constrain your system in ways that reflect poor 00:08:01.580 |
understanding they're bad but you can't get away in an engineering field with not 00:08:05.840 |
engineering your system like you're just quitting or something right so here's a 00:08:10.060 |
little piece of code if you follow me on X on Twitter you might recognize it but 00:08:14.240 |
otherwise I think it looks pretty opaque to me in like three seconds and I can't 00:08:18.440 |
really look at this and tell exactly what it's doing and I also honestly 00:08:20.600 |
don't really care so lo and behold this is computing a square root in a certain 00:08:26.300 |
floating point representation on an old machine and I think the thing that jumps at 00:08:30.020 |
me immediately is this is not the most future-proof program possible if you 00:08:34.400 |
change the machine architecture different floating point representations better CPUs 00:08:38.600 |
first of all it'll be wrong because you know like it's just hard coding some 00:08:41.860 |
values here and a second of all it'll probably be slower than a normal you know 00:08:45.980 |
square root like maybe is a single instruction or maybe the compiler has a 00:08:49.700 |
really smart way of doing it or you know a lot of other things that could be 00:08:52.640 |
optimized for you right so someone who wrote this maybe they had a good reason 00:08:56.760 |
maybe they didn't but certainly if you're writing this kind of thing often you're 00:09:00.400 |
probably messing up as an engineer so premature optimization is maybe the 00:09:05.800 |
square root of all evil or something but what counts as premature like I mean 00:09:12.880 |
that's kind of the name of the game right like if we could just say that but it 00:09:15.220 |
doesn't mean anything so I don't think any strategy will work in tech nobody can 00:09:19.780 |
anticipate what will happen in three years five years ten years but I think you 00:09:23.440 |
still have to have a conceptual model that you're working off of and I happen 00:09:27.520 |
to have built two things that are you know on the order of several years old 00:09:31.060 |
that have fundamentally stayed the same over the years from the days of birth text 00:09:35.200 |
DaVinci 2 up to 404 mini and they're bigger now than they ever were and they're 00:09:39.280 |
sort of like these stable fundamental kind of abstractions or AI systems around 00:09:44.620 |
around LLMs so what gives what happens in order to get something like Colbert or 00:09:50.220 |
something like the spy in this ecosystem and sort of endure a few years which is like you 00:09:55.260 |
know centuries in a island I'll try to reflect on this and you know again none of this is 00:10:01.100 |
guaranteed to be something that lasts forever so here's my hypothesis premature optimization 00:10:05.600 |
is what is happening even only if you're hard coding stuff at a lower level of abstraction 00:10:11.020 |
that you can then you can justify if you want a square root please just say give me 00:10:16.120 |
a square root don't start doing random bit shifts and bit stuff like you know 00:10:20.220 |
bit manipulation that happens to appease your particular machine today but 00:10:25.180 |
actually take a step back do you even want a square root or are you computing 00:10:28.060 |
something even more general and is there a way you could express that thing that is 00:10:31.420 |
more general right and only you know stoop down or go down to the level of 00:10:36.060 |
abstraction that's lower if you've demonstrated that a higher level of 00:10:39.340 |
abstraction is not good enough so I think the bigger picture here is applied 00:10:44.140 |
machine learning and definitely prompt engineering has a huge issue here tight coupling is known 00:10:49.100 |
tighter coupling than necessary is known to be bad in software but it's not really something 00:10:53.100 |
we talk about when we're building machine learning systems in fact the name of the game in machine 00:10:56.700 |
learning is usually like hey this latest thing came out let's rewrite everything so that we're 00:11:00.940 |
working around that specific thing and i tweeted about this a year ago 13 months ago last may in 00:11:05.900 |
2024 saying the bitter lesson is just an artifact of lacking high level good high level ml abstractions 00:11:12.140 |
deep learning uh scaling deep learning helps predictably but after every paradigm shift the best 00:11:17.980 |
systems always include modular specializations because we're trying to build software we need those 00:11:21.980 |
and every time they basically look the same and they should have been reusable but they're not 00:11:26.140 |
because we're writing code bad but we're writing bad code so here's a nice example just to demonstrate 00:11:31.180 |
this it's not special at all here's a 2006 paper the title could have really been a paper now right 00:11:35.980 |
a modular approach for multilingual question answering and here's the system architecture it looks like 00:11:40.540 |
your favorite multi-agent framework today right it has an execution manager it has some question 00:11:45.100 |
analyzers and retrieval strategy strategists from a bunch of corpora and it's like a figure you know if you 00:11:50.380 |
color it you would think it's a paper maybe from last year or something um now here's the problem 00:11:55.260 |
it's a pretty figure the system architecturally is actually not that wrong i'm not saying it's the 00:12:00.140 |
perfect architecture but in a normal software environment you could actually just upgrade the 00:12:05.020 |
machine right put it on a new hardware put it on a new operating system and it would just work and 00:12:09.660 |
actually work reasonably well because the architecture is not that bad but we know that that's not the 00:12:13.980 |
case for these ml sort of architectures because they're not expressed in the right way so i think 00:12:19.580 |
fundamentally i can express this most passionately against prompts a prompt is a 00:12:25.100 |
horrible abstraction for programming and this needs to be fixed asap i say for programming because 00:12:30.460 |
it's actually not a horrible one for management if you want to manage an employee or an agent a 00:12:34.940 |
prompt is a reasonably kind of like it's a slack channel you have a remote employee if you want to be a pet 00:12:40.060 |
trainer you know working with tensors and you know objectives is a great way to iterate that's how we 00:12:44.940 |
build the models but i want us to be able to also engineer ai systems and i think for engineering and 00:12:50.220 |
programming a prompt is a horrible abstraction here's why it's a stringly typed canvas just a big blurb no 00:12:56.140 |
structure whatsoever even if structure actually exists in a latent way that couples and entangles 00:13:02.220 |
the fundamental task definition you want to say which is really important stuff this is what you're 00:13:07.020 |
engineering with some random over uh fitted half-baked decisions about hey this llm responded to this 00:13:13.900 |
language you know when i talk to it this way or i put this example to demonstrate my point um and it kind 00:13:20.060 |
of clicked for this model so i'll just keep it in and there's no way to really tell the difference 00:13:23.500 |
what was the fundamental thing you're solving and like you know what was the random uh trick you applied 00:13:28.140 |
it's like a square root thing except you don't call it a square root and we just have to stare at it and 00:13:32.060 |
be like wait why are we shifting to the left five you know by five bits or something um you're also in 00:13:37.980 |
using the inference time strategy which is like changing every few weeks or people are proposing stuff 00:13:42.220 |
all the time and you're baking it literally entangling it into your system so if it's an agent your prompt 00:13:47.500 |
is telling it it's an agent your system has no big like deal knowing about the fact it's an agent or a 00:13:53.020 |
reasoning system or whatever what are you actually trying to solve right if it's like if you're writing 00:13:56.700 |
a square root function and then you're like hey here's the layout of the structs in memory or something 00:14:01.820 |
um you're also talking about formatting and parsing things you know write an xml produce json whatever 00:14:08.540 |
like again that's really none of your business most of the time so you want to write a human readable spec 00:14:13.820 |
but you're saying things like do not ignore this generate xml answer in json you are professor 00:14:19.020 |
einstein a wise expert in the field i'll tip you a thousand dollars right like that is just not 00:14:24.780 |
engineering guys um so what should we do trusty old separation of concerns i think is the answer 00:14:31.740 |
your job as an engineer is to invest in your actual system design and you know starting with the spec 00:14:38.540 |
the spec unfortunately or fortunately cannot be reduced to one thing and this is the time i'll 00:14:43.820 |
talk about evals i know everyone here's about eval so this is the one line about evals that makes us 00:14:47.500 |
talk about evals a lot of the time um you want to invest in natural language descriptions because that 00:14:53.900 |
is the power of this new framework natural language definitions are not prompts they are highly localized 00:14:59.820 |
pieces of ambiguous stuff that could not have been said in any other way right i can't tell the system 00:15:04.940 |
certain things except in english so i'll say it in english but a lot of the time i'm actually iterating 00:15:09.820 |
to uh to appease a certain model and to make it perform well relative to some criteria i have not 00:15:16.460 |
telling it the criteria just tinkering with things there evals is the way to do this because evals say 00:15:22.380 |
here's what i actually care about change the model the evals are still what i care about it's a fundamental 00:15:26.620 |
thing now evals are not for everything if you try to use evals to define the core behavior of your system 00:15:31.500 |
you will not learn induction learning from data is a lot harder than following instructions right 00:15:36.140 |
so you need to have both code is another thing that you need you know a lot of people are like oh it's 00:15:41.420 |
just like a you know you just just ask it to do the thing well who's gonna define the tools who's gonna 00:15:45.660 |
define the structure how do you handle information flow like you know like things that are private should 00:15:50.220 |
not flow in the wrong places right you need to control these things um how do you apply function 00:15:54.460 |
composition llms are horrible at composition because neural networks kind of essentially don't learn things 00:15:59.420 |
that reliably function composition in software is always perfectly reliable basically right by 00:16:04.220 |
construction so a lot of the things um are often best delegated to code right but it's it's hard and 00:16:11.260 |
it's really important that you can actually juggle and combine these things and you need a canvas that 00:16:15.820 |
can allow it can allow you to combine these things well when you do this a good canvas the definition here 00:16:21.020 |
of a good campus or the camp the the criteria for a good canvas is that it should allow you to express those 00:16:25.900 |
three in a way that's highly streamlined and in a way that is decoupled and not entangled with models 00:16:32.220 |
that are changing i should just be able to hot swap models uh inference strategies that are changing hey i 00:16:36.860 |
want to switch from a chain of thought to an agent i want to switch from an agent to a monte carlo tree 00:16:41.260 |
search whatever the latest thing that has come out is right i should be able to just do that um and new 00:16:45.580 |
learning algorithms this is really important we talked about learning but learning uh is you know always 00:16:51.020 |
happening at the level of your entire system if you're engineering it or at least you've got to be thinking about 00:16:54.700 |
it that way where you're saying i want the whole thing to work as a whole for my problem not for 00:16:59.580 |
some general default right so that's what the evals here are going to be doing and you want a way of 00:17:04.540 |
expressing this that allows you to do reinforcement learning but also allows you to do prompt optimization 00:17:08.620 |
but also allows you to do any of these things at the level of abstraction that you're actually working 00:17:11.900 |
with so the second takeaway is that you should invest in defining things specific to your ai system 00:17:17.180 |
and decouple from the lower level swappable pieces because they'll expire faster than ever 00:17:23.340 |
so i'll just conclude by telling you we've built and been building for three years this dspi framework 00:17:28.140 |
which is the only framework that actually decouples uh your job from which is writing the lower level ai 00:17:35.100 |
software from our job which is giving you powerful evolving toolkits for learning and for search which 00:17:41.420 |
is scaling um and for swapping lms through adapters um so there's only one concept you have to learn 00:17:48.300 |
it is a new concept which is which we call signatures a new first class concept if you learn it you've 00:17:53.340 |
learned dspi um i'll have to unfortunately skip this because of the because of the time for the other 00:17:57.740 |
speakers uh but let me give you a summary i can't predict the future i'm not telling you if you do this 00:18:02.620 |
you know the code you write tomorrow will be there forever but i'm telling you the least you can do 00:18:07.100 |
this is not like uh the the kind of the top level it's just like the base the baseline i would say 00:18:11.740 |
is avoid hand engineering at lower levels than today allows you to do right that's the that's the big 00:18:17.740 |
lesson from the bitter lesson and from premature optimization being the root of all label um among 00:18:22.620 |
your safest bets they could turn out to be wrong i don't know is um models are not anytime soon gonna read 00:18:28.540 |
uh specs off of your mind i don't know if like we'll figure that out um and they're not going to 00:18:33.740 |
magically collect all the structure and tools specific to your application so that's clearly 00:18:38.060 |
stuff you should invest in right when you're building a system invest in the signatures which 00:18:42.140 |
again you can learn about on the dspi site uh dspi.ai invest in essential control flow and tools 00:18:48.620 |
and invest in evals for things that you would otherwise be iterating on by hand and write the wave of 00:18:53.900 |
swappable models write the wave of the modules we build you just swap them in and out 00:18:58.460 |
and write the wave of optimizers which can do things like reinforcement learning or prompt optimization 00:19:02.620 |
for any application that it is that you've built all right thank you everyone