back to index

On Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks


Chapters

0:0 AI Engineer World's Fair
0:22 On Engineering AI Systems that Endure the Bitter Lesson
0:32 The Challenges of AI Software Engineering
0:40 The Bitter Lesson
4:50 AI Engineering's Purpose
6:39 Takeaway 1: Engineering for Scalability
7:19 Premature Optimization
12:18 The Problem with Prompts
14:26 Trusty Old Separation of Concerns
17:11 Takeaway 2: Invest in Decoupling
17:21 The Pyramid of LLM Software and DSPy
17:45 The DSPy Concept: Declarative Signatures

Whisper Transcript | Transcript Only Page

00:00:00.000 | So thanks everyone for showing up, and thanks to the organizers for inviting me
00:00:19.800 | and having me here. I'm excited to talk to you all about engineering AI systems
00:00:25.980 | that endure the bitter lesson. So I'm Omar, I guess the intro has already happened so
00:00:31.560 | let's not repeat that. So I mean if you're here I think it's probably because you
00:00:36.300 | engineer what we might call AI software or you might maybe manage people or work
00:00:41.220 | with people that do. It's not really a term that's like been used as a very
00:00:46.620 | special thing this way for that long so we're all kind of trying to figure out
00:00:51.720 | what are the right sort of basics and fundamentals here and what are the things
00:00:55.080 | that are fleeting. So this is what the talk will be largely about and you know
00:01:00.360 | like the name of the game is it's kind of a meme at this point every week there's a
00:01:04.320 | new large language model maybe every week is actually too slow at this point that
00:01:08.160 | actually changes something in terms of the trade-offs you can strike. It might
00:01:12.900 | not be the state-of-the-art in terms of the best quality necessarily although
00:01:15.960 | sometimes it is but maybe it's the best performance for certain costs or it's the
00:01:21.420 | best performance for certain types of applications or maybe it's the you know
00:01:24.900 | the speed that's really incredible. We've seen like things like you know the
00:01:27.660 | diffusion now. So every every week there's a new LLM that you kind of have to think
00:01:32.480 | about if you're engineering software in this space which is really unusual like if
00:01:35.760 | you think back to normal software engineering you change your hardware every
00:01:39.200 | two three years maybe if that so this is pretty unusual. The other part that's
00:01:45.380 | actually also a little bit weirder is if you are lucky the LLM provider has
00:01:50.720 | recognized that they're not really building these LLMs they're training them
00:01:55.340 | they're emerging based on a lot of nudging and data and iterating on a lot of evals
00:02:01.460 | and a lot of Vibes as well and they have realized you know if you're lucky that
00:02:06.020 | there are new quirks in their latest models that weren't there before and to
00:02:09.680 | the surprise of many people to these days you know you still get longer and
00:02:12.980 | longer prompting guides for the latest models that are supposed to be you know
00:02:16.520 | closer and closer to AGI and if you're less lucky you have to figure that out on
00:02:21.200 | your own right if you're even less lucky the prompting guides from the
00:02:25.020 | provider are not even that good so you have to actually kind of figure out what the
00:02:28.140 | I think is and every day maybe at an even faster pace someone is releasing an
00:02:33.300 | archive paper or a tweet or something that introduces a new learning algorithm
00:02:37.600 | maybe some reinforcement learning bills and whistles maybe some prompting tricks
00:02:41.520 | maybe a prompt optimization you know technique something or the other that
00:02:45.780 | promises to make your system learn better and sort of fit your goals better
00:02:49.740 | someone else is introducing some search or scaling or inference strategies or
00:02:54.120 | agent frameworks or agent architectures that are promising to finally
00:02:58.020 | unlock levels of reliability or quality better than what you had before and I
00:03:02.280 | think if you're actually doing a reasonable job now most likely you're
00:03:06.000 | scrambling every week that's not if you're doing a bad job that's if you're
00:03:08.280 | doing a good job right because you're like you know I've got to stay on top of at
00:03:12.000 | least some of this stuff so that like I don't fall behind and in many cases like
00:03:16.560 | you know model API's actually change the model under the hood even though you know
00:03:20.880 | you're using the same name so it's actually you're forced to scramble and
00:03:25.440 | actually I would say maybe the question isn't whether you will scramble every
00:03:28.560 | week and maybe a different question is will you even get to scramble for long if
00:03:32.880 | you think about the rate of progress of these LLMs like are they gonna eat your
00:03:35.700 | lunch right so these are I think questions that are on a lot of people's minds and
00:03:39.540 | this is what the talk is is is going to be addressing so the talk mentions the
00:03:43.080 | bitter lesson which is sounds like this you know really ancient old kind of AI
00:03:47.700 | lore but it's just you know six years old where the current years Turing Award
00:03:52.600 | winner Rich Sutton who's a pioneer of reinforcement learning wrote this short
00:03:56.460 | essay on his website basically that says 70 years of AI has taught him and taught
00:04:01.940 | you know other people in the AI community from his perspective that when AI
00:04:05.760 | researchers leverage domain knowledge to solve problems like I don't know chess or
00:04:09.900 | something we build complicated methods that essentially don't scale and we get
00:04:13.940 | stuck and we get beat by methods that leverage scale a lot better what seems
00:04:19.340 | to have to work better according to to Sutton is general methods to scale and he
00:04:23.880 | identifies search which is not like retrieval more of like you know exploring
00:04:28.160 | large spaces and learning so getting the system to kind of understand its
00:04:32.540 | environment maybe for example work best and search here is what we'd call in the
00:04:37.620 | element land maybe inference time scaling or something so I don't speak for Sutton
00:04:41.280 | and I'm not you know suggesting that I have the right understanding of what he's
00:04:44.700 | saying or that I necessarily agree or disagree but I think this is just
00:04:47.380 | fundamental and important kind of concept in this space so I think it's it's it
00:04:53.440 | raises interesting questions for us as people who build you know engineer AI
00:04:57.120 | systems because if leveraging domain knowledge is bad what exactly is AI
00:05:01.740 | engineering supposed to be about I mean engineering is understanding your
00:05:04.400 | domain and working in it with a lot of human ingenuity in repeatable ways let's
00:05:08.540 | say or with principles so like are we just doomed like I was just wasting our
00:05:12.140 | time why are we at an AI engineering you know fair and I'll tell you how to
00:05:17.240 | resolve this I've not really seen a lot of people discuss that Sutton is talking
00:05:20.660 | about and a lot of people you know throw the bitter lesson around so clearly
00:05:23.160 | somebody has to think about this right Sutton is talking about maximizing
00:05:26.420 | intelligence all of us probably care about that to some degree but which is
00:05:30.080 | like something like the ability to figure things out in a new
00:05:32.300 | environment really fast let's say all of us kind of care about this to some
00:05:36.320 | degree I'm also an AI researcher but when we're building AI systems I think it's
00:05:41.440 | important to remember that the reason we build software is not that we lack AGI
00:05:45.200 | we build software you know and and the reason for this and the way kind of
00:05:48.860 | understand this is we already have general intelligence everywhere we have eight
00:05:52.760 | billions of them they're unreliable because that's what intelligence is and
00:05:56.980 | they've not solved the problems that we want to solve with software that's why
00:05:59.560 | we're building software so we program software not because we lack AGI but
00:06:04.580 | because we want reliable robust controllable scalable systems and we want
00:06:10.900 | these things that to be things that we can reason about understand at scale and
00:06:15.320 | actually if you think about engineering and reliable systems if you think about
00:06:17.880 | checks and balances in any case where you try to systematize stuff it's about
00:06:21.520 | subtracting agency and subtracting intelligence in exactly the right places
00:06:25.700 | carefully and not restricting the intelligence
00:06:28.720 | otherwise so this is a very different axis from the kinds of lessons that you
00:06:33.080 | would draw on from the bitter lesson now that does not mean the bitter lesson is
00:06:36.440 | irrelevant let me tell you the precise way in which it's relevant so the first
00:06:40.280 | takeaway here is that scaling search and learning works best for intelligence this
00:06:44.480 | is the right thing to do if you're an AI researcher interested in building you know
00:06:47.660 | agents that learn really well really fast in new environments right don't hard code stuff at all or unless you really have to but in building AI
00:06:55.600 | systems it's helpful to think about well sure search and learning but
00:06:59.920 | searching for what right like what is your AI system even supposed to be doing
00:07:03.880 | what is the the fundamental problem that you're solving it's not intelligence it's
00:07:07.660 | something else and what are you learning for right like what is the system
00:07:11.600 | learning in order to do well and that is what you need to be engineering not the
00:07:16.000 | specifics of search and not the specifics of learning as I'll talk about in
00:07:19.100 | the rest of this talk so he's saying Sutton is saying complicated methods get in the
00:07:25.400 | way of scaling especially if you do it early like before you know what you're
00:07:29.420 | doing essentially did we hear that before I feel like I heard that back in the
00:07:32.880 | 1970s although I wasn't around this is you know the notion of structured
00:07:36.440 | programming with with Knoos saying his popular you know phrase in a paper
00:07:41.540 | premature optimization is the root of all evil I think this is the bitter lesson for
00:07:47.300 | software and thereby also for AI software so it's human ingenuity and human
00:07:53.660 | knowledge of the domain it's not that it's harmful it's that when you do it
00:07:57.740 | prematurely in ways that constrain your system in ways that reflect poor
00:08:01.580 | understanding they're bad but you can't get away in an engineering field with not
00:08:05.840 | engineering your system like you're just quitting or something right so here's a
00:08:10.060 | little piece of code if you follow me on X on Twitter you might recognize it but
00:08:14.240 | otherwise I think it looks pretty opaque to me in like three seconds and I can't
00:08:18.440 | really look at this and tell exactly what it's doing and I also honestly
00:08:20.600 | don't really care so lo and behold this is computing a square root in a certain
00:08:26.300 | floating point representation on an old machine and I think the thing that jumps at
00:08:30.020 | me immediately is this is not the most future-proof program possible if you
00:08:34.400 | change the machine architecture different floating point representations better CPUs
00:08:38.600 | first of all it'll be wrong because you know like it's just hard coding some
00:08:41.860 | values here and a second of all it'll probably be slower than a normal you know
00:08:45.980 | square root like maybe is a single instruction or maybe the compiler has a
00:08:49.700 | really smart way of doing it or you know a lot of other things that could be
00:08:52.640 | optimized for you right so someone who wrote this maybe they had a good reason
00:08:56.760 | maybe they didn't but certainly if you're writing this kind of thing often you're
00:09:00.400 | probably messing up as an engineer so premature optimization is maybe the
00:09:05.800 | square root of all evil or something but what counts as premature like I mean
00:09:12.880 | that's kind of the name of the game right like if we could just say that but it
00:09:15.220 | doesn't mean anything so I don't think any strategy will work in tech nobody can
00:09:19.780 | anticipate what will happen in three years five years ten years but I think you
00:09:23.440 | still have to have a conceptual model that you're working off of and I happen
00:09:27.520 | to have built two things that are you know on the order of several years old
00:09:31.060 | that have fundamentally stayed the same over the years from the days of birth text
00:09:35.200 | DaVinci 2 up to 404 mini and they're bigger now than they ever were and they're
00:09:39.280 | sort of like these stable fundamental kind of abstractions or AI systems around
00:09:44.620 | around LLMs so what gives what happens in order to get something like Colbert or
00:09:50.220 | something like the spy in this ecosystem and sort of endure a few years which is like you
00:09:55.260 | know centuries in a island I'll try to reflect on this and you know again none of this is
00:10:01.100 | guaranteed to be something that lasts forever so here's my hypothesis premature optimization
00:10:05.600 | is what is happening even only if you're hard coding stuff at a lower level of abstraction
00:10:11.020 | that you can then you can justify if you want a square root please just say give me
00:10:16.120 | a square root don't start doing random bit shifts and bit stuff like you know
00:10:20.220 | bit manipulation that happens to appease your particular machine today but
00:10:25.180 | actually take a step back do you even want a square root or are you computing
00:10:28.060 | something even more general and is there a way you could express that thing that is
00:10:31.420 | more general right and only you know stoop down or go down to the level of
00:10:36.060 | abstraction that's lower if you've demonstrated that a higher level of
00:10:39.340 | abstraction is not good enough so I think the bigger picture here is applied
00:10:44.140 | machine learning and definitely prompt engineering has a huge issue here tight coupling is known
00:10:49.100 | tighter coupling than necessary is known to be bad in software but it's not really something
00:10:53.100 | we talk about when we're building machine learning systems in fact the name of the game in machine
00:10:56.700 | learning is usually like hey this latest thing came out let's rewrite everything so that we're
00:11:00.940 | working around that specific thing and i tweeted about this a year ago 13 months ago last may in
00:11:05.900 | 2024 saying the bitter lesson is just an artifact of lacking high level good high level ml abstractions
00:11:12.140 | deep learning uh scaling deep learning helps predictably but after every paradigm shift the best
00:11:17.980 | systems always include modular specializations because we're trying to build software we need those
00:11:21.980 | and every time they basically look the same and they should have been reusable but they're not
00:11:26.140 | because we're writing code bad but we're writing bad code so here's a nice example just to demonstrate
00:11:31.180 | this it's not special at all here's a 2006 paper the title could have really been a paper now right
00:11:35.980 | a modular approach for multilingual question answering and here's the system architecture it looks like
00:11:40.540 | your favorite multi-agent framework today right it has an execution manager it has some question
00:11:45.100 | analyzers and retrieval strategy strategists from a bunch of corpora and it's like a figure you know if you
00:11:50.380 | color it you would think it's a paper maybe from last year or something um now here's the problem
00:11:55.260 | it's a pretty figure the system architecturally is actually not that wrong i'm not saying it's the
00:12:00.140 | perfect architecture but in a normal software environment you could actually just upgrade the
00:12:05.020 | machine right put it on a new hardware put it on a new operating system and it would just work and
00:12:09.660 | actually work reasonably well because the architecture is not that bad but we know that that's not the
00:12:13.980 | case for these ml sort of architectures because they're not expressed in the right way so i think
00:12:19.580 | fundamentally i can express this most passionately against prompts a prompt is a
00:12:25.100 | horrible abstraction for programming and this needs to be fixed asap i say for programming because
00:12:30.460 | it's actually not a horrible one for management if you want to manage an employee or an agent a
00:12:34.940 | prompt is a reasonably kind of like it's a slack channel you have a remote employee if you want to be a pet
00:12:40.060 | trainer you know working with tensors and you know objectives is a great way to iterate that's how we
00:12:44.940 | build the models but i want us to be able to also engineer ai systems and i think for engineering and
00:12:50.220 | programming a prompt is a horrible abstraction here's why it's a stringly typed canvas just a big blurb no
00:12:56.140 | structure whatsoever even if structure actually exists in a latent way that couples and entangles
00:13:02.220 | the fundamental task definition you want to say which is really important stuff this is what you're
00:13:07.020 | engineering with some random over uh fitted half-baked decisions about hey this llm responded to this
00:13:13.900 | language you know when i talk to it this way or i put this example to demonstrate my point um and it kind
00:13:20.060 | of clicked for this model so i'll just keep it in and there's no way to really tell the difference
00:13:23.500 | what was the fundamental thing you're solving and like you know what was the random uh trick you applied
00:13:28.140 | it's like a square root thing except you don't call it a square root and we just have to stare at it and
00:13:32.060 | be like wait why are we shifting to the left five you know by five bits or something um you're also in
00:13:37.980 | using the inference time strategy which is like changing every few weeks or people are proposing stuff
00:13:42.220 | all the time and you're baking it literally entangling it into your system so if it's an agent your prompt
00:13:47.500 | is telling it it's an agent your system has no big like deal knowing about the fact it's an agent or a
00:13:53.020 | reasoning system or whatever what are you actually trying to solve right if it's like if you're writing
00:13:56.700 | a square root function and then you're like hey here's the layout of the structs in memory or something
00:14:01.820 | um you're also talking about formatting and parsing things you know write an xml produce json whatever
00:14:08.540 | like again that's really none of your business most of the time so you want to write a human readable spec
00:14:13.820 | but you're saying things like do not ignore this generate xml answer in json you are professor
00:14:19.020 | einstein a wise expert in the field i'll tip you a thousand dollars right like that is just not
00:14:24.780 | engineering guys um so what should we do trusty old separation of concerns i think is the answer
00:14:31.740 | your job as an engineer is to invest in your actual system design and you know starting with the spec
00:14:38.540 | the spec unfortunately or fortunately cannot be reduced to one thing and this is the time i'll
00:14:43.820 | talk about evals i know everyone here's about eval so this is the one line about evals that makes us
00:14:47.500 | talk about evals a lot of the time um you want to invest in natural language descriptions because that
00:14:53.900 | is the power of this new framework natural language definitions are not prompts they are highly localized
00:14:59.820 | pieces of ambiguous stuff that could not have been said in any other way right i can't tell the system
00:15:04.940 | certain things except in english so i'll say it in english but a lot of the time i'm actually iterating
00:15:09.820 | to uh to appease a certain model and to make it perform well relative to some criteria i have not
00:15:16.460 | telling it the criteria just tinkering with things there evals is the way to do this because evals say
00:15:22.380 | here's what i actually care about change the model the evals are still what i care about it's a fundamental
00:15:26.620 | thing now evals are not for everything if you try to use evals to define the core behavior of your system
00:15:31.500 | you will not learn induction learning from data is a lot harder than following instructions right
00:15:36.140 | so you need to have both code is another thing that you need you know a lot of people are like oh it's
00:15:41.420 | just like a you know you just just ask it to do the thing well who's gonna define the tools who's gonna
00:15:45.660 | define the structure how do you handle information flow like you know like things that are private should
00:15:50.220 | not flow in the wrong places right you need to control these things um how do you apply function
00:15:54.460 | composition llms are horrible at composition because neural networks kind of essentially don't learn things
00:15:59.420 | that reliably function composition in software is always perfectly reliable basically right by
00:16:04.220 | construction so a lot of the things um are often best delegated to code right but it's it's hard and
00:16:11.260 | it's really important that you can actually juggle and combine these things and you need a canvas that
00:16:15.820 | can allow it can allow you to combine these things well when you do this a good canvas the definition here
00:16:21.020 | of a good campus or the camp the the criteria for a good canvas is that it should allow you to express those
00:16:25.900 | three in a way that's highly streamlined and in a way that is decoupled and not entangled with models
00:16:32.220 | that are changing i should just be able to hot swap models uh inference strategies that are changing hey i
00:16:36.860 | want to switch from a chain of thought to an agent i want to switch from an agent to a monte carlo tree
00:16:41.260 | search whatever the latest thing that has come out is right i should be able to just do that um and new
00:16:45.580 | learning algorithms this is really important we talked about learning but learning uh is you know always
00:16:51.020 | happening at the level of your entire system if you're engineering it or at least you've got to be thinking about
00:16:54.700 | it that way where you're saying i want the whole thing to work as a whole for my problem not for
00:16:59.580 | some general default right so that's what the evals here are going to be doing and you want a way of
00:17:04.540 | expressing this that allows you to do reinforcement learning but also allows you to do prompt optimization
00:17:08.620 | but also allows you to do any of these things at the level of abstraction that you're actually working
00:17:11.900 | with so the second takeaway is that you should invest in defining things specific to your ai system
00:17:17.180 | and decouple from the lower level swappable pieces because they'll expire faster than ever
00:17:23.340 | so i'll just conclude by telling you we've built and been building for three years this dspi framework
00:17:28.140 | which is the only framework that actually decouples uh your job from which is writing the lower level ai
00:17:35.100 | software from our job which is giving you powerful evolving toolkits for learning and for search which
00:17:41.420 | is scaling um and for swapping lms through adapters um so there's only one concept you have to learn
00:17:48.300 | it is a new concept which is which we call signatures a new first class concept if you learn it you've
00:17:53.340 | learned dspi um i'll have to unfortunately skip this because of the because of the time for the other
00:17:57.740 | speakers uh but let me give you a summary i can't predict the future i'm not telling you if you do this
00:18:02.620 | you know the code you write tomorrow will be there forever but i'm telling you the least you can do
00:18:07.100 | this is not like uh the the kind of the top level it's just like the base the baseline i would say
00:18:11.740 | is avoid hand engineering at lower levels than today allows you to do right that's the that's the big
00:18:17.740 | lesson from the bitter lesson and from premature optimization being the root of all label um among
00:18:22.620 | your safest bets they could turn out to be wrong i don't know is um models are not anytime soon gonna read
00:18:28.540 | uh specs off of your mind i don't know if like we'll figure that out um and they're not going to
00:18:33.740 | magically collect all the structure and tools specific to your application so that's clearly
00:18:38.060 | stuff you should invest in right when you're building a system invest in the signatures which
00:18:42.140 | again you can learn about on the dspi site uh dspi.ai invest in essential control flow and tools
00:18:48.620 | and invest in evals for things that you would otherwise be iterating on by hand and write the wave of
00:18:53.900 | swappable models write the wave of the modules we build you just swap them in and out
00:18:58.460 | and write the wave of optimizers which can do things like reinforcement learning or prompt optimization
00:19:02.620 | for any application that it is that you've built all right thank you everyone