On Engineering AI Systems that Endure The Bitter Lesson

So thanks everyone for showing up, and thanks to the organizers for inviting me and having me here. I'm excited to talk to you all about engineering AI systems that endure the bitter lesson. So I'm Omar, I guess the intro has already happened so let's not repeat that. So I mean if you're here I think it's probably because you engineer what we might call AI software or you might maybe manage people or work with people that do.

It's not really a term that's like been used as a very special thing this way for that long so we're all kind of trying to figure out what are the right sort of basics and fundamentals here and what are the things that are fleeting. So this is what the talk will be largely about and you know like the name of the game is it's kind of a meme at this point every week there's a new large language model maybe every week is actually too slow at this point that actually changes something in terms of the trade-offs you can strike.

It might not be the state-of-the-art in terms of the best quality necessarily although sometimes it is but maybe it's the best performance for certain costs or it's the best performance for certain types of applications or maybe it's the you know the speed that's really incredible. We've seen like things like you know the diffusion now.

So every every week there's a new LLM that you kind of have to think about if you're engineering software in this space which is really unusual like if you think back to normal software engineering you change your hardware every two three years maybe if that so this is pretty unusual.

The other part that's actually also a little bit weirder is if you are lucky the LLM provider has recognized that they're not really building these LLMs they're training them they're emerging based on a lot of nudging and data and iterating on a lot of evals and a lot of Vibes as well and they have realized you know if you're lucky that there are new quirks in their latest models that weren't there before and to the surprise of many people to these days you know you still get longer and longer prompting guides for the latest models that are supposed to be you know closer and closer to AGI and if you're less lucky you have to figure that out on your own right if you're even less lucky the prompting guides from the provider are not even that good so you have to actually kind of figure out what the I think is and every day maybe at an even faster pace someone is releasing an archive paper or a tweet or something that introduces a new learning algorithm maybe some reinforcement learning bills and whistles maybe some prompting tricks maybe a prompt optimization you know technique something or the other that promises to make your system learn better and sort of fit your goals better someone else is introducing some search or scaling or inference strategies or agent frameworks or agent architectures that are promising to finally unlock levels of reliability or quality better than what you had before and I think if you're actually doing a reasonable job now most likely you're scrambling every week that's not if you're doing a bad job that's if you're doing a good job right because you're like you know I've got to stay on top of at least some of this stuff so that like I don't fall behind and in many cases like you know model API's actually change the model under the hood even though you know you're using the same name so it's actually you're forced to scramble and actually I would say maybe the question isn't whether you will scramble every week and maybe a different question is will you even get to scramble for long if you think about the rate of progress of these LLMs like are they gonna eat your lunch right so these are I think questions that are on a lot of people's minds and this is what the talk is is is going to be addressing so the talk mentions the bitter lesson which is sounds like this you know really ancient old kind of AI lore but it's just you know six years old where the current years Turing Award winner Rich Sutton who's a pioneer of reinforcement learning wrote this short essay on his website basically that says 70 years of AI has taught him and taught you know other people in the AI community from his perspective that when AI researchers leverage domain knowledge to solve problems like I don't know chess or something we build complicated methods that essentially don't scale and we get stuck and we get beat by methods that leverage scale a lot better what seems to have to work better according to to Sutton is general methods to scale and he identifies search which is not like retrieval more of like you know exploring large spaces and learning so getting the system to kind of understand its environment maybe for example work best and search here is what we'd call in the element land maybe inference time scaling or something so I don't speak for Sutton and I'm not you know suggesting that I have the right understanding of what he's saying or that I necessarily agree or disagree but I think this is just fundamental and important kind of concept in this space so I think it's it's it raises interesting questions for us as people who build you know engineer AI systems because if leveraging domain knowledge is bad what exactly is AI engineering supposed to be about I mean engineering is understanding your domain and working in it with a lot of human ingenuity in repeatable ways let's say or with principles so like are we just doomed like I was just wasting our time why are we at an AI engineering you know fair and I'll tell you how to resolve this I've not really seen a lot of people discuss that Sutton is talking about and a lot of people you know throw the bitter lesson around so clearly somebody has to think about this right Sutton is talking about maximizing intelligence all of us probably care about that to some degree but which is like something like the ability to figure things out in a new environment really fast let's say all of us kind of care about this to some degree I'm also an AI researcher but when we're building AI systems I think it's important to remember that the reason we build software is not that we lack AGI we build software you know and and the reason for this and the way kind of understand this is we already have general intelligence everywhere we have eight billions of them they're unreliable because that's what intelligence is and they've not solved the problems that we want to solve with software that's why we're building software so we program software not because we lack AGI but because we want reliable robust controllable scalable systems and we want these things that to be things that we can reason about understand at scale and actually if you think about engineering and reliable systems if you think about checks and balances in any case where you try to systematize stuff it's about subtracting agency and subtracting intelligence in exactly the right places carefully and not restricting the intelligence otherwise so this is a very different axis from the kinds of lessons that you would draw on from the bitter lesson now that does not mean the bitter lesson is irrelevant let me tell you the precise way in which it's relevant so the first takeaway here is that scaling search and learning works best for intelligence this is the right thing to do if you're an AI researcher interested in building you know agents that learn really well really fast in new environments right don't hard code stuff at all or unless you really have to but in building AI systems it's helpful to think about well sure search and learning but searching for what right like what is your AI system even supposed to be doing what is the the fundamental problem that you're solving it's not intelligence it's something else and what are you learning for right like what is the system learning in order to do well and that is what you need to be engineering not the specifics of search and not the specifics of learning as I'll talk about in the rest of this talk so he's saying Sutton is saying complicated methods get in the way of scaling especially if you do it early like before you know what you're doing essentially did we hear that before I feel like I heard that back in the 1970s although I wasn't around this is you know the notion of structured programming with with Knoos saying his popular you know phrase in a paper premature optimization is the root of all evil I think this is the bitter lesson for software and thereby also for AI software so it's human ingenuity and human knowledge of the domain it's not that it's harmful it's that when you do it prematurely in ways that constrain your system in ways that reflect poor understanding they're bad but you can't get away in an engineering field with not engineering your system like you're just quitting or something right so here's a little piece of code if you follow me on X on Twitter you might recognize it but otherwise I think it looks pretty opaque to me in like three seconds and I can't really look at this and tell exactly what it's doing and I also honestly don't really care so lo and behold this is computing a square root in a certain floating point representation on an old machine and I think the thing that jumps at me immediately is this is not the most future-proof program possible if you change the machine architecture different floating point representations better CPUs first of all it'll be wrong because you know like it's just hard coding some values here and a second of all it'll probably be slower than a normal you know square root like maybe is a single instruction or maybe the compiler has a really smart way of doing it or you know a lot of other things that could be optimized for you right so someone who wrote this maybe they had a good reason maybe they didn't but certainly if you're writing this kind of thing often you're probably messing up as an engineer so premature optimization is maybe the square root of all evil or something but what counts as premature like I mean that's kind of the name of the game right like if we could just say that but it doesn't mean anything so I don't think any strategy will work in tech nobody can anticipate what will happen in three years five years ten years but I think you still have to have a conceptual model that you're working off of and I happen to have built two things that are you know on the order of several years old that have fundamentally stayed the same over the years from the days of birth text DaVinci 2 up to 404 mini and they're bigger now than they ever were and they're sort of like these stable fundamental kind of abstractions or AI systems around around LLMs so what gives what happens in order to get something like Colbert or something like the spy in this ecosystem and sort of endure a few years which is like you know centuries in a island I'll try to reflect on this and you know again none of this is guaranteed to be something that lasts forever so here's my hypothesis premature optimization is what is happening even only if you're hard coding stuff at a lower level of abstraction that you can then you can justify if you want a square root please just say give me a square root don't start doing random bit shifts and bit stuff like you know bit manipulation that happens to appease your particular machine today but actually take a step back do you even want a square root or are you computing something even more general and is there a way you could express that thing that is more general right and only you know stoop down or go down to the level of abstraction that's lower if you've demonstrated that a higher level of abstraction is not good enough so I think the bigger picture here is applied machine learning and definitely prompt engineering has a huge issue here tight coupling is known tighter coupling than necessary is known to be bad in software but it's not really something we talk about when we're building machine learning systems in fact the name of the game in machine learning is usually like hey this latest thing came out let's rewrite everything so that we're working around that specific thing and i tweeted about this a year ago 13 months ago last may in 2024 saying the bitter lesson is just an artifact of lacking high level good high level ml abstractions deep learning uh scaling deep learning helps predictably but after every paradigm shift the best systems always include modular specializations because we're trying to build software we need those and every time they basically look the same and they should have been reusable but they're not because we're writing code bad but we're writing bad code so here's a nice example just to demonstrate this it's not special at all here's a 2006 paper the title could have really been a paper now right a modular approach for multilingual question answering and here's the system architecture it looks like your favorite multi-agent framework today right it has an execution manager it has some question analyzers and retrieval strategy strategists from a bunch of corpora and it's like a figure you know if you color it you would think it's a paper maybe from last year or something um now here's the problem it's a pretty figure the system architecturally is actually not that wrong i'm not saying it's the perfect architecture but in a normal software environment you could actually just upgrade the machine right put it on a new hardware put it on a new operating system and it would just work and actually work reasonably well because the architecture is not that bad but we know that that's not the case for these ml sort of architectures because they're not expressed in the right way so i think fundamentally i can express this most passionately against prompts a prompt is a horrible abstraction for programming and this needs to be fixed asap i say for programming because it's actually not a horrible one for management if you want to manage an employee or an agent a prompt is a reasonably kind of like it's a slack channel you have a remote employee if you want to be a pet trainer you know working with tensors and you know objectives is a great way to iterate that's how we build the models but i want us to be able to also engineer ai systems and i think for engineering and programming a prompt is a horrible abstraction here's why it's a stringly typed canvas just a big blurb no structure whatsoever even if structure actually exists in a latent way that couples and entangles the fundamental task definition you want to say which is really important stuff this is what you're engineering with some random over uh fitted half-baked decisions about hey this llm responded to this language you know when i talk to it this way or i put this example to demonstrate my point um and it kind of clicked for this model so i'll just keep it in and there's no way to really tell the difference what was the fundamental thing you're solving and like you know what was the random uh trick you applied it's like a square root thing except you don't call it a square root and we just have to stare at it and be like wait why are we shifting to the left five you know by five bits or something um you're also in using the inference time strategy which is like changing every few weeks or people are proposing stuff all the time and you're baking it literally entangling it into your system so if it's an agent your prompt is telling it it's an agent your system has no big like deal knowing about the fact it's an agent or a reasoning system or whatever what are you actually trying to solve right if it's like if you're writing a square root function and then you're like hey here's the layout of the structs in memory or something um you're also talking about formatting and parsing things you know write an xml produce json whatever like again that's really none of your business most of the time so you want to write a human readable spec but you're saying things like do not ignore this generate xml answer in json you are professor einstein a wise expert in the field i'll tip you a thousand dollars right like that is just not engineering guys um so what should we do trusty old separation of concerns i think is the answer your job as an engineer is to invest in your actual system design and you know starting with the spec the spec unfortunately or fortunately cannot be reduced to one thing and this is the time i'll talk about evals i know everyone here's about eval so this is the one line about evals that makes us talk about evals a lot of the time um you want to invest in natural language descriptions because that is the power of this new framework natural language definitions are not prompts they are highly localized pieces of ambiguous stuff that could not have been said in any other way right i can't tell the system certain things except in english so i'll say it in english but a lot of the time i'm actually iterating to uh to appease a certain model and to make it perform well relative to some criteria i have not telling it the criteria just tinkering with things there evals is the way to do this because evals say here's what i actually care about change the model the evals are still what i care about it's a fundamental thing now evals are not for everything if you try to use evals to define the core behavior of your system you will not learn induction learning from data is a lot harder than following instructions right so you need to have both code is another thing that you need you know a lot of people are like oh it's just like a you know you just just ask it to do the thing well who's gonna define the tools who's gonna define the structure how do you handle information flow like you know like things that are private should not flow in the wrong places right you need to control these things um how do you apply function composition llms are horrible at composition because neural networks kind of essentially don't learn things that reliably function composition in software is always perfectly reliable basically right by construction so a lot of the things um are often best delegated to code right but it's it's hard and it's really important that you can actually juggle and combine these things and you need a canvas that can allow it can allow you to combine these things well when you do this a good canvas the definition here of a good campus or the camp the the criteria for a good canvas is that it should allow you to express those three in a way that's highly streamlined and in a way that is decoupled and not entangled with models that are changing i should just be able to hot swap models uh inference strategies that are changing hey i want to switch from a chain of thought to an agent i want to switch from an agent to a monte carlo tree search whatever the latest thing that has come out is right i should be able to just do that um and new learning algorithms this is really important we talked about learning but learning uh is you know always happening at the level of your entire system if you're engineering it or at least you've got to be thinking about it that way where you're saying i want the whole thing to work as a whole for my problem not for some general default right so that's what the evals here are going to be doing and you want a way of expressing this that allows you to do reinforcement learning but also allows you to do prompt optimization but also allows you to do any of these things at the level of abstraction that you're actually working with so the second takeaway is that you should invest in defining things specific to your ai system and decouple from the lower level swappable pieces because they'll expire faster than ever so i'll just conclude by telling you we've built and been building for three years this dspi framework which is the only framework that actually decouples uh your job from which is writing the lower level ai software from our job which is giving you powerful evolving toolkits for learning and for search which is scaling um and for swapping lms through adapters um so there's only one concept you have to learn it is a new concept which is which we call signatures a new first class concept if you learn it you've learned dspi um i'll have to unfortunately skip this because of the because of the time for the other speakers uh but let me give you a summary i can't predict the future i'm not telling you if you do this you know the code you write tomorrow will be there forever but i'm telling you the least you can do this is not like uh the the kind of the top level it's just like the base the baseline i would say is avoid hand engineering at lower levels than today allows you to do right that's the that's the big lesson from the bitter lesson and from premature optimization being the root of all label um among your safest bets they could turn out to be wrong i don't know is um models are not anytime soon gonna read uh specs off of your mind i don't know if like we'll figure that out um and they're not going to magically collect all the structure and tools specific to your application so that's clearly stuff you should invest in right when you're building a system invest in the signatures which again you can learn about on the dspi site uh dspi.ai invest in essential control flow and tools and invest in evals for things that you would otherwise be iterating on by hand and write the wave of swappable models write the wave of the modules we build you just swap them in and out and write the wave of optimizers which can do things like reinforcement learning or prompt optimization for any application that it is that you've built all right thank you everyone

On Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks

Chapters

Transcript