On Engineering AI Systems that Endure The Bitter Lesson

00:00:00.000 | So thanks everyone for showing up, and thanks to the organizers for inviting me

00:00:19.800 | and having me here. I'm excited to talk to you all about engineering AI systems

00:00:25.980 | that endure the bitter lesson. So I'm Omar, I guess the intro has already happened so

00:00:31.560 | let's not repeat that. So I mean if you're here I think it's probably because you

00:00:36.300 | engineer what we might call AI software or you might maybe manage people or work

00:00:41.220 | with people that do. It's not really a term that's like been used as a very

00:00:46.620 | special thing this way for that long so we're all kind of trying to figure out

00:00:51.720 | what are the right sort of basics and fundamentals here and what are the things

00:00:55.080 | that are fleeting. So this is what the talk will be largely about and you know

00:01:00.360 | like the name of the game is it's kind of a meme at this point every week there's a

00:01:04.320 | new large language model maybe every week is actually too slow at this point that

00:01:08.160 | actually changes something in terms of the trade-offs you can strike. It might

00:01:12.900 | not be the state-of-the-art in terms of the best quality necessarily although

00:01:15.960 | sometimes it is but maybe it's the best performance for certain costs or it's the

00:01:21.420 | best performance for certain types of applications or maybe it's the you know

00:01:24.900 | the speed that's really incredible. We've seen like things like you know the

00:01:27.660 | diffusion now. So every every week there's a new LLM that you kind of have to think

00:01:32.480 | about if you're engineering software in this space which is really unusual like if

00:01:35.760 | you think back to normal software engineering you change your hardware every

00:01:39.200 | two three years maybe if that so this is pretty unusual. The other part that's

00:01:45.380 | actually also a little bit weirder is if you are lucky the LLM provider has

00:01:50.720 | recognized that they're not really building these LLMs they're training them

00:01:55.340 | they're emerging based on a lot of nudging and data and iterating on a lot of evals

00:02:01.460 | and a lot of Vibes as well and they have realized you know if you're lucky that

00:02:06.020 | there are new quirks in their latest models that weren't there before and to

00:02:09.680 | the surprise of many people to these days you know you still get longer and

00:02:12.980 | longer prompting guides for the latest models that are supposed to be you know

00:02:16.520 | closer and closer to AGI and if you're less lucky you have to figure that out on

00:02:21.200 | your own right if you're even less lucky the prompting guides from the

00:02:25.020 | provider are not even that good so you have to actually kind of figure out what the

00:02:28.140 | I think is and every day maybe at an even faster pace someone is releasing an

00:02:33.300 | archive paper or a tweet or something that introduces a new learning algorithm

00:02:37.600 | maybe some reinforcement learning bills and whistles maybe some prompting tricks

00:02:41.520 | maybe a prompt optimization you know technique something or the other that

00:02:45.780 | promises to make your system learn better and sort of fit your goals better

00:02:49.740 | someone else is introducing some search or scaling or inference strategies or

00:02:54.120 | agent frameworks or agent architectures that are promising to finally

00:02:58.020 | unlock levels of reliability or quality better than what you had before and I

00:03:02.280 | think if you're actually doing a reasonable job now most likely you're

00:03:06.000 | scrambling every week that's not if you're doing a bad job that's if you're

00:03:08.280 | doing a good job right because you're like you know I've got to stay on top of at

00:03:12.000 | least some of this stuff so that like I don't fall behind and in many cases like

00:03:16.560 | you know model API's actually change the model under the hood even though you know

00:03:20.880 | you're using the same name so it's actually you're forced to scramble and

00:03:25.440 | actually I would say maybe the question isn't whether you will scramble every

00:03:28.560 | week and maybe a different question is will you even get to scramble for long if

00:03:32.880 | you think about the rate of progress of these LLMs like are they gonna eat your

00:03:35.700 | lunch right so these are I think questions that are on a lot of people's minds and

00:03:39.540 | this is what the talk is is is going to be addressing so the talk mentions the

00:03:43.080 | bitter lesson which is sounds like this you know really ancient old kind of AI

00:03:47.700 | lore but it's just you know six years old where the current years Turing Award

00:03:52.600 | winner Rich Sutton who's a pioneer of reinforcement learning wrote this short

00:03:56.460 | essay on his website basically that says 70 years of AI has taught him and taught

00:04:01.940 | you know other people in the AI community from his perspective that when AI

00:04:05.760 | researchers leverage domain knowledge to solve problems like I don't know chess or

00:04:09.900 | something we build complicated methods that essentially don't scale and we get

00:04:13.940 | stuck and we get beat by methods that leverage scale a lot better what seems

00:04:19.340 | to have to work better according to to Sutton is general methods to scale and he

00:04:23.880 | identifies search which is not like retrieval more of like you know exploring

00:04:28.160 | large spaces and learning so getting the system to kind of understand its

00:04:32.540 | environment maybe for example work best and search here is what we'd call in the

00:04:37.620 | element land maybe inference time scaling or something so I don't speak for Sutton

00:04:41.280 | and I'm not you know suggesting that I have the right understanding of what he's

00:04:44.700 | saying or that I necessarily agree or disagree but I think this is just

00:04:47.380 | fundamental and important kind of concept in this space so I think it's it's it

00:04:53.440 | raises interesting questions for us as people who build you know engineer AI

00:04:57.120 | systems because if leveraging domain knowledge is bad what exactly is AI

00:05:01.740 | engineering supposed to be about I mean engineering is understanding your

00:05:04.400 | domain and working in it with a lot of human ingenuity in repeatable ways let's

00:05:08.540 | say or with principles so like are we just doomed like I was just wasting our

00:05:12.140 | time why are we at an AI engineering you know fair and I'll tell you how to

00:05:17.240 | resolve this I've not really seen a lot of people discuss that Sutton is talking

00:05:20.660 | about and a lot of people you know throw the bitter lesson around so clearly

00:05:23.160 | somebody has to think about this right Sutton is talking about maximizing

00:05:26.420 | intelligence all of us probably care about that to some degree but which is

00:05:30.080 | like something like the ability to figure things out in a new

00:05:32.300 | environment really fast let's say all of us kind of care about this to some

00:05:36.320 | degree I'm also an AI researcher but when we're building AI systems I think it's

00:05:41.440 | important to remember that the reason we build software is not that we lack AGI

00:05:45.200 | we build software you know and and the reason for this and the way kind of

00:05:48.860 | understand this is we already have general intelligence everywhere we have eight

00:05:52.760 | billions of them they're unreliable because that's what intelligence is and

00:05:56.980 | they've not solved the problems that we want to solve with software that's why

00:05:59.560 | we're building software so we program software not because we lack AGI but

00:06:04.580 | because we want reliable robust controllable scalable systems and we want

00:06:10.900 | these things that to be things that we can reason about understand at scale and

00:06:15.320 | actually if you think about engineering and reliable systems if you think about

00:06:17.880 | checks and balances in any case where you try to systematize stuff it's about

00:06:21.520 | subtracting agency and subtracting intelligence in exactly the right places

00:06:25.700 | carefully and not restricting the intelligence

00:06:28.720 | otherwise so this is a very different axis from the kinds of lessons that you

00:06:33.080 | would draw on from the bitter lesson now that does not mean the bitter lesson is

00:06:36.440 | irrelevant let me tell you the precise way in which it's relevant so the first

00:06:40.280 | takeaway here is that scaling search and learning works best for intelligence this

00:06:44.480 | is the right thing to do if you're an AI researcher interested in building you know

00:06:47.660 | agents that learn really well really fast in new environments right don't hard code stuff at all or unless you really have to but in building AI

00:06:55.600 | systems it's helpful to think about well sure search and learning but

00:06:59.920 | searching for what right like what is your AI system even supposed to be doing

00:07:03.880 | what is the the fundamental problem that you're solving it's not intelligence it's

00:07:07.660 | something else and what are you learning for right like what is the system

00:07:11.600 | learning in order to do well and that is what you need to be engineering not the

00:07:16.000 | specifics of search and not the specifics of learning as I'll talk about in

00:07:19.100 | the rest of this talk so he's saying Sutton is saying complicated methods get in the

00:07:25.400 | way of scaling especially if you do it early like before you know what you're

00:07:29.420 | doing essentially did we hear that before I feel like I heard that back in the

00:07:32.880 | 1970s although I wasn't around this is you know the notion of structured

00:07:36.440 | programming with with Knoos saying his popular you know phrase in a paper

00:07:41.540 | premature optimization is the root of all evil I think this is the bitter lesson for

00:07:47.300 | software and thereby also for AI software so it's human ingenuity and human

00:07:53.660 | knowledge of the domain it's not that it's harmful it's that when you do it

00:07:57.740 | prematurely in ways that constrain your system in ways that reflect poor

00:08:01.580 | understanding they're bad but you can't get away in an engineering field with not

00:08:05.840 | engineering your system like you're just quitting or something right so here's a

00:08:10.060 | little piece of code if you follow me on X on Twitter you might recognize it but

00:08:14.240 | otherwise I think it looks pretty opaque to me in like three seconds and I can't

00:08:18.440 | really look at this and tell exactly what it's doing and I also honestly

00:08:20.600 | don't really care so lo and behold this is computing a square root in a certain

00:08:26.300 | floating point representation on an old machine and I think the thing that jumps at

00:08:30.020 | me immediately is this is not the most future-proof program possible if you

00:08:34.400 | change the machine architecture different floating point representations better CPUs

00:08:38.600 | first of all it'll be wrong because you know like it's just hard coding some

00:08:41.860 | values here and a second of all it'll probably be slower than a normal you know

00:08:45.980 | square root like maybe is a single instruction or maybe the compiler has a

00:08:49.700 | really smart way of doing it or you know a lot of other things that could be

00:08:52.640 | optimized for you right so someone who wrote this maybe they had a good reason

00:08:56.760 | maybe they didn't but certainly if you're writing this kind of thing often you're

00:09:00.400 | probably messing up as an engineer so premature optimization is maybe the

00:09:05.800 | square root of all evil or something but what counts as premature like I mean

00:09:12.880 | that's kind of the name of the game right like if we could just say that but it

00:09:15.220 | doesn't mean anything so I don't think any strategy will work in tech nobody can

00:09:19.780 | anticipate what will happen in three years five years ten years but I think you

00:09:23.440 | still have to have a conceptual model that you're working off of and I happen

00:09:27.520 | to have built two things that are you know on the order of several years old

00:09:31.060 | that have fundamentally stayed the same over the years from the days of birth text

00:09:35.200 | DaVinci 2 up to 404 mini and they're bigger now than they ever were and they're

00:09:39.280 | sort of like these stable fundamental kind of abstractions or AI systems around

00:09:44.620 | around LLMs so what gives what happens in order to get something like Colbert or

00:09:50.220 | something like the spy in this ecosystem and sort of endure a few years which is like you

00:09:55.260 | know centuries in a island I'll try to reflect on this and you know again none of this is

00:10:01.100 | guaranteed to be something that lasts forever so here's my hypothesis premature optimization

00:10:05.600 | is what is happening even only if you're hard coding stuff at a lower level of abstraction

00:10:11.020 | that you can then you can justify if you want a square root please just say give me

00:10:16.120 | a square root don't start doing random bit shifts and bit stuff like you know

00:10:20.220 | bit manipulation that happens to appease your particular machine today but

00:10:25.180 | actually take a step back do you even want a square root or are you computing

00:10:28.060 | something even more general and is there a way you could express that thing that is

00:10:31.420 | more general right and only you know stoop down or go down to the level of

00:10:36.060 | abstraction that's lower if you've demonstrated that a higher level of

00:10:39.340 | abstraction is not good enough so I think the bigger picture here is applied

00:10:44.140 | machine learning and definitely prompt engineering has a huge issue here tight coupling is known

00:10:49.100 | tighter coupling than necessary is known to be bad in software but it's not really something

00:10:53.100 | we talk about when we're building machine learning systems in fact the name of the game in machine

00:10:56.700 | learning is usually like hey this latest thing came out let's rewrite everything so that we're

00:11:00.940 | working around that specific thing and i tweeted about this a year ago 13 months ago last may in

00:11:05.900 | 2024 saying the bitter lesson is just an artifact of lacking high level good high level ml abstractions

00:11:12.140 | deep learning uh scaling deep learning helps predictably but after every paradigm shift the best

00:11:17.980 | systems always include modular specializations because we're trying to build software we need those

00:11:21.980 | and every time they basically look the same and they should have been reusable but they're not

00:11:26.140 | because we're writing code bad but we're writing bad code so here's a nice example just to demonstrate

00:11:31.180 | this it's not special at all here's a 2006 paper the title could have really been a paper now right

00:11:35.980 | a modular approach for multilingual question answering and here's the system architecture it looks like

00:11:40.540 | your favorite multi-agent framework today right it has an execution manager it has some question

00:11:45.100 | analyzers and retrieval strategy strategists from a bunch of corpora and it's like a figure you know if you

00:11:50.380 | color it you would think it's a paper maybe from last year or something um now here's the problem

00:11:55.260 | it's a pretty figure the system architecturally is actually not that wrong i'm not saying it's the

00:12:00.140 | perfect architecture but in a normal software environment you could actually just upgrade the

00:12:05.020 | machine right put it on a new hardware put it on a new operating system and it would just work and

00:12:09.660 | actually work reasonably well because the architecture is not that bad but we know that that's not the

00:12:13.980 | case for these ml sort of architectures because they're not expressed in the right way so i think

00:12:19.580 | fundamentally i can express this most passionately against prompts a prompt is a

00:12:25.100 | horrible abstraction for programming and this needs to be fixed asap i say for programming because

00:12:30.460 | it's actually not a horrible one for management if you want to manage an employee or an agent a

00:12:34.940 | prompt is a reasonably kind of like it's a slack channel you have a remote employee if you want to be a pet

00:12:40.060 | trainer you know working with tensors and you know objectives is a great way to iterate that's how we

00:12:44.940 | build the models but i want us to be able to also engineer ai systems and i think for engineering and

00:12:50.220 | programming a prompt is a horrible abstraction here's why it's a stringly typed canvas just a big blurb no

00:12:56.140 | structure whatsoever even if structure actually exists in a latent way that couples and entangles

00:13:02.220 | the fundamental task definition you want to say which is really important stuff this is what you're

00:13:07.020 | engineering with some random over uh fitted half-baked decisions about hey this llm responded to this

00:13:13.900 | language you know when i talk to it this way or i put this example to demonstrate my point um and it kind

00:13:20.060 | of clicked for this model so i'll just keep it in and there's no way to really tell the difference

00:13:23.500 | what was the fundamental thing you're solving and like you know what was the random uh trick you applied

00:13:28.140 | it's like a square root thing except you don't call it a square root and we just have to stare at it and

00:13:32.060 | be like wait why are we shifting to the left five you know by five bits or something um you're also in

00:13:37.980 | using the inference time strategy which is like changing every few weeks or people are proposing stuff

00:13:42.220 | all the time and you're baking it literally entangling it into your system so if it's an agent your prompt

00:13:47.500 | is telling it it's an agent your system has no big like deal knowing about the fact it's an agent or a

00:13:53.020 | reasoning system or whatever what are you actually trying to solve right if it's like if you're writing

00:13:56.700 | a square root function and then you're like hey here's the layout of the structs in memory or something

00:14:01.820 | um you're also talking about formatting and parsing things you know write an xml produce json whatever

00:14:08.540 | like again that's really none of your business most of the time so you want to write a human readable spec

00:14:13.820 | but you're saying things like do not ignore this generate xml answer in json you are professor

00:14:19.020 | einstein a wise expert in the field i'll tip you a thousand dollars right like that is just not

00:14:24.780 | engineering guys um so what should we do trusty old separation of concerns i think is the answer

00:14:31.740 | your job as an engineer is to invest in your actual system design and you know starting with the spec

00:14:38.540 | the spec unfortunately or fortunately cannot be reduced to one thing and this is the time i'll

00:14:43.820 | talk about evals i know everyone here's about eval so this is the one line about evals that makes us

00:14:47.500 | talk about evals a lot of the time um you want to invest in natural language descriptions because that

00:14:53.900 | is the power of this new framework natural language definitions are not prompts they are highly localized

00:14:59.820 | pieces of ambiguous stuff that could not have been said in any other way right i can't tell the system

00:15:04.940 | certain things except in english so i'll say it in english but a lot of the time i'm actually iterating

00:15:09.820 | to uh to appease a certain model and to make it perform well relative to some criteria i have not

00:15:16.460 | telling it the criteria just tinkering with things there evals is the way to do this because evals say

00:15:22.380 | here's what i actually care about change the model the evals are still what i care about it's a fundamental

00:15:26.620 | thing now evals are not for everything if you try to use evals to define the core behavior of your system

00:15:31.500 | you will not learn induction learning from data is a lot harder than following instructions right

00:15:36.140 | so you need to have both code is another thing that you need you know a lot of people are like oh it's

00:15:41.420 | just like a you know you just just ask it to do the thing well who's gonna define the tools who's gonna

00:15:45.660 | define the structure how do you handle information flow like you know like things that are private should

00:15:50.220 | not flow in the wrong places right you need to control these things um how do you apply function

00:15:54.460 | composition llms are horrible at composition because neural networks kind of essentially don't learn things

00:15:59.420 | that reliably function composition in software is always perfectly reliable basically right by

00:16:04.220 | construction so a lot of the things um are often best delegated to code right but it's it's hard and

00:16:11.260 | it's really important that you can actually juggle and combine these things and you need a canvas that

00:16:15.820 | can allow it can allow you to combine these things well when you do this a good canvas the definition here

00:16:21.020 | of a good campus or the camp the the criteria for a good canvas is that it should allow you to express those

00:16:25.900 | three in a way that's highly streamlined and in a way that is decoupled and not entangled with models

00:16:32.220 | that are changing i should just be able to hot swap models uh inference strategies that are changing hey i

00:16:36.860 | want to switch from a chain of thought to an agent i want to switch from an agent to a monte carlo tree

00:16:41.260 | search whatever the latest thing that has come out is right i should be able to just do that um and new

00:16:45.580 | learning algorithms this is really important we talked about learning but learning uh is you know always

00:16:51.020 | happening at the level of your entire system if you're engineering it or at least you've got to be thinking about

00:16:54.700 | it that way where you're saying i want the whole thing to work as a whole for my problem not for

00:16:59.580 | some general default right so that's what the evals here are going to be doing and you want a way of

00:17:04.540 | expressing this that allows you to do reinforcement learning but also allows you to do prompt optimization

00:17:08.620 | but also allows you to do any of these things at the level of abstraction that you're actually working

00:17:11.900 | with so the second takeaway is that you should invest in defining things specific to your ai system

00:17:17.180 | and decouple from the lower level swappable pieces because they'll expire faster than ever

00:17:23.340 | so i'll just conclude by telling you we've built and been building for three years this dspi framework

00:17:28.140 | which is the only framework that actually decouples uh your job from which is writing the lower level ai

00:17:35.100 | software from our job which is giving you powerful evolving toolkits for learning and for search which

00:17:41.420 | is scaling um and for swapping lms through adapters um so there's only one concept you have to learn

00:17:48.300 | it is a new concept which is which we call signatures a new first class concept if you learn it you've

00:17:53.340 | learned dspi um i'll have to unfortunately skip this because of the because of the time for the other

00:17:57.740 | speakers uh but let me give you a summary i can't predict the future i'm not telling you if you do this

00:18:02.620 | you know the code you write tomorrow will be there forever but i'm telling you the least you can do

00:18:07.100 | this is not like uh the the kind of the top level it's just like the base the baseline i would say

00:18:11.740 | is avoid hand engineering at lower levels than today allows you to do right that's the that's the big

00:18:17.740 | lesson from the bitter lesson and from premature optimization being the root of all label um among

00:18:22.620 | your safest bets they could turn out to be wrong i don't know is um models are not anytime soon gonna read

00:18:28.540 | uh specs off of your mind i don't know if like we'll figure that out um and they're not going to

00:18:33.740 | magically collect all the structure and tools specific to your application so that's clearly

00:18:38.060 | stuff you should invest in right when you're building a system invest in the signatures which

00:18:42.140 | again you can learn about on the dspi site uh dspi.ai invest in essential control flow and tools

00:18:48.620 | and invest in evals for things that you would otherwise be iterating on by hand and write the wave of

00:18:53.900 | swappable models write the wave of the modules we build you just swap them in and out

00:18:58.460 | and write the wave of optimizers which can do things like reinforcement learning or prompt optimization

00:19:02.620 | for any application that it is that you've built all right thank you everyone

On Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks

Chapters