back to indexAI Engineer Summit 2025: Agent Engineering (Day 2)

Chapters
0:0 start
15:36 swyx: Why Agent Engineering
27:23 AI Snake Oil: Building and evaluating AI Agents
47:17 Gemini: Going deep on Gemini Deep Research
62:23 Anthropic: How We Build Effective Agents
77:30 Sierra: The Agent Development Life Cycle
96:7 Morgan Stanley: What RL Means for Agents
173:39 Jane Street: Building AI-Powered Developer Tools at Jane Street
190:51 Bloomberg: Challenges to Scaling Agents for Generative AI Products
210:29 Brightwave: Knowledge Agents for Finance Workflows
307:34 Windsurf: Agents are built at the fringe: getting from 90 to 100
346:59 SuperDial: Voice AI: Your Bot Isn't Special
366:12 Ramp: AI Agents: the Bitter Lesson
426:33 OpenAI: Creating Agents that Co-Create
450:48 Gemini: The Next AI Engineers
472:26 Meta: What does it take to build a personal, local, private AI Agent that augments you deeply?
00:10:21.620 |
So, we're talking about agents engineering track, and it's all about builders, all about 00:10:27.680 |
you, who are actually shaping our agentic future. 00:11:01.180 |
So, today's leap, so today's leap, so today's leap, so today's leap isn't in theory, so today's 00:11:05.840 |
, so today's leap isn't in theory, but in scale. 00:11:08.240 |
So today's leap isn't in theory, but in scale, LLMs and automation frameworks are making agentic 00:11:08.640 |
workflows practical, driving real-world adoption across industries, and what was once theoretical, 00:11:13.920 |
and what was once theoretical is now operational on an unprecedented level. 00:11:25.480 |
You're going to hear a lot about what's actually happening now, 00:11:34.020 |
Yes, and in fact, things haven't gotten so real 00:11:37.520 |
that we will hear a number of use cases and deployment stories, 00:11:43.140 |
such as how agents are impacting the finance space 00:11:46.640 |
at Jane Street and BlackRock, very excited about that. 00:11:50.420 |
But of course, not everything is figured out. 00:11:52.660 |
The other big theme for today is the big open questions, 00:11:58.860 |
Yes, we have a very, very exciting lineup today. 00:12:03.880 |
And as AI engineers, we will go deep on Gemini deep research, 00:12:18.720 |
We will explore how reinforcement learning means for agents, 00:12:24.780 |
and discuss how to scaffold wisely while scaling effectively 00:12:30.660 |
Experts from Weights and Biases, Datadog, Morgan Stanley, 00:12:35.540 |
Bloomberg, Bright Wave, Galileo, and other amazing companies 00:12:41.560 |
And don't forget, tomorrow we will have hands-on workshops. 00:12:46.560 |
Before we start, we have to give a big thank you to our sponsors, 00:12:49.620 |
especially the AI Engineer Summit's platinum sponsor, Solana. 00:12:53.620 |
Solana, for those of you who don't know, is a blockchain and crypto ecosystem. 00:12:57.560 |
It's one of the ecosystems that's most directly at the center of this intersection 00:13:03.040 |
In short, they are building a permissionless layer for unlocking and allowing your agents 00:13:09.040 |
I think it's extremely telling that they are here supporting this event in terms of where 00:13:16.460 |
If you want to learn more about them, they have a large booth downstairs with three demonstrations 00:13:22.460 |
And this event would not be possible without all of our sponsors. 00:13:28.440 |
All the companies you see on the screen are pushing the boundaries of AI, engineering, 00:13:32.460 |
and represent a fascinating mix of pioneers shaping the future of the field. 00:13:39.800 |
The expo area is just down the stairs in the hallway on the lower level. 00:13:47.780 |
But please take time to visit our sponsors, chat with their experts and top engineers. 00:13:53.280 |
There is a wealth of knowledge to gain from them. 00:13:56.920 |
And you know, you can make connections, there are a lot of opportunities to explore, and these 00:14:03.000 |
sponsors could be your next collaborators, they can be your service providers, or maybe mentors 00:14:09.000 |
if you're just starting your AA journey, so take the time to visit them. 00:14:15.780 |
Last quick announcement before we get out of here, from a logistic perspective. 00:14:19.860 |
At the break following their sessions, most speakers will be able to answer your questions 00:14:26.160 |
There's one on this level, and then two down below just outside the expo. 00:14:31.040 |
There's also, during break time, the hallway track, which allows you to gather and talk about 00:14:37.500 |
It's a way to sort of engage in the conversation in a more direct way. 00:14:41.080 |
You'll see there's several breaks happening throughout the day. 00:14:43.280 |
Hopefully, you take advantage of those to go engage with the speakers, engage with each 00:14:46.500 |
other, and make sure, of course, that you do not miss the after party after all of this 00:14:59.860 |
So, please help me to welcome this person who probably all of you know. 00:15:10.060 |
He's a co-founder of this super practical AI engineer summit. 00:15:15.060 |
He'll set the context for today's track and discuss what needs to happen in 2025 to make 00:41:59.640 |
So very roughly speaking, capability means what a model could do at certain points of 00:42:06.640 |
For those of you who are technically minded, this means the pass at k accuracy of a model for 00:42:13.140 |
That means that one of the k answers that the model outputs are correct. 00:42:17.020 |
On the other hand, reliability means consistently getting the answer right each and every single 00:42:23.140 |
When agents are deployed for consequential decisions in the real world, what you really need 00:42:27.400 |
to focus on is reliability rather than capability. 00:42:31.680 |
That's because language models are already capable of very many things. 00:42:35.900 |
But if you trick yourself into believing this means a reliable experience for the end user, 00:42:40.640 |
that's when products in the real world go wrong. 00:42:43.440 |
So in particular, I think the methods for training models that get us to the 90 percent of it, what 00:42:49.800 |
in Swift's term would be the job of a machine learning engineer, don't necessarily get us to the 00:42:55.520 |
99.999 percent, what is often known as the five nines of reliability. 00:43:01.800 |
Closing this gap between the 90 percent and the 99.9 percent is the job of an AI engineer. 00:43:09.540 |
And I think this is what has led to the failures of products like Humane Spin and Rabbit R1. 00:43:14.800 |
One, it's because the developers did not anticipate that not having reliability in products like 00:43:23.080 |
In other words, if your personal assistant only offers your orders, your Doe Dash food correctly 00:43:28.080 |
80 percent of the times, that is a catastrophic failure from the point of view of a product. 00:43:34.080 |
Now, one thing people have proposed to fix this sort of issue, to improve reliability, is to create a verifier. 00:43:42.360 |
And on this basis, there have been several claims that if we could improve the inference scaling capabilities 00:43:48.100 |
of these tools and get to very reliable models. 00:43:52.360 |
Unfortunately, what we've found is that verifiers can also be imperfect in practice. 00:43:57.640 |
For instance, two of the leading coding benchmarks, Humane Val and MBPP, both have false positives 00:44:04.900 |
That is, a model could output incorrect code and still pass the unit test. 00:44:10.040 |
And once we account for these false positives, what we have are these inference scaling curves 00:44:15.680 |
So rather than model performance continuing to improve, if there are false positives in 00:44:20.640 |
your verifiers, the model performance sort of bends downwards, simply because the more 00:44:24.820 |
you try, the more likely it is you'll get a wrong answer. 00:44:29.360 |
And so this is also not a perfect solution to the problem of reliability. 00:44:36.260 |
I think the challenge for AI engineers is to figure out what sorts of software optimizations 00:44:42.720 |
and abstractions are needed for working with inherently stochastic components like LLMs. 00:44:48.920 |
In other words, it's a system design problem rather than just a modeling problem where you 00:44:53.320 |
need to work around the constraints of an inherently stochastic system. 00:44:58.080 |
And I want to argue in the last one minute of my talk that this means looking at AI engineering 00:45:03.680 |
as more of a reliability engineering field than a software or machine learning engineering field. 00:45:09.760 |
And this also brings me to the clear mindset shift that is needed to become successful from 00:45:18.960 |
If you look at the title slide of my talk, this title slide sort of pointed to one such area 00:45:25.000 |
where we've already overcome certain types of limitations of stochastic systems. 00:45:32.740 |
The 1946 ENIAC computer used over 17,000 vacuum tubes, many of which at the beginning of this 00:45:40.740 |
process, used to fail so often that the computer was unavailable half the time. 00:45:46.440 |
And the engineers who built this product knew that this is a failure from the point of view 00:45:52.160 |
So their primary job in the first two years of this computer was to fix the reliability 00:45:58.560 |
To reduce it to a point where it becomes well enough, it works well enough to become usable 00:46:05.900 |
And I say that this is precisely what AI engineers need to be thinking about as their real job. 00:46:12.420 |
It is not to create excellent products, though that is important, but rather to fix the reliability 00:46:17.300 |
issues that plague every single agent that uses inherently stochastic models as its basis. 00:46:27.580 |
To become successful engineers, you need a reliability shift in your mindset to think of yourselves as 00:46:33.000 |
the people who are ensuring that this next wave of computing is as reliable for end users 00:46:39.320 |
And there's a lot of precedent for this type of thing happening in the past. 00:46:44.160 |
With this, I'll leave you with the key takeaways. 00:46:48.200 |
Let's dive with our next presenters into Gemini Deep Research. 00:47:02.460 |
Please join me in welcoming to the stage staff ML software engineer of Google, Mukun Sridhar, 00:47:08.520 |
and product manager of Google Gemini, Arush Selvan. 00:47:19.360 |
I'm a software engineer at Google, working on Deep Research. 00:47:21.360 |
Arush Selvan: So I don't know if people have had a chance to try Deep Research on Gemini or are 00:47:23.360 |
Arush Selvan: But you can try it if you go to Gemini Advanced. 00:47:38.200 |
Arush Selvan: And if it costs 2.0 Flash, 2.0 Flash Thinking Experimental, 2.0 Flash Thinking 00:47:42.040 |
Experimental with Apps, 2.0 Pro Experimental. 00:47:45.040 |
Arush Selvan: You will find 1.5 Pro with Deep Research, which is what we built. 00:47:50.040 |
Arush Selvan: And if you have the chance to use it and you pay the 20 bucks, you'll see that 00:47:54.920 |
it's a personal research agent that can browse the web for you to build the reports on your 00:48:00.380 |
Arush Selvan: And so our motivation and what we want to talk about today is kind of why we 00:48:04.420 |
built it, some of the product challenges we overcame, and some of the technical challenges 00:48:07.860 |
Arush Selvan: You'll face of building a web research agent. 00:48:09.860 |
Arush Selvan: So our motivation was really we wanted to help people get smart fast. 00:48:15.960 |
Arush Selvan: We saw that research and learning queries are some of the top use cases in Gemini, 00:48:20.620 |
but when you bring like really hard questions to chatbots in general, what we were finding 00:48:25.740 |
is that it would often give you a blueprint for an answer, rather than actually give you 00:48:32.980 |
Arush Selvan: So we had this query that we used to throw around of like, "Tell me what 00:48:36.500 |
what does it take to get an athletic scholarship for shot put, and like how do I go get one?" 00:48:41.800 |
Arush Selvan: And often the answers would be things like, "You should talk to coaches, you 00:48:44.620 |
should find out how far you should be able to throw, and you know, you should make sure 00:48:49.880 |
Arush Selvan: But really what I want to know is like, "Okay, what are the grade boundaries? 00:48:52.740 |
Like, how far do I need to actually be able to throw?" 00:48:55.380 |
Arush Selvan: I want something super comprehensive, and that's where we saw a big opportunity. 00:48:59.000 |
Arush Selvan: Yeah, so we said, "What if you remove the constraints of compute and latency 00:49:04.360 |
Arush Selvan: Let Gemini take as long as it wants, browse the web as much as it needs, 00:49:08.660 |
Arush Selvan: and see if we can trade that off for a much comprehensive answer for the user." 00:49:13.960 |
Arush Selvan: But you've got to do it in five minutes, because beyond that, we don't have 00:49:18.960 |
Arush Selvan: So this brought a bunch of product challenges for us. 00:49:23.960 |
Arush Selvan: Gemini, up to this point, is an inherently synchronous feature. 00:49:27.960 |
Arush Selvan: And so we needed to figure out how do you sort of build asynchronous experiences 00:49:36.260 |
Arush Selvan: You also wanted to set expectations with users. 00:49:38.260 |
Arush Selvan: Deep research is good for one very specific thing, but a lot of user queries 00:49:41.960 |
to Gemini are things like, "What's the weather? 00:49:44.260 |
Arush Selvan: Things like that, where waiting five minutes is not going to get you a good answer, 00:49:48.260 |
Arush Selvan: and we wanted to set expectations. 00:49:50.060 |
Arush Selvan: And the last thing is, our answers can be thousands of words long, and we needed 00:49:54.460 |
to figure out how do you make it easy for users to engage with really long outputs in a chat 00:50:01.560 |
Arush Selvan: So let's walk through kind of the UX and kind of think about how we solve some 00:50:09.340 |
Arush Selvan: So imagine you're a VC, and everybody's talking about investing in nuclear in America, 00:50:14.540 |
Arush Selvan: So you come with this query like, "Hey, help me learn the latest technology breakthroughs 00:50:18.300 |
in small nuclear reactors, and tell me interesting companies in the supply chain." 00:50:22.240 |
Arush Selvan: So the first step when you bring this query to deep research is that Gemini 00:50:26.580 |
will actually put together a research plan for you and present it in a card. 00:50:30.020 |
Arush Selvan: And so this is the first way in which we're able to communicate with users. 00:50:34.020 |
Arush Selvan: This isn't your standard chatbot experience. 00:50:38.360 |
Arush Selvan: But it's also an opportunity for us to actually show the user a research plan that they 00:50:44.860 |
Arush Selvan: They wouldn't just get to work. 00:50:45.860 |
Arush Selvan: They'd actually show you, "Okay, here's how I'm going to approach this." 00:50:47.860 |
Arush Selvan: And it's a way for users to, if they want, kind of engage and steer the direction 00:50:54.100 |
Arush Selvan: Now, once you hit start, we actually try and show you what Gemini is doing under the 00:51:03.140 |
hood in real time by showing you the websites it's browsing. 00:51:07.140 |
Arush Selvan: And this is a feature that was built before Thinking Models, and thoughts 00:51:10.600 |
are also a really great way of kind of showing transparency of what the model is thinking. 00:51:15.620 |
Arush Selvan: But what's really nice here is while you wait, you can sort of click through 00:51:18.140 |
the websites, dive into any of the content, but what we also inadvertently saw is people 00:51:22.980 |
trying to game that number to see how high it could go. 00:51:25.520 |
Arush Selvan: And so we definitely saw people push that number into the thousands to try 00:51:28.900 |
and see how many websites Deep Research could read. 00:51:32.900 |
Arush Selvan: Finally, we kind of get this report that's thousands of ways long. 00:51:39.440 |
And we're really inspired by what kind of what Anthropic does with artifacts, and so we thought 00:51:44.460 |
that was a really great way of sort of being able to pin an artifact so that users can actually 00:51:49.480 |
ask questions about the research while reading the material. 00:51:52.520 |
Arush Selvan: They don't have to scroll back and forth, and what's really neat about this 00:51:55.360 |
is it means it's easy for you to engage in sort of changing the style of the report, adding 00:51:59.340 |
sections, removing sections, asking follow up questions, and it sort of makes that really 00:52:05.280 |
Arush Selvan: And the last part that's super important is kind of user trust, and also doing right 00:52:10.320 |
So we try and always show as all the sources we read, as well as all the sources we used in 00:52:14.960 |
the report, because not everything that we read is used, but it stays in context for 00:52:20.860 |
Arush Selvan: And also sort of these are all things that carry over to Google Docs, citations, 00:52:25.840 |
and things like that, if you choose to export. 00:52:27.780 |
Arush Selvan: So I thought today we can pick some of the challenges that one has to encounter 00:52:37.140 |
while building a research agent, and talk through some of them. 00:52:42.680 |
Arush Selvan: So one is this long-running nature of tasks introduce a couple of things that we 00:52:50.180 |
Arush Selvan: Second is the model has to plan iteratively and spend its time and compute during 00:52:57.680 |
Arush Selvan: So what are those challenges there? 00:52:59.660 |
Arush Selvan: And it has to do this while interacting with a very noisy environment that is the web. 00:53:05.840 |
Arush Selvan: And as you do this and read through information, very quickly you can start seeing 00:53:11.760 |
your context grow, and how do you effectively manage context. 00:53:15.300 |
Arush Selvan: So if you think about a job that runs for multiple minutes, and something that can 00:53:22.720 |
make many, many different LLM calls and calls to different services, they're bound to be failures. 00:53:29.560 |
Arush Selvan: And today we are talking about minutes, but you can very easily think in the 00:53:33.800 |
future of these kind of research agents taking like multiple hours. 00:53:38.680 |
Arush Selvan: So it's important to be robust to intermediate failures of these various services 00:53:45.080 |
Arush Selvan: And so being able to build a good state management solution, being able to recover 00:53:50.400 |
Arush Selvan: from errors effectively so that you just don't drop the whole research task due 00:53:57.440 |
Arush Selvan: The second aspect of doing this, what it enables us is to enable this feature 00:54:01.600 |
Arush Selvan: cross-platform, so we believe more and more users will start kind of registering 00:54:08.580 |
your asks, or your research tasks, and just like walk away, do their thing, and then you 00:54:14.860 |
Arush Selvan: And this can happen now across devices, and you can pick off reading it once 00:54:25.180 |
Arush Selvan: So now what is the model doing through these few minutes? 00:54:30.380 |
Arush Selvan: So let's take an example, right? 00:54:32.580 |
Arush Selvan: So here we're looking for athletic scholarships for short-put. 00:54:36.420 |
Arush Selvan: There are many facets to this query, and we kind of show this in a research 00:54:42.360 |
Arush Selvan: The first thing the model has to do is try to figure out which of these sub-problems 00:54:46.940 |
it can start tackling in parallel versus things that are inherently sequential, right? 00:54:52.860 |
Arush Selvan: So the model has to be able to reason to do that. 00:54:56.180 |
Arush Selvan: And the other challenge is, here you see you're always going to land in this 00:55:01.540 |
state where there's partial information, so it's important to look at all the information 00:55:06.260 |
found so far before you decide what to do next. 00:55:09.380 |
Arush Selvan: So in this instance, the model found, hey, it knows the qualifying standards 00:55:14.980 |
for the D1 division, but in order to provide a complete report and answer the user's question, 00:55:20.160 |
Arush Selvan: It has to go figure out what the equivalent for the D2 and D3 divisions are. 00:55:25.480 |
Arush Selvan: So this notion of being able to ground on information you find and then plan 00:55:34.420 |
Arush Selvan: Another example of partial information could be when you make searches. 00:55:38.020 |
Arush Selvan: So in this case, you're trying to find the best roller coaster for kids. 00:55:43.340 |
Arush Selvan: You might find results that provide partial information again. 00:55:47.340 |
Arush Selvan: So here you end up at a link which talks about the top ten roller coasters, but 00:55:53.700 |
does not mention anything about them being suitable to kids. 00:55:57.340 |
Arush Selvan: So the plan has to recognize this fact and then go ahead and in the next steps 00:56:02.660 |
Arush Selvan: of planning try to dissolve this disambiguity. 00:56:05.980 |
Arush Selvan: Another example of challenges in planning is information is often not found 00:56:15.440 |
Arush Selvan: You find facets of information spread across different sources. 00:56:18.860 |
Arush Selvan: So here you're trying to find what would what would it take to get a certification 00:56:25.260 |
for a scuba dive in some dive centers nearby. 00:56:29.200 |
Arush Selvan: So you see one part or one source has the kind of the structure of what what you 00:56:36.140 |
have to go through to get a certification, but in a completely different source you have 00:56:40.980 |
this notion of the pricing for this diving center. 00:56:43.560 |
Arush Selvan: So the model has to weave this together to figure out what the cost structure 00:56:47.660 |
Arush Selvan: for such a certification would look like. 00:56:50.680 |
Arush Selvan: Then there's the classic entity resolution problem. 00:56:55.460 |
Arush Selvan: So you might find mentions of the same entity across different sources. 00:57:00.420 |
Arush Selvan: So you need to be able to reason about some information indicators to kind of figure 00:57:04.900 |
out if they're talking about the same entity or you need to explore more to verify such disambiguities. 00:57:09.900 |
Arush Selvan: Yeah, I think most people here have worked on some notion of a web problem. 00:57:17.460 |
Arush Selvan: And we know like it's super fragmented. 00:57:19.480 |
Arush Selvan: So here you see two different websites talking about the same thing about music festivals 00:57:26.180 |
Arush Selvan: in Portugal this year on the left. 00:57:28.200 |
Arush Selvan: On the left, if you end up at such a website, it's easier and you get most of your information 00:57:34.860 |
Arush Selvan: On the right, the layout is different, so having a robust browsing mechanism if you want 00:57:41.200 |
Arush Selvan: To navigate the web for your research tasks is another important challenge. 00:57:46.220 |
Arush Selvan: So like we saw, there is a lot of these intermediate outputs, and as you do this 00:57:53.180 |
Arush Selvan: and you start getting streams of information during your planning. You can imagine your context size growing very quickly. 00:58:00.200 |
Arush Selvan: The other challenge that about context size is your research task doesn't typically end with your first query. 00:58:08.200 |
Arush Selvan: People have follow ups. People can say, hey, can you also do the same for this other topic? 00:58:13.220 |
Arush Selvan: So there is like this kind of a follow up, a deep research, and that also adds pressure on the context. 00:58:21.020 |
Arush Selvan: We at Gemini have the liberty of really long context models, but even then you have to design 00:58:29.020 |
Arush Selvan: Some way to make sure you effectively manage your context. And there are multiple choices here. Each come with various different trade-offs. 00:58:38.040 |
Arush Selvan: We're showing one here where we kind of have like this recency bias. So you have a lot more information about your current and your previous tasks. 00:58:48.040 |
Arush Selvan: But as you get to older tasks, we kind of selectively pick out, you know, things what we call as research notes and put it in a rag. 00:58:56.040 |
Arush Selvan: That way, the model can still access it, but it's being selective. 00:58:59.060 |
Arush Selvan: I'll hand it back to Arush to talk about what's next. 00:59:03.060 |
Arush Selvan: Yeah, so we were super excited to put this feature out in December. We weren't actually sure if anyone was going to use it, if anyone was going to care to wait five minutes for something, and we were really positively surprised by the reception. 00:59:18.060 |
Arush Selvan: And really what we saw was, hey, we've built something that's maybe as good as like a McKinsey analyst, right? And we give it away for 20 bucks. But, you know, that's really great. 00:59:31.080 |
Arush Selvan: And, but what it does is it just retrieves from the open web, and it's a text in, text out only system, right? And so where we sort of, we sort of see a few different directions of where research agents are going to go next. 00:59:44.100 |
Arush Selvan: And the first one is around expertise, right? So how do you go from a McKinsey analyst to a McKinsey partner, or a Goldman Sachs partner, or like a partner at a law firm, right? So that's really around not just being able to aggregate information and synthesize it, but also think through the so what of how do, like, what are the implications for what we're going to do, and what are the most interesting insights and patterns that come out of it? 01:00:04.120 |
Arush Selvan: The, the other thing is, you know, there are plenty of domains beyond professional services, like the sciences, where you, you know, want to get really good, you know, you want something that can read many papers, form hypotheses, find really interesting patterns in, you know, what methods we used, and, and come up with novel hypotheses to explore. 01:00:22.140 |
Arush Selvan: However, just because you build something that can be really smart, doesn't mean that it's useful to someone, right? So, if we were thinking about a use case of running a due diligence 01:00:34.120 |
Arush Selvan: on a company, the way you'd present that information to me, would be very different to the way you'd present that information to say a Goldman Sachs banker, right? 01:00:42.140 |
Arush Selvan: For me, you really want to talk through, like, what, like, what is this company, and how's its position strategically, but a banker would want to know all the financial information, actually have a DCF that they could look at, right? 01:00:53.140 |
Arush Selvan: Actually have a, have a much more, like, fine-grained, sort of, finance, financial modeling and analysis, and, and that really should shape the way in which you browse the web, right? 01:01:03.520 |
Arush Selvan: The way you browse the web, the way you frame your answer, the kind of questions you pursue, should be very personalized to kind of meeting the user where they're at. 01:01:09.520 |
Arush Selvan: I think the last part is sort of something that goes across domains of what models can do, right? 01:01:15.520 |
Arush Selvan: So not just being able to do web research with text, but being able to combine that with abilities in coding, data science, even video generation, right? 01:01:22.520 |
Arush Selvan: So coming back to this example, if you're doing a due diligence, what if it could go and do, like, a lot of statistical analysis and actually build financial models to inform the research output that it gives you, right? 01:01:32.920 |
Arush Selvan: Telling you, hey, why is this a good company or not? 01:01:34.920 |
Arush Selvan: I should say Google doesn't give financial advice, and, you know, it's not a financial advisor. 01:01:40.920 |
Arush Selvan: But yeah, and so we're really excited about the potential. 01:01:43.920 |
Arush Selvan: We think there's a ton of headroom to make research agents better. 01:01:46.920 |
Arush Selvan: And we are really glad we didn't call this Gemini Deep Dive, which was our best name before, before launching this feature. 01:01:56.920 |
Arush Selvan: Our next presenter is member of technical staff at Anthropic, here to present how they build effective agents. 01:02:15.920 |
Arush Selvan: Please join me in welcoming to the stage Barry Zhang. 01:02:29.920 |
Arush Selvan: Wow, it's incredible to be on the same stage as so many people I've learned so much from. 01:02:36.920 |
Arush Selvan: My name is Barry, and today we're going to be talking about how we build effective agents. 01:02:42.920 |
Arush Selvan: About two months ago, Eric and I wrote a blog post called Building Effective Agents. 01:02:47.920 |
Arush Selvan: In there, we shared some opinionated take on what an agent is and isn't, and we give some practical learnings that we have gained along the way. 01:02:55.920 |
Arush Selvan: Today, I'd like to go deeper on three core ideas from the blog post and provide you with some personal musings at the end. 01:03:06.920 |
Arush Selvan: First, don't build agents for everything. 01:03:12.920 |
Arush Selvan: And third, think like your agents. 01:03:15.920 |
Arush Selvan: Let's first start with a recap of how we got here. 01:03:19.920 |
Arush Selvan: Most of us probably started building very simple features, things like summarization, classification, extraction, just really simple things that felt like magic two to three years ago and have now become table stakes. 01:03:31.920 |
Arush Selvan: Then as we got more sophisticated and as products mature, we got more creative. 01:03:36.920 |
Arush Selvan: One model call often wasn't enough. 01:03:38.920 |
Arush Selvan: So we started orchestrating multiple model calls in predefined control flows. 01:03:43.920 |
Arush Selvan: This basically gave us a way to trade off cost and latency for better performance. 01:03:50.920 |
Arush Selvan: We believe this is the beginning of agentic systems. 01:03:54.920 |
Arush Selvan: Now models are even more capable and we're seeing more and more domain specific agents start to pop up in production. 01:04:04.920 |
Arush Selvan: Unlike workflows, agents can decide their own trajectory and operate almost independently based on environment feedback. 01:04:11.920 |
Arush Selvan: This is going to be our focus today. 01:04:14.920 |
Arush Selvan: It's probably a little bit too early to name what the next phase of agentic system is going to look like, especially in production. 01:04:21.920 |
Arush Selvan: Single agents could become a lot more general purpose and more capable, or we can start to see collaboration and delegation in multi-agent settings. 01:04:28.920 |
Arush Selvan: Regardless, I think the broad trend here is that as we give these systems a lot more agency, they become more useful and more capable. 01:04:37.920 |
Arush Selvan: But as a result, the cost, the latency, the consequences of errors also go up. 01:04:42.920 |
Arush Selvan: And that brings us to the first point: Don't build agents for everything. 01:04:50.920 |
Arush Selvan: We think of agents as a way to scale complex and valuable tasks. 01:04:54.920 |
Arush Selvan: They shouldn't be a drop-in upgrade for every use case. 01:04:58.920 |
Arush Selvan: If you haven't read the blog post, you'll know that we talked a lot about workflows. 01:05:02.920 |
Arush Selvan: And that's because we really like them, and they're a great, concrete way to deliver values today. 01:05:08.920 |
Arush Selvan: Well, so when should you build an agent? 01:05:13.920 |
Arush Selvan: The first thing to consider is the complexity of your task. 01:05:17.920 |
Arush Selvan: Agents really thrive in ambiguous problem spaces, and if you can map out the entire decision tree pretty easily, just build that explicitly and then optimize every node of that decision tree. 01:05:28.920 |
Arush Selvan: It's a lot more cost-effective, and it's going to give you a lot more control. 01:05:32.920 |
Arush Selvan: Next thing to consider is the value of your task. 01:05:36.920 |
Arush Selvan: The exploration I just mentioned is going to cost you a lot of tokens. 01:05:39.920 |
Arush Selvan: So the task really needs to justify the cost. 01:05:42.920 |
Arush Selvan: If your budget per task is around 10 cents, for example, you're building a high-volume customer support system, 01:05:49.920 |
Arush Selvan: that only affords you 30 to 50 thousand tokens. 01:05:53.920 |
Arush Selvan: In that case, just use a workflow to solve the most common scenarios, and you're able to capture the majority of the values from there. 01:06:00.920 |
Arush Selvan: On the other hand, though, if you look at this question and your first thought is, 01:06:04.920 |
Arush Selvan: I don't care how many tokens I spend, I just want to get the task done. 01:06:09.920 |
Arush Selvan: Our go-to-market team would love to speak with you. 01:06:11.920 |
Arush Selvan: From there, we want to de-risk the critical capabilities. 01:06:16.920 |
Arush Selvan: This is to make sure that there aren't any significant bottlenecks in the agent's trajectory. 01:06:21.920 |
Arush Selvan: If you're doing a coding agent, you want to make sure it's able to write good code, it's able to debug, and it's able to recover from its errors. 01:06:28.920 |
Arush Selvan: If you do have bottlenecks, that's probably not going to be fatal, but they will multiply your cost and latency. 01:06:35.920 |
Arush Selvan: So in that case, we normally just reduce the scope, simplify the task, and try again. 01:06:40.920 |
Arush Selvan: Finally, the last important thing to consider is the cost of error and error discovery. 01:06:48.920 |
Arush Selvan: If your errors are going to be high stake and very hard to discover, it's going to be very difficult for you to trust the agent to take actions on your behalf and to have more autonomy. 01:06:58.920 |
Arush Selvan: You can always mitigate this by limiting the scope, right? 01:07:01.920 |
Arush Selvan: You can have read-only access, you can have more human in the loop, but this will also limit how well you're able to scale your agent in your use case. 01:07:09.920 |
Arush Selvan: Let's see this checklist in action. 01:07:13.920 |
Arush Selvan: Why is coding a great agent use case? 01:07:16.920 |
Arush Selvan: First, to go from design doc to a PR is obviously a very ambiguous and very complex task. 01:07:22.920 |
Arush Selvan: And second, a lot of us are developers here, so we know that good code has a lot of value. 01:07:28.920 |
Arush Selvan: And third, many of us already use cloud for coding, so we know that it's great at many parts of the coding workflow. 01:07:36.920 |
Arush Selvan: And last, coding has this really nice property where the output is easily verifiable through unit test and CI. 01:07:43.920 |
Arush Selvan: And that's probably why we're seeing so many creative and successful coding agents right now. 01:07:49.920 |
Arush Selvan: Once you find a good use case for agents, this is the second core idea, which is to keep it as simple as possible. 01:08:02.920 |
Arush Selvan: This is what agents look like to us. 01:08:05.920 |
Arush Selvan: They're models using tools in a loop. 01:08:08.920 |
Arush Selvan: In this frame, three components define what an agent really looks like. 01:08:15.920 |
Arush Selvan: This is the system that the agent is operating in. 01:08:17.920 |
Arush Selvan: Then we have a set of tools, which offer an interface for the agent to take action and get feedback. 01:08:23.920 |
Arush Selvan: Then we have the system prompt, which defines the goals, the constraints, and the ideal behavior for the agent to actually work in this environment. 01:08:32.920 |
Arush Selvan: Then the model gets called in a loop, and that's agents. 01:08:38.920 |
Arush Selvan: We have learned the hard way to keep this simple because any complexity upfront is really going to kill iteration speed. 01:08:45.920 |
Arush Selvan: Iterating on just these three basic components is going to give you by far the highest ROI, and optimizations can come later. 01:08:52.920 |
Arush Selvan: Here are examples of three agent use cases that we have built for ourselves or our customers, just to make it more concrete. 01:09:02.920 |
Arush Selvan: They're going to look very different on the product surface. 01:09:04.920 |
Arush Selvan: They're going to look very different in their scope. 01:09:06.920 |
Arush Selvan: They're going to look different in their capability. 01:09:08.920 |
Arush Selvan: But they share almost exactly the same backbone. 01:09:11.920 |
Arush Selvan: They actually share almost the exact same code. 01:09:14.920 |
Arush Selvan: The environment largely depends on your use case. 01:09:18.920 |
Arush Selvan: So really, the only two design decisions is what are the set of tools you want to offer to the agent, and what is the prompt that you want to instruct your agent to follow? 01:09:29.920 |
Arush Selvan: On this note, if you want to learn more about tools, my friend Mahesh is going to be giving a workshop on model context protocol, MCP, tomorrow morning. 01:09:36.920 |
Arush Selvan: I've seen that workshop. It's going to be really fun, so I highly encourage you guys to check that out. 01:09:43.920 |
Arush Selvan: Once you have figured out these three basic components, you have a lot of optimization to do from there. 01:09:48.920 |
Arush Selvan: For coding and computer use, you might want to catch the trajectory to reduce cost. 01:09:54.920 |
Arush Selvan: For search, where you have a lot of tool calls, you can parallelize a lot of those to reduce latency. 01:09:59.920 |
Arush Selvan: And for almost all of these, we want to make sure to present the agent's progress in such a way that gain user trust. 01:10:05.920 |
Arush Selvan: But that's it. Keep it as simple as possible as you're iterating, build these three components first, and then optimize once you have the behaviors down. 01:10:14.920 |
Arush Selvan: All right. This is the last idea. 01:10:19.920 |
Arush Selvan: It's to think like your agents. 01:10:21.920 |
Arush Selvan: I've seen a lot of builders, and myself included, who develop agents from our own perspectives and get confused when agents make a mistake. 01:10:30.920 |
Arush Selvan: It seems counterintuitive to us. 01:10:32.920 |
Arush Selvan: And that's why we always recommend to put yourself in the agent's context window. 01:10:37.920 |
Arush Selvan: Agents can exhibit some really sophisticated behavior. It can look incredibly complex. 01:10:43.920 |
Arush Selvan: But at each step, what the model is doing is still just running inference on a very limited set of contexts. 01:10:49.920 |
Arush Selvan: Everything that the model knows about the current state of the world is going to be explained in that 10 to 20k tokens. 01:10:56.920 |
Arush Selvan: And it's really helpful to limit ourselves in that context and see if it's actually sufficient and coherent. 01:11:03.920 |
Arush Selvan: This will give you a much better understanding of how agents see the world and then kind of bridge the gap between our understanding and theirs. 01:11:11.920 |
Arush Selvan: Let's imagine for a second that we're computer use agents now and then see what that feels like. 01:11:18.920 |
Arush Selvan: All we're going to get is a static screenshot and a very poorly written description. 01:11:23.920 |
Arush Selvan: This is about yours truly. Let's read through it. You know, you're a computer use agent. 01:11:27.920 |
Arush Selvan: You have a set of tools and you have a task. Terrible. 01:11:31.920 |
Arush Selvan: We can think and talk and reason what we want, but the only thing that's going to take effect in the environment are our tools. 01:11:38.920 |
Arush Selvan: So we attempt a click without really seeing what's happening and while the inference is happening, while the tool execution is happening, 01:11:46.920 |
Arush Selvan: This is basically equivalent to us closing our eyes for three to five seconds and using the computer in the dark. 01:11:52.920 |
Arush Selvan: Then you open up your eyes and you see another screenshot. Whatever you did could have worked or you could have shut down the computer. 01:11:59.920 |
Arush Selvan: You just don't know. This is a huge lethal phase and the cycle kind of starts again. 01:12:04.920 |
Arush Selvan: I highly recommend just trying - try doing a full task from the agent's perspective like this. 01:12:10.920 |
Arush Selvan: I promise you it's a fascinating and only mildly uncomfortable experience. 01:12:14.920 |
Arush Selvan: However, once you go through that mildly uncomfortable experience, I think it becomes very clear what the agents would have actually needed. 01:12:24.920 |
Arush Selvan: It's clearly very crucial to know what the screen resolution is so I know how to click. 01:12:30.920 |
Arush Selvan: It's also good to have recommended actions and limitations just so that, you know, we can put some guardrails around what we should be exploring and we can avoid unnecessary exploration. 01:12:41.920 |
Arush Selvan: These are just some examples and, you know, do this exercise for your own agent use case and figure out what kind of context do you actually want to provide for the agent. 01:12:51.920 |
Arush Selvan: Fortunately, though, we are building systems that speak our language. So we could just ask cloud to understand cloud. 01:12:59.920 |
Arush Selvan: You can throw in your system prompt and ask, well, is any of this instruction ambiguous? Does it make sense to you? Are you able to follow this? 01:13:06.920 |
Arush Selvan: You can throw in your true description and see whether the agent knows how to use the tool. You can see if it wants more parameter, fewer parameter. 01:13:13.920 |
Arush Selvan: And one thing that we do quite frequently is we throw the entire agent's trajectory into cloud and just ask it, hey, why do you think we made this decision right here? 01:13:23.920 |
Arush Selvan: And is there anything that we can do to help you make better decisions? 01:13:26.920 |
Arush Selvan: This shouldn't replace your own understanding of the context, but it will help you gain a much closer perspective on how the agent is seeing the world. 01:13:34.920 |
Arush Selvan: So once again, think like your agent as you're iterating. 01:13:38.920 |
Arush Selvan: All right, I've spent most of the talk talking about very practical stuff. I'm going to indulge myself and spend one slide on personal musings. 01:13:48.920 |
Arush Selvan: This is going to be my view on how this might evolve and some open questions I think we need to answer together as AI engineers. 01:13:55.920 |
Arush Selvan: These are the top three things that are always on my mind. First, I think we need to make agents a lot more budget aware. 01:14:02.920 |
Arush Selvan: Unlike workflows, we don't really have a great sense of control for the cost and latency for agents. I think figuring this out will enable a lot more use cases as it gives us a necessary control to deploy them in production. 01:14:16.920 |
Arush Selvan: The open question is just what's the best way to define and enforce budgets in terms of time, in terms of money, in terms of tokens, the things that we care about. 01:14:24.920 |
Arush Selvan: Next up is this concept of self-evolving tools. I've already hinted at this two slides ago, but we are already using models to help iterate on the tool description. 01:14:34.920 |
Arush Selvan: But this should generalize pretty well into a meta tool where agents can design and improve their own tool ergonomics. 01:14:41.920 |
Arush Selvan: This will make agents a lot more general purpose as they can adopt the tools that they need for each use case. 01:14:48.920 |
Arush Selvan: Finally, I don't even think this is a hot take anymore. I have a personal conviction that we will see a lot more multi-agent collaborations in production by the end of this year. 01:14:59.920 |
Arush Selvan: They're well-parallelized, they have very nice separation of concerns, and having sub-agent, for example, will really protect the main agent's context window. 01:15:10.920 |
Arush Selvan: But I think a big open question here is how do these agents actually communicate with each other? 01:15:17.920 |
Arush Selvan: We're currently in this very rigid frame of having mostly synchronous user assistant turns, and I think most of our systems are built around that. 01:15:26.920 |
Arush Selvan: So how do we expand from there and build an asynchronous communication, and enable more roles that afford agents to communicate with each other and recognize each other? 01:15:34.920 |
Arush Selvan: I think that's going to be a big open question as we explore this more multi-agent future. 01:15:40.920 |
Arush Selvan: These are the areas that take up a lot of my mind space. If you're also thinking about this, please shoot me a text. I would love to chat. 01:15:49.920 |
Arush Selvan: Okay, let's bring it all together. If you forget everything I said today, these are the three takeaways. 01:15:55.920 |
Arush Selvan: First, don't build agents for everything. If you do find a good use case and want to build an agent, keep it as simple for as long as possible. 01:16:03.920 |
Arush Selvan: And finally, as you iterate, try to think like your agent, gain their perspective, and help them do their job. 01:16:09.920 |
Arush Selvan: I would love to keep in touch with every one of you. If you want to chat about agents, especially those open questions that I talked about, you'll be incredibly lovely. You can just, you know, jam on some of these ideas. 01:16:24.920 |
Arush Selvan: These are my socials if you want to get connected. And I'm going to end the presentation on a personal anecdote. 01:16:30.920 |
Arush Selvan: So back in 2023, I was building an AI product at Meta. And we had this funny thing where we could change our job description to anything we want. 01:16:38.920 |
Arush Selvan: After reading that blog post from Suix, I decided I was going to be the first AI engineer. I really love the focus on practicality and just making AI actually useful to the world. And I think that aspiration brought me here today. So I hope you enjoy the rest of the AI engineer summit. And in the meantime, let's keep building. Thank you. 01:19:26.920 |
Arush Selvan: So if you zoom in on the bottom right, you can see I'm actually down there. 01:19:29.920 |
Arush Selvan: I was working at Google with a bunch of amazing computer vision engineers. 01:19:34.860 |
Arush Selvan: And what that meant in 2016 is we were really trying to help computers understand 01:19:39.980 |
the difference between chihuahuas and blueberry muffins. 01:19:43.980 |
Arush Selvan: And, you know, it's not actually that simple. 01:19:48.940 |
Arush Selvan: It's not just chihuahuas and blueberry muffins. 01:19:50.940 |
Arush Selvan: You know, it's dogs and bagels, dogs and mops, and, of course, dogs and 01:19:57.880 |
Arush Selvan: And so in other words, what we were doing is we were building the first version 01:20:03.680 |
Arush Selvan: And at this time, I lived in New York City. 01:20:05.460 |
I was in the East Village, and I had about a 30-minute walk to work. 01:20:09.100 |
And on my walk, I would see a bunch of stuff. 01:20:10.640 |
New York's one of the greatest walking cities in the world. 01:20:15.820 |
Or, "Oh, I wonder if that bookstore is nice." 01:20:23.440 |
Arush Selvan: And so there were also a bunch of flowers on the walk. 01:20:26.840 |
Arush Selvan: At this time, Google Lens was in its infancy. 01:20:28.840 |
Arush Selvan: And one of the very few things that computer vision models were actually good 01:20:32.580 |
at that had some consumer application was identifying plants. 01:20:36.480 |
Arush Selvan: You might still know this today. 01:20:37.480 |
Arush Selvan: It's kind of in the, you know, "Is that bug poisonous?" category. 01:20:40.840 |
Arush Selvan: And so I'd ask questions on the walk, like, you know, "Can it tell the color 01:20:45.240 |
of the plant in addition to the species?" or, "What's that? 01:20:50.980 |
Arush Selvan: And there's a bunch of flower shops on this walk, so I'd even walk in. 01:20:54.300 |
And these are all actually photos from 2016 from my walks to work. 01:20:57.880 |
Arush Selvan: And I would go in and test them all out. 01:21:00.240 |
Arush Selvan: And as you can imagine, you know, sometimes it was accurate. 01:21:04.980 |
Arush Selvan: And sometimes, you know, it wasn't necessarily wrong, but it wasn't really on 01:21:10.660 |
Arush Selvan: And so it felt like a slot machine. 01:21:12.700 |
Arush Selvan: And I think everyone here who's building with AI can probably understand that 01:21:15.860 |
feeling of, "Oh, it worked five times in a row. 01:21:21.040 |
Arush Selvan: Whether it's the non-determinism of the inputs or the non-determinism of the outputs, 01:21:25.320 |
that's just part of what it means to be building with AI. 01:21:28.320 |
Arush Selvan: So let's fast forward a bit to present day. 01:21:33.320 |
Arush Selvan: You can not only search what you see, you can also shop what you see. 01:21:36.520 |
Arush Selvan: You can do this on Google Images, on YouTube, you can do it with your camera. 01:21:40.140 |
Arush Selvan: You can translate non-Latin character sets into English, so you can read the washing 01:21:44.160 |
machine in Tokyo and actually figure out what settings in your Airbnb you should use. 01:21:50.140 |
Arush Selvan: I'm a little bit too old to have benefited from this, but apparently it's a brave 01:21:56.080 |
Arush Selvan: And of course, this is from the Google Lens homepage, you can still identify flowers. 01:22:01.920 |
Arush Selvan: So this is all very mind blowing. 01:22:03.940 |
Arush Selvan: But in my opinion, it comes down to consistent step by step iteration over a 01:22:09.740 |
Arush Selvan: And when we think about what drives this, we're all engineers in the room, we understand 01:22:14.160 |
that you need a process to iteratively improve to get better without also getting worse. 01:22:18.120 |
Arush Selvan: And this, over time, has kind of been considered software development lifecycle. 01:22:22.940 |
Arush Selvan: How do you continuously improve? 01:22:24.940 |
Arush Selvan: How do you implement, test, maintain, analyze, design, and go through this as many 01:22:35.140 |
Arush Selvan: The AI caves, you know, the drawings are a little bit less sophisticated. 01:22:42.040 |
Arush Selvan: And I pulled some headlines from around this time. 01:22:43.840 |
Arush Selvan: You can see this is around when Google Brain was watching cat videos and identifying 01:22:49.980 |
them on YouTube, and it was a big breakthrough. 01:22:51.760 |
Arush Selvan: I don't know if anyone remembers how big this model was. 01:22:54.600 |
Arush Selvan: It was about a billion parameters, and this was a huge breakthrough. 01:22:58.380 |
Arush Selvan: If you think today, the frontier models are about a trillion parameters. 01:23:03.740 |
Arush Selvan: It was as if this whole room had like a quarter of a person in it. 01:23:08.400 |
Arush Selvan: And so it was still very impressive at the time. 01:23:11.540 |
Arush Selvan: There was also a theory, you know, everyone thought computers would be limited 01:23:15.780 |
Arush Selvan: I think this is a less popular theory today. 01:23:18.200 |
Arush Selvan: What I'm trying to say is it was a long time ago. 01:23:22.720 |
Arush Selvan: This is also around the time that Marc Andreessen published his famous essay 01:23:31.760 |
Arush Selvan: And that took a lot of people by storm. 01:23:33.880 |
If you looked at Stanford University on campus, you would have seen some early stage startups 01:23:39.300 |
Arush Selvan: Does anyone know which startups I'm talking about? 01:23:50.340 |
Arush Selvan: You might be thinking Snapchat. 01:23:53.160 |
Arush Selvan: I did actually hear DoorDash in the back. 01:23:58.040 |
Arush Selvan: Of course, you look like stylish people, so I think you'll know what I'm talking 01:24:04.380 |
Arush Selvan: Chubbies had a contrarian idea that was also right, which was not only 01:24:10.720 |
a software eating the world, but teeny shorts for men are also going to take over. 01:24:15.880 |
Arush Selvan: And as I mentioned, they were correct, which you can see here, and you can also see 01:24:21.800 |
Arush Selvan: Fast forward to 2024, Kit Garten, SVP of Commercial at Chubbies, we were fortunate 01:24:31.920 |
Arush Selvan: And Chubbies has had an amazing brand since they were founded, and they've 01:24:35.500 |
always been on the forefront of customer experience. 01:24:37.920 |
Arush Selvan: They've always been thinking about how to level up and how to make the experience 01:24:41.400 |
Arush Selvan: more fun and better for their customers. 01:24:44.240 |
Arush Selvan: And so it clicked immediately for Kit. 01:24:47.280 |
Arush Selvan: That the same way you needed a website in 1995, the same way your business 01:24:50.820 |
needed a social profile and a mobile app this millennium, in 2025, you need an AI agent to 01:24:56.940 |
represent your business and to help your customers. 01:24:59.740 |
Arush Selvan: So Kit and Chubbies partnered with Sierra. 01:25:05.440 |
Arush Selvan: We came up with an AI agent, which is affectionately called Duncan Smothers. 01:25:10.720 |
Arush Selvan: First and foremost, he's incredibly capable, but almost as importantly, he's always 01:25:17.460 |
Arush Selvan: Duncan Smothers is on the Chubbies website and can help you with a variety of 01:25:22.520 |
Arush Selvan: I got permission from Kit to show some of these conversations to you today so 01:25:26.100 |
you can see what some of the Sierra interactions look like under the hood and some of the things 01:25:32.620 |
Arush Selvan: So on the left here, you have a customer asking a question about sizing and 01:25:36.900 |
Arush Selvan: Duncan is able to empathetically help them while asking questions like, "What's 01:25:41.640 |
your waist size?" and offer product recommendations. 01:25:45.380 |
Arush Selvan: At the end, he gets a thumbs up from the customer. 01:25:48.380 |
Arush Selvan: Another example, another thumbs up. 01:25:52.380 |
Arush Selvan: Duncan can tell what's in stock and help customers choose new items. 01:25:56.380 |
Arush Selvan: And then finally, package tracking and refunds. 01:26:00.380 |
Arush Selvan: So more customer love, in this case, the Duncan is able to inform the customer. 01:26:05.120 |
Arush Selvan: Actually, there's a couple of different tracking numbers for your order. 01:26:08.120 |
Arush Selvan: And in the second case, issue a refund. 01:26:10.540 |
Arush Selvan: And so when we talk about autonomous agents, agents actually taking action, not just 01:26:15.040 |
answering questions, this is what we're talking about. 01:26:17.120 |
Arush Selvan: And the results for Chubbies have been they're able to help more customers more 01:26:23.500 |
Arush Selvan: The way that we get to this is because we believe at Sierra that every agent 01:26:30.580 |
Arush Selvan: That means that you can't just drag and drop a bunch of boxes. 01:26:33.800 |
Arush Selvan: You need a fully-featured developer platform. 01:26:36.020 |
Arush Selvan: You need a fully-featured customer experience operations platform in order to work 01:26:41.060 |
on this the same way you would work on your mobile app, the same way that you would work 01:26:44.200 |
on your website if you want the best results. 01:26:46.200 |
Arush Selvan: And so when Chubbies is partnering with Sierra, it's not just using the product. 01:26:51.760 |
Arush Selvan: It's actually partnering with our team. 01:26:54.080 |
Arush Selvan: And so we have dedicated agent engineering and agent product management functions 01:26:57.980 |
that you can think of sort of as forward-deployed with our customers, working closely with Kit 01:27:06.220 |
Arush Selvan: By the way, remember that face that you just saw on the last slide? 01:27:11.520 |
Arush Selvan: Was anyone here at the AI Engineering World's Fair back in June? 01:27:15.920 |
Arush Selvan: Nice, got some whoops from the audience. 01:27:21.620 |
Arush Selvan: I was there on stage introducing everyone, and the energy was electric. 01:27:24.940 |
Arush Selvan: You can see the crowd is packed. 01:27:27.060 |
Arush Selvan: When I got there, the first thing I did was I sat down at the Deepgram workshop. 01:27:31.400 |
Arush Selvan: This was about three months into me building voice agents at Sierra, and I was 01:27:38.940 |
Arush Selvan: What did they think of the latest multimodal models? 01:27:42.820 |
Arush Selvan: How are they handling tone and phrasing? 01:27:44.500 |
Arush Selvan: All of these problems that were new at the time. 01:27:46.500 |
Arush Selvan: And I sat down next to a man named Sean. 01:27:49.820 |
Arush Selvan: And Sean and I were nerding out about how to increase the speed of our developer 01:27:53.040 |
loop by using the say command on Mac and then using a program called loopback in order to 01:27:58.020 |
pipe that into the browser so that we didn't have to wear headphones and talk and look awkward 01:28:08.020 |
Arush Selvan: And a few months later, there we are working together in the office. 01:28:11.360 |
Arush Selvan: So when I told our company and our founders, hey, I'm going to the AI summit. 01:28:16.160 |
Arush Selvan: You know, I hope it's as productive as the last one. 01:28:20.940 |
Arush Selvan: They said, go find more Sean's. 01:28:23.280 |
Arush Selvan: So I'm hopeful that people in the audience will say hi after this. 01:28:27.460 |
Arush Selvan: Whether or not you're interested in working at Sierra, I'm interested in meeting 01:28:31.500 |
Arush Selvan: And so I hope to meet you later today. 01:28:33.500 |
Arush Selvan: Anyway, back to Duncan Smothers. 01:28:35.760 |
Arush Selvan: The point of the software development lifecycle, the point of our agent engineering 01:28:40.880 |
team, is that even if Duncan is not perfect today, he should be getting better every single 01:28:46.780 |
Arush Selvan: And so what we did is we sought out to build something like the software development 01:28:49.840 |
cycle, borrowing as many concepts as we could and inventing new ones where we needed to. 01:28:55.420 |
Arush Selvan: The issue is that large language models are like building on top of a foundation 01:28:59.960 |
Arush Selvan: And so you can't just take everything out of the box and have it just work. 01:29:04.280 |
Arush Selvan: While traditional software is deterministic, fast, cheap, rigid, and 01:29:09.400 |
are governed by if statements that always follow logic. 01:29:12.520 |
Arush Selvan: Large language models can be non-deterministic. 01:29:20.420 |
Arush Selvan: They can reason through problems. 01:29:22.420 |
Arush Selvan: And so we wanted to create a methodology that takes advantage of all the strengths of large 01:29:27.540 |
Arush Selvan: models and then also is able to invoke traditional software where it's helpful. 01:29:32.660 |
Arush Selvan: And that brings me to slide 78. 01:29:34.660 |
Arush Selvan: The agent development lifecycle. 01:29:37.660 |
Arush Selvan: So at Sierra, this is the process by which we build and improve AI agents. 01:29:43.660 |
Arush Selvan: You might be thinking about it like, oh, that looks kind of like the software development 01:29:49.780 |
Arush Selvan: And I think the devil is in the details, so I'm going to dive in a little bit. 01:29:52.960 |
Arush Selvan: It's not that these are revolutionary or innovative concepts. 01:29:56.260 |
Arush Selvan: It's that each one of them involves iterative refinement with customers in production 01:30:01.660 |
Arush Selvan: To make it as productive and as bulletproof as possible. 01:30:05.660 |
Arush Selvan: So if we dig into quality assurance, for example, if you work at one of your customer, one of our customer companies, you have access to Sierra's experience manager. 01:30:14.660 |
Arush Selvan: What that means is that you can dive in and look at every conversation and you can look at high level reports of how is the agent performing in real time. 01:30:24.700 |
Arush Selvan: So, for example, if Duncan Smothers has incorrect inventory, maybe it made one API call to one warehouse. 01:30:31.620 |
Arush Selvan: But it didn't make all the API calls that it needed to or one of them timed out, whatever it may be. 01:30:38.620 |
Arush Selvan: It then will lead to an issue being filed, which leads to a test being created. 01:30:43.620 |
Arush Selvan: And then once that test is passing, we can make a new release. 01:30:46.620 |
Arush Selvan: And over the course of time, a Sierra agent will go from having a handful of tests at launch to hundreds and then thousands of tests as it improves. 01:30:55.620 |
Arush Selvan: Another example here is it's not always that the agent is making a mistake. 01:31:00.620 |
Arush Selvan: Sometimes there's an opportunity to go above and beyond. 01:31:03.620 |
Arush Selvan: Chubby's actually has each of its agents have a budget in order to delight customers. 01:31:08.620 |
Arush Selvan: And so in this case, Duncan Smothers could actually, you know, door dash the shorts from a retail location if they're not available online. 01:31:16.620 |
Arush Selvan: So this is the agent development lifecycle at work. 01:31:20.620 |
Arush Selvan: But the thing is, a year ago, we were doing this all manually. 01:31:25.620 |
Arush Selvan: This was kind of early on in the history of Sierra, and we were learning what works at each of these stages. 01:31:31.620 |
Arush Selvan: And with the improvements to AI, we're actually able to add AI to each part of this lifecycle and speed up the improvements in the present day. 01:31:42.620 |
Arush Selvan: But it's bigger than just Duncan. 01:31:45.620 |
Arush Selvan: The agent development lifecycle is more effective the larger the customer is. 01:31:49.620 |
Arush Selvan: And while Duncan handles hundreds of thousands of requests, we have customers that are doing tens of millions. 01:31:54.620 |
Arush Selvan: So the more valuable the velocity and change management are when you're that big. 01:32:01.620 |
Arush Selvan: And the change also comes from everywhere. 01:32:05.620 |
Arush Selvan: It's not just that, oh, there's an issue with the agent and we need to improve it. 01:32:09.620 |
Arush Selvan: There's tons of stuff going on outside. 01:32:11.620 |
Arush Selvan: There's all those graphs at the beginning of this presentation showing how fast our space is moving. 01:32:16.620 |
Arush Selvan: You have models being upgraded. 01:32:17.620 |
Arush Selvan: You have new paradigms like reasoning models. 01:32:20.620 |
Arush Selvan: You have multimodality and more and more. 01:32:24.620 |
Arush Selvan: When we think about how these impact the agent development lifecycle, reasoning models are a force multiplier toward each step. 01:32:30.620 |
Arush Selvan: We're actually able to be more effective applying AI to development, to testing, to QA, and every step in between. 01:32:37.620 |
Arush Selvan: Now, another one that's near and dear to my heart, I mentioned the DeepGram workshop eight months ago, which was an accelerant in my understanding of the voice landscape, 01:32:49.620 |
Arush Selvan: And I started working on this about a year ago. 01:32:51.620 |
Arush Selvan: And in October, we were able to launch voice generally available at Sierra. 01:32:56.620 |
Arush Selvan: One of our large customers that has benefited from the agent development lifecycle that has tens of millions of customers in the United States is SiriusXM. 01:33:06.620 |
Arush Selvan: And with Sierra's voice capabilities, they're able to pick up the phone right away every time to answer their customers. 01:33:13.620 |
Arush Selvan: The way that we think about voice, I think, is similar to the way that we think about web development today. 01:33:20.620 |
Arush Selvan: If you remember 10, 15 years ago, a lot of websites were, you know, m.website.com. 01:33:27.620 |
Arush Selvan: You had two separate websites for mobile phones and for desktops. 01:33:31.620 |
Arush Selvan: And then we graduated to responsive design. 01:33:33.620 |
Arush Selvan: And this is how we think about our AI agents at Sierra, too. 01:33:37.620 |
Arush Selvan: Under the hood, it's the same platform, it's the same agent code, but it's able to be responsive to whatever channel someone reaches out in, and whatever modality you're operating in. 01:33:47.620 |
Arush Selvan: Of course, you can still customize the same way you might have a different layout. 01:33:52.620 |
Arush Selvan: You can still have different phrasing. 01:33:54.620 |
Arush Selvan: You can still parallelize requests to achieve lower latency. 01:33:57.620 |
Arush Selvan: But it basically just works out of the box. 01:34:00.620 |
Arush Selvan: I'll close with a few thoughts. 01:34:03.620 |
Arush Selvan: This is something I've been thinking about a lot lately. 01:34:05.620 |
Arush Selvan: One of the most fascinating and fun parts about building with AI is that large language models remind us of ourselves. 01:34:12.620 |
Arush Selvan: In short, they're unpredictable. 01:34:18.620 |
Arush Selvan: And they're not that great at math. 01:34:20.620 |
Arush Selvan: But also, it allows us to be great designers by having empathy in a way that we probably couldn't ever before with computers. 01:34:29.620 |
Arush Selvan: And so you can actually put yourself in the shoes of the robot. 01:34:34.620 |
Arush Selvan: You can put yourself in the, I don't know, primordial soup of the jello. 01:34:39.620 |
Arush Selvan: And you can think about what it would mean to actually build a good experience. 01:34:43.620 |
Arush Selvan: And as someone who's building voice agents, and a bunch of you, I bet, in the audience are, I know there's kind of this thought on, are these multimodal agents the real deal? 01:34:52.620 |
Arush Selvan: You know, should I just kind of wire everything together and hope it works? 01:34:56.620 |
Arush Selvan: And the question I've been asking myself a lot lately, and what our results have kind of shown us is, you know, how would you do if someone just passed you transcribed text of your conversation partner with a few hundred milliseconds of delay, and then you had to respond on the spot? 01:35:10.620 |
Arush Selvan: And so what we're building at Sierra is much more robust and very exciting to me, and I hope to talk to you all about it. 01:35:18.620 |
Arush Selvan: I think on my badge it says voice to voice models is the thing that I'm excited about. 01:35:22.620 |
Arush Selvan: And so here is kind of a sense of the robustness and the richness of what you can create when you let large language models have the same inputs and same experiences that humans have. 01:35:34.620 |
Arush Selvan: And so thank you for your time today. 01:35:37.620 |
Arush Selvan: I look forward to a lot of engaging discussions, and it's great to talk to you all. 01:35:42.620 |
Arush Selvan: Our next presenter is a researcher at Morgan Stanley. 01:35:56.620 |
Arush Selvan: Please join me in welcoming to the stage, Will Brown. 01:36:11.620 |
Arush Selvan: Thanks, Rick and the whole AI engineer conference team for putting this together and having me. 01:36:17.620 |
Arush Selvan: I am a machine learning researcher at Morgan Stanley. 01:36:19.620 |
Arush Selvan: And today I want to talk to you all a bit about what I think reinforcement learning, or RL, means for agents. 01:36:26.620 |
Arush Selvan: So I was in grad school at Columbia for a while and I mostly worked on theory for multi-agent reinforcement learning. 01:36:31.620 |
Arush Selvan: And over the past couple of years I have been working at Morgan Stanley on a wide range of LLM related projects, some of which look kind of like agents, but I will not really be talking too much about that today. 01:36:41.620 |
Arush Selvan: I'm also relatively active on X, the everything app, and that will become relevant later in the talk. 01:36:47.620 |
Arush Selvan: This talk I think will be probably a little different from most of the talks at the conference. 01:36:51.620 |
Arush Selvan: It's not about things we ship to prod. 01:36:53.620 |
Arush Selvan: It's not about things that definitely work and you should go do tomorrow that are like proven science or best practices. 01:37:00.620 |
Arush Selvan: It's about where we might be headed and I want to really just tell a story that will synthesize some things that have been happening in the broader research community. 01:37:09.620 |
Arush Selvan: And where these trends might be pointing, do some speculation, and also talk about some recent open source work of my own. 01:37:16.620 |
Arush Selvan: And the goal of this is to help you plan and understand what reinforcement learning means, what it means for agents, and how to best be ready for a potential future which may involve reinforcement learning as part of the agent engineering loop. 01:37:35.620 |
Arush Selvan: Most LLMs that we work with are essentially chatbots. 01:37:38.620 |
Arush Selvan: I think it's helpful to think about OpenAI's five levels framework here. 01:37:42.620 |
Arush Selvan: So, we did pretty well with chatbots. 01:37:44.620 |
Arush Selvan: Seems like we're doing pretty well with reasoners. 01:37:47.620 |
Arush Selvan: These are great models for question and answer. 01:37:49.620 |
Arush Selvan: They're very helpful for interactive problem solving. 01:37:51.620 |
Arush Selvan: We have the 01, 03, R1, Grok3, Gemini, et cetera, models that are really good at kind of thinking longer. 01:37:58.620 |
Arush Selvan: And we're trying to figure out how we take all of this and make agents level three. 01:38:03.620 |
Arush Selvan: And these are systems that are taking actions. 01:38:05.620 |
Arush Selvan: These are systems that are doing things that are longer and harder and more complex. 01:38:09.620 |
Arush Selvan: And currently, the way we tend to do this is chaining together multiple calls to these underlying chatbot or reasoner LLMs. 01:38:16.620 |
Arush Selvan: And we do lots of things like prompt engineering, tool calling, evals, ops, giving the models tools of their own to use, having humans in the loop. 01:38:24.620 |
Arush Selvan: And the results are like pretty good. 01:38:26.620 |
Arush Selvan: There's a lot of things that we can do. 01:38:27.620 |
Arush Selvan: And then there's a lot of stuff that it feels like is around the corner that we're all imagining about AGI. 01:38:33.620 |
Arush Selvan: But we're not really to the point yet where these things are going off and doing the things that we would imagine an AGI is really doing to the degree of autonomy that that would, I presume, entail. 01:38:45.620 |
Arush Selvan: So I think it's useful a bit to distinguish between agents and pipelines. 01:38:50.620 |
Arush Selvan: I think Barry's talk earlier was a good way to kind of frame this. 01:38:52.620 |
Arush Selvan: I'm going to use pipelines to encapsulate what Barry called workflows. 01:38:55.620 |
Arush Selvan: And I think these are really systems with fairly low degrees of autonomy. 01:39:00.620 |
Arush Selvan: And there's a very nontrivial amount of engineering required to determine these decision trees to say, how does one action or call flow into another? 01:39:11.620 |
Arush Selvan: And it seems like a lot of the winning apps in the agent space have very tight feedback loops. 01:39:17.620 |
Arush Selvan: And so whether or not you want to call these agents or pipelines, these are things where a user is interacting with some sort of interface. 01:39:24.620 |
Arush Selvan: The thing will do some stuff and come back relatively quickly. 01:39:27.620 |
Arush Selvan: Things like the IDEs like Cursor, Windsurf, and Replit. 01:39:30.620 |
Arush Selvan: And search tools that are really good at harder question answer, maybe with some web search or research integrated. 01:39:35.620 |
Arush Selvan: But there's not that many agents nowadays that will go off and do stuff for more than 10 minutes at a time. 01:39:40.620 |
Arush Selvan: I think Devon, Operator, and OpenAI's deep research are the three that really come to mind as like feeling a little more in the like autonomous agent direction. 01:39:49.620 |
Arush Selvan: And I think a lot of us might be wondering, how do we make more of these? 01:39:52.620 |
Arush Selvan: And the kind of traditional wisdom is like, okay, we'll just wait for better models. 01:39:57.620 |
Arush Selvan: Once better models are around, we can just like use those, we'll be good. 01:40:02.620 |
Arush Selvan: But I think it's also to kind of take note of like the traditional definition of reinforcement learning. 01:40:06.620 |
Arush Selvan: And what an agent means there, which is this idea of a thing that is interacting with an environment with a goal. 01:40:12.620 |
Arush Selvan: And the goal that, and the system is designed to learn how to get better at that goal over time via repeated interaction with the system. 01:40:19.620 |
Arush Selvan: And I think this is something that a lot of us are either doing manually or don't really have the tools to do, 01:40:24.620 |
Arush Selvan: Which is once we have our thing, that it's set up to make the calls we want, and the performance is like 70%. 01:40:31.620 |
Arush Selvan: And we've done a lot of prompt tuning and we wanted to get it up to 90%. 01:40:34.620 |
Arush Selvan: But we just like don't have the models to do it or the models struggle to get the success. 01:40:40.620 |
Arush Selvan: And so in terms of model trends, I think I won't spend too much time talking about this. 01:40:44.620 |
Arush Selvan: But pre-training seems to be having diminishing returns to capital at least. 01:40:48.620 |
Arush Selvan: We're still seeing kind of like loss go down, but it does kind of feel like we need new tricks. 01:40:53.620 |
Arush Selvan: Reinforcement learning from human feedback is great for making kind of friendly chatbots. 01:40:58.620 |
Arush Selvan: But it doesn't really seem to be continually pushing us at the frontier of smarter and smarter and smarter models. 01:41:05.620 |
Arush Selvan: We talk a lot about synthetic data and I think synthetic data is great for distilling larger models down into smaller models to have kind of really tiny models that are really performant. 01:41:14.620 |
Arush Selvan: But on its own, it doesn't seem to be an unlock for like massive capabilities getting better and better. 01:41:21.620 |
Arush Selvan: Unless we throw in verification in the loop or rejection sampling or any of these things. 01:41:25.620 |
Arush Selvan: And that really takes us to the world of reinforcement learning where this seems to be the trick that unlocked test time scaling for 01 models in R1. 01:41:32.620 |
Arush Selvan: It's not bottlenecked by needing manually curated human data and it does seem to actually work. 01:41:37.620 |
Arush Selvan: I think we all kind of took note about a month ago when DeepSeq released the R1 model and paper to the world. 01:41:46.620 |
Arush Selvan: And I think this was really exciting because it was the first paper that really explained how you build a thing like 01. 01:41:51.620 |
Arush Selvan: We had kind of speculation and some rumors, but they really laid out the algorithm and the mechanisms for what it takes to get a model to learn to do this kind of reasoning. 01:42:02.620 |
Arush Selvan: And it turns out it was essentially just reinforcement learning where you give the model some questions, you measure if it's getting the answer right, and you just kind of turn this crank of giving it feedback to do more like the things that worked well and less like the things that didn't work. 01:42:16.620 |
Arush Selvan: And what you end up seeing is that the long chain of thought from models like 01 and R1 actually emerges as a byproduct of this. 01:42:23.620 |
Arush Selvan: It wasn't kind of manually programmed in where the models were like given data of like 10,000 token reasoning steps. 01:42:29.620 |
Arush Selvan: This was the thing the model learned to do because it was a good strategy. 01:42:32.620 |
Arush Selvan: And reinforcement learning at the core is really about identifying good strategies for solving problems. 01:42:37.620 |
Arush Selvan: It also seems like open source models are back in a big way. There's a lot of excitement around the open source community. 01:42:43.620 |
Arush Selvan: People have been working on replication efforts for the 01 project and have also been trying to distill data from 01 down to smaller models. 01:42:50.620 |
Arush Selvan: And so what next? How does this relate to agents? 01:42:53.620 |
Arush Selvan: I think it'll be helpful to know a little bit about how reinforcement learning works. 01:42:56.620 |
Arush Selvan: The key idea is to explore and exploit. So you want to try stuff, see what works, do more of the things that worked, less of the things that didn't. 01:43:03.620 |
Arush Selvan: And so in this feedback loop demonstrated here in the image, we can see a challenge where a model is supposed to be writing code to pass test cases 01:43:11.620 |
Arush Selvan: And we give it rewards that correspond to things like formatting, using the right language, and then ultimately whether or not the test cases are passing. 01:43:18.620 |
Arush Selvan: And so this is kind of a numerical signal that rather than like training on data where we were kind of curating this in advance, 01:43:25.620 |
Arush Selvan: We are letting the model do synthetic data rollouts and seeing scores from these rollouts, which then are fed back into the model. 01:43:31.620 |
Arush Selvan: And so the GRPO algorithm, which maybe some of you have heard about, is the algorithm DeepSeq used. 01:43:36.620 |
Arush Selvan: I think it's less of like a technical breakthrough in terms of it being a really important new algorithm to study, but I think it's very conceptually simple. 01:43:42.620 |
Arush Selvan: And I think it's a nice way to think about what reinforcement learning means. 01:43:45.620 |
Arush Selvan: And the idea really is just that you, for a given prompt, sample n completions, you score them all, and you tell the model be more like the ones with higher scores. 01:43:53.620 |
Arush Selvan: This is still in kind of the single-turn reasoner model non-agentic world. 01:43:59.620 |
Arush Selvan: And so the challenges that lie ahead are going to be about how do we take these ideas and extend them into more powerful, more agentic, more autonomous systems. 01:44:09.620 |
Arush Selvan: But we do know that it can be done. 01:44:11.620 |
Arush Selvan: So OpenAI's deep research still has a lot of questions that we do not know the answers to about how it works, but they have told us that it was end-to-end reinforcement learning. 01:44:19.620 |
Arush Selvan: And so this is a case where the model is taking up to potentially 100 different tool calls of browsing or querying different parts of the internet to synthesize a large answer. 01:44:27.620 |
Arush Selvan: And it does seem, I think, to many people's vibe check opinions, very impressive. 01:44:32.620 |
Arush Selvan: But it also is not AGI in the sense of you can't get it to go work in a repo or solve hard software engineering tasks. 01:44:40.620 |
Arush Selvan: And people have kind of anecdotally found that it does struggle a bit for out-of-distribution tasks. 01:44:45.620 |
Arush Selvan: If you want it to fill out a table with 100 very manual calculations, it can struggle there. 01:44:51.620 |
Arush Selvan: And so it seems like reinforcement learning, on one hand, is a big unlock for new skills and more autonomy. 01:44:57.620 |
Arush Selvan: But it's not a thing that, so far, has granted us agents that can just do everything and know how to solve all kinds of problems. 01:45:04.620 |
Arush Selvan: But it is a path forward for teaching a model skills and having the model learn how to get better at certain skills, particularly in conjunction with environments and tools and verification. 01:45:16.620 |
Arush Selvan: And so there is infrastructure out there for doing this on our own, kind of. 01:45:22.620 |
Arush Selvan: A lot of it is still RLHF style, by which I mean it's about kind of single-turn interactions, where the goal is we have reward signals that come from kind of human data that has been combined into a reward model. 01:45:34.620 |
Arush Selvan: And if we want to have RL agents becoming part of our systems, maybe we will get really good API services from the large labs that let us build these things and hook into GPT, whatever, or Claude, whatever, and train these sorts of models on our own with fine-tuning. 01:45:49.620 |
Arush Selvan: But we also don't really have these options yet. 01:45:52.620 |
Arush Selvan: OpenAI has kind of teased their reinforcement fine-tuning feedback, but it's not multi-step tool calling yet. 01:45:57.620 |
Arush Selvan: And so I think if we want to plan ahead, it's worth kind of noting and asking, what would this ecosystem look like? 01:46:04.620 |
Arush Selvan: And there's a lot of unknown questions like how much this will cost? 01:46:09.620 |
Arush Selvan: Will it generalize across tasks? 01:46:11.620 |
Arush Selvan: And how do we design good rewards in good environments? 01:46:14.620 |
Arush Selvan: And there's a lot of opportunity here. 01:46:16.620 |
Arush Selvan: Open source infrastructure, there's a lot of room to build and grow and determine what the best practices are going to be, what the right tools will be, 01:46:22.620 |
Arush Selvan: As well as companies that can build tools to support this ecosystem, whether or not they're already in the fine-tuning world or not. 01:46:29.620 |
Arush Selvan: And services for supporting this kind of agentic RL. 01:46:32.620 |
Arush Selvan: And I think also it is worth thinking about things that are like not literal RL in the sense of training the model, but at the prompt level, there's all sorts of automation we can do. 01:46:40.620 |
Arush Selvan: So if you've used DSPy, I think that is kind of adjacent to RL in the flavor of having a signal that we can then bootstrap from to improve our underlying system based on improving some downstream scores. 01:46:53.620 |
Arush Selvan: Now I want to share a story with you about a single Python file I wrote a couple weeks ago. 01:46:59.620 |
Arush Selvan: So this was the weekend after R1 came out, and I had been reading the paper and thought it was really cool. 01:47:04.620 |
Arush Selvan: We had not had the NVIDIA stock crash quite yet. 01:47:07.620 |
Arush Selvan: And I was just playing around some experiments. 01:47:10.620 |
Arush Selvan: I was taking a trainer from Hug and Face that had the GRP algorithm, and I was getting a really small language model, LLAMA1B, to do some reasoning and then give an answer for math questions. 01:47:22.620 |
Arush Selvan: And I started with like a pretty simple system prompt, and I was just training the model to let it see what it did, and I had kind of manually curated some rewards in terms of what the scoring function should look like. 01:47:33.620 |
Arush Selvan: And I just kind of like tweeted it out, where I had an example of the model kind of looking like it's doing some self-correction and showing that the accuracy gets better as well as the length of response will initially drop once it learns to kind of follow the format. 01:47:48.620 |
Arush Selvan: Then it goes back up as it learns to kind of take advantage of longer chains of thought to do its reasoning. 01:47:54.620 |
Arush Selvan: And this was not the first thing to replicate in any sense. 01:47:57.620 |
Arush Selvan: I wouldn't really call it a true replication. 01:47:59.620 |
Arush Selvan: It was far from the most complicated, and I think that actually caught a lot of people's imaginations, and it became kind of a thing. 01:48:07.620 |
Arush Selvan: So over the next two weeks after that, it just took on a life of its own where a lot of people were kind of tweeting about it and forking it and making modifications to it, 01:48:17.620 |
Arush Selvan: And making it something you could run in a Jupyter notebook, making it more accessible, writing blog posts about it, and it was interesting. 01:48:25.620 |
Arush Selvan: Because it, to me, didn't feel like a thing that kind of merited this level of excitement. 01:48:32.620 |
Arush Selvan: But what I think was catching people's imagination was that it was one file of code. 01:48:39.620 |
Arush Selvan: And it invited modification in a very user-friendly, engaging way, which I like to call rubric engineering. 01:48:46.620 |
Arush Selvan: And so the idea of rubric engineering here is that, similar to prompt engineering, to have a model do reinforcement learning, it's going to get some reward. 01:48:56.620 |
Arush Selvan: But what should this reward be? 01:48:57.620 |
Arush Selvan: In the most simple version, it's just like, did it get the question right or wrong? 01:49:01.620 |
Arush Selvan: But there's a lot more you can do beyond this. 01:49:03.620 |
Arush Selvan: And so I think the single file of code exposed examples of this, where you can give the model points for things like following this XML structure. 01:49:12.620 |
Arush Selvan: Like, if it gets a certain tag right, you give it plus one point. 01:49:15.620 |
Arush Selvan: If it has an integer answer that's still the wrong answer, but it's learned that the format should be an integer answer, get some points for that. 01:49:21.620 |
Arush Selvan: And there's a lot of room here for getting creative. 01:49:25.620 |
Arush Selvan: And for designing rules that are not just downstream evals to, for our own sake, know whether a thing is working, but to allow the model itself to know whether it's working. 01:49:34.620 |
Arush Selvan: And use that as feedback for going further and training more. 01:49:40.620 |
Arush Selvan: There's a lot of things we don't know. 01:49:41.620 |
Arush Selvan: And I think there's a lot of opportunity to get creative and explore and try things out. 01:49:45.620 |
Arush Selvan: Such as using LLMs to design these rubrics, auto-tuning these rubrics or auto-tuning your prompts with frameworks like DSPy. 01:49:52.620 |
Arush Selvan: Incorporating LLM judges as part of the scoring system. 01:49:55.620 |
Arush Selvan: And then also I think reward hacking is an issue to be very conscious of where the idea is you want to ensure that the reward model you're using is actually capturing the goal. 01:50:05.620 |
Arush Selvan: And it doesn't have kind of these back doors where a model can kind of cheat and do something else that ultimately results in its kind of getting a super high reward without learning to do the actual task. 01:50:17.620 |
Arush Selvan: And following this, I have been trying to learn from those lessons of what I saw people using out in the wild and make something that is a little more robust and usable for actual projects beyond just one file of code. 01:50:30.620 |
Arush Selvan: And this has been a kind of very recent effort. It's not a thing that I've been telling you to go use for all your problems tomorrow. 01:50:36.620 |
Arush Selvan: But I think it's my attempt at doing some open source research code that will help people potentially try these things out easier and answer some questions about this. 01:50:46.620 |
Arush Selvan: And so what this really is, it's a framework for doing RL inside of multi-step environments. 01:50:52.620 |
Arush Selvan: So the idea here is that lots of us have built these great agent frameworks for using API models. 01:50:57.620 |
Arush Selvan: And the hope would be that we can leverage those existing environments and frameworks to actually do RL. 01:51:04.620 |
Arush Selvan: So here the idea is you can just create this environment thing that the model plugs into and you don't have to worry about the weights or the tokens. 01:51:11.620 |
Arush Selvan: You can just write an interaction protocol and then this gets fed into a trainer. 01:51:15.620 |
Arush Selvan: And so once you build this environment, you can just kind of let it run and have a model that, once you give it some rewards, learns to get better and better over time. 01:51:24.620 |
Arush Selvan: And to conclude, I want to talk about what I think AI engineering might look like in the RL era. 01:51:31.620 |
Arush Selvan: So this is all still something that is very new. 01:51:35.620 |
Arush Selvan: We don't know whether the off-the-shelf API models are going to just work for the tasks we throw at them. 01:51:41.620 |
Arush Selvan: It might be the case that they do. 01:51:43.620 |
Arush Selvan: It might be the case that they don't. 01:51:44.620 |
Arush Selvan: One reason I think that they might not be the entire solution is that it is really hard to include a skill in a prompt. 01:51:53.620 |
Arush Selvan: You can include knowledge in a prompt, but a lot of us, when we try something, we don't nail it the first time. 01:52:00.620 |
Arush Selvan: And it takes a little bit of trial and error. 01:52:02.620 |
Arush Selvan: And it seems to be the case that models are like this as well, where a model does get better at a thing and really gets a skill nailed down by trial and error. 01:52:13.620 |
Arush Selvan: And this has been the most promising unlock we've seen so far for these higher autonomy agents like DeepResearch. 01:52:20.620 |
Arush Selvan: Fine-tuning might still be important. 01:52:23.620 |
Arush Selvan: I think a lot of people wrote off fine-tuning for a while because open models were far enough behind the frontier that a prompted frontier model API was just going to beat a smaller fine-tuned model. 01:52:34.620 |
Arush Selvan: I think, one, we're now seeing the open and closed source gap be close enough that this is less of a concern. 01:52:39.620 |
Arush Selvan: A lot of people are using open source hosted models in their platforms. 01:52:43.620 |
Arush Selvan: And also, RL, the most kind of true version of RL that DeepSeq did for their R1 model that OpenAI has talked about doing for deep research requires doing some reinforcement learning. 01:52:56.620 |
Arush Selvan: There's a lot of challenges here. 01:52:58.620 |
Arush Selvan: There's a lot of research questions we don't know the answers to. 01:53:00.620 |
Arush Selvan: But there's a lot of things that I think these skills we've learned from doing AI engineering over the past couple years translate very directly to. 01:53:07.620 |
Arush Selvan: Which is that the challenge of building environments and rubrics is not that different from the challenge of building evals and prompts. 01:53:13.620 |
Arush Selvan: We still need good monitoring tools. 01:53:15.620 |
Arush Selvan: We still need a large ecosystem of companies and platforms and products that support the kinds of agents we want to build. 01:53:22.620 |
Arush Selvan: So I think all the stuff we've been doing is going to be essential. 01:53:27.620 |
Arush Selvan: And it's worth looking ahead a little bit to see if we end up in a world where we have to do a little bit more reinforcement learning to unlock things like true autonomous agents or innovators or organizations that are powered by language models. 01:54:08.620 |
Arush Selvan: Ladies and gentlemen, please welcome back to the stage, MC for the AI Engineer Summit, Agent Engineering Day. 01:54:15.620 |
Arush Selvan: The founder and CEO of Superintelligent, NLW. 01:54:30.620 |
Arush Selvan: Thank you, Will, for a great way to close us out and for all the other great presenters as well. 01:54:35.620 |
Arush Selvan: Quick clarification before I let you guys go tonight. 01:54:38.620 |
Arush Selvan: So there is no on-site after party. 01:54:44.620 |
Arush Selvan: However, there are a number of affiliated events. 01:54:47.620 |
Arush Selvan: If you check the website homepage for info and RSVP instructions, that's all there. 01:54:52.620 |
Arush Selvan: But again, expo closes at 4:00. 01:54:55.620 |
Arush Selvan: So we'll be wrapping up conversations and evening plans at around 5:30. 01:55:03.620 |
Arush Selvan: And so we're going to do a 30-minute break now. 01:55:06.620 |
Arush Selvan: If you want to check out and have discussions with the speakers, the Q&A lounges are available to meet them. 01:55:11.620 |
Arush Selvan: The first one is on the first floor to the right as you exit the theater, and there are two downstairs as well. 01:55:17.620 |
Arush Selvan: Also, we recommend making time to stop by the sponsor expo. 01:55:21.620 |
Arush Selvan: You're going to find coffee, snacks, and also the amazing products and services of our sponsors. 01:55:27.620 |
Arush Selvan: So thank you very much, and we will see you back here in about half an hour. 02:54:06.620 |
Arush Selvan: My name is John Carpezi, and I work on a team at Jane Street called AI Assistance. 02:54:11.620 |
Arush Selvan: Our group, roughly, is there at Jane Street to try to maximize the value that 02:54:16.620 |
Jane Street can get from large language models. 02:54:19.620 |
Arush Selvan: And I've spent my entire career in dev tools. 02:54:22.620 |
Arush Selvan: Before I worked at Jane Street, I was at GitHub for a long time, and then before 02:54:25.620 |
that I worked at a variety of other dev tools companies. 02:54:28.620 |
Arush Selvan: And LLMs kind of present this really amazing opportunity in that they're so 02:54:32.120 |
open-ended that we can build kind of anything that we can imagine. 02:54:35.120 |
Arush Selvan: And it seems like right now, the only thing moving faster than the progress 02:54:38.520 |
of the models is kind of our creativity around how to employ them. 02:54:42.120 |
Arush Selvan: At Jane Street, though, we've made some choices that make adoption of off-the-shelf 02:54:47.220 |
Arush Selvan: tooling a little bit more difficult than it is for other companies. 02:54:50.220 |
Arush Selvan: And kind of the biggest reason that we have this problem is that we use OCaml 02:54:56.220 |
Arush Selvan: For those not familiar with OCaml, it is a functional, very powerful language, but 02:55:04.220 |
Arush Selvan: It was built in France, and its most common applications are in things like 02:55:10.220 |
Arush Selvan: theorem proving or formal verification. 02:55:12.220 |
Arush Selvan: It's also used to write programming languages. 02:55:15.220 |
Arush Selvan: We use OCaml kind of for everything at Jane Street. 02:55:20.220 |
Arush Selvan: So just a couple of quick examples. 02:55:22.220 |
Arush Selvan: When we write web applications, of course, web applications have to be written 02:55:26.220 |
Arush Selvan: in JavaScript, but instead we write OCaml, and we use a library called JS of OCaml 02:55:31.220 |
Arush Selvan: that is essentially an OCaml bytecode to JavaScript transpiler. 02:55:35.220 |
Arush Selvan: When we write plugins for Vim, those have to be written in VimScript. 02:55:39.220 |
Arush Selvan: But we actually use a library called VCaml, which again is OCaml to VimScript transpiler. 02:55:45.220 |
Arush Selvan: And even people at the company that are working on FPGA code, they're not writing 02:55:51.220 |
Arush Selvan: Verilog, they're writing in an OCaml library called Hard Caml. 02:55:55.220 |
Arush Selvan: So why are the tools on the market available not good for working with OCaml? 02:56:01.220 |
Arush Selvan: I think it kind of comes down to a few primary reasons. 02:56:04.220 |
Arush Selvan: The first and the most important is that models themselves are just not very good 02:56:09.220 |
Arush Selvan: And this isn't the fault of the AI labs. 02:56:11.220 |
Arush Selvan: This is just kind of a byproduct of the amount of data that exists for training. 02:56:15.220 |
Arush Selvan: So there's a really good chance that the amount of OCaml code that we have 02:56:19.220 |
Arush Selvan: inside of Jane Street is just more than like the total combined amount of OCaml 02:56:23.220 |
Arush Selvan: code that there exists in the world outside of our walls. 02:56:26.220 |
Arush Selvan: The second is that we've made things really hard on ourselves. 02:56:30.220 |
Arush Selvan: Partially as a byproduct of working in OCaml, we've had to build our own build systems. 02:56:34.220 |
Arush Selvan: We built our own distributed build environment. 02:56:36.220 |
Arush Selvan: We even built our own code review system, which is called Iron. 02:56:39.220 |
Arush Selvan: We develop all of our software on a giant monorepo application. 02:56:44.220 |
Arush Selvan: And just for fun, instead of storing that monorepo in Git, we store it in Mercurial. 02:56:49.220 |
Arush Selvan: And at last count, 67% of the firm uses Emacs instead of normal editors, maybe like VS Code. 02:56:59.220 |
Arush Selvan: We do have people using VS Code, but Emacs is the most popular. 02:57:02.220 |
Arush Selvan: And the last thing is we're dreamers. 02:57:05.220 |
Arush Selvan: I mean, kind of everyone in this room hopefully is a dreamer in a way. 02:57:08.220 |
Arush Selvan: And what I mean by this is we want the ability to kind of take LLMs and apply them to different parts of our development flow and light up different parts. 02:57:15.220 |
Arush Selvan: So maybe we want to use LLMs to resolve merge conflicts or build better feature descriptions or figure out who reviewers for features be. 02:57:24.220 |
Arush Selvan: And we don't want to be hampered by the boundaries between different systems when we do that. 02:57:29.220 |
Arush Selvan: Over the next 15 minutes, I'm going to cover our approach to large language models at Jane Street, particularly when it comes to developer tools. 02:57:38.220 |
Arush Selvan: I'm going to talk about custom models that we're building and how we build them. 02:57:42.220 |
Arush Selvan: I'm going to talk about editor integrations, so these are the integrations into VS Code, Emacs, and NeoVim. 02:57:49.220 |
Arush Selvan: And I will talk about the ability that we've built over time to evaluate models and figure out how to make them perform best. 02:57:55.220 |
Arush Selvan: And I guess at first glance, it's not really obvious that training models at all is a good idea. 02:58:01.220 |
Arush Selvan: I mean, it's a very expensive proposition. 02:58:03.220 |
Arush Selvan: It takes a lot of time, and it can go wrong in a ton of different ways. 02:58:06.220 |
Arush Selvan: Who here has trained a model before or tried to train something like a model? 02:58:09.220 |
Arush Selvan: Maybe you took a foundation model and trained on top of it. 02:58:13.220 |
Arush Selvan: We were more convinced after we read this paper. 02:58:17.220 |
Arush Selvan: This is a paper from Meta about a project called Code Compose. 02:58:20.220 |
Arush Selvan: And in this paper, they detail the results, fine-tuning a model specifically for use with other 02:58:26.220 |
Arush Selvan: Hack is actually pretty similar to OCaml. 02:58:28.220 |
Arush Selvan: Not in its syntax or function, but really just in the fact that it's used primarily 02:58:33.220 |
at one company and not really used much outside of that company, even though it's open source. 02:58:38.220 |
Arush Selvan: So actually, a fun fact, hack is implemented in OCaml. 02:58:42.220 |
Arush Selvan: I think that's just like a total coincidence. 02:58:48.220 |
Arush Selvan: We read this paper, and we decided that it would be really cool if we could replicate 02:58:53.220 |
Arush Selvan: We thought we would just take a model off the shelf. 02:58:55.220 |
Arush Selvan: We would show it a bunch of our code, and then we would get back a model that 02:58:59.220 |
worked like the original model, but knew about our libraries and idioms. 02:59:02.220 |
Arush Selvan: It turns out that's just not how it works. 02:59:06.220 |
Arush Selvan: In order to get good outcomes, you have to have the model see a bunch of examples 02:59:10.220 |
Arush Selvan: that are in the shape of the type of question that you want to ask the model. 02:59:14.220 |
Arush Selvan: So we needed to first create a goal, a thing that we wanted the model to be able to do. 02:59:18.220 |
Arush Selvan: And in our world, the goal that we came up with was this. 02:59:21.220 |
Arush Selvan: We wanted to be able to generate diffs given a prompt. 02:59:25.220 |
Arush Selvan: So what that means is we wanted a user inside of an editor to be able to write a description 02:59:29.220 |
of what they wanted to happen, and then have the model suggest a potentially multi-file diff. 02:59:34.220 |
Arush Selvan: So maybe you want to modify the test file, an ML file, and an MLI, which is kind of like a header file. 02:59:40.220 |
Arush Selvan: We wanted the diffs to apply cleanly, and we wanted them to have a good likelihood of type checking after they had been applied. 02:59:47.220 |
Arush Selvan: And we were kind of targeting this range of up to 100 lines as an ideal zone of what we thought LLMs would actually be capable of. 02:59:56.220 |
Arush Selvan: And in order for that to work, we needed to collect data, like I was talking about before. 03:00:01.220 |
Arush Selvan: We needed data of the training shape that looked just like the test shape. 03:00:05.220 |
Arush Selvan: And this is what that shape looks like for this task. 03:00:07.220 |
Arush Selvan: You need to be able to collect a bunch of examples of what context the model would have had beforehand, 03:00:12.220 |
Arush Selvan: and then some prompt of what you want the model to do, written hopefully in the same way that a human would write it, 03:00:17.220 |
Arush Selvan: and then some diff that would accomplish that goal. 03:00:19.220 |
Arush Selvan: So context, prompt, diff, and we need a bunch of these examples. 03:00:23.220 |
Arush Selvan: So how do we get these? How do we get these training examples? 03:00:26.220 |
Arush Selvan: Kind of the first place to look is features. Features is, I mentioned a code review system that we built internally. 03:00:32.220 |
Arush Selvan: This is what it looks like. It's called iron. Features are very similar to pull requests. 03:00:37.220 |
Arush Selvan: I think you can just, you know, swap that term in your head. And features at first glance have exactly the data they want. 03:00:43.220 |
Arush Selvan: On the description tab, they have a human written description of a change. And on the diff tab, they have the code that accomplishes the goal of the developer. 03:00:51.220 |
Arush Selvan: But on closer look, they're not exactly what you want, right? 03:00:54.220 |
Arush Selvan: The way that you write a feature description or a pull request description is really very different from what you might want to say inside of an editor. 03:01:01.220 |
Arush Selvan: So you're not writing multiple paragraphs in the editor. You're just saying something like, "Fix that error that's happening right now." 03:01:07.220 |
Arush Selvan: And that's just not how we write feature descriptions. 03:01:09.220 |
Arush Selvan: Another problem with these features or pull requests is that they're really large, right? 03:01:14.220 |
Arush Selvan: Often it's a feature is 500 lines or 1,000 lines. So in order to use it as training data, we would need to have an automated way to pull features apart into individual smaller components 03:01:24.200 |
Arush Selvan: that we could train on. So we need smaller things than features. What are those? Well, maybe commits. Commits are smaller chunks than features. 03:01:31.200 |
Arush Selvan: This is what a typical commit log looks like at Jane Street. So this is not like a git short log. This is literally just like an actual-- 03:01:39.200 |
Arush Selvan: I want you to look at this as an actual git log. And where it says summary z, that's my commit message. 03:01:46.200 |
Arush Selvan: We don't really use commits the same way that the rest of the world uses them. So we use commits mostly as checkpoints between different parts of a development cycle that you might want to revert back to. 03:01:57.200 |
Arush Selvan: Commits don't have a description, and they also have the same problem in that they're not isolated changes. They're a collection of changes. 03:02:04.200 |
Arush Selvan: What we actually ended up with was an approach called workspace snapshotting. And the way that that works is we take snapshots of developer workstations throughout the workday. 03:02:13.200 |
Arush Selvan: So you can think like every 20 seconds, we just take a snapshot of what the developer's doing. And as we take the snapshots, we also take snapshots of the build status. 03:02:20.200 |
Arush Selvan: So there's the build that's running on the box. We can see what the error is or whether the build is green. And we can kind of notice these little patterns. 03:02:27.200 |
Arush Selvan: If you have a green to red to green, that often corresponds to a place where a developer has made an isolated change, right? You start writing some code, you break the build, and then you get it back to green. And that's how you make a change. 03:02:39.200 |
Arush Selvan: Maybe this one, the red to green, this is a place where the developer encountered an error, whether that's a type error or a compilation error, and then they fixed it. So if we capture the build error at the red state, and then the diff from red to green, we can use that as training data to help the model be able to recover from mistakes. 03:02:55.200 |
Arush Selvan: The next thing we need is a description. And the way that we did that, we just used a large language model. So we had a large language model write a really detailed description of a change in as much words as it possibly could. 03:03:07.200 |
Arush Selvan: And then we just kept filtering it down until it was something that was around the right level of what a human would write. 03:03:12.200 |
Arush Selvan: So now we have this training data, and training data is kind of only half the picture of training a model. So you have the supervised training data, and then you need to do the second part, which is the reinforcement learning. 03:03:23.200 |
Arush Selvan: This is really where models get a lot of the power, right? We align the model's ability to what humans think is actually good code. So what is good code? 03:03:35.200 |
Arush Selvan: I guess, on the surface, good code is -- I mean, it's code. It has the parses code, meaning if a piece of code doesn't go through the OCaml parser and come out with a green status, that is not good code, I would say, by most definitions. 03:03:49.200 |
Arush Selvan: Good code in OCaml, because it's statically typed, is code that type checks. So we want to have good code be code that, when it is applied on top of a base revision, can go through the type checker, and the type checker agrees that the code is valid. 03:04:03.200 |
Arush Selvan: And, of course, the gold standard is that good code is code that compiles and passes tests. So ideally, during the reinforcement learning phase of a model, you could give the model a bunch of tasks that are, like, verifiable. 03:04:15.200 |
Arush Selvan: We have the model perform some edit, and then we check whether or not it actually passes the test when applied to the code. 03:04:24.200 |
Arush Selvan: So we did that. We've done this as part of our training cycle, and we built this thing that is called CES. It's the Code Evaluation Service. 03:04:33.200 |
Arush Selvan: You can think of it kind of like a build service, except with a slight modification to make it much faster. And that's that, first, we pre-warm a build. 03:04:41.200 |
Arush Selvan: It sits at a revision and is green. And then we have these workers that all day just take diffs from the model, they apply them, and then we determine whether the build status turns red or green, and then we report that error or success back up to the build function. 03:04:56.200 |
Arush Selvan: And through continued use of this service over the course of, like, months, we're able to better align the model to write code that actually does compile and pass tests. 03:05:05.200 |
Arush Selvan: It turns out this exact same setup is the one that you would want for evaluation. So if you just hold out some of the RL data, you can also use it to evaluate models' ability to write code. 03:05:17.200 |
Arush Selvan: It kind of looks like this. You give the model a problem, you let it write some code, and then you evaluate whether or not the code that it writes actually works. 03:05:24.200 |
Arush Selvan: And training is hard, and it can have kind of catastrophic but hilarious results. So at one point we were training a code review model, and this is a totally separate model, but the idea was we want to be able to give some code to this model and have it do a first pass of code review just like a human would do to try to save some of the toil of code review. 03:05:46.200 |
Arush Selvan: We trained this model. We put a bunch of data into it. We worked on it for months. We're real excited, and we put our first code in for code review through the automated agent. 03:05:55.200 |
Arush Selvan: It's fun for a bit, and it came back with something along the lines of, "Hmm, I'll do it tomorrow." 03:06:01.200 |
Arush Selvan: And like, of course it did that, because it's trained on a bunch of human examples, and humans write things like, "I'll do this tomorrow." So it's, you know, not very surprising. 03:06:13.200 |
Arush Selvan: So having evaluations that are meaningful is kind of a cornerstone of making sure that models don't go off the rails like this and you don't waste a bunch of your time and money. 03:06:20.200 |
Arush Selvan: In the end, though, the real test of models is whether or not they work for humans. So I'm going to talk a little bit about the editor integrations that we've built to expose these models 03:06:33.200 |
Arush Selvan: Kind of when we were starting building these integrations, we had three ideas in mind. The first idea was, "Wow, we support three editors." We have NeoVim, VS Code, and Emacs, and we really don't want to write the same thing three times. 03:06:44.200 |
Arush Selvan: So ideally, we don't want to write all the same context-building strategies and all of the same prompting strategies. We want to just write it once. 03:06:51.200 |
Arush Selvan: The second is that we wanted to maintain flexibility. So we had a model that we were using at the time that was not a fine-tuned model. We were pretty convinced 03:06:59.200 |
Arush Selvan: That a fine-tuned model was in our future. We wanted the ability to do things like swap the model or swap the prompting strategy out. 03:07:06.200 |
Arush Selvan: And lastly, we wanted to be able to collect metrics. So in a developer, in their editor, developers care about latency. They care about whether or not the diffs actually apply. 03:07:16.200 |
Arush Selvan: So we wanted to get kind of on-the-ground real experience of whether or not the diffs really were meaningful for people. 03:07:23.200 |
Arush Selvan: This is the simplified version of the architecture that we settled on for this service, the AI development environment. 03:07:29.200 |
Arush Selvan: Essentially, you have LLMs on one side, and then Aid handles all of the ability to construct prompts and to construct context and to see the build status. 03:07:38.200 |
Arush Selvan: And then we were able to just write these really thin layers on top of Aid for each of the individual editors. 03:07:44.200 |
Arush Selvan: And what's really neat about this is that Aid sits as a sidecar application on the developer's machine. 03:07:49.200 |
Arush Selvan: Which means that when we want to make changes to Aid, we don't have to make changes to the individual editors and hope that people restart their editors. 03:07:56.200 |
Arush Selvan: We can just restart the Aid service on all of the boxes. So we restart Aid, and then everyone gets the most recent copy. 03:08:03.200 |
Arush Selvan: This is an example of Aid working inside of VS Code. So this is the sidebar in VS Code, very similar to something like Copilot, except this thing allows you to ask for it and get back multi-file diffs. 03:08:17.200 |
Arush Selvan: And you can see it kind of looks like what you'd expect in VS Code. It's a visual interface that lays things out really nicely. 03:08:23.200 |
Arush Selvan: This is what we built in Emacs, though. So in Emacs, developers are used to working in text buffers. 03:08:30.200 |
Arush Selvan: They move around files. They want to be able to copy things, the normal way that they copy things. 03:08:34.200 |
Arush Selvan: So we actually built the Aid experience in Emacs into a markdown buffer. So users can move around inside this markdown buffer. 03:08:40.200 |
Arush Selvan: They can ask questions, and then there are keybinds that essentially append extra content to the bottom of the markdown buffer. 03:08:46.200 |
Arush Selvan: Aid's architecture lets us plug various pieces in and out, like I mentioned. 03:08:52.200 |
Arush Selvan: So we can swap in new models. We can make changes to the context building. We can add support for new editors, which I think probably sounds far-fetched, but this is something we're actually just doing right now. 03:09:05.200 |
Arush Selvan: And we can even add domain-specific tools. So different areas of the company can supply tools that are available inside of the editors, and they kind of end up in all the editors without having to write individual integrations. 03:09:17.200 |
Arush Selvan: Aid also allows us to A/B test different approaches, so we can do something like send 50% of the company to one model and 50% to another, and then determine which one gets the higher acceptance rate. 03:09:27.200 |
Arush Selvan: Aid is kind of an investment that pays off over time. Every time something changes in large language models, we're able to change it in one place downstream of the editors, and then have it available everywhere. 03:09:39.200 |
Arush Selvan: And things change, like, really often. We need to be ready when things change. What I've had time to show you today is only a small portion of what my team is doing. 03:09:51.200 |
Arush Selvan: We've got a lot of other things going on. So we're finding new ways to apply RAG inside of the editors. We're applying similar approaches to what you've seen here to large-scale multi-agent workflows. 03:10:02.200 |
Arush Selvan: We are working with reasoning models more and more. But the approach is the same through all of these. We keep things pluggable, we lay a strong foundation to build on top of, and we build the ways for the rest of the company to add to our experience by adding more domain-specific tooling on top of it. 03:10:18.200 |
Arush Selvan: If you think what I've said is interesting and you want to talk more about this, I would love to hear from you. You can just find me outside, and thank you for your time. 03:10:25.200 |
Arush Selvan: Next up, the head of AI engineering at Bloomberg is here to present challenges to scaling agents for generative AI products. 03:10:45.200 |
Arush Selvan: Please join me in welcoming to the stage Anju Kambadur. 03:10:50.200 |
Anju Kambadur: Oh man, it's really hard to see my photo that big or small. 03:11:00.200 |
Arush Selvan: Thank you so much for inviting me. As I was trying to think what would be a good topic to present at this talk, the organizers were really nice. 03:11:12.200 |
Arush Selvan: So a lot of things that you'll hear today were influenced by what the organizers thought was important because there are really so many things happening that are exciting to talk about in the agentic landscape. 03:11:22.200 |
Arush Selvan: So let's get started. The first thing was late 2021, I think LLMs really were starting to capture the imagination. 03:11:30.200 |
Arush Selvan: As a company we've been investing in AI for almost 15, 16 years, so we decided we'll build our own, um, we'll build our own last language model. 03:11:38.200 |
Arush Selvan: Took all of 2022 to do that. And 2023 we wrote a paper about it. We had learned a lot about how do you build these models, how do you organize data sets for these, how does evaluation work, how do you coax performance in certain zones out of this. 03:11:52.200 |
Arush Selvan: But then ChatGPT happened, I think the open weight and the open source community has come up so beautifully along. So while we continue to do very similar work, as a strategy we've pivoted to say let's build on top of whatever is available out there. 03:12:08.200 |
Arush Selvan: We have many, many different use cases, so I think we pretty much pivoted to say we'll build on top. If it helps you in any way on how we are doing things, so there you go. 03:12:17.200 |
Arush Selvan: Uh, the other was, uh, I think there was a curiosity on how exactly, uh, does a company like Bloomberg organize its AI efforts. 03:12:26.200 |
Arush Selvan: So, um, I report into, I report to the, uh, global head of engineering and we are organized somewhat as a special group, if you will. 03:12:35.200 |
Arush Selvan: We work a lot with our data counterpart. Bloomberg has a really strong, large data organization that you can appreciate now helps us out a lot. 03:12:42.200 |
Arush Selvan: Uh, we work with the product, the CTO in, in cross functional settings. About 400 people, 50 teams, London, New York, uh, Princeton and Toronto. So that's a little bit about our, our group. 03:12:55.200 |
Arush Selvan: Okay. So, um, we've been, uh, building products using Generative AI, um, starting with tools, more agentic for 12 to 16 months now. I think the effort has been really, really serious. 03:13:09.200 |
Arush Selvan: And so there have been so many things we've had to solve in order to build something today using what's available today. 03:13:16.200 |
Arush Selvan: Uh, and then I decided somebody must cover all of these topics. So I'm not going to talk about these at all. Right. Uh, I think there are some wonderful speakers talking about this. 03:13:26.200 |
Arush Selvan: Uh, I'll try to hang around a bit after this. And I'm really, um, I'm really bullish on what the developments are in any one of those challenges that we need to solve. 03:13:38.200 |
Arush Selvan: I think it gets easier and easier to solve those challenges. So please don't read these as being pessimistic. It's just realistic. Right. I need to build and ship things today. 03:13:48.200 |
Arush Selvan: And that means these are the things I need to deal with today. Uh, again, we won't be touching on any of these topics today. 03:13:54.200 |
Arush Selvan: Um, so internally, it was really hard to say what's an agent and what's a tool because everyone kind of had their own vocabulary. And then this really nice paper came out. So when I'm talking today, when I say a tool, I mean on the left hand side of that, uh, it's cognitive architectures for language agents. If you haven't read it, you should try to read that paper. 03:14:16.200 |
Arush Selvan: And then an agent is really like more autonomous, has memory, can evolve. So whenever I say agentic, it's on the right hand side of the spectrum and the other one is the left hand side. Um, so that's what my vocabulary will be. 03:14:28.200 |
Arush Selvan: Finally, to set the stage for the talk. Um, I don't know how many of you know about Bloomberg. I certainly did not know as much as I do today when I joined. So, um, we are a fintech company, as you can imagine from my nice, uh, jacket or jumper. 03:14:45.200 |
Arush Selvan: And our clients are in finance, but finance is a very diverse field. So, uh, uh, listed here, 10 different archetypes of people who are in finance and they do very different activities, but they also do a lot of similar activities. And so, um, what is like a short form of thinking what Bloomberg does? We have, we both generate and accumulate a lot of data. This is unstructured and unstructured. So news, research, uh, documents, slides. We, uh, 03:15:15.200 |
also provide access to websites. There's a lot of reference data, uh, market data coming in. So if you just want to know the scale every day, we get 400 billion ticks of structured data information about a billion plus, um, unstructured messages, millions of well written documents, which include news. And this is just every day. And we have over 40 years of history on it. So when we say we offer information as one of the things to our clients, 03:15:45.180 |
this is the scale at which we are working. Uh, the rest of this talk, I will, uh, as you can imagine, we are building a very broad sort of products. So to focus the talk, I'll talk about one particular, uh, archetype, uh, research analyst. If you didn't know what a research analyst has done, here is a, uh, does, here is a short course. So, uh, there's a research analyst. They are typically an expert in a particular area. Think like, you know, I'm a research analyst, 03:16:15.160 |
analyst analyst in AI or semiconductors or technology or electric vehicles and the kinds of things they need to do on a daily basis are written at the bottom. So they are doing a lot of work with search and discovery and summarization. A lot of things with unstructured data on the left hand side. 03:16:33.160 |
they are doing a lot of work in, uh, in data and analytics, structured data and analytics in the middle part of the segment. They are reaching out to their colleagues both to disperse and gather information. So there's a lot of communication. And then they're also, uh, some of them are also building models, uh, which means they need to normalize data. They need to actually program and generate models as well. So this is a research analyst in a, uh, in a nutshell. 03:16:42.160 |
Uh, the other bit is because we are in finance and we've been here for, uh, we've been here for, uh, we've been in finance for like since founding 40 years ago. There are some aspects of our products that are non-negotiable. And, uh, we've been here for, uh, we've been in finance for like since founding 40 years ago. 03:16:51.160 |
Uh, there are some aspects of our products that are non-negotiable and, uh, uh, those include things like precision, comprehensiveness, speed, throughput, availability, um, some principles like protecting our contributor and client data. 03:17:04.160 |
Uh, we've been here for some aspects of our products that are non-negotiable and, uh, those include things like precision, comprehensiveness, speed, throughput, availability, um, some principles like protecting our contributor and client data. 03:17:20.160 |
Making sure that whatever we build there is transparency throughout. These are non-negotiables. It doesn't matter whether you're using AI or not. So these should ground you in the kinds of challenges we face when we use what's available today to build agents. 03:17:34.160 |
Okay. So what was the first thing we did? Uh, again, 2023 is when I think we got serious. So the first thing we did was for the research in the, in the zone of helping the research analyst community, um, companies, public companies in particular, 03:17:49.160 |
they have scheduled quarterly calls that discuss the health of their company. They talk about their future. It's a conference call. A lot of analysts attend the call. Uh, there's a presentation by the company's executives. And then there's a Q and A segment. 03:18:03.160 |
And during earning season. It happens that on any given day, many, many, many of these things are happening. So I told you that a research analyst has to stay on top of what's happening every single day. So transcripts of these calls need to be generated again. AI is used. And in 2023, we saw an opportunity to say, well, we know what for every company, which is a, which is operating in a particular sector. We know what are the kinds of, 03:18:21.160 |
of questions are of interest. And maybe we can try to answer them for the analyst to take a look at. And that way they can be informed on whether they wanted a deeper dive or not. Right? Seems like a simple product. And again, I'm talking about work that started in 23. So where the technology was, we still needed to do a lot to bring in the 03:18:51.140 |
and take it to the market, keeping our principles and features in place. So what does it mean? Just focus on the right hand side, if you will. Performance out of the box was not great. Like precision, accuracy, factuality, things like that. And for those of you who are interested in MLOps, I think there was a lot of work done in order to just build remediation workflows and circuit breakers. Because remember, these summaries are not somebody just chatting with a transcript. It's actually published. 03:19:20.140 |
It's actually published. And everyone gets to see the same summary and anything that is an error has an outsized impact for us. So we constantly monitor performance, remediate, and then the summaries get more and more accurate. So a lot of, I think a lot of monitoring goes in behind it. A lot of CI/CD goes in behind it as well. 03:19:37.140 |
Okay, so today, how are the products that we are building? How does the agentic architecture look like? Well, first of all, it's semi-agentic because I don't, this is an opinion. 03:19:49.140 |
We don't yet fully have the trust that everything can be autonomous. So there are some pieces that are autonomous. The other pieces that are not autonomous. 03:19:57.140 |
Guardrails is a classic example of, for example, Bloomberg doesn't offer financial advice. So if someone starts with, hey, should I invest in? Then, you know, you need to catch it. 03:20:06.140 |
We need to be factual. That's again a guardrail. So like those are not optional pieces for any agent. Those are coded in as you must, you must do this check. 03:20:16.140 |
So just take this, keep this image in mind. It will come back. Okay. So this is about, this is a talk about scaling. 03:20:24.140 |
So with that long runway, let's get to scaling. So I just wanted to cover two aspects of scaling. I'm hoping that both these aspects will be more of a confirmation and not a surprise to any of you. 03:20:34.140 |
So let's see. So the first thing is if you want to build agents and you want each agent to evolve really quickly, because when you build the first time, unless you're a magician, it's going to suck a bit. 03:20:46.140 |
And then it needs to improve and improve and improve and improve, right? So how do you get there? 03:20:50.140 |
Well, let's go back to how some really good software is built. When I was a grad student, I use matrix multiplication a lot. 03:20:58.140 |
And this is a snapshot of the generalized matrix matrix product. And if you read the API documentation, it lays out every aspect of the input, every error code, how long it will take is also available in documentation. It's just, it just works. 03:21:14.140 |
Right? And when you build software on top of such really well-documented, well-written software, your software also tends to be robust. Your products tend to be robust. 03:21:26.140 |
Even from 20 years ago when we started using machine learning to build products like, you know, there are tools like APIs that use models or pipelines of models behind them. 03:21:36.140 |
You as a caller or a person downstream of such APIs, there is a bit of stochasticity, if I can pronounce it correct, involved, right? You don't quite know what the result will be. 03:21:50.140 |
And you don't quite know if it'll work for you or not. And this is despite best intentions of establishing, you know, what the input distributions are and what the output distributions are. There's always a bit of stochasticity. 03:22:00.140 |
stochasticity. It was still okay to work with them and I'll tell you why it was okay to work with these. But when you enter using LLMs and agents, which are really compositions of LLMs, 03:22:12.140 |
the errors multiply a lot. And that is something that causes a lot of fragile behavior. And we'll just take a look at it. 03:22:24.140 |
And I hope my answer is mildly surprising to you on how to avoid the fragility. In 2009, we built a new sentiment product. It was basically to detect if a piece of news for a given company would be beneficial for that company or not. 03:22:41.140 |
So, the input distribution, we knew which news wires, we were monitoring, we knew which language it was in. News wires also have editorial guidelines on how they write things. 03:22:51.140 |
So, while the API that sits in front of the model is not as clean as matrix, matrix multiply, you still have a very decent handle on what is coming into my system. 03:23:02.140 |
And the outputs are obviously just like, you know, it's minus one to plus one, pretty much. So, like the output space is also very easy. Training data, we built it from scratch, so we know the training data. 03:23:12.140 |
We could have really nice, held out in time and space, test sets, and then we could establish the risk of deploying this. We could monitor it. 03:23:21.140 |
So, despite all of this guardrail being present, we still ended up having a lot of out-of-band communication on anyone who is downstream of us. 03:23:31.140 |
So, for example, if you were consuming our stream of output on sentiment, we would give you a heads up. We would tell you that, hey, the model version is changing. 03:23:39.140 |
If you have a downstream application using this as a signal, you want to test it out, things like that. 03:23:44.140 |
This was the landscape that's changed a lot when you think about building agentic architectures. 03:23:50.140 |
Like, you want to make improvements to your agents every single day. You don't want to have a release cycle where there is a, you know, a purely batch regression test based release cycle. 03:24:01.140 |
Because there are so many customers who are downstream of you who are also making independent improvements to your model. 03:24:07.140 |
So, I'll give you like one small example, right? So, one of the, one of the workflows that we have agents for is, for a research analyst is, I told you that structured data is something that they look at. 03:24:19.140 |
The question here is, US CPI for the last five quarters, Q is just a quarter. There's an agent that deeply understands the query, figures out what domain it should dispatch to, and then uses a tool. 03:24:30.140 |
There's an NLP front end to the tool, but uses a tool to basically fetch the data, right? 03:24:35.140 |
It turns out that the data is wrong and which is why you need the guardrails. The data is wrong because of one character that was missed. 03:24:45.140 |
It fetched monthly data as opposed to quarterly data. And if you're actually building a downstream workflow where you're not even exposing the table, a good research analyst would catch it. 03:24:55.140 |
But if you're not even exposing the table, and you're just looking at an answer that says, well, it looks like the answer is 42. It's really hard to catch these compounding errors, which is why it is easier to not count on the upstream systems to be accurate, but rather factor in that they will be fragile and they'll be evolving and just do your own safety checks. 03:25:19.140 |
I'm talking about within my own org, people are independently operating every version of the data and analytics API tool that's coming out as better and better, but being better means being better on average. 03:25:31.140 |
It doesn't mean it'll be better for you as a downstream consumer. So building in some of this guardrail, I just think is good sense and that almost makes you go faster as you factor out individual agents and each agent can evolve without having these handshake signals of well, every downstream caller I have, I have to make sure that they understand what's changed. 03:25:58.140 |
And they sign off that I can actually release my, I can promote my new agent to like beta or production. I think we just need to like change that mindset and be more resilient. 03:26:11.140 |
So that's one. The second thing is as much as I used to code one, one, one fine day long, long ago, I'm a manager now. So I thought I'd talk about org structure and I don't know how many of you will resonate with it. 03:26:24.140 |
So Bloomberg, Bloomberg, like I said, we've been building these things for like 15 years and traditional machine learning. Um, it has a particular factorization of software. 03:26:34.140 |
And that software factorization is then reflected in the org structure. If you are lucky, you have the reverse Conway law of design, but you, but you really need to rethink that as you start using different tech stacks and start building different kinds of products. Um, what do I, what do I mean? 03:26:53.140 |
How many agents do you want to build and what should each agent do and should agents have overlapping functionality or not? 03:27:01.140 |
These are some basic questions. And typically it's very tempting to just say, let's just keep our current software stack and see if we can build on top of that, or let's keep our current org structure and build on top of that. 03:27:12.140 |
And so what I've learned is on the columns here, you can see, you know, the first two columns are vertically aligned teams. The next two columns are horizontally aligned teams. And there are some properties in the rows and what we've learned and we've actually done some reorgs. What we've learned are in the beginning, you don't really know much on what the product design is going to be. And you want to iterate fast. It's just easier to like collapse the org. 03:27:30.140 |
collapse the software stack and just say, here's a team go build what needs to be built and figure things out. And that's where you want like, you know, really fast iteration. 03:27:36.140 |
You want sharing of core data models, things like that. The more you have understood this for a single product or a single agent, the more you understand what its use is and what it's good at and what it's not and what it's not. 03:27:54.140 |
And you actually build many, many of these agents and that's when you start thinking, okay, I can go back to the foundations of building good software and good orgs and I want to have things like optimization on it. 03:28:16.140 |
So I want to increase the performance, reduce the cost, make it more testable, make it more transparent and that's where you move into the bottom right corner of the segment where you do have some horizontal. 03:28:27.140 |
So in our case, like guardrails are horizontal. We don't want every team, every one of those 50 teams like trying to figure out what does it mean for me to not accept user inputs that are thinly wailed financial advice inputs. 03:28:45.140 |
Right? Like it's something that you want to do horizontally, but you also don't want to, you want to figure out for yourself what is the right time for you and your organization to start creating horizontals to also start breaking out some of these monolithic agents, which are reflected again in our structure and start creating smaller and smaller pieces. 03:29:04.140 |
So all that said and done, like, you know, just again for the running example of a research agent, this is how it looks like today. 03:29:13.140 |
So, you know, I think taking in the user world and session context and deeply understanding what is the question and then figuring out what kinds of information are needed to answer that question. 03:29:24.140 |
It's factorized as its own agent reflected in the art structure. Similarly, for answer generation, we have a lot of rigor around what constitutes a well-formed answer. 03:29:35.140 |
Again, that's factored out. I call it semi-agentic, like I alluded to before, because we do have guardrails that are non-optional. 03:29:41.140 |
There is no autonomy there. You have to call it at multiple points. And then, yeah, like, we build on top of, like, years of traditional and more and more modern forms of data munging, like, you know, your sparse indices have become dense and hybrid indices now. 03:30:00.140 |
So, yeah, that's a little bit, and I think I'm right at time. So, have a nice day. Thank you. 03:30:05.140 |
Our final speaker this morning will teach us how to distill accurate, actionable insights from vast, multimodal data sources. 03:30:30.140 |
He's the founder and CEO of BrightWave. Please join me in welcoming to the stage, Mike Conover. 03:30:35.140 |
Hey, everybody. I'm Mike Conover. I am founder and CEO of BrightWave. 03:30:48.140 |
We build a research agent that digests very large corpuses of content in the financial domain. 03:30:54.140 |
So, you can think of due diligence in a competitive deal process. You are a preterm sheet. You step into a data room with thousands of pages of content. 03:31:02.140 |
You need to get to conviction quickly ahead of other teams. You need to spot critical risk factors that would diminish asset performance. 03:31:11.140 |
It's a fairly non-trivial task. You think about mutual fund analysts. It's earnings season. You've got a universal coverage of 80, 120 names. 03:31:21.140 |
There are calls, transcripts, filings. It's a fairly non-trivial problem to understand at a sector level but also at the individual ticker level what's happening in the market. 03:31:34.140 |
For goodness, you get into confirmatory diligence and you've got 80, 800 vendor contracts and you need to spot early termination clauses. 03:31:44.140 |
You need to understand thematically how is my entire portfolio negotiating their vendor contracts. 03:31:49.140 |
It's frankly not a human level intelligence task. And the reality as we've stepped into this space is that these professionals just get put in a meat grinder. 03:32:04.140 |
Junior analysts are tasked to be impossible on extremely tight deadlines. 03:32:10.140 |
I come from a technical background. Prior to who Bright Wave I was at Databricks and created a language model called DALI that was one of the earlier models to demonstrate the power of instruction tuning for eliciting instruction following behavior from open source technologies. 03:32:27.140 |
And as I have met with these professionals I have developed a deep sense of empathy for the stakes and the human cost of doing this work manually. 03:32:40.140 |
When it comes to the role of the individual in finance workflows and financial research, we think of the parallels to early spreadsheets. 03:32:51.140 |
You go to an accountant or finance professional in 1978 before the advent of computational spreadsheets. 03:32:56.140 |
You say, "What's your job?" Well, I run the numbers. It's cognitively demanding. 03:33:00.140 |
These people write this stuff out by hand on literally wide pieces of paper called spreadsheets. 03:33:05.140 |
It's cognitively demanding. It's important to the business. It's time intensive. It feels like real work. 03:33:09.140 |
And now nobody wants that job. And it's not because there aren't finance professionals. It's not because nobody's doing analysis. 03:33:17.140 |
It's the sophistication of the thought that you can bring to bear on the problem has increased so substantially because there are tools that allow us to think more effectively, more efficiently. 03:33:28.140 |
What we're seeing, what we're hearing from our customers is that a system like Bright Wave that is able to -- and not just Bright Wave. 03:33:38.140 |
What we're seeing is that a system like that is a system of knowledge agents is able to digest volumes of content and perform meaningful work that accelerates by orders of magnitude the efficiency and also time to value in these markets. 03:33:52.140 |
The purpose of this talk is to relate the intelligence that we've developed in the course of building this high-fidelity research agent and just things that we're seeing both technically but also in terms of product affordances. 03:34:07.140 |
I mean the design problem that you have to solve is how do you reveal the thought process of something that's considered 10,000 pages of content to a human in a way that's useful and legible. 03:34:19.140 |
That is not a UI/UX problem. It's not a product architecture problem that existed three years ago. 03:34:25.140 |
And the final form factor has not been determined. Chat, everybody's very target fixated on chat. That's probably not enough. 03:34:35.140 |
So the first thing that I'll observe is that non-reasoning models are performing greedy local search. 03:34:41.140 |
So the Bloomberg talk highlighted that sort of fidelity issue, like a really concrete example. 03:34:45.140 |
You put a Reuters article in 4.0 and you ask it to extract all the organizations, goodness if it's not going to give you products. 03:34:51.140 |
And if you have a 10, 5% error rate and you chain calls like that, you're going to introduce sort of in an exponential way the likelihood of error being in these models. 03:35:02.140 |
And so the winning systems will perform end-to-end RL over tool use calls where the results of the API call are in fact part of the RL sequence of decisions so that you can make locally suboptimal decisions in order to get globally optimal outputs. 03:35:19.140 |
The reality is that that's still an open research problem. You know, how do I avail myself of a knowledge graph or... I did not do that. 03:35:29.140 |
How do you avail yourself of these tools in an intelligent way so that you get globally optimal outputs? 03:35:37.140 |
It does seem like that that is not a solved question. 03:35:39.140 |
So the reality, and I think it's like heartening to see this is a theme and I think everybody in this room can be sort of comforted by this. 03:35:47.140 |
You got to build a product today and like there's going to be this talk of the bitter lesson that more data, more compute, better models dominate all other approaches. 03:35:56.140 |
It's like nobody wants an expert system. Nobody, nobody wants to use spacey to do named entity recognition. 03:36:06.140 |
It, you can think of being more circumspect about what is the scope of behaviors that the system, the agent is going to engage in. 03:36:13.140 |
Sort of like a regularization parameter which constrains the complexity of the model and that limits the likelihood, reduces the likelihood that it will go truly off the rails and begin to produce degenerate output. 03:36:24.140 |
You can think of it sort of like a, like multi-term, the most interesting interactions I've had with language models are deep into a conversational tree. 03:36:33.140 |
Where you can think of selecting at each branch, each response. 03:36:37.140 |
There are a set of reactions that I can have to the model output and I'm steering, I'm choosing. 03:36:43.140 |
This is what, knowing how to use language models, that's, that's a skill. 03:36:46.140 |
Um, and many people who have real full-time jobs may not invest in developing that skill. 03:36:52.140 |
This is not dissimilar to what these RL systems are doing. 03:36:55.140 |
And if you can think of a multi-term conversation as not just establishing a human orchestrated chain of thought, but really that set of tokens defines the activations of the model. 03:37:08.140 |
And if you think of the activations of the model as defining a program, what you are doing when you respond to the model and say, no, not quite like that, more like this, is if you think of the, the activation weights at, or the activations as a point in a vector space, you are nudging the activations to a place where they can finally solve the problem at hand. 03:37:29.140 |
And I think that's what the chain of thought process, or the sort of reasoning monologue is performing. 03:37:35.140 |
It's, it's getting the activations to a position where it can actually solve the problem. 03:37:38.140 |
So it's actually not, I don't, it's cute that it, you can interpret it, but I would prefer if it just got to the right set of activations automatically. 03:37:46.140 |
Um, and so from a product affordance standpoint, people are not going to want to really become prompting experts in a deep way. 03:37:54.140 |
And, and frankly, it takes, you know, easily a thousand hours. 03:37:57.140 |
Um, and so the scaffolding that products put in place in order to orchestrate these workflows and, and shape the, the, the, the behavior of these systems, um, I think has, you know, these verticalized product workflows are, are probably going to be enduring because they specify intent. 03:38:16.140 |
Um, so some of the things that we see with respect to archetypal design patterns in the space scale. 03:38:26.140 |
You really want to mimic the human decision making process and decompose. 03:38:33.140 |
Well, if I need to understand how this, uh, poly, uh, polypropylene reslin, uh, manufacturer, um, is, is managing costs, I might look for public market comparables. 03:38:42.140 |
And that would include, that would, you know, maybe entail going to the SEC filings or earnings call transcripts. 03:38:46.140 |
And I would assess content potentially from a knowledge graph constructed from, uh, previous deals that, that, that, that, you know, I, as a, as a private equity investor have done. 03:38:55.140 |
Um, news corpuses assess which, which documents that's irrelevant to me, distill down from those documents findings that substantiate, um, premises or hypotheses that I might have about this question or this investment thesis. 03:39:09.140 |
Um, and then enrich and error correct those findings. 03:39:14.140 |
One is that, um, it is actually, so I forget who it was, but they were talking about, it was the deep research team talking about, um, on that next step, what are my intermediary notes? 03:39:25.140 |
What is it that I believed on the basis of what I found? 03:39:28.140 |
That's actually an extremely useful, think out loud about what do we believe given the facts as they, uh, have materialized on that first pass through the, the, the data set. 03:39:38.140 |
Um, enriching individual findings that are distilled down from documents is an extremely powerful, um, design pattern. 03:39:46.140 |
Likewise, um, it's, it's kind of, you can ask these models, you know, is this accurate? 03:39:51.140 |
For that Reuters example, you can say, uh, is this factually entailed by this document? 03:39:58.140 |
Um, and the model can frequently self-correct. 03:40:01.140 |
And what we've noticed is that it is, you can do that in the JSON, um, as sort of like a chain of thought behavior, but it's also, it's actually more powerful to do it as a secondary call. 03:40:10.140 |
Because the model's kind of primed to be credulous. 03:40:13.140 |
It says, well, you know, I told you it was, and so, yeah, I'm probably right. 03:40:17.140 |
Um, so it's interesting how you can tease apart some of these steps into multiple different, uh, calls. 03:40:22.140 |
And then through this process of synthesis, you're able to weave together fact patterns across many, many, many documents into a coherent narrative. 03:40:30.140 |
Um, and that control loop, we think that, obviously, human oversight is extremely important. 03:40:35.140 |
Um, the ability to nudge the model, um, with directives or, or sort of selecting, this is an interesting thread, I want you to pull that, is extremely important. 03:40:46.140 |
And that's because the human analyst always is going to have access to information that has not been digitized. 03:40:52.140 |
That's, uh, your portfolio manager thinks this class of biotech is just harebrained. 03:40:57.140 |
Um, that taste making, I think, is going to be where you see, um, the most powerful, uh, products lean. 03:41:04.140 |
I firmly believe, with respect to the nodes in that knowledge graph, and we probably, many people in this room probably have reached this conclusion as well. 03:41:12.140 |
But you still see this, oh, we got a portfolio manager agent. 03:41:14.140 |
And this is the fact checker, and, and that sort of, it, needless, like, anthropomorphizing of these systems. 03:41:21.140 |
Um, it constrains your flexibility if the design needs of your compute graph change. 03:41:26.140 |
And this is, this 1970, I think it was 1978, you know, the Unix philosophy, it's like, you think about piping and teeing on the bash command line. 03:41:35.140 |
Or I guess I date myself, I still use bash, not Z shale. 03:41:38.140 |
Um, just simple tools that do one thing and that work together well. 03:41:47.140 |
Um, so our friends at Latent Space put together this plot. 03:41:51.140 |
With respect to the structure of these graphs, obviously that Pareto frontier, which is the sort of efficiency frontier. 03:42:00.140 |
Um, the efficiency frontier for compute and performance trade-off, or price performance trade-off. 03:42:06.140 |
Um, that frontier is going to continue to move out, but there will, I believe there will, for at least enduring time, be a frontier. 03:42:12.140 |
And what's notable about that is that you have to select then which tool, which system, which model, 03:42:17.140 |
am I going to use for each node in the compute graph? 03:42:20.140 |
And the reason that this is important is what I call the latency trap. 03:42:28.140 |
If you think about the plot of timed value and realized value for agentic systems, and I think this is extremely important, 03:42:35.140 |
it's very easy to think, oh, man, it's going to do all of these things. 03:42:38.140 |
It's going to, you know, I'm going to check it and error correct, and then, you know, in 25 minutes, it's going to be banger. 03:42:44.140 |
And I think even with high quality products like OpenAI's deep research, it's, hmm, you're not always sure that what you're going to get out is high quality. 03:42:53.140 |
So there's kind of like a question of like, which side of the diagonal? 03:42:56.140 |
It's probably not a straight line, but is that product on? 03:42:58.140 |
But also from a reps standpoint, the impulse response for the user, how well refined is, you can think of, like, my expectation for what the report is going to look like and what the report actually looks like is the loss. 03:43:11.140 |
And the user's mental model is developing a sense for how do my prompts elicit behaviors from these models. 03:43:17.140 |
If it's an eight-minute feedback loop, it's a 20-minute feedback loop, goodness, you're not going to do many of those in the course of a day. 03:43:22.140 |
And your faculty with the system and the product is going to be low. 03:43:27.140 |
So, synthesis is really where a lot of the magic happens in these systems. 03:43:35.140 |
So notice that it, I don't know, has anybody in this room ever had a 50,000 token response from any model? 03:43:43.140 |
They say that it's, you know, O1 is 100,000 context, output context length. 03:43:51.140 |
And it's because the instruction tuning demonstrations, these human general, synthetic or human generated outputs that are used to post-train the models, have a characteristic output length. 03:44:01.140 |
It's hard to write 50,000 coherent novel words. 03:44:06.140 |
And so the likelihood that the models are able to produce, I mean, even O1 still is about 2,000, 3,000 tokens. 03:44:16.140 |
And so what happens, it's kind of like a, there's a compression problem. 03:44:19.140 |
If I have a very, very large context window for input, I'm compressing that information into a set of tokens. 03:44:25.140 |
And so it's like the difference between writing a book report and a synopsis of each chapter. 03:44:31.140 |
You can just, you can be more focused and specific about what is it that I want those N,000 tokens to be focused on. 03:44:39.140 |
Here we have, you know, I said, write, write an analysis of the global financial crisis. 03:44:45.140 |
Goodness if I don't think the rise of the shadow banking system warrants more than three sentences. 03:44:50.140 |
And so if you, if you can be more granular and more specific, you can get higher quality, higher fidelity, 03:44:56.140 |
more information dense outputs out of these systems by decomposing your research instructions into multiple sub-themes. 03:45:03.140 |
Additionally, the last point I'll make on this problem is that of the presence of recombinative reasoning demonstrations 03:45:11.140 |
in the instruction tuning and post-training corpuses is low. 03:45:15.140 |
So it is easy to say, here, you know, given the text of the great Gatsby, this is the epilogue. 03:45:22.140 |
And write a new epilogue for the great Gatsby because the cost of internalizing that corpus is fixed. 03:45:28.140 |
Effectively, you read the book and then you write five epilogues and it's like, goodness, I got it. 03:45:33.140 |
Synthesis really is about weaving together disparate fact patterns from multiple documents. 03:45:37.140 |
Think about the applications to biomedical literature synthesis. 03:45:40.140 |
I need to read all of these papers and then have something useful to say that actually brings together the facts from these documents. 03:45:48.140 |
Now there's like a cute trick you could try, which is to say, given the bibliography of any given paper, write the abstract as a post-training exercise. 03:45:58.140 |
But it's just really hard to get high-quality, intelligent, thoughtful analysis of many, many, many different documents. 03:46:05.140 |
And so there are limitations in practice for even state-of-the-art models in terms of how they are able to manage complex real-world situations. 03:46:21.140 |
And being able to understand, you know, something like a merger and an acquisition, you know, these pro forma financial statements are different from those that came before the event. 03:46:34.140 |
If there are addendums to contracts, it's important to propagate with evidentiary passages, a metadata that contextualizes, why do I care about this? 03:46:48.140 |
What, how should I consider this in relation to the other pieces of evidence in the context window? 03:46:56.140 |
So I'll now shift a little bit with some examples from the product that we've built, which is how do you reveal the thought process of something that's considered 10,000 pages of text? 03:47:10.140 |
And I think that it is more like a surface, and one where you're able to, it's kind of like this, like, people you may know, the Facebook and LinkedIn recommendation algorithm for connections, 03:47:28.140 |
feels uncanny good, in part, not because, I mean, the algorithms are okay, not great, have gotten a lot better over time, but in your visual cortex, 03:47:39.140 |
there is a bundle of nerves that are exclusively dedicated to face recognition. 03:47:45.140 |
And the ability to say, you know, in a six by six grid of faces, goodness, I know that person. 03:47:50.140 |
And so you attend to the things that matter, even if it's actually a low precision product experience. 03:47:56.140 |
And so the ability to give the person details on demand is extremely important. 03:48:07.140 |
We, you know, I think the ability to click on a citation and then get additional context about this, not just what document is it from, but how should I be thinking about this? 03:48:15.140 |
What was the model thinking in the course of this? 03:48:17.140 |
As well as structured interactive outputs that give you the ability to pull the thread and say, well, tell me more about that rising capex spend. 03:48:26.140 |
In Bright Wave, you can highlight any passage of text. 03:48:30.140 |
So it's not just the citations, but you can highlight any passage of text and say, tell me more. 03:48:35.140 |
I think OpenAI gestures towards this with respect to Canvas and the ability to increase the reading level of a passage. 03:48:42.140 |
Having a continuous surface that's not just these citations, but in fact, any finding should be interrogable. 03:48:52.140 |
You can think of, likewise, you can think of the set of things that the model has discovered. 03:49:12.140 |
As a high-dimensional data structure in the report is one view on that data structure. 03:49:17.140 |
It's kind of a low effort point of entry into the space of ideas. 03:49:22.140 |
You want to be able to turn over that cube and see, especially in finance, the receipts. 03:49:29.140 |
What's the audit trail for this system that's read all of these materials? 03:49:32.140 |
And so being able to, in this example, click into the documents is one level. 03:49:35.140 |
But having all of the findings laid out for you, whether it's a fundraising timeline, ongoing litigation, 03:49:41.140 |
I'm able to, if something catches my attention, click on it. 03:49:56.140 |
This patent litigation, goodness, that seems important. 03:49:59.140 |
You had a factory fire in Mexico that wiped out, you know, critical single source supplier. 03:50:06.140 |
That ability to drill in and get additional details on demand is extremely important. 03:50:11.140 |
And I think, candidly, we do not yet have the final version, the final form factor of this 03:50:19.140 |
But it's an extremely interesting design problem. 03:50:25.140 |
So these QR codes, not only is it a great place to work, we've got people from Goldman Sachs and UBS and Meta and Instagram and Anna Plan. 03:50:34.140 |
And we just hired a senior staff software engineer from Brave. 03:50:40.140 |
So I'm going to see a lot more phones come out now. 03:50:43.140 |
$10,000 referral bonus for all of these roles, primarily the product designer and the front-end engineer. 03:50:50.140 |
We're hiring staff and senior staff level professionals. 03:50:53.140 |
We have a small team of extremely experienced individuals. 03:50:56.140 |
And this is structured like the DARPA red balloon challenge, if you're familiar. 03:51:01.140 |
So if you refer the person that refers the person that we hire, you get $1,000. 03:51:06.140 |
And so on and so on and so on, all along that exponentially exploding referral tree. 03:51:11.140 |
So we're Bright Wave, we build knowledge agents for finance workflows. 03:51:16.140 |
Ladies and gentlemen, please welcome back to the stage, MC for the AI Engineer Summit, Agent Engineering Day. 03:51:51.140 |
Thanks to Mike who already left for this amazing deep dive. 03:51:57.140 |
If you have any questions to any of the speakers, please find them in the Q&A areas. 03:52:04.140 |
One is right here on this floor, other two on the lower level. 03:52:09.140 |
And I just wanted to say that this session was amazing. 03:52:14.140 |
I feel buzzing with insights and I hope you got a lot of interesting things for you to think about. 03:52:27.140 |
We got the sneak peek into allotting Copilot's enterprise multi-agent platform. 03:52:33.140 |
We learned Jane Street's tooling for OCaml, Bloomberg's challenges in scaling generative AI agents, and we learned about Bright Wave's knowledge agents. 03:52:48.140 |
All these companies are hiring, so go talk to them if you're interested. 03:52:57.140 |
Ladies and gentlemen, lunch is now being served. 03:53:02.140 |
Ladies and gentlemen, lunch is now being served. 03:53:04.140 |
Ladies and gentlemen, lunch is now being served. 03:53:05.140 |
Ladies and gentlemen, lunch is now being served. 03:53:07.140 |
Ladies and gentlemen, lunch is now being served. 03:53:10.140 |
Ladies and gentlemen, lunch is now being served. 03:53:11.140 |
Ladies and gentlemen, lunch is now being served. 03:53:12.140 |
Ladies and gentlemen, lunch is now being served.