. It's a pleasure to be with you today. My name is Michael Alveda and I'm a principal applied scientist at Microsoft. And today I'm going to be presenting on building applications with AI agents. So just as a brief bio, I've been at Microsoft for about two years. I've been one of the key contributors to security copilot and the recently announced security copilot agents, specifically working in the cybersecurity division.
Before that I spent four years working on machine learning at Uber, lots of big geospatial problems, and I was in startups before that. And this talk is really a distillation of a 300-page book that I have coming out with O'Reilly. The first seven chapters are already up on early release on the platform and it is going to print next month.
And while I'll be focusing mostly on slides, I won't get too deep into code for this particular talk. I just want to say that there is full code examples backing everything I'm describing here. So if you want to dive in a little bit deeper, take a look at what this looks like in actual code.
This is to give you a brief introduction to that. So to give a brief overview of what I'll be covering, I'll be talking a little bit about the promise and obstacles that we're seeing so far in agentic development. I'll go through some of the core components that are required to build really effective AI systems that we get to production.
And then I'll talk through some of the common pitfalls and lessons. So there's a tremendous amount of excitement that's happening here to just take one data point. If we look at the companies that have been accepted to Y Combinator just over the last three years, we've seen a 254% increase in companies that are describing themselves as agentic or as if they're building agents.
Certainly we're seeing a lot of increased investment. There's a lot of excitement here. But I think we're also seeing that these are really hard problems that we're going after. And so if we're looking at some of the leading agentic benchmarks from academia, we're seeing these are hard tasks that require a sequence of multiple tool calls, multiple executions, and really operating in complex environments.
If we were to go back five or 10 years, we'd be in the single digits on some of these tasks. So the fact that we're getting up to in the 50s, 60s, and 70s on some of this is really impressive. And I think just as Clay described, we want to move to where the puck is moving, and it's actually accelerating in that direction.
But if you're going and trying to build your own agentic system, do not expect perfection by any measure. And we're seeing it's quite easy to get to those initial prototype stages that might get you to 70% accuracy, but it's increasingly challenging to go after that long tail of increasingly complex scenarios.
So just to give a brief definition to start, I'm sure you've heard a few of these at the conference so far. But I'm defining it as an entity that can reason, act, communicate, and adapt to solve tasks. And so we treat the foundation model as a foundation. And then we can add these additional components to increase the performance and effectiveness in different scenarios.
There's been a lot of discussion about what constitutes an agentic system. And I think Andrew Eng helped clarify this, that it's not a binary distinction, but it really is a continuum. It really is a spectrum. And what I would add on to that is there's a second axis that we should consider, which is the effectiveness of the system.
And so I wouldn't think about agency or the agenticness of your system as a goal or an end in and of itself. It is a tool to help you solve problems. And so I just think it's a classic example of something that has a very low degree of agency, but a really high degree of efficacy.
It's robotic process automation. This is a previous generation of automation that's been incredibly helpful and incredibly useful and is used by many companies and delivers a lot of economic value. The problem is those types of automations are fixed. They're brittle. Small changes to the input can allow the entire thing to break.
So it requires a lot of manual intervention to continue maintaining and updating these. And so I think part of the promise as we move to these more agentic systems is they're flexible, they're adaptable, and they unlock this additional capability of adapting to and responding to those changing inputs. But we want to make sure that any time we're adding agency to our system when we're moving along the right, we're staying at that very high level of effectiveness.
We want to make sure that any incremental addition towards agency maintains that high level of performance. What we don't want to do is end up compromising the degree of effectiveness. And I think there's no shortage of bad chatbots that we've seen shipped by many companies that are relatively low on efficacy and relatively low on agency.
But we also want to avoid our going out over our skis, building agentic systems that are low in efficacy. That's what I would call the future news stories that all of us want to avoid so that as we're designing things, we're delivering things that are delivering value. So I'll start with tool use.
And I just want to say this is a really -- it's such a powerful idea. And it's really simple in principle. We're working with foundation models. These are autoregressive generative models that are predicting one token at a time. Typically, those are predicting natural language. But they can also output function calls.
And so if you're exposing tools and functionality to this language model in the attribute that we are calling an entity, all of a sudden that agent can now invoke functions. And we're now exposing the full range of tools that we can expose over APIs. So just think about the incredible functionality that you can expose to this, but also the risk and challenge that comes along with it.
And so it requires a great degree of discernment and responsibility to think about which of those functionalities we're going to expose and in what way so that we can deliver values to our customers. And, of course, this also operates in a loop. So we apply that parser to the outputted text.
We invoke the tool. And we get some response back, some observation. That's a way of providing more information to the agent that it can use to solve problems. We then continue this in a loop until we take our final output and we generate that for the customer. As you're thinking about designing and building tools, one really common fallacy I see is to think that there's a one-to-one mapping between your APIs and your tools.
If you're working in an organization that has 300 APIs, please do not register 300 tools with your agent. It will get really confused. And something we've seen empirically is that the more tools that you expose to an individual prompt, to an individual completion call, the less accuracy you see overall.
There's more semantic collision between those different tools. So if at all possible, reduce the number of tools that you're exposing at a single time and really try and group them together in logical ways. You want to keep that scope really specific, really clear, specific names and descriptions. And those tools should really feel like a single human-facing action.
So now that you've exposed this rich functionality and tools to your agent, you need to think about how you're going to invoke it and what the orchestration pattern is going to look like. I would recommend that you keep it simple. And in particular, just a huge amount of great work can be done with these standard workflow patterns.
And it's applying -- so if it can fall into a single chain, please do that. It will make it easier to measure. It will keep your costs down. It will keep your reliability up. And it will allow you to deliver value for customers more easily. And you can also apply different types of branching logic.
You can rely on the LLM to choose which path through the tree that you might want to go through. I work in cybersecurity and applying these types of patterns work really nicely for deciding what the severity of an incident is, what additional information we need to enrich in performing and going through that multi-hop enrichment and reasoning.
It's incredibly valuable. Moving through to a full agentic pattern is putting more power into the hands of the model. You're relying on it to choose which actions to invoke and do that repeatedly in order to solve some work. And it's just harder to measure and it's harder to get full performance out of that.
But remember the bitter lesson. If you are getting to a point where your chains and the trees that you're building out are becoming so complex and so convoluted and they're difficult to maintain, that's probably a sign that you want to move to a more agentic pattern that will make it easier to maintain long-term.
And it might be something that you might want to start considering doing some additional fine-tuning on your model for that. The other pattern that we found to be incredibly useful and I think we'll see scale in the future is too many teams are relying on the large language model to apply the logic that they want to have applied.
So if you have some fixed business logic, if you only want to take an action if A, B, and C are correct, what you can do is expose tools to your agent to update each of those states. And you can apply sanitization and validation on each of those. You keep your logic in deterministic fixed business logic that you can maintain over time and you maintain the state external to the model.
And so that way you can ensure that the correct actions are only taken when those conditions are met. There's also been a real interest in moving from single agent to multi-agent systems. I think the best reason that I know to break down a single agent system to a multi-agent system is exactly that problem I was describing earlier with tool calls.
If you just start dumping too many tools into a single prompt, it will get overwhelmed. But if you can break down those larger groups of tools into semantically similar groups, register it with an individual agent, and then rely on this coordinator to route to the appropriate agent to handle that task is a great way to continue to scale as your number of tools grows so you can handle a wider variety of scenarios.
There's also been a lot of talk about the agent to agent protocol. It's a really exciting future direction. Most of the successful multi-agent systems that I've seen have been built by an individual team that is able to coordinate that. I think what agent to agent protocol is really reaching towards and aiming towards is a future where different teams building different agents are able to discover and coordinate and work together.
I think we will see more of that, but it's really early days and there's plenty of additional technical and security questions to work through for that. This brings us to evaluation. My general recommendation, both for my team and I think for just about everyone, is invest more in evaluation.
It's gotten so easy to build and it is very easy to get to your 70% or 80% accuracy. But there are so many hyperperimeters to choose. How many agents? How many tools do you expose? Which model do you use? What type of memory do I want to use? All of those questions are almost impossible to answer without a high quality rigorous evaluation set.
So I really encourage everyone to focus more time on this than you think. And really, it's moving us towards a type of test-driven development with agents. Your agent then becomes defined in terms of the inputs and the outputs that you're expecting. And there's a whole range of tools that you can use to then automatically improve relative to that.
So labeling, I think, is a bit of a bad word from all of our time in machine learning. That is the thing that we outsourced to Mechanical Turk. I think that is no longer something that we can do. And really, the AI architects and the AI engineers who are building agents, you really need to take more ownership and responsibility of exactly what you want your agent to do.
And so spending time defining those inputs and outputs can really help accelerate your team in making all of those hard questions as you move forward. And the ground is changing under us in terms of models and frameworks. And so you take those user inputs, you run it through your agent, you get your outputs.
There needs to be some amount of human review. And then you take those new additions, you add them to your evaluation set. Once you have this evaluation set, you can now run it through your agent and go through our evaluation loop. You can analyze your failures. You probably want to do some type of clustering and summarization on those outputs.
And you can suggest improvements. And if this looks like a lot of work, fortunately, there are fantastic tools to help you with this. And it's not as hard as it looks. So a couple incredible open source libraries I recommend. Intel Agent, which is great at generating additional synthetic inputs.
Let's say you don't necessarily have access to your raw user data for security or privacy concerns. Or let's say you're building something that hasn't shipped yet. Synthetic data can get you a long way. Microsoft is also open sourced pirate, which is great for red teaming agents. A fantastic idea to run and launch before you ship your agents that will try jailbreak red team and try and otherwise compromise your agent.
Great strategy to take. Label Studio is a great framework to help you build up these evaluation sets. And then there's this whole rich set of tools. So besides automatic prompt optimization, automatic prompt engineering, we also have trace, textgrad, dspy. All of these allow you to set hyperparameters and calculate gradients using a foundation model as a judge.
So it can look at your failures and automatically suggest changes back through your flows to automatically improve your system. So instead of you manually looking at examples and having to say, well, maybe this thing will work and I'll run it through, there's a lot of development by anecdote happening right now.
And I just encourage all of us that as we build up these evaluation sets and run and batch and start analyzing at the aggregate level, it allows us to make more intelligent steps to take the parallel over to optimizing a neural network model. slightly larger batches will help us take more accurate steps towards the global minimum.
This brings us to observability. I think the iceberg is a fantastic metaphor here. We're working with generative models. They're very good at generating content. That's a very good thing, but it also is a real challenge for us as we're thinking about evaluating these systems at scale. And as soon as you deploy these and get this out into the hands of customers, it becomes really hard to understand what's actually happening out there and really understanding the full range of failures and use cases.
So I encourage you to use tools like open LLMetry and open telemetry integrations. You really want to have detailed logs and tracing and probably some way of doing additional clustering and automated summarization to understand those main categories of failure modes so that you can optimize and improve your system more easily.
And now this brings us to just a few of the common pitfalls that we've seen both internally and also speaking with folks outside of Microsoft. Just insufficient evals is far and away the biggest limitation and challenge that I see. But also on the tool side, maybe you haven't built enough tools.
Maybe the descriptions are not sufficiently accurate or clear. Maybe there's too high a degree of semantic overlap between your tools. And so individual completion calls are getting confused between those tools and leading to worse outcomes than you suspect. And then excessive complexity. There's so many bells and whistles these days.
It's very easy to go chasing these other things. I just encourage us all to stay really focused on the principles, really focused on what we're trying to achieve, and only add additional complexity if we've actually tested and make sure that it's actually providing for a better experience for our users and customers.
And then this lack of learning, so tightening up the learning loop is really challenging. All of this content makes it hard to sift through. And so really focusing on getting down to those root causes and suggesting improvements that will result in a better system. And then the final thing I'll add is coming from the cybersecurity division, this is such an exciting time for this technology.
I think it's going to help us in so many ways. But agentic systems are a new class of potential vulnerability. And so I just encourage all of us to really design for safety at every layer. Pirate can definitely help on many layers of this. But just good software engineering and good principles are really critical for this.
And make sure that you're building tripwires and detectors at different stages of your agentic stack so that you can eject out and fall back to human review in all of the critical cases. So this brings me to the end of my talk, and I think I'll just close with a quote that I love from Paul Krugman that productivity isn't everything, but in the long run, it is almost everything.
A country's ability to prove its standards of living over time depends almost entirely on its ability to raise its output per worker. I really think we're at the beginning of an upshift in the amount of work that every single one of us can accomplish. And I think this new design pattern for agents is going to help each of us accomplish more.
And I'm really excited about what we're going to be able to do together. Thank you so much. Thank you so much. Thank you. We'll see you next time.