Hi everyone, my name is Victor Dibia. I'm a principal research software engineer at Microsoft Research, and my background is mostly around human AI experiences, so that's sort of what I'm interested in right now. And over the last few years, I've been sort of looking at scenarios where a human works in tandem with an AI agent to solve problems.
And so one of the things I've worked on at Microsoft Research is GitHub Copilot. So how many of you have used GitHub Copilot? Excellent. And so I think, in my opinion, it's the first example of an AI model working at scale, an IDE helping the developer solve a problem.
Go ahead, you can switch that. And beyond that, more recently, I've spent my time working on an open source multi-agent framework, something called Autogen. How many of you have heard of Autogen? Okay, great. About half of the room. And as part of that, I've also helped build Autogen Studio, which is a local developer tool to help build out multi-agent workflows.
Previously, I worked at cloud there as a machine learning engineer. And before that, I worked at IBM Research as a research staff member focusing on human-computer interaction. Okay, so how did I get into agents? So I'm just going to give a brief history. And so sometime in, I think, August 2022, so this was about four months before ChatGP sort of took off.
I had worked on a project called LiDAR. And essentially, what that tool did was that it lets you, in a web user interface, drag in some data, CSV or JSON file. And it did a few things. So first, it came up with a summarization of the data. Next, it did things like ask a bunch of questions regarding data.
And for each of those questions, we sort of generated code, executed it, did some processing, error recovery. And then we showed the user a bunch of visualizations. And so if you sort of look at it, it actually is an agentic workflow. And that's just quite, quite early in its time.
So it had these four main categories, summarization, goal exploration, visualization, generation, which was essentially a code interpreter sort of built into the entire system. And then once we got these data visualizations, we did things like use a decision model to sort of come up with more representations, more sort of diverse representations of that data set.
And so an interesting thing about the system was that the first version used the DaVinci Codex models. Does anybody remember the OpenAI DaVinci models? That was a really long time ago. And one of the interesting things was that once I showed it, and the error rate was about 20%.
And then about three months later, there was the GPT 3.5 turbo models. And we sort of tuned that and the error rate sort of went down to about 1.5%. The key fun fact there was that like, this sort of showed that this sort of applications were possible. And today, I think you see a lot of these sort of capabilities across many Microsoft products and products even beyond Microsoft.
And so fast forward after that, a few colleagues started to think about how can we, as opposed to building these handheld workflows, how can we instead build multi-agent applications where you define agents and the sort of exchange messages and self-organize to sort of explore problem space. And that's sort of where Autogen sort of came about.
I'd encourage you guys to sort of look through it. It's a framework for building multi-agent applications. It's pretty well used for the 5k stars. And I think the more interesting thing I did there was Autogen Studio, which is a kind of a nifty tool. It's a low-code tool again.
But in this case, essentially what happens is that you sort of sign into a web interface. You can then compose multiple agents. So for example, you create a team, drag in a set of agents into that team. And then for each of those teams, you have primitives like models, tools, and you can sort of compose them together to sort of build multi-agent applications.
And so when I started to prepare this talk, I think one of the things I wanted to do was sort of walk you through all of the capabilities of Autogen Studio, how we built it, the design philosophy behind all that. But I thought, you know, there's just a bunch of resources out there.
And in general, this is AI.engineer. How about we go ahead and we build something from scratch and we sort of show that today. And so maybe you shouldn't do that, but we're going to do that today. And the tool I'm going to show today is something called Blender LM.
It's a multi-agent system built from scratch, no frameworks, nothing. And the idea is that it's supposed to help you build enable 3D sort of tasks. So you could go to this tool, say things like build, like, I don't know, a scene with a ball on a table. And essentially, it'll do all of the plumbing underneath and get you a Blender interface that sort of accomplishes that.
Anyone here familiar with Blender? Okay, great. Awesome. So here's the plan today. I have about 13 minutes left. I'll show you a demo. I'll walk you through how I built it. And then at the end, we'll sort of discuss or synthesize a bunch of design principles that underpin a good user experience for a tool like this.
And then finally, we'll sort of settle on a bunch of takeaways. Okay, let's go. In terms of background, how do I settle on Blender LM? So about two years ago, I sort of wanted to learn how to use Blender. And of course, if you've tried to use Blender, there's a really popular tutorial called Donut Tutorial.
So that's what you're sort of looking at. And it's kind of deceptive because the tutorial takes about four hours. But at the end of the day, you need about 40 or 50 hours just to get through the whole thing. And so you're trying to learn, you know, where are things, where do things leave?
How do you use the tool? And then you need to learn all of the concepts underneath. And so one of the things I asked myself was, can I build with all that I know about agents, with all my experience building Autogen, can I create an agentic workflow that will help me take, go from natural language to, let's say, something that looks like this.
The prototype is not at this level of quality, but I think it can get there. And so the next question is, how do you express this as a multi-agent workflow? And you have a couple of options as a multi-agent system. So do you build a workflow? And I'm sure if you've been at this conference, you've seen people debate all of the pros and cons between a fixed deterministic workflow, so essentially know exactly what all the steps.
And this is great. We sort of use a lot of that in production today. You can build reliable systems, take advantage of things like function calling, structured output, and build really, really valuable systems. However, it requires that you know the exact solution to the problem. And so what you're doing is that you're expressing that solution as a workflow.
But there are costs of problems, like the kind of thing we want to address here, that you don't know the exact solution to the problem. Because every time you take an action, let's say, click something in Blender, the entire space changes, and you have to react to that in some way.
And so on the other hand, on the other end of the spectrum, what I'm going to focus on today is more autonomous exploration systems. And so what that means is that we're sort of looking at a system where an LLM sort of drives the flow of control. We have tools, we take actions, we inspect the results, we observe, and then we make progress.
Okay, so three characteristics here that we should be sort of have at the back of our mind. The system should have a bit of autonomy. So it might not address just a single task, maybe many different tasks. It should be able to take actions. And an action here can have side effects.
And so, for example, you could try something, call a tool, and it could sort of return with a result that you don't expect, and your system should be able to handle that. And then finally, you need to have systems that are expected to sort of explore complex tasks, break them down into steps, and then run for extended periods of time.
Okay, so let's switch to a quick demo. This is the Blender LLM interface. It's a web application. And essentially, what's going on here is that it's connected over a WebSocket connection to an actual Blender instance. So this is a Blender. This is a software tool for building 3D applications.
And what we have here, the first thing you'll notice is that we have a set of fixed tools that the developer can use directly, and I'll tell you why we kind of need that in a second. So for example, I could click a button to clear the scene, and because we have a socket connection, we can stream exactly what's going on in the Blender interface.
The scene is now clear, and we can show that to the UI in the UI. Next, we can, let's say, go on to a list of pre-selected or predetermined examples here. And so maybe I might ask this system to create two balls with a shiny, glossy, silver finish. And essentially, what we see here is that a bunch of activities start to occur.
They're streamed to the UI in real time. So for us, it says we're analyzing the task if we scroll up just a bit. We've come up with a plan. There's some planning done. There's a planning agent underneath. It says the first step, we're going to set up the same environment by adding a ground plane.
We're going to create two spheres with correct spatial separation, assign a glossy, silver finish, and all of that. And we can see in real time, the first thing is done is it's put in that plane. If you look at Blender, we see the horizontal plane here, and all of this is running live.
We probably shouldn't do live demos and talk, but hey, you know, we're trying to be brave here. And then it sort of explores. Each time, essentially, what's happening is that each time it takes a step, calls a bunch of tools, executes those tools. We stream the update to the user interface.
And then we have a sort of verification loop. It's a verify agent that sort of takes a snapshot of the scene, an actual log of what's in the scene, a visual representation. We use an LL, we sort of judge, are we making progress? Are we stalled? And then we sort of use that information to decide what we do next.
And we can see that we have a ball here, which is actually what we really, really want to do. And we can look at that in Blender. We can tweak it around. Hey, look, look at that, guys. It works. I think we deserve a little applause here. Come on.
There we go. I wouldn't explore more. I have about eight minutes left. Let's move on. So how is all of this built? Let's walk through the process really, really fast. Most of the time, when people sort of think of a system like this, the first thing they probably will say is, like, let's define the agent.
That's really not what you should do first. First, let's define the goal. Pretty simple. Next, we need to come up with a baseline. Probably has nothing to do with agents, nothing to do with AI. Create. Just ensure everything works correctly. Third, we build out our tools. What tools does this agent need?
If a human was going to do this, what tools would they need to accomplish this? And then, not the agent yet. Next, we define a testbed. How do we evaluate how this thing works? And so, and then finally, when you have all of that, that's when you then go ahead and build the agent.
The step one is really simple. What we want here is we want to translate natural language tasks to 3D artifacts. Next, we create a baseline. We want a script that can say, let's build the hello world of Blender. Create a script. We run it. It adds a single cube to the scene.
What we need here is a Blender add-on. We need a client library that can enable socket connections, all of that. And then, this is really, really valuable for rapid prototyping and testing. Next, we need to define a set of tools. And there are two types of tools. They can be task-specific tools.
For example, something direct just create a Blender object and do nothing else. And then, you might have, let's say, general purpose kind of tool, which is something to execute arbitrary code. And so, in this case, you get your LLM to generate code, and you execute that, and that's what drives all the capabilities on Blender.
And one thing to note is your agent is always only as good as the tools you give it, so spend a lot of time, about 50% of your time, on tools. And you can test all of this in code, just what that looks like. Next, you want to build an eval testbed.
In this case, it's three steps. V1 is just a Jupyter notebook. We're going to write all that code, test it in a Jupyter notebook. Next, we create a full interactive web UI, which is the kind of thing I just showed here. And then, third, we probably want to create an eval automated test suite, so things with metrics and a full evaluation harness.
And then, finally, to create your agent, the first thing you want to do is to create a base agent loop. If you've been at this conference, you know that an agent is mostly an LLM in a tight loop with a bunch of function calls. So, you create that, you get your spinal result, and then you're fine.
But typically, for a problem like that, this is typically not enough, and you need to iterate just a little bit more. Okay. You need to iterate just a little bit more. And in this case, we have two other agents. There's one called the verify agent. What it does is that every time this agent takes a step, instead of taking the content of the scene, we take a list of all the objects there, and instead of using an LLM to predict, are we making progress?
Is the user task completed? And then, we decide how to move forward. The second agent you want here is a planner. And so, you saw earlier, when the task came in, the planner sort of broke it down into atomic steps, and then each of the steps sort of addressed in this sort of tight loop.
So, what can we learn from all of these? So, the design principles I'm going to give you here, they're not exhaustive. They're not perfect. And in fact, if you met someone that told you that they knew the exact design principles for multi-agent system design, you probably shouldn't trust them, because the space is just too early for that.
So, what I'm going to try to give you today is a set of four high-level ideas that you can sort of take, and once you build this sort of system, sort of apply them to see how you can use that to improve your own systems. So, the first principle is capability discovery.
So, what you want to do is that, because you have an agent, it can do a whole bunch of things, but there are a few things that it can do with high reliability. And so, you saw earlier, I had this little sort of pill that showed that here are the things the agent can do.
So, you want to itemize the kind of things that your agent can do with high reliability. The second thing you can do is to have proactive suggestions based on user context. Let's say we have a scene open. You can sort of parse the scene and sort of suggest to the user some high-level things they can accomplish.
And so, this is an example of that. The second thing is observability and provenance. And so, stream all of the activity logs, help the user sort of make sense of what the agent is doing. And then you want to provide tools for debugging and all of that. So, all the little things around number of tokens used, the amount of time taken for each of that, very, very valuable for the user to make sense of what the agent is doing.
The third is interruptability. So, at this point, your agent is sort of taking all kinds of actions. And so, at any time, you want to sort of design your system such that you can pause it. It might be going down the wrong route, about to make a mistake, or sort of consume a bunch of resources that you don't intend.
And so, you want a system that enables things like checkpointing, rollback, pauses, and resumes. And then, finally, cost-aware delegation. So, every time an agent takes an action, from the LLM's perspective, all actions are equal, all two calls are equal, unless you do something about it. And so, in the case of Blender, you might have it write some Python code that, let's say, I didn't add something to the same.
But for any reason, let's say, it tries to delete the entire operating system, you really don't want things like that to happen. And so, you want a module that actively inspects and tries to estimate the cost of the action, and then knows when to delegate to the user. And so, I'm kind of getting to the end.
What are some of the key takeaways? The first one is knowing when to use a multi-agent approach. And so, a multi-agent approach is not always the thing to use. Essentially, when you have multiple agents collaborating, and you give them a bunch of autonomy, as part of that process, you also increase the surface for error.
And so, like any other tool, you should sort of inspect the problem space, and sort of verify if a multi-agent system is actually the right tool for the job. I always show this little graph, and so, you have a big circle here, and if the big circle is the task that you're engineering or most engineering teams need to do, then this small circle here at the bottom is the task that truly benefits from a multi-agent system approach.
It's really that small, and before you try to build an autonomous multi-agent system, just think very, very carefully and ensure that you have good ROI on this specific approach. The next question I get is, how do I know if my task might benefit from a multi-agent approach? I typically offer a five-step framework.
The first is planning. Does the task benefit from planning? Can you take a high-level input from a user and meaningfully break it down into a bunch of steps that lead to you from an unsolved state to a solved state? Next, can you take the task and sort of break it into multiple perspectives or personas?
In this case, let's say you might have some persona that explores things like just planning of the task, maybe some personas that handle, let's say, code execution, all of that. And so, with the multi-agent approach, you can explore this sort of domain-driven kind of design. The third is, does the task require consuming or processing extensive context?
And so, here we are constantly shorting the state of the app, screenshots, all of that. And it's kind of useful to sort of give individual agents each of these large pieces of context of process and then return to some other final coordination agent. And then, finally, adaptive solutions. As you take actions in the world or in the environment that your agent exists, the environment might change and you might constantly move to sort of react to that.
So, you might need like an autonomous multi-agent approach here. The second takeaway is eval-driven design. Most people want to start out with just building your agent. It's typically a mistake. Instead, you want to define your task, define your evaluation metrics, build the baseline that has nothing to do with agents, improve your agents iteratively.
Basically, in the case of this app, we had like a simple type loop, then we improved it, we added a verification agent, and then we added a planning agent. And based on the interactive evaluation tool I built, I could see that all of these things actually had like ROI and improvement, and that's why it makes sense to explore a multi-agent approach in this space.
And then, the final thing is that academic benchmarks are great, but they're not your task, and so you really should build evals that's sort of attuned to your task. And then, the second slide, the last set of key takeaways are the design principles that we'll walk through today. I think this is the money slide here today for high-level things.
First, always ensure that your users can discover the ideal tasks that your multi-agent system is designed for, provide user-facing observability traces, ensure that your agents are interruptible, you can check point and restart them, and then ensure that your agents can quantify the risk or cost of all actions and delegate to users as needed.
And then, finally, don't build the whole multi-agent system from scratch just to give a talk, you probably know that. It's fun, it's a lot of work, and if you ever want to do something like this, consider using the framework to sell you a couple of keystrokes here and there.
So, at the last slide, I have a bunch of further reading, a couple of papers we've written on Origin Studio, Magentic One, Magentic UI, challenges in human AI communication. These are all good references I recommend, you take a look. And then, I'm at the end of my slides, thank you so much for listening.
I have a book I'm writing, there's a lot more about, there's, chapter three is really just about, like, design. Take a look if it's helpful for you, and all the code for Blender LLM is also available. Thank you. Thank you. Bye-bye. Bye-bye. Bye-bye. Bye-bye. I'll see you next time.