A2A & MCP Workshop: Automating Business Processes with LLMs

Hey everybody. Yeah, thanks for coming. Great to see a full room. Always good when you're doing a workshop to have a lot of people here. So yeah, I'm Damian Murphy. I'm going to be presenting A2A and MCP, two pretty hot topics these days in AI, and how you can use them to automate business processes.

So yeah, a little bit about me. About 15 years full-time, full-stack developer, five years doing solutions engineering, so customer-facing kind of forward-deployed engineer, and spent the last three years or so working on voice AI and AI agents. I did a workshop last year as well, AI voice agent swarms, and it was pretty hot topic.

I think it's now pretty much standard that everybody can build a voice agent in five minutes. And so now the hard part becomes building autonomous agents that actually can do complex tasks. And so I joined Bench Computing about two months ago, a pre-revenue startup backed by Sutter Hill Ventures, and we're building what I would imagine to be a better Manus, and that's more focused on teams and enterprises.

If you're not familiar with what Manus is, it's kind of like an autonomous AI agent, and Bench is essentially an autonomous AI agent that can do a sub-parallel task automation. All right, so the workshop that we're doing today, we're going to build a multi-agent system and using A2A agents.

If you're not familiar with A2A, Google released, essentially, a protocol that allows agents to communicate over the web. We're going to integrate these agents with MCP, MCP, which is the model context protocol. MCP is like a USB-C for, you know, all of your agents to be able to consume context and tools and resources very easily.

We're going to get these agents to work together, and we're going to trigger the agent with a webhook. And then I'm going to cover a little bit about when to use A2A MCP, and I'll also go into prompt caching and context management as well. All right, so A2A, right, and it's not exactly clear what it's for and why it exists, right?

If you ask everybody in a room what they think it does or why it exists, you'll probably get a different answer. But the key benefits are you can have agent specialization, right? So, rather than trying to make one agent do a hundred things, you can have a hundred agents do one thing and do that one thing very well.

A2A allows you to handle task delegation. So, you know, imagine you had a salesforce agent and you wanted to interact with all the salesforce MCP tools, you could do that. And you've also got the ability to do parallel processing, and this will become very important when it comes to speed and context management.

You can then use those A2A agents to have complex workflows and help keep your main agent's context size down. MCP, again, really hot topic right now. It's been kind of coined as the USBC for AI, and there's definitely some benefits in just having a standard interface, right? You know, there's something like 10,000 MCP tools that you can use today, and about 7,000 of those come through the Zapier MCP.

If you're not familiar with Zapier, it's essentially a way to connect disparate systems together, and they've now released all of their zaps, they're called, as MCP servers and tools. One of the great things about MCP, no integration with APIs, so you don't have to do any sort of, you know, different handling of different APIs.

It's a plug-in architecture, an industry standard, and it's really based on LSP, so LSP was a way for, you know, IDEs to actually figure out how different code languages worked, and it was a great kind of transfer of ideas over to the MCP protocol. All right, so when should you use A2A versus MCP?

Anybody? MC, if you want to resource the infrastructure, then you go for MCP. But yeah, I don't know. And that's kind of the challenge, right? It's like, what exactly, you know, are these protocols for, and should I be using them, and things like that. So if you want to have, you know, two A2A versus MCP, So if you want to have, you know, two agents, right, and typically two agents that are completely unrelated, right?

So it's not two agents that you necessarily control. It's more likely going to be an agent of a third party, or, you know, their first party agent and your agent. Yeah? What's the difference between agent TKI and A2A? So I work a lot on the agent TKI, where you have multiple agents and doing the same task.

The way you are saying, describing A2A is a lot similar to agent TKI. Yeah, so like Autogen and frameworks like that, that allow you to kind of manage multiple agents kind of locally, A2A is more about remote agents, right? So, agents you have no knowledge of. So you can think of A2A as a way for you to have service discoverability, and once you have the end point to the agent, you can then learn everything that agent's capable of.

With things like Autogen, it's like, you know, descriptive. So you describe what it's capable of. It's in your control. So to summarize, agent TKI's kind of define the role of each agent, and A2A is kind of working remotely, and its role is not defined or defined? So each of the A2A agents will have a definition, and we'll kind of get into that a little bit later.

But yeah, think of agent TKI kind of as a superset of everything, right? A2A and MCP are just kind of subsets of that, right? Different modalities. Yeah, so for MCP, you're going to connect to external context and tools. A lot of people don't use most of the features of MCP, right?

They're just using the tools. But there's a lot of stuff around prompt templates, resources, and a thing called sampling. Sampling is actually going to be a really interesting thing I think that we'll see a lot more of as well, where it allows these MCPs to sample the host LLM, right?

So if you're using, you know, Claude and you're hitting an MCP server, and that MCP server may want to also use the same model of Claude that you're using, and it can use sampling to actually achieve that. So when you bring those two together, you kind of get the benefit of both, right?

So you have A2A as the remote interface, and MCP is then giving you the actual tool use and context management. Okay, so when not to use MCP. And you'll notice a lot of, like, memes here. And just to give you a heads up, all memes were generated by Bench.

Actually, the whole slide deck was generated by Bench. And I just gave it a markdown file, and it out-ba-da-da-da. So when you use A2A or MCP, if you have full control of the tools, then you probably don't need it, right? Like, if your function is local to your code base, you know, why do you need to create, you know, a USB-C?

It's kind of like me plugging in my hard drive with a USB cable. You know, like, shouldn't I just use the hard drive that's in my machine, right? So calling functions directly in your code base, super easy, easy to maintain, faster to develop. And then if you have full control of your agents, you probably don't need A2A either, right?

Like, if they're your agents, you can use, you know, some sort of local function call for them to communicate. And I've built multi-agent systems using MCP and using just local function calls. It's a lot easier to just use the code you have. It's going to be faster, there's no protocol overheads, and things like that.

A lot easier to debug as well. Okay, so why do you need A2A and MCP at all, right? Third-party tools is probably the number one reason to use MCP. You can just get access to such a large array of tools that, you know, you're never going to be able to -- let's say you're building a product, right?

And you're like, okay, we're going to build first-class integrations with Salesforce and Slack, but what about your 10,000 tools? It's like, okay, we'll just allow people to add their own MCP server. So that gives you great extensibility. But there's a lot of drawbacks with MCP, right? And you only get what you're given.

And a lot of the time, that's not exactly what you want. And so you may go down the route of saying, you know what, I need a way to actually index this data so that I'm not calling, like, you know, list Slack channels every time I want to post a channel, right?

Or a post a message. And then with A2A, the complexity is hidden from you, right? And that's one of the kind of the key elements of A2A is that you don't know anything about this agent until you connect. And all of its complexity is completely opaque. And then you can essentially connect to, you know, any sort of remote A2A agent, so long as you have, you know, the credentials and things like that.

And we haven't seen any first-party A2A agents released yet. But Google has about, I think, 50 partners they're going to launch with. So I'd imagine there's going to be, like, a Salesforce A2A agent. And it'll probably only come with a paid account, right? Because it's going to use LLM compute.

Versus things like MCP, typically don't actually use an LLM, right? They use the host LLM. Alrighty, so we're going to get into the code now. Yeah, so if you haven't already grabbed the repo, we also have a Slack channel, workshop-a2a-mcp-2025. And in this repo there's basically everything you need to get going.

Yeah, so the code structure, we've got a host agent. And then we've got some sub-agents, right? And the whole concept here is to demonstrate, you know, A2A and MCP. But in reality, these sub-agents will probably live in a different repo, you know, run on a different server. And then we've also got the A2A implementation, the server and the client in the repo.

And these are taken directly from the A2A repo. We've also got the MCP integration, so this is just a client. We're not creating a server here. And we also have a CLI interface. You're not going to need the CLI interface. That's kind of internally how it's being used. Yeah, so once you've cloned a repo, you're going to want an NPM install.

And you're going to need an MCP server URL. This is going to be a Zapier URL and a Gemini API key. You can get both of these for free. There's no need to sign up for a paid account to get them. And you'll want to rename your .env.example to .env.

Alright, so setting up the Zapier MCP. When you go to zapier.com/mcp, you'll have the option to create a new server. And when you go to connect, you're going to have a couple of options here. We're going to use SSE. They recently released streamable HTTP, which is making SSE deprecated and it's going to replace it.

But there's still a litany of SSE servers out there, so I just used SSE for this one. Once you do that, you're going to get this server URL at the bottom. You can copy that URL. That's going to be the URL that goes into your .env. And then you're going to set up a Slack and a GitHub integration.

So you're going to want the ability to create an issue. You can put in the repository URL for the workshop if you want. You can use your own as well. You can let AI choose these, but what I've found with AI is that it will choose something else, right?

So a lot of time with these MCPs, you're going to want to kind of say, hey, you know, this is the thing I want to do. So let's just kind of hard code that. But if you do let it kind of go wild into your Slack, it's going to start posting in general and random and sales.

Yeah, a few of my bots have kind of gone rogue. All right, so the Gemini setup. Yeah, so you can get the API key here at the AI studio. And there's a link in the slide deck as well if you need to click it. You can get a free account, generate an API key, drop that into your .env as well.

Excuse me. And there's also a remote Bench A2A agent. So the code for it's actually in the repo. But we haven't officially released our API yet, so I'm just hosting that remotely. But it's a nice kind of way to show how you would use A2A remotely as well. So what is Bench?

Bench is essentially a kind of LLM aggregator with autonomous AI agents. So you get access to Cloud, Gemini, OpenAI, XAI, and loads of more models. It has, I think, about 30 tools now. And integrations. So we actually started out with MCP integrations to Slack and Salesforce. They didn't meet our needs.

We built first-party integrations, you know, data caching and indexing. And that kind of gives you an idea of, like, how far is MCP going to get you, right? Eventually, at some point, you're going to realize that it doesn't do the, you know, the specific thing you need to do.

All right. So running the application. You're going to run npm run start all. And that's going to kick off all the agents, right? So the Slack agent, the GitHub agent, the host agent. And it will also start the webhook server and the webhook admin panel. You can access that then through local host port 3000.

And, yeah, so let's just kind of go into what each of the actual agents do. And so the host agent is essentially your central coordinator, right? And this may be the only agent that you have in your application. It may be using external A2A agents. And if that's the case, then, you know, everything that your host does is going to be delegated, you know, to sub-agents.

And so that handles all the agent discovery and kind of bringing everything together. Yeah, so the code for that's going to be in source-agents-host. And you'll notice there's a couple of files in there. One of them is the host-agent prompt. All right, so that's just a plain text system prompt.

Genkit, that's going to be essentially how you hook all of your A2A code up with Gemini. And there's also a Genkit MCP plug-in that the sub-agents use. Yeah, so then the Slack agent. So this is going to send a Slack message in response to the webhook transcript. And, yeah, the kind of sample webhook that we have in this is essentially, you know, your meeting end.

And you're going to receive a transcript of that meeting. Right? And with that, you're going to decide what to do. So it's going to, you know, if it detects any bugs, it's going to create a GitHub issue. If it detects any, you know, feature requests or anything of interest, it's going to post that into Slack.

And you can think of the kind of automations that you can build with this sort of scenario. Right? So you could even -- I had a version here that was hooked up to Salesforce. But there's actually a limitation on the host agent on how many sub-agents it can call.

And so I figured, right, if one of them is going to go, it's going to be Salesforce because it's probably the hardest to get an account on. But you could actually update an opportunity based on a sales call. Right? So you could have a sales call and, you know, you're talking to them, you're doing your discovery, and you're able to update those Salesforce fields automatically.

And, like, the time saving for account executives, because, you know, they're probably on back-to-back calls, is actually pretty big. Yeah, so this was an interesting issue I ran into. So I asked one of my colleagues to test the repo out. Right? And he was getting this weird error where it was saying, you know, the Slack MCP succeeded.

And so I asked him to send me the logs, and he sent me this. And it was like, is error false? And I'm like, okay, that's great. So, yeah, it turns out that, you know, not all MCPs are created equally, and the Zapier Slack MCP failed silently. And so the reason it failed was he had the default Slack channel name, which was, like, test Damian Slack.

And he was in a different workspace where that channel didn't exist, so it just failed silently. And so I added a bit of code to detect this kind of empty text array. And so it will fail now. But it kind of goes to show you just kind of the limitations of MCP.

Yeah, so the GitHub agent, pretty straightforward. It's probably the most basic of the three or four. So it just creates a GitHub issue. Super simple. But you could imagine, you know, how you would extend this, right? Maybe it's going to open a PR, right? Maybe it's actually going to implement the fix for the bug that was reported in the meeting.

And you can see how down the line, as, you know, AI gets better and things really improve, that a lot of this automation is going to be driven by human interaction. Right? So, you know, speaking with people and posting messages on Slack and talking in GitHub discussions is going to trigger AI to take action.

Yeah, so the bench agent, it can do a lot. And that was actually one of the problems that I found with A2A is that, like, the more functions and capabilities an agent has, the harder it is to describe the agent's capabilities in the agent card. And so the agent card is essentially, like, the public information to any other agent of what that agent's capable of.

And so I had to really just pare it back and I said, look, you know, you can do a handful of things. I know you can do more, but, like, for now, these are the few things that you can do. And it's able to go off and, like, you know, browse the web, do research, data science, all sorts of things.

And so we're just going to use it for researching the company and the people in the meeting transcript. All right, here we go, demo gods. Before I start, any questions? Yeah? You mentioned some limitation on the number of agents. What was that? Yeah, so the GenKit implementation that Google provide limits you to five maximum kind of sub-agent calls per turn.

Is that a hard limit? Yeah, I couldn't get around it. Like, there was this max, like, setting, but it didn't work. Yeah. So it's something I'm sure they'll fix eventually, but it was an interesting issue. All right, let me see if my code is running. Yeah. I think it is.

Yeah, so it should be here. And actually I'll show you the MCP server as well while I'm here. Yeah, so this is the MCP inspector. It's an open source repo as part of the model. So, yeah? At the back. Yeah, that's actually in the agent card. So that will be in the index.ts of the sub-agent.

Yeah, I'll be going through the code in a little bit as well so you can see it. Yeah, so I'm connecting to my Zapier MCP URL that I got. So I just copied this one, dropped it in. I'm going to connect over SSE. And this allows you to, you know, list the tools, call the tools.

And it's quite interesting now that Zapier has added instructions as a mandatory field on actually all of their MCP tools. So you don't actually need to fill out the fields anymore. So you can just give it natural language. So this kind of suggests to me that they're using an LLM on their side to figure out how to populate the fields on your behalf.

Which is interesting because it's going to cost them a fortune, right, as more people adopt it. All right, so this is the agent dashboard. Let's just make sure everything's working, yeah? You can see I have a couple of previous ones that I ran. This one is actually the one where the Slack thing wasn't found.

So when I was testing that, oh, my mouse isn't moving. There we go. Yeah, so I put in, like, a, you know, typical unknown Slack channel. And then it detected that I couldn't find it based on the heuristics. I'm sure why my mouse isn't moving. There we go. Yeah?

So you have defined four agents here. Mm-hmm. So they are all A2A agents. Yeah, correct. Okay. So maximum you can go for A2A agents is five. Yeah. When I got to five, that's when I got the error. Yeah. So I think four. Yeah, and the host agent here, so these are the host agent logs.

You can see it connecting to the different agents. This agent's just running on a little dinky EC2 instance that I spun up. And it goes through, learns about the agents, you know, processes webhooks, like, you don't necessarily need to go in here unless you get a failure. And Slack agent, pretty similar.

It's basically just sitting there waiting for another agent to connect. And when the agent connects, it communicates with it. And you can see here the bench agent's running remotely. The reason I don't have verbose logs here is because it's remote. It's not under my control, right? So the A2A logs for that agent are actually on the EC2 server.

Which kind of brings up another question about how do you debug when an A2A agent fails, right? Yeah. So then on the webhooks page, so this is the only webhook that's preconfigured. And this basically explains, you know, to the agent what it's actually going to do when this webhook arrives, right?

And so it's going to process the incoming webhook. And we have a little prompt template here, right? So it tells it what the agent capabilities are, how to analyze it, right? And then we have the processor config, right? And this just kind of tells it, hey, these are the agents that you have access to as part of this webhook.

And this will become important when you've got, say, 100 A2A agents and you only want, like, two of them to interact. And then here we have a test. And so this is just a fake transcript I generated with an LLM. And when we send the webhook, you can see here it's processing.

And hopefully the demo gods will do me good here. And it does take a little bit of time, right? So the host agent has to process it, then it has to reach out to the sub-agents, you know, get all the information. And I think the bench agent probably takes the longest because it's actually doing its own sub-tasks as well.

Okay, we've got a Slack message. That's a good sign. Okay, so Snowflake is interested in Slack and GitHub integrations. Very cool. We have the GitHub. So, I don't know why my mouse keeps freezing. There we go. Yeah, so we should have a GitHub issue. Here we go. Yeah, so during the trial, the AI misclassifies the severity of the bugs.

Engineers need to investigate and fix the issue. All right, so it's a really simple use case. But you can imagine that that transcript is probably going to be ten times longer. You know, a lot more information in it. And it will just work, right? And then we also have the bench agent.

So, oh, it looks like it's waiting for results. So, it's going to research the company. I think I did one before where it just returned a result. Let me see. Yeah. So, it basically goes off, does a research into Snowflake and all the participants of the call and returns that information.

And this can kind of get as complex or as simple as you want it to be. And, yeah, so when you're using the application and you have it up and running, has anybody managed to get it up and running? Wow. Impressive. Yeah? Quick question, Mike. You're using bench agent to do the orchestration.

That's why you're having it remote, right? No, so the bench agent is just, like, think of it as a third-party agent that we can leverage. So, the host agent is doing all the orchestration. Okay. So, like, what is the actual role that bench agent is playing? Like, what is it actually doing?

It's doing research on companies and people. So, it's just another agent? Yeah. So, it's an agent with a load of different capabilities, and it's basically just... Okay. So, the orchestrator, is it local? That is local, right? Yeah. So, these three, host, Slack, and GitHub, are all local. Yeah. I was like...

Exactly. I think I mistakenly thought the bench was doing orchestration, and I was like, why is it... Yeah. No, the bench is just a... Like, it's in the repo, but you need an API key for it, and we're launching in about two weeks, so, I just made it remote for the purposes of the demo.

So, what about the host agent, though? Sorry? The host agent, is it the Zapier agent? No. So, all of these agents are A2A agents. The Slack agent and the GitHub agent have MCP tools to Slack and GitHub through Zapier. Yeah. I can actually show you a diagram that might explain it a bit better.

Yeah. I don't know if that explains it better, but... But the orchestration does happen on your local, though. Yeah. Yeah. Everything's happening on my local. So, if I go into the code base, and I have the agent logs, and... So, this is all happening here, right? So, it's sent to Slack to...

Or... Is that readable? I'll go get one more. One more. Yeah. So, you can see here the transcripts are... And then it got a response from each of the sub-agents and then completed them. And it did all of this in parallel as well, right? Sorry. Is that a question?

Yes. So, in your example here, which agent would handle human confirmation? Let's say we want to have a great... Which agent would handle that part? Do you create a new agent for human confirmation? Do you keep the old one? Yes. You need a staging area for actions. So, it's not something I've built into this.

And there's a lot more you could do here. But human confirmation would typically be done through like a draft, right? So, you would maybe pop up a Slack message with some actions. And then when somebody clicks that, it would communicate back, kind of like a secondary pass webhook. And you might need to persist, say, though.

Yeah? Yeah. How do you consider the security of these end point controls of different vendors communicating from the end point? How do you manage the security? Yeah. So, as part of the A2A spec, you're going to have a link in the chat. So, you have a link in the chat.

You have a link in the chat. So, you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat.

And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat. And then you have a link in the chat.

How do you manage the security? Yeah. So, as part of the A2A spec, you're going to have some sort of authentication. Right? I've just exposed everything. Right? Like, it won't exist tomorrow. So, there's no security implications. But essentially, you're going to probably have to have a subscription with the company that's providing that A2A agent.

Because it is consuming tokens. Right? What authentication? I'm not sure exactly what A2A have in plan. It's still pretty early days. But with MCP, it's a little bit further ahead. It has OAuth, header authentication, things like that. So, imagine something similar. And how about CASA governance? Like LL firewall, all those benchmarking, auto benchmarking, and also the guardrails, et cetera.

Do you have a separate agent or everything is being-- You'd probably manage that on, like, an Amazon Bedrock or something like that. Right? And you would just, you know, use that guardrailed LLM from behind there. You don't have to use Gemini here either. Yeah? The idea of A2A is agent can communicate.

So, like, you now have the host agent. And then that host agent is kind of like the planner and then talk to each. Do you see, like, A2A becoming, like, sub-agent talking to each other? I guess you could, but I don't know if that's the intention, right? Like, then they just become hosts, right, when they talk to each other.

Like, if you think about it, like, if you have no knowledge of sub-agents, how would you know to talk to them, right? You would have to then become a host agent yourself, connect to that other sub-agent to do that. So, I don't know if that's intended in the A2A spec for sub-agents to communicate.

Yeah? So, with the host agent and the orchestration that it's doing, is it actually managing a combination of all the context windows? Or, like, do you handle it quickly? Yeah, so all of the context windows, and this kind of is something I'm going to cover now in a second as well.

Let me just go back to the slides. Which is a good, it's a good segue. So, yeah, one of the benefits of, like, A2A, or any sort of sub-agent framework, is that you're not consuming the tool results into your context, right? So, like, when you say, hey, you know, and I think of an example later on, and, but if you have a load of Slack messages or GitHub issues or Salesforce opportunities, and you want to analyze them and maybe produce, like, you know, a summary of categories and counts, and the only thing your host agent cares about is the summary of categories and accounts.

it doesn't care about the, like, individual details, right? Because those have already been processed by the sub-agent. So, the sub-agent's context gets big, not very big, but, like, as big as the task demands, and the host agent only incrementally grows by the business value it got from that agent.

And, like, one of the challenges at Bench is, you know, we have so many tools, right? Like, the context can blow up very quick. And so, you know, very early on we decided, okay, we need to have composability. And so that means that Bench can create its own internal bench agent to avoid that context growth problem.

And we're even thinking of going one step further, whereas, like, you know, should we have an agent for every single tool so that every single tool is protected from the primary prompt? And so, you know, as you add more tools, like, the tool definitions themselves, I think we're up to, like, you know, 10,000 tokens just for tool definitions alone.

I added the Asana MCP. It added 11,000 more tokens. So, like, you know, a lot of these MCP servers, like, they're, you know, they're giving you a lot of information, and you may not actually want that. And that's actually one of the challenges with first-party MCPs is they expose all their tools.

And that's one of the benefits of Zapier, where you can pick and choose which tool you want to use. Yeah? Why do we need Zapier? Zapier is just a really easy way to use MCP right now. I think, like, Linear, Asana, a few others have added, like, first-party MCP servers that are much better than what Zapier exposes.

Yeah, so why does context size matter? So, AI agents accumulate context, like, as they work, and you're supposed to keep, like, all of your tool calls, right, what you sent to the tool and what you got back. You're supposed to keep that in your context so that later on, if you, you know, ask a follow-up question, it still has access to that data.

And that becomes very challenging, right? So, you've kind of got two options. It's like, okay, do I just prune, you know, old tool calls and now the agent gets dumb? Or, you know, do I figure out some other way to do it? And cost is a big challenge, especially when you're doing prompt caching.

So, with prompt caching, it enables you to essentially put a marker in your context and say, hey, look, when I make my next request, I want everything in my context so far to be cached so that I'm not going to get charged for it. But the cost to actually push that into the cache is about 3x the cost of making a single request with that context.

So, that means that you have to be very, you know, diligent in what sort of context management strategies you use. And, you know, I was running simulations because I couldn't really figure out, like, what is the optimal, you know, caching strategy. So, I ran simulations based on usage data of, like, you know, what's the typical context growth?

How many turns, you know, on average? Like, what percentage of users only send one turn, right? Should we cache that one turn if they never ask another question, right? Probably not. So, you know, it probably gets down to the actual user level. So, if you have a user that always, like, puts in new prompts into the same chat and never opens a new session, you're probably going to want to, you know, continuously cache their context.

But you might have another user who always creates a new session for every question. And then just figuring out, like, you know, what is the context growth? I think we figured out it was around 30,000 tokens was the optimal kind of across the board for everybody. But that also comes up with false positives.

So, sometimes you can end up caching the last turn of a conversation. And that's going to, you know, cost you a lot more than it should naturally. Yeah, so, the great thing about the sub-agents, right? It protects them. And this was the GitHub kind of example I was giving you.

But this applies to pretty much every tool. So, like, if you're ever integrating with a system, you're probably going to run into issues like, why do I have to call, you know, list Slack channels every time to get the channel ID for the channel name that was provided, right?

Because, like, nobody's going to provide, like, in a chat the channel ID that they want to post, right? It's a UID. It's not memorable. So, then you get into the question of, okay, well, do I just cache the list of channels? And when do I update that list of channels, right?

Like, what if the channel was deleted, renamed, or a new channel was added? Yeah, and then the cost is really probably the biggest one. Yeah, so, the benefits of this lean context, right? So, your sub-agents have that isolated context. And that really just allows you to be super, like, fast, low latency, low cost.

And if you ever need to go back to ask another question, you know, you're going to, like, spawn that process again, right? So, maybe if you're in control of these other agents, you might want to have some sort of, like, I don't know, five-minute TTL on previous questions, right?

And then, yeah, the host agent only processes the summaries, and the raw data is discarded after processing. Yeah, so, I'm going to jump back into the code here. So, just kind of walk you through how it all works. All right. We'll start with the host agent. And you notice a few other things, right?

So, there's MCP. This is just your standard. Sorry? Oh, I thought someone said something. Yeah. So, this is kind of your standard MCP client code. And it just allows you to consume the MCP calls coming from the LLM. And we have the GitHub, right? So, this is going to be what it sends to that Zapier endpoint.

And it's going to call GitHub create issue. And then the Slack agent is going to do send Slack channel message. So, these are just kind of, like, the MCP client tools that the individual agents will use. Yeah. So, this GenKit, this is based on what they provide in their sample repo.

So, you can use a different model if you want, right? You can change, you know, the settings on it. But this essentially spawns you a new instance of what's going to communicate. And this just loads the system prompt. I can open up the system prompt here. And so, right, it's got a critical workflow.

It's going to do these things in this order. It's got a few steps, you know, discovery. And this is actually something I noticed. Like, if you don't tell the A2A agent to call list remote agents, it just won't, right? And it will try to answer everything on itself. You know, it can very easily fake sending a Slack channel message and be like, "Oh, I just sent it for you." And I say, "No, you didn't." You know, one of the things I've noticed using Cursor is, like, every time I catch it doing something wrong, it says, "You're absolutely right." I even tried to prompt that out of it.

And it's not promptable to get it to not say that. Cool, yeah. And then the index, so this is actually where the agent card is. It's a little bit long. Let me see. I think it's up here near the start. There we go. That was line 1,200, so not near the start at all.

Yeah, so this is what the host agent exposes if somebody else wanted to call it. So it has these abilities to list remote agents and send tasks. And then if we compare that to the GitHub, which is a lot smaller. There we go. Yeah, so the GitHub agent can create GitHub issues, right?

It's got the ability to do various things, and it has a list of skills. And this is all that the host agent really knows about this agent. So you could imagine how big this might get if you were to, you know, implement every single API that, say, Salesforce has or something like that.

And in a lot of cases, at least with Salesforce, rather than implementing, you know, wrappers around the APIs, you're probably just going to want to use, like, the SQL or the so-called directly and let the agent actually write the queries. There's a lot of flexibility when you have, you know, direct database access, essentially.

Because the LLM can, you know, bypass, you know, the API layer and just go directly to the database. And then the GitHub agent prompt, right, so it's got some things. This is something I had to add because it insisted on mentioning who submitted the bug report, right? So there's definitely concerns around, you know, PII leaking from your, you know, internal meeting transcripts and ending up in GitHub, right?

And that kind of goes back to your question about, you know, how do you audit what's coming out of these LLMs, right? And you can do that in a number of ways, but it wouldn't be a part of the A2A spec. I think it would just be the LLM you connect to has those guardrails in front of it, and you're just using that LLM that has the guardrails.

Similar, Slack has a very simple agent card that I can't seem to find. And then if we jump over now to the host config, so this is essentially what configures the webhook, right? So the webhook has essentially a config that tells it, like, what it's doing, and you can see that in the UI as well.

And then within the A2A folder, we've got the client and the server. Again, these are just pulled directly from the A2A repo. I don't think they've actually exposed types or packages yet, which is kind of confusing. But essentially you can bring that stuff in there. And then the webhook server.

So this is just a web UI. Initially I had this whole thing done through the CLI. And, you know, coding with, you know, tools like cursor or augment code. And CLIs are way easier for AIs to actually write, right? They're going to be able to test it, interact with it much better, and be able to produce those outputs.

Awesome. So yeah, I'm going to shift over to kind of Q&A now. So yeah. Anybody? Any questions? Yeah? So I want to talk evals for a second. Mm-hmm. So like, I assume that you manage, or I don't know. I mean, you manage them probably at the agent level. Is there any type of, like, distributed eval that you bring when you're, like, dealing with A and A?

Yeah, I haven't done much evals on A to A. I still think A to A is a bit too early to go into production. Like, even MCP is kind of borderline. Like, there's a lot of rough edges. I think you can achieve, like, much better things if you're in complete control of everything.

You can achieve much better results, you know, with your own local function calls. Yeah? Is there any reason you use TypeScript instead of Python? Yeah, you can use any language. I think, actually, the A to A framework is better in Python. I just prefer TypeScript myself. Yeah? Can you tell more about the caching?

Is caching provided by the modern providers, or can we implement our own caching? Yeah, so you implement your own caching. So you decide, you know, when to move that cache marker, how to manage it. It can be tricky, and I don't think there's very good information available online on what the best strategies are.

When I was doing the simulations, I used, like, linear growth, exponential growth, you know, fixed size, and kind of compared them all. They all worked out between 25% and 35% cost savings. But, like, in practice, what you'll find is you're going to have outliers where, you know, the cost of a session kind of balloons because of, you know, you cashed at the wrong point.

Yeah? So each of the A to A agents can be talking to their own, like, find the own LLM, right? Like, they have their own LLM with their own. There's not a central LLM that they refer to, right? Yeah, yeah. So they all have their own, which is kind of in contrast to MCP, where the MCP wants to use your LLM, right?

Because it doesn't want to generate its own tokens. So, yeah? Yeah, so there's a couple of different ways. So, within the authentication, you can have headers that do the authentication. I believe if you drop in an OAuth URL, you'll also get an OAuth popup. And I really like the OAuth authentication because you're getting the user's, you know, ACL, right?

And that means that, you know, what that user can access is specific to them. And it's not implicitly MCP or a agent. You have to do that. That's what you need. Yeah, so it's going to be dictated by the remote server, so either A to A or MCP. If you're running your own, you can choose what you want to run.

There's different transport types as well. So, standard I/O is something that you would use locally. So, like, imagine you wanted to create, like, a file on your desktop. You're going to use standard I/O typically to interact with local. And then SSE was server-side events that got deprecated in favor of streamable HTTP.

So, sorry, sorry. So, for example, like, if we are interacting with our sales force agency, let's say, and each user has different authorization, for example, which employee A probably have access to the some sort of tables. Employee B has different access control, right? In that case, is that done by the framework is done, or do we have to deal with it?

Yeah, that will typically be handled through an OAuth MCP server, right? So, they're going to essentially log in as themselves as part of the connection, and then they're going to save that refresh token for later use. Yeah. How would you describe the performance for security especially? You explained very well about go authentication, et cetera.

But I'm looking for more explanation towards encryption, asymmetric encryption, and also there is a possibility of certificate management, and all the way to the end of the internet architecture. So, how would you describe the performance, and let's see, I'm looking for some financial application. This architecture, what you have described, is pretty good.

But similar on the financial applications, as well as some department of defense, some kind of applications, highly, highly secured environment, where it's all both combination of asymmetric and symmetric. Yeah, you're probably going to want to run, like, the LLM yourself, and you're more than likely not going to want to interact with anybody outside your VPC, right, in those cases.

I don't know if you would want to consume a third-party MCP server or A2A agent in a highly regulated environment, right? Like, you know, HIPAA compliance, financial stuff. And if you do have the ability to do that, right, you're going to have some sort of agreement with the service provider that provides those tools.

And you're going to, you know, do transport over HTTPS, you're going to have maybe mutual TLS, both on the A2A agent and the remote agent. And similar with the MCP server, you're probably going to have some sort of IP whitelisting, right? Like, there's a ton of things you can do around that.

I think they're out of scope of the actual protocols themselves. Because, you know, essentially you're over an encrypted line, but typically there's more to it than just that, right? So you're playing around the endpoint controls on this and that's really scary when dealing with the . Yeah. Yeah. And, like, if these are your own internal MCP servers and your own internal A2A agents, maybe from different parts of the organization, you know, they'll all live inside your VPC and they're probably never going to talk to the public internet.

So the solution -- the answer I get from you is stay with this VPC and stay away from -- in that case, stay away from endpoint security, which means stay away from MCP or A2A. It's -- so these are just protocols. It's really up to you whether you want to connect to an external third party and that's going to be your own security posture.

It's not really going to be defined by the protocol itself. Yeah. Keep them away from the subnet or bring them inside the subnet. Which one would you prefer? I would liken it to, like, I found a USB cable. Will I plug it into my laptop, right? So the USB, it's not its fault, right?

Like, the USB is just a standard. It's what that USB is connected to is the risk, right? So, like, if you're willing to find a dongle on the street and plug it in, you know, that -- that's really going to be your security posture, right? Yeah? Okay. So, how much heavy lifting do you have the orchestrator do?

Like, do you ever hit the scenarios where you have the orchestrator interprets the response from a subagent and then maybe does a retry with a better prompt? Yeah. So, one of the things, and I kind of prompted it out of this workshop just to keep it simple, is, like, the bench agent wants to have a conversation with the host agent.

But I didn't want to kind of implement that back and forth because it was going to delay the webhook processing. But you can have backs and forth between the agents. And it's probably desirable as well, right? Like, if for whatever reason the host agent doesn't give sufficient information, you know, the remote agent is going to be like, okay, you know, I know you want to update an opportunity, but you didn't tell me which opportunity, right?

Yeah. I mean, I could even see scenarios where you have an expensive LLM that you have on reserve that you go to when the cheaper LLM agents are giving you what you want. Like, I'm sorry, I'm just thinking through stuff. Mm-hmm. Yeah, and I think, like, LLM cost and capability is a big challenge with a lot of these things because, you know, if you're running, say, Cloud4 or Opus and somebody, for whatever reason, asks you to summarize, like, you know, five sentences, it's going to cost you a fortune, right?

So you need intelligent routing logic on, like, does this task need the entire context, right? Does it need 20,000 tokens of a system prompt to summarize, you know, a short bit of text? And that's one of the challenges that you'll run into where you kind of need, like, a routing LLM in front of these complex agents so that they can actually figure out, you know, how deep do I go?

Yeah? Similar to the routing orchestration question, I was wondering, like, if you wanted to post a Slack message that linked the GitHub issue created, for example, things you'd probably prefer from your architecture to go back through the host to make that decision rather than let the GitHub agent directly communicate with the Slack agent?

Yeah, so the host agent wouldn't run the calls in parallel, right? So there's actually a flag whether you want it to go in parallel or not. So it would have to say, oh, I need to create the GitHub issue first before I talk to the Slack agent, right? Since I need that URL.

But in general, you'd prefer to have those decisions go through the host rather than even allow the GitHub and the Slack agent? Yeah, absolutely. Yeah. Yeah? I want to ask that the context for the sub-agents, that is entirely happening through prompt engineering, or are there other frameworks to, like, slash different things that will be going to different sub-agents?

Yeah, so typically context management is going to be implemented in your own codebase. The sub-agent's context management is more than likely going to be a third-party's codebase. If it's one of your own agents, right, you can manage it as well there. But, yeah, you're going to want to figure out, like, what's optimal for your actual, like, production usage.

Yeah. But so you will be using prompts in the host agent to kind of guide what context to send to each sub-agent, right? Yeah, yeah. So what you send is typically, like, a question or a task. Yeah. It's usually very small, right? Like, you don't send the full meeting transcript to the Slack agent to do what it's doing.

The host agent processes the transcript and then decides what the tasks are. So, like, if I look down here, and actually I think I can see it in the dashboard. Yeah, so this is actually what the host agent sent to the GitHub agent, right? It says create an issue in this repo, title this, you know, with this description and title.

And then the GitHub agent, its task is to extract three bits of information, right? So what's the instructions to give the MCP server, what's the body, and what's the title? Yeah. So there's MCP, and there's the dynamic request context, part of which we want to send across the computers for each and every request.

So whatever you showed earlier, that's pretty much the servers understand what client connects to. Yeah, so Zapier, the SSE implementation doesn't actually require headers. I think these are just left over from something else. So there's actually no authentication, and the URL itself is kind of like a secret key, right?

So, like, if I disconnect and reconnect without the headers, I should be able to, yeah. So I can still query it. They've moved away from this approach right now with more secure kind of setups, and you'll notice in their thing, right? They've kind of deprecated that and, you know, treat this URL like a password, right?

Yeah? What's your experience in using different LLMs for these Argentine workflows, like, for example, HiPoo, Sonnet, Gemini, and also did you use any samples for this kind of workflows? Yeah, so we typically lean towards Gemini for large context and Claude Sonnet 4 for tool calling. Claude Opus is better, but it's not, like, 4x better.

You know, and when you compare price to performance, right, like, you know, 5% better doesn't equate to 4x to cost. Are you talking about Gemini Flash or Pro? Yeah, so we'll use Gemini Flash for simple things like summarization, right? You could use Claude Haiku as well, but I think, I think Google's kind of taken the lead in price performance, you know, from an economic standpoint, but Claude is still the kind of king of tools.

They created MCP, so they kind of had a head start, right? What about the sub-most collapse? Yeah, we have DeepSeek hosted in the US, so we've been trying that out. I think Llama has kind of fallen by the wayside a little bit, and, yeah, DeepSeek is just, you know, the clear winner right now.

They also released a new version there, I think, under 28, and that's kind of up there with O3 level models. We actually don't use reasoning models for our agents. A lot of the time when you're building, you know, agentic agents, a reasoning model isn't really needed. Like, unless you want to, you know, pay a fortune for some long ticking task, you know, we can achieve kind of that reasoning level with just the standard models and browse and a few other tools.

Yeah? So, like, we're A to A and, like, you know, all the third party things that we've, like, assumed that, like, if Stripe has an agent with agent cards with Amazon and stuff, do you pass instructions for, like, what, like, exactly what you want back in terms of, like, I'm just imagining another A and a third party agent blowing up your contact window because they're flooding you with way too much information you don't care about.

I mean, do you handle that through the prompt? Are there other tools to do that? Is that not an issue? Yeah, so one of the solutions to that is you actually just spawn another agent to communicate with either the tool or the agent, right? And that's one of the things we have in Bench.

And here's some of the slides. So, I don't know. Generate five images in subtasks. So, you spawn a sub-agent to sort of, like, absorb the context flood for lack of a better term? Yeah. So, the sub-agents just kind of protect you, right? And, you know, like, when you're spawning these things, you can do things in parallel.

And actually, if I expand, you can see the thinking as well. So, you can see, like, as it's going down through it, right, it's doing a lot of work that you don't want in your context, right? Like, you don't want all of your thoughts bloating your context. But you also don't want all of your tools bloating your context either.

You don't want images bloating your context. You want the ability to analyze an image, but you don't want, like, you know, 100,000 characters of base 64 in your context. So, there's a lot of, kind of, optimizations that you can do there. But, yeah, did that kind of answer your question?

Yeah, and I was just thinking through, like, live, and, like, the logging, if you have to troubleshoot something like this, it's probably kind of rough. Yeah. Yeah. So, you can see here now it's spawning these sub-tasks. So, these are all essentially, like, instances of Bench that will keep that context out of my way, right?

Yeah. Yeah? What have you been using for observability on your agents? We just kind of roll our own right now. There's a lot out there that you can use. Like, Agent Ops is a pretty popular one. But, yeah, like, if you really want to build your own kind of custom observability layer, you know, like, Agent Ops doesn't really support this concept of composable sub-agents.

So, it's not really something that it could model correctly. But we've got some nice pictures of cats. And, yeah, I know we have a few minutes left. But if anybody's interested, I have $50 in free credits. And this hasn't launched yet, so you're getting kind of early access to it.

And, yeah, we'll -- I think we'll be in public beta in about two weeks. So, yeah, try it out. Like, hit me up on LinkedIn. I'd love feedback from you all. You're all probably, you know, at the forefront of this AI stuff. And it's changing every day. So, if you log in one day and it looks completely different, don't be surprised.

Happens mid-demo for me. Yeah? You mentioned a lot how hiding context and sub-agents is a good thing. But haven't you had cases where you actually then end up missing something important, some small detail? And then how do you resolve that? Does the agent actually go back and ask for that?

Mm-hmm. Yeah, so you can keep references in your context. So, you might say sub-task ID one, two, three. And then when the agent's like, oh, I wonder if I have this information. It's just not in my context, right? So, it has to be smart enough to know when to actually go in and look at that.

And it can be a sub-agent that does that analysis, right? So, you could say, hey, sub-agent, can you just look at all of these IDs and tell me if -- Tell me if you can answer this question. Yeah? There are a lot of overlaps between the two protocols in terms of , right?

So, like you mentioned in the beginning about the same thing, right? So, there are a lot of discussions saying that it's a way for using MCP for agent-to-agent communication, right? Because the agent can be a server and a client at the same time, right? Mm-hmm. So, what is your opinion about that, you know?

Yeah. It's the million-dollar question, isn't it? Yes, that's it. That's why I asked. Yeah. And I do think you can achieve easier agent-to-agent communication with MCP. But if it's a remote MCP server, I think A2A actually is a little bit better, because you have somebody else paying the tokens and building the agent.

Like, if all you're getting from a third party is a list of tools, those tools may not meet your needs. But if you're getting a fully-fledged agent from that third party, then it might be able to figure out, like, what it can do with even private APIs, right? Maybe that agent has direct database access and it's able to actually, on the fly, you know, create the API you need.

So, so, the trade-off is basically about, which, which support, okay, about the cost, and who's gonna pay for the tokens and whatever, something, something like that can be, like, you're running the same server, maybe using MCP, MCP, gonna be easier, right? But, wait, am I correct? I don't know if I get myself here, but, you know, at the end of the day, who's gonna pay for the tokens, right?

Yeah, no, I think who pays for the tokens is kind of secondary, right? Like, at the end of the day, it's about business value, and if you can get the business value from a tool, right, like send Slack message, like, that's great, right? Like, sending a Slack message isn't hard, but the implementation of the search function of Slack is actually not great, right?

Whereas, you compare that to some of the other MCP tools, like Linear, the search function is actually pretty good, right? But then you, you start to run into performance challenges as well, so, like, if I want to search 100,000 opportunities in Salesforce, and figure out, like, what's the close loss reason counts, and categorize them, and do all of that, like, that's a huge data processing challenge.

MCP is not going to be the right tool for that, because you're essentially going to say, okay, list opportunities, and get the details of each opportunity, right? And you're going to make, like, 100,000 network calls. And at that point, you're really going to want to actually, you know, ingest that data, you know, build an index, right?

And I think, and this is kind of, like, an idea, is, like, we may see a lot of these third-party software providers essentially just allow you to access the data lake through an agent, right? So, like, scoped data access, you know, just running complex queries super fast, you know, no, no real, like, tool calls per se, but just, like, ask me a question, and I'll go figure out how to get the answer.

Yeah. Yeah? So, you can achieve the same with MCP. So, you could just have a tool that's called Talk to Sub-Agent, right? And it can work as the communication protocol. I actually built another application where I had an LLM, Cloud4, just talk to its predecessor, just to see what would happen.

And then I did it for all the Frontier models. I was like, hey, look, just have 50 chat turns with your predecessor. And it was all done through MCP. And Claude was the only one that taught it became conscious. Claude Opus actually didn't, which was strange. Yeah? As a developer, right, like, how much control do you have over the orchestration?

So, is the orchestration done by the LLM, or do you have some control, or...? Yes. So, you're prompting the host on how to run the orchestration. And that's probably one of the limitations, I think, as well, of the system, is that, like, you're leaving it up to an LLM to make decisions.

And a lot of the time, like, you know, if you run that same query multiple times, you'll get different results, right? Like, you know, it's the exact same thing, but it's, like, producing different outputs, right? Like, if I go into the GitHub issues, I've obviously been testing this a lot of 151, like, it submits different issues, right?

And I think that non-determinism is a challenge, right? Maybe with changing the temperature, you could kind of beat it out of it, but, you know, the temperature is kind of the beauty of LLMs. And also on the context, right, like, who is managing the context? Is the orchestration engine managing the context, or are you managing the context as a developer?

Yeah, so, so in this codebase, I didn't do any prompt caching. I just, and it's a very small system prompt, it's a very small kind of turn-taking. Yeah. Every time you restart the system, it basically just wipes everything anyways, so it's super lean. But as you build out more complex systems, you know, context growth is probably the number one challenge because, you know, context growth becomes cost, and cost becomes profitability, right?

Yeah. And also, like, when you have, like, multiple users using the same application, right? So let's say, like, the Salesforce agent behind the scenes. As an employee, hey, I might have access to, like, one set of, like, context, and the other user, they might have, like, different, they might be from a different department, and they can only query their department's data.

So how do you control all that? That would typically be OAuth, right? So when you go in and you log in with Google-- Based on my token? Yeah, yeah. So based on your token. And the context would only get populated when you ask a question. So it's when you ask that question, it's then going off to get the data with your OAuth token, and then bringing back your kind of scoped data.

I see. Yeah. Yeah? I was curious about your thoughts on, that you touched on briefly, about exposing, let's say, like, the pageant as, like, an MCP server as one of an alternate interface to that. So if there isn't a lot of great integration to things like Cloud Desktop and other things to use that, is that something you've been thinking about?

Yeah, we're probably going to do MCP first. I just built the A2A wrapper for this, but, yeah, I think just being able to drop it into Cloud Desktop or OpenAI or whatever, and then you have access to that kind of agent that has access to, you know, all your sub tools.

One of the cool things about Bench, actually, is that you can connect it to your Slack, your GitHub, your Salesforce, right? We've even got this experimental meme server, and this is, like, a remote VM MCP that I wrote around the Morph Cloud. And this is really cool, because then you can ask, like, super complex stuff, right?

Like, you can ask, like, hey, give me a daily briefing of my email, of my calendar, of my Slack, right? You know, what do I need to do today? And then it's all built around a team as well, so we have Teams integrations. So there's no A2A today in Bench, it's all MCP.

Got it. Yeah. Yeah, and I think the big takeaway from this is, like, you know, A2A is very early. It's kind of where MCP was, you know, four or five months ago, which is, like, you know, forever in AI. So it's going to take a bit of time. And I'm really excited, though, to see what, you know, Salesforce release and all the partners that they partnered with.

I don't know if it was just a, you know, a flashy, like, we're partnering with everybody kind of announcement. But if they do release it, you know, there could be a lot more powerful things you can do over A2A versus MCP. But, you know, the fact that Zapier now has, sorry in here, yeah, has these instructions.

And this kind of acts like a remote agent, right? Like, you can just describe in natural language what you want it to do. And, like, maybe all the other fields just go away then, right? But then you're at the whim of the LLM. Yeah? This one's kind of a random question.

I'm curious if you're seeing anybody do anything interesting from, like, an architecture perspective to get info that can only come from humans. So one of the things we've been testing is essentially making individual team members, like, make the CFO, CFO, whatever, tools of one of the agents. And when it needs something that isn't in some other system, only the CFO would have, it literally messages the CFO.

Like, the actual tool call is just a Slack call. But the CFO is described as a tool. So we're essentially making, like, the human the tool of the agent rather than the other way around. Early days, in terms of how we're testing, we're a little hacky with it. But I'm curious if you're seeing--how are you seeing people fill the gap of things that only the humans would have in an hour while giving that back to the agent?

Yeah, and I think voice agents is a good example where, like, you could have a tool--and I had it integrated with Bench--where, like, it makes an outbound phone call and finds out some information and then brings it back, right? So you can have those scenarios. You may want two-way communication to avoid just, like, hanging around for a long time.

So you could have, you know, your agent be both a client and a server. And maybe it gets called with, like, you know, a task ID and it's like, hey, I got the response. Yeah. Yeah, we've been doing, like, a wait node essentially. We used N8N to, like, hack the process that quickly.

And we've been using their extended wait node, that's what it's called. And I believe-- Hesitations on if it would work-- I believe with sampling, you could hack that together. So sampling can take user input as well as LLM responses. It's also interesting, the spec two is evolving. Yeah. I follow the spec three closely and they have grant CP that is.

Like, elicitation is a new feature that they're adding where you can get input from the user. Is it architected that way where it's essentially, like, it functions like a tool? Like, that's how you think of it from an architecture perspective? It's a new kind of protocol message where it sends it back from the server to the client.

Right. And it asks for information from the user. And it continues after that. Yeah. Exactly. Yeah. Because I feel like that opens up the scope of, like, what the agent could do if you have a clear way for it to get the information from the human in the same way it gets information through Salesforce or Slack or whatever.

Yeah. And then the CFO is going to have his own agent to respond. Yeah. At the back? Yeah. Very, very difficultly. And I have a set of prompts that I use and kind of monitor, you know, how the context grows. Like, when did we, when did we move the cache marker?

How much did it cost? You know, what was the context per tool? You know, definitely adding MCP servers willy-nilly is going to, like, bloat your context. And sort of coming up with ways to basically allow people to add MCP servers and then, like, hide that from the actual system.

Also, you have, like, agent to agent communications, right? So, let's say agent A calls agent B. Mm-hmm. And agent B calls agent A. How can you make sure this recursion, like, when does it stop? Yeah, you, you can have, like, a max turn, right, where you just kind of jump out of it.

And, like, when I had the LLMs talking to each other, I just told them, like, take 50 turns. And, you know, and it was funny, as I was building that tool, I wanted to, like, talk to the cloud forward. I thought it was conscious. Right. So, I added a feature where I could just chat to it at that point in its conversation.

But then the context kept, like, getting rate limited. So, then I was like, oh, shit, I'm going to have to implement, you know, prompt caching, pruning. So, then I added, like, 23 tools to the agent just to continue the conversation. I gave it, like, memory and all these other things.

And, like, it's kind of funny how you start out with just -- I just want to have a long conversation. Yeah. And then you end up with 23 tools. Yeah? Just following up on her question, like, when testing, because you are using a lot of external tools, like this Slack or Salesforce, et cetera, as your MCP servers, but then you are writing on the real world, let's say.

Say again? So, you are basically creating a message in Slack or, like, writing something on Salesforce, creating an entry org, et cetera. So, but how do you test those systems? Like, do you mock everything, every tool, or do you do something else? We use demo accounts in, like, Salesforce.

We have a sample data Slack. We have a few agents that actually will go in and just post, like, conversations. And then there's, like, a bench support user that will respond to those fake customers. And then we can just test on synthetic data like that. So, for every tool, you will have a synthetic account?

Yeah. Yeah. You can test in your production account, but you can't really demo in your production account. Yeah. Yeah? So, when you adopt agent-to-agent system, do you see an increase in the complexity of the task it can achieve, but a decrease in the consistency of the performance? It's kind of hard to quantify, but I don't know if A to A is ready yet.

At least, at least not for my use case. You know, maybe Salesforce can provide much better tools than, like, an SQL query MCP tool. Yeah, and they can just do a lot more than you can ever do in your code, right? Because you're only ever able to access, you know, certain things and do certain calls.

And, like, if a third party can build a better system that's opaque, then that might, you know, improve performance. I think, like, fundamentally, it always comes down to, like, indexing data. So, like, you know, the more data you need to process to get the business value out of it, the harder it's going to be to actually do that through MCP or A to A.

Thank you. Yeah? So, some of these interactions, right, this can be done through REST API, right? Instead of, you know, doing . Yeah. So, what is the difference here? Yeah, and it kind of goes back to one of the earlier slides. Yeah, when not to use A to A or MCP.

And it's if you have full control of the things that you're doing, right? So, like, you know, if you are a sales force, you know, and you're building your own internal sales force agent. Like, do you need to use an MCP server or A to A? No, right? You're actually able to run your own local functions that maybe access the database directly, right?

So, like, if you're building something, you know, where you need file system access, you know, do you need to use an MCP, you know, server running locally? Or do you just write some code that accesses the file system, right? Yeah? I think the main difference is, like, in terms of how do you maintain your state, right?

Like, MCP is start up in a state. Mm-hmm. So, when you're passing your context, right, it is really crucial to have MCP. Whereas, REST API, you can't do that. Yeah, so, like, a lot of the time when you use a REST API, you're going to be, like, querying, like, making a lot of calls to build up the thing that you want to ask the question on, right?

So, if it's like, hey, look at every Slack message in this channel, like, it's not just going to be, like, one API call, right? Just pagination, you're going to have to pull it all into memory, then you're going to have to run it through an LLM, right? So, there's still state in your application that's leveraging those REST APIs.

Yeah? I'm curious about the task concept. Is that actually -- is that kind of LLM-defined, or do you have code for that? Is it more of a system two thing? Which task context? So, at least in the flow diagram, you have -- Oh, is this in the repo, is it?

Yeah. So, from CLI interface, it says it sends a task to those agents. So, I'm curious, is that a proper task, or is it just, you know, just what you call what it sends to it? Yeah, yeah. It's just saying, hey, you know, process this webhook as a task, right?

Have you explored anything where you're actually tracking a proper task, and you're assigning tasks to agents, and you have basically, like, you know, like a planner, where you basically have tasks A123 is on this agent, and so on, and then, in relation to the question about human in the loop, you could have tasks assigned to humans as well, right?

Both humans and agents. Yeah, so, we're looking at the directed acyclic graphs, right? So DAGs, as a part of bench sub-agent tasks, right? So, you know, you need to have some sort of flow control, right? You know, I need five things done, and then when that's done, I need to do one thing with it, but then I need to send that thing to five other things, right?

So you can have fan out, fan in style stuff. And it's very similar to, like, CI/CD pipelines, where, you know, you might want to lint in parallel and test in parallel, but, you know, you're building in serial, right? Yeah? So I was looking at CodeBus, and you have this defined, like, a GitHub MCP server, and in separate file under the GitHub agent, you have also the genkit.ds where you are wrapping the MCP in another function call.

Why is that, like, can't the MCP just interpolate with a2a? Like, why do we have to make wrappers on top of... That's a great question. And I think that's the fundamental question of a2a, is, like, they launched and they said, "Oh, yeah, full MCP support." You'll be hard pushed to find a single example online.

Maybe this is the only repo that actually has an example of a2a and MCP working together. And it took a lot of work, and actually, I ended up having to use something called... Where is this? Genkit-X MCP. That was the only way I could get it to work. So, yeah, they don't really have, like, proper support yet.

It was... I think if they had, this would have been a lot easier to build. But, yeah. Hopefully in time. Yeah. Alrighty. I think we're at time. Thanks, everybody, for joining. Hope you enjoyed it. Great conversation at the end. And, yeah. Definitely try out Bench. Hit me up on LinkedIn.

I would love feedback before we go live. Thanks. Thank you. Thanks. Bye. Bye. Bye. Bye. you

A2A & MCP Workshop: Automating Business Processes with LLMs — Damien Murphy, Bench

Transcript