AI in Action 15 Aug 2025: How to Build and Manage a Team of AI Agents

Thank you. I got a thumbs up. Dirk's like the one other person besides me actually on camera. Yeah, I would, but I'm quite sick. So no worries. No worries. Thank you for still being willing to kick us off then. Appreciate it. Yeah, no worries. Okay. I finally found it.

Sorry for the delay. Awesome. All right. Well, I'll hand off. Would you, if you'd like, I can keep an eye on the chat and just flag things as they're there so you don't have to pay attention to it. And then, yep. So welcome everyone. I'm going to hand over to Olivier.

He kicked off probably the most epic thread we've had in the AI in Action channel talking about quad and subagents. So he's going to present about his approach and workflow. It sounds like he's going to be sick and he's in EU time zone. So he'll kick us off for the first, you know, 20, 30 minutes and then we can continue the conversation.

And with that, I will go off camera, hand it over to you, Olivier. Great. Thank you so much. Okay. So let's give you some background, everyone who's watching. Markov is an AI engineering studio. we're based in Antwerp and long story short, we're focused on the automation of accounting systems.

So end-to-end accounting and we're also in MedTech now bringing both systems to production on like a pan-European scale. for me with cloud code and the really big unlock came around the four model series release and then these subagents just make things a lot more reliable. I think it's important, even though I won't go into this in high detail today, that everyone is aware of how we get to the point within Markov where we start using cloud code, right?

So for us, the big learnings have been that as the models have been getting better at coding. So actually, you know, for everyone who's been working with these models for a long time, I'm sure you'll know like a few years, maybe even double digit months ago, they were still making a lot of syntax errors.

That's not really the case anymore. And the models have gotten pretty good. So now we've come to the point at least for us internally where we will often, and this is like, you know, in the scope of like internal tools or like prototypes for customers because we also do a bit of consulting on the side.

You know, our flagship products are medtech and accounting. So fintech, but we also do a bit of consulting on the side because we haven't raised any capital and it's a nice way to generate some cash flow. So what we have is basically on my screen, you'll see we have, so this is kind of our directory and all of our prompts that we share within our team are stored in here.

The main reason that I'm sharing all of this, even though, you know, I think a lot of people would consider this to be quite proprietary data is because in a few months, it's all going to be relevant anyway because I'm pretty sure that at least that's been my experience so far, all this kind of hardness thing and scaffolding as the models get better, it just gets washed away.

So maybe I'll be able to help some people in the coming months with this. So as you can see, we have these like architecting prompts. We can see we have these like subsections of prototypes. We have a Jinja, FastAPI, local, Azure, you know, so this is like for, this is a prompt that we use to generate, to create prototype plans for like really, really quick prototypes that will just like run our machine.

like we don't even deploy to Azure. It's like for super quick customer demos and this prompt is basically like it's catered fully to our stack. It has like a lot of methodologies that we use on how we want our plans to be constructed. So we always use OpenAI agents, SDK, we use Lytlm with open routers.

So all our code is model agnostic or at least the prototypes. The big downside with this, I'm not sure if there's many other Europeans here is that it's more difficult to be GDPR compliant when you use LightLM and open routers. So in production, it's usually OpenAI agents, SDK, and then some European endpoint.

But for prototyping, it doesn't matter as much. And this is really nice because this way you can just easily swap models in and out and if it is already included in your planning stage, then it just makes life a lot easier. Then we usually go with hexagonal LightStructures or hexagonal software architecture.

We define a bunch of variables that are important to us specifically for these local prototypes. We usually use SQLite and FastAPI. And then if we wanted to go really fast, we also have like Streamlit. So some Streamlit prototypes which don't even have like FastAPI. We also have one where we use Streamlit as a UI with a FastAPI and then two containers which we can then deploy to Azure if we want to.

So that's kind of the essence of it, right? And you'll see like there's nothing in here that really takes us to production. That's because, well, you don't, you don't really architect software that often to take it to production, right? So that's usually like custom work that we only have to do when we like want to start a new flagship product or something.

That's not really part of this. But if it were, then it would be something like FastAPI with Pydantic and then like if there's a frontend, frontend, it would be something TypeScript React-based, right? I think that's what most people are doing these days. So let's, oh, and then I also have this agent blueprint prompt which will basically generate a plan of our agents that we want to create for a certain project.

And we also have a test, oh yeah, we also have the test-driven development prompt. So to recap how it usually works is we have some idea of some prototype or some application that we want to make. Then we will pick the stack that we want. So FastAPI Ginger or Streamlet or something else.

And then at the bottom of these prompts you can always see, sorry, you can always see like paste your idea below this line. So this is nice because that means that whatever provider, whatever model is best at architecting, you know, for like a week it was Gemini 2.5 DeepThing and before that it was O3 Pro and now it's, you know, it's GP5 Pro again.

So you just swap between the models and usually how I go about is if I want to create some like internal tool and give some context our internal tools like we build them once and then we actually keep building on them. so we want to make sure that the first time we make them is like already maintainable, scalable code.

So no spaghetti code and how it usually works is we'll, you know, depending on what I want to do, I will pick a prompt and then I will just paste this prompt in and then I will ideate with which I should put you. So I will just use voice messages, transcribe, and basically spend like half an hour to an hour just going through the motions of what does this actually look like?

what is the user journey that I want? So the user stories and what are the functional requirements? And then when I've gone through that, then I'll take the output of that and I'll start a new session and I basically paste it underneath here and then you get a very nice verbose plan.

then when you have that plan you can put that plan underneath this prompt which will basically create a whole test set for your code because, and we'll get to that in a bit, so within cloud code the only way you can really get it to work reliably from my experience at least, it's obviously limited, is to really double down on test-driven development but these models are quite prone to reward hacking so you really want to make sure that the test-driven development for that in itself you also have like a standalone plan.

so you'll see we kind of go through the motions of doing, you know, you want to do unit tests, end-to-end tests, smoke tests, live API keys, kind of everything. Then when this is done, so you'll have two plans, right, you'll have your test-driven development plan, so you have your test-plan-md and then you have your plan-md, right, and these two files depending on the complexity of the application that you're making, either you can separate them and have like two markdown files you're working with or you just append one after the other and you have your plan-md to get started.

So that's kind of the whole journey before you get to actually using cloud code and then this is the fun part. Real quick question came in in the chat before we move on from this section which is around for all of these sort of high-level guidance prompts, are these also generated by a model?

Do you have like a base prompt that you use to generate them or like how are you managing those kind of prompts over time? Yeah, that's a good question. I mean, I've been working with these models for a very long time so it's kind of intuitive to me but we do have, so we have prompts for everything, right, and let's see.

so depending on the subject matter I will usually, like if it's something I don't know anything about, like for example, we have this MedTech project, like flagship product, right, and obviously I'm not a doctor so I want to do a lot of deep research. So we have these prompts where you like, you basically, same principle, to put your query underneath and it will generate a highly verbose, like super detailed deep research query for you, which you can then again put into JetGPT or whatever model you want to use for deep research.

and then we also have like user query to prompt, same principle, it's a prompt that helps you create prompts, right, so you basically, you put your thoughts underneath and from my experience, I always, always, always just to speech to text, right, because bandwidth that you have when you're typing, you'll always, you're always going to take shortcuts whereas when you can just talk to these models, you will just use a lot more like, you know, of your own tokens, your bandwidth is much larger and you'll, like from my experience at least, obviously anecdotally, a lot of the times when you're like talking to these models or like recording this first message, halfway through you're going to realize that something you said actually doesn't make sense and then you just cancel and you go again.

So you're kind of, you're kind of iterating with yourself and then when you're happy. talking makes you think in a different way than writing does. Yeah, exactly. And then you put your idea underneath and then you get a prompt, right? And then for actually improving your prompts, that's kind of this nice thing with these models.

So usually what I do is I just find like all the papers and like best practices that the labs release. So OpenAI, Entropic, I always have this, they're like best practices for prompting your models. So I will scrape that and then based on the documents that I've scraped, I will basically improve my prompt improver, right?

And then you use that scrape documentation, you use the most intelligent model that's available to you and you just like iterate on your prompt improver. So it's kind of funny because it scales pretty well with model intelligence. And we also have these like best practices, right? So we have like this best practices prompting playbook.

You know, you see you have some XML stuff here that's, you know, specifically for the cloud model series and then we have some best practices for the GPT-5 models. But yeah, that's a long story short, that's kind of how we go about it. so you just scrape what's out there and then you put the, take that context and you ask the smartest model that's available to you to, you know, make it make sense and help you with prompt engineering.

But it's always, it's very important that you, like this whole phase of planning and prompt engineering, it's not a hands-off thing. You actually have to monitor it and put in your own time, your own thoughts, else it's not going to work. OK, so moving on. Right now, the flow that we have in Cloud Code is basically you start with your plan, right?

So you use your architect prompts and the moment you actually start in Cloud Code, and maybe I can show you an example. Let me see if I have one here. Is this? OK. This is maybe a bit too proprietary. Wait, let me, we grab something from an inter-repositoring. OK, so this is an example of a plan.

This is for a nonprofit that I'm in where we basically, we want to do some AI. we want to use AI basically for managing our social media advertising. So we wanted to create a Jinja front-end FastAPI pedantic. We use OpenAI agent SDK, and this is the plan, right? So you can see it's over a thousand lines.

It's like one hour of iterating with a GPT-5 Pro using our planning prompts. We have our test plan. As you can see, we have the test plan appended underneath our plan, the plan MD, so to speak. So the first part is focused more on like software architecture, data models, user stories, et cetera.

And the second part is focused on purely on the testing. test-driven development of it all. Then we have our clod.md file. Very heavily focused on orchestration. So in this setup, your clod.md, let me, I can just start one here. So in this setup, your root agent, so to speak, is never actually going to write any code, right?

All it does is orchestration and how it will always start is, you know, let's say you have a clean repository and you want to start with the implementation of your plan. You open it up, you tell it root-clod.md, and then this isn't actually the root repository, right? Which is why it's not looking for the plan.

It's a sub-repository. It's like just a prompt stash, so to speak. But what it would do is if you tell it, okay, read this plan, it will read the plan, then that's when the automation or the autonomous engineering comes into play. So the first thing it will do is it will summon or it will execute the architect agent.

And what this will do is it will basically turn our plan, our markdown file into a task list. And for this task list, we have an MCP server. So I wrote my own MCP server for this, and this one is empty. Let me grab one from a different repository real quick.

I believe this one. So it's very simple. It's a SQLite database. And what it, so the architect, what it does is it will go through the plan and it will basically create vertically sliced task lists for maximum parallelization. and then you can see it's a little bit bigger. Okay.

So it will basically create tasks and then assign them to agents and execute them in parallel, right? So what's very important is that every agent, so we have one main agent to write code in this one. So it's the executor, it's the sonnet, it's access to bash. that basically has access to all the tools and it also has access to all our MCP servers that we use for this.

So we use playwright for like a runtime testing and then we have this core hub light, that's the MCP server. And what it'll do is it gets a task. The orchestrator agent is going to give it a contextual identity. So when the orchestrator summons or invokes an executor, it will give it a contextual identity and within that identity it will basically tell it this is your task, these are the files that you can edit, these are the tests that you have to run and this is the output that you have to generate and the output that it generates is a task artifact.

Let me see if I can find one here. right, so it's always red to green test-driven development. So it's very small. Let me see if there's a better way to display this. right, so we use UV, we run tests and it will generate an artifact and after every time an executor agent runs, the orchestrator will always invoke a reviewer agent because you can only go so in our system in this MCP server which is being used by the orchestrator to do the task management, you can only go from so a completed task so the first iteration of writing code will always go to the status needs review and that's the machine state so even the orchestrator cannot simply move a task from like in progress to approved or completed, it always has to go through in review and that means the orchestrator will trigger this reviewer agent and the reviewer agent basically checks two things, it checks the output of the executor agent so it will check it will read the artifact the artifact is basically a highly verbose summary of what the executor agent has done let me see I think the template should be in here somewhere right so this this is a completion report so task the task ID that they got from the orchestrator right so the orchestrator will give the executor agent the task ID that's the contextual what do we call contextual identity and then I'm sorry when the agent is done it will create this artifact so the implementation context the stack the architecture adherence to the plan so every agent so every instance of this executor agent and remember we can run four in parallel we'll first read the plan it will read the framework guidelines it gets it gets from the orchestrator it gets a concrete list of all the files that it's able to edit so it will output in this summary so this artifact that we write to our SQLite database which is our mcp server all the files that it touched what it did what it built and the key decisions it made the test results passing tests coverage all tests passing validation handoff so ready for validation when it has the status that means it will be handed over to the to the viewer and then we have the actual outputs of the test right so what we mandate from these executor agents is that they have to basically deliver a let me see search real quick right so this is kind of an important one business value truth gate why because these models will are very prone to reward hacking and you have to force them to show proof that the tests that they created so red to green test driven development so tests that they created the green tests especially in the phase where they're testing stuff at runtime time so like live API key smoke test etc they have to deliver proof that what they did is actually real because if we don't do that then they will simply use mock data and they'll be like yeah sure passed and not really like it didn't and then after three hours of these agents running autonomously you boot up your application and it won't even start because they just reward attack the whole thing and this kind of fixes that because the reviewer agent that comes after the executor agent is going to actually check the proof and it's going to review the code and it's going to run the tests and if it sees if the reviewer agent establishes that the tests are faulty then let's see quick then I will basically hand the task back over here we go sorry for all the scrolling so it will basically hand the task back over to the executor agent and it will tell it what to fix right and it just goes back and forth until it is actually correct and then the reviewer agent can say okay fine this is approved and then when it's approved the orchestrator can then move on to the next task in the task list right from my experience so far this is kind of arbitrary because you know sub agents are quite new and I was on a while I kind of I capped it at four agents running in parallel let's see let me share my other screen real quick is my whole screen visible not yeah we've got we've got tabs and stuff yeah okay so this is just some stuff I was running earlier so this is actually the non-profit thing I showed you earlier the plan so and this isn't running now just be clear you know I got it limited unlucky so as you can see what it'll do is it'll basically hand over tasks to these reviewer agents after the executor agents run we also have a release gate auditor so every plan that we turn into tasks is in phases right so we have phase one two three four five six seven whatever and then after every phase it goes through a release gate auditor and you can think of it as a final boss kind of review agent which will again go through all the artifacts that were created and test everything so you can see some well this one actually didn't run too long but most of the time these will take quite a while and and when when you do it this way so you see like all these agents we're running in parallel for quite some time and the way that we keep it going is also quite important is through hooks so we have a few hooks right so we have one hook that's when your root agent so when you root agent being like when you boot up cloud code like the agent you're talking to then stops it will basically tell it you know you cannot stop you have to you have to resume and what it'll do is it will basically it forces the root agent to do exactly this right so it has to run this command get task to your tool then it'll basically query the database and it will figure out okay which tasks are approved completed merged pending or signed then if any tasks are approved then it'll hand them over to the integration manager the integration manager you can kind of consider that to be like the agent that handles your repository your commits and what not and then after that it goes to the release gate auditor which are any pending or assigned tasks that are ready to go it'll run the ownership preflight which basically checks okay let's say we have four executed agents who are assigned to their own tasks is there overlap between these four agents the files that they use because obviously in this setup we're not using separate git work trees right so we have to make sure that all the agents use are working independently on their own files and then okay so it runs this tool it sees okay all these files that we're giving signing to these executor agents in parallel they're fine like there's no conflict then it will batch assign the tasks maximum four and then once it does that then it will simply start these for agents in parallel and that's quite important so what you kind of have to do and for some reason it takes quite a lot of repetition to actually get us to do this is you have to really force it to spawn these sub agents in one message because from my experience what it will usually do is it'll say yes spawn these in parallel and then it'll spawn one sub agent and then it's blocked and then only when that one sub agent finishes it can spawn next one so if it has to spawn all the sub agents all four parallel in one tool call basically and then they can actually run parallelize so that's that's that and then the cloud code file a lot of iteration went through this is like this is basically just the whole workflow all the agents like artifact base completion what agents are available to you like how do you initialize when you start your session the boot artifacts so to make sure that when you're running an app or when you're coding an application that it actually works right we have to force these sub agents to actually run it try it and then debug it else like it'll go for two hours and it won't work and then there's a lot of just rules as to how the coordination has to take place and this is very important so it always goes from needs review auto science reviewer approved goes to validation completed goes to integration coordinator these are all the sub agents just to be clear as you can see on the left hand side and then no go basically it resets it and it can either depending on the severity of issue it will send it to the architect if it's like a fundamental issue with architecture of the code the structural problem and the architect will basically alter the plan and then the cycle starts again and then we also have these dedicated testing agents and a data review agent so a data review agent you can think of that for example we have one project where we're creating synthetic data so the data review agent will basically sanity check that the data that's coming out of your API calls to LLMs when you want to create synthetic data that that it actually checks out with what you're trying to do so this is basically just LLM as a judge in the form of a cloud code sub agent so yeah that's about it from my experience so far this can run for I would say four to maybe three to five hours fully autonomously the output I would say get to about 80% there I have to be super honest like I don't think we've had a single project where it just came out and like everything worked there's always like some buttons that don't work or some tests that were reward hacked we only implemented this kind of business value truth gate that's something we implemented today so I haven't had a chance to really try yet I'm going through those motions now and so far it's actually been really good because of this whole value roof right so it has to actually show proof that the tests executed especially with live API keys or when it uses play right to run tests on your development server or whatever that those actually produced real outputs seems to have solved a lot of reward hacking so I'm quite optimistic about that and then we also added this validator agent which basically is just a second kind of it's basically a final boss for reviewer agents so it will just try to build your docker image it will do smoke tests yeah it will basically just do all the end to end tests and everything to make sure that there was no reward hacking whatsoever when it comes to the test driven development that's honestly the main challenge the code itself like like from a syntax perspective and architecture perspective is okay there's just they're quite prone to reward hacking and then this one so in your cloud code and your settings bot bash maintain project working directory very important else it's going to get lost in the sauce halfway through and else just forget where it actually works from then after I think every three sub agents stops we re-inject the cloud md file in the context window I think yeah so then okay maybe this last thing like the framework guide is just like a project agnostic yeah like a playbook you know like a cookbook of how do we want you to write code how do you want you to write your tests what is the target shape what are some rules and because we force every agent every source or every sub agent to read the framework guidelines and the plan and they get like very concrete instructions from the orchestrator agent this means that your code stays clean and maintainable and you don't wind up with like you know massive files and stuff that's all over the place oh and maybe one last thing and this is proven super handy so especially when you want to develop your own MCP servers or your own agents just take the time and scrape all documentation so I scraped all documentation open AI agents SDK open router MCP fire crawl fast MCP and then also for anthropic sub agents documentation MCP documentation because our cloud code so the agent the orchestrator when let's say for example I'm developing an MCP server for cloud code then it will also instruct the executor agents that this documentation is available to them because obviously it's not in training data so let's say they're working on some MCP server then every executor will get in there like prompt that orchestrator uses to invoke them it will get the path to this directory we also have a nice index file here which basically allows the tells the models what to grep you can see grep pattern documentation multi-agent evaluations mcp fast mcp agent sdk so it always tells it to read the index and then it knows okay shit I'm running into some issues with getting this mcp server to work for cloud code then it knows it can find documentation for cloud code with mcp servers exactly here and that just saves a lot time so I would say anytime we're working on some project that has some specific API which the models aren't super familiar with just scrape the documentation and make it available to your sub agents so that they can check it before they write the code okay I think that's awesome we've accumulated a set of questions if you if you're game to go through some of the questions people have so I'm going to start actually back at some of the earlier ones so actually just tactically flow asked about is there a VS code extension that's letting you play with the .DB inside of the IDE or is that a part of cursor oh yeah that's that is that is a VS code extension um let's see SQL I viewer got it perfect next question from Kyle was how are you framing the work when it gets handed to the reviewer agent um so for that I I meant like how are you framing like who is producing the work and from what perspective they're reviewing it um okay wait let me open yeah so personally with clud code whenever I ask it to review its own stuff it's it finds I mean it thinks it's great if I ask it to review my stuff it thinks it's great if I ask you to review somebody else's stuff even if it's the same stuff yeah um so from from my experience that hasn't been a huge problem I think that's probably because we set out like very strict compliance rules in this there's the sub agent mock time file so you can see here reject if there's a whole bunch of rules common violations ACDD testing and then there's like a protocol that has to follow for verification of the tests right and from my experience if you are very deliberate in establishing for the agent what is good what is bad you know like violation so if any of these like protocols are violated it must set the decision to needs fixes right and I think for me that's kind of solved a lot of it I think if you just ask is it good or bad it's and most of the time it's going to say like yeah it's great it's perfect nothing wrong with this but if you tell it like very specifically it has to adhere to these thresholds or these requirements then from my experience it does actually flag issues when they occur I got you so you're saying just give it a rubric to work with yes exactly cool thank you next question was from Dirk he says I failed to see how handover from executor to reviewer is firmly defined couldn't that part fail easily and then you only have part of the flow complete no so the orchestrator let's say the orchestrator invokes a bunch of executor agents then it will set the task right so it will set the task to a certain status right so for example it can set to these fixes but it can never move a task that's been executed by that's been completed by an executor to anything besides needs review and then we also have a hook see we also have a hook that will automatically assign any task that was completed by an executor to a review agent and there's a machine state in the SQLite database so the MCP server that doesn't allow the orchestrator to ever move a task like change the status from task that's been executed or completed by an executor to anything else that needs review so only when the status of the task is approved which can only be done by the review agent through their MCP tool use then it can be moved to completed so there's a machine state guardrail that means that the model simply can't reward themselves to skipping the review all right let's see next question is around the models that the sub agents use do you dictate which models they're using are they just kind of going to default so depending on the complexity I'm pre-executor I use sonnet since that makes up the bulk of the token usage but then everything else is opus so just the thing that writes the code is sonnet but the reviewer the architect the validator etc is all opus all right let's see next question I saw was have you considered breaking the validation into layers approached by different agents yes and we I mean we kind of do that it depends a bit so we also have like a playwright debugger so so for some things especially like runtime issues the orchestrator can choose to hand off to a playwright debugger and this agent is then well you know specialized is a big word obviously it's just a prompt right but focused on debugging issues using the playwright mcp and for our frontend so we also have a frontend agent stack kind of the same principle they were like this it's a bit different because it's like the frontend it's very hard at least you know I'm not good at frontend so maybe I should start with that definitely frontend design but it's been very hard to like just use like design tokens and stuff to keep everything like looking good so we kind of went through this more esoteric approach we have like this visual anthropologist which will basically so it takes screenshots and then it just ingests the screenshots of the frontend to make design decisions so yes sorry that was a bit of a tangent but yes we do use specialized agents or sub agents rather for review if it's necessary and usually that will be the data reviewer if it's specifically for stuff that's like business logic related like synthetic data that we generate and we have frontend architect and then we have a playwright debugger and that's kind of all we have but maybe down the road it would make sense to have more I don't know awesome scanning through I guess the last question I see is for the local documentation which there's a lot of people chiming in and agreeing on that approach and they love that do you do any sort of indexing of that or how do you manage like which documentation is relevant when for what context yeah so we have the index file right so documentation we have the index directory sorry the index file and orchestrator I believe also has an overview yeah so we have documentation guidance shit where we kind of tell the orchestrator which documentation is available to it and then based on that it will read the index file and inform the sub agents in the prompt when they're invoked what documentation they should use have you considered an archivist role or specifically like context and documentation management who can like take that off of the orchestrators context window yeah I mean that would make a lot of sense I mean I think that what I have here is like version 0.01 which is probably going to get washed away by model improvements in a few months but I'm 100% sure there are a thousand ways that you can do what I'm doing better by having more granular task management and more granular agents and a better system to handoffs 100% are you going to open source this oh yeah I think so yeah cool thanks I built my own version of this but yours is much bigger I think you've put more time in it than I did and I like what you did yeah I hope it works I hope it doesn't just work on my machine it should be fine awesome well this was super interesting lots of engagement and the comments appreciate you taking the time I know it's late you're sick we've held you long past when we thought we would so thank you Olivier really appreciate it and hopefully we will hear from you again in the future great thank you very much have a good day see all right awesome we have a few minutes left if anybody wants to hop up and like discuss debrief insights things you took away from that but we can also wrap a few minutes early whatever y'all prefer I think the local documentation thing and keeping that up to date was super cool like the discussion around that seemed interesting and I want to make sure I post all the links that were in that conversation in the discord but that was kind of like a huge takeaway for me as someone who's trying to keep up with documentation it was like just interesting discussion but I think the whole thing was actually pretty interesting I think there is something to be said because like a huge part of that 400 messages that led to this call was that like there can be over orchestration which I think there's probably some element of that going on but I think overall just to see somebody's workflow is always interesting every time and I'll watch it every time so yeah those are my thoughts on that note of the unfortunate for those who haven't read the thread but I think or what I see with regard to the over orchestration is something kind of that you were touching on there where it's like this system that I'm building right now is probably not going to be relevant in you know one or two model generations so I think to me it seems like there's value in figuring out what all like what's the minimum version of this system because you're going to need to redo it from scratch two generations from now so what you want to be able to do then is the minimum amount of work to get back to the kind of functionality that you're using here effectively The alternative angle is that you just make it good enough that it can rebuild itself for every new generation Yeah well exactly is either is you could yeah you could take it either approach where you could have a fairly sophisticated system that's built for a particular model that's spitting out littler systems or you could have like these six or seven fairly straightforward approaches will generally give me all the all the chassis I need to get this thing going kind of thing basically establish the primitives yeah more or less I think another takeaway from watching him is like using a database for orchestration is really cool too I think that's something I am becoming more database peeled by the day like just a quick tip when you are using databases always have one test environment and one production environment and name it like this so the LLL doesn't delete your production database I feel I'm a big fan of Neo4j myself yeah I was going to say I just the most recent database thing that I've been interested in lately is called DOLT where it's it what basically what it is it's it's like Git and SQLite squished together so you can like fork your database or have different branches or whatever so you'd have one prod branch and then one test branch and they wouldn't be able to add to it thanks for pointing that out to me yeah it's like a really good prompt is literally just like use SQLite or use DOLT and then put your task in it and then it will create a schema and it will like use it which is which is really nice yeah which is kind of where I see the difference between your approach and Olivier's where he's got a fairly sophisticated system that took a while to build up whereas a lot of what I see you do is the applications are more where that structure goes and the thing that you're doing is kind of like trying to shoot for like the minimal sort of approach that's going to work like what how can I best leverage the model and like what the model is to make it just do stuff as opposed to like I need it to do these things so how do I get it to do that sort of or what artifacts do I need to make and then put in the context window to make it do that kind of approach I do have a question for everyone here just because I don't have a lot of experience with databases and I'm checking out the adult as we speak but does anyone else Marcus if you want to expand does anyone else have any tips for dealing with databases when it comes to agent orchestration like in this sense I would personally say sorry I think it was Kyle Marcus speaking if I'm not mistaken Kyle if you want to go okay what I was going to say is just make sure you use something that's represented in the training data because if you can try to get the latest and greatest of a certain X Y or Z thing and it might make sense but if it doesn't either operate on an existing schema that's in the training data if it doesn't do that then you're going to have a hard time getting the LLM to recognize the commands it needs to use to interact with it effectively but yeah an MCP server obviates some of that but even an MCP server is like tools built to give the LLM handles and if so like I have personally found that just making sure that the schema is in date in the model training data makes a massive difference to the consistency of your LLM ability to reliably interact with a database as a tool call yeah that's interesting and it's Marcus I don't know if that was you that was saying something about MCP or someone else yeah I would start with PostSQL and with an MCP nice yeah look into that and then Dirk it seems like Dirk in the comments is saying also he's a fan of Superbase so you can have your database in the cloud yeah often just makes things easier if you go from one machine to another or just collaborating sorry someone who just spoke before you you mentioned something about the MCP server and it not working I wasn't I was just curious what your point was again well if you talking about me it wasn't necessarily that MCP servers wouldn't work regardless of how you're using how you're interacting with the database whether it's an MCP server or custom tool call for the agent then you'll want to make sure that you're using a database that operates on a schema that is popular and existing and been around for a while for best results because I was going between Falcor DB and Neo4j personally and I really liked what Falcor DB had to offer in terms of features and functionality but I personally found that Neo4j was just much better represented in the training and it was just a much more accessible system to the LLMs themselves and so for that reason I've just had a lot better luck since moving to Neo4j for that project and my error rate has dropped substantially and errors compound over operations and over the length of time you're using the system so even a 1 or 2% difference is huge over the lifetime of a project and I'm talking differences of 20% more question here on Falcor do you query both of those databases in the same way but you have different schema inside utilities is that kind of what you're getting at I forget exactly off the top of my head it was the schema pattern of Falcor DB was I don't remember exactly what it was but they could not reliably query a Falcor DB database my thought there is it needs to be represented in the training data in that it needs to be able to accurately it needs to have a lot of examples in the training data of ways to accurately query that database yes yes yes and so as long as as long as as long as it's in Cypher or SQL I think you're probably good if it has like weird yeah weird proprietary language you're going to be in trouble yes that's a great way yeah that's a great way if it works with SQL or Cypher then you're set top of the hours gentlemen if y'all want to cable yeah I was just going to say thank you all great conversation appreciate everybody who participated we do not have a speaker for next week yet if you are inspired by this if you want to go and tinker with something you learned today and come back and report whatever it might be come speak with us run the session next week you can sign up by just at mentioning the AI in action bot in our channel get it all autonomous you don't have to talk to any of us human folks Kyle Neo4j if you want to bring a conversation about that doesn't have to be next week but yeah this look this you guys are you all are what makes this continue to work so you know bring the topics bring the stuff it doesn't have to be polished just has to be interesting and thank you again we'll see you next week strong recommend for the tool use podcast I'll post link in the master post it you have you you you you you you you you you you Thank you.

AI in Action 15 Aug 2025: How to Build and Manage a Team of AI Agents

Chapters

Transcript