Back to Index

Agentic Coding Ecosystem 2025


Transcript

Hello I'm Armin and sometimes recently rather more common than not I'm sharing HNT coding and just LLM stuff on this channel. And since May there has been an explosion of different kind of HNT coding tools and I wanted to give you a little bit of an update of which tools exist and what I've learned in general about them in the last couple of weeks in particular.

So if you're on this channel you probably already know who I am. I created a lot of open source software over the years but the kind of Python ecosystem and as of recently I'm kind of obsessed with HNT coding tools and I'm trying my best to use a lot of them to help me build my new startup.

One of the ways in which I think we should look at the current point in time is that there has been an explosion of interest both from investors and from users and programmers in general about this general topic of using agentic tool loops to help us build software. And this has primarily resulted in tools that help us write more code but there's also a little bit of the same kind of pattern being applied for non coding agent but most of what's currently making it through the internet is a combination of a very good foundation model with some agentic tool.

A lot of them are moving for some reason or another into the command line and it has been very hard to keep track of everything because every single day a new tool pops up and then there are also different models that they can use with them and some tools support multiple models and so forth.

And I just want to give my best understanding of what's going on right now. so different models that they can use with them and some tools support multiple models and so forth. And I just want to give my best understanding of what's going on right now. And I also apologize for calling them agents.

There's a lot of discussion of this is an appropriate name or not. It's probably not an appropriate name, but it is what it is. This is what seemingly we're calling this now. So it's it's a when I use agent here it's sort of an agentic coding tool in one form or another.

Just to level set this here. Roughly we're looking at five different types of tools. Some of them are IDE extensions. So GitHub Copilot cursor are probably the most well known ones. They're both autocomplete, but they can also do basic agentic coding loops. Cursor can actually do very capable coding loops at this point.

In parts also because it's the case that we're working on right now. So if you want to look at this, we're working on right now. Cursor has recently moved into the CLI as well. And they are also providing these standalone agents. So Cursor, for instance, can run on a remote server, too.

They have these background agents. Then the tools like Devon, which are also I think the bot windsurf currently recently. So these are all kind of these are different kind of tools exist. And then we have these by coding tools that use very much the same kind of technology behind the scenes.

So they're they're running, auto generating, running in a tool loop, but they often are very targeted towards user interface creation. And they behave quite a bit different from what you can do with with a tool like Cloud Code or similar, which really have they run on your machine and they can do pretty much everything that you give them access to.

And when I recorded this last video here, there was really just a handful of tools that were capable of competing with Cloud Code, which was one of the first, if not the first that got mass adoption. But by now, honestly, there are so many of them, I don't even want to enumerate them.

But I think there are more than 30 different kind of command line tools at this point that that have some version of a gente coding loop. And I don't have the time to look at all of those tools. Right. This is this is really not I don't have the time for that.

And I don't think, honestly, anyone has time for that. But there are some reasons of why you want to look at more than one tool. For me, one of which is there has been a discussion recently about pricing and I will go into that later. But because there is a worry that we might be heavily being subsidized by venture capital here, that the cost of all of this will go up.

So people have been starting to look into can you use open models to do the same thing? And for one reason or another, there are limitations in some of the tools that we're currently using. And it's very helpful to understand if those limitations are related to the coding agent versus if they're related to the model and what this might mean in the future.

I don't want to talk too much about what kind of workloads work really well with a gente coding tools today. But I do want to mention that one of the main unlocks that I see today in a gente coding tools is the creation of software that can be perpetual prototypes.

And some very good examples of this are internal tools. You might have been in a situation multiple times through your career where you wish you would have had a slightly better debugging tool, but you also didn't have the time at this very moment to create them. And this is actually one of the ways in which I really enjoy using these tools is that I'm working on something, but I can also have the agent on the side provide me a better way to visualize my data or a better way to debug a particular problem.

And write a tool in a tool in a way that I can maybe use it a couple of times. And it makes this this flow generally more enjoyable. And to create these tools, you're kind of in a situation where you want to be able to see what they do and interrupt and steer them.

So it's a little bit of a combination of both hands off and hands on, which is very different from tools which are fully detached. Devin is a good example here. I think a lot of the tools that sit as GitHub extensions on the pull request are a little bit like this, because it's much harder to understand if a change that you're doing to a prompt or a tool setup or something like this has a positive or negative effect.

So these command line based tools or these IDE integrated tools, they make it easier to do rather quick experimentation, but it is still very hard. And we'll share a little bit later of why I think it's so hard to evaluate these. So last time I talked about this, I think I made this assumption that there's a little bit more of an understanding of how these tools work.

And I actually think that what I've discovered over the last couple of weeks is that there is a lack of understanding how they actually work. And this does make a sense a little bit because on one hand, it seems very simple. It's a LLM that runs inference and then calls a bunch of tools.

But if you go into the nitty gritty details of how these agents work, there are some really rather important aspects to it that can fundamentally change how they behave. And this also directly translates to their capabilities and what they cost and how long they run. So the very basic behavior is obvious, right?

There's an agent decoding tool which has a system prompt that tells it something. It takes a user prompt and then it runs into a loop with a bunch of tools until the agent thinks it's done. But how all of this goes together and how they are actually capable of doing that and why we're actually seeing progress in these tools, I think, is rather subtle.

So the most important tools that an agent needs is the ability to read a file. And what does it mean? It basically means that the agent is interested in seeing something and it reads the file and it reads it into its context. In order to understand which files it might want to read, it needs to find this code.

And so, for instance, Cloud Code and a bunch of other tools, they use tools like grep, usually ripgrep, to have an idea of what they're looking for and then to find the file names and then to use that to pull it into the context. The combination of grep and the read file that will bring files into context.

Once it has determined that they want to make some changes, it needs to use another tool to change the files on disk. And then for the agent decoding loop to work best, you need to be able to execute commands. And this is particularly important for things like, can this even compile?

The most useful ability for an agent in many ways is just to understand if linters pass, if the compiler is able to compile it, and is it able to run the program that you have just changed. For some things, it also is capable of doing web searches or fetching remote files, things like that.

These are also useful. And with these tools, there are some rather subtle aspects to it which influence the quality of this. And I will go into this, but what you should think of with these tools is that you don't wake up in the morning and you determine as an agent decoding tool writer that you want to add a new tool to it.

In a way you are, because this is what MCP does, but the best performing tools, they have been trained into the LLM in one form or another. So the foundation model that you're using to some degree is an understanding of these tools. And the consequence of this fundamental understanding of tooling is that there is some sort of binding between a particular coding agent or the tool used by an agent decoding agent and the model that it's using.

So there are different kinds of models that you can use. So for instance, Sonnet, you can use GPT-5. And some of these models have been trained, not just with the knowledge that a tool exists, but also seemingly with a whole bunch of examples of what it means. And Tropic actually does a pretty good job at documenting most of these tools in their documentation.

So if you go to the API documentation on the Sonnet family of models or the Opus family of models, they will document these tools. So as far as a bash tool, so it understands how to run bash commands. And through the bash command, it also understands how to run Python.

It has a code execution command. It has this computer use command, neither of which I believe are actually being used by cloud code. I think they might be used by cloud desktop, although I'm not sure. There's a text editor command that has been specifically trained on manipulating text, web search command, and so forth.

But for instance, if you use the GPT-OSS model, you will rather quickly figure out that it loves to call a tool called search, even if it hasn't been supplied to the context, particularly in the smaller model. And there the sort of the interesting element of this is that Codex also doesn't have a search tool, even though Codex is written by OpenAI, who also supplied GPT-OSS.

And the answer seemingly is that this is a tool that's being executed by the inference platform and not by the coding agent. And this is probably similar to the web search tool, although again, I'm not really familiar how it's supposed to be used, but these tools effectively do not provide.

Normally, if you run a tool, you basically say like, hey, here are the tools that you can use. The LLM exposes, will reply, so it will reply with a JSON blob or some XML blob or something similar, saying like, hey, I want to use this tool. Then you would, in the agentic coding loop, execute a tool and provide the results back, and it makes another turn.

So with the server side or inference side tools, what actually happens is that you're not supposed to be able to call these tools because they will be called automatically behind the scenes. The results of those tools invocation will be immediately fed back into the prompt, and then they make a turn further.

And one of the results of this is the primary result of this is that if your agent does not match the tools with those models, then the experience that you're going to get from that agent is worse. And I really, really don't want to name any agentic coding tools today, but particularly coding tools that allow you to swap out the model and that are kind of open ended to which model you can use are also not necessarily going to map the tools exactly with either prompt and or tool naming or tool providing to those models.

And you can sort of see this a little bit with a GPT-5 release. There were a bunch of day one tools that came out, which had access to GPT-5 for two weeks or more. And at least on the first two days when I tried them, they outperformed tools which only added support for GPT-5 on the first day.

Presumably because they got a little bit of a heads up of how they should match up prompts and tool usage. It's not just the tools that you provide that can have a significant impact on the agent's performance. In some cases, there are pre-flights and post-flights to each agentic coding loop.

So I will go with the example of Claude here, Claude code. There's a little spinning indicator to the left of your prompt, which sort of makes up a little fancy cutie description. And that comes by running the request also through Haiku, which is one of the fast models. And they will also for any tool usage that is emitted by the main model, which is usually Sonnet or Opus.

They will also run it through Haiku in a form of like LLM as a judge to figure out if it's safe to run this tool or not. Which, for instance, means that it's much less likely that Claude code will delete the wrong file or will work outside of its working directory, or at least outside of your project's directory than an agent that skips that step.

It is very possible for an agent to emit commands which are highly inappropriate to use. And having the second LLM make a sanity check on it will get rid of some of those users. Not all of them, but some of them. And so, for instance, I know for a fact that not every single agentic coding tool that exists today is safe to run with EULO mode, which is basically full permissions.

In fact, quite a few of them do not properly protect to the same degree that some others do. There's a second problem also, which is that once you give the agent the ability to run code, you also have to figure out what to do if that code doesn't run properly.

And there are definitely big quality differences in the tool's ability to run executables. So, for instance, some tools get stuck and never recover. I ran this yesterday. I will not name the agent that I was using, but I ran an agent yesterday at night before I went to bed.

Actually, I ran three agents with one prompt overnight. Same prompt, just to see how they're using the same model differently. And two out of three agents got stuck and never recovered. And they got stuck in different places. One of them got stuck because they ran a tool which brought up the prompt and they just didn't figure out and didn't kill it.

And the other one got stuck because it made an HTTP request seemingly that got stuck somehow. And it didn't recover from that. And I could see that it was doing that because there was no tool usage being, or I was assuming it was doing this because there was no tool usage at that time.

It was effectively an inference step and the token counter did not increase anymore. And it was just stuck there. And it was either because the response that it got back from the server let it locally get stuck in a loop. But my point here is that there are some tools which have been already battle tested for autonomous use for extended period of time.

And some of which haven't, even though they might be using the exact same underlying model. And so this really puts us in this really unfortunate situation where it is, even with the same model, very hard to evaluate the quality of an agent. And seemingly, I think to me, it has become harder.

If you look at Sweebench, for instance, or things like this, you will get a bunch of numbers, but they don't necessarily tell you everything. There are too many different workloads that we're throwing at these things. And there also seems to be some form of benchmark manipulation going on. I don't think this is necessarily nefarious.

This might just be an accidental benchmark optimization by just running it in a Sweebench and see how it's doing. And if those tests don't change all that much, then maybe just optimize towards that. But for instance, when GPT-5 launched, it looked very appealing. It's very appealing and it probably is very appealing that it is significantly cheaper than the Antropic models are.

The problem today is that if I just compare their official CLI or cursor with cloud code, I noticed that it uses many more tokens and many more turns. Some of the savings that you get from the cheaper per token costs are negated by less efficient agentic loops. And even inference and turns doesn't necessarily tell you how long it's going to take.

because there are also other things that influence the overall speed that it takes you to get a good result. So there is actually some benefit to one-shotting a good result because you are less likely to go into iteration, which might be slow. On the other hand, if you are making iterative improvements and you also run the right tools all the time afterwards, sometimes that can actually result in a better code quality overall.

I think that Sonnet loves to write wrong code, but because it is also rather eager to run the tools, it overall results in better looking code quicker than you would achieve that same result right now in some versions of GPT-5. And some agents are at least functionally capable of parallelization, even though that is not necessarily always possible.

But there's so many different ways in which these models can and these agents can behave differently. That's very, very hard to evaluate. And I just generally want to caution everybody on a lot of takes that you can get on Twitter and elsewhere because it is really, really hard to, I think at least, it is really hard to get a good sense of what works and doesn't work.

And I also noticed that in many ways what prevents me from using a tool today is actually not necessarily the quality of the code that it writes or the quality of the inference, but really annoying things. So for instance, one of the agents that I actually think I would like quite a lot, I don't enjoy the user interface of.

And so I'm less likely to use that. Another person might have a different opinion on it. But there's so many different kind of things that influence how you interact with a tool that it can also then compromise the results that you're going to get. So it is very hard to evaluate.

And I think that a lot of people are attached to one of the tools that they're using, particularly if you already start paying for it. And I do think it's important for us to get some better sentiment of evaluating this, in particular also because I think it's unsustainable to have that many tools that we'll have to consolidate down.

And the way we're evaluating sometimes is just really kind of wrong. Particularly, I think in the last two weeks there was a lot of discussion about like who has the nicest looking terminal UI or the best performing terminal UI. I also don't like how the terminal renderer of most of those tools chitters around and is annoying.

At the same time, that's not the most pressing issue that I have with an authentic coding tool. And I also think that long term, the terminal UI is not where we want to end up. So there are many ways to evaluate it, but really what we should be evaluating on is like how good is it day to day in solving the problems that we're throwing at these tools.

And I don't think we're doing a particularly good job right now, or at least in general, I find Twitter to be a horrible source of high signal information on that topic. For evaluating models, I think it's even harder because there's so many trade offs on it. At the moment, I think there are basically a handful of models that are capable of doing this.

The most significant ones are, of course, the entropic models, Opus and Sonnet, GPT-5. Actually, there's also Gemini in there. I kind of forgot about this, but I think GPT-5 and Gemini are both fantastic models for code generation. Gemini didn't perform particularly great for tool usage, some of which can be compensated by the tool.

So Gemini CLI, for instance, got quite a bit better recently by just recovering better from bad tool usage. I think with GPT-5, it's a little bit too early to say, but it looks from the initial feedback on people that got it already for two weeks or so that it's pretty capable.

And then there's a growing number of open weights models that are at least capable of tool calls. And I think they are very interesting for all kinds of different reasons. So for instance, Quen3Coder, I don't know if it has been trained on any particular tool, but seemingly it is at least quite a capable tool caller.

I think this is very interesting. Kimi is also a seemingly capable tool caller. There's also the GLM model, which I haven't tried yet. Many of these models, though, the open weight ones come in very different sizes. So if you go to open router, you can pick different ones. If you run them locally, they can run through different systems.

Like there's a pretty significant difference in behavior sometimes between LM Studio and OLAMA. On open router, it's almost a slot machine about if that model comes as advertised or not. I forgot for which model it was. It might have been GPT-OSS where people reported a wildly different measured performance difference for tool calling reliability between the different providers on open router.

And it also looked like the initial source release of the Harmony new token system, whatever you want to call this, had some bugs that people walked around in different ways. So all of this makes it just so much harder to evaluate this. And I want to reiterate this. I think it's very hard to estimate the costs right now because just because a token is cheaper doesn't mean that you're going to be cheaper off than with another one.

And also, I think I didn't write this here, but I think another way of also estimating this is there was a lot of excitement in the first couple of days for Cerebus and Grok that they are going to be dramatically faster for these agents. So for instance, but the challenge is that it doesn't seem like that is a consistent experience.

And in some cases, just being faster doesn't help you if you're also worse or that inference is necessarily the most expensive piece here. So for instance, pure throughput in some cases is higher on some hosting platforms, but time to first token is also higher, which in many cases compensates a lot.

So it is really, really, really hard to evaluate. I just want to point this out once again. I have no idea how to do this better yet, but many of the benchmarks are kind of misleading. And I think we need to just share a little bit the practical results and experience that we have.

This is basically just why is everything so crazy. It's very clear that there is something really cool here. There's a lot of value being possible with these models, particularly if the agent and tool combination works out really well. I think almost everybody can use one of the agents to build another agent.

So there's only going to be more for a little while on these just because it's so much fun to build them to. And it's kind of shocking how many people raise money right now on more coding agents. I don't understand the motivation here, but it doesn't stop. There's so many more.

And because we haven't found a good way to evaluate them, I think it's also very hard to evaluate the quality of an agent. I'm evaluating the quality of one agent in particular. I still use cloud code most of the time because I got used to that. And I have one eval where I run in regular intervals the same task on my code base just to see if it doesn't get worse.

I basically want to make sure that my code base stays in a state where the agent doesn't get meaningfully worse. But even that evaluation, I can only run against cloud code just because of some of the behaviors that it has. I currently have this not set up in a way where I could use OpenCode here or where I could use Crush or Gemini CLI.

This is very specific to how I'm evaluating cloud code. And this is just for me to evaluate that my code base does not get worse. Last thing here is pricing. There was an interview with Dario recently. It was a podcast somewhere where he mentioned that they underestimated how many people would use the CloudMax subscription to the degree that they did.

I know a handful of people that have gotten a lot of value out of it. Thousands of dollars worth of value out of a $200 subscription. So there is some pressure coming in right now to put higher rate limits up. But if you look into how could you get tokens cheaper, you will discover that it's not possible for you to come to those token pricing through self-hosting.

There are H100s, H200s, B200s, whatever they are called on the Internet. You can rent them for a couple of hours. You can rent them for longer. But if you want to end up with a self-hosted open weights model, it is going to be more expensive than if you buy those minutes on open router.

That's the situation that we find ourselves in today. And your experience is also not necessarily going to work all that well. Even some of the open weights models require a lot of hacks for you to self-host them reliably. And that might also explain why some of the model providers that you can find on open router for the same model have worse performance.

There's a lot of stuff in the stack. So for instance, when GPT-OSS launched, there were a bunch of bugs filed against open code and other tools that the responses coming from GPT-OSS were wrong. But it was actually a bug or a slash limitation in the underlying O, LAMA, and LM Studio where that issue appeared.

And so fixing that problem in open code would not be correct because the transformation should happen one layer down. But at the same time, as I have already seen, some fixes landing in some of the coding tools specifically for that. So it's kind of tricky to say where this goes.

I think right now there's only limited motivation for people to get these open models running. But some people are trying, right? And so maybe as the cost might go down, there might be more people trying it. There might be better tooling around it. You might actually get to the point where maybe 10 people can share to H200 and get a recent experience out of it.

But we're not there today. And even more so locally running models, I think it's just going to have a really bad time with them. I try regularly to see how far they get. And GPT-OSS and a bunch of other sequence recoder, the smaller variant, they're definitely capable of running, but they give up early.

They sometimes call the wrong tools. I haven't found a good agent decoding tool that is really tweaked for them. So it's just too early right now. But I would just generally say, unless you're really motivated to try this, there's not much of a point in doing that. There's a price pressure per token that's visible.

GPT-5 launched for a lower token price than the Antropic models. But overall, we are using many more tokens. The problems that we are throwing at these agents, they are still largely constrained by how much we can throw into the context. And so for quite a while longer, the number of tokens we're going to throw at these are going to go up.

And so as a result, I just generally expect that the cost will go up. And that's all I have at the moment. So I have absolutely no recommendations of which agents you should be using. But I just want to share a little bit what I've learned on this. And I encourage everybody to share their experiences, their genuine experiences, not their quick takes on Twitter.

Like write blog posts, like long form content, share what you really did with it. That is so much more valuable than any quick tweet that has a face blown emoji or something like this. We don't need more of that. That's already, we have enough of that. But what we need is genuine experiences with these tools and some consolidation to fewer of them.

That is just too much. Yeah. So thank you for listening. I hope this was useful in one form or another.