Building Replit Agent v2

. I'm thrilled to announce our next speaker from Replit. You all are probably familiar with Replit. They've made programming accessible to anyone. They've revolutionized how we've written, deployed, and collaborated on code, empowering a community of over 30 million developers to build more efficiently than ever before. And so I'd like to welcome my friend Michele, president at Replit, to the stage for our fireside chat.

Welcome, Michele. Thanks for coming. Thanks for being-- thanks for being here. MIKHELE MIRANDA: Thanks for inviting me. Excited about it. So I think most people are probably familiar with Replit and what they do. You guys launched V2 of Replit Agent six months ago? MIKHELE MIRANDA: Oh, I wish. Two months ago.

MIKHELE MIRANDA: Two months ago. MIKHELE MIRANDA: OK. MIKHELE MIRANDA: Early access end of February, GA, late March. MIKHELE MIRANDA: And I've heard nothing but fantastic things about it. And so if people haven't tried out Replit Agent in the last two months, what is new? What is different? What is different?

Why should they try it out? MIKHELE MIRANDA: I think the shortest possible summary is autonomy, the level of autonomy that it showcases compared to V1. If you tried V1 starting from September last year, you recall that it was working autonomously for a couple of minutes at most. MIKHELE MIRANDA: And right now it's not uncommon to see it running for 10, 15 minutes.

And what I mean, what I say by running is not spinning the wheels, like rather doing useful work and accomplishing what the user wants. And it took a lot of re-architecting and also thanks to new models that are coming out. And things we learned, to be honest, like shipping things in production teaches you a lot.

I think we learned a lot of tweaks to make the agent overall better in these months. MIKHELE MIRANDA: Are you able to share any of those tweaks? MIKHELE MIRANDA: Yeah. MIKHELE MIRANDA: Where do I start from? MIKHELE MIRANDA: I would say I usually have two pillars, which, by the way, I'm going to reiterate what you just explained during your keynote.

On one end, investing early in evaluations-- extremely important. Otherwise, especially the more your agent becomes advanced, the more you don't have an idea if you're introducing regressions or actually making progress. And the other one is observability. We can go deep in there. I mean, as you know, we use LangSmeet pretty thoroughly.

We also use another set of tools. And I think we are all learning, as a field, how to do observability on agents. It's a completely different animal compared to how we built distributed systems in the past decades. MIKHELE MIRANDA: One of the things that I'd love to hear more about, when we did a separate fireside chat maybe in December, and we talked about the human in the loop experience and how that was important kind of like at the time.

Now you're saying these agents are more autonomous. How do you think about that? Has that changed, or is it just present in a different way? MIKHELE MIRANDA: Yeah, you're spot on. There is this constant tension between wanting to put the human in the loop so that you can break the genetic flow and make sure that in case it's going sideways, the human can bring it back on tracks.

But at the same time, what we're experiencing from our users is that when the agent is actually working correctly, they don't want to be bothered. They just want you to get things done. And the bar keeps raising basically on a monthly basis. The more we can get done, it maybe takes a week for the user to get used to that, and then they just want more.

So I think the strategy that we're following at the moment is we try to upload notifications also to other platforms. We have a mobile app, for instance, that basically allows you to bring back the user to the attention. But at the same time, there is always a chat available where you can ask the agent to stop.

You can ask it to do different work, even while it's actually working. So it depends, I think, on the user profile. Some users tend to be more trustworthy, and then deliver the agency to the agent. And some others have been more hands-on. And I'm trying to build a product that makes both of them happy.

But I think, overall, we are all going towards more autonomy over time. And I think that's the winning recipe. On the topic of users, how are people using RepliAgent? What types of things are they building? What are their backgrounds? Who are the users that you're thinking of targeting? Yeah.

So starting from early February, we finally opened our free tier. So everyone can use Rapid just creating an account. And we are on track to create roughly 1 million applications per month. So that's the level of scale that we reach today. A lot of them are just testing what agents can do.

And I think the same high that we got when we were younger. We wrote our first piece of code, and you actually see it running. That's what a lot of people are chasing when first trying the agent. Like realizing that you can actually build software even without having any coding background.

At the same time, some of them get hooked up. And they realize, oh, I can build what I need for my business. I can build something that I need at work. And that's when they start to work on much more ambitious applications. So I think one of the key differences of our product is the fact that it's not used mostly to create simple like landing pages or prototypes, but rather people find value on very long trajectories.

I've seen people spending hundreds of hours on a single project with a reputation, writing absolutely no lines of code, just making progress with the agent. That is, first of all, a great technical challenge because it makes things much harder for several different reasons, and the people that are spending so much time, they are usually either building internal tools in companies.

There's something I'm very excited about. There is this concept of unbundling SaaS that even program talks about, the idea that why would I spend seven figures buying a very expensive SaaS when I did only two features? I'm going to rather rebuild it and deploy it internally in the company.

So this is one direction that I see a lot more companies working on. And at the same time, also personalized applications for professionals or even people that have their own hobby and they want to build software based on that. So that's the kernel escape today. That's awesome. For people who have agents and are maybe starting with agents on the lower end of autonomy and are thinking of letting it run now for 10, 15 minutes like you are, how did you have the confidence to let it do that?

When was the point where you were like, okay, we can bring the human out of the loop and we can start letting it run? Was that based on feedback from users, internal testing, metrics? What did that process to get that confidence look like? I would say a lot of internal testing.

Even before we launched V1, we had a prototype of it since early 2024. So we have always been trying to make it work. And the moment we find the ride unlocks, which partially are due to what Frontier Labs are working on, so the new models that they give us.

And at the same time, it's also due to how good is the scaffold that we're building. The moment it works well enough, then that's when we start to feel we should launch this. We should put it at least in front of a small alpha users cohort. What happened with V2 is that we re-architected it to best leverage the latest models out there, and then we started to use it a lot internally.

And we started with a approach that was a bit more similar to V1, so we were more cautious. And then we just gave more leash. So we wanted to say, "Okay, how far can we take this? How good is it going to work?" And it turns out that it exceeded our expectations.

So the confidence, in all honesty, as usual, came during the early access program where we launched it as an opt-in. We asked users just through social to go and try it. And then we received exceedingly positive feedback. And then as a team, we rushed to basically go to GA as soon as possible.

So you've mentioned models a few times. Are you able to share what models you all are using or how generally you think of the model landscape out there? We are heavy users of the Sonnet models, especially in 3.7 as unlock a new level of autonomy for coding agents. So I see overall the industry pointing in that direction, like the latest Gemini 2.5 Pro is also following a very similar philosophy.

And I do believe that Frontier Labs are realizing that there is a lot of value in allowing companies like ours and all your customers to create much more advanced, agentic workflows compared to the past. So I wouldn't be surprised if in the next few months we are going to see all the top models exposing tools and being post-trained in such a way that allows you to have much more autonomy than before.

And how many do you let users choose what model is used under the hood, or is that hidden? No, we are very opinionated, and it's also product choice. In all honesty, there are platforms where, of course, you can pick your model. We use Cursor Internet and Rapid, for example, to develop parts of it.

So I think it's great to have a model selector and get the best possible performance from the different models available on the market. In our case, it would be a fairly big challenge to allow you to switch models. We use multiple models, by the way. In one run of the agent?

Yeah. 3.7 is kind of like the foundation, the main building block for the IQ of the agent. But we also use a lot of other models to do a lot of accessory functions. Especially when we can trade off latency for performance, then we go with flash models or with smaller models in general.

So we don't give you that optionality, because it would be very hard for us to even maintain several different prompts. Yeah. If you think about it. We go very deep into the rabbit hole of optimizing the prompts. It would be very hard for me to go from n=1 to n=3 prompt sets.

It would be quite a lot of work for now. Do you use any open source models as well as part of this, or is it mostly foundation models? At this point, it's mostly foundation models. We definitely spent some time testing DeepSeq, and I'm very bullish overall in time. The reason why we're not investing too much time today fine-tuning or exploring open source models at length is because, again, the labs are moving at a completely different pace compared even to one year ago.

I think back in the days when we got to know each other, maybe there was a new leap every six to nine months. Now it's probably happening every couple of months. So it's better to explore what you can do today with Frontier Labs. And then eventually, when things slow down, if they will ever slow down, by the way, or if there is a reason for us to take an open source model, fine-tune it, and perhaps try to optimize some of the key actions that our agent takes, then I'd be happy to spend time there.

But for now, it's already very frantic, as it is. You've mentioned kind of like the trade-off between cost and latency, and then there's also kind of like performance there. And performance, yeah. How do you think about that now, and how have you thought about that over time? Because RepliAgent, I feel like, at least based on what I see on Twitter, has exploded like recently.

And so was there a moment-- like, I think everyone kind of has some fear when they launch an agent or some AI application. Like, if this becomes really popular, like, it's going to bankrupt me. And so did you guys have that fear as you started to see things take off?

I still have that fear, so it doesn't change much, trust me. So I think I went on a podcast, probably in early October last year, of course, saying that the three dimensions you want to optimize are performance, cost, and latency. And for me, performance and cost are almost at the same level in terms of importance.

And then, already back in the V1 days, I was using latency as a far third. It doesn't change much today with V2, if anything, that gap has become even wider. Because it runs for so long. It runs for so long, and possibly that was the scariest bet we did when we launched it, especially when we put it on and we made it GA.

And the reason is, we were already not emphasizing too much the latency component, but we strongly believe that it's far more important for the agent to get done what people want, and especially for the ICP that we have in mind, which is non-technical people. So we went almost like one order of magnitude in terms of additional latency.

And the reaction has been fairly non-controversial, I think, and maybe for the first week we heard some people being shocked about the amount of time it was taking, but the moment you realize how much more it gets done, and the amount of headaches that it solves for you, because you don't have to go and try to debug.

Even if you debug it with the agent, with an older version of the agent, you have to know what to ask. Right now, it's not the case anymore, oftentimes. So do you see people modifying the code manually still, or is it completely hands-off? It's a great question. We have an internal metric, and it's one of my North Stars, to be honest.

We try to track how often people go back into our editor, which, by the way, we have been hiding in the product since we launched Agent B1. I mean, that was the main product. That was the goal. Yeah, exactly. The main product for those who didn't know Rapid before we launched the agent was an editor in the cloud.

We started by still showing you the file tree, then now it's hidden by default, and then it takes some effort to get in front of the editor. We started where, I think, one user out of four were actually still editing the code, especially like the more professional ones. I think as of today, we arrived to a point where it's one out of ten doing that.

And my goal is, eventually, it should be like zero users willing to put their hands on the code. One of the cool features of Repl.it that I remember from before, Agent, was kind of like the multiplayer collaborator thing as well. When people build agents, is there a collaborative aspect to it, or is it mostly kind of like—sorry, when people build apps with agent, is it mostly one person using the agent, or is there sometimes collaborative as well interacting with the agent?

So for our consumers around the world, yes, most of them, I think, are just single-player experience, especially more like in a business and enterprise setting. We bring them in in a team so everyone can see each other's projects. And we see them using the agent together. Now, we have a giant lock as of now, for reasons I'm happy to explain.

But, you know, we see oftentimes in the shot logs that there are several people sending, basically, prompts to the agent. The challenge why it's still hard to run a lot of agents in parallel is not that much on the infrastructure side. Like, we have everything it takes to run multiple instances because we already run at scale, so that wouldn't be such a big leap.

The real challenge is how do you merge all the different, you know, patches, basically PRs that the agent creates, which is a non-trivial problem, Steve, for even AI frontier models. Like, merge conflicts are hard, unfortunately. You mentioned earlier that there's some app for using Repl.it and getting notifications. Where I'm going with this is when this agent's running for, like, 10, 15 minutes, how does it-- like, what are the communication patterns you're seeing?

How do the users know when it's done? Are they just keeping the browser open and looking there? Do you have, like, Slack notifications? Is it this app that sends them a push-- like, what are you seeing being helpful there? And has that changed as the agent gets longer and longer running?

Yeah. So with WeeV1, most of the users were in front of the screen all the time, because the feedback loop was relatively short. And I think there was also quite a bit to learn from what the agent was doing. It's still the case today. It's fairly verbose. If you're curious, you can basically expand every single action it does.

If you want, you can see the output of every single tool we run. We try to be as transparent as possible. So there is a subset of users that are using the agent not only because they want to build something, but also because they want to speedrun their learning experience.

It teaches you how to build 0 to 1 apps in possibly the best possible way. There are also users that absolutely don't care, and they just launch, they submit a prompt, and then they go back, maybe they go to it, and then they go back and check Replit. To make sure that the loop is a bit tighter, the Replit mobile app, that is available both in App Store and Android, sends you notifications when the agent wants your feedback.

And the vision that we have for the next release is to send you even fewer notifications. And the idea is, right now, one of the bottlenecks, at least for us, is the fact that we rely solely on humans for testing. But, as you know, more and more progress is happening on the computer use side.

You know, Anthropic launched that back in late October, if I recall correctly. Open AI fast-followed, and open source is also catching up. You know, I see Hagen Phase launched something similar a week ago. That is something that we are actively working on to remove even, you know, this additional hurdle from the user.

Because a lot of the time what we ask you to test is fairly trivial. So, like, it's data input and clicking around a very simple interface. I expect us to be able to do that with computer use very soon. Bring it in products, and then jumping from, say, ten minutes of autonomy to one hour of autonomy.

That is my target, you know, for P3, hopefully in a few months. How do you think about, there's kind of like testing, but then there's also making sure that it's doing what the human actually wanted. And oftentimes we're bad communicators, and don't specify everything up front. How do you think about getting all of that specification?

Do you have something like deep research, where it kind of grills the user back and forth at the start? Or how do you think about that? So we are changing the planning experience as we speak, and we're going to launch it very soon. It's hard to reconcile how most of the users have been trained by products like ChatGPT, and actually how we expect them to use a coding agent, or in general any agent.

Because if you have a complicated task that you want to express, let's say in the case of building software, you basically want to submit a PRD, that's what like every PM is capable of doing. Very few people are willing to do that. Or what they do is that they write a two-lines prompt, they throw it into Cloud, they get back a long PRD, and then they expect to follow pedantically every single item in that PRD.

We're not there yet. The challenge here is to make happy both people that love to use it as a chatbot, so that they do basically one single task at a time. And we put some effort in training. You know, we did a course with Andrew Yang, who's going to be on stage in a few hours, just to tell people if you want to use it that way, it's important that you split your main goal into subtasks, and basically you submit them sequentially.

But at the same time, I would love to reach a point where we go through each subtask in isolation, we get things done. And maybe after we ask for feedback, say, for one hour, then it's up to you as a user to find out if you accomplished everything that you wanted.

But I think there is so much that can be done autonomously that maybe brings, say, 90% close to what the user wants. And then when we get their attention back, we basically ask them to polish the user experience and finance exactly what they want. You mentioned observability and thinking about that early on.

What have you learned as Repl.Agent has gone crazy viral? That observability is even harder than expected, regardless of the fact that you guys are building something awesome with Langsmit. What are the hardest parts? Give us some product ideas. So first of all, this feels a bit like back in the days when we were discussing what is the best possible architecture for databases.

The tool is, you know, one size does not fit all in this case. And there are the datadog style observability that is still very useful. Like you want to have aggregates, you want to have dashboards that tell you you're failing to use this tool 50% of the times and then ring an alert and go ahead and fix it.

At the same time, something like Langsmit is extremely important because unfortunately we're still at the kind of like assembly era of debugging for agents. I think you would agree with me because when you are trying to understand why the agent has made, you know, the wrong choice or is going sideways, your last resort is to actually read the entire input from the output and the generated output and trying to figure out why certain choices have been made.

So it's much more effort to debug compared to an advanced distributed system in Mambulopino. Like aggregates are not enough. You have something that looks like a step debugger, but rather than showing you the state in memory, you need to read 100,000 tokens and figure out what's wrong. So I think we are at the early stages of observability.

But what I recommend everyone who starts to really think of building an agent or like any agentic workflow is invest in observability from day one. Otherwise, you're going to be lost immediately and you're probably going to give up because you're going to think it's impossible to pull this off.

And I hope that we are proof and many other companies are proof that it's not impossible. It's just really hard as we speak. who do you see kind of being the best-- who debugs these agents? Is it everyone on the team? I mean, you guys are building a technical product.

So presumably everyone has some product sense and product feel for it. But is there a particular persona that spends the majority of their time in Langsmith looking out logs or who has the best kind of like skill or knack or intuition for that? Given the size of rapidly today, we are like barely 75 people across the entire company.

The way we work is everyone does a bit of everything. So even if you're an AI engineer and you are the person who has been optimizing the prompts, but there is a page and something is broken, most of the people in the technical team are capable of going all the way from almost the product surface to the metal.

Now, what makes it a bit more challenging for Rapid is that we own the entire stack. So we have the execution plane where we orchestrate all the containers. We have the control plane, which is basically like a combination of our agent code base, Langrath style orchestration, and all the way to the product.

So it's important, unfortunately, as of now, to be capable of reading the traces all the way down. Those problems can happen anywhere. You know, even one of the tools we invoke, maybe the interface is correct, but it could be that the binary of the tool is broken. We've talked a bit about the journey from v1 to v2, and maybe to close us off, what's coming in v3?

What are some things that are on the roadmap that we can expect? So I entered one of them. You know, I expect us to bring computer use, or in general, like making it easier to test applications. At the same time, I'm also very bullish on bringing in software testing in the loop.

Yeah. The beauty of building a coding agent is that code is far more observable, and there are way more tools that you can apply on code to test if it's correct or not. And last but not least, I will want to work even further on test time computing, where, as of today, we already use a fair amount of tokens, as you know.

But definitely we want to explore both sampling and parallelism. So we see this, especially at the beginning, a lot of our users open several projects in parallel, and do the initial build, so that they can see which one matches their UI taste the better. I imagine taking this concept and carrying it along the entire trajectory, where you sample, and then you rank and pick the best solution for the problem.

So this will be like for our high spenders, but it definitely helps you to get better performance. Awesome. Well, I'm looking forward to all of those. Thank you, Michele, for joining me. Thank you. Let's give Michele a big round of applause. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you.

Building Replit Agent v2

Transcript