LangChain Interrupt 2025 Building Replit Agent v2

Thanks for me. Thanks for me, Gary. So I think most people are probably familiar with Repli and what they do. You guys launched V2 of Repli Agent six months ago. Two months ago. And I heard nothing but fantastic things about it. So if people haven't tried out Repli Agent in the last two months, what is that?

What is different? Why did they try it out? I think the shortest possible summary is autonomy. The level of autonomy, which is good to be one. If you tried V1, it started last year. You know that it was working out on . And right now, it's not a common way to see that it was spent years.

And what I mean, what I say, about running is not spinning the wheels, rather doing useful work and accomplishing what user wants. And it took a lot of re-adventing and also testing new models that are coming out. And things we learned, to be honest, like shipping things in relation to digital technology.

I think we've got a lot of things to engage in the world value in these months. Are you able to share any of those pieces of this? Yeah. Where does it start from? I would say I usually have two pillars, which by the way I'm trying to reiterate what you just explained earlier.

On one end, investing early in evaluations, is extremely important device, especially the more your agent becomes advanced, the more you can have an idea for introducing progressions or automated products. And the other wise observability, we can go deeply there. As you know, we use LikesMeet pretty thoroughly. We also use another set of tools.

I think we're all learning as a few top loops in the V2M agents. It's not really safe, but I may want to talk to other people in the past, but I guess... One of the things that I'd love to hear more about, we did a separate fireside chat in December, and we talked about the human in the loop experience and how that was important at the time.

Now you're saying these agents are more common. How do you think about that? Has that changed? Or is that the way? Yeah, you're still talking about this constant action between wanting to do the human in the loop, so that you can break the genetic flow and make sure that in particular sideways, the human can bring it back on trucks.

At the same time, what we're saying from our users is that when the agent is actually working correctly, they don't want to be bothered, they just want to get things done. The more we can get done, it maybe takes a week for users to get used to that, and then they just want more.

So, I think the strategy that we're following at the moment is, we try to move forward with other platforms, where a mobile app, for instance, obviously allows you to bring back the user to the connection. But at the same time, you know, there is always a chat, where you can ask the agent to stop, you can ask it to do a different work, even while it's having to work.

So, it depends on the user profile. Some users have to be more trustworthy, and they have to deliver the agency to the agent, and some users are having more hands-on, and, you know, try to get a product that makes all of them active. But, I think the role we are all going towards, I think that would be interesting.

On the, on the topic of, kind of, like, users, how, how are people, kind of, like, using Reply? What types of things are they building? What, like, what are their backgrounds? Who are the, the users that you're going to determine? Yeah, so, uh, starting from early February, we, I mean, over three years or so, everyone can use, you know, Reply, just spinning out of town, and we haven't got to think about the one-event application was per month.

So, that's what I was here, that we should be. Um, a lot of them are, that's how the agents can do it. And, I think, the same pie that we got, you know, when we were younger, we brought our first piece of clothing, we actually see the money, that's what a lot of people are chasing the first time.

Like, realizing that you can actually build software, even without learning and improving the ground. At the same time, some of them get hooked up, and they realize, oh, I can build what I need for my business, I can build something that I need at work. And, and that's when they start to work on much-conditioned applications.

So, the, I think, one of the key differences of our product is the fact that it's not used mostly to create simple, like, landing pages or prototypes, but rather people find value on very long projects. Uh, I see people spending hundreds of hours on a single project in the revenue page, running, absolutely no answer to it, just making progress with the agent.

That is, first of all, a great technical challenge, like this, because it makes things much harder for several different reasons. And, people are spending so much time, they are usually either building, uh, internal tools in companies. There's something I'm getting excited about, there's this concept of abandoning SaaS, the program comes about, the idea that, why would I spend seven figures buying an expensive SaaS when I need only two t-shirts?

I'm going to rather rebuild it and build volume to the new company. So, this is one direction that I, I see a lot more companies working on. But, at the same time, also personalized applications, uh, for, you know, uh, professionals, like, even people that have their own monthly, otherwise it would stop replacing that.

So, that's, that's the kind of basic solution. Awesome. How, uh, so people who have agents and are maybe starting with agents on the, on the lower end of the topic and are thinking of letting it run now for 10, 15 minutes, like, who are, how did you have the confidence to, like, do that?

Like, when was the point where you're like, okay, bring the team out of the loop and we need to start letting it run? Is that based on kind of, like, feedback from users, internal testing, metrics, like, what did that process to get that confidence look like? I would say a lot of internal testing.

Even before we launched V1, we had a prototype of it since early 2024. So, we've always been trying to make it work. And the moment we find the right blocks, uh, which partially are due to what computer labs are working on. So, the new models that they give us, uh, at the same time, it's also due to how good is the stuff that we're feeling.

The moment it works well enough then, that's when we start to feel we should launch this, we should put it at least in front of a small, you know, alpha users cohort. What I think we do is that we, we re-architected it to, you know, to best leverage the database models up there, and then we start to use it a lot internally.

And we started with a approach that was a bit more similar to V1, so we were more cautious, and then we just gave more niche. So, we wanted to say, yeah, how far can we take these? How good is it going to work? And it turns out that, you know, it exceeded our expectations.

So, in confidence, in all honesty, as always, as usual, came with the early access product, where we launched it as an opt-in. We asked the user just to social tools to try it, and then we received exceeding positive feedback, and then as a team, we rushed because we go to GA as much as possible.

You mentioned models a few times. What models, are you able to share what models you all are using, or how generally you think about the model landscape, you know, we are the users of the smaller models, especially in 3.7, as a lot of new levels, not only for open images.

I see overall the industry pointing in a direction, and the latest GMI 2.5 is also probably a very similar lab philosophy, and I do believe that from here, labs are realizing that there is a lot of value in how large companies like ours, and you know, all your customers to grade much more gently, much more advanced, and gentle workflows compared to the best.

So, I'm going to be surprised if in the next few months, you are going to see all the models, explosive tools, and the infrastructure in such a way that allows you to have, you know, much more time with everything. So, how many users to choose what model is used under the hood, or is that hidden?

Now, we are very innovative, and it's also a product choice. In all honesty, there are platforms where, of course, you can be a model to use for street products, for example, to develop parts of it, so I think it's great to have a model selector and get the best possible performance from under different models that are on the market.

In our case, it would be a fairly big challenge to allow you to switch models. We use multiple models, by the way. Like... In one run of the... Yeah, yeah, like, 3.7 is kind of like the, uh, the, uh, the foundation, the main leading lock for the IT model of the agent, but we also use a lot of, you know, other models to do a lot of accessory function, you know, especially when we, when we get to enough latency for, you know, for, for, for performance, then we go, you know, we flash models, or we go forward models in general.

So, we don't need that optionality, because it would be very hard for us to even maintain several different parts. And if you think about, you know, we go very deep into the rapid total of optimizing the bugs, you know, very hard for me to go home, and we go back and we go keep, you know, tons, that's a lot of work for us to do.

Do you mostly use, do you use the open source models as well as part of this, or is it mostly foundation model? At this point, it's mostly foundation models. We have a base, consulting, tax testing, NIPC, um, and I'm very bullish about the amount of time. The reason why we're not, you know, testing too much time, to do in time 20 where it's already been so small as a plant is because, again, the labs are moving at a complete, a base, I think back in the days when we were talking with each other, maybe there was a new leaf every single month, and that's probably happening every couple of months, so it's probably, it's better to explore what you can do to do with your labs, and then eventually when things go down, we'll do better stuff.

By the way, uh, or if there is a reason for us to take out the source model, continue, and perhaps try to optimize some of the key, uh, actions that our future takes, that allow me to, to spend time there. But for now, it's really, really friendly, I think it is.

You mentioned, kind of, like, the trade-off between costs and latency and performance. Yeah, yeah. How, how, how do you think about that now, and how have you thought about that over time? Because Repli, I feel like, at least based on what I've seen on Twitter, has exploded, like, uh, recently.

And, and, and so, was there a moment, like, I think, I think everyone kind of has some fear when they watch in the agent or some AI. If this becomes really popular, like, it's going to bankrupt me. Um, and so, did you guys have that fear? I think you started to see things take off.

I still have the fears. It hasn't changed my strategy. Um, so, I think I went on our podcast from, you know, last year for us, yeah, last year for us, saying that, you know, the three, the three dimensions we want to optimize are performance, cost, and latency. And, for me, performance and cost are almost, you know, uh, the same level in terms of importance.

And that, really, back in the, in one days, I was using it as, as a, as a partner. Um, it doesn't change much. Today, we, too, we find that gap has become even wider. Because it runs for so long. It runs for so long, and possibly there was the scariest that we did when we launched it, especially when we put it, uh, and we made the GA.

And the reason is we were already not emphasizing how much the data components. Like, we strongly believe that it's far more important for the agent to get done what people want, and especially for the ICT that we've been wanting, which is contacting our people. Um, so, we have almost like one of the 90% of, uh, additional latency.

And the reaction has been fairly more controversial. I think maybe for, for the first week, there are some people being shocked how the amount of time it was taking. At the moment, you realize how much more it gets done, and the, the, the, the, the, the, the amount of edits that it's also doing, because you don't have to go and try to debug.

Even if you debug it with the agent, you got all the version of the agent, you, you, you have to know what to ask. For now, it's not a big thing, or often times. So, so, do you see people modifying the code management still, or is it completely hands off?

That's a great question. We have an internal metric, and it's one of my worst parts. To be honest, we, we, we try to try it off when people go back into our editor, which by the way, we have been hiding in the products since we launched Agent V1. I mean, that was the main.

The, the, the, the, the main product for closing to be democratic before we launched the agent was a, I'm sorry about. Um, and we started basically showing you the 5P, then now it's hidden by default, and then it takes no effort to get in front of the editor. And we started where, I think, one user out of four were actually still adding in the code, especially the more professional ones.

I think as of today, we arrived at the point where it's a while out of 10, and my goal is, actually, zero users will link to one of the brands in the code. One of the cool features of Replit that I remember from, from before, Agent was kind of like, the multi-player collaborator thing as well.

Do you still, like, when people build agents, is there a collaborative aspect, or is it mostly kind of like, uh, sorry, when people build apps with agents, are they mostly, is it mostly one person using the agent, or is there sometimes collaborative as, as well interacting with the agent?

So for our, uh, consumers who are around the world, yes, most of them, I think, are just single player experience. Uh, especially more like in a business enterprise setting, they, we bring them in the team, so everyone can see each other's projects, and we see them using the agent together.

Now, we have a giant block, as I'm not, uh, for reasons that I'm happy to explain, uh, but we saw times in the shot blocks that there are several people who sent me using the problems to the agent. The challenge-wise, you are, to our model agents in parallel, it's not that much on the infrastructure side, that we are having a few days to run, not in instances, because we really want to scale, so that wouldn't be such a big deal.

Uh, the real challenge is, how do you merge all the different, you know, um, patches with the PR that the agent creates, which is unknown to your promise if you're able to use your AI to trust your models? And merge conflicts on one. That's my question. You, uh, you, you mentioned earlier that there's some, like, some app for, for, for using Repli, and getting notifications, where we're going with this, is when this agent's running for like 10-15 minutes, how, how does it, like, what are the communication patterns you're seeing?

How do the users know when it's done? Or are they just keeping the browser open and looking there? Do you have, like, Slack notifications? Is this app that sends them a push? Like, what are you seeing being helpful there? And, and, and, and has that changed as the agent gets longer and longer?

Right. Yeah, so we, we won, most of the users were in front of the screen of design, because the feedback loop was relatively short, and I think it was also quite a bit to learn from, from what the agent was doing. It's still the case today, uh, it's fairly verbose if you're curious to get a basic experiment every single action it does.

If you want, you can see the output of every single tool we run, you can try to be as transparent as possible. So there is a subset of users that are using the agent not only because they want to build something, but also because they want to speed run their learning experiments.

And it teaches you how to build zero to one apps in the possible, the best possible way. They're also using that absolutely don't care, and they, you know, they just launched, uh, they submit a prompt, and then they go back, maybe they go delete, and then they, they go back and check that way.

To make sure that the loop is a bit tighter, uh, the replica mobile app that is available on, you know, apps around, right, um, sends a notification to when the agent wants, what's with you. And the vision that we have for the next release is to send even fewer information.

And the idea is, right now, one button is for us, which is the fact that we rely solely on the last professional. But, as you know, more and more progress is happening on the computer use side. That is something that we are actively working on to remove even, you know, these additional hurdles from the user, because a lot of the time, what we ask them to test is fairly clear.

Like, it's data input and clicking around a very simple, simple, simple place. I expect us to be able to do that with the computer use very soon, reading the products, and then jumping from, say, 10 minutes of autonomy to one hour of autonomy. That is my target for me to talk to me in four months.

How do you, how do you think about, there's kind of like testing, but then there's also making sure that it's doing what the human actually wanted, and oftentimes we're bad communicators, and don't specify everything up front. How do you think about getting all that specification? Do you have something like deep research where it kind of grills the user back and forth at the start, or how do you think about that?

So we are changing the mining experience as we speak, and we're going to launch it very soon. It's hard to reconcile how most of the users have been trained by products like ChatGPT, and actually how we expect them to use a coding agent, or in general, any agent, because if you have a complicated task that you want to address the same images of using the software, because you want to submit a P&D.

That's what, again, you can. It's typical of doing. Digital people are willing to do that, or what they do is that they run a two likes ground, they throw it into a cloud, they get back along P&D, and then they expect the reputation to follow it up in every single item that you did.

We're not there yet. So what I, the challenge is to make that people that want to use it as a chatbot, so that they do basically one single task at a time, and we've got some effort in training. We did that course with Andrea, who's going to be on stage in a few hours, just to tell people if you want to use it that way, it's important that you split your main goal into sub-tasks, and basically some people say much.

But at the same time, I would love to reach a point where we go through each sub-tasking as they show, we get things done, and maybe after we ask, let's say for one hour, then it's up to use a user to find out if you accomplish everything that you want to.

But I think there is so much that can be done autonomously, because you'll say 90% of what user wants, and then when we gather attention to that, because it has to do that to polish the user's penis by messing things up what it wants. You mentioned, um, you mentioned observability and thinking about that early on.

As, what have you learned as, as RepletAgent has gone crazy viral? That observability is even harder than it expected. Regardless of the fact that you guys are feeling something awesome in linespain. What are, what are the hardest parts? Give us some product ideas. So, first of all, I, this is a bit like back in the days when we were discussing, what is the best possible architecture for databases?

But we do this, you know, one size does not fit all in this case. And there are the DataDoc style observability that is still the useful, like you want to have aggregates, you want to have to have your, you know, you're failing to use this tool 50% of the times, and then bring an alert and go ahead and fix it.

At the same time, something like Blacksmith is extremely important, because, unfortunately, we're still at the, kind of, like, assembly era, you know what I mean, for agents, you will agree with me, because when you are trying to understand why the agent has made, you know, the wrong choice, you know, sideways, your last resort is to actually read and carry you from the output, the generated output, and trying to figure out why the agent has been made.

It's much more effort compared to what advances in the system, like, aggregates, and whatnot, you know, you have something that is, like, a strategy darker, but rather, you need to read a hundred thousand tokens and figure out what's wrong. So, I think we are at the early stages of observing, but what I recommend everyone who starts to weave from being an agent or, like, any agent in workflow, is to invest in the server BPP company, otherwise, you're going to be lost immediately, and you're probably going to be lost because you're going to think it's impossible to do this off.

And I hope that we are proof, and the other companies are proof that it's not impossible, it's just really hard. So, who do you see kind of be the best, who debuts these agents, is it everyone on the team? I mean, you guys are building a technical product, so, so, probably everyone has some product sense and product feel for it, but is there a particular persona that spends the majority of their time in, like, spilling out logs, who has the best kind of, like, skilled or natural intuition for that?

Given the size of Rappi today, you are, like, barely 75 people across the entire company. The way we work is everyone doesn't deal with everything. So, even if you're an AI engineer and you're a person who's been optimizing the prompts, But, when there is a page and something is broken, mostly people in tech, you know, the team are capable of going, you know, all the way from home on the public surface to the network.

Now, what makes it a bit more challenging for Rappi is that we own the entire staff. So, we have the execution plane, where we orchestrate all the containers, we have e-control plane, which is basically like a combination of how the agent conveys We have a basic library style, illustration, and all the way from home.

So, it's important, unfortunately, as of now, to be able to read the traces all the way down, because problems can happen, and, you know, even one of the tools we work, maybe the interface is correct, but it could be that . We've talked a bit about the journey from V1 to V2, and maybe to close us off, what's coming in V3?

What are some things that are on the roadmap that we can expect? Yeah. So, I included one of them, you know, I expect us to bring in real use, or in general, I can bring in research and test applications. At the same time, I'm also a very good shot of reading in software tests.

The beauty of getting a coding agent is that all these are more observable, and there are ways that you can apply the code to test if it's correct or not. And, last but not least, I would want to work even for the test side of computing. We are, as of today, we already used a fair amount of tokens, as you know.

But, definitely, we want to explore both sampling and parallelism. So, we see this especially at the beginning. A lot of our users have opened several projects in parallel, and to the initial view, so that they can see which one matches their UI base for better. I imagine taking this concept and carrying along the dark trajectory, where you sample and then you run to be like the best solution for the problem.

So, this would be like a power of expanders, but it definitely helps to get better performance. Awesome. Well, I'm looking forward to all of those. Thank you, Nikayla, for joining me. Let's give Nikayla a big round. Thank you. And, with that, I'd like to introduce Lance Martin to the stage, or Lance from Langchain, as you might know.

We'll meet your.

LangChain Interrupt 2025 Building Replit Agent v2 – Michele Catasta

Transcript