I don't see who you are following me by now, but this is a demand that is happening to the labs correctly in the backpacks in the course of the agency project. So Connor and Sam go through the margins of the screenwork, but they also go through a three-four tool system.
Searching the internet, searching the website, and screening the website. So you can see here on the lab we have the stamp-up part one in moderation, meaning the and all the agents here. And you can see the differences in the architecture. The key difference here is that we use a weaker and faster models for generating the land and pricing the plant, and it's not just you can use some of the new instruments and recent models at a time, and generate a smaller land reported, and then we use both or both need to use that thing to use.
After we've built those, the first thing we want to know is how do we know the future is better? Before we build any new values or any metrics, we spend a lot of time looking at the traces to see if the director is going to be okay, which ones will be better.
What we found initially is that O1 produce much more recurrence requirements. So you can see here on the left we have O2U, and on the right we have O4U. You can see O1U and O1U in the same content. O2U has around 600 tokens, and only 600 for O2U. If you want to zoom in to this plan, you can see for a specific question, which is the only question in the comment, the difference in quality and specificity about the O2U and O2U in this O2U.
And so what we found actually from this plan is that this increase in specificity and the accuracy of general velocity to help the agent improve outcomes downstream in this initial planning phase. But after we did this kind of budget, we started to build actual deval. So we started with just accuracy, which was just a percentage of questions answered correctly, and then hand labeled a bunch of data sets.
So I had a hand labeled probably 500 and 100 companies reach a spot on the data set. And if you take these data sets to be installed, we found customers that they find the user agent for. So things like it's going to be used to receive from the rapid, the type of rapid, so that we can deploy these data sets from agents and how to get answers from those kinds of updates.
And we've hit ConorAgent and Sandlot R1 head-to-head, and ConorAgent came up on top. It kind of beat out Sandlot in most of the categories and more than the increasing difference with these tasks. So we learned from the ways initially developed. So like I mentioned earlier, these reason models have an outsize impact with downstream actions and downstream accuracy.
and then you can also learn the accuracy of solving evaluation for how we look for an agent again. But between that, actually still haven't been able to trace it because we didn't have a period slice into where to go from here. So we know these are the vectors in these agents.
So how do we work with the agents? And so we thought about three, four axes for how we improve these agents. One is like changing graph in the architecture. Two is changing models in the prompts. And then three is adding more tools. So I spent a lot of time reflecting on customer use cases.
Customer needs to find new prompts to be able to block these and make them more close to what needs to happen. So they changed the goals in the prompts and added the first two areas they wanted to invest in for improvements. So I'll talk a little bit about how many cool learners you have from doing models in the prompts.
So first thing we want to do is kind of optimize the performance of the prompts. Go on, go to the creative, go on. They're pretty expensive. They're also pretty slow. So, and also at the start of this year, there would seem to be new models are coming out. Like you got to be.
So you're often just plugging in a new model, trying to see if it's better or not. And what was interesting is that we didn't really see a huge difference until 4.1 came out recently. And that's the only model we replaced go along with in terms of 4.1 planning. And the outcome of this change is that the control of agent prompts that used to cost around 35 cents now costs around 10 cents for 4.1 with similar performance.
You can see here the number of other models we tried and why 4.1 is the best or most cost-effective model for planning in our agent. And we tried even DeepSeq, 12-7 and 19-255 recently. And one thing that I guess for DeepSeq was that when it came out it was really promising, but it was only until probably like a week or two ago when the latency was down to an acceptable threshold or similar to like the 1 for these tasks.
The other thing we came across was date formatted, which is pretty distinct, where we had a bunch of models of scale to correctly identify what date was in the future just because the format it was in. So we saw it for us, or a little bit something like 514-2025-03 versus May 15th-2024.
Because the date was in the future, it actually thought that the date of 2024 was in the future as well. So we've done some adjustments to the company just by providing actually different versions of the date, performance and standardizing accuracy on the date that's passed across the models. The last main thing was how we do the tool volume.
This is something we're still working on, but initially the huge problem was that agents were making a better way of tool volumes, something like searching for B2B just generally. So what we did is we ended up changing a lot of the banking models for these input schema tools to force the tool volume node or tool volume agent to change the schema and think a little bit more about what it was following.
So just overall across comp and model changes, we learned that, you know, agent costs are coming down a lot because we were able to swap 0144-2025 planning with no more quality change. And also we learned some ton of cases that I think I was not necessarily cached. And even if you go on a lot for most evals, you're probably still going to find yourself in the game of traces where you're some kind of human eval or human vibe check.
And that seemed notable from both AI recently with their changes in the models in . We also learned models tend to spike in different use cases. that we really wanted to turn on . So customers can be powered by adding just a single D tool. So the four tools that we decided to add to most of this are internet research, browser access, searching HTML and data set access.
And so I'll give you a couple of these. So why we started on deep internet research is that internet is still hard between SEO articles and Google as well as search-grounding and things like . You're left with a lot of result qualities now in your hands. And you also saw that in our use, that tools or calls to the internet would not be utilized like we would.
So this is how we conduct research on the internet before using . So we thought about how do we do it today. So we're pretty good at doing internet research, and we do it fundamentally differently than agents were doing or how our agents were doing initially. When you do a search on Google, you might search for a query over the top ten links, implicitly filter out probably five of those just facing the source, open a couple of these tabs, maybe you're going through a couple sentences on each before deciding that you need to do a different search query or that you found your answer.
So we saw that our agents were not mimicking this common behavior and wanted an adjust force to improve the result quality. So we upgraded from our initial identity model, which was initially we had a very naive structure, which was just kind of a query term. We kind of flipped that to include a bunch of other arguments with things like capital-free, whether you want a live crawl, including the text in summary, and also like maybe you can string that domain and then you can publish that date.
And so by changing all these parameters, we're changing the trajectory of Google search as well, from first reviewing just the preview from internet search output, which is what we have on the left here, to after getting both the URL and the actual page content in one tool call. Sorry, so what this allows us to do is pull in all this content at once and we can sidestep this issue that we were seeing with agents picking an answer just based on the Google search for you, which has to know it isn't always reliable or accurate.
So the second new tool we built was browser access. So how do we do it again? There's a lot of rich data online that's scraping these people to capture. So between like online data sources or data sets, that's already going to enter the query, interactive search and screenwriting, so even things like Google Maps or images, you can't really capture that content in the screen.
So we wanted to allow a unified agent to use the browser the same way we would. So we built browser access into the sub-agent. So we gave this tool, which is basically browser access for this agent, and what it does is decompose the task into a browser trajectory using the Chrome Mini and it also uses computer use preview to actually action on that.
We evaluated browser use, the open source alternative, and we found that while it was marginally faster, it struggled in more complex browser paths, which let us use computer use preview instead. You can see an example of this here, where we try to find if Google has a parking on site.
And so eventually it ends up using the browser tool, it goes to Google Maps, it ends up using screen view, going through, looking for an EV charging station in their parking lot, and then also flipping to the tab in the browser to check to see if it has a PV.
And on that last page there, it does actually confirm between Google Maps and that page that there is an EV charging station. So we learned a lot from these tools, and one thing we learned is that we can't use this kind of bank approach to internet search. Internet search in Google is great, but you still need to empower your users to be able to look at the data, adjust the right content in the context, and then actually face off of that context.
T-search in this pivot to changing and pulling in content at once, massively reduced the amount of misinterpretation we have in internet search, and changed how we were going to search. And these other tools, like browser use and searching HTML, will automatically be new use cases for our agent. So as a result, the new agent, the new champion, we have a product is called Browser Agent, as you can see in the sole cap to the name, or the theme of naming our agents.
So a couple of quick next steps, and it's based on these changes in tools, we want to invest a little bit more time in emails to actually highlight some of these issues that I found, just looking at traces that we found, looking at outputs, to make this process a little more repeatable and scalable.
Awesome. We are solving a lot of interesting agent problems, so if you also want your name in our code base as an agent, reach out to us after at 459. We are hiring tons of engineers. Thank you guys. Thank you. All right. Thank you, Conor and Pinal. So next up is Eno Reyes.
Conor's CTO of Factory. Eno will be sharing insights on challenges with agenting systems and have actually a family number of mics. Please welcome Eno. Hey everybody. My name is Eno, co-founder and CTO of a company called Factory. At Factory, we believe that the way we build software is radically changing.
We are transitioning from the era of human-driven software development to agent-driven software development. You can see glimpses of that today. However, it seems to me,