back to indexLangChain Interrupt 2025 Learnings from Building AI Research Agents – Connor Heggie

00:00:00.760 |
I don't see who you are following me by now, but this is a demand that is happening to the labs correctly in the backpacks 00:00:10.100 |
So Connor and Sam go through the margins of the screenwork, but they also go through a three-four tool system. 00:00:15.600 |
Searching the internet, searching the website, and screening the website. 00:00:20.200 |
So you can see here on the lab we have the stamp-up part one in moderation, meaning the and all the agents here. 00:00:26.000 |
And you can see the differences in the architecture. 00:00:30.000 |
The key difference here is that we use a weaker and faster models for generating the land and pricing the plant, 00:00:37.000 |
and it's not just you can use some of the new instruments and recent models at a time, 00:00:41.000 |
and generate a smaller land reported, and then we use both or both need to use that thing to use. 00:00:48.000 |
After we've built those, the first thing we want to know is how do we know the future is better? 00:00:53.000 |
Before we build any new values or any metrics, we spend a lot of time looking at the traces 00:00:57.000 |
to see if the director is going to be okay, which ones will be better. 00:01:01.000 |
What we found initially is that O1 produce much more recurrence requirements. 00:01:06.000 |
So you can see here on the left we have O2U, and on the right we have O4U. 00:01:14.000 |
O2U has around 600 tokens, and only 600 for O2U. 00:01:20.000 |
If you want to zoom in to this plan, you can see for a specific question, 00:01:26.000 |
the difference in quality and specificity about the O2U and O2U in this O2U. 00:01:32.000 |
And so what we found actually from this plan is that this increase in specificity 00:01:50.000 |
and the accuracy of general velocity to help the agent improve outcomes downstream 00:01:59.000 |
But after we did this kind of budget, we started to build actual deval. 00:02:03.000 |
So we started with just accuracy, which was just a percentage of questions answered correctly, 00:02:11.000 |
So I had a hand labeled probably 500 and 100 companies reach a spot on the data set. 00:02:17.000 |
And if you take these data sets to be installed, 00:02:20.000 |
we found customers that they find the user agent for. 00:02:22.000 |
So things like it's going to be used to receive from the rapid, 00:02:26.000 |
the type of rapid, so that we can deploy these data sets from agents 00:02:29.000 |
and how to get answers from those kinds of updates. 00:02:33.000 |
And we've hit ConorAgent and Sandlot R1 head-to-head, 00:02:38.000 |
It kind of beat out Sandlot in most of the categories 00:02:42.000 |
and more than the increasing difference with these tasks. 00:02:47.000 |
So we learned from the ways initially developed. 00:02:49.000 |
So like I mentioned earlier, these reason models have an outsize impact 00:02:53.000 |
with downstream actions and downstream accuracy. 00:02:56.000 |
and then you can also learn the accuracy of solving evaluation 00:03:03.000 |
But between that, actually still haven't been able to trace it 00:03:06.000 |
because we didn't have a period slice into where to go from here. 00:03:09.000 |
So we know these are the vectors in these agents. 00:03:13.000 |
And so we thought about three, four axes for how we improve these agents. 00:03:19.000 |
One is like changing graph in the architecture. 00:03:26.000 |
So I spent a lot of time reflecting on customer use cases. 00:03:29.000 |
Customer needs to find new prompts to be able to block these 00:03:32.000 |
and make them more close to what needs to happen. 00:03:37.000 |
and added the first two areas they wanted to invest in for improvements. 00:03:42.000 |
So I'll talk a little bit about how many cool learners you have 00:03:48.000 |
So first thing we want to do is kind of optimize the performance of the prompts. 00:04:00.000 |
there would seem to be new models are coming out. 00:04:04.000 |
So you're often just plugging in a new model, 00:04:08.000 |
And what was interesting is that we didn't really see a huge difference 00:04:13.000 |
And that's the only model we replaced go along with 00:04:20.000 |
the control of agent prompts that used to cost around 35 cents 00:04:23.000 |
now costs around 10 cents for 4.1 with similar performance. 00:04:27.000 |
You can see here the number of other models we tried 00:04:31.000 |
and why 4.1 is the best or most cost-effective model for planning in our agent. 00:04:41.000 |
And one thing that I guess for DeepSeq was that 00:04:46.000 |
but it was only until probably like a week or two ago 00:04:49.000 |
when the latency was down to an acceptable threshold 00:04:54.000 |
The other thing we came across was date formatted, 00:05:01.000 |
to correctly identify what date was in the future 00:05:07.000 |
or a little bit something like 514-2025-03 versus May 15th-2024. 00:05:14.000 |
it actually thought that the date of 2024 was in the future as well. 00:05:19.000 |
So we've done some adjustments to the company 00:05:21.000 |
just by providing actually different versions of the date, 00:05:24.000 |
performance and standardizing accuracy on the date 00:05:30.000 |
The last main thing was how we do the tool volume. 00:05:36.000 |
was that agents were making a better way of tool volumes, 00:05:38.000 |
something like searching for B2B just generally. 00:05:41.000 |
So what we did is we ended up changing a lot of the banking models 00:05:45.000 |
for these input schema tools to force the tool volume node 00:05:52.000 |
and think a little bit more about what it was following. 00:05:55.000 |
So just overall across comp and model changes, 00:06:01.000 |
because we were able to swap 0144-2025 planning 00:06:16.000 |
in the game of traces where you're some kind of human eval 00:06:20.000 |
And that seemed notable from both AI recently 00:06:25.000 |
We also learned models tend to spike in different use cases. 00:06:42.000 |
So customers can be powered by adding just a single D tool. 00:06:48.000 |
So the four tools that we decided to add to most of this 00:07:13.000 |
is that internet is still hard between SEO articles and Google 00:07:17.000 |
as well as search-grounding and things like . 00:07:22.000 |
You're left with a lot of result qualities now in your hands. 00:07:34.000 |
So this is how we conduct research on the internet 00:07:43.000 |
So we're pretty good at doing internet research, 00:07:45.000 |
and we do it fundamentally differently than agents were doing 00:07:51.000 |
you might search for a query over the top ten links, 00:07:54.000 |
implicitly filter out probably five of those just facing the source, 00:07:59.000 |
maybe you're going through a couple sentences on each 00:08:01.000 |
before deciding that you need to do a different search query 00:08:06.000 |
So we saw that our agents were not mimicking this common behavior 00:08:09.000 |
and wanted an adjust force to improve the result quality. 00:08:13.000 |
So we upgraded from our initial identity model, 00:08:17.000 |
which was initially we had a very naive structure, 00:08:21.000 |
We kind of flipped that to include a bunch of other arguments 00:08:24.000 |
with things like capital-free, whether you want a live crawl, 00:08:29.000 |
and also like maybe you can string that domain 00:08:34.000 |
we're changing the trajectory of Google search as well, 00:08:37.000 |
from first reviewing just the preview from internet search output, 00:08:43.000 |
to after getting both the URL and the actual page content in one tool call. 00:08:48.000 |
Sorry, so what this allows us to do is pull in all this content at once 00:08:55.000 |
and we can sidestep this issue that we were seeing with agents 00:08:58.000 |
picking an answer just based on the Google search for you, 00:09:01.000 |
which has to know it isn't always reliable or accurate. 00:09:05.000 |
So the second new tool we built was browser access. 00:09:10.000 |
There's a lot of rich data online that's scraping these people to capture. 00:09:14.000 |
So between like online data sources or data sets, 00:09:23.000 |
you can't really capture that content in the screen. 00:09:26.000 |
So we wanted to allow a unified agent to use the browser the same way we would. 00:09:31.000 |
So we built browser access into the sub-agent. 00:09:34.000 |
So we gave this tool, which is basically browser access for this agent, 00:09:39.000 |
and what it does is decompose the task into a browser trajectory 00:09:43.000 |
using the Chrome Mini and it also uses computer use preview to actually action on that. 00:09:47.000 |
We evaluated browser use, the open source alternative, 00:09:50.000 |
and we found that while it was marginally faster, 00:09:55.000 |
which let us use computer use preview instead. 00:10:01.000 |
where we try to find if Google has a parking on site. 00:10:05.000 |
And so eventually it ends up using the browser tool, 00:10:13.000 |
looking for an EV charging station in their parking lot, 00:10:16.000 |
and then also flipping to the tab in the browser to check to see if it has a PV. 00:10:24.000 |
it does actually confirm between Google Maps and that page that there is an EV charging station. 00:10:33.000 |
and one thing we learned is that we can't use this kind of bank approach to internet search. 00:10:39.000 |
but you still need to empower your users to be able to look at the data, 00:10:46.000 |
T-search in this pivot to changing and pulling in content at once, 00:10:51.000 |
massively reduced the amount of misinterpretation we have in internet search, 00:10:57.000 |
And these other tools, like browser use and searching HTML, 00:11:00.000 |
will automatically be new use cases for our agent. 00:11:18.000 |
we want to invest a little bit more time in emails 00:11:20.000 |
to actually highlight some of these issues that I found, 00:11:25.000 |
to make this process a little more repeatable and scalable. 00:11:31.000 |
We are solving a lot of interesting agent problems, 00:11:35.000 |
so if you also want your name in our code base as an agent, 00:12:13.000 |
Eno will be sharing insights on challenges with agenting systems 00:12:33.000 |
My name is Eno, co-founder and CTO of a company called Factory. 00:12:38.000 |
At Factory, we believe that the way we build software is radically changing. 00:12:44.000 |
We are transitioning from the era of human-driven software development