back to index

LangChain Interrupt 2025 Learnings from Building AI Research Agents – Connor Heggie


Whisper Transcript | Transcript Only Page

00:00:00.760 | I don't see who you are following me by now, but this is a demand that is happening to the labs correctly in the backpacks
00:00:07.700 | in the course of the agency project.
00:00:10.100 | So Connor and Sam go through the margins of the screenwork, but they also go through a three-four tool system.
00:00:15.600 | Searching the internet, searching the website, and screening the website.
00:00:20.200 | So you can see here on the lab we have the stamp-up part one in moderation, meaning the and all the agents here.
00:00:26.000 | And you can see the differences in the architecture.
00:00:30.000 | The key difference here is that we use a weaker and faster models for generating the land and pricing the plant,
00:00:37.000 | and it's not just you can use some of the new instruments and recent models at a time,
00:00:41.000 | and generate a smaller land reported, and then we use both or both need to use that thing to use.
00:00:48.000 | After we've built those, the first thing we want to know is how do we know the future is better?
00:00:53.000 | Before we build any new values or any metrics, we spend a lot of time looking at the traces
00:00:57.000 | to see if the director is going to be okay, which ones will be better.
00:01:01.000 | What we found initially is that O1 produce much more recurrence requirements.
00:01:06.000 | So you can see here on the left we have O2U, and on the right we have O4U.
00:01:11.000 | You can see O1U and O1U in the same content.
00:01:14.000 | O2U has around 600 tokens, and only 600 for O2U.
00:01:20.000 | If you want to zoom in to this plan, you can see for a specific question,
00:01:24.000 | which is the only question in the comment,
00:01:26.000 | the difference in quality and specificity about the O2U and O2U in this O2U.
00:01:32.000 | And so what we found actually from this plan is that this increase in specificity
00:01:50.000 | and the accuracy of general velocity to help the agent improve outcomes downstream
00:01:56.000 | in this initial planning phase.
00:01:59.000 | But after we did this kind of budget, we started to build actual deval.
00:02:03.000 | So we started with just accuracy, which was just a percentage of questions answered correctly,
00:02:08.000 | and then hand labeled a bunch of data sets.
00:02:11.000 | So I had a hand labeled probably 500 and 100 companies reach a spot on the data set.
00:02:17.000 | And if you take these data sets to be installed,
00:02:20.000 | we found customers that they find the user agent for.
00:02:22.000 | So things like it's going to be used to receive from the rapid,
00:02:26.000 | the type of rapid, so that we can deploy these data sets from agents
00:02:29.000 | and how to get answers from those kinds of updates.
00:02:33.000 | And we've hit ConorAgent and Sandlot R1 head-to-head,
00:02:36.000 | and ConorAgent came up on top.
00:02:38.000 | It kind of beat out Sandlot in most of the categories
00:02:42.000 | and more than the increasing difference with these tasks.
00:02:47.000 | So we learned from the ways initially developed.
00:02:49.000 | So like I mentioned earlier, these reason models have an outsize impact
00:02:53.000 | with downstream actions and downstream accuracy.
00:02:56.000 | and then you can also learn the accuracy of solving evaluation
00:03:00.000 | for how we look for an agent again.
00:03:03.000 | But between that, actually still haven't been able to trace it
00:03:06.000 | because we didn't have a period slice into where to go from here.
00:03:09.000 | So we know these are the vectors in these agents.
00:03:11.000 | So how do we work with the agents?
00:03:13.000 | And so we thought about three, four axes for how we improve these agents.
00:03:19.000 | One is like changing graph in the architecture.
00:03:22.000 | Two is changing models in the prompts.
00:03:24.000 | And then three is adding more tools.
00:03:26.000 | So I spent a lot of time reflecting on customer use cases.
00:03:29.000 | Customer needs to find new prompts to be able to block these
00:03:32.000 | and make them more close to what needs to happen.
00:03:35.000 | So they changed the goals in the prompts
00:03:37.000 | and added the first two areas they wanted to invest in for improvements.
00:03:42.000 | So I'll talk a little bit about how many cool learners you have
00:03:46.000 | from doing models in the prompts.
00:03:48.000 | So first thing we want to do is kind of optimize the performance of the prompts.
00:03:54.000 | Go on, go to the creative, go on.
00:03:55.000 | They're pretty expensive.
00:03:56.000 | They're also pretty slow.
00:03:58.000 | So, and also at the start of this year,
00:04:00.000 | there would seem to be new models are coming out.
00:04:03.000 | Like you got to be.
00:04:04.000 | So you're often just plugging in a new model,
00:04:06.000 | trying to see if it's better or not.
00:04:08.000 | And what was interesting is that we didn't really see a huge difference
00:04:11.000 | until 4.1 came out recently.
00:04:13.000 | And that's the only model we replaced go along with
00:04:16.000 | in terms of 4.1 planning.
00:04:18.000 | And the outcome of this change is that
00:04:20.000 | the control of agent prompts that used to cost around 35 cents
00:04:23.000 | now costs around 10 cents for 4.1 with similar performance.
00:04:27.000 | You can see here the number of other models we tried
00:04:31.000 | and why 4.1 is the best or most cost-effective model for planning in our agent.
00:04:36.000 | And we tried even DeepSeq,
00:04:38.000 | 12-7 and 19-255 recently.
00:04:41.000 | And one thing that I guess for DeepSeq was that
00:04:44.000 | when it came out it was really promising,
00:04:46.000 | but it was only until probably like a week or two ago
00:04:49.000 | when the latency was down to an acceptable threshold
00:04:51.000 | or similar to like the 1 for these tasks.
00:04:54.000 | The other thing we came across was date formatted,
00:04:58.000 | which is pretty distinct,
00:04:59.000 | where we had a bunch of models of scale
00:05:01.000 | to correctly identify what date was in the future
00:05:04.000 | just because the format it was in.
00:05:06.000 | So we saw it for us,
00:05:07.000 | or a little bit something like 514-2025-03 versus May 15th-2024.
00:05:13.000 | Because the date was in the future,
00:05:14.000 | it actually thought that the date of 2024 was in the future as well.
00:05:19.000 | So we've done some adjustments to the company
00:05:21.000 | just by providing actually different versions of the date,
00:05:24.000 | performance and standardizing accuracy on the date
00:05:28.000 | that's passed across the models.
00:05:30.000 | The last main thing was how we do the tool volume.
00:05:33.000 | This is something we're still working on,
00:05:34.000 | but initially the huge problem
00:05:36.000 | was that agents were making a better way of tool volumes,
00:05:38.000 | something like searching for B2B just generally.
00:05:41.000 | So what we did is we ended up changing a lot of the banking models
00:05:45.000 | for these input schema tools to force the tool volume node
00:05:49.000 | or tool volume agent to change the schema
00:05:52.000 | and think a little bit more about what it was following.
00:05:55.000 | So just overall across comp and model changes,
00:05:58.000 | we learned that, you know,
00:05:59.000 | agent costs are coming down a lot
00:06:01.000 | because we were able to swap 0144-2025 planning
00:06:04.000 | with no more quality change.
00:06:07.000 | And also we learned some ton of cases
00:06:09.000 | that I think I was not necessarily cached.
00:06:11.000 | And even if you go on a lot for most evals,
00:06:14.000 | you're probably still going to find yourself
00:06:16.000 | in the game of traces where you're some kind of human eval
00:06:18.000 | or human vibe check.
00:06:20.000 | And that seemed notable from both AI recently
00:06:22.000 | with their changes in the models in .
00:06:25.000 | We also learned models tend to spike in different use cases.
00:06:29.000 | that we really wanted to turn on .
00:06:42.000 | So customers can be powered by adding just a single D tool.
00:06:48.000 | So the four tools that we decided to add to most of this
00:07:02.000 | are internet research, browser access,
00:07:05.000 | searching HTML and data set access.
00:07:08.000 | And so I'll give you a couple of these.
00:07:11.000 | So why we started on deep internet research
00:07:13.000 | is that internet is still hard between SEO articles and Google
00:07:17.000 | as well as search-grounding and things like .
00:07:22.000 | You're left with a lot of result qualities now in your hands.
00:07:26.000 | And you also saw that in our use,
00:07:29.000 | that tools or calls to the internet
00:07:32.000 | would not be utilized like we would.
00:07:34.000 | So this is how we conduct research on the internet
00:07:38.000 | before using .
00:07:40.000 | So we thought about how do we do it today.
00:07:43.000 | So we're pretty good at doing internet research,
00:07:45.000 | and we do it fundamentally differently than agents were doing
00:07:48.000 | or how our agents were doing initially.
00:07:50.000 | When you do a search on Google,
00:07:51.000 | you might search for a query over the top ten links,
00:07:54.000 | implicitly filter out probably five of those just facing the source,
00:07:57.000 | open a couple of these tabs,
00:07:59.000 | maybe you're going through a couple sentences on each
00:08:01.000 | before deciding that you need to do a different search query
00:08:04.000 | or that you found your answer.
00:08:06.000 | So we saw that our agents were not mimicking this common behavior
00:08:09.000 | and wanted an adjust force to improve the result quality.
00:08:13.000 | So we upgraded from our initial identity model,
00:08:17.000 | which was initially we had a very naive structure,
00:08:19.000 | which was just kind of a query term.
00:08:21.000 | We kind of flipped that to include a bunch of other arguments
00:08:24.000 | with things like capital-free, whether you want a live crawl,
00:08:27.000 | including the text in summary,
00:08:29.000 | and also like maybe you can string that domain
00:08:31.000 | and then you can publish that date.
00:08:32.000 | And so by changing all these parameters,
00:08:34.000 | we're changing the trajectory of Google search as well,
00:08:37.000 | from first reviewing just the preview from internet search output,
00:08:42.000 | which is what we have on the left here,
00:08:43.000 | to after getting both the URL and the actual page content in one tool call.
00:08:48.000 | Sorry, so what this allows us to do is pull in all this content at once
00:08:55.000 | and we can sidestep this issue that we were seeing with agents
00:08:58.000 | picking an answer just based on the Google search for you,
00:09:01.000 | which has to know it isn't always reliable or accurate.
00:09:05.000 | So the second new tool we built was browser access.
00:09:08.000 | So how do we do it again?
00:09:10.000 | There's a lot of rich data online that's scraping these people to capture.
00:09:14.000 | So between like online data sources or data sets,
00:09:18.000 | that's already going to enter the query,
00:09:20.000 | interactive search and screenwriting,
00:09:21.000 | so even things like Google Maps or images,
00:09:23.000 | you can't really capture that content in the screen.
00:09:26.000 | So we wanted to allow a unified agent to use the browser the same way we would.
00:09:31.000 | So we built browser access into the sub-agent.
00:09:34.000 | So we gave this tool, which is basically browser access for this agent,
00:09:39.000 | and what it does is decompose the task into a browser trajectory
00:09:43.000 | using the Chrome Mini and it also uses computer use preview to actually action on that.
00:09:47.000 | We evaluated browser use, the open source alternative,
00:09:50.000 | and we found that while it was marginally faster,
00:09:53.000 | it struggled in more complex browser paths,
00:09:55.000 | which let us use computer use preview instead.
00:09:59.000 | You can see an example of this here,
00:10:01.000 | where we try to find if Google has a parking on site.
00:10:05.000 | And so eventually it ends up using the browser tool,
00:10:09.000 | it goes to Google Maps,
00:10:10.000 | it ends up using screen view,
00:10:12.000 | going through,
00:10:13.000 | looking for an EV charging station in their parking lot,
00:10:16.000 | and then also flipping to the tab in the browser to check to see if it has a PV.
00:10:22.000 | And on that last page there,
00:10:24.000 | it does actually confirm between Google Maps and that page that there is an EV charging station.
00:10:30.000 | So we learned a lot from these tools,
00:10:33.000 | and one thing we learned is that we can't use this kind of bank approach to internet search.
00:10:37.000 | Internet search in Google is great,
00:10:39.000 | but you still need to empower your users to be able to look at the data,
00:10:42.000 | adjust the right content in the context,
00:10:44.000 | and then actually face off of that context.
00:10:46.000 | T-search in this pivot to changing and pulling in content at once,
00:10:51.000 | massively reduced the amount of misinterpretation we have in internet search,
00:10:55.000 | and changed how we were going to search.
00:10:57.000 | And these other tools, like browser use and searching HTML,
00:11:00.000 | will automatically be new use cases for our agent.
00:11:03.000 | So as a result,
00:11:05.000 | the new agent, the new champion,
00:11:07.000 | we have a product is called Browser Agent,
00:11:09.000 | as you can see in the sole cap to the name,
00:11:11.000 | or the theme of naming our agents.
00:11:13.000 | So a couple of quick next steps,
00:11:16.000 | and it's based on these changes in tools,
00:11:18.000 | we want to invest a little bit more time in emails
00:11:20.000 | to actually highlight some of these issues that I found,
00:11:22.000 | just looking at traces that we found,
00:11:24.000 | looking at outputs,
00:11:25.000 | to make this process a little more repeatable and scalable.
00:11:30.000 | Awesome.
00:11:31.000 | We are solving a lot of interesting agent problems,
00:11:35.000 | so if you also want your name in our code base as an agent,
00:11:39.000 | reach out to us after at 459.
00:11:41.000 | We are hiring tons of engineers.
00:11:43.000 | Thank you guys.
00:11:45.000 | Thank you.
00:12:00.000 | All right.
00:12:05.000 | Thank you, Conor and Pinal.
00:12:09.000 | So next up is Eno Reyes.
00:12:11.000 | Conor's CTO of Factory.
00:12:13.000 | Eno will be sharing insights on challenges with agenting systems
00:12:17.000 | and have actually a family number of mics.
00:12:20.000 | Please welcome Eno.
00:12:21.000 | Hey everybody.
00:12:33.000 | My name is Eno, co-founder and CTO of a company called Factory.
00:12:38.000 | At Factory, we believe that the way we build software is radically changing.
00:12:44.000 | We are transitioning from the era of human-driven software development
00:12:49.000 | to agent-driven software development.
00:12:52.000 | You can see glimpses of that today.
00:12:55.000 | However, it seems to me,