LangChain Interrupt 2025 Learnings from Building AI Research Agents

00:00:00.760 | I don't see who you are following me by now, but this is a demand that is happening to the labs correctly in the backpacks

00:00:07.700 | in the course of the agency project.

00:00:10.100 | So Connor and Sam go through the margins of the screenwork, but they also go through a three-four tool system.

00:00:15.600 | Searching the internet, searching the website, and screening the website.

00:00:20.200 | So you can see here on the lab we have the stamp-up part one in moderation, meaning the and all the agents here.

00:00:26.000 | And you can see the differences in the architecture.

00:00:30.000 | The key difference here is that we use a weaker and faster models for generating the land and pricing the plant,

00:00:37.000 | and it's not just you can use some of the new instruments and recent models at a time,

00:00:41.000 | and generate a smaller land reported, and then we use both or both need to use that thing to use.

00:00:48.000 | After we've built those, the first thing we want to know is how do we know the future is better?

00:00:53.000 | Before we build any new values or any metrics, we spend a lot of time looking at the traces

00:00:57.000 | to see if the director is going to be okay, which ones will be better.

00:01:01.000 | What we found initially is that O1 produce much more recurrence requirements.

00:01:06.000 | So you can see here on the left we have O2U, and on the right we have O4U.

00:01:11.000 | You can see O1U and O1U in the same content.

00:01:14.000 | O2U has around 600 tokens, and only 600 for O2U.

00:01:20.000 | If you want to zoom in to this plan, you can see for a specific question,

00:01:24.000 | which is the only question in the comment,

00:01:26.000 | the difference in quality and specificity about the O2U and O2U in this O2U.

00:01:32.000 | And so what we found actually from this plan is that this increase in specificity

00:01:50.000 | and the accuracy of general velocity to help the agent improve outcomes downstream

00:01:56.000 | in this initial planning phase.

00:01:59.000 | But after we did this kind of budget, we started to build actual deval.

00:02:03.000 | So we started with just accuracy, which was just a percentage of questions answered correctly,

00:02:08.000 | and then hand labeled a bunch of data sets.

00:02:11.000 | So I had a hand labeled probably 500 and 100 companies reach a spot on the data set.

00:02:17.000 | And if you take these data sets to be installed,

00:02:20.000 | we found customers that they find the user agent for.

00:02:22.000 | So things like it's going to be used to receive from the rapid,

00:02:26.000 | the type of rapid, so that we can deploy these data sets from agents

00:02:29.000 | and how to get answers from those kinds of updates.

00:02:33.000 | And we've hit ConorAgent and Sandlot R1 head-to-head,

00:02:36.000 | and ConorAgent came up on top.

00:02:38.000 | It kind of beat out Sandlot in most of the categories

00:02:42.000 | and more than the increasing difference with these tasks.

00:02:47.000 | So we learned from the ways initially developed.

00:02:49.000 | So like I mentioned earlier, these reason models have an outsize impact

00:02:53.000 | with downstream actions and downstream accuracy.

00:02:56.000 | and then you can also learn the accuracy of solving evaluation

00:03:00.000 | for how we look for an agent again.

00:03:03.000 | But between that, actually still haven't been able to trace it

00:03:06.000 | because we didn't have a period slice into where to go from here.

00:03:09.000 | So we know these are the vectors in these agents.

00:03:11.000 | So how do we work with the agents?

00:03:13.000 | And so we thought about three, four axes for how we improve these agents.

00:03:19.000 | One is like changing graph in the architecture.

00:03:22.000 | Two is changing models in the prompts.

00:03:24.000 | And then three is adding more tools.

00:03:26.000 | So I spent a lot of time reflecting on customer use cases.

00:03:29.000 | Customer needs to find new prompts to be able to block these

00:03:32.000 | and make them more close to what needs to happen.

00:03:35.000 | So they changed the goals in the prompts

00:03:37.000 | and added the first two areas they wanted to invest in for improvements.

00:03:42.000 | So I'll talk a little bit about how many cool learners you have

00:03:46.000 | from doing models in the prompts.

00:03:48.000 | So first thing we want to do is kind of optimize the performance of the prompts.

00:03:54.000 | Go on, go to the creative, go on.

00:03:55.000 | They're pretty expensive.

00:03:56.000 | They're also pretty slow.

00:03:58.000 | So, and also at the start of this year,

00:04:00.000 | there would seem to be new models are coming out.

00:04:03.000 | Like you got to be.

00:04:04.000 | So you're often just plugging in a new model,

00:04:06.000 | trying to see if it's better or not.

00:04:08.000 | And what was interesting is that we didn't really see a huge difference

00:04:11.000 | until 4.1 came out recently.

00:04:13.000 | And that's the only model we replaced go along with

00:04:16.000 | in terms of 4.1 planning.

00:04:18.000 | And the outcome of this change is that

00:04:20.000 | the control of agent prompts that used to cost around 35 cents

00:04:23.000 | now costs around 10 cents for 4.1 with similar performance.

00:04:27.000 | You can see here the number of other models we tried

00:04:31.000 | and why 4.1 is the best or most cost-effective model for planning in our agent.

00:04:36.000 | And we tried even DeepSeq,

00:04:38.000 | 12-7 and 19-255 recently.

00:04:41.000 | And one thing that I guess for DeepSeq was that

00:04:44.000 | when it came out it was really promising,

00:04:46.000 | but it was only until probably like a week or two ago

00:04:49.000 | when the latency was down to an acceptable threshold

00:04:51.000 | or similar to like the 1 for these tasks.

00:04:54.000 | The other thing we came across was date formatted,

00:04:58.000 | which is pretty distinct,

00:04:59.000 | where we had a bunch of models of scale

00:05:01.000 | to correctly identify what date was in the future

00:05:04.000 | just because the format it was in.

00:05:06.000 | So we saw it for us,

00:05:07.000 | or a little bit something like 514-2025-03 versus May 15th-2024.

00:05:13.000 | Because the date was in the future,

00:05:14.000 | it actually thought that the date of 2024 was in the future as well.

00:05:19.000 | So we've done some adjustments to the company

00:05:21.000 | just by providing actually different versions of the date,

00:05:24.000 | performance and standardizing accuracy on the date

00:05:28.000 | that's passed across the models.

00:05:30.000 | The last main thing was how we do the tool volume.

00:05:33.000 | This is something we're still working on,

00:05:34.000 | but initially the huge problem

00:05:36.000 | was that agents were making a better way of tool volumes,

00:05:38.000 | something like searching for B2B just generally.

00:05:41.000 | So what we did is we ended up changing a lot of the banking models

00:05:45.000 | for these input schema tools to force the tool volume node

00:05:49.000 | or tool volume agent to change the schema

00:05:52.000 | and think a little bit more about what it was following.

00:05:55.000 | So just overall across comp and model changes,

00:05:58.000 | we learned that, you know,

00:05:59.000 | agent costs are coming down a lot

00:06:01.000 | because we were able to swap 0144-2025 planning

00:06:04.000 | with no more quality change.

00:06:07.000 | And also we learned some ton of cases

00:06:09.000 | that I think I was not necessarily cached.

00:06:11.000 | And even if you go on a lot for most evals,

00:06:14.000 | you're probably still going to find yourself

00:06:16.000 | in the game of traces where you're some kind of human eval

00:06:18.000 | or human vibe check.

00:06:20.000 | And that seemed notable from both AI recently

00:06:22.000 | with their changes in the models in .

00:06:25.000 | We also learned models tend to spike in different use cases.

00:06:29.000 | that we really wanted to turn on .

00:06:42.000 | So customers can be powered by adding just a single D tool.

00:06:48.000 | So the four tools that we decided to add to most of this

00:07:02.000 | are internet research, browser access,

00:07:05.000 | searching HTML and data set access.

00:07:08.000 | And so I'll give you a couple of these.

00:07:11.000 | So why we started on deep internet research

00:07:13.000 | is that internet is still hard between SEO articles and Google

00:07:17.000 | as well as search-grounding and things like .

00:07:22.000 | You're left with a lot of result qualities now in your hands.

00:07:26.000 | And you also saw that in our use,

00:07:29.000 | that tools or calls to the internet

00:07:32.000 | would not be utilized like we would.

00:07:34.000 | So this is how we conduct research on the internet

00:07:38.000 | before using .

00:07:40.000 | So we thought about how do we do it today.

00:07:43.000 | So we're pretty good at doing internet research,

00:07:45.000 | and we do it fundamentally differently than agents were doing

00:07:48.000 | or how our agents were doing initially.

00:07:50.000 | When you do a search on Google,

00:07:51.000 | you might search for a query over the top ten links,

00:07:54.000 | implicitly filter out probably five of those just facing the source,

00:07:57.000 | open a couple of these tabs,

00:07:59.000 | maybe you're going through a couple sentences on each

00:08:01.000 | before deciding that you need to do a different search query

00:08:04.000 | or that you found your answer.

00:08:06.000 | So we saw that our agents were not mimicking this common behavior

00:08:09.000 | and wanted an adjust force to improve the result quality.

00:08:13.000 | So we upgraded from our initial identity model,

00:08:17.000 | which was initially we had a very naive structure,

00:08:19.000 | which was just kind of a query term.

00:08:21.000 | We kind of flipped that to include a bunch of other arguments

00:08:24.000 | with things like capital-free, whether you want a live crawl,

00:08:27.000 | including the text in summary,

00:08:29.000 | and also like maybe you can string that domain

00:08:31.000 | and then you can publish that date.

00:08:32.000 | And so by changing all these parameters,

00:08:34.000 | we're changing the trajectory of Google search as well,

00:08:37.000 | from first reviewing just the preview from internet search output,

00:08:42.000 | which is what we have on the left here,

00:08:43.000 | to after getting both the URL and the actual page content in one tool call.

00:08:48.000 | Sorry, so what this allows us to do is pull in all this content at once

00:08:55.000 | and we can sidestep this issue that we were seeing with agents

00:08:58.000 | picking an answer just based on the Google search for you,

00:09:01.000 | which has to know it isn't always reliable or accurate.

00:09:05.000 | So the second new tool we built was browser access.

00:09:08.000 | So how do we do it again?

00:09:10.000 | There's a lot of rich data online that's scraping these people to capture.

00:09:14.000 | So between like online data sources or data sets,

00:09:18.000 | that's already going to enter the query,

00:09:20.000 | interactive search and screenwriting,

00:09:21.000 | so even things like Google Maps or images,

00:09:23.000 | you can't really capture that content in the screen.

00:09:26.000 | So we wanted to allow a unified agent to use the browser the same way we would.

00:09:31.000 | So we built browser access into the sub-agent.

00:09:34.000 | So we gave this tool, which is basically browser access for this agent,

00:09:39.000 | and what it does is decompose the task into a browser trajectory

00:09:43.000 | using the Chrome Mini and it also uses computer use preview to actually action on that.

00:09:47.000 | We evaluated browser use, the open source alternative,

00:09:50.000 | and we found that while it was marginally faster,

00:09:53.000 | it struggled in more complex browser paths,

00:09:55.000 | which let us use computer use preview instead.

00:09:59.000 | You can see an example of this here,

00:10:01.000 | where we try to find if Google has a parking on site.

00:10:05.000 | And so eventually it ends up using the browser tool,

00:10:09.000 | it goes to Google Maps,

00:10:10.000 | it ends up using screen view,

00:10:12.000 | going through,

00:10:13.000 | looking for an EV charging station in their parking lot,

00:10:16.000 | and then also flipping to the tab in the browser to check to see if it has a PV.

00:10:22.000 | And on that last page there,

00:10:24.000 | it does actually confirm between Google Maps and that page that there is an EV charging station.

00:10:30.000 | So we learned a lot from these tools,

00:10:33.000 | and one thing we learned is that we can't use this kind of bank approach to internet search.

00:10:37.000 | Internet search in Google is great,

00:10:39.000 | but you still need to empower your users to be able to look at the data,

00:10:42.000 | adjust the right content in the context,

00:10:44.000 | and then actually face off of that context.

00:10:46.000 | T-search in this pivot to changing and pulling in content at once,

00:10:51.000 | massively reduced the amount of misinterpretation we have in internet search,

00:10:55.000 | and changed how we were going to search.

00:10:57.000 | And these other tools, like browser use and searching HTML,

00:11:00.000 | will automatically be new use cases for our agent.

00:11:03.000 | So as a result,

00:11:05.000 | the new agent, the new champion,

00:11:07.000 | we have a product is called Browser Agent,

00:11:09.000 | as you can see in the sole cap to the name,

00:11:11.000 | or the theme of naming our agents.

00:11:13.000 | So a couple of quick next steps,

00:11:16.000 | and it's based on these changes in tools,

00:11:18.000 | we want to invest a little bit more time in emails

00:11:20.000 | to actually highlight some of these issues that I found,

00:11:22.000 | just looking at traces that we found,

00:11:24.000 | looking at outputs,

00:11:25.000 | to make this process a little more repeatable and scalable.

00:11:30.000 | Awesome.

00:11:31.000 | We are solving a lot of interesting agent problems,

00:11:35.000 | so if you also want your name in our code base as an agent,

00:11:39.000 | reach out to us after at 459.

00:11:41.000 | We are hiring tons of engineers.

00:11:43.000 | Thank you guys.

00:11:45.000 | Thank you.

00:12:00.000 | All right.

00:12:05.000 | Thank you, Conor and Pinal.

00:12:09.000 | So next up is Eno Reyes.

00:12:11.000 | Conor's CTO of Factory.

00:12:13.000 | Eno will be sharing insights on challenges with agenting systems

00:12:17.000 | and have actually a family number of mics.

00:12:20.000 | Please welcome Eno.

00:12:21.000 | Hey everybody.

00:12:33.000 | My name is Eno, co-founder and CTO of a company called Factory.

00:12:38.000 | At Factory, we believe that the way we build software is radically changing.

00:12:44.000 | We are transitioning from the era of human-driven software development

00:12:49.000 | to agent-driven software development.

00:12:52.000 | You can see glimpses of that today.

00:12:55.000 | However, it seems to me,

LangChain Interrupt 2025 Learnings from Building AI Research Agents – Connor Heggie