Breakthrough Agents: Learnings from Building AI Research Agents

SPEAKER 1: Hey, I'm Connor. I'm the co-founder and CTO of Unify. I'm Kanaal. I'm an engineer at Unify. SPEAKER 1: Awesome. Some quick background on Unify. We're building an AI system of action that lets companies grow revenue in a repeatable, observable, and scalable way, generate pipeline and find new business.

A key belief that we have is that growth should be a science and that the best products foundationally should win. An insight that we had was that foundationally go-to-market is a search problem. It's about finding people and companies that have a problem that you uniquely solve. Historically, in order to run this search problem over huge amounts of unstructured, semantically rich data, you had to deploy people.

You have a sales team. They do research. But in the world of LLMs and AI, you can now do that in code. You can get a bunch of great benefits like repeatability, observability, scalability that historically you couldn't get. How do you run this research? Well, you run it with agents.

Here's what that looks like in our product. We take customer questions and figure out a way to answer those questions using internet data and sort of open research. We ask our customers for two things. One, a list of questions that they want answered about a specific company or person with defined outputs.

So that might be text, enum, Boolean. And then we ask them to provide some guidance. That's a free-form text field that lets you describe, sort of like how you would describe to a high schooler, how they should go about the actual research. This agent then runs on thousands or tens of thousands of companies, answers those questions in an engine called plays, and helps you send targeted sales outreach to those companies in a timely manner.

Some examples of things that our customers will research. One, when did this company last have downtime? Maybe if you're selling an incident response tool, you want to know if they've had downtime and speak to that. Or if you're selling a login tool or an auth tool, maybe you want to find times where customers are complaining about login experiences, bad login experiences, find some links to it and include that in a timely email.

We run quite a few agents. We're pushing a lot of tokens through open AI. In April, we pushed 36 billion tokens, and that has increased pretty quickly every month since. We've learned a lot in doing this, and so we thought it would be fun to share some practical learnings of running these really generalized research agents at scale.

Thanks, Connor. So going back all the way to November, the first thing we did was build the version one of our agent. And so Sam, one of our founding engineers, and Connor both took cracks at it using the React framework, which most of you are probably familiar with by now, but just is reasoning and then acting.

So it allows for, like, corrective and reactive actions over the course of an agent's trajectory. So Connor and Sam both built versions of this framework, but they also built it with three core tools: searching the Internet, searching a website, and scraping a website. So you can see here on the left we have Sambot Mark1 and ConnorAgent.

We named all of our agents here, and you can see the differences in the architecture. The key differences here is that Sam chose to use a weaker and faster model, 4.0, for generating the plan and revising the plan. And Connor chose to use 01 preview, which was the reasoning model at the time, and to generate a stronger plan, hopefully, and then lead both -- or both lead to this agentic tool use loop.

After we built those, the first thing we wanted to know is, how do we know which one is better? So before we built any evals or any metrics, we spent a lot of time looking at the traces to see -- and the trajectories -- to see, okay, which one's working better?

And what we found initially is that 01 produced much more thorough research plans. So you can see here on the left we have 01 preview, and on the right we have 4.0. You can see 01 preview and 01 are given the same prompt, and for 01 preview you have around 1,600 tokens, and only 600 for 4.0.

If we want to zoom into this plan, we can see for a specific question, which is one of the questions in the prompt, the difference in quality and specificity of output for 01 preview versus 4.0. So you can see we've outlined how to answer the question in 01, potential mistakes or pitfalls in 01, and also, like, even the value type or the structured output that was provided.

And so what we found actually from these plans is that this increase in specificity and actually just general verbosity helped the agent improve outcomes downstream of this initial planning phase. But after we did this kind of vibe check, we started to build actual evals. We started with just accuracy, which was just a percentage of questions answered correctly, and then hand-labeled a bunch of data sets.

So I actually hand-labeled probably, like, 500 examples across, you know, 100 companies for each of these five core data sets. And we picked these data sets based off of what we thought the customers at the time would use our agent for. So things like, is this company B2B or B2C, firmographics and technographics, so that they can deploy these agents and accurately get answers about these companies.

And when we picked ConorAgent and Sandbot Mark 1 head to head, you can see ConorAgent came out on top. It kind of beat out Sandbot in most of these categories, and the margin was pretty significant for these tasks. So what did we learn from these initial evals? So like I mentioned earlier, you know, these reasoning models had an outsized impact on downstream actions and downstream accuracy.

And then we also learned that accuracy-based evaluations were a good heuristic for how good an agent is doing. But between that and actually still having to look at the traces, we didn't have clear insights into where to go from here. So we know these are the metrics that these agents produce, but how do we improve the agents?

And so we thought about three core axes for how do we improve these agents. One is like changing the graph of the architecture, two is changing the models and the prompts, and then three was adding more tools. So we spent a lot of time reflecting on the customer use cases, customer needs at the time, and we wanted to enable like a lot of these initial workflows with these changes.

So we picked changing the models in the prompt and adding more tools as the first two areas that we wanted to invest in for improvements. So I'll talk a little bit about a couple of the cool learnings we had from doing model and prompt changes. So the first thing we want to do is kind of optimize for performance and cost.

01, 03, and 01 Preview were pretty expensive, and they were also pretty slow. So, and also in the start of this year, seemingly new models were coming out like week after week. So you're often just plugging in a new model and trying to see if it was better or not.

And what was interesting is that we didn't really see a huge difference until 4.1 came out recently, and that's the only model we replaced 01 with in production for agentic planning. And the outcome of this change was that an initial agent run that used to cost around 35 cents now costs around 10 cents with 4.1 with similar performance.

You can see here the number of other models that we tried and why 4.1 is the best or most cost effective model for our planning and our agent. And we tried even Deep Seek, Cloud 3.7, Gemini 2.5 recently. And one thing to know, I guess, for Deep Seek was that when it came out, it was really promising.

But it was only until probably like a week or two ago when latency was down to an acceptable threshold or similar to like 01 for these tasks. Another thing we came across was date formatting, which is pretty interesting, where we had a bunch of models fail to correctly identify what date was in the future, just because of the format it was in.

So we saw that 4.0 struggled with something like 5.14, 2025 at 3 versus May 15th, 2024, because the day was in the future, it actually thought that the date of 2024 was in the future as well. So we've done some adjustments and prompting just by providing actually different versions of the date and improve the performance and standardize accuracy on a date-based task across models.

And the last main thing was how do we improve tool calling? This is something we're still working on, but initially the huge problem was that agents were making throwaway tool calls, something like searching for B2B, just generally. So what we did is we ended up changing a lot of the Pydantic models for these input schemas for tools to force the tool calling node or tool calling agent to change the input schema and think a little bit more about what it was calling.

So just overall across prompt and model changes, we learned that, you know, agent costs are going down a lot because we were able to swap 01 to 41 for agentic planning with no notable quality change. And also we learned there's a ton of edge cases that evals don't necessarily catch.

And even if you build a lot of robust evals, you're probably still going to find yourself looking at the traces to do some kind of human eval or human vibe check. And that's even notable from OpenAI recently with their changes in the models in ChatGPT. We also learned models tend to spike in different use cases.

So using a model in planning might not be as effective in tool calling or reflecting or in different parts of your agent workflow. So you probably want to do some kind of node-based eval. So the second axis was building more tools. And we needed to think about what tools we needed to build.

And so we thought about, okay, what use cases can we not support today that we really want to turn on with additional tools? So what customers can we power new workflows with by adding just a single new tool? So the four tools that we decided to add because of this were deep internet research, browser access, searching HTML, and dataset access.

And so I'll go through a couple of these. So why we started with deep internet research is that internet search is still hard. Between SEO articles on Google and as well as with search grounding with LLMs, with things like OpenAI and perplexity, you're left with a lot of result qualities out of your hands.

And we also saw that in our agentic use that tools or calls to internet were not being utilized like we would. So like this is how we would conduct research on the internet before using LLMs. So we thought about how do we do it today? So we're pretty good at doing internet research, but we do it fundamentally differently than agents were doing it, or how our agents were doing it initially.

When you do a search on Google, you might search for a query, look at the top 10 links, implicitly filter out probably five of those just based on the source, open a couple of new tabs, maybe read through a couple of sentences on each, before deciding that you need to do a different search query or that you found your answer.

So we saw that our agents were not mimicking this common behavior, and we wanted to adjust course to improve agent result quality. So we upgraded from our initial Pydantic model, which was initially we had a very naive structure, which is just kind of like a query term, and we've kind of flipped that to include a bunch of other arguments with things like a category, whether we want to live crawl, including the text in the summary, and also like maybe constraining like domain or even published at date.

And so by changing all these parameters, we're changing the trajectory of a Google search as well, from first like reviewing just the preview from an internet search output, which is what we have on the left here, to after getting both the URL and the actual page content in one tool call.

Sorry, so what this allows us to do is pull in all this content at once and sidestep this issue that we were seeing with agents picking an answer just based on a Google search preview, which as we know isn't always reliable or accurate. So the second main tool we built was browser access.

So how do we do it again? There's a lot of rich data online that scraping isn't able to capture. So between like online data sources or data sets that require you to enter a query, interactive search experience, or even things like Google Maps or images, you can't really capture that content with scraping.

So we wanted to allow our unified agent to use the browser the same way we would. So we built browser access as a sub-agent. So we gave this tool, which is basically browser access to this agent. And what it does is it decomposes the task into a browser trajectory using O4Many, and then also uses computer use preview to actually action on that.

We evaluated browser use, the open source alternative, and we found that while it was marginally faster, it struggled in more complex browser tasks, which led us to use computer use preview instead. You can see an example of this here. It's where we try to find if Google has EV parking on site.

And so it eventually ends up using the browser use tool. It goes to Google Maps. It ends up using Street View, going through looking for an EV charging station in their parking lot, and then also flipping to a new tab in the browser to check to see if it has EV.

And on that last page there, it does actually confirm between Google Maps and that page that there is an EV charging station. So we learned a lot from these tools. And one thing we learned was, okay, we can't use this kind of naive approach to internet search. Internet search in Google is great, but you still need to empower your agent to be able to look at the data, ingest the right content into context, and then action based off of that context.

Deep search and this pivot to changing and pulling in content at once massively reduced the amount of misinterpretation we had in internet search and changed how we conducted research. And these other tools like browser use and searching HTML unlocked completely new use cases for our agent. So as a result, the new agent or the new champion we have in prod is Kunal Browser Agent.

As you can see, we still kept the name or theme of naming our agents. So a couple quick next steps is based on these changes in tools, we want to invest a little bit more time in evals to actually highlight some of these issues that I found just looking through traces or we found looking at outputs to make this process a little bit more repeatable and scalable.

Awesome. We are solving a lot of interesting agent problems. So if you also want your name in our code base as an agent, come chat with us after or apply online. We are hiring tons of engineers. Thank you, guys. Thank you, guys.

Breakthrough Agents: Learnings from Building AI Research Agents

Transcript