The Web Browser Is All You Need

Hey, everybody. I'm Paul. I'm the founder of BrowserBase, and I am obsessed with browsers, specifically one type of browsers, headless browsers. And I'm here to talk about how the browser is all you need. It's not attention. It's not MCP. It's the browser, or specifically the browser MCP server is all you need.

And I'm going to try and keep it light on slides. We only have 100 to get through in 20 minutes, but you and I, we can do it together. Just stay with me for 20 more minutes, okay? So, first of all, a little about BrowserBase. This is what it looks like to build your own browser infrastructure.

It's messy, and it breaks all the time. With BrowserBase, we can let you run thousands of headless browsers in the cloud for agents to control. And you might be wondering, well, why do I need a browser? Well, every AI agent needs a web browser. That's the point of this whole talk.

So, of course, you know, I can talk to you about how to scale browser infrastructure and all that interesting stuff, but that's too pluggy. We'll save that for later. But I do know quite a bit about, you know, how customers are using browsers as part of their AI agent applications, specifically with MCP.

We have the most popular browser automation MCP server out there. And the reason why people choose BrowserBase to run their headless browser MCP is that we make it really nice and really can help you scale your infrastructure. And when you think about why do I need a browser? Well, I like to think about two things.

You have AI agents and the legacy internet. You know, the DMV is not going to have an MCP server anytime soon. My barber shop is not going to open a GraphQL API for me to schedule a haircut. As much as I keep begging John to do it, he's got better things to do.

So if we want AI agents to interact with the rest of the legacy internet, they need a bridge. And I really do believe that the browser is that bridge between AI and the rest of the internet. And this is the unsexy internet, I might add. It's the internet that's not going to get a lot of attention.

I've seen countless flight bookers and countless restaurant pickers, and I have not seen anyone do the thing I need, which is a Delaware franchise tax filing agent. Any founders in the room who have done that before? Not super fun. And I think people use a lot of acronyms these days.

You know, you have MCP, you have A to A, you have OpenAPI. But if those aren't available, you can just do what can be considered the dumb thing. You can just use the website. And websites are out there. There are plenty of them. There are billions of websites. And when your user is going to prompt your agent to do something, you might not always have a first party integration available.

That's where a browser is kind of the integration of last resort. The path that you can take your agent down if you don't have something, a primary integration already built in. You might be wondering, well, okay, Paul, cool, I get it. You've beaten this thing dead. AI agency in the browser, but how do they control it?

Well, you can think about web agents and browser tools. First of all, you know, what is a web agent? I want to keep this technical as this is a bunch of AI architects, right? You know, web agents, we've heard about them for a long time, take a model and then have it generate some code to control a browser by generally parsing the DOM and the page, the HTML and the CSS.

Web Voyager was early here. Adept did a lot of really cool stuff here with their Fuji models. OpenAI had Operator, Proxy by Convergence, now Salesforce, H Company. Everyone was kind of doing a lot of this stuff last year. We really got to see a lot of web agents in production, but they were still early days, you know?

And Web Voyager was first, you know, it was taking screenshots of a page, it was using chain of thought prompting, and then from that it was saying, click the button at this coordinate. Sometimes they're doing these things, we'll talk in a second, called labeling on top of the page.

But I think it's pretty cool because we haven't changed that much from this. There really are two different types of web agents. There's vision-driven agents. These are ones that predominantly use screenshots as context for the model. They might do some marking up of the screenshot to indicate what box to click on.

Or there's text-based web agents who predominantly use HTML as the context of the model. Both have different approaches, pros and cons. Text web agents use expats and playwright code. Some may argue it's more repeatable. Vision models can be more accurate and more complex pages. There's trade-offs here. And it really does depend on what website you're trying to automate.

Here's an example on a vision agent using set of marks prompting. You can see these little boxes here where you're marking up what you should click, and the agent or the model will turn and say, "Click the box label 25." That's going to help you out. And on the DOM-based agent side, there's also ideas of taking HTML, and how can we transform HTML to make it more reliable for web agents?

So the accessibility tree is something that's built in every page in a lot of applications. Where you can take a different structure of the same information and condense it down where you get the same layout, but without all the extra div tags and classes. So we have vision-based agents and DOM-based agents.

And there's now also computer use models, which are kind of like the next step here where let's train a whole model on this stuff. Previously, we were just using the stock image models. But now we can train that model on these things called web trajectories. And I won't go too much into this.

There's a lot of great papers out here. I'd recommend this paper linked down here about web trajectories and how you can generate them to do RL to teach models how to not just pick the right button on the page, but how to reason across multiple pages about the right path to take.

But all in all, there's just a lot of innovation here happening on teaching AI how to browse the web. And this stuff is getting good. It is working. And you can use it today to help add some sort of extra functionality to your applications if you're making the right choices.

And I will kind of add, you kind of want to think about if you want a web agent or a browser tool. And you may have never heard of what the difference is here, but there is a difference in my head. Like a web agent is kind of like, it's kind of like one prompt to many actions.

I think OpenAI's operator is a good example of this. You say, hey, operator, go, you know, file my Delaware franchise tax. You give it some prompt, it's going to go take many actions. And if you give it that same prompt twice, it might take two different paths to get the task done.

Web agents are good. They're like little cockroaches. They're just going to keep trying to find a way to complete your task. But they're a little bit more non-deterministic because the reasoning is in their control. Whereas like browser tools are like one action and one prompt. You say, click the sign in button, that thing's going to click the sign in button.

You ask it to purchase the Amazon item that you want. That's a series of multiple steps that really might be more suited for a web agent. So we have a framework called Stagehand that we think is the best browser tool. And it really does depend on what you want.

If you know what your workflow is going to be with some high level steps, you can actually use a browser tool to take those steps and translate them into reliable web actions. If you don't know what you're going to do, if you don't know what the prompt will be, a web agent more generically might be the right fit.

But I do believe your agent needs a browser tool. Another thing I'll add is like, you may want to think about with MCP, what types of servers are you integrating. So a vertical MCP server with something like linear, where it's going to give you options to control specific things on a specific task, like create a linear ticket, create, assign someone to the linear ticket, whereas like a horizontal MCP server, you're going to have some sort of perimeter that can do many things.

You know, for browsing, we view it as a horizontal MCP server. You're exposing primers like click a button on a page. Now that page may change, there may be many different pages, but when you have a horizontal MCP server with one server, you have the opportunity to automate the whole web.

And these like, you know, you will see vertical MCP servers which are more direct tool calls as like an important part of your agent. I'm not saying we replace those with browsers. If you are interacting with Salesforce, you probably should just use the Salesforce MCP. You don't need a browser there.

But if you're interacting with custom bespoke MCP built by large enterprise that doesn't, sorry, custom bespoke CRM built by large enterprise that doesn't have an MCP server, you don't have to go reverse engineering APIs. You can use a browser tool and a browser MCP server to go out and automate that.

Is that making sense for everybody so far? Okay. Thank you guys. Hearing some heads nodding. Okay. A few other notes on like MCP and all these like concerns I have here. I do think that MCPs are going to have to pass compliance. And dynamic tool discovery, this idea that you can plug in your agent to an infinite toolbox of MCP servers is going to be hard for CSOs to get on board with, right?

You're going to want to be able to select which MCP servers make the most sense. And with a browser or a horizontal MCP server, this could be browser, it could be email, it could be anything, you really only have to onboard one MCP server as opposed to an MCP server for each individual integration.

Secondly, like, yeah, that's the most important one. Secondly, like evals. Most benchmarks are fake news. Just want to let you know. Especially when the company putting out the benchmark is the one that's also ranking themselves. So I would be very critical of public benchmarks of any web agent you see out there or any model you see out there.

You really need your own evals. My friend Ankur, who runs Braintrust, he supports our evals. I really like them. Because then we're able to say, hey, actually, for our web agent we're building, which models are the best model for this web task or for this certain website? And you can get really intelligent and honest about what you actually need to do to automate the web and which model is the right choice for you.

Finally, I think you need observability. If your AI agent is controlling a browser, you need to be able to see what happens in the browser. BrowserBase bundles this out of the box so you can build your own. You know, taking screenshots, recording history, recording actions, making sure that you know exactly where your agent went to and why.

It's important because, let's say, your agent is going to go buy an Xbox and it buys you AirPods. You want to understand what prompts went into that, what page paths it took, and really break that down. At BrowserBase, we include this in every browser. Sessions are recorded, logs are available.

It really makes it quite easy. Okay, so that was, like, a lot on browsers. I got a ton of time left, but I'll end with this one point. The browser is the default MCP server for the rest of the internet. If you need to integrate with something, whether it's with MCP, an API, and there's not something available, you should really consider including a browser, because a browser is all you need.

And since I'm doing so well on time, I'm going to do some live coding, because I feel like there hasn't been enough live coding in this room today, so I'm breaking it out. And let's pull up Cursor and bring it over here, and it's going to be so hard to see, but let me try.

Give me this, please. Love. All right, so we have Cursor right here in all of my screenshots. I'm a screenshot hoarder, I apologize. So I pulled up the Cursor MCP server, sorry, the Cursor controller, and I've written a prompt. It says, create a new Browserway session, navigate to sfpca.org, close out any modals, and find a dog for adoption in San Francisco, return the URL.

We'll click enter, and I may have to jump to a browser here really quickly. So you can see it's calling the Browserway session tool, it's generated a Browserway session right here, and you can look at how it's actually making these individual tool calls on the page. If I pull up the session at the same time, and scoot this tab over there, we can see, in parallel, the browser is loading.

And as our MCP server is making these tool calls, right now it's trying to close out the modal, it's navigating the website. And thinking about how, like, there's a reasoning model here that's deciding, what should I do, what should I call, these tools are exposed. And now, we go here, we have been given a dog with this URL, and one of you will be going home with a lucky dog today.

All right, can I get a drumroll, please? All right, that was so half-hearted, but thank you. Give me a dog! Yay! There we go! Oh, they really want that 200k match. This campaign's been running for a while, I feel kind of bad for SMCA. But that's a good example of what will happen when you're building these web automations.

Sometimes modals will pop up. Sometimes things you aren't expecting may happen on the web page. You need to have an AI agent driving the page with primitives, so it can react to any sort of weird changes that happen. And hopefully, you can integrate with things that aren't going to be AI-native.

To me, the most important problems to solve in AI right now are the unimportant, boring problems. And they're going to require, you know, intelligent engineering to both bring the best models to the unsexy problems. And when we talk to customers at BrowserBase, they're not just Perplexity or Clay or Commure.

They're also a 55-year-old dairy trucking company who has never hired an engineer in their 55 years until this year. And the first thing they did was use BrowserBase to automate a really painful ops workflow. So if I can somehow pull my slides back, I don't know if I'll be able to.

I might just have to move this bad boy over here and we'll just go for it live. Yeah, screw it. Let's do this. Okay, so if you do want to try a BrowserMCP server, it's available today. You can actually go ahead and scan this QR code and sign up or use this to try adding some sort of automation.

It's really easy. If I can use it, you can use it, I promise. I'm going to pause. I'm seeing some photos being taken. Amazing. Great. And finally, you know, if you are looking to join a company that's growing quickly, BrowserBase has been around for a year and a half.

We're 30 people. We're backed by some really great investors. And we would love for you to come build the future of automation with us. All right. That's all my time. I might have a minute for some questions or two, if that's okay. Yeah. If there's any questions, happy to take them.

Otherwise, thanks so much, everybody. Any questions? Do you use specific models? Hold on. Maybe do hands if that's okay. We'll go up front and then back. Yeah, go ahead. Do you use specific model for navigation? Like it's your own model or like when the browser is scrolled and so on and so on, We are model agnostic.

We are model agnostic. So browser brace is just the infrastructure for running headless browsers as well as the frameworks and tools. We have an MCP server. We have a framework called stage hand, which is like play right but better. You bring your own model. And you may want to choose different models based on your conditions.

If you are doing HIPAA and you need like zero data retention and open AI, you might want to use one of their models. So you bring the model. We bring the infra. Hold on. I'm sorry. I'm just gonna go ahead. Say it again. You can just, we just skip it.

Yeah, the example we just did, I used, yeah, the cursor agent mode. I think it's 4.0 max. Oh, sorry. It's probably 4.0 sonnet right now. That's why I baked in there. Go ahead. Love the questions. Keep them coming. How do you guys manage anything that requires human in the loop?

Because we have to deal with a lot of legacy infrastructure, but it's financial data. So an advisor wants to review it before it goes into a financial planning tool or something like that. How do you bring in some sort of human in the loop interaction or to a browser A and B if because this is financial data, we also want to give users a clear view of what the agent has done.

Is there a way we can send that information even if they couldn't interfere in the process? Yeah, I was just going to hop over the slides. So not only do we have recordings available and they can be turned on or off depending on data sensitivity, you can embed these recordings into a user facing application and show your user what happened.

We also have this feature called the live view where you can embed an iframe and show exactly what's happening in the browser. And better yet, if someone wants to, a human wants to come in and click and type in on the live view, they can do that as well.

So it's not just for browser automation, it can be a browser co-pilot and it's an iframe so it's a browser inside a browser which is kind of fun to see. Yeah, thank you. I think there's a question over here. Yeah, and we'll go to you guys next. So my question is kind of two parts.

One, have you dealt with captchas at all and do you see future websites kind of using similar strategies to defend against automation tools that only users can use their website? Yeah, captchas. So for the longest time, there's never been good bots on the internet. And captchas were built to stop all bots.

But now there are good bots and bad bots. And at BrowserBase, we do offer captcha solving and proxies as something that's needed to browse the web. We have customers that use captcha solving against their own captchas because they can't get their sec ops team to get through a good way to bypass that.

However, I think longer term, my friend Michael at WorkOS was just talking about captchas. Agent authentication is the path to avoid captchas. Most captchas we see at BrowserBase are when someone's logging in. Once you log in, you know who the agent is. You know who they're acting on behalf of.

And I'm really hopeful that solving captchas at BrowserBase is a short-term solution as we figure out how to do agent authentication on the internet longer term. But down to talk about that afterwards. Come find me. I got a minute and 42 seconds. I'm holding us to it. Yeah. So during web navigation, I think my question was related to captchas, but during web navigation, does the browser actually detect it's a robot that's doing the navigation?

And does it increase captcha coming up? Yeah, I think that the way that captcha detection works is often based on your behavior. And what we advise our customers is like, listen, you know, even though we provide the best stealth browsing features necessary, in the end, if you're a bad citizen of the internet, you are going to get blocked.

It's an inevitability. You can see this on LinkedIn. If you have an agent that's using LinkedIn, LinkedIn measures how many actions you take per minute. And if you're violating that, you know, you're going to get stopped. So we advise our customers is like, you need to be a good citizen of the internet first.

You need to try and obey robots.txt. You need to be careful what you're doing. And if not, like, you're going to have a really hard time. And no matter what we do at BrowseBase, we can't stop that. We can help with the simple things. But if you're doing something that's against the law, unethical, we don't really want that on our platform.

We can help with the simple things. We can help with the simple things. We can help with the simple things. We can help with the simple things. We can help with the simple things. We can help with the simple things.

The Web Browser Is All You Need - Paul Klein IV

Chapters

Transcript