AI in Action 3 Oct 2025: AI Browser Agents: Navigating the Web with Vision and DOM

00:00:00.000 | All right. Is that coming through for you?

00:00:09.100 | Crisp. Looking good.

00:00:12.040 | Crisp. Okay, cool. That is my shameless plug. My name is CJ. Hey everyone. I've been kind

00:00:21.940 | of lurking around. I think I've been on the one that we did on Nano Banana a couple of

00:00:28.520 | weeks back. So, just a little bit about me. Let's see. I think folks are trying to look

00:00:37.420 | for the code. But yeah, background is in EE. I'm currently pursuing my master's in artificial

00:00:49.600 | intelligence. Currently working for a software test company. And yeah, kind of dabbling here

00:01:02.420 | and they're in VLMs specifically. So, so digital language models and also computer use agents.

00:01:11.320 | So, I figured I'd do a session on them since I've been kind of playing around quite a bit.

00:01:17.220 | And I don't know about you guys, but I also got access to Claude, the Chrome browser extension.

00:01:24.220 | So, just kind of been playing around with that too. So, yeah. Fun fact, I was born in Hawaii. So, a little bit about me.

00:01:33.120 | Just going to do kind of a quick overview with some of the open source agents that I've been kind of playing around with.

00:01:43.120 | And then show you kind of some of the stuff that I've built into with it just to see how they work.

00:01:52.020 | And then kind of walk through the Claude extension just to see what it looks like. So, if anyone has anything, let me know.

00:02:04.920 | I don't know if I can see. Let me see if I can pull the chat over. Yeah, I can see the chat.

00:02:09.920 | I can monitor chat for you if you'd like. Just like hop in if there's questions.

00:02:13.920 | Okay. Yeah, that'd be great. Awesome. And yeah, I saw that. Yeah, I'm out here in Sacramento.

00:02:21.320 | I'm actually in one of the northern suburbs of Sacramento. So, not too far.

00:02:26.220 | All right. There's my shameless plug.

00:02:29.220 | Okay. So, I'm sure you guys have a pretty good background about computer use stuff.

00:02:36.020 | You know, it's basically just agents with the ability to look at the screen, usually through screenshots,

00:02:44.320 | you know, pass that to like a grounded type of LLM.

00:02:49.220 | So, you know, your Claude's, I think OpenAI ChatGPT 5 is grounded too.

00:03:00.220 | There's a couple of like UI TARS, which is a Chinese one.

00:03:05.220 | Basically, it can see what's on the screen and it can like convert that into XY pixels.

00:03:13.120 | So, that's the sensor.

00:03:15.120 | Sometimes there's other tools built into it to be able to look at the DOM or other tools

00:03:21.120 | to basically sense the context of what you're trying to interact with.

00:03:25.120 | And then on top of that, they have the ability to call tools to interact with the DOM for websites.

00:03:33.020 | For computer use, specifically for operating systems, there's usually some sort of accessibility layer.

00:03:39.020 | So, it can look at the accessibility layer and then also interact with it.

00:03:44.920 | And then generally speaking, it also has stuff like keyboard and mouse interactions, being able to simulate those things and be able to call those specific tools.

00:03:56.920 | So, one of the ones that I'm going to show you right now is browser use.

00:04:01.920 | If you haven't heard of it, theirs is specifically, you know, specifically this one looks at the DOM layer.

00:04:14.820 | So, when it looks at a website, it's going into the HTML, the CSS, all of the X paths and stuff like that.

00:04:23.820 | One of the projects I'm building up is the ability to scrape down career sites and be able to look at career sites.

00:04:36.720 | One of the reasons why I went down the computer use or browser use path is because I'm sure you guys know when you're looking at all of the different solicitations that companies put out, they're not standardized.

00:04:52.720 | Some folks put the location in the description.

00:04:56.620 | Some people don't even put the location.

00:04:59.620 | Even when you're navigating through websites to go find job boards, it's not at, you know, like Google slash careers.

00:05:08.620 | Sometimes they put it into much deeper paths like Google slash career slash job board.

00:05:15.120 | And because of the non-standardization, having an agent that can navigate through all of that stuff is pretty useful.

00:05:21.620 | Some of the drawbacks of using agents, you really like they need to be trained very well on user journeys.

00:05:31.520 | And a lot of them aren't because there's usually like a planning layer and then a navigation or sorry, a planning step.

00:05:39.520 | And then there is a navigation step.

00:05:41.420 | And then there's like an interaction step.

00:05:43.420 | And a lot of times these agents aren't really trained on all of those things.

00:05:48.420 | So sometimes you'll hit infinite loops with some of the models where it's just trying to literally figure out how to navigate a website.

00:05:55.420 | So in this particular example for browser use, this is their cloud platform specifically.

00:06:03.320 | You can see I gave it a bunch of instructions to go through specific sites and to find what job is listed in each of these sites.

00:06:15.320 | And it took about 15 minutes to complete five sites worth, I think five or six sites worth of jobs.

00:06:25.220 | So in this case I'm using the cloud, the sonnet for specific one.

00:06:30.120 | And then I'm going through all the specific steps here.

00:06:34.120 | You can see all the different, you know, individual steps that it's taking the thinking process behind each step and how it's navigating through this particular website.

00:06:47.120 | The challenge with this website, which let me let me minimize this so it's kind of out of the way this website, I believe is careers at Uniswap.

00:07:00.020 | Each one of these individual job solicitations is hidden under like an expander and it gets stuck a couple of times trying to open up each expander, as you can see.

00:07:13.120 | So the navigation training in this one is probably not that great for this type of website, but it manages to get through it eventually like opens up engineering and opens up design like several times before it completes it and then moves on to the next step.

00:07:30.020 | So this is one example of browser navigation through this specific browser use platform.

00:07:42.920 | And of course, you can change models. I think I tried it once with, let's see, 03. This particular run was with 03. And it just kept getting stuck on trying to navigate through the expanders.

00:07:58.920 | 03. It literally goes into customer experience data science, customer experience data science, and it just doesn't get anywhere with with this particular task. So anyways, that's that's one of them. I can see that you guys are putting quite a bit of stuff into the chat. So yeah, I think it's all just chatter. I haven't really been any questions.

00:08:22.920 | 03. You're welcome to follow along with the chatter.

00:08:27.920 | 03. Cool, cool, cool. So this is one example. This is a browser use. Again, this one just goes through the DOM and it collects all the DOM information before making a decision.

00:08:39.920 | 03. Another one that I've been looking at quite a bit is called magnitude. And this one's more vision first based. So it definitely focuses more on, you know, using the VLM over using the DOM and kind of falls back on the DOM itself.

00:08:58.920 | 03. Of course, of course, all of this is being done through playwrights. That's that's been kind of the the latest automation.

00:09:05.920 | 03. Platform library for a lot of these computer use.

00:09:10.920 | 03. Applications, these computer use systems, this one tends to be a little bit more focused on.

00:09:19.920 | 04. Making sure that the grounded model is the thing that drives a lot of the automation rather than going through the DOM.

00:09:29.920 | 04. It's a little bit more helpful. I don't know how much you guys do web text stuff, but sometimes the DOM can be flaky versus going a more UI direction.

00:09:40.920 | 04. You can see here in the in the docs, it tells you a lot about what types of models to use specifically in like the compatible models.

00:09:54.920 | 04. The reason that I'm kind of getting into a lot of this stuff is because I'm in test and automation.

00:10:01.920 | 04. And so one of the one of the things I'm seeing is that these types of like browser use, computer use models are great for self healing and maintenance.

00:10:13.920 | And so the big, big reason why I've been kind of diving deeper into this stuff is to help with self healing and maintenance.

00:10:22.920 | And that's where a lot of this playwright stuff comes into play.

00:10:25.920 | 04. But you can see here, it's even calling out if you want to use like a more open source model to use when 2.5 VL.

00:10:34.920 | 04. And so these vision based models have the ability to navigate, find things and navigate the browser and the computer use more effectively.

00:10:47.920 | 04. So yeah, this is another example of that.

00:10:51.920 | 04. And kind of how I've been building stuff that application portion of it is just kind of wrapping all of the stuff around.

00:11:01.920 | This is vibe coded, obviously, but wrapping it around like a site where you can be able to use these in the back end.

00:11:09.920 | So this is an example of something I've put together using magnitude.

00:11:14.920 | 04. And the idea is to kind of get closer to like a Magnus low or is it Magnus?

00:11:23.920 | 04. That that big Chinese one where they have a bunch of agents that you can control.

00:11:27.920 | 04. Is it Magnus?

00:11:31.920 | 04. It is Magnus.

00:11:33.920 | 04. Yeah, Manus.

00:11:35.920 | 04. So to get something of that level where you can launch several agents at once.

00:11:41.920 | 04. And be able to, you know, kind of run across different applications and be able to run a multitude of tests.

00:11:51.920 | 04. And be able to, you know, utilize a bunch of browsers through these types of libraries.

00:11:59.920 | 04. So yeah, that's kind of the first portion of it.

00:12:03.920 | 04. Just kind of talking about some of the browser based automation and agents.

00:12:09.920 | 04. So yeah, kind of curious about what you guys think.

00:12:13.920 | 04. What you guys are noticing, stuff like that.

00:12:16.920 | 04. So Cable and I were in the chat just kind of bemoaning the state of just navigating web UIs in general in scripts like I know Playwrights, the new big one, but in the past I've used other tools for it like I can't remember the names of them because it's been a while.

00:12:34.920 | 04. But like with Ajax stuff, it just it gets tough. And then I mean, I have trouble with that even when I know I'm building something for one site.

00:12:54.920 | 04. So it's it's interesting to think about how to have an agent navigate that in a more like adaptable way.

00:13:03.920 | 04. Yeah, I am curious how well the visual navigation approach works because yeah, I would assume that the DOM based one is going to work really well in a very accessible website, which is like maybe 10% of the websites and applications out there.

00:13:20.920 | 04. And most people are using, you know, alums to generate code or JavaScript libraries that don't do them any any favors and not investing in making this thing machine navigable, which has, you know, is the same problem that accessibility advocates have been bitching about for forever.

00:13:38.920 | 04. Like, it means that anyone not using eyes and a mouse to manipulate the web are struggling if you're using any assistive technology, but it also means that an agent trying to do the same thing is going to flail like it'll it'll struggle.

00:13:53.920 | 04. Yeah, that's a really good point.

00:13:55.920 | 04. I can kind of show you how magnitude runs just so you can see what it looks like as as it's executing.

00:14:03.920 | 04. Hopefully all my APS are still good.

00:14:08.920 | 04. But this is a really basic run. And like the magnitude website says, it's very focused on the visual aspect of it.

00:14:19.920 | 04. I don't really know how to answer your question, mostly because of the complexity of some of the development that goes into just setting some of this stuff up.

00:14:25.920 | 04. Probably part of the challenges that I have noticed from from getting into that stuff.

00:14:44.920 | That was a super simple one, but it kind of took just two steps to get to this point.

00:14:48.920 | 04. The first part is for folks who have been working in the web tech space, there are some very advanced like anti robotics in this space.

00:15:00.920 | And so just even trying to run some tests across like some of the more accessible websites is is already challenging because the first step is like you can't even access those sites.

00:15:12.920 | 04. Do you know robot dot text, you know, kind of blocking the scraping anti scraping measures, you know, more advanced captures cloud flare like it, like even trying to test this stuff.

00:15:29.920 | The first step is like, how do you even break into places that you can test for like real production sites and, and so on and so forth.

00:15:37.920 | 04. So I've had to stand up a lot of custom sites just to be able to test certain types of interactions that I would hope the agent could take.

00:15:44.920 | 04. And then of course, with like more advanced like graphics and, you know, animations and stuff like that.

00:15:54.920 | This is something where agents really struggle with because right now the technology is take a screenshot, analyze the screenshot, think about next steps, take the next step.

00:16:06.920 | 04. And so if there's anything that's close to real time or anything that requires like human reaction time or a human like click drag and drop, it immediately kind of puts the agent into this weird spot where it doesn't know how to interact with it.

00:16:23.920 | 04. So yeah, I mean, that's a really good point because as as cool as this is to be able to like set up very basic things, as soon as it gets even a little bit more advanced, this technology kind of breaks down.

00:16:40.920 | 04. So it sounds like you're doing there's sort of like a two pronged approach where there's like DOM exploration, but also like render it analyze the image.

00:16:51.920 | 04. And then figure out where to click and navigate from there. Is that how do you balance that?

00:16:58.920 | 04. Yeah, yeah, that's that's kind of like the two approaches right now. So magnitude is very much more focused on like looking at the screen and like understanding and getting contacts, contacts and insight from the screen.

00:17:13.920 | 04. And then browser use is very much like focused on, you know, what's available on the DOM. Both of them have fallbacks. So like, I would say magnitude is vision first, and then browser uses DOM first, and then they fall back on each other.

00:17:28.920 | If if they if the agent struggles to navigate any of that in any way. So but how do you balance that? I don't know. That's a great question. I'm, I'm still trying to figure that out.

00:17:41.920 | 04. Is that I kind of had a similar question there, because I haven't really explored this stuff a lot, but anthropics computer use uses.

00:18:00.920 | 04. Is that right is that so the two, the two approaches are done based or VLMs, but some of them use a hybrid. And when you say eight when you're talking about agents, are you talking about it? And most of these agent frameworks, VLM based the newer ones, or is it hybrid, as you were just explaining?

00:18:23.920 | 04. I would say most of them are hybrid, because most of them have the ability, a lot of these browser use frameworks, want you to prioritize using a VLM, because at a certain point, they need to look at the screen.

00:18:44.920 | 04. So I would say most, if not all of these agents use a VLM as kind of the starting point, it's just what they prioritize first, from my use of browser, the browser use framework, it tends to do DOM first, because that information is what it coalesces.

00:19:03.920 | 04. And then you have the option to turn on vision based tasks, I would say, the accuracy in the way it interacts with the browser goes way up, if you have vision turned on, for specifically for browser use, for magnitude.

00:19:21.920 | 04. I'm pretty sure it's vision first, like it needs you, it's almost a requirement that you use, like, clawed for the model to drive the browser.

00:19:42.920 | 04. Okay, so hopefully, there's no questions from there. I was going to show you, let's see if I can find my Chrome.

00:20:03.920 | 04. So I was putting together a presentation, the other day, and I wanted it to, I wanted to use a template for my company to be able to put together the presentation.

00:20:19.920 | 04. And I figured, you know what, I wanted, I had access to the Claude, Chrome browser extension, just like gotten waitlist access, this week, and was basically like, okay, maybe I'll put the presentation together with, with the plugin, because, you know, I don't, I don't want to put together a bunch of PowerPoint presentations.

00:20:46.920 | 04. It really struggled with that for some reason, I gave it a prompt, and I was like, hey, can you put together this presentation for me.

00:20:54.920 | 04. And, you know, make this slide about x. And it just could not figure that out. And, I'm really curious if anyone else has gotten the Claude research preview for the Chrome extension, because,

00:21:09.920 | 04. And I almost find the performance of this much worse than some of the frameworks that I've shown you guys before.

00:21:17.920 | 04. But, you know, I figured we can do a live demonstration, or, you know, try to try to use this together, and just see what, what it does, because I've maybe played with it for a total of 30 minutes.

00:21:35.920 | 04. But maybe I can do something like put together a presentation for the latent space.

00:21:49.920 | 04. So, you know, I can do something like that. And, you know, I can do something like that. But yeah, I can do something like that. But yeah, I think one of the cool things about this is, it's great for very basic and very repeatable actions.

00:22:17.920 | 04. But if it's complex, like putting together, you know, a PowerPoint, sometimes it just really doesn't understand how this works, because the model itself hasn't been trained with the interaction.

00:22:39.920 | 04. I will say, I didn't get to use this with 4.5 when it came out. So I'm wondering if 4.5 is enough, like they've done enough training for more agentic interactions with applications to be able to do stuff like put together slides now.

00:22:59.920 | 04. But yeah, when I was doing this the other day, it really was confused about how to use this particular application.

00:23:09.920 | 04. How do you get the extension?

00:23:13.920 | 04. So there's a waitlist.

00:23:16.920 | 05. So let me see if I can go, I think you also need to get, I think you need Anthropics Max plan.

00:23:27.920 | 05. But don't quote me on that. I'm not entirely sure.

00:23:31.920 | 05. So let's see.

00:23:32.920 | 05. Let's see.

00:23:33.920 | 05. Claude Anthropic browser extension.

00:23:38.920 | 05. Claude extension for Chrome.

00:23:44.920 | 05. I'll put this in the chat. Oh, that's not the website.

00:23:54.920 | 05. So yeah, if you want to sign up for the waitlist, that's, I think somewhere in this page.

00:24:10.920 | 05. There we go. That's the full link.

00:24:13.920 | 05. So yeah, I, this, this seems like it's gotten farther than I remember it doing.

00:24:19.920 | 05. It really struggled yesterday when, or not yesterday, the, the time when I first got

00:24:24.920 | it to put together a company presentation, but, uh, looks like it's putting together a PowerPoint

00:24:31.920 | presentation for our group, which is kind of fun.

00:24:37.920 | 05. There are some interesting inconsistencies where it comes to like formatting and interacting

00:24:46.920 | with, uh, applications for stuff where it has to do text input. Cause you can tell like an AI has put it

00:24:55.920 | together. Um, but overall it's, it's like a great way to start something like most, uh, most AI applications.

00:25:06.920 | Uh, and then you have to go in and of course touch it up and, and make it more, um, improved, but yeah, there you go.

00:25:15.920 | I, it looks like it's doing a pretty decent job of trying to make a slideshow for us.

00:25:20.920 | 05. But yeah, uh, that's kind of most of the browser use stuff that I wanted to show you guys at the moment.

00:25:35.920 | Um, if there are any comments or discussion points, I'm happy to open the floor up.

00:25:41.920 | 05. This is really interesting. Like I saw the blog post dropped the other day and I just didn't really pay attention to it, but this is actually like cool to see the demo.

00:25:56.920 | I'm really curious. Like, uh, I mentioned it in the chat might be kind of lost in the sauce, but, um, like this seems like a really interesting use case.

00:26:04.920 | Cause the Dom is a huge mess. Um, but the UI is pretty intuitive. So like what other, like you mentioned job searches, which is there's some non-trivial stuff happening there, especially given they have like very different layouts.

00:26:17.920 | I'm curious what other kinds of web apps you've tried this on like the full spectrum, like some apps that are like super heavy UI versus some stuff.

00:26:26.920 | That's like a lot more conventional. Like I can imagine this thing could navigate hacker news and find, you know, do some sentiment analysis on like certain threads or something like that a lot easier.

00:26:38.920 | Yeah, that's a great question. I'll be honest with you. I have not had this for too long of a time, but I'm happy to do.

00:26:46.920 | I'm happy to do live demonstrations right now with you guys. If you can think of any examples that you want to run through, I'm more than happy to, to just do that.

00:26:55.920 | Um, we can have it stop making us a PowerPoint for, for the group.

00:27:00.920 | Um, could we do like Claudeception and have it like go to Claude.ai and do some dumb shit?

00:27:06.920 | Uh, we probably could. Let me see if I can, um, let me see if I can get onto, uh, and drop big.

00:27:15.920 | Uh, someone also brought up, uh, perplex perplexity comment.

00:27:19.920 | I actually have, I, I have perplexity comment too.

00:27:26.920 | If we want to look at that real quick.

00:27:30.920 | Like city.

00:27:31.920 | Well, I have comment on here or if someone else has comment and wants to, to run it for us as a, as a live demo while I figure out how to.

00:27:43.920 | Let's see.

00:27:46.920 | Let's see.

00:27:47.920 | Looks like.

00:27:48.920 | Let me just stop sharing for a second.

00:27:50.920 | I think I need to do a, a cheeky log on for a second.

00:27:52.920 | Yeah.

00:27:53.920 | If anyone has comment and wants to share, go for it.

00:28:02.920 | Claude.

00:28:03.920 | Claude.

00:28:04.920 | Claude.

00:28:05.920 | Claude.

00:28:06.920 | Claude.

00:28:09.920 | Claude.

00:28:10.920 | Claude.

00:28:11.920 | Claudeception is an interesting idea. I also find myself wondering about using Claudeception to interact with like the Claude Prompt Tuning Workbench.

00:28:30.740 | And you just get the AI to drive a bunch of stuff. It would be entertaining.

00:28:36.420 | That said, as we talk about this, it all mostly feels like toys still.

00:28:41.620 | At this point, when we're talking about these uses of computer use, I don't know if you found some areas where it's reliable, but it feels like it's a long way from being able to independently assign it a task.

00:28:56.080 | It's much more like, okay, novelty, get this started for me and I'll come back and then I'll have to change a bunch of stuff or look up job board scraping.

00:29:05.240 | Yeah. If that's reliable, I think it's potentially useful, but yeah, I don't, I don't know. I'm, I'm curious your sense of like, is this,

00:29:14.240 | we, we did a different, it was not a web-based, but it was a computer-based computer use one.

00:29:21.460 | And there was a lot of conversation at that point around like, at what point does this become our way of interacting with our computers?

00:29:28.780 | Where we're just like talking to this thing and it goes and does things for us and it, it felt then like we were a long way away from it.

00:29:35.380 | I'm curious what your perception is having played with this for a while.

00:29:40.060 | Yeah, that's a great question.

00:29:41.500 | I would say my answer to that is it's great for atomic interactions.

00:29:54.080 | I think the reason why scraping job boards felt easy is because there are only so many steps from a landing page to a career board.

00:30:08.360 | Before it gets, uh, before it, you know, it realizes that a certain page doesn't have a job board.

00:30:16.680 | Um, there are like N number of steps that you, as a user can take before you realize, oh, this company doesn't have available jobs.

00:30:24.720 | Um, it's when the, it's when the interaction that you assign it is so open-ended where it becomes almost like toddler feeling.

00:30:37.360 | Um, and that's where kind of human intuition needs to step in and be like, okay, this is not how you do it.

00:30:43.580 | This is the correct way you're supposed to do it.

00:30:45.580 | Um, and I think it just comes down to the corpus of training that it has on steps, uh, to interact with an application, um, and that applications use case.

00:31:00.640 | Right.

00:31:00.900 | If I'm, if I'm going onto a website with a, with a closed purpose of finding a career board.

00:31:09.640 | That makes more sense than going onto a webpage and being like, you know, just browse around and see what it feels like, or like, um, you know, try to understand what this company is trying to do.

00:31:24.140 | That feels very open.

00:31:25.520 | Maybe that wasn't a great example, but, um, I think that is where it starts to get kind of loosey goosey.

00:31:32.160 | Um, the reason why I've been exploring it with testing specifically in software testing is that, um, typically the things that we test for, uh, tend to be closed.

00:31:46.920 | Um, so for example, um, you know, if you give it a user journey to go look for a job board, um, and you know, that job board is supposed to be there and the agent can't find it.

00:31:58.580 | Um, that's a pretty closed problem for it to solve, but if you're doing a very open one, like, um, you know, tell me if this, uh, looks the best.

00:32:11.180 | Tell me if this website looks the best, then it, then it really struggles because it doesn't really understand what that concept of like looks the best is.

00:32:18.980 | Um, I don't know if I answered that question really well, but that's kind of been my experience with some of these browser use agents.

00:32:26.460 | Um, I wonder, um, so as a dev, like if I saw a really cool site, like, and told blood, uh, Hey, how, how does it, how is that like effect done?

00:32:43.940 | Would it like look in the elements and like be able to reverse engineer it?

00:32:48.300 | I think that'd be a pretty cool use case.

00:32:50.220 | I haven't played with it yet.

00:32:51.520 | Um, I just installed it.

00:32:53.340 | Um, so I'm kind of peeling it out before I let it just go wild on, on any site.

00:33:01.240 | What would you guys like me to have Claude do with Claude?

00:33:14.300 | Well, I guess we could start with having it explain the plot of inception that that's hard enough for a human to do.

00:33:32.040 | Right.

00:33:41.880 | Well, and we, we should make sure it's not just a one step, right?

00:33:46.120 | So explain the plot of inception, keep asking questions until you can, you know, until you feel satisfied or something like that, or like, until you, uh, think you could explain it to a five-year-old.

00:34:00.880 | Make Flappy Bird and have Claude play it.

00:34:20.100 | That's a pretty good one.

00:34:27.340 | Oh, I think I got, I think I, Oh, let me see.

00:34:53.100 | Fascinating.

00:34:53.760 | Um, while that's trying to talk to itself, I'm going to see if I have comment.

00:34:58.860 | I'm pretty sure I have access to it.

00:35:00.860 | Okay.

00:35:09.900 | You can see one of the tools in it is, uh, a waiting time.

00:35:31.140 | So it'll set its own waiting time to make sure whatever it's doing is, is completed.

00:35:36.760 | Um, I think there was a question back a few minutes ago about what, what makes it an agent.

00:35:43.480 | Uh, the agent is the ability for it to use its own tools to be able to interact with the browser that it quote unquote lives in.

00:35:54.400 | Um, and speaking of browsers, that was actually something that I also took a deep, deep dive in, which is all the kind of browser services that are out there.

00:36:04.480 | Like steel dot dev, um, stuff like that.

00:36:08.600 | It's, it's something that I don't really interact with, uh, on a daily basis.

00:36:12.200 | Cause I'm, I'm not a web dev.

00:36:13.540 | Um, but yeah, just kind of going into that space too.

00:36:17.560 | Um, and learning about kind of what all the different browser providers are out there is, is kind of neat.

00:36:22.800 | Um, so I got to play with, um, steel dot dev, um, and then also learning about like headless chromium and, and how playwright kind of interfaces with that and the CDP protocols.

00:36:35.740 | Um, so generally speaking, what happens is that the, the, these libraries, the, um, browser use a magnitude, um, they recommend that you use like a, um, kind of a, a browser provider for you to be able to, to interface, um, with, uh, with the agent.

00:36:58.740 | But you can, of course, use your own, uh, headless chromium instance, uh, and get, you know, CDP pipes into that.

00:37:06.060 | Um, yeah, it's, it's really heavy, a fun conversation with itself.

00:37:17.140 | Um, yeah, at this point, I'm just burning tokens to let it talk to itself, but yeah, it's just interesting.

00:37:26.520 | Um, I think I have comment, so let me, if you guys are interested in looking at that too, I can get that pulled up and running.

00:37:45.560 | Um, can you do some kind of scraping task, like going to a feed or a paper and just seeing how good it can pull it out?

00:37:53.640 | Also, does any, I have the max plan.

00:37:55.680 | Does anybody else have the max plan, but you still have to go on the waiting list for this?

00:37:59.000 | Yes.

00:38:13.840 | You know, I should have just done that from, uh, I should have just done that from, uh, I should have just done that from Google, uh, look up the, all you need is attention paper from the path and E-L-I-5.

00:38:43.800 | Oh, I'm going to give it permission to run.

00:38:50.820 | Oh, I'm going to give it permission to run.

00:39:13.040 | Oh, yeah, it is kind of efficient, um, inefficient.

00:39:26.140 | Um, it also looked like it just looked at the, uh, summary page and then decided to give me a summary.

00:39:35.840 | I didn't actually look at the actual paper, um, which is something interesting that I've noticed.

00:39:41.300 | And I'm, I'm calling it premature stopping.

00:39:45.080 | Um, which is something that you'll notice with running these agents, which is once it gets to like a partial, partial solution to your answer, it just decides to stop.

00:39:55.620 | Um, and we've had to do some prompt tweak, prompt tweaking, um, and, and maybe some guard railing to help it move it along.

00:40:04.320 | Um, but once it feels like it's sufficiently, once it feels like it has a sufficient answer for you, sometimes it just decides to completely stop, which is not great when you need to be continuing or, um, doing something that's persistent.

00:40:18.060 | Um, so, um, so, um, how does it do with PDFs?

00:40:23.080 | Well, let's, let's see, uh, open the PDF for the full paper.

00:40:29.000 | So, um, read through it and give me an E-L-I-5 summary.

00:40:37.740 | So, um, you know, what's interesting, uh, is.

00:40:52.500 | What's interesting, uh, is because it's using a VLM, I I'm pretty sure it's using a VLM because it's reading through a PDF.

00:41:10.840 | It can notice if it doesn't see text well, um, it will actually zoom in on its own accord using its, you know, tools, it will zoom in and get a better read of what it's looking at, which is, I think is fascinating behavior for an agent.

00:41:33.320 | Um, can it see only what you see in the browser window?

00:41:38.160 | Yes, that, that is probably the thing that limits it the most is that it's locked into whatever is available in the browser, uh, window.

00:41:49.200 | Um, what's really interesting is that, oh, actually, sorry, that was a different thought.

00:41:56.960 | I, I've lost that train of thought.

00:41:58.320 | Um, oh, and I got a connection error.

00:42:00.820 | Um, I bet I probably exceeded the token amounts that I have for my account.

00:42:06.280 | Um, so one of the downsides, you're, you're kind of limited to the service.

00:42:11.560 | Of course, um, you know, there's still what, how many pages, there's still 15 more pages in this PDF and it just kind of, uh, just kind of crapped out.

00:42:20.020 | So let me see if I can get it to continue.

00:42:22.840 | Looks like you stops.

00:42:27.640 | Please continue.

00:42:30.460 | Um, is this based on usage billing?

00:42:36.560 | I don't know.

00:42:37.580 | This is one of the things that's with Anthropic, you know, there's not really a lot of transparency.

00:42:42.520 | Um, I have the max plan.

00:42:44.660 | I'd pay for the a hundred dollar five X max plan.

00:42:48.800 | Uh, I don't know if there are, I don't know what the actual limitations are because this is a, a quote unquote research beta.

00:42:56.520 | Um, so thanks Kevin.

00:43:00.740 | Um, so I don't know what the limitations are.

00:43:03.980 | Uh, sometimes I'll hit those, you know, uh, connection errors.

00:43:07.400 | And if I tell it to continue, it will sometimes I hit a really hard limit and I don't know why I've hit the limit and it just doesn't tell you.

00:43:15.080 | So, um, yeah, some of the, some of the caveats for that.

00:43:19.420 | Um, so while that's running, I do have comment up and we can, we can tell it the same thing.

00:43:25.480 | Um, can you find the, all you need is attention paper and give me a, you know, live five summary, uh, but you're not paying per token via the API.

00:43:43.800 | Yeah, you're not, you're just, if you have the max plan, you can use their service.

00:43:47.980 | So, um, nice.

00:43:55.460 | So comment just did everything for you.

00:43:59.100 | That's almost feel like perplexity in a browser versus a browser use agent.

00:44:06.780 | Um, okay.

00:44:12.280 | So like I said before, there's a little bit of premature stoppage with the agent.

00:44:20.480 | So it decided eight pages out of 15 was enough for it to give you a summary.

00:44:25.340 | Um, after reading through the full paper, here's the simple explanation.

00:44:31.100 | Here's the problem we solve.

00:44:32.520 | Yeah.

00:44:35.000 | I mean, that's a pretty good Eli five.

00:44:37.000 | Um, but again, it decided that eight pages was enough.

00:44:42.600 | I don't know if it really captured the full extent of the paper.

00:44:45.280 | Um, and I think, I think there's some built in guardrails that anthropic put in so that.

00:44:54.580 | You know, they're not just burning tokens for folks who are using the max plan.

00:44:58.820 | Um, why it stopped at eight.

00:45:02.020 | Couldn't tell you.

00:45:03.580 | Um, but yeah, I, has anyone played enough around with comment?

00:45:09.180 | Actually, there's an update, uh, to know if it will actually do stuff like this, where it will drive the browser for you, or if it's just more like perplexity in a browser.

00:45:19.060 | Um, that's a good question, Julie.

00:45:32.300 | I will ask it.

00:45:33.400 | I will ask you to have stopped at eight pages.

00:45:39.580 | I will ask you to have to do stuff like this, or 15, or 15, or 15, or 15, or 15.

00:45:49.100 | I will ask you to have to do stuff like this.

00:45:50.100 | I will ask you to have to do stuff like this.

00:45:51.100 | I will ask you to have to do stuff like this.

00:45:52.100 | I will ask you to have to do stuff like this.

00:45:53.100 | I will ask you to have to do stuff like this.

00:45:54.100 | I will ask you to have to do stuff like this.

00:45:55.100 | I will ask you to have to do stuff like this.

00:45:56.100 | I will ask you to have to do stuff like this.

00:45:57.100 | I will ask you to have to do stuff like this.

00:45:58.100 | I will ask you to have to do stuff like this.

00:45:59.100 | I will ask you to have to do stuff like this.

00:46:00.100 | I will ask you to do stuff like this.

00:46:01.100 | I will ask you to have to do stuff like this.

00:46:02.100 | I will ask you to have to do stuff like this.

00:46:03.100 | I will ask you to have to do stuff like this.

00:46:04.100 | I will ask you to have to do stuff like this.

00:46:05.100 | I will ask you to have to do stuff like this.

00:46:06.100 | I will ask you to have to do stuff like this.

00:46:07.100 | I will ask you to have to do stuff like this.

00:46:08.100 | I will ask you to have to do stuff like this.

00:46:09.100 | I will ask you to have to do stuff like this.

00:46:10.100 | I will ask you to have to do stuff like this.

00:46:11.100 | I will ask you to have to do stuff like this.

00:46:12.100 | I will ask you to have to do stuff like this.

00:46:13.100 | I will ask you to have to do stuff like this.

00:46:14.100 | I will ask you to have to do stuff like this.

00:46:15.100 | I will ask you to have to do stuff like this.

00:46:16.100 | Oh, you are absolutely right.

00:46:19.100 | I should have read through all 15 pages.

00:46:27.100 | Probably because the rest of the pages are all appendix.

00:46:30.100 | And it is like I read through the paper.

00:46:41.100 | Do you think that Claude is just using the VLM?

00:46:45.100 | It is just vision.

00:46:46.100 | Whereas perplexity and some of these other ones are the hybrid.

00:46:52.100 | Well, I asked it to do it.

00:46:54.100 | And, you know, it does feel like perplexity.

00:47:02.100 | It does feel like perplexity in a browser.

00:47:06.100 | Because I asked it to now go to the archive site and pull the paper.

00:47:11.100 | And in the assistant itself, it's going through like its own computer.

00:47:18.100 | I don't know if you guys can see this because it's a bit tiny and I, am I allowed to expand this?

00:47:22.100 | Yeah, I can expand it.

00:47:24.100 | It's actually perplexity.

00:47:27.100 | Like a perplexity browser use agent within the browser.

00:47:33.100 | And then, you know, it almost feels like you're using the perplexity app in the browser itself.

00:47:43.100 | And it's doing the browsing within the tool rather than doing it through the browser, the comment browser that perplexity has given you, which is such a bizarre way to do it.

00:47:55.100 | Do you think oops, the right response is a result of the anthropic lacking transparency?

00:48:06.100 | Yeah, I don't know.

00:48:08.100 | That's a great question, Julia.

00:48:10.100 | I could not tell you how anthropic wants to do, how to make their product.

00:48:17.100 | It's fascinating, though, it actually gives you like a seven part breakdown about the paper.

00:48:22.100 | So, I mean, I would use that for my research stuff in the future.

00:48:27.100 | But yeah, that's to answer your question, Brad, the perplexity app doesn't, doesn't feel like a browser use agent within the browser.

00:48:40.100 | It's like its own tool and comment has just kind of reskinned it for you.

00:48:46.100 | Whereas the Claude extension actually feels like it's using VLMs to look at stuff and interact with the browser.

00:48:55.100 | I'm sure there's some DOM interactions, maybe DOM tool calling in the back end to do like interactions with like clicks and stuff like that.

00:49:04.100 | But it actually looks like anthropics using a VLM to look at the screen, so to speak.

00:49:17.100 | It looks like we've got like five minutes left, so I'm going to stop sharing and then just kind of open up the floor for folks to discuss and have conversation.

00:49:26.100 | But anyways, that's the end of my demo.

00:49:28.100 | Thanks, guys.

00:49:29.100 | All right, maybe we'll just at this point, as people come and say one, thank you for running the show today for sharing this super interesting and we appreciate it.

00:49:54.100 | I'd like to call out to everyone in here.

00:49:57.100 | If this sparked your curiosity, maybe take that, go do some experimentation, go do some learning and then bring it back to us and do another session on this.

00:50:09.100 | Maybe take it a little further, find a use case where this works really, really well, or one where it breaks in a very interesting way that is illuminating in some way.

00:50:17.100 | But yeah, I do want to encourage you all to step up the way that Susia did and sorry, I probably just butchered your name.

00:50:24.100 | But and no, you do great.

00:50:27.100 | And you know, lead us through one of these as we all like to say.

00:50:34.100 | I think I said it once that people latched onto it and so now I'm repeating it every time because it's good, but it doesn't have to be, you know, polished.

00:50:41.100 | It doesn't have to be useful.

00:50:42.100 | It just has to be interesting.

00:50:44.100 | And this definitely was interesting.

00:50:46.100 | And I think a lot of you all can can do that yourself in whatever direction.

00:50:51.100 | Can I pop in for one last question?

00:50:54.100 | Yeah.

00:50:55.100 | Go for it.

00:50:56.100 | So Siege, what are you looking forward to like exploring and digging into deeper next?

00:51:02.100 | Like now that you've shown us kind of a lay of the land, like what are some, what are some of your frontiers on learning about this that you're digging into next?

00:51:11.100 | Oh, gosh, that's a great question, Zach.

00:51:13.100 | Two things I would love to do with this.

00:51:16.100 | I would love to build an agent or sorry, an application that is multiple like personality agents that can also crawl through a website and give you feedback or information after you've run it.

00:51:36.100 | So being able to launch like five agents where one has a like copy persona where it literally just goes in and reads copy and website copy, and it gives you feedback on it.

00:51:49.100 | I'd love to, or, and then have a different agent that looks at like, because they're VLMs, like aesthetics, like give you information on best practices of like, oh yeah, this particular button didn't render it.

00:52:04.100 | Cause you're, you know, your tailwind wasn't implemented correctly or something like that.

00:52:08.100 | And, you know, have multitudes of these agents just be able to walk through and crawl through a website and then give you feedback.

00:52:16.100 | Um, that would be like one cool project that I've thought of, uh, that I probably will let die on the line cause I have so many projects going on.

00:52:26.100 | Um, and then the second thing is, um, I've been really getting into, uh, swarms.

00:52:32.100 | So like biological, um, animal, like animal based swarming.

00:52:37.100 | Uh, and, and it's the idea of how do you take small, but cheap, um, agents, or in this case, it would be models, um, that have vision capability and then be able to, um, allow for this kind of like advance navigating through websites.

00:52:58.100 | Um, but at a much, much reduced cost because.

00:53:02.100 | One of these agentic steps using Claude co, uh, Claude sonnet.

00:53:08.100 | Or, uh, is like.

00:53:11.100 | Third 30 cents is like the max I've seen where it's just like a really dense picture with a lot of Dom information and a lot of tokens being pushed into it.

00:53:23.100 | But like taking something super cheap, like a Gemini flash light, um, and just running just a bunch of cheap agents, um, to, to reduce the cost, but get more information and complexity.

00:53:36.100 | Um, those are some of the things that I've been, uh, thinking about and, and interested in, but yeah, good question, Zach.

00:53:42.100 | Thank you.

AI in Action 3 Oct 2025: AI Browser Agents: Navigating the Web with Vision and DOM

Chapters