Stanford CS25: V5 I The Advent of AGI, Div Garg

00:00:00.000 | Today we have our co-instructor, Div, talking about human-inspired approaches to agents and

00:00:10.800 | how the path to AGI requires a rethinking of how we design, evaluate, and deploy intelligence.

00:00:15.760 | Div Garg is the founder and CEO of AGI Inc., a new applied AI lab redefining AI-human interaction

00:00:22.900 | with the mission to bring AGI into everyday life.

00:00:26.000 | Div previously founded Multion, the first AI agent startup, developing agents that can interact with computers and assist with everyday tasks, funded by top Silicon Valley VCs.

00:00:36.940 | Div has spent his career at the intersection of AI, research, and startups, and was previously a CS PhD student here at Stanford, focused on RL.

00:00:46.440 | His work spans across various high-impact areas, ranging from self-driving cars, robotics, computer control, and Minecraft AI agents.

00:00:54.520 | With that, I'll hand it to him to take it away.

00:00:57.200 | Yes, excited to be here.

00:01:00.680 | Great.

00:01:02.140 | So, yeah, excited to be here.

00:01:03.900 | And the topic for this lecture is we wanted to talk about a lot of new things that are happening in the AI world right now.

00:01:12.360 | So, there's been a lot of developments with agents and all the new models that are coming out.

00:01:17.980 | And it seems like you have, like, some sort of superintelligence when it comes to, like, chat and reasoning already there, compared to, like, average humans.

00:01:26.740 | And it's going to be very interesting, like, over the next few years as we figure out what does intelligence look like?

00:01:34.500 | What is the, like, what is, like, something like AGI?

00:01:38.020 | And what's the form factor?

00:01:39.420 | How can this be something that's useful?

00:01:40.860 | And, like, how will this be applied in society?

00:01:43.240 | Cool.

00:01:47.300 | So, let's say the first thing, like, we want to touch on.

00:01:50.300 | Like, what does AGI look like?

00:01:52.140 | It's, like, AGI is such an abstract concept right now.

00:01:55.060 | It's, like, no one has, like, visualized it or given it a meaning.

00:01:57.780 | It's, like, is it some sort of supercomputer?

00:02:00.780 | Is it just, like, chat CPT but just, like, 10x better?

00:02:03.100 | Is it something that's more of a personal companion?

00:02:05.320 | Is it something that's embedded in your life?

00:02:07.620 | And, like, that's not clear yet.

00:02:09.060 | And, like, those are, like, kind of the questions I think we really need to go and figure out.

00:02:12.340 | This is one diagram on how AI agents work.

00:02:20.040 | So, this is architecture from an open AI researcher, Lillian Wang.

00:02:23.580 | She recently left and joined a new company.

00:02:26.280 | So, this is showing how you can think about agents and, like, how they can be broken down into, like, different, like, subparts.

00:02:34.100 | And there's a lot of different things that you require to make these agents work.

00:02:38.480 | So, the first layer is memory.

00:02:40.240 | You need to have, like, some sort of short-term memory.

00:02:42.960 | You want to have some sort of long-term memory.

00:02:44.300 | And this is, like, you have some sort of short-term presentation that's maybe, like, a chat window, if you're using something like chat GPT.

00:02:49.640 | And you might also have, like, a personal history of the user.

00:02:52.620 | Where, like, okay, this is maybe, like, what the user likes.

00:02:54.700 | This is what they don't like.

00:02:55.480 | The second thing that you need is tools.

00:02:58.040 | Like, you want this kind of agent to be able to use tools like how humans use tools.

00:03:03.560 | So, you want them to be able to use calculators.

00:03:06.160 | You want them to be able to use, like, calendars, web search, coding, and so on.

00:03:10.740 | The third part over here is, like, you want to have advanced planning.

00:03:13.920 | And that means, like, you want to, the agents to be able to, like, use reflection where, like, if something goes wrong, they can, like, have failover mechanisms, error correct, and, like,

00:03:22.560 | recover, you want, like, self-criticism, and you want, like, decomposition where, like, you have, like, chains of thoughts so that agent can do, like, their own reasoning loops.

00:03:30.600 | They can also break down a complex task into sub-goals.

00:03:34.280 | And the final, fourth ingredient is actions, where, like, you want these agents to go to act on your behalf and, like, go do things.

00:03:40.200 | And this is kind of, like, high-level encapsulates, like, how agents look like fundamentally.

00:03:46.080 | And this is maybe, like, what will, like, as this system's become more powerful over time, will eventually lead to something that's like AGI.

00:03:56.320 | This is one thing that we're also building.

00:03:58.760 | So I recently started this new AI lab called AGI Inc.

00:04:04.660 | And we're looking a lot into, like, what does AGI look like for everyday purposes and, like, how can this be applied to daily life?

00:04:11.080 | This is one of the demos of some technologies we built in the past.

00:04:18.400 | This shows how an AI agent can be applied in the real world.

00:04:21.260 | So this is a bit old.

00:04:24.220 | And this shows, like, how an AI agent can be applied to pass a real driving test in California.

00:04:29.340 | And so this is, like, an actual DMV test that the agent took.

00:04:33.940 | And then let me share the screen and talk about the setup.

00:04:37.880 | So in this screen, what's happening is there's someone attempting the DMV online test.

00:04:46.920 | And there's a human who has their hands over the keyboard.

00:04:49.500 | They're not actually touching the screen.

00:04:50.960 | And it's the agent that's going and taking all the exams.

00:04:53.380 | And there's, like, 40 questions in this test.

00:04:55.500 | And the agent's really good.

00:04:57.740 | So it can go and, like, pass the whole thing.

00:04:58.960 | And we did this live.

00:05:00.780 | So the DMV was actually, like, screen recording what we were doing.

00:05:05.300 | They were also, like, watching, like, the person on camera.

00:05:07.460 | But even then, like, the agent was successfully able to evade, like, the whole setup and, like, pass the exam.

00:05:13.480 | So this was really fun.

00:05:16.280 | We did this as a teching attempt.

00:05:17.680 | So we informed the DMV afterwards that we did this.

00:05:20.340 | Funnily enough, they actually sent us a driving license afterwards.

00:05:23.960 | So that was really fun, actually.

00:05:25.680 | So at the end, the agent's able to pass and get a pass, get a full score on this test here.

00:05:32.800 | And so, yeah.

00:05:34.340 | So this is, like, a very fun experiment showing, like, how agents can be applied in the real world.

00:05:37.440 | And, like, there's so many things that are possible.

00:05:40.660 | In this vein, like, how can we make agents more useful, apply them in real life?

00:05:44.860 | We've been, like, working on a lot of, like, different efforts along with a lot of, like, the AI community.

00:05:51.840 | One of those things is, like, agent evaluations.

00:05:53.900 | How can we evaluate this kind of agents in the real world?

00:05:57.320 | And make sure, like, we have standards and benchmarks that allows us to know, okay, like, how well are these agents working on, like, different websites or different use cases?

00:06:05.240 | How can we trust them?

00:06:06.500 | How can we know, okay, like, where to deploy them and how to use them?

00:06:10.660 | Another thing we have been doing is agent training.

00:06:13.560 | Can we train agents to be able to do advanced planning, self-correction, and improve themselves?

00:06:19.540 | And this uses the combination of, like, reinforcement learning and a bunch of other advanced techniques.

00:06:24.260 | And, finally, we have also been looking a lot into, like, agent communication.

00:06:28.280 | Like, how can you have agent communicate with other agents?

00:06:31.520 | And there's been a lot of, like, new breakthroughs in this area recently.

00:06:34.520 | So if you've looked at model context protocol, MCP, that's a very new thing that has been coming out.

00:06:39.960 | Similarly, like, there's a lot of work around, like, A2A, that's, like, Google's agent-to-agent communication protocol that recently came out.

00:06:45.560 | We also have been working on some, like, open source projects called agent protocol, where you also, we have been allowing other different kind of agents to communicate to each other.

00:06:56.620 | So you can have a coding agent that can talk to, like, a web agent that can talk to, like, an API-based agent and so on.

00:07:01.140 | And that allows you to, like, do, like, much, much more complex things than what's possible with just a single agent.

00:07:05.920 | Cool.

00:07:09.440 | So before we dive more deeper into, like, how a lot of these things works, I want to bring out, like, why do we need agents?

00:07:17.800 | Like, why are they useful?

00:07:19.200 | Why do we actually want to go and build them?

00:07:20.620 | And there's a lot of things we need to think about here.

00:07:24.360 | And I will touch on a lot of different topics in the introduction, going from, like, the architectures, building, like, more, like, human-like agents, using computer interactions, maybe, like, memory, communication, and, like, what are the future actions?

00:07:38.920 | So, like, when you think about building agents, there's a lot of things, or questions you have to answer.

00:07:46.000 | The first one is, okay, like, why is this useful?

00:07:48.880 | How can you actually build them?

00:07:49.860 | What are the different building blocks?

00:07:51.680 | And finally, what can you do with them?

00:07:53.560 | And to first answer the why question, we have this key thesis that agents will be more efficient in interfacing with computers in the digital world compared to humans.

00:08:09.320 | And that's the reason that we want to go and apply agents to be able to, like, do things for us.

00:08:14.780 | So, you can imagine you have an army of virtual assistants that are, like, fully digital that can go and, like, do whatever you want on your behalf.

00:08:22.560 | And you can just talk to them using a human interface.

00:08:24.820 | And that's kind of the vision we have been, like, moving towards.

00:08:29.780 | I also have a blog post about this called SoftFit 3.0 that you can check out, which touches on some of the ideas.

00:08:36.600 | Cool.

00:08:41.620 | So, if you want to go and build agents, because usually, like, large language models are not good enough.

00:08:46.540 | And we want, like, action capabilities that allows us to unlock more productivity and go do things.

00:08:51.100 | And this also allows us to build more complex systems.

00:08:53.900 | There's a lot of techniques involved in actually building this, such as, like, chaining different models together, reflection, and a bunch of other mechanisms.

00:09:05.140 | And, as I showed before, in the architecture, on slide two, there's a lot of, like, different components, like memory, actions, personalization, access to the internet, and so on.

00:09:17.480 | And finally, the question becomes, like, what are the different applications you can apply them to?

00:09:23.120 | There's also a question of why do we want to build human-like agents?

00:09:28.820 | Like, why can't we just have API agents?

00:09:30.940 | Or why can't we have a bunch of, like, other kind of agents you can imagine, which are not mimicking human interactions?

00:09:37.540 | And one reason we want to push towards more human-like agents is these agents can operate interfaces, like how we do.

00:09:45.140 | And usually, the internet and the web and computers are designed for humans.

00:09:49.360 | So, they're designed for, like, keyboard and mouse interactions, so that we can go and, like, navigate interfaces.

00:09:55.220 | We can, like, use our keyboards.

00:09:56.860 | And if agents are able to, like, use interfaces like we do, that allows them to, like, directly communicate and, like, do a lot of things without changing how current software programs work.

00:10:06.120 | And that becomes very, very effective because that allows you to, like, work on the 100% of the internet without any sort of bottlenecks.

00:10:12.880 | If you think about APIs, there's only, like, 5% of APIs on the internet are public that are accessible.

00:10:18.860 | And it's very hard to build agents that are fully reliable over APIs.

00:10:21.580 | And so, there's, like, a lot of contention between, like, human-like agents versus API agents.

00:10:25.560 | And that's, like, an ongoing battle that's happening right now.

00:10:31.160 | Second thing is, you can imagine a lot of human-like agents as becoming a digital extension of you.

00:10:36.840 | So, they can learn about you.

00:10:38.060 | They can have a context about you.

00:10:39.160 | They can do tasks, like, how you will do it.

00:10:41.140 | They also have less restrictive boundaries.

00:10:45.560 | This kind of human-like agents can handle logins.

00:10:49.040 | They can handle payments.

00:10:49.840 | And they're able to interact with any of the services without restrictions on terms of API access.

00:10:55.980 | So, you don't need to pay for, like, using an API, or you don't need to, like, go to, like, a service provider and ask them for, like, can you give me access to this API?

00:11:03.020 | You can just go and use an interface, how, like, you normally do.

00:11:06.160 | And the final thing is, like, there's a very simple action space.

00:11:11.460 | These agents only need to learn how to click and type.

00:11:14.000 | And if they're able to do that very effectively, they can generalize to any sort of interface.

00:11:17.360 | And they can also improve over time.

00:11:21.940 | So, the more you teach them, the more data you can give them.

00:11:25.160 | They can learn from, like, user recordings, feedback, and become better and better over time.

00:11:29.380 | And so, when it comes to this API versus more direct computer-controlled agents, this is kind of, like, how we think about, like, the pros and cons.

00:11:42.340 | Like, API agents are usually easier to build.

00:11:45.420 | They are more controllable.

00:11:47.280 | They are more safer.

00:11:49.380 | But APIs are more, have a higher variability.

00:11:51.680 | So, you have to build, like, different agents for each API.

00:11:54.720 | And then, APIs can keep changing.

00:11:56.700 | You never have, like, a full guarantee that this agent will always work 100%.

00:12:00.940 | When it comes to this more direct interaction computer-controlled agents, they are, it's easier to take actions in this case.

00:12:07.700 | It's also more free-form interactions.

00:12:12.360 | Because you're not restricted by the API boundaries.

00:12:14.260 | But it's also hard to provide guarantees.

00:12:16.700 | Because you don't know what the agent will do.

00:12:18.400 | So, if anyone here has played with, like, agents like operator, it's a work in progress.

00:12:24.680 | It's not, like, clearly there.

00:12:26.060 | There's a lot of, like, issues that it turns into.

00:12:27.960 | And that's kind of the boundaries where, like, agents are right now.

00:12:34.640 | There's also, like, different levels of autonomy when you think about agents.

00:12:37.480 | This usually goes from level 1 to level 5.

00:12:40.200 | So, level 1 to level 2 is when a human is in control.

00:12:43.980 | And the agent is acting like a copilot.

00:12:45.940 | So, it's helping the human.

00:12:47.080 | So, this is something, like, if you're using, like, a code editor, like cursor.

00:12:52.020 | That's a L2 agent where you have partial automation where the human is in control.

00:12:56.660 | The human is directing the code.

00:12:57.660 | But the agent is helping them.

00:12:59.120 | When it comes to, like, something like L3, this is where there's still a human fallback mechanism.

00:13:05.060 | But the agent is in control.

00:13:06.620 | So, this is, like, if you use, like, cursor composer or windsurf or any of the newer code editors that are more agentic.

00:13:11.620 | The agent is directing most of the code.

00:13:13.080 | But a human is monitoring, giving it feedback.

00:13:14.700 | Okay, this went wrong.

00:13:15.740 | Can you correct that for me?

00:13:16.580 | Can you fix this issue?

00:13:17.540 | And that is, like, more of a L3 system.

00:13:20.600 | And then you have more advanced systems, which are, like, L4 and L5.

00:13:23.220 | In L4 systems, you don't have a human in the loop.

00:13:26.140 | So, it's the agent that's going and doing everything.

00:13:28.260 | You might still have some sort of, like, automated fallback layers.

00:13:30.980 | So, if you look at Waymo in SF, that's an L4 system because the self-driving car is driving itself.

00:13:35.500 | But there's, like, human operators that are remotely monitoring it, making sure, like, nothing goes wrong.

00:13:39.900 | And when you have an L5 system, in that case, there's no humans in the loop.

00:13:43.920 | There's no monitoring.

00:13:44.800 | And the AI agent is able to operate itself autonomously, fully, fully independently.

00:13:49.520 | So, when we are building these agents, one hard thing is trust.

00:14:01.580 | How do we trust these agents are actually going to go do what we want them to do?

00:14:05.320 | How can we go and deploy them in the real world?

00:14:07.120 | To solve these issues, one effort that we have been building is a miniature version of the internet, where we have cloned, like, the top 20 websites on the internet.

00:14:19.740 | And we are benchmarking, like, how do agents go and perform on all these interfaces.

00:14:24.300 | This is actually live, so you can go check it out on reallywells.xyc.

00:14:30.040 | And what we have done is we have built, like, digital clones of websites like Airbnb and Amazon and DoorDash and LinkedIn.

00:14:37.420 | And the agents can go and navigate these interfaces on predefined tasks, and you can get a final score.

00:14:42.540 | This is showing the evaluation results for GPT-4.0.

00:14:51.180 | We find that GPT-4.0 actually is not very good when it comes to being agentic.

00:14:56.580 | And it only reaches 14% successful accuracy, in this case.

00:15:01.200 | We tried this on, like, 11 different environments that we are showing on the right.

00:15:06.440 | And that's, we have our different environments.

00:15:13.840 | We have DashDash, which is our DoorDash clone, and OmniZone, and so on.

00:15:17.200 | So you can actually go and check this environment out.

00:15:24.440 | We also compare a lot of, like, open source frameworks out there.

00:15:28.200 | Some of them are, like, the open AI computer use model that's powering operator.

00:15:33.820 | We actually find it's not very good when it comes to these tasks.

00:15:37.140 | So it's only able to reach maximum 20% accuracy on some of the environments, like our email environment or our calendar environments.

00:15:44.820 | But on a lot of the other environments, it's not able to actually go and do really well.

00:15:48.720 | We also tried a bunch of other frameworks out there.

00:15:52.720 | StageHand, if you've seen that, it's an open source framework for automating web agents, browser use, and one of our own custom agents, which we are calling Agent0.

00:16:02.320 | And we find, like, agents are still, like, early when it comes to, like, actually automating a lot of these interfaces.

00:16:07.460 | And we are able to reach, maybe, like, I would say, like, up to 50% success rate.

00:16:11.700 | But a lot of the agents are actually failing when you are applying them in a lot of these real-world websites.

00:16:15.720 | Similarly, we benchmark all the different models that are available, including all the closed source APIs and all the open source models.

00:16:28.180 | And we find, again, on AT&T tasks, most models are doing, like, decently well, but no one is, like, really, really good right now.

00:16:36.860 | The maximum success we have seen is with, like, GLOD 3.7, where it can reach around 40% accuracy.

00:16:43.240 | Gemini 2.5 and O3 follow very closely with it.

00:16:49.660 | The other models, like, tend to taper off.

00:16:51.460 | And so the interesting learning has been for us is that a lot of these models are not fully ready to be deployed in the real world.

00:16:59.360 | Because if you have, say, like, an agent that's powered by GLOD and then you're applying that, you can only expect a 41% success rate that this will actually go and do what you want it to do.

00:17:08.700 | And that's not good enough.

00:17:11.160 | And this brings the question, okay, like, what is it that is required to make this agency even better?

00:17:16.620 | How can they, like, improve and how can they be applied for your actual practical use cases?

00:17:21.420 | And so this brings us to our next topic for the lecture, which is, like, how can we train AI agentic models?

00:17:32.300 | So how can we have models that are more custom-fine-tuned and are better on this agentic tasks?

00:17:40.780 | This is one of our past works called Agent Q, which is a self-improving agent system.

00:18:10.400 | So how can we train AI agent?

00:18:15.740 | This is a self-improving agent.

00:18:16.740 | This is a self-improving agent.

00:18:17.740 | This is a self-improving agent.

00:18:18.860 | This is a self-improving agent.

00:18:19.740 | This is a self-improving agent.

00:18:20.740 | This is a self-improving agent.

00:18:22.740 | This is a self-improving agent.

00:18:23.740 | This is a self-improving agent.

00:18:24.740 | This is a self-improving agent.

00:18:25.740 | This is a self-improving agent.

00:18:26.740 | This is a self-improving agent.

00:18:27.740 | This is a self-improving agent.

00:18:28.740 | This is a self-improving agent.

00:18:29.740 | This is a self-improving agent.

00:18:30.740 | This is a self-improving agent.

00:18:34.080 | This is a self-improving agent.

00:18:36.080 | This is a self-improving agent.

00:18:37.080 | This is a self-improving agent.

00:18:38.080 | This is a self-improving agent.

00:18:39.080 | This is a self-improving agent.

00:18:40.080 | This is a self-improving agent.

00:18:41.080 | This is a self-improving agent.

00:18:42.080 | This is a self-improving agent.

00:18:43.080 | This is a self-improving agent.

00:18:44.080 | This is a self-improving agent.

00:18:45.080 | This is a self-improving agent.

00:18:46.080 | This is a self-improving agent.

00:18:47.080 | This is a self-improving agent.

00:18:48.080 | This is a self-improving agent.

00:18:49.080 | This is a self-improving agent.

00:18:50.080 | This is a self-improving agent.

00:18:51.080 | So, this is agent queue.

00:19:05.760 | That's a system that can self-improving.

00:19:07.700 | It can learn by corrections and planning.

00:19:11.420 | How the system works is it's able to go and self-correct itself.

00:19:14.960 | So, whenever it makes a mistake,

00:19:16.580 | it can save that mistake in its past memory,

00:19:18.780 | and it's able to use that to do a lot of

00:19:22.100 | like trial and error learning similar to humans.

00:19:24.280 | So, suppose the first time you learn how to ride a bike,

00:19:28.120 | you make a lot of mistakes, you fall over a lot of times,

00:19:30.140 | but over time you're able to improve your policy and

00:19:32.180 | go and do that really well.

00:19:34.000 | We apply similar mechanisms to make

00:19:36.160 | this agents actually work really,

00:19:37.240 | really well in the real world.

00:19:39.140 | So, what's happening in the system is the agent can

00:19:43.080 | explore the space of interfaces and see,

00:19:47.880 | look like what are the things that it did that went wrong?

00:19:50.000 | What are the things that went right?

00:19:51.200 | And it's able to use reinforcement learning to

00:19:53.100 | self-improve and become better and better.

00:19:55.380 | So, agent queue combines a lot of different techniques.

00:19:58.480 | The first method is Monte Carlo research.

00:20:00.800 | This is borrowed from other RL techniques like AlphaGo,

00:20:05.200 | that allow you to plan over search space of tasks,

00:20:09.100 | and unlock advanced reasoning.

00:20:11.260 | A second thing that we do is self-critic mechanisms.

00:20:15.980 | So, the agent can self-verify and get feedback whenever it makes

00:20:19.380 | a mistake and it's able to learn from that feedback.

00:20:22.220 | And finally, we use RL AIF techniques like DPO,

00:20:27.480 | direct preference optimization to be able to improve the agent using RL.

00:20:34.600 | and by combining all these three techniques together,

00:20:37.220 | we are able to build some very powerful systems.

00:20:40.020 | Agent queue is also available on archive as a research paper,

00:20:45.760 | so you can go and check it out.

00:20:46.720 | So, for the sake of time, I will skip some of the lectures here.

00:20:58.180 | But how agent queue normally works is we have this Monte Carlo tree search,

00:21:04.100 | where the agent is exploring the different states.

00:21:08.240 | It's estimating like rewards on,

00:21:10.900 | okay, if we were to visit like this state,

00:21:13.420 | what's the expected value of the future predicted reward.

00:21:18.020 | And based on that, it's able to improve its prediction model,

00:21:22.260 | like should we go take this path or a different path in the tree?

00:21:26.660 | And then over time,

00:21:31.720 | the agent can become very good at exploring the right states,

00:21:35.040 | and figuring out what are the right paths in

00:21:37.020 | the state space and what are the wrong ones.

00:21:40.040 | we also do like self-critic magnetism.

00:21:45.720 | In this case, what happens is if you have a particular task in this case,

00:21:52.560 | where like say a user says like book me a reservation for a restaurant,

00:21:56.480 | for the chow on open table for two people on August 14, 2024 at 7:00 PM.

00:22:02.600 | And like this is the current like state of the screen where you can see the screenshot.

00:22:05.760 | Then the agent can go and propose a bunch of different actions.

00:22:10.640 | So you can choose to go and select the date and time.

00:22:12.940 | It can choose to also like select the number of people and then open the date selector.

00:22:17.440 | It can instead search for Terra Italy,

00:22:21.200 | Silicon Valley restaurant and type that in the search bar.

00:22:24.580 | or it can maybe like decide to go to the open table homepage.

00:22:29.960 | And how the self-critic mechanism works is like all these proposed actions are passed to a critic network.

00:22:37.760 | And the critic LLM is able to go and predict like, okay, what's the best action to take?

00:22:43.800 | And it's able to give a ranking order.

00:22:45.760 | Okay, like this is the best action that we should go and use.

00:22:48.600 | So this is like rank one, this is rank two, and it's rank three.

00:22:52.680 | And based on that,

00:22:53.480 | we can go and optimize the system to take the correct actions and improve over time.

00:22:57.940 | And finally, we use like reinforcement learning from human feedback,

00:23:04.880 | where we used methods like GRPO and TPO,

00:23:08.220 | which are different RL algorithms,

00:23:09.880 | to be able to like use all the failures and like successful trajectories you've collected so far,

00:23:17.260 | and improve the agent over them.

00:23:18.560 | And so DPO is a special technique based on RLHF,

00:23:34.380 | where you can train LLM using preference data sort of like failures and successes,

00:23:39.480 | and use that to improve the model overall.

00:23:40.880 | And so this is how Agent Q works,

00:23:46.700 | where we create this Monte Carlo tree search to create like trajectories of successes and failures.

00:23:53.600 | We can then use self-critic mechanisms to identify what are the proposed actions that actually succeeded and failed.

00:24:00.300 | And then we are able to pass them through DPO to actually go and optimize the network.

00:24:03.780 | we can then use that to help.

00:24:04.780 | And then we can do that for example.

00:24:06.780 | So this is an example of how this works.

00:24:10.680 | So the agent starts in like the first state and the task in this case is we want to go and book a restaurant reservation on OpenTable.

00:24:27.300 | So first, it makes a mistake and goes to the homepage.

00:24:32.300 | Then it recognizes that we made a mistake and can backtrack.

00:24:35.600 | So the blue arrow here shows that it's going and backtracking.

00:24:38.020 | Then it can go and navigate to the restaurant.

00:24:42.340 | In this case, if the agent accidentally makes a mistake and choose the incorrect date,

00:24:47.180 | then it can again backtrack, recover back, open the date selector, choose the right date.

00:24:54.180 | Open the seat selection, and then finally come to the reservation.

00:24:56.920 | And so this is kind of how the system is learning over time.

00:25:00.360 | It's making a lot of mistakes, but it's saving the mistakes and over time improving on them.

00:25:10.180 | So we tried Agent Q in a lot of real-world scenarios, including like OpenTable actual reservations.

00:25:17.180 | So we actually spun up thousands of, or I would say like more like hundred thousands of bots that ran on OpenTable and like used our method to like create agents that are able to book restaurants and like make reservations and do a bunch of other things.

00:25:32.180 | And we tried this with a lot of different methods and models out there.

00:25:36.180 | So we tried GPT-40 and then we found like on this OpenTable reservation tasks.

00:25:40.180 | We are only able to reach around 62.6 percent accuracies.

00:25:44.180 | When it comes to something like DPO, the accuracies actually go to something like 71 percent.

00:25:49.180 | When we try Agent Q, we are able to like make this work much, much better.

00:25:57.180 | So we are able to reach 81 percent accuracy without any MCTS as part of the method.

00:26:02.180 | And when we apply the whole technique with MCTS and DPO and like the self-critic mechanisms, we are actually able to reach close to 95.4 percent accuracies.

00:26:11.180 | And this is using a lot like self-learning for the agent to improve itself.

00:26:16.180 | This takes usually less than one day of training for the agent to go from, I would say here, like 20 percent accuracy.

00:26:26.180 | That's 18.6, that's roughly 20 percent, all the way to 95.4.

00:26:31.180 | So that's a 4X improvement in like agent performance in less than one day.

00:26:39.180 | Cool.

00:26:46.180 | As the next topic, I'll touch on memory and personalization.

00:26:58.180 | So one way to think about AI agents is that they are taking information, processing them, and I think, and-

00:27:11.180 | So imagine like you have an AI model.

00:27:13.180 | What an AI model is doing, it's taking some like prompts.

00:27:16.180 | So it's taking some like language tokens and outputting like some new language tokens.

00:27:21.180 | And so this is acting like similar to a processor, where you, if you have a CPU, what happens is you have some instructions, which are usually binary encoded, that go into the CPU.

00:27:30.180 | And then you have some like instructions that come out, which are also binary encoded.

00:27:33.180 | And then you do a loop over them again and again.

00:27:35.180 | And that's how like normal computers work.

00:27:37.180 | You can do a very similar thing and have this abstraction of AI model as acting similar to a computer, where you have language tokens that are going in, that are encoded in the prompt.

00:27:50.180 | And you have like language tokens coming out.

00:27:53.180 | And this allows you to think about like an AI model as being a processor that's operating over natural language.

00:27:59.180 | So this is something that you can think about GPT-4, for example, going and doing this.

00:28:06.180 | This is similar to some of the older processor like MIPS 32 that use 32-bit instructions.

00:28:14.180 | Right now if you look at GPT-4, we are able to reach like very big context lens.

00:28:20.180 | So that's very interesting.

00:28:23.180 | And when like GPT-4 initially came out, it was constrained to like 8K tokens.

00:28:26.180 | Now we have 32K tokens and 128K tokens and like 1 million tokens.

00:28:30.180 | So the context length of this model is just increasing and increasing over time.

00:28:37.180 | And as the context length increases, that also allows us to have...

00:28:42.180 | A question from online.

00:28:48.180 | Can you speak to the compute budget for the day long run?

00:28:51.180 | Like was it H100s or like a cluster?

00:28:56.180 | The results?

00:28:57.180 | Yes, yes, yes, yes.

00:28:58.180 | So that was all H100s actually.

00:28:59.180 | We trained the whole models on 50 H100s in less than one day.

00:29:04.180 | Gotcha.

00:29:06.180 | And then one question from before.

00:29:08.180 | As AI agents increasingly emulate human behavior, what protocols do you foresee being implemented to help users distinguish between AI and humans in conversation?

00:29:17.180 | Yeah, that's very interesting.

00:29:20.180 | I think that becomes a question of security, of how can we identify whether it's a human or an agent.

00:29:26.180 | It's actually a very hard question right now because you actually have voice agents that are effectively able to mimic humans and are able to pass as humans.

00:29:33.180 | And that's actually happening in the real world right now.

00:29:35.180 | Over time we will need like human proof of identity.

00:29:39.180 | So this could be like biometrics.

00:29:41.180 | This could also be a combination of maybe like some sort of like personal data or like some sort of like password or secrets that only you know.

00:29:48.180 | And you can use that to authenticate that you're talking to an actual human and not an agent.

00:29:53.180 | Cool.

00:30:02.180 | Any more questions?

00:30:03.180 | Yeah.

00:30:04.180 | So, you know Berkeley just published a paper by professors, students written a paper saying,

00:30:10.180 | "Why do multi-agent systems fail?"

00:30:14.180 | Right.

00:30:15.180 | So, they did some comprehensive study.

00:30:17.180 | First of all, they said MIS has been there for more than 20 years, right?

00:30:22.180 | You have distributed systems, the transition processing.

00:30:25.180 | So, we are just having AI covered by the same name.

00:30:30.180 | And so, so far I really haven't seen anything new except for you have an agent which instead of just having API people coding all the logic in the program,

00:30:43.180 | you have an agent will be able to do something.

00:30:45.180 | Mm-hmm.

00:30:46.180 | You're given a prompt and will give you some results, right?

00:30:48.180 | How can, you just putting all agents together, the intelligence suddenly elevate it?

00:30:55.180 | So, communicating between agents, my point is just for communication between agents is exactly the same as MIS before, 20 years ago.

00:31:03.180 | I think collaboration between agents is only being equal to any of the intelligence.

00:31:08.180 | But I'm missing that part today in the lecture.

00:31:12.180 | All right.

00:31:13.180 | This is actually something that's coming next.

00:31:16.180 | But just to answer your question, the biggest issue is just to elaborate.

00:31:21.180 | And what happens is when all these agents are communicating using natural language, that causes a lot of miscommunication.

00:31:27.180 | Where like, okay, maybe your agent got the wrong instruction or failed to understand what's happening.

00:31:32.180 | And the more agents you add, the more communication overhead is there.

00:31:35.180 | So, you can imagine if you have an agentic system with n different agents, then there's n squared communication groups.

00:31:42.180 | And so, like, the amount of error in that system increases, like, as quadratic.

00:31:46.180 | And that makes, that allows for, like, a lot of different mistakes that can happen.

00:31:50.180 | I think I'm suggesting you to do a textbook called MACI, M-A-C-I.

00:31:55.180 | They've all, pretty much all the problems solved.

00:31:57.180 | 15 chapters of the book.

00:31:59.180 | Great.

00:32:00.180 | Yeah, totally.

00:32:01.180 | Yeah, that could be very interesting also for the audience here.

00:32:07.180 | So, let's come back to this.

00:32:14.180 | So, one way to think about agents is when you have this transform model.

00:32:20.180 | The transform model is acting as this processor.

00:32:23.180 | So, it's taking in this input prompts, and it's giving out the output prompts.

00:32:26.180 | And what you want to do is you want to be able to, like, have a memory system.

00:32:30.180 | So, you want to have something like a file disk called RAM, where you are saving what's happening.

00:32:36.180 | And being able to process that over time.

00:32:37.180 | And being able to process that over time.

00:32:39.180 | So, you want to have repeated operations.

00:32:40.180 | So, you do the first pass over a model.

00:32:42.180 | You get some output tokens.

00:32:43.180 | You can save them in a RAM-like system.

00:32:45.180 | And then you have, like, some, like, new instructions that come out.

00:32:49.180 | Okay, like, now, here's step two of the plan.

00:32:51.180 | Go execute that.

00:32:52.180 | Here's step three of the plan.

00:32:53.180 | Here's step four of the plan.

00:32:54.180 | And that looping behavior, this is what's, in a sense, giving rise to, like, agents.

00:32:59.180 | You can imagine this is, the transformer is the processor.

00:33:04.180 | The memory system and the instructions and the planning are acting similar to the file system and the RAM.

00:33:11.180 | And so, they are overall, like, giving rise to this computer architecture.

00:33:16.180 | Where you have the agent acting as, like, a computer system.

00:33:21.180 | With the memory processors, which is the compute.

00:33:26.180 | And then, being able to use, like, browsers and actions and multi-modality.

00:33:31.180 | Which can be, like, inputs like audio and voice and so on.

00:33:35.180 | Okay.

00:33:40.180 | When we think about long-term memory, there's, based on the analogy before.

00:33:45.180 | You can think of this as similar to a disk.

00:33:47.180 | Where you want a user memory that's long-lived and persistent.

00:33:52.180 | And so, that you can, like, save context about the user.

00:33:55.180 | You can, like, load that on the fly whenever you want to.

00:33:59.180 | There's different mechanisms for long-term memories.

00:34:02.180 | The prevalent one is embeddings.

00:34:04.180 | So, you have retrieval models that can go and fetch the right user embeddings on the fly.

00:34:09.180 | So, if I have a question, like, okay, like, does this person, Joe, is he allergic to peanuts?

00:34:16.180 | Then, can the system go and find out?

00:34:18.180 | And if we have a lot of user data about the user, then we can use a retrieval model to do, like, embedding lookup.

00:34:27.180 | Find out, okay, like, if this is something that we already know about the user or not.

00:34:30.180 | And based on that, make a right judgment.

00:34:34.180 | And this is something that is very important.

00:34:36.180 | And you are able to see early traces of this in systems like chat GPT right now.

00:34:40.180 | There's still a lot of open questions when it comes to long-term memory.

00:34:45.180 | The first one is hierarchy.

00:34:47.180 | How do we decompose memory into, like, more graphical structures where you can have temporal persistence.

00:34:56.180 | You can have more structures.

00:34:57.180 | And you might also want to think about memory as something that is adaptable.

00:35:01.180 | Because human memory is usually not static.

00:35:03.180 | It's changing over time.

00:35:04.180 | And so, you also want to think about when you have agent memory.

00:35:07.180 | How can it change?

00:35:08.180 | How can it be dynamic?

00:35:09.180 | How can it self-adjust?

00:35:10.180 | Because the systems are also learning.

00:35:12.180 | They're improving.

00:35:13.180 | And what does this dynamic memory systems look like?

00:35:16.180 | Cool.

00:35:23.180 | And with memory, like, leads to personalization.

00:35:29.180 | Where the goal with having long-term memory is that you can personalize these agents to the user.

00:35:37.180 | And they're able to understand what you like, what you don't like.

00:35:41.180 | And they're aligned with their preferences.

00:35:44.180 | So, if you have this case of, like, maybe someone is allergic to peanuts.

00:35:48.180 | And you want to have an agent that's ordering food on DoorDash.

00:35:51.180 | Then you want it to be personalized.

00:35:53.180 | So, it doesn't accidentally order something that you're allergic to.

00:35:56.180 | And how can you go and build that?

00:35:57.180 | And everyone has different preferences, likes and dislikes.

00:36:01.180 | So, when you're designing agents, it's very important to actually make sure that you can account for this.

00:36:06.180 | So, there might be a lot of explicit personalization information that you can collect.

00:36:10.180 | Like, what is the user like?

00:36:11.180 | Are they allergic to something?

00:36:12.180 | What are their favorite dishes?

00:36:14.180 | What seat preferences they have, if they're flying, and so on.

00:36:19.180 | There's also a lot of implicit preferences.

00:36:21.180 | So, there's a lot of things around, like, which brand do you like?

00:36:24.180 | Do you like Adidas versus Nike?

00:36:26.180 | If there were, like, ten items on a list, like, suppose you're looking for, like, a housing.

00:36:30.180 | Which one do you prefer and why?

00:36:32.180 | And those things are very implicit.

00:36:34.180 | So, they're not explicitly known.

00:36:36.180 | And then, there are mechanisms where you can collect a lot of this implicit preferences.

00:36:40.180 | And then, personalize over time.

00:36:42.180 | There's a lot of challenges when you're building these personalization systems.

00:36:49.180 | The first one is just user privacy and trust.

00:36:52.180 | How do you actually go and actively collect this information?

00:36:56.180 | And how do you get people to give that to you?

00:37:00.180 | There's different methods you can go and use to actually collect this information.

00:37:04.180 | So, one is just actively active learning.

00:37:06.180 | But you're explicitly asking the user for their preferences.

00:37:09.180 | You're asking them, okay, like, are you allergic to something?

00:37:11.180 | Or do you have the seat preference?

00:37:12.180 | And so on.

00:37:13.180 | And there might also be, like, passive learning.

00:37:15.180 | Where the, if you can, like, record the users and see what they're doing.

00:37:18.180 | Then you're able to passively learn from their preferences.

00:37:21.180 | Maybe this person likes Nike shoes.

00:37:23.180 | Because, like, that's where, what we have seen them do on the computer.

00:37:26.180 | And the agent is learning from your behavior and become better and better.

00:37:29.180 | And you can learn to personalize by supervised fine tuning.

00:37:34.180 | Where you are collecting a lot of interactions.

00:37:36.180 | This can also be through human feedback.

00:37:38.180 | Where you can get thumbs up or thumbs down.

00:37:40.180 | And use that to, like, improve.

00:37:41.180 | Okay, like, this agent go and do the right thing.

00:37:44.180 | And this is something similar to ChatJPT.

00:37:46.180 | Where if you like the chat outputs, then you can give it a thumbs up.

00:37:50.180 | If you don't like it, you give it a thumbs down.

00:37:52.180 | And then this can be used to, like, personalize the system over time.

00:37:55.180 | Okay, so now I'm going to agent to agent communication.

00:38:01.180 | One question online.

00:38:06.180 | How do you do evaluations on the performance of agents that collaborate with humans?

00:38:10.180 | And is it a moving target?

00:38:11.180 | At what point is human performance redundant?

00:38:14.180 | And agents can be fully autonomous?

00:38:16.180 | I would say it's a hard question.

00:38:18.180 | You just have to go and build benchmarks.

00:38:20.180 | Because it's very hard to know what's going to happen in the real world.

00:38:23.180 | Right now, I would say, like, based on, like, a lot of, like, current state of evaluations.

00:38:29.180 | And, like, what I showed before.

00:38:30.180 | Like, agents are not fully there.

00:38:32.180 | The most successful agents we have seen so far are coding agents.

00:38:35.180 | So, if you have your, whatever, like, intelligent code editor.

00:38:39.180 | And you can already see the traces.

00:38:43.180 | Like, they are automating a lot of engineering for you already.

00:38:46.180 | That you don't have to go and write a lot of boilerplate code.

00:38:48.180 | Or you don't have to, like, spend a lot of your own time fixing bugs.

00:38:50.180 | So, at some point, we'll see this thing where, like, humans become more, like, managers.

00:38:55.180 | And we are giving them feedback.

00:38:56.180 | We are giving them direction.

00:38:57.180 | Okay, like, we want this, like, suppose you have, like, systems of different agents.

00:39:01.180 | So, giving them, okay, like, I want agent one to go and do this, agent to go and do this.

00:39:05.180 | So, on.

00:39:06.180 | See what the final output is.

00:39:07.180 | And over using that, improve the overall, like, the generation process that you are going towards.

00:39:14.180 | And so, this is likely what's going to happen is, like, the agentic systems will become better and better executors.

00:39:19.180 | Where, like, the humans become the managers for the systems of agents.

00:39:24.180 | Okay.

00:39:25.180 | Cool.

00:39:26.180 | So, when it comes to agent to agent communication, we think about, like, multi-agent architectures and multi-agent systems.

00:39:39.180 | Where you have all this, like, cute little, like, digital robots that can go and talk to each other, communicate, and, like, go do your work in a very coordinated and streamlined manner.

00:39:52.180 | There's reasons that you want to go and build multi-agent systems.

00:39:55.180 | The first one is parallelization.

00:39:58.180 | By dividing a task into smaller chunks and having, like, multiple agents, like, if you have any agents instead of one agent, you can improve the overall, like, speeds and efficiencies.

00:40:10.180 | The second thing is specialization.

00:40:13.180 | If you have, like, different specialized agents.

00:40:14.180 | So, you have maybe, like, a spreadsheet agent, and you have a Slack agent, and you have a web browser agent.

00:40:18.180 | Then you can also route different tasks to different agents.

00:40:20.180 | Each agent can become really, really good at their tasks.

00:40:23.180 | And this is similar to, like, having, like, a degree in a specific major or having an occupation, especially in that occupation.

00:40:37.180 | There's a lot of challenges when it comes to agent-agent communication.

00:40:41.180 | The biggest one is that this kind of communication is lossy.

00:40:45.180 | When you have one agent communicating to another agent, it's possible that it might make mistakes.

00:40:51.180 | So, like, it's, like, this is similar to what happens in human organizations.

00:40:55.180 | Maybe your manager will ask you to go do something, but you maybe misunderstood them, and, like, did something different, or they were, like, oh, like, what does happen?

00:41:01.180 | And similarly, like, agent-to-agent communication is also fundamentally lossy, where, like, whenever you are communicating information from one agent to another, you're losing some percentage of that information.

00:41:12.180 | And that allows for mistakes to, like, propagate in that system and become, like, increasingly more prevalent.

00:41:18.180 | Yeah.

00:41:22.180 | And there's different mechanisms for multi-agent system.

00:41:25.180 | This is a very novel field right now.

00:41:26.180 | Like, people are still trying to figure this out.

00:41:28.180 | No one has actually cracked this right now.

00:41:30.180 | What you want to do is you want to, like, build the right system of hierarchies, where, like, you might have manager agents that are working with, like, worker agents.

00:41:39.180 | You might have managers of manager agents, and you might have, like, maybe, like, flat agent organizations where, like, maybe, like, one manager is managing hundreds of agents.

00:41:46.180 | Or it could be, like, a big, like, vertical tree where you have, like, maybe, like, ten different hierarchies of agents that are managing each other.

00:41:54.180 | And so, a lot of these systems are possible.

00:41:56.180 | And this just depends on the task of what you're going and specializing on.

00:42:01.180 | And the biggest challenge with this kind of systems is, like, how do you exchange communication effectively without losing that information?

00:42:08.180 | How do you build syncing primitives?

00:42:10.180 | Like, how can communication from one agent that's very far away from the, maybe, like, another agent in the hierarchy come go and, like, be communicated very, very effectively across the chain?

00:42:25.180 | There's a couple of frameworks out there that are looking to solve these problems on how do we make this communication protocols robust?

00:42:32.180 | And how can we add mechanisms to reduce this miscommunication?

00:42:37.180 | A big one in this part is MCP, which is model context protocol.

00:42:41.180 | This is a protocol that came from Anthropic that a lot of people are using right now.

00:42:46.180 | It's a simple wrapper on APIs.

00:42:48.180 | So, what it does is it gives you, like, a streamlined standard format around each API.

00:42:53.180 | And by creating an MCP wrapper around your service.

00:42:58.180 | So, this could maybe, like, you have a file server service that's, like, exposed as an API.

00:43:02.180 | You can create, like, an MCP wrapper for your file server or maybe for your email client or for your maybe a Slack client or something running on your computer.

00:43:09.180 | Then all these MCP-connected servers can go and communicate with each other and, like, do things for you.

00:43:15.180 | And so, this allows for very effective communication where you are able to control the routing and, like, make decisions modular.

00:43:21.180 | So, you're able to plug in, like, new services as you want to.

00:43:28.180 | Similarly, another framework in this space is agent-to-agent protocol.

00:43:32.180 | So, this is a new protocol that came from Google very recently.

00:43:36.180 | That's aligned for agents to also communicate with other agents and add a lot of reliability and fallback mechanisms.

00:43:43.180 | And I'm not sure how many people here in the room have used MCPs.

00:43:52.180 | Okay.

00:43:53.180 | Yeah, not many.

00:43:55.180 | Okay.

00:43:56.180 | Okay.

00:43:57.180 | Cool.

00:43:58.180 | So, MCPs are actually very cool.

00:44:00.180 | Like, what they are doing is they are abstracting your APIs and making them, like, very, very modular, so that you can go and plug your API into an MCP protocol.

00:44:13.180 | And once it's wrapped around that, then you can go and, like, interconnect it to any other service that's supported by MCPs.

00:44:20.180 | So, it's kind of becomes, in a sense, like, having a center interface for communication for your different services or applications you have, and exposing them and, like, letting them connect and talk to each other.

00:44:35.180 | So, similar to, like, how you have something like HTTPS for communication on the normal internet, MCP becomes, like, an interesting protocol for communication to happen over different agents.

00:44:49.180 | And, yeah.

00:44:50.180 | So, if you have, like, a client, like, cloud or replet or some other model, you can connect that to servers that are supporting the MCP protocol.

00:45:01.180 | You can have, like, a bunch of, like, different services.

00:45:03.180 | Each services could be, like, some sort of data tool, like a database, API, or pretty much else.

00:45:08.180 | And they can all interconnect and, like, do, like, modular things for you.

00:45:14.180 | And because MCPs are not dependent on the spec of your API, they can allow you to absorb a lot of changes and add this level of modularity and abstraction by standardizing the whole interface.

00:45:29.180 | You can also have, like, dynamic tool discovery because you can find different MCP servers that are exposed in some sort of directories.

00:45:37.180 | And then you can also, like, plug in MCP servers that you like and, like, connect to them.

00:45:42.180 | So, you can plug in new tools, let them out.

00:45:45.180 | And you can route information based on, like, what you want to do.

00:45:50.180 | Okay.

00:45:51.180 | Finally, like, touching on, like, some of the issues when it comes to agent systems.

00:46:00.180 | So, so far we have seen a lot of different things.

00:46:03.180 | Okay, like, how can these agents work?

00:46:05.180 | How can we evaluate them?

00:46:06.180 | How can we train them?

00:46:07.180 | How can we think about communicating with different agent systems?

00:46:12.180 | And even though a lot of these things are very interesting, a lot of these things are taking off, there remains a lot of, like, key problems in the space that you still have to solve for this agency to be practical, for them to be applied in everyday life, and for them to become useful for you.

00:46:26.180 | The biggest one is just reliability, that the systems have to become very, very reliable.

00:46:30.180 | Like, they need to be close to 99.9% reliable if you're giving them access to your payments and your bank details, for example, or maybe they're connected to your emails, calendars, and whatever services.

00:46:41.180 | And then you want to really trust them, that you don't want the systems to go rogue and maybe, like, post something wrong for you on socials, on your Twitter, or your LinkedIn, or you don't want them to go and create an havoc or make a wrong transaction on your behalf.

00:46:53.180 | And so that becomes, like, how can you trust an agentic system that's operating autonomously?

00:46:59.180 | And that's where, like, reliability becomes a big thing.

00:47:02.180 | Second issue with autonomous agents is looping.

00:47:05.180 | Like, this agent can go do something wrong.

00:47:08.180 | So they can get stuck in a loop, and they might just go and repeat that process again and again.

00:47:12.180 | So if you give them a task, and, like, maybe, like, if you remember, like, the restaurant booking task that I showed before, and, like, maybe, like, the agent went to the wrong restaurant,

00:47:22.180 | and got stuck, and maybe just trying to do the same thing again and again, and doesn't know what to do.

00:47:27.180 | And that kind of issues can happen a lot of the agents where you might end up wasting a lot of money in compute.

00:47:31.180 | And it's very important to be able to, like, figure that out and correct that.

00:47:34.180 | And that leads to a lot of use cases around, like, how can we test agents?

00:47:38.180 | How can we properly benchmark them in the real world on a lot of, like, different use cases and make sure, like, we are learning from that?

00:47:44.180 | And how can we also, once we deploy the systems, be able to, like, observe them?

00:47:48.180 | And that becomes, like, how can we know what is happening?

00:47:51.180 | Can we monitor them online?

00:47:53.180 | Can we have some sort of, like, safety, which could be based on audit trails that we can audit all the operations that this agent has done so far?

00:48:00.180 | And we can maybe also have human overrides.

00:48:02.180 | But if something goes wrong, we have some sort of human fallback where maybe, like, a remote operator can take control of the agent and correct it and fix it.

00:48:09.180 | Or maybe you are able to go and, like, actually take control and fix it.

00:48:13.180 | This is similar to, like, autopilot in Tesla.

00:48:16.180 | So, when you're driving autopilot, maybe, like, you see, like, something, maybe it's going to go do something wrong and you can take control and override the system.

00:48:23.180 | And that becomes very interesting when you're thinking about real world deployment of agents.

00:48:30.180 | Cool.

00:48:31.180 | Okay.

00:48:32.180 | So, that was, yeah.

00:48:33.180 | So, that was the whole lecture on agents.

00:48:35.180 | Sorry, there was, like, some things that were a bit messy.

00:48:38.180 | Yeah, we had to put together some, like, final slides.

00:48:41.180 | So, happy to take questions.

00:48:42.180 | And, yeah, go on.

00:48:43.180 | So, when you see, like, an accuracy, say, like, here.

00:48:46.180 | So, this is one.

00:48:47.180 | Yeah.

00:48:48.180 | So, when you see an accuracy of, say, like, 40% or something on a task over the course of a day, do you think that there is a plan to get to, you know, like, P9 or F9.99?

00:49:01.180 | And do you have a thought of just, is that just iterations on research?

00:49:06.180 | Or is there actually a lot of things that you need to try?

00:49:10.180 | So, I would say, like, this is definitely possible, especially with, like, reinforcement learning and, like, like, I showed the agent queue method before.

00:49:17.180 | So, right now, a lot of these models, like, even if you have, like, Claude Sonnet or GPT-40, GPT-40 or Gemini, they're not trained on this agentic interfaces tasks.

00:49:27.180 | So, that's why, like, they're kind of, they're working zero short.

00:49:30.180 | So, they're not, never trained in their distribution training set on actually going and optimizing these problems.

00:49:36.180 | And so, when they encounter this, like, new interfaces or this kind of new tasks in the real world, they often fail.

00:49:40.180 | But if you're able to, like, train the systems directly to work on this task using reinforcement learning and, like, corrections and, like, self-improvement, then you can actually reach, like, very, very high accuracies.

00:49:49.180 | So, in the open table task with agent queue, we reach, like, 95% accuracies.

00:49:53.180 | And if you keep going and training the systems, you can, like, fully saturate them, like, reach close to, like, 99.9%.

00:49:59.180 | The hard thing becomes is there's a diversity of tasks.

00:50:02.180 | So, I can imagine there's, like, millions of websites.

00:50:04.180 | And if you want to train an agent that's usually 99.9% on each website.

00:50:07.180 | That's a hard challenge.

00:50:09.180 | And that's something that's very interesting.

00:50:11.180 | Like, how can you build a generalized agent that can work on the whole internet that can generalize to everything?

00:50:15.180 | Maybe in the future, you will have agents that can do, like, automate all of voice calling, all of, like, computer control.

00:50:21.180 | Maybe, like, they can also use all of the APIs and everything.

00:50:23.180 | And something like that is possible theoretically.

00:50:26.180 | It's just, like, very hard to build up.

00:50:31.180 | Do you know whether AI agents are able to solve CAPTCHAs?

00:50:34.180 | They can.

00:50:35.180 | What do you think the implications of that are for, like, how the internet's going to work in the next 10 years?

00:50:41.180 | It's definitely very interesting.

00:50:44.180 | I would say it's a cat and mouse game.

00:50:46.180 | So, if you have seen, like, the new generation of CAPTCHAs, they're becoming harder and harder to solve.

00:50:50.180 | And I think, like, it's very hard to beat this because, like, if a human can do it theoretically, like, an agent can also go and do the same thing.

00:50:57.180 | So, over time, I think we'll have to just figure out, like, a better method.

00:51:00.180 | of identity.

00:51:01.180 | Biometrics can be a big part of that.

00:51:03.180 | Like, if you are able to use fingerprints or some sort of 2FA mechanisms, then we know, like, this is an actual human, not an agent.

00:51:10.180 | So, in the -- there's this article called AI 2027 that you've probably heard of that, you know, outlined, you know, like, where AI research is going to go and what might happen.

00:51:20.180 | And in 2027, after 27, when, you know, AI, we automate programming and then we automate AI research.

00:51:33.180 | And, you know, after your lecture, I was wondering, do you think we could automate the process of creating AI agents?

00:51:48.180 | Because, from what I understand, the main bottleneck is how am I going to access UIs, APIs?

00:51:54.180 | How am I going to be able to access data that is enclosed in those, like, I guess, complex and somewhat dynamic systems?

00:52:01.180 | So, what if, very simply, someone designed an agent that could -- that was optimized to vectorize APIs and UIs, and then you designed an agent that was optimized to train agents on different vectorized data sets?

00:52:17.180 | Because they're, like, specific architectures that you can use to train agents, whatever.

00:52:20.180 | Do you think we will see in the future people automating, with confidence, our process of creating AI agents, making all these niche-specific AI agents that we're seeing on the market obsolete?

00:52:36.180 | Yeah, absolutely.

00:52:37.180 | I absolutely think so.

00:52:38.180 | So, this is going to happen.

00:52:39.180 | And I think it's already happening in the bigger labs.

00:52:42.180 | So, if you have, like, labs like OpenAI, they have a lot of research agents.

00:52:45.180 | There's also, like, papers from, like, if you've seen from Sakana AI, people are working on, like, AI research agents that can go and write research papers and, like, train models and do a bunch of things.

00:52:54.180 | So, it's certainly possible for agents to go and self-improve and, like, build other agents and, like, you can have a whole process on, like, how that can happen.

00:53:01.180 | And definitely, like, it's possible to, like, train on a lot of these, like, data sources and APIs and find ways to, like, represent them and, like, collect the right sets of data and improve that.

00:53:11.180 | I do think that seems to be the future of a lot of, like, maybe hard research, especially around, like, protein designs and, like, a lot of, like, hard sciences.

00:53:20.180 | So, we'll definitely see a lot of that happen.

00:53:33.180 | Hi, Garg.

00:53:34.180 | Nice to meet you again.

00:53:35.180 | Just to give you context, we're building, like, the Slack for AI agents.

00:53:39.180 | Basically, it's, like, Uber for AI agents.

00:53:41.180 | So, I've been working for agents for a long time.

00:53:45.180 | The biggest problem with agents has been, as you said, reliability and hallucination.

00:53:50.180 | The first thing is, the first thing we try to work on is how do we prevent agent prevention from hallucinating?

00:53:57.180 | The next thing is, what models we're best at executing actions?

00:54:02.180 | So, for my research, we realized that Claude is great.

00:54:07.180 | Right.

00:54:08.180 | Better still, we have GPT at the end.

00:54:10.180 | So, we have, like, just like Slack, so we have, like, a team of agents doing work.

00:54:15.180 | And then, the one that does the action seems to be GPT agent because we struggle with some agents doing, as you said, GPT 4.0 is great at, you know, taking action.

00:54:29.180 | And other models of GPT seem to not work well and Claude and other stuff.

00:54:32.180 | So, I think the biggest challenge with building agents is also the third one is the fact that end users can't take one hit.

00:54:39.180 | So, my wife here doesn't give, like, if I give out to Tel's the product and it makes one mistake, there's no space for reinforcement learning.

00:54:49.180 | In the sense that, if I say, book my flight, like I told Manos to do yesterday, and I made one mistake, I lost trust.

00:54:56.180 | So, the problem is, to work in the real world, our agents should prevent making mistakes in the real world.

00:55:02.180 | So, that brings us to Sandbox, which I love what you're doing with Sandbox and doing clones of this website.

00:55:07.180 | The challenge with Sandbox is you can't clone all the websites on the internet.

00:55:11.180 | And where human excel is, the fact that, if given a new task, they figure their way around.

00:55:18.180 | So, these are challenges that we have with agents, and I'm happy if we can talk more about it or we can talk about it later.

00:55:26.180 | Totally, totally. Just to get a gist of it, what's the exact question there?

00:55:31.180 | So, I think the question is, how do we make them ready for the real world?

00:55:35.180 | We have a body, which does a good job with calling, but makes mistakes.

00:55:41.180 | We have an email agent, my own, that got stopped in a loop and sent an email five times to an investor.

00:55:49.180 | We have a coding agent that wiped up 3,000 lines of code from me yesterday.

00:55:55.180 | I had to redo it.

00:55:56.180 | So, we have these challenges in the real world, and people like my wife are not going to take one shot and they will just stop using it.

00:56:02.180 | So, I think the question is, how do we prevent agent from hallucinating?

00:56:05.180 | Right.

00:56:06.180 | Yeah, so it's definitely a hard problem where you can go and keep improving these agents.

00:56:12.180 | Even if you look at maybe a lot of the initial models that came out, like when you had the first versions of GPT-3 and so on, they hallucinate a lot.

00:56:21.180 | But as you have bigger models that are more parameter size and they train on more data, they start hallucinating less.

00:56:29.180 | So, if you see the new generation models, GPT-4 and Claude, I think over time, I think as you figure out how to make better foundation models, a lot of these errors in the systems go down, especially hallucinations and other things that can happen.

00:56:42.180 | You just require a lot of monitoring and evaluations and a lot of testing.

00:56:46.180 | And this also becomes very domain-specific.

00:56:48.180 | So, if you're working on something that's a domain-specific problem, and you're like, okay, you want an agent that can work 99.9% on this domain, then what you want to do is you want to get the right test cases.

00:56:58.180 | You can be like, okay, here's 1,000 scenarios that we really need to go and care about.

00:57:02.180 | Can we go and test these agents on these 1,000 scenarios all the time, which could be in production when you're actually running this with your users?

00:57:09.180 | Or this could be some sort of offline simulation where you're daily testing, is there any regressions in the system?

00:57:15.180 | What happens if you change a prompt?

00:57:16.180 | What will this look like?

00:57:18.180 | And if you're able to build a lot of very robust testing, then you can also verify, okay, your accuracies are going up.

00:57:25.180 | And then it kind of becomes like, can you fine-tune these agents to become better and better for your use cases?

00:57:29.180 | So, I think I would say the correct answer is a combination of models will become better and better over time.

00:57:34.180 | So, you can just implicitly trust them more as the new model comes out.

00:57:37.180 | And the second thing becomes is just you want to have very domain-specific testing and evaluation.

00:57:41.180 | So, for your own use case, can you go and have some sort of ways to rank which model is doing what?

00:57:47.180 | How good is it?

00:57:48.180 | And make the right judgment and be able to fine-tune and use reinforcement learning and other techniques to make them better over time.

00:57:54.180 | What do you think about smaller language model?

00:57:56.180 | Right.

00:57:57.180 | Right.

00:57:58.180 | Smaller model?

00:57:59.180 | Right.

00:58:00.180 | Thank you.

00:58:01.180 | I think the problem is large language models.

00:58:06.180 | I don't think we need bigger brains.

00:58:08.180 | We just need smaller.

00:58:10.180 | So, playing the way more smaller language models.

00:58:13.180 | Yeah, so that's an interesting question.

00:58:25.180 | Also, we are already seeing some hints of this.

00:58:28.180 | So, if you look at a lot of the newer models, they're trained on reasoning traces.

00:58:31.180 | And we have found like you can actually train smaller models on reasoning traces and have like better accuracies.

00:58:38.180 | things.

00:58:39.180 | So, a lot of the newer like GPT-4 models like all the GPT-4-0 and like all the new series of like O3 mini and so on.

00:58:46.180 | They're actually distilled small models.

00:58:48.180 | But they're just like fine-tuned and using reinforcement learning other techniques to be very good at reasoning.

00:58:54.180 | And so, we are already seeing that with all the new generation of like all the thinking models that are coming out and all the O1 and O3 series.

00:59:01.180 | So, that's showing this that smaller models with better reasoning, better processing is actually the right answer.

00:59:06.180 | It will be interesting to see like how far can you push the limits?

00:59:09.180 | And like what will this look like?

00:59:11.180 | It's like maybe like over this year, what are the best accuracies we can expect from this kind of reasoning things?

00:59:18.180 | Can we actually go and be like PhD level at mathematics and like even like super intelligence on a lot of these specific domains?

00:59:27.180 | Okay.

00:59:28.180 | I think the lead process is reward.

00:59:30.180 | And my approach architecture which I think may work is the manager, you know, manager agent could be large language model.

00:59:37.180 | Then the worker agent could be small language models.

00:59:39.180 | Because I think there's distillation happening when you're collaborating in like a team.

00:59:45.180 | Yeah.

00:59:46.180 | Yeah.

00:59:47.180 | What the last question is regarding memory.

00:59:50.180 | What would the analogy give with respect to a computer?

00:59:53.180 | We have random access memory.

00:59:55.180 | We have the room and then we have the, you know, hard drive, right?

00:59:59.180 | With AI agents right now, I just think they have random access memory.

01:00:03.180 | And with map zero, we are just giving it room.

01:00:06.180 | I don't think they have the hard drive on this, the consciousness, right?

01:00:11.180 | Why they are working.

01:00:12.180 | I think that's a challenge.

01:00:13.180 | I would like to know how do we implement that kind of system to make it sort of like a computer?

01:00:20.180 | Yeah.

01:00:21.180 | That's an interesting question.

01:00:22.180 | Like I'll be curious if you actually try and experiment and like see how that works.

01:00:25.180 | Like I'll say there's no straight answer to this.

01:00:28.180 | It just depends on what you're building, what your applications are.

01:00:31.180 | And then just depending on what you're doing, like there can be different size of models that might work better.

01:00:36.180 | If you're doing a coding task, you might want like a more of a coding model versus like if you're doing something more chat based or actions and so on.

01:00:43.180 | And I think you just have to like find the right ingredients in a sense like the right components for your application and then go and build that.

01:00:52.180 | Yeah, so there's no right answer to it in a sense.

01:00:54.180 | Yeah.

01:00:56.180 | Thank you.