back to indexUseful General Intelligence — Danielle Perszyk, Amazon AGI

00:00:00.000 |
Hi everyone. That's a little loud. I'm Danielle and I'm a cognitive scientist working at the 00:00:21.200 |
experimental new Amazon AGI SF lab. And throughout this conference you're going to hear a lot of 00:00:26.880 |
talks about building and scaling agents including some from my colleagues at AWS. But this talk is 00:00:32.720 |
going to be a little different. I want to think about how we can co-evolve with general purpose 00:00:37.040 |
agents and what it will take to make them reliable and aligned with our own intelligence. So I'd like 00:00:43.280 |
to set the stage by reminding us of a fact about the reliability of our own minds. We're all 00:00:50.880 |
hallucinating right now. Our brains don't have direct access to reality so they're stuck inside 00:00:57.840 |
our heads so they can only really do a few things. They can make predictions with their world models. 00:01:04.640 |
They can take in sensory information and they can reconcile errors between the two. That's about it. 00:01:10.720 |
And that's why neuroscientists call our brains prediction machines and say that perception is 00:01:16.480 |
controlled hallucination. But there's no way of course that I could be standing up here in front 00:01:22.640 |
of you if I didn't have my hallucinations under control. The controlled part is the critical bit. 00:01:26.640 |
But that's not all that's happening right now. If you're understanding my words then I'm also influencing 00:01:32.880 |
your hallucinations. And assuming you do understand my words then your brain just did something else. 00:01:38.960 |
It activated all meanings of the word hallucination including this one. So today we rely upon 00:01:46.240 |
hallucinating chatbots for brainstorming, generating content and code, and images of themselves like this. 00:01:53.840 |
But what they can't yet do is think, learn, or act in a reliable general purpose way. And we're not 00:01:59.760 |
satisfied with that because we've set our sights on building AI that more closely resembles our own 00:02:06.560 |
intelligence. But what makes our intelligence general? Well one thing we know is that hallucinations 00:02:13.760 |
are necessary because they allow us to go beyond the data. They're features rather than bugs of AI that's 00:02:20.560 |
flexible like ours. So we just need to figure out how to control them. I'm going to be drawing a lot of 00:02:26.160 |
parallels to our intelligence. But I'm not saying that we are or should be building something like a human 00:02:32.240 |
brain. We don't want AI to replace us or replicate us. We want it to complement us. We want AI plus humans 00:02:40.480 |
to be greater than the sum of our parts. Now this isn't typically what we think about when we hear AGI. 00:02:46.640 |
we think about the AI becoming more advanced. But this reflects a category error about how our intelligence 00:02:54.480 |
actually works. And that error is that general intelligence can exist within a thinking machine. 00:03:00.560 |
So when you think about AGI, you probably think about something like this. And you might think that 00:03:07.600 |
it's right around the corner. But why does it then feel like agents are closer to something like this? 00:03:14.320 |
The reality is that models can't yet reliably click, type, or scroll. And so everyone wants to know, 00:03:24.160 |
how do we make agents reliable? That's the question that I'm going to focus on today. 00:03:28.080 |
So first I'll share our lab's vision for agents. Then I will show you how Nova Act, which is a research 00:03:35.680 |
preview of our agent works today. And then finally I'll show you how Nova Act will evolve and how 00:03:41.600 |
you are all central to that evolution. So let's start with the big picture. Our vision for agents is 00:03:47.920 |
different than the standard vision, which reflects this long lineage of thought that has become folklore. 00:03:53.760 |
So you all know the story by now, which is why you probably spotted the hallucination here. 00:03:59.200 |
The concept of machines that can think like humans didn't originate in the 2010s, 00:04:04.800 |
but in 1956 when a group of engineers and mathematicians set out to build thinking 00:04:10.160 |
machines so they could solve intelligence. Of course, you also all know that these guys didn't 00:04:15.680 |
solve intelligence, but they did succeed in founding the field of AI and sparking a feedback loop that 00:04:21.840 |
changed how we live and work. So first we built more powerful computers, then we connected them 00:04:27.360 |
together to build the internet, which enabled more sophisticated learning algorithms. And this made our 00:04:32.720 |
computers even more powerful. And now we're back to aiming for thinking machines by another name: 00:04:37.440 |
Artificial General Intelligence, or AGI. So the standard vision is to make AI smarter and give it more 00:04:44.720 |
agency. And notice that this is about the technology, not us. Well, luckily this wasn't the only historical 00:04:56.240 |
This is Douglas Engelbart, and he invented the computer mouse and the GUI. He didn't care so much 00:05:01.600 |
about thinking machines and solving intelligence. What he cared about was thinking humans and augmenting 00:05:08.320 |
our intelligence. And he proposed that computers could make us smarter. Of course, he was absolutely right. 00:05:14.800 |
So as computers became pervasive, they also started changing our brains. We began offloading our computation to 00:05:22.640 |
devices, distributing our cognition across the digital environment, and this had the effect of augmenting 00:05:29.360 |
our intelligence. Scientists call this techno-social co-evolution. It just means that we invent new technologies 00:05:37.040 |
that then shape us. So here we have two historical perspectives for the goal of building more advanced 00:05:44.320 |
intelligence that resembles our own. We can build AI that is as smart as or even smarter than us, 00:05:50.000 |
or we can build AI that makes us smarter. We all believe that more general purpose agents are going to be 00:05:57.200 |
more useful. But how? Well, things are useful when they have one of two effects. They can simplify our lives by 00:06:04.160 |
allowing us to offload things, or they can give us more leverage. And yes, automation is an engine for 00:06:11.360 |
augmentation. This is how we become expert at things. We start by paying conscious attention to the details, 00:06:17.440 |
we practice, and then our brain moves things over to our subconscious. Automation frees up our attention to 00:06:23.760 |
focus on other things. The problem is that automation doesn't always lead to augmentation. Sometimes it even 00:06:31.440 |
comes at a cost. How many hours have we lost to scrolling? Or how many echo chambers have we been 00:06:37.120 |
trapped within? How many times has autocomplete just shut down our thinking? So this is how algorithms can 00:06:43.760 |
reduce our agency. And it's how increasingly intelligent agents might cause more problems than they solve. 00:06:50.400 |
But if we have precise control and we actively tailor these systems the way that we want, then we can actually 00:06:57.440 |
increase our agency. And this is the crossroads in front of us. We can continue to make AI smarter and give 00:07:04.240 |
it more agency. We can focus on unhobbling the AI, as it's fashionable to say. But this doesn't guarantee 00:07:11.120 |
that it will be useful to us. It just guarantees that we'll continue to see a lot of the same patterns that 00:07:15.920 |
we've seen in tech recently. And that's why that our vision is to build AI that makes us smarter and gives 00:07:22.240 |
us more agency. To build AI that unhobbles humans. So how do we do that? Well, in these early stages, 00:07:30.720 |
we need to do two things. We need to meet the models where they are and meet the builders where they are. So 00:07:37.920 |
all of you have a million ideas about what you want to do with agents. We have to make it frictionless 00:07:42.640 |
for you to get started. And Nova Act does these two things. We're building a future where the atomic unit 00:07:51.440 |
of all digital interactions will be an agent call. The big obstacle is that we still only have some 00:07:57.440 |
infrastructure for APIs. Most websites are built for visual UIs. And so since most websites lack APIs, 00:08:06.240 |
we need to use the browser itself as a tool. And that's why we've trained a model of Amazon's 00:08:12.720 |
foundation model, Nova, to be really good at UIs, to interact with UIs like we do. Nova Act combines 00:08:20.480 |
this model with an SDK to allow developers to build and deploy agents. All you have to do is make an Act 00:08:27.840 |
call, which translates natural language into actions on the screen. And I'm going to show you a demo here 00:08:34.800 |
where my teammate, Carolyn, will show you how you can use Nova Act. 00:08:38.480 |
Nova Act to find our dream apartment. We're searching for a two-bedroom, one-bath in Redwood City. 00:08:47.040 |
Here, we've given our first Act call to the agent. It's going to break down how to complete this task, 00:08:53.440 |
considering the outcome of each step as it plans the next one. Behind the scenes, this is all powered by a 00:08:59.120 |
specialized version of Amazon Nova, trained for high reliability on UI tasks. 00:09:03.520 |
Next, I'm going to show you my teammate, Fjord, who will describe how you can do even more things with 00:09:13.680 |
Python integrations. All right. We see a bunch of rentals on the screen, so let's grab them using a 00:09:18.800 |
structured extract. We'll define a Pydantic class and ask the agent to return JSON matching that schema. 00:09:27.200 |
For my commute, I want to know the biking distance to the nearest Caltrain station for each of these results. 00:09:31.600 |
Let's define a helper function. Add biking distance will take in an apartment and then use Google Maps to calculate the distance. 00:09:38.240 |
Now, I don't want to wait for each of these searches to complete one by one, so let's do this in parallel. 00:09:45.600 |
Since this is Python, we can just use a thread pool to spin up multiple browsers, one for each address. 00:09:50.160 |
Finally, I'll use pandas to turn all these results into a table and sort my biking time to the Caltrain station. 00:09:57.680 |
We've checked this script into the samples folder of our GitHub repo, so feel free to give it a try. 00:10:04.400 |
So we've made it really easy to get started. It's just three lines of code. And under the hood, 00:10:11.840 |
we're constantly making improvements to our model and shipping those every few weeks. And this is 00:10:17.040 |
important because even the building blocks of computer use are deceptively challenging. Here's why. 00:10:23.120 |
This is the Amazon website. And let me ask you, what do these icons mean? We typically take for granted 00:10:29.360 |
that even if we've never seen them before, we can easily interpret them. And when we can't, there are 00:10:34.240 |
usually plenty of cues for us to know what they mean. Now, Amazon actually labels these, but in many 00:10:40.000 |
contexts, the icons are not labeled and we couldn't possibly teach our agent all of the different icons, 00:10:45.360 |
let alone all of the different useful ways that it could use a computer. So we have to let our agent explore and learn 00:10:51.760 |
with RL. And it's really fascinating to think about how RL will enable these agents to discover how to use computers 00:10:59.120 |
in entirely new ways. And that's okay because we want them to be complementary to us. But if we're going to diverge in our computer use methods, 00:11:06.800 |
then it's really critical that our agent's perception of the digital world is aligned with our own. 00:11:12.160 |
And that's not what most agents can do right now. So current agents are LLM wrappers that function as 00:11:20.480 |
read-only assistants. They can use tools and some of them are getting really good at code, but they don't 00:11:26.480 |
have an environment to ground their interactions. They lack a world model. Computer use agents are different. 00:11:33.280 |
They can see pixels and interact with UIs just like us. So you can think of them as kind of having this 00:11:40.400 |
early form of embodiment. Now, we're not the only ones working on computer use agents, but our approach 00:11:46.560 |
is different. We are focusing on making the smallest units of interaction reliable and giving you granular 00:11:53.760 |
control over them. Just like you can string together words to generate infinite combinations of meaning, 00:12:00.480 |
you can string together atomic actions to generate increasingly complex workflows. Now, grounding our 00:12:07.280 |
interactions in a shared environment is necessary for building aligned general purpose agents, but it's not 00:12:14.400 |
sufficient. Computer use agents will need something else to be able to really reliably understand our higher 00:12:20.560 |
level goals. So how will Nova Act need to evolve to make us smarter and give us more agency? In other words, 00:12:28.240 |
what is it that makes our intelligence reliable and flexible and general purpose? Well, it turns out that 00:12:35.520 |
that over the past decades, as engineers were building more advanced intelligence, scientists were learning 00:12:42.480 |
about how it works. And what they learned was that this isn't the whole story. It's just the most recent 00:12:48.800 |
story of our co-evolution with technology. So computers, co-evolving with computers is this thing that we're fixated 00:12:57.840 |
on. But the story goes back a lot longer, and Engelbart actually hinted at this. He said, "In a very real sense, 00:13:05.840 |
as represented by the steady evolution of our augmentation means, the development of artificial 00:13:10.400 |
intelligence has been going on for centuries." Now, he was correct, but it was actually going on for a lot 00:13:15.520 |
longer than that. So let me take you back to the beginning. Around six million years ago, the environment 00:13:21.440 |
changed for our ancestors, and they had exactly two options. They could solve intelligence, 00:13:27.600 |
or go extinct. And the ones that solved intelligence did so through a feedback loop that changed our 00:13:34.160 |
social cognition. This should look familiar. First, our brains got bigger, then we connected them together, 00:13:41.360 |
which enabled us to further fine-tune into social information, and this made our brains even bigger. 00:13:46.720 |
But now you know that this scaling part is only half of the story. The other half had to do with how we all 00:13:54.000 |
got smarter. So we offloaded our computation to each other's minds and distributed our cognition across 00:14:01.360 |
the social environment. And this had the effect of augmenting our intelligence. So scientists call the 00:14:08.000 |
thing that we got better at through these flywheels representational alignment. We figured out how to 00:14:14.720 |
reproduce the contents of our minds to better cooperate. The key insight here is that the history of upgrading our 00:14:22.160 |
intelligence didn't start with computers. It started with an evolutionary adaptation that allowed us 00:14:28.160 |
to use each other's minds as tools. Let me say that in another way. The thing that makes our intelligence 00:14:33.920 |
general and flexible is inferring the existence of other minds. This means that this is general intelligence. 00:14:41.920 |
This can be general intelligence. This could possibly be general intelligence, but there's no reason to 00:14:49.520 |
expect that it will be aligned. And this is not general intelligence. Intelligence of the variety that humans 00:14:56.400 |
have can't exist in a vacuum. It doesn't exist in individual humans. It won't exist in individual models. 00:15:03.680 |
Instead, general intelligence emerges through our interactions. It's social, distributed, 00:15:09.680 |
ever-evolving. And that means that we need to measure the interactions and optimize for the interactions 00:15:15.120 |
that we have with agents. We can't just measure model capabilities or things like time spent on platform. 00:15:21.200 |
We have to measure human things like creativity, productivity, strategic thinking, even things like states of flow. 00:15:28.400 |
So let's take a closer look at this evolutionary adaptation. Any ideas as to what it was? 00:15:34.400 |
It was language. So language co-evolved with our models of minds in yet another flywheel that 00:15:43.520 |
integrated our systems for communication and representation. And it did this by being both a cause 00:15:49.200 |
and an effect of modeling our minds. Let's break that down. 00:15:52.720 |
We've got our models and our communicative interfaces, and then here's how they became integrated. 00:15:58.160 |
As we fine-tuned into social cues, our models of mind became more stable. This advanced our language, 00:16:06.240 |
and our language made our models of mind even more stable. And then here's the big bang moment for our intelligence. 00:16:14.560 |
Our models of mind became the original placeholder concept, the first variable for being able to 00:16:20.160 |
represent any concept. That right there is generalization. So you might be thinking, but is this 00:16:26.240 |
different from other languages? And the answer is yes. Other communication systems don't have models of mind. 00:16:33.200 |
Programming languages don't negotiate meaning in real time. This is why code is so easily verifiable. 00:16:39.600 |
And LLMs don't understand language. What do we mean they don't understand language? They don't understand 00:16:45.440 |
that words refer to things that minds make up. So when we ask what's in a word, the answer is quite literally a mind. 00:16:53.440 |
So language was so immensely useful that it triggered a whole new series of flywheels that scientists call cognitive technologies. 00:17:02.880 |
Each one is a foundation for the next, and each one allows us to have increasingly abstract thoughts. 00:17:08.960 |
They become useful by evolving within communities. 00:17:13.200 |
So early computers were not very useful to many people. They didn't have great interfaces, but Engelbart changed this. 00:17:20.640 |
Now computers are getting in our way. We've never had the world's information so easily accessible, 00:17:27.760 |
but also we've never had more distractions. And agents can help fix this. They can do the repetitive stuff 00:17:35.360 |
for us. They can learn from us and redistribute our skills across communities, and they can teach us new things 00:17:41.360 |
when they discover new knowledge. In essence, agents can become our collective subconscious, 00:17:47.360 |
but we need to build them in a way that reflects this larger pattern. So collectively, these tools for thought 00:17:55.040 |
stabilize our thinking, reorganize our brains, and control our hallucinations. How do they control our 00:18:03.120 |
hallucinations? Well, they direct our attention to the same things in the environment. They pick out the relevant 00:18:09.120 |
signals and the noise, and then we stabilize these signals to co-create these shared world models. And what does 00:18:15.920 |
that sound like? It sounds like what we're building. So another way of thinking about Nova Act is as the 00:18:22.400 |
primitives for a cognitive technology that aligns agents and humans' representations. And just like with 00:18:28.880 |
other cognitive technologies, early agents will need to evolve in diverse communities. So that's where all of 00:18:36.640 |
you come in. But reliability isn't just about clicking in the same place every time. It's about understanding 00:18:43.360 |
the larger goal. So to return to our big question: How do we make agents reliable? Eventually, 00:18:50.240 |
they're going to need models of our minds. So the next thing that we'll need to build is agents with models 00:18:56.880 |
of our minds. But we don't actually build those directly. We need to set the preconditions for them 00:19:02.240 |
to emerge. And this requires a common language for humans and computers. And at this point, you know what this entails. 00:19:08.960 |
Agents will need a model of our shared environment and interfaces that support intuitive interactions with us. 00:19:17.840 |
These will enable humans and agents to reciprocally level up one another's intelligence. To advance the models, we will need human-agent interaction data. 00:19:27.280 |
And to motivate people to use the agents in the first place, we'll need useful products. The more 00:19:32.720 |
useful the products become, the smarter we will all become. So this is how we can collectively build 00:19:38.320 |
useful general intelligence. If you want to learn more about Nova Act, then stick around right here 00:19:45.200 |
for the upcoming workshop, and thank you for your time.