Useful General Intelligence — Danielle Perszyk, Amazon AGI

00:00:00.000 | Hi everyone. That's a little loud. I'm Danielle and I'm a cognitive scientist working at the

00:00:21.200 | experimental new Amazon AGI SF lab. And throughout this conference you're going to hear a lot of

00:00:26.880 | talks about building and scaling agents including some from my colleagues at AWS. But this talk is

00:00:32.720 | going to be a little different. I want to think about how we can co-evolve with general purpose

00:00:37.040 | agents and what it will take to make them reliable and aligned with our own intelligence. So I'd like

00:00:43.280 | to set the stage by reminding us of a fact about the reliability of our own minds. We're all

00:00:50.880 | hallucinating right now. Our brains don't have direct access to reality so they're stuck inside

00:00:57.840 | our heads so they can only really do a few things. They can make predictions with their world models.

00:01:04.640 | They can take in sensory information and they can reconcile errors between the two. That's about it.

00:01:10.720 | And that's why neuroscientists call our brains prediction machines and say that perception is

00:01:16.480 | controlled hallucination. But there's no way of course that I could be standing up here in front

00:01:22.640 | of you if I didn't have my hallucinations under control. The controlled part is the critical bit.

00:01:26.640 | But that's not all that's happening right now. If you're understanding my words then I'm also influencing

00:01:32.880 | your hallucinations. And assuming you do understand my words then your brain just did something else.

00:01:38.960 | It activated all meanings of the word hallucination including this one. So today we rely upon

00:01:46.240 | hallucinating chatbots for brainstorming, generating content and code, and images of themselves like this.

00:01:53.840 | But what they can't yet do is think, learn, or act in a reliable general purpose way. And we're not

00:01:59.760 | satisfied with that because we've set our sights on building AI that more closely resembles our own

00:02:06.560 | intelligence. But what makes our intelligence general? Well one thing we know is that hallucinations

00:02:13.760 | are necessary because they allow us to go beyond the data. They're features rather than bugs of AI that's

00:02:20.560 | flexible like ours. So we just need to figure out how to control them. I'm going to be drawing a lot of

00:02:26.160 | parallels to our intelligence. But I'm not saying that we are or should be building something like a human

00:02:32.240 | brain. We don't want AI to replace us or replicate us. We want it to complement us. We want AI plus humans

00:02:40.480 | to be greater than the sum of our parts. Now this isn't typically what we think about when we hear AGI.

00:02:46.640 | we think about the AI becoming more advanced. But this reflects a category error about how our intelligence

00:02:54.480 | actually works. And that error is that general intelligence can exist within a thinking machine.

00:03:00.560 | So when you think about AGI, you probably think about something like this. And you might think that

00:03:07.600 | it's right around the corner. But why does it then feel like agents are closer to something like this?

00:03:14.320 | The reality is that models can't yet reliably click, type, or scroll. And so everyone wants to know,

00:03:24.160 | how do we make agents reliable? That's the question that I'm going to focus on today.

00:03:28.080 | So first I'll share our lab's vision for agents. Then I will show you how Nova Act, which is a research

00:03:35.680 | preview of our agent works today. And then finally I'll show you how Nova Act will evolve and how

00:03:41.600 | you are all central to that evolution. So let's start with the big picture. Our vision for agents is

00:03:47.920 | different than the standard vision, which reflects this long lineage of thought that has become folklore.

00:03:53.760 | So you all know the story by now, which is why you probably spotted the hallucination here.

00:03:59.200 | The concept of machines that can think like humans didn't originate in the 2010s,

00:04:04.800 | but in 1956 when a group of engineers and mathematicians set out to build thinking

00:04:10.160 | machines so they could solve intelligence. Of course, you also all know that these guys didn't

00:04:15.680 | solve intelligence, but they did succeed in founding the field of AI and sparking a feedback loop that

00:04:21.840 | changed how we live and work. So first we built more powerful computers, then we connected them

00:04:27.360 | together to build the internet, which enabled more sophisticated learning algorithms. And this made our

00:04:32.720 | computers even more powerful. And now we're back to aiming for thinking machines by another name:

00:04:37.440 | Artificial General Intelligence, or AGI. So the standard vision is to make AI smarter and give it more

00:04:44.720 | agency. And notice that this is about the technology, not us. Well, luckily this wasn't the only historical

00:04:51.120 | perspective. Does anybody know who this is?

00:04:56.240 | This is Douglas Engelbart, and he invented the computer mouse and the GUI. He didn't care so much

00:05:01.600 | about thinking machines and solving intelligence. What he cared about was thinking humans and augmenting

00:05:08.320 | our intelligence. And he proposed that computers could make us smarter. Of course, he was absolutely right.

00:05:14.800 | So as computers became pervasive, they also started changing our brains. We began offloading our computation to

00:05:22.640 | devices, distributing our cognition across the digital environment, and this had the effect of augmenting

00:05:29.360 | our intelligence. Scientists call this techno-social co-evolution. It just means that we invent new technologies

00:05:37.040 | that then shape us. So here we have two historical perspectives for the goal of building more advanced

00:05:44.320 | intelligence that resembles our own. We can build AI that is as smart as or even smarter than us,

00:05:50.000 | or we can build AI that makes us smarter. We all believe that more general purpose agents are going to be

00:05:57.200 | more useful. But how? Well, things are useful when they have one of two effects. They can simplify our lives by

00:06:04.160 | allowing us to offload things, or they can give us more leverage. And yes, automation is an engine for

00:06:11.360 | augmentation. This is how we become expert at things. We start by paying conscious attention to the details,

00:06:17.440 | we practice, and then our brain moves things over to our subconscious. Automation frees up our attention to

00:06:23.760 | focus on other things. The problem is that automation doesn't always lead to augmentation. Sometimes it even

00:06:31.440 | comes at a cost. How many hours have we lost to scrolling? Or how many echo chambers have we been

00:06:37.120 | trapped within? How many times has autocomplete just shut down our thinking? So this is how algorithms can

00:06:43.760 | reduce our agency. And it's how increasingly intelligent agents might cause more problems than they solve.

00:06:50.400 | But if we have precise control and we actively tailor these systems the way that we want, then we can actually

00:06:57.440 | increase our agency. And this is the crossroads in front of us. We can continue to make AI smarter and give

00:07:04.240 | it more agency. We can focus on unhobbling the AI, as it's fashionable to say. But this doesn't guarantee

00:07:11.120 | that it will be useful to us. It just guarantees that we'll continue to see a lot of the same patterns that

00:07:15.920 | we've seen in tech recently. And that's why that our vision is to build AI that makes us smarter and gives

00:07:22.240 | us more agency. To build AI that unhobbles humans. So how do we do that? Well, in these early stages,

00:07:30.720 | we need to do two things. We need to meet the models where they are and meet the builders where they are. So

00:07:37.920 | all of you have a million ideas about what you want to do with agents. We have to make it frictionless

00:07:42.640 | for you to get started. And Nova Act does these two things. We're building a future where the atomic unit

00:07:51.440 | of all digital interactions will be an agent call. The big obstacle is that we still only have some

00:07:57.440 | infrastructure for APIs. Most websites are built for visual UIs. And so since most websites lack APIs,

00:08:06.240 | we need to use the browser itself as a tool. And that's why we've trained a model of Amazon's

00:08:12.720 | foundation model, Nova, to be really good at UIs, to interact with UIs like we do. Nova Act combines

00:08:20.480 | this model with an SDK to allow developers to build and deploy agents. All you have to do is make an Act

00:08:27.840 | call, which translates natural language into actions on the screen. And I'm going to show you a demo here

00:08:34.800 | where my teammate, Carolyn, will show you how you can use Nova Act.

00:08:38.480 | Nova Act to find our dream apartment. We're searching for a two-bedroom, one-bath in Redwood City.

00:08:47.040 | Here, we've given our first Act call to the agent. It's going to break down how to complete this task,

00:08:53.440 | considering the outcome of each step as it plans the next one. Behind the scenes, this is all powered by a

00:08:59.120 | specialized version of Amazon Nova, trained for high reliability on UI tasks.

00:09:03.520 | Next, I'm going to show you my teammate, Fjord, who will describe how you can do even more things with

00:09:13.680 | Python integrations. All right. We see a bunch of rentals on the screen, so let's grab them using a

00:09:18.800 | structured extract. We'll define a Pydantic class and ask the agent to return JSON matching that schema.

00:09:27.200 | For my commute, I want to know the biking distance to the nearest Caltrain station for each of these results.

00:09:31.600 | Let's define a helper function. Add biking distance will take in an apartment and then use Google Maps to calculate the distance.

00:09:38.240 | Now, I don't want to wait for each of these searches to complete one by one, so let's do this in parallel.

00:09:45.600 | Since this is Python, we can just use a thread pool to spin up multiple browsers, one for each address.

00:09:50.160 | Finally, I'll use pandas to turn all these results into a table and sort my biking time to the Caltrain station.

00:09:57.680 | We've checked this script into the samples folder of our GitHub repo, so feel free to give it a try.

00:10:04.400 | So we've made it really easy to get started. It's just three lines of code. And under the hood,

00:10:11.840 | we're constantly making improvements to our model and shipping those every few weeks. And this is

00:10:17.040 | important because even the building blocks of computer use are deceptively challenging. Here's why.

00:10:23.120 | This is the Amazon website. And let me ask you, what do these icons mean? We typically take for granted

00:10:29.360 | that even if we've never seen them before, we can easily interpret them. And when we can't, there are

00:10:34.240 | usually plenty of cues for us to know what they mean. Now, Amazon actually labels these, but in many

00:10:40.000 | contexts, the icons are not labeled and we couldn't possibly teach our agent all of the different icons,

00:10:45.360 | let alone all of the different useful ways that it could use a computer. So we have to let our agent explore and learn

00:10:51.760 | with RL. And it's really fascinating to think about how RL will enable these agents to discover how to use computers

00:10:59.120 | in entirely new ways. And that's okay because we want them to be complementary to us. But if we're going to diverge in our computer use methods,

00:11:06.800 | then it's really critical that our agent's perception of the digital world is aligned with our own.

00:11:12.160 | And that's not what most agents can do right now. So current agents are LLM wrappers that function as

00:11:20.480 | read-only assistants. They can use tools and some of them are getting really good at code, but they don't

00:11:26.480 | have an environment to ground their interactions. They lack a world model. Computer use agents are different.

00:11:33.280 | They can see pixels and interact with UIs just like us. So you can think of them as kind of having this

00:11:40.400 | early form of embodiment. Now, we're not the only ones working on computer use agents, but our approach

00:11:46.560 | is different. We are focusing on making the smallest units of interaction reliable and giving you granular

00:11:53.760 | control over them. Just like you can string together words to generate infinite combinations of meaning,

00:12:00.480 | you can string together atomic actions to generate increasingly complex workflows. Now, grounding our

00:12:07.280 | interactions in a shared environment is necessary for building aligned general purpose agents, but it's not

00:12:14.400 | sufficient. Computer use agents will need something else to be able to really reliably understand our higher

00:12:20.560 | level goals. So how will Nova Act need to evolve to make us smarter and give us more agency? In other words,

00:12:28.240 | what is it that makes our intelligence reliable and flexible and general purpose? Well, it turns out that

00:12:35.520 | that over the past decades, as engineers were building more advanced intelligence, scientists were learning

00:12:42.480 | about how it works. And what they learned was that this isn't the whole story. It's just the most recent

00:12:48.800 | story of our co-evolution with technology. So computers, co-evolving with computers is this thing that we're fixated

00:12:57.840 | on. But the story goes back a lot longer, and Engelbart actually hinted at this. He said, "In a very real sense,

00:13:05.840 | as represented by the steady evolution of our augmentation means, the development of artificial

00:13:10.400 | intelligence has been going on for centuries." Now, he was correct, but it was actually going on for a lot

00:13:15.520 | longer than that. So let me take you back to the beginning. Around six million years ago, the environment

00:13:21.440 | changed for our ancestors, and they had exactly two options. They could solve intelligence,

00:13:27.600 | or go extinct. And the ones that solved intelligence did so through a feedback loop that changed our

00:13:34.160 | social cognition. This should look familiar. First, our brains got bigger, then we connected them together,

00:13:41.360 | which enabled us to further fine-tune into social information, and this made our brains even bigger.

00:13:46.720 | But now you know that this scaling part is only half of the story. The other half had to do with how we all

00:13:54.000 | got smarter. So we offloaded our computation to each other's minds and distributed our cognition across

00:14:01.360 | the social environment. And this had the effect of augmenting our intelligence. So scientists call the

00:14:08.000 | thing that we got better at through these flywheels representational alignment. We figured out how to

00:14:14.720 | reproduce the contents of our minds to better cooperate. The key insight here is that the history of upgrading our

00:14:22.160 | intelligence didn't start with computers. It started with an evolutionary adaptation that allowed us

00:14:28.160 | to use each other's minds as tools. Let me say that in another way. The thing that makes our intelligence

00:14:33.920 | general and flexible is inferring the existence of other minds. This means that this is general intelligence.

00:14:41.920 | This can be general intelligence. This could possibly be general intelligence, but there's no reason to

00:14:49.520 | expect that it will be aligned. And this is not general intelligence. Intelligence of the variety that humans

00:14:56.400 | have can't exist in a vacuum. It doesn't exist in individual humans. It won't exist in individual models.

00:15:03.680 | Instead, general intelligence emerges through our interactions. It's social, distributed,

00:15:09.680 | ever-evolving. And that means that we need to measure the interactions and optimize for the interactions

00:15:15.120 | that we have with agents. We can't just measure model capabilities or things like time spent on platform.

00:15:21.200 | We have to measure human things like creativity, productivity, strategic thinking, even things like states of flow.

00:15:28.400 | So let's take a closer look at this evolutionary adaptation. Any ideas as to what it was?

00:15:34.400 | It was language. So language co-evolved with our models of minds in yet another flywheel that

00:15:43.520 | integrated our systems for communication and representation. And it did this by being both a cause

00:15:49.200 | and an effect of modeling our minds. Let's break that down.

00:15:52.720 | We've got our models and our communicative interfaces, and then here's how they became integrated.

00:15:58.160 | As we fine-tuned into social cues, our models of mind became more stable. This advanced our language,

00:16:06.240 | and our language made our models of mind even more stable. And then here's the big bang moment for our intelligence.

00:16:14.560 | Our models of mind became the original placeholder concept, the first variable for being able to

00:16:20.160 | represent any concept. That right there is generalization. So you might be thinking, but is this

00:16:26.240 | different from other languages? And the answer is yes. Other communication systems don't have models of mind.

00:16:33.200 | Programming languages don't negotiate meaning in real time. This is why code is so easily verifiable.

00:16:39.600 | And LLMs don't understand language. What do we mean they don't understand language? They don't understand

00:16:45.440 | that words refer to things that minds make up. So when we ask what's in a word, the answer is quite literally a mind.

00:16:53.440 | So language was so immensely useful that it triggered a whole new series of flywheels that scientists call cognitive technologies.

00:17:02.880 | Each one is a foundation for the next, and each one allows us to have increasingly abstract thoughts.

00:17:08.960 | They become useful by evolving within communities.

00:17:13.200 | So early computers were not very useful to many people. They didn't have great interfaces, but Engelbart changed this.

00:17:20.640 | Now computers are getting in our way. We've never had the world's information so easily accessible,

00:17:27.760 | but also we've never had more distractions. And agents can help fix this. They can do the repetitive stuff

00:17:35.360 | for us. They can learn from us and redistribute our skills across communities, and they can teach us new things

00:17:41.360 | when they discover new knowledge. In essence, agents can become our collective subconscious,

00:17:47.360 | but we need to build them in a way that reflects this larger pattern. So collectively, these tools for thought

00:17:55.040 | stabilize our thinking, reorganize our brains, and control our hallucinations. How do they control our

00:18:03.120 | hallucinations? Well, they direct our attention to the same things in the environment. They pick out the relevant

00:18:09.120 | signals and the noise, and then we stabilize these signals to co-create these shared world models. And what does

00:18:15.920 | that sound like? It sounds like what we're building. So another way of thinking about Nova Act is as the

00:18:22.400 | primitives for a cognitive technology that aligns agents and humans' representations. And just like with

00:18:28.880 | other cognitive technologies, early agents will need to evolve in diverse communities. So that's where all of

00:18:36.640 | you come in. But reliability isn't just about clicking in the same place every time. It's about understanding

00:18:43.360 | the larger goal. So to return to our big question: How do we make agents reliable? Eventually,

00:18:50.240 | they're going to need models of our minds. So the next thing that we'll need to build is agents with models

00:18:56.880 | of our minds. But we don't actually build those directly. We need to set the preconditions for them

00:19:02.240 | to emerge. And this requires a common language for humans and computers. And at this point, you know what this entails.

00:19:08.960 | Agents will need a model of our shared environment and interfaces that support intuitive interactions with us.

00:19:17.840 | These will enable humans and agents to reciprocally level up one another's intelligence. To advance the models, we will need human-agent interaction data.

00:19:27.280 | And to motivate people to use the agents in the first place, we'll need useful products. The more

00:19:32.720 | useful the products become, the smarter we will all become. So this is how we can collectively build

00:19:38.320 | useful general intelligence. If you want to learn more about Nova Act, then stick around right here

00:19:45.200 | for the upcoming workshop, and thank you for your time.