Hi everyone. That's a little loud. I'm Danielle and I'm a cognitive scientist working at the experimental new Amazon AGI SF lab. And throughout this conference you're going to hear a lot of talks about building and scaling agents including some from my colleagues at AWS. But this talk is going to be a little different.
I want to think about how we can co-evolve with general purpose agents and what it will take to make them reliable and aligned with our own intelligence. So I'd like to set the stage by reminding us of a fact about the reliability of our own minds. We're all hallucinating right now.
Our brains don't have direct access to reality so they're stuck inside our heads so they can only really do a few things. They can make predictions with their world models. They can take in sensory information and they can reconcile errors between the two. That's about it. And that's why neuroscientists call our brains prediction machines and say that perception is controlled hallucination.
But there's no way of course that I could be standing up here in front of you if I didn't have my hallucinations under control. The controlled part is the critical bit. But that's not all that's happening right now. If you're understanding my words then I'm also influencing your hallucinations.
And assuming you do understand my words then your brain just did something else. It activated all meanings of the word hallucination including this one. So today we rely upon hallucinating chatbots for brainstorming, generating content and code, and images of themselves like this. But what they can't yet do is think, learn, or act in a reliable general purpose way.
And we're not satisfied with that because we've set our sights on building AI that more closely resembles our own intelligence. But what makes our intelligence general? Well one thing we know is that hallucinations are necessary because they allow us to go beyond the data. They're features rather than bugs of AI that's flexible like ours.
So we just need to figure out how to control them. I'm going to be drawing a lot of parallels to our intelligence. But I'm not saying that we are or should be building something like a human brain. We don't want AI to replace us or replicate us. We want it to complement us.
We want AI plus humans to be greater than the sum of our parts. Now this isn't typically what we think about when we hear AGI. we think about the AI becoming more advanced. But this reflects a category error about how our intelligence actually works. And that error is that general intelligence can exist within a thinking machine.
So when you think about AGI, you probably think about something like this. And you might think that it's right around the corner. But why does it then feel like agents are closer to something like this? The reality is that models can't yet reliably click, type, or scroll. And so everyone wants to know, how do we make agents reliable?
That's the question that I'm going to focus on today. So first I'll share our lab's vision for agents. Then I will show you how Nova Act, which is a research preview of our agent works today. And then finally I'll show you how Nova Act will evolve and how you are all central to that evolution.
So let's start with the big picture. Our vision for agents is different than the standard vision, which reflects this long lineage of thought that has become folklore. So you all know the story by now, which is why you probably spotted the hallucination here. The concept of machines that can think like humans didn't originate in the 2010s, but in 1956 when a group of engineers and mathematicians set out to build thinking machines so they could solve intelligence.
Of course, you also all know that these guys didn't solve intelligence, but they did succeed in founding the field of AI and sparking a feedback loop that changed how we live and work. So first we built more powerful computers, then we connected them together to build the internet, which enabled more sophisticated learning algorithms.
And this made our computers even more powerful. And now we're back to aiming for thinking machines by another name: Artificial General Intelligence, or AGI. So the standard vision is to make AI smarter and give it more agency. And notice that this is about the technology, not us. Well, luckily this wasn't the only historical perspective.
Does anybody know who this is? This is Douglas Engelbart, and he invented the computer mouse and the GUI. He didn't care so much about thinking machines and solving intelligence. What he cared about was thinking humans and augmenting our intelligence. And he proposed that computers could make us smarter. Of course, he was absolutely right.
So as computers became pervasive, they also started changing our brains. We began offloading our computation to devices, distributing our cognition across the digital environment, and this had the effect of augmenting our intelligence. Scientists call this techno-social co-evolution. It just means that we invent new technologies that then shape us.
So here we have two historical perspectives for the goal of building more advanced intelligence that resembles our own. We can build AI that is as smart as or even smarter than us, or we can build AI that makes us smarter. We all believe that more general purpose agents are going to be more useful.
But how? Well, things are useful when they have one of two effects. They can simplify our lives by allowing us to offload things, or they can give us more leverage. And yes, automation is an engine for augmentation. This is how we become expert at things. We start by paying conscious attention to the details, we practice, and then our brain moves things over to our subconscious.
Automation frees up our attention to focus on other things. The problem is that automation doesn't always lead to augmentation. Sometimes it even comes at a cost. How many hours have we lost to scrolling? Or how many echo chambers have we been trapped within? How many times has autocomplete just shut down our thinking?
So this is how algorithms can reduce our agency. And it's how increasingly intelligent agents might cause more problems than they solve. But if we have precise control and we actively tailor these systems the way that we want, then we can actually increase our agency. And this is the crossroads in front of us.
We can continue to make AI smarter and give it more agency. We can focus on unhobbling the AI, as it's fashionable to say. But this doesn't guarantee that it will be useful to us. It just guarantees that we'll continue to see a lot of the same patterns that we've seen in tech recently.
And that's why that our vision is to build AI that makes us smarter and gives us more agency. To build AI that unhobbles humans. So how do we do that? Well, in these early stages, we need to do two things. We need to meet the models where they are and meet the builders where they are.
So all of you have a million ideas about what you want to do with agents. We have to make it frictionless for you to get started. And Nova Act does these two things. We're building a future where the atomic unit of all digital interactions will be an agent call.
The big obstacle is that we still only have some infrastructure for APIs. Most websites are built for visual UIs. And so since most websites lack APIs, we need to use the browser itself as a tool. And that's why we've trained a model of Amazon's foundation model, Nova, to be really good at UIs, to interact with UIs like we do.
Nova Act combines this model with an SDK to allow developers to build and deploy agents. All you have to do is make an Act call, which translates natural language into actions on the screen. And I'm going to show you a demo here where my teammate, Carolyn, will show you how you can use Nova Act.
Nova Act to find our dream apartment. We're searching for a two-bedroom, one-bath in Redwood City. Here, we've given our first Act call to the agent. It's going to break down how to complete this task, considering the outcome of each step as it plans the next one. Behind the scenes, this is all powered by a specialized version of Amazon Nova, trained for high reliability on UI tasks.
Next, I'm going to show you my teammate, Fjord, who will describe how you can do even more things with Python integrations. All right. We see a bunch of rentals on the screen, so let's grab them using a structured extract. We'll define a Pydantic class and ask the agent to return JSON matching that schema.
For my commute, I want to know the biking distance to the nearest Caltrain station for each of these results. Let's define a helper function. Add biking distance will take in an apartment and then use Google Maps to calculate the distance. Now, I don't want to wait for each of these searches to complete one by one, so let's do this in parallel.
Since this is Python, we can just use a thread pool to spin up multiple browsers, one for each address. Finally, I'll use pandas to turn all these results into a table and sort my biking time to the Caltrain station. We've checked this script into the samples folder of our GitHub repo, so feel free to give it a try.
So we've made it really easy to get started. It's just three lines of code. And under the hood, we're constantly making improvements to our model and shipping those every few weeks. And this is important because even the building blocks of computer use are deceptively challenging. Here's why. This is the Amazon website.
And let me ask you, what do these icons mean? We typically take for granted that even if we've never seen them before, we can easily interpret them. And when we can't, there are usually plenty of cues for us to know what they mean. Now, Amazon actually labels these, but in many contexts, the icons are not labeled and we couldn't possibly teach our agent all of the different icons, let alone all of the different useful ways that it could use a computer.
So we have to let our agent explore and learn with RL. And it's really fascinating to think about how RL will enable these agents to discover how to use computers in entirely new ways. And that's okay because we want them to be complementary to us. But if we're going to diverge in our computer use methods, then it's really critical that our agent's perception of the digital world is aligned with our own.
And that's not what most agents can do right now. So current agents are LLM wrappers that function as read-only assistants. They can use tools and some of them are getting really good at code, but they don't have an environment to ground their interactions. They lack a world model. Computer use agents are different.
They can see pixels and interact with UIs just like us. So you can think of them as kind of having this early form of embodiment. Now, we're not the only ones working on computer use agents, but our approach is different. We are focusing on making the smallest units of interaction reliable and giving you granular control over them.
Just like you can string together words to generate infinite combinations of meaning, you can string together atomic actions to generate increasingly complex workflows. Now, grounding our interactions in a shared environment is necessary for building aligned general purpose agents, but it's not sufficient. Computer use agents will need something else to be able to really reliably understand our higher level goals.
So how will Nova Act need to evolve to make us smarter and give us more agency? In other words, what is it that makes our intelligence reliable and flexible and general purpose? Well, it turns out that that over the past decades, as engineers were building more advanced intelligence, scientists were learning about how it works.
And what they learned was that this isn't the whole story. It's just the most recent story of our co-evolution with technology. So computers, co-evolving with computers is this thing that we're fixated on. But the story goes back a lot longer, and Engelbart actually hinted at this. He said, "In a very real sense, as represented by the steady evolution of our augmentation means, the development of artificial intelligence has been going on for centuries." Now, he was correct, but it was actually going on for a lot longer than that.
So let me take you back to the beginning. Around six million years ago, the environment changed for our ancestors, and they had exactly two options. They could solve intelligence, or go extinct. And the ones that solved intelligence did so through a feedback loop that changed our social cognition. This should look familiar.
First, our brains got bigger, then we connected them together, which enabled us to further fine-tune into social information, and this made our brains even bigger. But now you know that this scaling part is only half of the story. The other half had to do with how we all got smarter.
So we offloaded our computation to each other's minds and distributed our cognition across the social environment. And this had the effect of augmenting our intelligence. So scientists call the thing that we got better at through these flywheels representational alignment. We figured out how to reproduce the contents of our minds to better cooperate.
The key insight here is that the history of upgrading our intelligence didn't start with computers. It started with an evolutionary adaptation that allowed us to use each other's minds as tools. Let me say that in another way. The thing that makes our intelligence general and flexible is inferring the existence of other minds.
This means that this is general intelligence. This can be general intelligence. This could possibly be general intelligence, but there's no reason to expect that it will be aligned. And this is not general intelligence. Intelligence of the variety that humans have can't exist in a vacuum. It doesn't exist in individual humans.
It won't exist in individual models. Instead, general intelligence emerges through our interactions. It's social, distributed, ever-evolving. And that means that we need to measure the interactions and optimize for the interactions that we have with agents. We can't just measure model capabilities or things like time spent on platform. We have to measure human things like creativity, productivity, strategic thinking, even things like states of flow.
So let's take a closer look at this evolutionary adaptation. Any ideas as to what it was? It was language. So language co-evolved with our models of minds in yet another flywheel that integrated our systems for communication and representation. And it did this by being both a cause and an effect of modeling our minds.
Let's break that down. We've got our models and our communicative interfaces, and then here's how they became integrated. As we fine-tuned into social cues, our models of mind became more stable. This advanced our language, and our language made our models of mind even more stable. And then here's the big bang moment for our intelligence.
Our models of mind became the original placeholder concept, the first variable for being able to represent any concept. That right there is generalization. So you might be thinking, but is this different from other languages? And the answer is yes. Other communication systems don't have models of mind. Programming languages don't negotiate meaning in real time.
This is why code is so easily verifiable. And LLMs don't understand language. What do we mean they don't understand language? They don't understand that words refer to things that minds make up. So when we ask what's in a word, the answer is quite literally a mind. So language was so immensely useful that it triggered a whole new series of flywheels that scientists call cognitive technologies.
Each one is a foundation for the next, and each one allows us to have increasingly abstract thoughts. They become useful by evolving within communities. So early computers were not very useful to many people. They didn't have great interfaces, but Engelbart changed this. Now computers are getting in our way.
We've never had the world's information so easily accessible, but also we've never had more distractions. And agents can help fix this. They can do the repetitive stuff for us. They can learn from us and redistribute our skills across communities, and they can teach us new things when they discover new knowledge.
In essence, agents can become our collective subconscious, but we need to build them in a way that reflects this larger pattern. So collectively, these tools for thought stabilize our thinking, reorganize our brains, and control our hallucinations. How do they control our hallucinations? Well, they direct our attention to the same things in the environment.
They pick out the relevant signals and the noise, and then we stabilize these signals to co-create these shared world models. And what does that sound like? It sounds like what we're building. So another way of thinking about Nova Act is as the primitives for a cognitive technology that aligns agents and humans' representations.
And just like with other cognitive technologies, early agents will need to evolve in diverse communities. So that's where all of you come in. But reliability isn't just about clicking in the same place every time. It's about understanding the larger goal. So to return to our big question: How do we make agents reliable?
Eventually, they're going to need models of our minds. So the next thing that we'll need to build is agents with models of our minds. But we don't actually build those directly. We need to set the preconditions for them to emerge. And this requires a common language for humans and computers.
And at this point, you know what this entails. Agents will need a model of our shared environment and interfaces that support intuitive interactions with us. These will enable humans and agents to reciprocally level up one another's intelligence. To advance the models, we will need human-agent interaction data. And to motivate people to use the agents in the first place, we'll need useful products.
The more useful the products become, the smarter we will all become. So this is how we can collectively build useful general intelligence. If you want to learn more about Nova Act, then stick around right here for the upcoming workshop, and thank you for your time.