Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3

*Music Playing* Today we are going to talk about why AI benchmarking is about to get a lot more fun. But, before we do, we need to go over some cool demos here. So, I love the Claude Place Pokémon demo. There's something really special about seeing this little embodied agent make its own decisions, go and play Pokémon for us from our childhood here.

Now, OpenAI got in on the same game. I thought this was awesome. And then, just the other day, Gemini beat Pokémon. I like seeing the little robots with AGI on top of their head. That must mean that we're already there, right? Well, if we have these agents playing Pokémon and it's already doing it and beating it, that means the game's over, right?

We're all done? Well, not quite. Because, with Claude Place Pokémon, we saw that it would get stuck in the same place for three days. It would need interventions. It would hallucinate different actions. And not only that, there was a ton of Pokémon training data that was within the model itself.

So, although this is a really cool example about an agent exploring a world, there are a lot of things that we can go and improve on. So, as Kyle was saying, my name is Greg Kamrad, President of ArcPrize. We are a non-profit with the mission to be a North Star guide towards Open AGI.

We were founded last year by Francois Chollet and Mike Knup. And, just last December, we were invited by OpenAI to join them on their live stream to co-announce the 03 preview results on Arc AGI. Now, there's a lot of AI benchmarking companies out there, but we take a very opinionated approach as to how we should do this.

And our opinion is that the best target we should be aiming for is actually humans. And the reason why we think that is because we see that humans are the one proof point of general intelligence that we know about. And if we use humans as the target, that does two things for us.

Because what we do is we come up with problems that are feasible for humans, but hard for AI. Now, while we do that, that does two things. Number one is it creates a gap. And when you have that gap, you can start to measure. Well, how many problems can we come up with that humans can still do, but AI can't?

And then number two is it guides research. So, you can quantify that class of problems and then go tell researchers, "Hey, there's something really interesting going on on this side of the problem. There's something that we need to go check out from there." Alright? So, if we're going to measure artificial general intelligence based off of humans, we need to actually define, well, what is general intelligence?

And there's two definitions that I love to quote. The first one was by John McCarthy. And he says that AI is the science and engineering of making machines do tasks, and this is the important part, that they have never seen beforehand and they have not prepared for beforehand. This is very important because if you've seen a class of problem beforehand, if it's already in your training data, then you're simply just repeating memorization.

You're not actually learning anything new on the fly, right? The second person I'd like to quote on this is actually Francois himself. And he put it very eloquently within just three words. And he calls intelligence skill, acquisition, efficiency. And this is really beautiful here because skill, acquisition, can you learn new things?

And not only that, but how efficiently can you learn those new things? And humans are extremely, spoiler, humans are extremely efficient at learning these new things. So Francois proposed this definition in his 2019 paper on the measure of intelligence, but he went further than that. He didn't just define it.

He actually proposed the benchmark to see can a human or an AI, can it learn something new and then go repeat what it learned. And this is where the ArcGi version one benchmark came out. So over here on the left hand side, this is the learn the skill portion.

This is what we call the training portion. And what we show you is a transformation from an input to an output grid. And then the goal for the human or the AI is to look at it and say, hmm, what's going on here? And then on the left, we actually ask you to demonstrate that skill.

So it's a little mini skill you learn on the left and we ask you to demonstrate it on the right. And if you can successfully do it, and this is what it looks like, it's just the grid editor here, then yes, you've learned what the transformation is and you've actually applied this.

And so you're showing a non-zero level of generalization as you go through this. So our benchmarks, ArcGi 2, this is the most recent one, it has over a thousand tasks in it. And the important part here is each one of these tasks is novel and unique. And what I mean by that is, the skills required for one of them, we will never ask you to apply that same skill to another task.

This is very important because we're not testing whether or not you can just repeat the skill you've already learned, but we want to test all the little mini skills that you can do over time and see if you can actually demonstrate those. And if we're going to back up that humans can actually do this, well, we need to go get first party data.

So our group as a nonprofit, we went down to San Diego and we tested over 400 people. So rented a bunch of computers and we did this in person to prefer to have data privacy. And we made sure that every single task that was included in ArcGi was solvable by people.

So this isn't just an aim here, we're actually doing the work to go and do that. But if we think about it, there's actually quite a bit of human-like intelligence that's out of scope from what we call a single turn type of benchmark. With ArcGi, you have all the information presented that you need right at test time.

You don't need to do any exploring or anything and it's all through single turn. So if we're going to be measuring any human-like intelligence, and I would argue that if you are going to measure human-like intelligence, it needs to be interactive by design. And what you need to have is you need to be able to test the ability of an agent, whether that be biological or artificial, to explore an open world, understand what goals it needs to do, and ultimately look at the rewards and go from there.

So this is actually very in line with what Rich Sutton had just published within his paper, "Welcome to the Era of Experience." And he argues that if we want agents that will be readily adaptable to the human world, they need to engage with the open world, they need to collect observational data, and they need to be able to take that data to build a world model and make their own rules and really understand what it is.

Or else you're just going to have the human data ceiling going forward from here. If we're going to be able to build this, we're going to need a new type of benchmark that gets out of the single turn realm, and this is where interactive reasoning benchmarks are going to come in.

Now an interactive reasoning benchmark is going to be a benchmark where you have a controlled environment, you have defined rules, and you may have sparse rewards where an agent needs to navigate to understand what is going on in order to explore and complete the objective from here. Now there's an open question as to all right, if our aim is interactive reasoning benchmarks, what is the medium in which we're going to actually execute these benchmarks in?

And it turns out that actually games are quite suitable for interactive reasoning benchmarks. The reason for this is games, they're a very unique set of intersection of complex rules, defined scope, and you have large flexibility into creating these types of environments that you can then go put different artificial systems in or biological systems with it.

Now, I know what you may be asking here, wait, Greg, didn't we already do games? Didn't we do this 10 years ago? We already went through the Atari phase. Well, yes, we did, but there's actually a huge amount of issues with what was going on during that realm there, not even just starting with all the dense rewards that come with the Atari games.

There was a ton of irregular reporting, so everybody would put their own performance on these different scales and was tough to compare these models with it. There was no hidden test set that came, and then one of my biggest gripes with the Atari phase was that all the developers, they already knew what the Atari games were.

So they were able to inject their own developer intelligence into their models themselves, and then all of a sudden the intelligence of the performance, well, that's getting barred from the developer. That's not actually getting done by the model itself from there. So if we were able to create a benchmark that overcame these shortcomings, well, then we'd be able to make a capabilities assertion about the model that beat it that we've never been able to make beforehand.

And so to put it another way that's a bit more visual, we know that AI can beat one game. This is proved. AI can beat chess. AI can beat Go. We've seen this many, many, many times here. And we know that AI can beat 50 known games with unlimited compute and unlimited training data.

We've seen this happen with Agent 57 and Mute Zero. But the assertion that we want to make is, well, what if AI beat 100 games that the system has never seen beforehand and the developer has never actually seen beforehand either? If we were able to successfully put a test or put AI to this test, then we could make the capabilities assertion about that AI that we don't currently have in the market right now.

And I'm excited to say that that's exactly what Arc is going to go build. So this is going to be our version three benchmark. Today is a sneak preview about what that's going to look like. And this is going to be our first interactive reasoning benchmark that is going to come from Arc.

And I want to jump into three reasons why it's very unique here. So the first one is much like our current benchmark, we're going to have a public training and a public evaluation set. So the reason why this is important with our public training, call it on the order of about 40 different novel games.

This will be where the developer and the AI can understand the interface and understand kind of what's going on here. But all performance reporting will happen on the private evaluation set. And this is very important because on this private evaluation set, there's no internet access allowed. So no data is getting out about this.

The scores that come out of private evaluation set will have been done by an AI that has never seen these games beforehand and neither has the developer seen them. So we can authoritatively say that this AI has generalized to these open domains here. Now, the second important point about what ArcGIS 3 is going to have is it's going to force understanding through exploration.

One of my other gripes with current game benchmarks out there is you give a lot of instruction to the actual AI itself. It's like, hey, you're in a racing game, or hey, you're in an FPS, go control the mouse and do all these things. We're going to drop AI and humans into this world, and they won't know what's going on until they start exploring.

So even as I look at this screenshot, this is actually one of our first games, we call it Locksmith. We give all of our games a cool little name like that. As I look at this, I don't know what's going on, right? But I start to explore and I start to understand, oh, there's certain things I need to pick up.

There may be walls, there may be goals and objectives. I'm not sure what those goals and objectives are right when I first start, but that's the point. So not only are we going to ask humans to explore and make up their own rules as to understand how to do the game, but we're going to require the same thing for AI as well.

And that's something that we're not currently seeing from the reasoning models that we have from there. Now, the third key point is that we're only going to require core knowledge priors only. This is something that we carry from ArcGIS 1 and 2 as well. But what this means is, you'll notice the ArcTasks, there's no language, there's no text that's being involved here, there's no symbols, and we're not asking you any trivia.

So I call these other benchmarks that rely on these, sometimes we try to make the hardest problems possible. We go hire the best people in the world, and I call them PhD++ problems, right? And that's great, but AI is already superhuman, it's way smarter than me in a lot of different domains.

We take the alternative approach, which is let's look at more of the floor and look at the reliability side. Let's take anything outside of core knowledge and strip those away. So core knowledge priors, the four of them that there are, are basic math. And these are things that are humans that we're either born with or hardwired to gather immediately after birth.

So basic math, meaning counting up to 10. Basic geometry, so understanding different shapes and topology. And then agentness, which is understanding theory of mind, that there's other types of agents out there in the world that I know that they're interacting. And then the fourth one is objectness. So as we create our benchmark, these are the four principles that we like to go after when we try to test the abstract and reasoning piece.

Now, I was reading the recent Darkesh essay, and he actually put it really well in one of his paragraphs here. He was talking about one of the reasons why humans are great, and he says it's their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.

We don't yet have this type of environment that can go and test this from a benchmark perspective for AI. And this is exactly what ArcGIS is going to go build. So before we wrap it up here, I want to talk about how we're going to evaluate AI. Because it's like, okay, cool, they go play the game.

Well, what does it mean? How do you know if it's doing well or it's not? And we're going to bring it back to Francois' definition. So we're going to bring it back to skill acquisition efficiency. And we're going to use humans, which again, is our only proof point of general intelligence.

We're going to use humans as the baseline. So we're going to go and test hundreds of humans on these exact Arc tasks. And we're going to measure how long does it take them? How many actions does it take them to complete the game? And then we're going to get a human baseline and we're going to be able to measure AI in the same exact way.

So can the AI explore the environment, intuit about it, create its own goals and complete the objectives faster than humans? Well, if it cannot, I would go as far as to assert that we do not yet have AGI. And as long as we can come up with problems that humans can still do but machines cannot, I would again assert that we do not have AGI with it.

So we're going to be looking at skill acquisition efficiency as our main output metric here. Today, we're giving a sneak preak about what this looks like. This is World's Fair. Actually, next month in San Francisco, we're going to give a sandbox preview. So we're going to release five games.

We know better than to try to wait till the end. We're going to make contact with reality. We're going to put out these fives. We're actually going to host a mini agent competition too. So we want to see what is the best possible agent that people can do. We'll put up a little prize money.

And then we're going to look forward to launching about 120 games. That's the goal by Q1 of 2026. Now, that sounds like it's not that many games and you think it's not that many data points. But the richness of each one of these games goes really, really deep. There's multiple levels.

It goes deep with each one of them. And it's quite the operational challenge to make all of these. And that's a whole other side of the benchmarking process, which I'm happy to talk about later. If this mission resonates with you, again, ArcPrize, we are a nonprofit. One of the best ways to get involved is through making a direct tax-deductible donation from that.

If anybody in the room knows any philanthropic donors, whether it be LPs or individuals, I'd love to absolutely talk to them. But then also, we're looking for adversarial testers. We want to pressure test ArcGi 3 as best as we can. So if there's anybody who's interested in participating in the agent competition, whether it's online or offline, let me know.

Happy to chat. Happy to chat. And then also, kind of a cool story, we originally started with Unity to try to make these games. And we quickly found out that Unity was way overkill for what we needed to do if you're just doing 2x2, 64x64 games here. So we're actually making a very lightweight Python engine ourselves.

So if there's any game developers out there, anybody who wants to get involved with this and knows Python well, we're looking for game developers and game designers as well. And that is all we have today. Thank you very much. Kyle, do we have-- Yeah, I think we have time in this case for a couple of questions.

If anyone wants to come up, there's microphones. One, two, three of them. Maybe a couple of questions. I'm going to kick that off. Sure, yeah, yeah. All right. Yes, yes. Question for you. So, and I don't know where-- I can repeat. Okay. All right. It's very hard to make estimates about timelines famously, but if you had to guess, how long do you think this new version of the benchmark you're making will take before it gets saturated?

Well, the way I think about that is, well, I would say I'm counting in years. I'm not counting decades. We'll put it that way. Okay. Interesting. All right. Yeah, we'll take one at each mic. Looks like it's well distributed. So, starting over here. Sure. Hi. You mentioned efficiency as part of the requirements.

And so I'm wondering for the benchmarks if you're considering things like wattage or time or other ways of using that as one of the criteria. Yeah, I love that question. And I would have put it in if I had more time, but I'm very opinionated about efficiency for measuring AI systems.

If I could have two denominators for intelligence on the output, number one would be energy because you know how much energy the human brain takes. And that's our proof point of general intelligence. So you can take how much calories the human brain takes. So I'd love to do energy.

But the number two denominator is the amount of training data that you need for it. Neither of which are very accessible for closed models in the current day. So we use proxies. And the proxy is the cost. But then with interactive evils like this, you get another proxy, which is action count.

And how long does it take you to actually do it? We're not going to have a wall clock within these games. It's going to be turn based. So we won't have a wall clock to do it. Awesome. All right. Question two, and then we'll do three. Please keep them both very short.

Yeah. Yeah. Very quick question. Could you define more what you mean by objectness? Yes. That one's actually quite simple. It's just understanding that when you look out into the world, that there's a mass of things that may act together. So the crude way would be you have one pixel, but then it's surrounded by a whole bunch of other pixels and they all move together.

You understand all those pixels as one and it kind of acting as a one body rather than individual. And really evolutionary wise, that's the same. That's a tree over there. All this is part of the same tree, that kind of thing. Final question. I'll keep this one super short.

How do you distinguish between tasks that in the games that you guys are developing, how do you distinguish between tasks that humans cannot do and an AGI also cannot do? Like what is the north star there? It's a good question. Tasks that humans cannot do are a bit out of scope for our thesis on how we want to drive towards AGI.

So I would say that's not really the aim that we're looking for on that. That's a whole different conversation around super intelligence than maybe that's for another time. Thank you. Thank you. Thank you. Transcription by CastingWords

Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3 — Greg Kamradt, ARC Prize Foundation

Transcript