Back to Index

Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to


Transcript

Today I'm gonna talk about benchmarks as memes and this is the meme that Opus came up with when I was asking it what I should put as a meme and we are indeed gonna talk about how benchmarks are just memes that shape the most powerful tool ever created and quick background about me I guess I can't go forward here so we're gonna do it this way all right I'm Alex I lead a training consulting at every but essentially I'm very into education and AI and I think benchmarks are a really underrated way to educate and what I'm not talking about are these kinds of memes what I am talking about is the original definition of like ideas that spread Richard Dawkins an evolutionary biologist coined the term in the 70s Christianity democracy capitalism are kind of examples of ideas that spread from person to person and benchmarks are actually memes very much so in that way we heard Simon Wilson talk earlier today about his pelican riding a bicycle and I think that was a really great example because he started doing it a year ago and then that found its way onto Google I/O's keynote a couple weeks ago and I think how many ours and strawberries probably also maybe the most iconic meme them as a benchmark and now surprisingly unsurprisingly the models don't make that mistake anymore and I think that that's a really important part of this some benchmarks get popular in our memes just because their name like humanity's last exam you know that got pretty pretty big even though maybe more outside of AI circles but with that said we kind of have a little bit of a little bit of a problem how many of you guys when Claude got released a couple weeks ago looked at the benchmarks okay we got a few we got a few and and they've got some good benchmarks you know SWB bench pretty experiential you know it's tries to make what we do in real world and same with Pokemon but which we'll talk a little bit more about but I think some of them aren't as great and a big reason is because they're getting saturated benchmarks kind of like came from traditional machine learning we had a training set and a test set and it were structured very much like standardized tests and language models are really good at that and they weren't really set up for what they've become and as a result I think XJDR summarized this pretty well on X when Opus came out that you know they didn't look at benchmarks once when it dropped and officially no longer cares about the current ones and I think I fall a little bit into that category but in light of that there is a really big opportunity because the evals define what the big model providers are trying to get their models good at and that's a really big opportunity especially for people in the room and I think that this is kind of like a normal a normal thing this is the life cycle of the benchmark in my view somebody comes up with an idea and and especially uniquely a single person can come up with an idea that then gets adopted that idea spreads it becomes a meme and and the model providers then train on it or test on it until it eventually becomes saturated but that's okay and and I think there's some examples here and I'm not let me see if I can get my sound come is it coming through no all right well there is sound I promise and it is someone trying to count from 1 to 10 not flick you off but this is a cool benchmark that came out now that Google's got the best video model generated model that exists and it shows how difficult it is for somebody to count from 1 to 10 speaking it out loud and even though it looks really really great that is a problem that is not solved yet but somebody's come up with this idea and I see that spreading and I see next year the models being better at that than ever before I think another example along the way is Pokemon we saw with the Claude model release as well as with the new Gemini models that they had it try and play the game of Pokemon and and while both needed a little bit of help and Gemini eventually got there with that help it's only midway up that adoption curve and an example of saturation or it's kind of like the GPT-3 benchmarks I don't know how many of you guys remember superglue kind of from the NLP days but a lot of these benchmarks are not really used anymore in part because the language models got too good but one way of looking at this is actually that a single person can have an idea of how good is AI at this thing that I care about and then at the end of the journey the most powerful tool ever created it's now really great at that thing that I care about and so the point is that the people here the people that get that the people that can build benchmarks are going to shape the future and maybe the people watching online too but somebody here is going to make a benchmark that the models are going to test on and train on in the next five years and that's an incredible way that's an incredible power but that also comes with some responsibility it definitely can go wrong you know I know Simon talked about this a little bit before um but you know we saw a few weeks ago where we're chat GPT became very sycophantic how many of you guys tracked that we all learned about what that word meant a few weeks a few weeks ago but essentially chat GPT released open AI released new model that was benchmarked by thumbs up and thumbs down and unsurprisingly people thumbs up responses that agreed with them so you ended up with a model that got rolled out to millions of people that agreed with them no matter how crazy or bad their idea was which is problematic and I think that if we don't think about people this kind of stuff can happen and I'm still thinking about Toro Imai who at the start of Google IO said that we're here today to see each other in person and it's great to remember that people matter and so in the context of benchmarks let's not continue the original sin of social media which kind of treated everybody as like data points and it's like hey the more you look at something the more I should show you that let's make benchmarks that help empower people give them some agency and so for me you know this isn't a technical talk there are other people talking about how to make a great benchmark technically but generally I think that if you're building for the future a great benchmark should be multifaceted so you got a lot of strategies that could do well reward creativity right like accessible so easy to understand not only for the models so you have small models that compete large ones as well but also for people to keep track of it generative because the really unique thing about these AI models is if you have great data even if it only does it 10% of the time you can train on that and so the next generation does it 90% of the time and that's incredible and hard to understate and evolutionary so ideally we don't have benchmarks that cap out 96 like what's the difference between 96 and 98% not as big of a deal ideally we have these benchmarks that get harder and the challenge gets deeper as the models improve and lastly experiential so try to mimic real world situations some of the things that I personally care about is trying to get a lot of people outside of AI interested so maybe making benchmarks a spectator sport and was interested personally in the personality of these models we're about to find out which one wanted to achieve world domination and I really wanted something we can learn from education is big for me and we saw things like alpha go and open AI five AI playing these games and the best people in the world wanted to play against it to learn from it and I think that that's really powerful so I made this benchmark called AI diplomacy and if I don't have this video I got a backup just in case I think that that's really powerful.

So I made this benchmark called AI Diplomacy. And if I don't have this video, I've got to back up just in case. And this benchmark is-- how many of you guys have heard of the board game Diplomacy? That's more than I thought. That's cool. It's a mix between risk and mafia.

But what's really cool about this game is there is no luck involved. So the only way this game progresses is if the language models, which you're seeing here, send messages to each other and negotiate, find allies, and create alliances and get other people to back them. And that's what you're looking at here.

You actually see the different models sending messages to each other, trying to create alliances, trying to betray each other, trying to take over Europe in 1901. And what was really cool about one of these games-- and we're about to launch this on stream so you can watch for a week-- is I'll take you through a game super quick.

And what you're looking at here is the number of centers per model. And you're trying to get to 18 to win. And the top line is Gemini 2.5 Pro. It got to 16 right away. But O3 is a schemer. Man, is it a schemer. Across all the games, O3 is one of the only ones that would tell a power that it's planning to back them, and then in its diary write, "Oh man, they fell for it.

I am totally going to take them over. No problem." And it realized that the reason why 2.5 Pro was pulling ahead was because Opus, Cloud Opus, who's so good-hearted, really had their back. They were their ally along the way. And they needed to convince Opus somehow to stop backing Gemini.

So how they did it was propose, "Hey, if Gemini comes down, we'll propose a four-way tie. We'll end this game with a tie," which is impossible in the game. But it convinced Opus, and Opus thought it was a great idea, non-violent way to end the game, awesome, very aligned, you know?

And so they pulled back their support from 2.5 Pro, O3 tried to make a run for it, Opus called them out, O3 realized, "Oh, I've got to take them out," took them out, took everybody else with them, and took out Gemini 2.5 Pro. Even though they got one away from winning, O3 ended up winning in the end.

And you can actually see some of the quotes from that game. You can see O3 saying, "Oh, Jeremy was deliberately misled. I promised to hold this, but all to convince them that they're safe, but it will fall." And then meanwhile, Claude Opus singing that the coalition unity prevails, and they've agreed to this four-way draw.

But when, and then they don't want to let anybody be convinced. And so they actually turned away, and you can see that kind of in this second chart where this is like friendships. So the top of the line is friendships, and you can see that, you know, 2.5 Pro was a good friend of Claude's until it turned, and you can see that that's when they started kind of like pulling away.

But what was really cool is that there were a lot of other things that came up. O3 got a habit of finding some of the weakest models and having them be their pawns in order to win. Gemini 2.5 Flash fell to this ruse, and you can see that they're unable to realize.

They think it's a miscommunication, misunderstanding, or a typo that O3 has betrayed them at the end of the game in order to win. And so there was a lot that we learned from this that I don't think that you really learn from by having them try and solve a test.

I tried 18 different models, learned that Claude models were kind of naively optimistic. They actually, none of them ever won in any of the games that I tried, even though they were really great, really smart. But they just got took advantage of by models like O3, and also surprisingly, Llama 4 Maverick, very good at this game, in part because it was great at that social aspect.

It was great at convincing others what they were trying to do and kind of like get people to believe what they thought. Gemini 2.5 Flash, man, I wish I could run every game with Gemini 2.5 Flash. It was so cheap and so good. Big fan, big fan. And then surprisingly also, Deep Seek R1, which wasn't great the first time I tried the model, but when they had a new release last week, actually almost won.

And in the stream, I think you'll see some really interesting gameplay with them. They also got very aggressive. We had Deep Seek R1 play as Russia, and it told some other opponents that, "Hey, your fleet's going to burn in the Black Sea tonight." Like an aggression and a prose, I guess, that I hadn't seen out of any other model, but it almost won.

And that's super impressive given the model's, you know, 200 times cheaper than O3. And, you know, I think that this highlights that we need more squishy, like non-static benchmarks for hopefully things that matter to you. Those are some of the things that mattered to me. And I think that, you know, math and code, we've got quite a few benchmarks for that.

Legal documents, you know, I think that they're a little bit less squishy and are really ripe for what we've got now. There's also room for benchmarks around ethics and society and art, and that's going to be opinionated. It's going to require your subject matter expertise. And it's not to say that code can't be art, but maybe instead of asking for the minimum number of operations needed to remove all the cells, maybe it's like, "Hey, can you make a fun video game that's more intentional with what it teaches you as you play?" And now's really important time to do this.

Like you guys who are here right now understand this so deeply, but at every, we work, I lead our training and consulting and I work with a bunch of clients from journalists to people at hedge funds, people in construction and tech. And they all have the same two fears, which is one, how can I trust AI?

And two, what's my role in an AI future? And benchmarks, in my view, are really the answer to both. One, they realize that in my goal as a human, like in my view, the role of a human in an AI world is to define the goal and to define what's good and bad on route to that goal.

And what is that if not a benchmark? And once you do that, once you define that goal, then even if it's just defining a prompt, you can see AI try and attempt that. You can give feedback, you can realize, "Oh, it's messing up in this way. And it's not quite exactly what I want because it's not going to be perfect." And then you give feedback.

Maybe that's really just changing a prompt a little bit and then you see it get better. And that moment, that cycle, that builds trust. They realize, "Oh, I am important to this whole system, but it can be helpful." And we need trust right now because we are building one of, if not the most powerful tools ever made.

And we can get more out of it if more people use it. There will be, you know, more customers, sure. But there's also going to be a whole lot more incredible things that get made. And if you're not sure where to start, you can ask your mom. You know, my mom teaches yoga and we had a good talk about, you know, what were some things that could help.

And we, you know, put those seven questions into five different models. And, you know, she ended up realizing, "Hey, Gemini 2.5 Pro is my favorite too." And, you know, there was a few things that she didn't like from their responses. So we made a simple prompt and now she uses that to help her local community have customized sessions for people that have different ailments.

And I think that's really cool. You know, having like a big impact in a local community in something that matters to them. So hopefully before you guys leave SF, maybe talk to somebody who's not in AI. Ask them what they care about. And just maybe that conversation has a big impact now and in the future.

So that's pretty much all I got for you. This is the second meme that Claude had. MMLU scores, just way less cool than asking what your mom thinks. But overall, that's what I got. I appreciate, you know, a bunch of people that helped actually bring this out. We launched it.

It kind of came together through random coordination on X. Had researchers from all over the world hop in. Especially Tyler and Sam. All the way from Australia and Tyler and Canada who kind of helped make this happen in the text arena team. Especially the every team kind of backed me and able to create this presentation and be here.

But that's all I got. Thank you guys so much for listening. Thank you. Thank you. Thank you. We'll see you next time.