back to indexBenchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to

00:00:00.000 |
Today I'm gonna talk about benchmarks as memes and this is the meme that Opus came up with when 00:00:20.040 |
I was asking it what I should put as a meme and we are indeed gonna talk about how benchmarks are 00:00:25.380 |
just memes that shape the most powerful tool ever created and quick background 00:00:31.740 |
about me I guess I can't go forward here so we're gonna do it this way all right 00:00:40.200 |
I'm Alex I lead a training consulting at every but essentially I'm very into education and AI and I 00:00:49.200 |
think benchmarks are a really underrated way to educate and what I'm not talking about are these 00:00:56.100 |
kinds of memes what I am talking about is the original definition of like ideas that spread 00:01:03.000 |
Richard Dawkins an evolutionary biologist coined the term in the 70s Christianity democracy capitalism are 00:01:08.340 |
kind of examples of ideas that spread from person to person and benchmarks are actually memes very much 00:01:15.000 |
so in that way we heard Simon Wilson talk earlier today about his pelican riding a bicycle and I think 00:01:21.360 |
that was a really great example because he started doing it a year ago and then that found its way 00:01:25.800 |
onto Google I/O's keynote a couple weeks ago and I think how many ours and strawberries probably also 00:01:31.740 |
maybe the most iconic meme them as a benchmark and now surprisingly unsurprisingly the models don't make 00:01:39.000 |
that mistake anymore and I think that that's a really important part of this some benchmarks get popular 00:01:43.680 |
in our memes just because their name like humanity's last exam you know that got pretty pretty big even 00:01:48.920 |
though maybe more outside of AI circles but with that said we kind of have a little bit of a little bit 00:01:54.360 |
of a problem how many of you guys when Claude got released a couple weeks ago looked at the benchmarks 00:01:58.860 |
okay we got a few we got a few and and they've got some good benchmarks you know SWB bench pretty 00:02:04.440 |
experiential you know it's tries to make what we do in real world and same with Pokemon but which we'll talk a 00:02:10.080 |
little bit more about but I think some of them aren't as great and a big reason is because 00:02:15.320 |
they're getting saturated benchmarks kind of like came from traditional machine learning we had a 00:02:20.060 |
training set and a test set and it were structured very much like standardized tests and language models 00:02:26.600 |
are really good at that and they weren't really set up for what they've become and as a result I think 00:02:33.560 |
XJDR summarized this pretty well on X when Opus came out that you know they didn't look at benchmarks 00:02:38.840 |
once when it dropped and officially no longer cares about the current ones and I think I fall a little 00:02:43.580 |
bit into that category but in light of that there is a really big opportunity because the evals define 00:02:51.380 |
what the big model providers are trying to get their models good at and that's a really big opportunity 00:02:58.080 |
especially for people in the room and I think that this is kind of like a normal a normal thing this is the 00:03:03.740 |
life cycle of the benchmark in my view somebody comes up with an idea and and especially uniquely a single 00:03:09.560 |
person can come up with an idea that then gets adopted that idea spreads it becomes a meme and and 00:03:15.380 |
the model providers then train on it or test on it until it eventually becomes saturated but that's okay and 00:03:23.600 |
and I think there's some examples here and I'm not let me see if I can get my sound 00:03:29.480 |
come is it coming through no all right well there is sound I promise and it is someone trying to count 00:03:39.200 |
from 1 to 10 not flick you off but this is a cool benchmark that came out now that Google's got the best video 00:03:45.920 |
model generated model that exists and it shows how difficult it is for somebody to count from 1 to 10 00:03:52.760 |
speaking it out loud and even though it looks really really great that is a problem that is not solved 00:04:00.200 |
yet but somebody's come up with this idea and I see that spreading and I see next year the models being 00:04:05.480 |
better at that than ever before I think another example along the way is Pokemon we saw with the 00:04:11.180 |
Claude model release as well as with the new Gemini models that they had it try and play the game of 00:04:17.480 |
Pokemon and and while both needed a little bit of help and Gemini eventually got there with that help 00:04:21.920 |
it's only midway up that adoption curve and an example of saturation or it's kind of like the GPT-3 00:04:28.520 |
benchmarks I don't know how many of you guys remember superglue kind of from the NLP days but a lot of 00:04:33.860 |
these benchmarks are not really used anymore in part because the language models got too good but one way of 00:04:40.740 |
looking at this is actually that a single person can have an idea of how good is AI at this thing that I care 00:04:47.280 |
about and then at the end of the journey the most powerful tool ever created it's now really great at 00:04:52.380 |
that thing that I care about and so the point is that the people here the people that get that the 00:04:58.860 |
people that can build benchmarks are going to shape the future and maybe the people watching online too but 00:05:04.840 |
somebody here is going to make a benchmark that the models are going to test on and train on in the next 00:05:09.300 |
five years and that's an incredible way that's an incredible power but that also comes with some 00:05:15.240 |
responsibility it definitely can go wrong you know I know Simon talked about this a little bit before 00:05:19.620 |
um but you know we saw a few weeks ago where we're chat GPT became very sycophantic how many of you 00:05:26.820 |
guys tracked that we all learned about what that word meant a few weeks a few weeks ago but essentially 00:05:31.860 |
chat GPT released open AI released new model that was benchmarked by thumbs up and thumbs down and 00:05:38.440 |
unsurprisingly people thumbs up responses that agreed with them so you ended up with a model that got rolled 00:05:43.940 |
out to millions of people that agreed with them no matter how crazy or bad their idea was which is 00:05:50.300 |
problematic and I think that if we don't think about people this kind of stuff can happen and I'm still 00:05:55.440 |
thinking about Toro Imai who at the start of Google IO said that we're here today to see each other in 00:06:00.760 |
person and it's great to remember that people matter and so in the context of benchmarks let's not 00:06:06.100 |
continue the original sin of social media which kind of treated everybody as like data points and it's like 00:06:11.940 |
hey the more you look at something the more I should show you that let's make benchmarks that help 00:06:16.380 |
empower people give them some agency and so for me you know this isn't a technical talk there are other 00:06:22.420 |
people talking about how to make a great benchmark technically but generally I think that if you're 00:06:26.620 |
building for the future a great benchmark should be multifaceted so you got a lot of strategies that 00:06:30.780 |
could do well reward creativity right like accessible so easy to understand not only for the models so you 00:06:36.940 |
have small models that compete large ones as well but also for people to keep track of it 00:06:40.780 |
generative because the really unique thing about these AI models is if you have great data even if 00:06:46.220 |
it only does it 10% of the time you can train on that and so the next generation does it 90% of the 00:06:51.020 |
time and that's incredible and hard to understate and evolutionary so ideally we don't have benchmarks 00:06:57.700 |
that cap out 96 like what's the difference between 96 and 98% not as big of a deal ideally we have these benchmarks 00:07:05.380 |
that get harder and the challenge gets deeper as the models improve and lastly experiential so 00:07:10.820 |
try to mimic real world situations some of the things that I personally care about is trying to get 00:07:15.620 |
a lot of people outside of AI interested so maybe making benchmarks a spectator sport and was interested 00:07:20.180 |
personally in the personality of these models we're about to find out which one wanted to achieve world 00:07:25.860 |
domination and I really wanted something we can learn from education is big for me and we saw things like alpha go and 00:07:31.060 |
open AI five AI playing these games and the best people in the world wanted to play against it to learn from it and I think that that's really powerful so I made this benchmark called AI diplomacy and if I don't have this video I got a backup just in case 00:07:38.780 |
So I made this benchmark called AI Diplomacy. 00:07:42.940 |
And if I don't have this video, I've got to back up just in case. 00:07:59.980 |
is if the language models, which you're seeing here, 00:08:02.900 |
send messages to each other and negotiate, find allies, 00:08:06.280 |
and create alliances and get other people to back them. 00:08:13.400 |
You actually see the different models sending messages 00:08:18.220 |
trying to betray each other, trying to take over Europe in 1901. 00:08:22.840 |
And what was really cool about one of these games-- 00:08:25.480 |
and we're about to launch this on stream so you can watch for a week-- 00:08:32.040 |
And what you're looking at here is the number of centers per model. 00:08:45.980 |
Across all the games, O3 is one of the only ones that would tell a power that it's planning to back them, 00:08:51.740 |
and then in its diary write, "Oh man, they fell for it. I am totally going to take them over. No problem." 00:08:56.800 |
And it realized that the reason why 2.5 Pro was pulling ahead was because Opus, Cloud Opus, who's so good-hearted, really had their back. 00:09:08.060 |
And they needed to convince Opus somehow to stop backing Gemini. 00:09:10.720 |
So how they did it was propose, "Hey, if Gemini comes down, we'll propose a four-way tie. We'll end this game with a tie," which is impossible in the game. 00:09:18.780 |
But it convinced Opus, and Opus thought it was a great idea, non-violent way to end the game, awesome, very aligned, you know? 00:09:25.320 |
And so they pulled back their support from 2.5 Pro, O3 tried to make a run for it, Opus called them out, O3 realized, "Oh, I've got to take them out," 00:09:33.440 |
took them out, took everybody else with them, and took out Gemini 2.5 Pro. 00:09:38.120 |
Even though they got one away from winning, O3 ended up winning in the end. 00:09:42.280 |
And you can actually see some of the quotes from that game. 00:09:44.880 |
You can see O3 saying, "Oh, Jeremy was deliberately misled. 00:09:48.240 |
I promised to hold this, but all to convince them that they're safe, but it will fall." 00:09:53.540 |
And then meanwhile, Claude Opus singing that the coalition unity prevails, and they've agreed to this four-way draw. 00:10:00.540 |
But when, and then they don't want to let anybody be convinced. 00:10:04.400 |
And so they actually turned away, and you can see that kind of in this second chart where this is like friendships. 00:10:09.280 |
So the top of the line is friendships, and you can see that, you know, 2.5 Pro was a good friend of Claude's until it turned, and you can 00:10:17.700 |
see that that's when they started kind of like pulling away. 00:10:21.000 |
But what was really cool is that there were a lot of other things that came up. 00:10:24.000 |
O3 got a habit of finding some of the weakest models and having them be their pawns in order to win. 00:10:32.000 |
Gemini 2.5 Flash fell to this ruse, and you can see that they're unable to realize. 00:10:41.000 |
They think it's a miscommunication, misunderstanding, or a typo that O3 has betrayed them at the end of the game in order to win. 00:10:48.300 |
And so there was a lot that we learned from this that I don't think that you really learn from by having them try and solve a test. 00:10:55.800 |
I tried 18 different models, learned that Claude models were kind of naively optimistic. 00:11:01.100 |
They actually, none of them ever won in any of the games that I tried, even though they were really great, really smart. 00:11:06.100 |
But they just got took advantage of by models like O3, and also surprisingly, Llama 4 Maverick, very good at this game, in part because it was great at that social aspect. 00:11:15.100 |
It was great at convincing others what they were trying to do and kind of like get people to believe what they thought. 00:11:25.400 |
Gemini 2.5 Flash, man, I wish I could run every game with Gemini 2.5 Flash. 00:11:31.400 |
And then surprisingly also, Deep Seek R1, which wasn't great the first time I tried the model, but when they had a new release last week, actually almost won. 00:11:40.400 |
And in the stream, I think you'll see some really interesting gameplay with them. 00:11:46.700 |
We had Deep Seek R1 play as Russia, and it told some other opponents that, "Hey, your fleet's going to burn in the Black Sea tonight." 00:11:53.700 |
Like an aggression and a prose, I guess, that I hadn't seen out of any other model, but it almost won. 00:11:59.700 |
And that's super impressive given the model's, you know, 200 times cheaper than O3. 00:12:04.700 |
And, you know, I think that this highlights that we need more squishy, like non-static benchmarks for hopefully things that matter to you. 00:12:15.000 |
Those are some of the things that mattered to me. 00:12:17.000 |
And I think that, you know, math and code, we've got quite a few benchmarks for that. 00:12:21.000 |
Legal documents, you know, I think that they're a little bit less squishy and are really ripe for what we've got now. 00:12:27.000 |
There's also room for benchmarks around ethics and society and art, and that's going to be opinionated. 00:12:32.300 |
It's going to require your subject matter expertise. 00:12:34.300 |
And it's not to say that code can't be art, but maybe instead of asking for the minimum number of operations needed to remove all the cells, 00:12:41.300 |
maybe it's like, "Hey, can you make a fun video game that's more intentional with what it teaches you as you play?" 00:12:51.300 |
Like you guys who are here right now understand this so deeply, but at every, we work, I lead our training and consulting 00:12:57.600 |
and I work with a bunch of clients from journalists to people at hedge funds, people in construction and tech. 00:13:03.600 |
And they all have the same two fears, which is one, how can I trust AI? 00:13:11.600 |
And benchmarks, in my view, are really the answer to both. 00:13:14.900 |
One, they realize that in my goal as a human, like in my view, the role of a human in an AI world is to define the goal 00:13:23.900 |
and to define what's good and bad on route to that goal. 00:13:28.900 |
And once you do that, once you define that goal, then even if it's just defining a prompt, you can see AI try and attempt that. 00:13:35.900 |
You can give feedback, you can realize, "Oh, it's messing up in this way. 00:13:40.200 |
And it's not quite exactly what I want because it's not going to be perfect." 00:13:44.200 |
Maybe that's really just changing a prompt a little bit and then you see it get better. 00:13:48.200 |
And that moment, that cycle, that builds trust. 00:13:51.200 |
They realize, "Oh, I am important to this whole system, but it can be helpful." 00:13:57.200 |
And we need trust right now because we are building one of, if not the most powerful tools ever made. 00:14:03.200 |
And we can get more out of it if more people use it. 00:14:06.500 |
There will be, you know, more customers, sure. 00:14:09.500 |
But there's also going to be a whole lot more incredible things that get made. 00:14:14.000 |
And if you're not sure where to start, you can ask your mom. 00:14:16.500 |
You know, my mom teaches yoga and we had a good talk about, you know, what were some things that could help. 00:14:22.500 |
And we, you know, put those seven questions into five different models. 00:14:25.800 |
And, you know, she ended up realizing, "Hey, Gemini 2.5 Pro is my favorite too." 00:14:30.800 |
And, you know, there was a few things that she didn't like from their responses. 00:14:33.800 |
So we made a simple prompt and now she uses that to help her local community have customized sessions 00:14:43.100 |
You know, having like a big impact in a local community in something that matters to them. 00:14:48.100 |
So hopefully before you guys leave SF, maybe talk to somebody who's not in AI. 00:14:55.100 |
And just maybe that conversation has a big impact now and in the future. 00:15:04.400 |
MMLU scores, just way less cool than asking what your mom thinks. 00:15:11.400 |
I appreciate, you know, a bunch of people that helped actually bring this out. 00:15:16.400 |
It kind of came together through random coordination on X. 00:15:19.400 |
Had researchers from all over the world hop in. 00:15:24.400 |
All the way from Australia and Tyler and Canada who kind of helped make this happen in the text arena team. 00:15:28.400 |
Especially the every team kind of backed me and able to create this presentation and be here.