back to index

Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to


Whisper Transcript | Transcript Only Page

00:00:00.000 | Today I'm gonna talk about benchmarks as memes and this is the meme that Opus came up with when
00:00:20.040 | I was asking it what I should put as a meme and we are indeed gonna talk about how benchmarks are
00:00:25.380 | just memes that shape the most powerful tool ever created and quick background
00:00:31.740 | about me I guess I can't go forward here so we're gonna do it this way all right
00:00:40.200 | I'm Alex I lead a training consulting at every but essentially I'm very into education and AI and I
00:00:49.200 | think benchmarks are a really underrated way to educate and what I'm not talking about are these
00:00:56.100 | kinds of memes what I am talking about is the original definition of like ideas that spread
00:01:03.000 | Richard Dawkins an evolutionary biologist coined the term in the 70s Christianity democracy capitalism are
00:01:08.340 | kind of examples of ideas that spread from person to person and benchmarks are actually memes very much
00:01:15.000 | so in that way we heard Simon Wilson talk earlier today about his pelican riding a bicycle and I think
00:01:21.360 | that was a really great example because he started doing it a year ago and then that found its way
00:01:25.800 | onto Google I/O's keynote a couple weeks ago and I think how many ours and strawberries probably also
00:01:31.740 | maybe the most iconic meme them as a benchmark and now surprisingly unsurprisingly the models don't make
00:01:39.000 | that mistake anymore and I think that that's a really important part of this some benchmarks get popular
00:01:43.680 | in our memes just because their name like humanity's last exam you know that got pretty pretty big even
00:01:48.920 | though maybe more outside of AI circles but with that said we kind of have a little bit of a little bit
00:01:54.360 | of a problem how many of you guys when Claude got released a couple weeks ago looked at the benchmarks
00:01:58.860 | okay we got a few we got a few and and they've got some good benchmarks you know SWB bench pretty
00:02:04.440 | experiential you know it's tries to make what we do in real world and same with Pokemon but which we'll talk a
00:02:10.080 | little bit more about but I think some of them aren't as great and a big reason is because
00:02:15.320 | they're getting saturated benchmarks kind of like came from traditional machine learning we had a
00:02:20.060 | training set and a test set and it were structured very much like standardized tests and language models
00:02:26.600 | are really good at that and they weren't really set up for what they've become and as a result I think
00:02:33.560 | XJDR summarized this pretty well on X when Opus came out that you know they didn't look at benchmarks
00:02:38.840 | once when it dropped and officially no longer cares about the current ones and I think I fall a little
00:02:43.580 | bit into that category but in light of that there is a really big opportunity because the evals define
00:02:51.380 | what the big model providers are trying to get their models good at and that's a really big opportunity
00:02:58.080 | especially for people in the room and I think that this is kind of like a normal a normal thing this is the
00:03:03.740 | life cycle of the benchmark in my view somebody comes up with an idea and and especially uniquely a single
00:03:09.560 | person can come up with an idea that then gets adopted that idea spreads it becomes a meme and and
00:03:15.380 | the model providers then train on it or test on it until it eventually becomes saturated but that's okay and
00:03:23.600 | and I think there's some examples here and I'm not let me see if I can get my sound
00:03:29.480 | come is it coming through no all right well there is sound I promise and it is someone trying to count
00:03:39.200 | from 1 to 10 not flick you off but this is a cool benchmark that came out now that Google's got the best video
00:03:45.920 | model generated model that exists and it shows how difficult it is for somebody to count from 1 to 10
00:03:52.760 | speaking it out loud and even though it looks really really great that is a problem that is not solved
00:04:00.200 | yet but somebody's come up with this idea and I see that spreading and I see next year the models being
00:04:05.480 | better at that than ever before I think another example along the way is Pokemon we saw with the
00:04:11.180 | Claude model release as well as with the new Gemini models that they had it try and play the game of
00:04:17.480 | Pokemon and and while both needed a little bit of help and Gemini eventually got there with that help
00:04:21.920 | it's only midway up that adoption curve and an example of saturation or it's kind of like the GPT-3
00:04:28.520 | benchmarks I don't know how many of you guys remember superglue kind of from the NLP days but a lot of
00:04:33.860 | these benchmarks are not really used anymore in part because the language models got too good but one way of
00:04:40.740 | looking at this is actually that a single person can have an idea of how good is AI at this thing that I care
00:04:47.280 | about and then at the end of the journey the most powerful tool ever created it's now really great at
00:04:52.380 | that thing that I care about and so the point is that the people here the people that get that the
00:04:58.860 | people that can build benchmarks are going to shape the future and maybe the people watching online too but
00:05:04.840 | somebody here is going to make a benchmark that the models are going to test on and train on in the next
00:05:09.300 | five years and that's an incredible way that's an incredible power but that also comes with some
00:05:15.240 | responsibility it definitely can go wrong you know I know Simon talked about this a little bit before
00:05:19.620 | um but you know we saw a few weeks ago where we're chat GPT became very sycophantic how many of you
00:05:26.820 | guys tracked that we all learned about what that word meant a few weeks a few weeks ago but essentially
00:05:31.860 | chat GPT released open AI released new model that was benchmarked by thumbs up and thumbs down and
00:05:38.440 | unsurprisingly people thumbs up responses that agreed with them so you ended up with a model that got rolled
00:05:43.940 | out to millions of people that agreed with them no matter how crazy or bad their idea was which is
00:05:50.300 | problematic and I think that if we don't think about people this kind of stuff can happen and I'm still
00:05:55.440 | thinking about Toro Imai who at the start of Google IO said that we're here today to see each other in
00:06:00.760 | person and it's great to remember that people matter and so in the context of benchmarks let's not
00:06:06.100 | continue the original sin of social media which kind of treated everybody as like data points and it's like
00:06:11.940 | hey the more you look at something the more I should show you that let's make benchmarks that help
00:06:16.380 | empower people give them some agency and so for me you know this isn't a technical talk there are other
00:06:22.420 | people talking about how to make a great benchmark technically but generally I think that if you're
00:06:26.620 | building for the future a great benchmark should be multifaceted so you got a lot of strategies that
00:06:30.780 | could do well reward creativity right like accessible so easy to understand not only for the models so you
00:06:36.940 | have small models that compete large ones as well but also for people to keep track of it
00:06:40.780 | generative because the really unique thing about these AI models is if you have great data even if
00:06:46.220 | it only does it 10% of the time you can train on that and so the next generation does it 90% of the
00:06:51.020 | time and that's incredible and hard to understate and evolutionary so ideally we don't have benchmarks
00:06:57.700 | that cap out 96 like what's the difference between 96 and 98% not as big of a deal ideally we have these benchmarks
00:07:05.380 | that get harder and the challenge gets deeper as the models improve and lastly experiential so
00:07:10.820 | try to mimic real world situations some of the things that I personally care about is trying to get
00:07:15.620 | a lot of people outside of AI interested so maybe making benchmarks a spectator sport and was interested
00:07:20.180 | personally in the personality of these models we're about to find out which one wanted to achieve world
00:07:25.860 | domination and I really wanted something we can learn from education is big for me and we saw things like alpha go and
00:07:31.060 | open AI five AI playing these games and the best people in the world wanted to play against it to learn from it and I think that that's really powerful so I made this benchmark called AI diplomacy and if I don't have this video I got a backup just in case
00:07:37.060 | I think that that's really powerful.
00:07:38.780 | So I made this benchmark called AI Diplomacy.
00:07:42.940 | And if I don't have this video, I've got to back up just in case.
00:07:45.940 | And this benchmark is-- how many of you guys
00:07:48.220 | have heard of the board game Diplomacy?
00:07:50.780 | That's more than I thought.
00:07:51.660 | That's cool.
00:07:52.720 | It's a mix between risk and mafia.
00:07:54.940 | But what's really cool about this game
00:07:56.440 | is there is no luck involved.
00:07:58.520 | So the only way this game progresses
00:07:59.980 | is if the language models, which you're seeing here,
00:08:02.900 | send messages to each other and negotiate, find allies,
00:08:06.280 | and create alliances and get other people to back them.
00:08:11.340 | And that's what you're looking at here.
00:08:13.400 | You actually see the different models sending messages
00:08:16.320 | to each other, trying to create alliances,
00:08:18.220 | trying to betray each other, trying to take over Europe in 1901.
00:08:22.840 | And what was really cool about one of these games--
00:08:25.480 | and we're about to launch this on stream so you can watch for a week--
00:08:29.860 | is I'll take you through a game super quick.
00:08:32.040 | And what you're looking at here is the number of centers per model.
00:08:35.500 | And you're trying to get to 18 to win.
00:08:38.160 | And the top line is Gemini 2.5 Pro.
00:08:40.560 | It got to 16 right away.
00:08:42.620 | But O3 is a schemer.
00:08:45.280 | Man, is it a schemer.
00:08:45.980 | Across all the games, O3 is one of the only ones that would tell a power that it's planning to back them,
00:08:51.740 | and then in its diary write, "Oh man, they fell for it. I am totally going to take them over. No problem."
00:08:56.800 | And it realized that the reason why 2.5 Pro was pulling ahead was because Opus, Cloud Opus, who's so good-hearted, really had their back.
00:09:05.400 | They were their ally along the way.
00:09:08.060 | And they needed to convince Opus somehow to stop backing Gemini.
00:09:10.720 | So how they did it was propose, "Hey, if Gemini comes down, we'll propose a four-way tie. We'll end this game with a tie," which is impossible in the game.
00:09:18.780 | But it convinced Opus, and Opus thought it was a great idea, non-violent way to end the game, awesome, very aligned, you know?
00:09:25.320 | And so they pulled back their support from 2.5 Pro, O3 tried to make a run for it, Opus called them out, O3 realized, "Oh, I've got to take them out,"
00:09:33.440 | took them out, took everybody else with them, and took out Gemini 2.5 Pro.
00:09:38.120 | Even though they got one away from winning, O3 ended up winning in the end.
00:09:42.280 | And you can actually see some of the quotes from that game.
00:09:44.880 | You can see O3 saying, "Oh, Jeremy was deliberately misled.
00:09:48.240 | I promised to hold this, but all to convince them that they're safe, but it will fall."
00:09:53.540 | And then meanwhile, Claude Opus singing that the coalition unity prevails, and they've agreed to this four-way draw.
00:10:00.540 | But when, and then they don't want to let anybody be convinced.
00:10:04.400 | And so they actually turned away, and you can see that kind of in this second chart where this is like friendships.
00:10:09.280 | So the top of the line is friendships, and you can see that, you know, 2.5 Pro was a good friend of Claude's until it turned, and you can
00:10:17.700 | see that that's when they started kind of like pulling away.
00:10:21.000 | But what was really cool is that there were a lot of other things that came up.
00:10:24.000 | O3 got a habit of finding some of the weakest models and having them be their pawns in order to win.
00:10:32.000 | Gemini 2.5 Flash fell to this ruse, and you can see that they're unable to realize.
00:10:41.000 | They think it's a miscommunication, misunderstanding, or a typo that O3 has betrayed them at the end of the game in order to win.
00:10:48.300 | And so there was a lot that we learned from this that I don't think that you really learn from by having them try and solve a test.
00:10:55.800 | I tried 18 different models, learned that Claude models were kind of naively optimistic.
00:11:01.100 | They actually, none of them ever won in any of the games that I tried, even though they were really great, really smart.
00:11:06.100 | But they just got took advantage of by models like O3, and also surprisingly, Llama 4 Maverick, very good at this game, in part because it was great at that social aspect.
00:11:15.100 | It was great at convincing others what they were trying to do and kind of like get people to believe what they thought.
00:11:25.400 | Gemini 2.5 Flash, man, I wish I could run every game with Gemini 2.5 Flash.
00:11:28.400 | It was so cheap and so good.
00:11:30.400 | Big fan, big fan.
00:11:31.400 | And then surprisingly also, Deep Seek R1, which wasn't great the first time I tried the model, but when they had a new release last week, actually almost won.
00:11:40.400 | And in the stream, I think you'll see some really interesting gameplay with them.
00:11:44.700 | They also got very aggressive.
00:11:46.700 | We had Deep Seek R1 play as Russia, and it told some other opponents that, "Hey, your fleet's going to burn in the Black Sea tonight."
00:11:53.700 | Like an aggression and a prose, I guess, that I hadn't seen out of any other model, but it almost won.
00:11:59.700 | And that's super impressive given the model's, you know, 200 times cheaper than O3.
00:12:04.700 | And, you know, I think that this highlights that we need more squishy, like non-static benchmarks for hopefully things that matter to you.
00:12:15.000 | Those are some of the things that mattered to me.
00:12:17.000 | And I think that, you know, math and code, we've got quite a few benchmarks for that.
00:12:21.000 | Legal documents, you know, I think that they're a little bit less squishy and are really ripe for what we've got now.
00:12:27.000 | There's also room for benchmarks around ethics and society and art, and that's going to be opinionated.
00:12:32.300 | It's going to require your subject matter expertise.
00:12:34.300 | And it's not to say that code can't be art, but maybe instead of asking for the minimum number of operations needed to remove all the cells,
00:12:41.300 | maybe it's like, "Hey, can you make a fun video game that's more intentional with what it teaches you as you play?"
00:12:47.300 | And now's really important time to do this.
00:12:51.300 | Like you guys who are here right now understand this so deeply, but at every, we work, I lead our training and consulting
00:12:57.600 | and I work with a bunch of clients from journalists to people at hedge funds, people in construction and tech.
00:13:03.600 | And they all have the same two fears, which is one, how can I trust AI?
00:13:07.600 | And two, what's my role in an AI future?
00:13:11.600 | And benchmarks, in my view, are really the answer to both.
00:13:14.900 | One, they realize that in my goal as a human, like in my view, the role of a human in an AI world is to define the goal
00:13:23.900 | and to define what's good and bad on route to that goal.
00:13:26.400 | And what is that if not a benchmark?
00:13:28.900 | And once you do that, once you define that goal, then even if it's just defining a prompt, you can see AI try and attempt that.
00:13:35.900 | You can give feedback, you can realize, "Oh, it's messing up in this way.
00:13:40.200 | And it's not quite exactly what I want because it's not going to be perfect."
00:13:43.200 | And then you give feedback.
00:13:44.200 | Maybe that's really just changing a prompt a little bit and then you see it get better.
00:13:48.200 | And that moment, that cycle, that builds trust.
00:13:51.200 | They realize, "Oh, I am important to this whole system, but it can be helpful."
00:13:57.200 | And we need trust right now because we are building one of, if not the most powerful tools ever made.
00:14:03.200 | And we can get more out of it if more people use it.
00:14:06.500 | There will be, you know, more customers, sure.
00:14:09.500 | But there's also going to be a whole lot more incredible things that get made.
00:14:14.000 | And if you're not sure where to start, you can ask your mom.
00:14:16.500 | You know, my mom teaches yoga and we had a good talk about, you know, what were some things that could help.
00:14:22.500 | And we, you know, put those seven questions into five different models.
00:14:25.800 | And, you know, she ended up realizing, "Hey, Gemini 2.5 Pro is my favorite too."
00:14:30.800 | And, you know, there was a few things that she didn't like from their responses.
00:14:33.800 | So we made a simple prompt and now she uses that to help her local community have customized sessions
00:14:39.800 | for people that have different ailments.
00:14:41.800 | And I think that's really cool.
00:14:43.100 | You know, having like a big impact in a local community in something that matters to them.
00:14:48.100 | So hopefully before you guys leave SF, maybe talk to somebody who's not in AI.
00:14:54.100 | Ask them what they care about.
00:14:55.100 | And just maybe that conversation has a big impact now and in the future.
00:14:59.900 | So that's pretty much all I got for you.
00:15:02.100 | This is the second meme that Claude had.
00:15:04.400 | MMLU scores, just way less cool than asking what your mom thinks.
00:15:08.400 | But overall, that's what I got.
00:15:11.400 | I appreciate, you know, a bunch of people that helped actually bring this out.
00:15:15.400 | We launched it.
00:15:16.400 | It kind of came together through random coordination on X.
00:15:19.400 | Had researchers from all over the world hop in.
00:15:22.400 | Especially Tyler and Sam.
00:15:24.400 | All the way from Australia and Tyler and Canada who kind of helped make this happen in the text arena team.
00:15:28.400 | Especially the every team kind of backed me and able to create this presentation and be here.
00:15:33.900 | But that's all I got.
00:15:35.200 | Thank you guys so much for listening.
00:15:36.200 | Thank you.
00:15:36.700 | Thank you.
00:15:37.500 | Thank you.
00:15:38.500 | We'll see you next time.