SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors

00:00:00.000 | Since late April, myself and machine learning engineer Josh Stapleton have evaluated over

00:00:05.900 | 120,000 answers from GPT models to explore their limits. In my original SmartGPT video,

00:00:13.480 | I showed that even popular TED talks calling GPT-4 stupid were not accurately testing what

00:00:19.860 | GPT-4 could do. And actually, it could easily get such questions right. Little did we foresee that

00:00:25.780 | come the summer, our tests with GPT-4 would be revealing a host of mistakes in an official

00:00:31.400 | globally used benchmark, uncovering concerns that even OpenAI and Google don't appear to be aware

00:00:37.720 | of. But by the end of the video, I want to show how you can tangibly benefit from our experiments,

00:00:43.820 | including in unexpected domains like medicine. Where to start? Well, here's a super quick intro

00:00:50.160 | to those of you who haven't seen the original video. SmartGPT was a way of using the latest

00:00:55.340 | prompting methods to test GPT-4. So, let's get started.

00:00:55.760 | engineering research to trigger better performance in a model like GPT-4. Getting the model to think

00:01:01.880 | a bit, aka use some tokens, before giving a final answer was key. Another important element I talked

00:01:07.260 | about in that video was the power of getting the model to self-reflect. An insight I drew on from

00:01:12.380 | talks with the lead author of the famous reflection paper. My manual experiments showed that using

00:01:18.520 | optimized prompts, reflection, and self-dialogue, you could boost performance in almost any task.

00:01:23.900 | And I demoed the improvement on,

00:01:25.560 | formal logic and college mathematics. But there was a problem, which is why you guys haven't heard

00:01:30.500 | about SmartGPT in a while. How could I systematically benchmark GPT-4 using these methods,

00:01:35.920 | when I'm just one guy? Well, enter Josh Stapleton, machine learning engineer extraordinaire. Without

00:01:41.440 | him, it would have been impossible to build out such a fleshed out, flexible code base,

00:01:46.320 | with which we could systematize experiments and iterate rapidly. But then we both quickly

00:01:51.040 | realized that there was another problem with benchmarking the original version of SmartGPT.

00:01:55.540 | And that was that the software was not designed to be able to do the work of the original version.

00:01:58.880 | It would be hell to manually extract out the final answers within pages of reflection and resolving,

00:02:05.320 | not to mention cost tens of thousands of dollars. And trust me, a month of YouTube advertising would

00:02:10.880 | not even cover the first hour of that run, unfortunately. And no, we would never compromise

00:02:16.360 | by asking GPT-4 to grade its own answers. It would be unscientific and inaccurate. The infamous MIT

00:02:23.920 | paper is enough evidence for that. But it would be a waste of time to do that. And I think that's

00:02:24.460 | what we're going to do. We're going to do a lot of research on this. And I think that's what we're going to do.

00:02:24.500 | And I think that's what we're going to do. And I think that's what we're going to do. And I think that's what we're going to do.

00:02:24.600 | There's enough evidence of that. GPT-4 didn't get 100% on an MIT degree,

00:02:28.940 | and this paper was withdrawn. So yes, we had to lower the power level of SmartGPT,

00:02:34.640 | get rid of the reflection and resolving, deliberately sacrificing some of its intelligence

00:02:39.840 | because we simply couldn't afford to unleash it fully. And yet, we still got a new, albeit

00:02:45.460 | unofficial, record of 88.4% on the MMLU. That not only beats the 86.4% recorded by OpenAI,

00:02:54.140 | it does. And that's because we still have a new record of 88.4% on the MMLU.

00:02:54.580 | It beats the projections for 2024 that Metaculous recorded before ChatGPT came out. And yet,

00:03:01.660 | we are both convinced that there are at least a dozen more ways performance can be further

00:03:06.900 | boosted using existing models. Yes, that might mean GPT-4 getting a result reserved for June

00:03:13.380 | of 2025. The thing is, we have hit the limits of what a self-funding team of two can do.

00:03:19.320 | Before I briefly touch on what the MMLU is, I am happy to say that all of our results,

00:03:24.560 | and answers, that's 2,850 for the GPT-4 run and over 120,000 for GPT-3.5, are freely available

00:03:33.380 | to browse in a GitHub repository linked in the description. So what the hell is MMLU anyway?

00:03:39.520 | Well, it is arguably the best-known benchmark of language model performance. It stands for

00:03:45.360 | Massive Multitask Language Understanding. Massive because it has over 14,000 questions and multitask

00:03:52.600 | because it covers 57 different languages. And it's a very, very good benchmark for language

00:03:54.540 | management. It's a very, very good benchmark for language development. And it's a very, very good

00:03:55.340 | benchmark for language development. The idea behind it was truly fantastic. And it is important

00:03:59.420 | enough to feature prominently on the first page of the GPT-4 technical report. In the past,

00:04:05.740 | I have said that getting 100% on this test would be a good sign of AGI. Others have talked about 95%.

00:04:12.220 | I do think I have like a 50% chance, like within the next 20 years or so,

00:04:16.680 | there might be something what we might call an AGI or transformative AI.

00:04:20.560 | What do I mean by this? Well, maybe you can measure it on benchmarks. There's like,

00:04:24.520 | this famous MMLU benchmark is like, yeah, there's something which like scores like 95% on this.

00:04:29.400 | And the paper itself notes that an 89.8% performance represents human expert ability,

00:04:35.900 | which as you can tell from the title, we are achingly close to beating. And as you'll see

00:04:40.880 | in a moment, GPT-4 with the full power of prompt engineering could likely get upwards of 90 to 92%

00:04:47.860 | right now. And frankly, whether it's GPT-5 or Gemini, that 95% threshold should be

00:04:54.500 | easily be broken by next year, not in 20 years. If we didn't use the full power of SmartGPT,

00:05:00.980 | how did we get 88.4%? And why does the title say 89%? Well, let me show you the two facets of

00:05:08.980 | SmartGPT that we did use. The thing is the MMLU demands a single character answer, A, B, C, or D.

00:05:16.420 | And that answer must be immediate. Now imagine taking a test and the very first thought you had

00:05:22.200 | had to be your final answer. On a quick test, you might be able to get a single answer,

00:05:24.480 | this is believed to be a key reason for hallucinations. This was a great paper on

00:05:30.000 | how language model hallucinations can snowball from the first few tokens. As they say, the language

00:05:36.060 | model first commits to an answer. That's before outputting the explanation. And this is a problem

00:05:41.240 | because transformers cannot find the answer within one time step because of their limited

00:05:45.700 | reasoning abilities within that time step. And why don't language models like GPT-4 back down

00:05:50.760 | and change halfway through? Well, they prioritize fluency and

00:05:54.460 | coherence at the expense of factuality. But rushing an answer in your first token is particularly

00:06:01.120 | hobbling in questions like this requiring deeper thought or calculation. It's fine for questions

00:06:06.800 | that need memorized knowledge, but not for questions like these. We went through all of

00:06:11.720 | these subjects in the MMLU and characterized around a third of them as requiring that kind

00:06:18.200 | of deeper thought. And of course, most ways that you use GPT-4 will also require some thought. But

00:06:24.440 | the first two are the most common ones. The first one is the one that is the most common.

00:06:26.180 | The second one is the one that is the most common. The third one is the one that is the most common.

00:06:26.660 | The third one is the one that is the most common. The third one is the one that is the most common.

00:06:26.740 | Open source teams and groups like OpenAI and Google all draw on the dev set when testing a model.

00:06:33.860 | Notice the five questions each with a single character answer. We were not, of course,

00:06:38.820 | the first team to realize that this underplays the abilities of the model.

00:06:43.820 | The Minerva paper from Google said this: "The standard way of evaluating on the

00:06:49.380 | MMLU is to construct a five-shot prompt out of the dev set." So what they did in

00:06:54.420 | the first two years, instead, like us, was to use a prompt which has a chain of thought

00:06:58.980 | before outputting the final answer. You can see some examples below.

00:07:02.880 | Essentially, it allows the model to think a bit first and gives it a scratch pad. There

00:07:08.100 | are other theories though, like the length and detail of the exemplars triggering different

00:07:12.540 | weights of the model. This paper from two months ago used longer exemplars for the moral scenarios

00:07:19.080 | subject within the MMLU. With these five custom exemplars plus self-consistency, which I'll

00:07:24.400 | get to in a moment, they saw accuracy go up to 80% from 68%. But before even this paper came out,

00:07:30.940 | Josh and I were sourcing and crafting bespoke exemplars for the 21 subjects we deemed would

00:07:37.780 | need the most working out. For the other subjects, we may do with the normal dev examples. OpenAI and

00:07:43.960 | Google don't do this for their benchmarking, underplaying the abilities of their model.

00:07:48.580 | So why doesn't everyone do it, you might ask? My theory is that it's because you have to hand-grade

00:07:53.620 | every answer, and that's why we're doing this. But it's not just the model that's doing it. It's

00:07:54.380 | the level of accuracy that we're taking into account. We're taking into account the level of

00:07:55.480 | accuracy that we're taking into account. We're taking into account the level of accuracy that we're

00:07:56.340 | taking into account. We're taking into account the level of accuracy that we're taking into account.

00:07:56.600 | And it's not just the level of accuracy that we're taking into account. It's the level of accuracy that

00:07:57.440 | we're taking into account. We're taking into account the level of accuracy that we're taking into account.

00:07:58.060 | Essentially, you're taking the time to listen, which is the least that we can do, I feel, as we approach

00:08:04.940 | human-level intelligences. Even though it still took weeks, to make checking easier, we taught

00:08:10.720 | GPT-3.5 and GPT-4 through our exemplars to always end with a final answer in the same format.

00:08:18.780 | Lesson one, therefore, for everyone watching, is don't make the first token the final answer.

00:08:24.360 | Lesson two comes from a paper on self-consistency, which in a nutshell says that taking the highest

00:08:31.940 | probability answer, sometimes called greedy decoding, doesn't always reflect the best

00:08:37.040 | answer the model is capable of. In other words, don't take the model's first answer as its final

00:08:42.380 | answer. Take this example. The highest single probability answer was this, and that was

00:08:47.920 | incorrect. For open AI, it would now be over, it's incorrect, done. But sometimes, if you look at all

00:08:54.340 | the different answers that a model might give, and then take the majority answer, the final answer

00:09:00.040 | that came up the most often, it can get it right. Interestingly, in the Minerva paper, they used

00:09:05.160 | 256 samples, although only 16 for the MMLU. Open AI even put a little footnote in their GPT-4

00:09:13.620 | technical paper, admitting that they don't use that approach, but Google does. And yes, this can

00:09:18.920 | significantly affect the final results. Look at the boost going up to 40%.

00:09:24.320 | Now, the MMLU has 40 sampling paths, and it hasn't fully leveled off yet. These aren't

00:09:27.940 | re-dos where you keep trying until you get it right. This is letting the model explore its

00:09:32.720 | full probability distribution of outputs, and taking the truly most probable final answer.

00:09:37.940 | Letting the model think, not rushing it. For our runs, we limited ourselves to 9 samples,

00:09:43.100 | and took the majority vote. But of course, the results could have been dramatically better if

00:09:47.780 | we did 40 samples, or indeed 256. Now, aside from these two hard-won lessons, which I'm going to

00:09:54.300 | show how all of you can benefit from, the other difference from the previous state-of-the-art 86.4%

00:10:00.760 | was that we did use the most current versions of each model. So, the models may have independently

00:10:06.060 | gotten better or worse in certain topics. But I would say that our broad findings do run counter

00:10:12.020 | to any simple, it's-got-dumber narrative. And if I had to guess, behind the scenes, Open AI have

00:10:17.920 | implemented some fine-tuning involving step-by-step solutions. As I see that phrase cropping up in

00:10:24.280 | the middle of the video, I would say that the model is now able to do more than just

00:10:30.740 | the same thing. And that particular trick from the original SmartGPT seems less effective than before.

00:10:34.280 | And now I'm going to ask Josh to talk about our state-of-the-art score, not only with GPT-4,

00:10:38.460 | but also with GPT-3.5. But just before I do, here is a hint of why the title talks about

00:10:45.340 | breaking a benchmark. I'll show you how GPT-4 itself encouraged us to question many of the

00:10:54.260 | tests, leading to the discovery of at least 80, and likely hundreds of errors in the test. Enough to

00:11:00.440 | significantly affect final results by up to 2%. And given that the differences in, say, the open-source

00:11:07.280 | language model leaderboard come down to as little as 0.1 of a percent, that's pretty big. Yes, we've

00:11:12.440 | been in contact with some of the authors of the test over the past month to check our findings, and

00:11:17.000 | I'll say more in a bit. But first, here is ML engineer Josh detailing how the magic happened. Josh, by the way, is a pretty

00:11:24.240 | precocious AI consultant working on a master's at Imperial College London.

00:11:28.260 | Josh Stapleton: Hi everyone, nice to meet you all. My name is Josh Stapleton, and let me show you the version of SmartGPT we used. The SmartGPT framework is highly parallelized and can handle industry-scale use cases. We used a thread and a syncIO-based approach to make simultaneous calls to the API at answer option, answer, and subject levels, stacking parallelization upon parallelization. This led to crazy iteration speed boosts. For example, we were able to

00:11:54.220 | complete the final GPT-4 run in under two hours, generating single answer options in series would have taken weeks. We did two large runs using SmartGPT, first with GPT 3.5 and then GPT 4. The 3.5 run was on the entirety of MMLU for a total of nine times 14,042 questions, 126,000 answers. This was a mammoth effort to manually grade, but the SmartGPT innovations and hard work ended up boosting GPT

00:12:24.200 | 3.5's performance by a significant 3.7%, from 70% to 73.7%. The GPT-4 run using SmartGPT also beat the OpenAI MMLU benchmark score substantially, and this run actually resulted in the discovery of a number of problematic MMLU questions, which Philip will talk about shortly. The cost to run GPT-4 on all MMLU would have been too high for us to self-fund, having already each invested four-figure sums, so we used a representative subset

00:12:54.180 | of 2,850 questions from the total of 14,042, of course fully weighted to official standards. SmartGPT is a model-agnostic, parametrized, and highly flexible system that can be applied to disparate use cases. We are already working on applications in a number of domains in both the public and private sectors. The system is evolving and improving constantly under the hood as we continue to innovate. While the current system can get state-of-the-art results, with the ability to handle enterprise-scale data, the

00:13:24.160 | system is still a challenge. There are a number of known ways to improve it, which we aim to implement in the near future. From better and more numerous automatically sourced exemplars, to LLM-driven prompt optimization, to fine-tuning, we are just getting started with SmartGPT. And we are uniquely positioned as a tiny team to integrate both our own ongoing improvements, as well as promising discoveries in the field as they arise. Back to Philip.

00:13:54.140 | Here is the question that started it all off. As you can see, the question makes no sense. The text says, demand reduction, and the answers are either 134, 234, 123, 124. What on earth is that about? Now remember, it was only human grading that enabled us to spot this. I reckon most companies like OpenAI rely on auto-grading by exact match. That would immediately toss out any answer like these as null because an answer of A, B, C, or D

00:14:24.120 | hasn't been given. Now I should say, it was human grading that caught this, and GPT-4 itself. Here is poor GPT-3.5 bravely giving an answer to a question that doesn't make absolutely any sense at all. I love how a couple of times it changed its mind and was like, no, no, no, D, not B. What then followed was weeks and weeks of me following up every quote-unquote mistake with the official source the question came from. When I found the original source, I realised what the problem was. Sometimes,

00:14:54.100 | they just hadn't pasted all of these statements. When you can see all four of these statements,

00:14:58.900 | the answer options make a lot more sense. Now I know what some of you may be thinking. Maybe

00:15:03.560 | it's just business ethics, that's just one subject, and what, it's a dozen questions,

00:15:07.640 | what's the big deal? Well first of all, business ethics only has 100 questions, so 13 of them,

00:15:12.780 | missing vital context, completely undermines that entire subject. And second of all, it wasn't just

00:15:19.080 | business ethics, and it wasn't just this same problem. It wouldn't always be about missing statements,

00:15:24.080 | it would be about missing the whole thing. So let's take a look at the results of the

00:15:28.080 | first two questions. So let's take a look at the results of the first two questions.

00:15:32.300 | So let's take a look at the results of the first two questions. So let's take a look at the results

00:15:54.060 | of the first two questions. So let's take a look at the results of the first two questions. So let's

00:16:24.040 | take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:16:54.020 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:17:24.000 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:17:53.980 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:18:23.960 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:18:53.940 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:19:23.920 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:19:53.900 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:20:23.880 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:20:53.860 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:21:23.840 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:21:53.820 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:22:23.800 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:22:53.780 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:23:23.760 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:23:53.740 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:24:23.720 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:24:53.700 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:25:23.680 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:25:53.660 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

00:26:23.640 | So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors

Chapters