SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors

Since late April, myself and machine learning engineer Josh Stapleton have evaluated over 120,000 answers from GPT models to explore their limits. In my original SmartGPT video, I showed that even popular TED talks calling GPT-4 stupid were not accurately testing what GPT-4 could do. And actually, it could easily get such questions right.

Little did we foresee that come the summer, our tests with GPT-4 would be revealing a host of mistakes in an official globally used benchmark, uncovering concerns that even OpenAI and Google don't appear to be aware of. But by the end of the video, I want to show how you can tangibly benefit from our experiments, including in unexpected domains like medicine.

Where to start? Well, here's a super quick intro to those of you who haven't seen the original video. SmartGPT was a way of using the latest prompting methods to test GPT-4. So, let's get started. engineering research to trigger better performance in a model like GPT-4. Getting the model to think a bit, aka use some tokens, before giving a final answer was key.

Another important element I talked about in that video was the power of getting the model to self-reflect. An insight I drew on from talks with the lead author of the famous reflection paper. My manual experiments showed that using optimized prompts, reflection, and self-dialogue, you could boost performance in almost any task.

And I demoed the improvement on, formal logic and college mathematics. But there was a problem, which is why you guys haven't heard about SmartGPT in a while. How could I systematically benchmark GPT-4 using these methods, when I'm just one guy? Well, enter Josh Stapleton, machine learning engineer extraordinaire. Without him, it would have been impossible to build out such a fleshed out, flexible code base, with which we could systematize experiments and iterate rapidly.

But then we both quickly realized that there was another problem with benchmarking the original version of SmartGPT. And that was that the software was not designed to be able to do the work of the original version. It would be hell to manually extract out the final answers within pages of reflection and resolving, not to mention cost tens of thousands of dollars.

And trust me, a month of YouTube advertising would not even cover the first hour of that run, unfortunately. And no, we would never compromise by asking GPT-4 to grade its own answers. It would be unscientific and inaccurate. The infamous MIT paper is enough evidence for that. But it would be a waste of time to do that.

And I think that's what we're going to do. We're going to do a lot of research on this. And I think that's what we're going to do. And I think that's what we're going to do. And I think that's what we're going to do. And I think that's what we're going to do.

There's enough evidence of that. GPT-4 didn't get 100% on an MIT degree, and this paper was withdrawn. So yes, we had to lower the power level of SmartGPT, get rid of the reflection and resolving, deliberately sacrificing some of its intelligence because we simply couldn't afford to unleash it fully.

And yet, we still got a new, albeit unofficial, record of 88.4% on the MMLU. That not only beats the 86.4% recorded by OpenAI, it does. And that's because we still have a new record of 88.4% on the MMLU. It beats the projections for 2024 that Metaculous recorded before ChatGPT came out.

And yet, we are both convinced that there are at least a dozen more ways performance can be further boosted using existing models. Yes, that might mean GPT-4 getting a result reserved for June of 2025. The thing is, we have hit the limits of what a self-funding team of two can do.

Before I briefly touch on what the MMLU is, I am happy to say that all of our results, and answers, that's 2,850 for the GPT-4 run and over 120,000 for GPT-3.5, are freely available to browse in a GitHub repository linked in the description. So what the hell is MMLU anyway?

Well, it is arguably the best-known benchmark of language model performance. It stands for Massive Multitask Language Understanding. Massive because it has over 14,000 questions and multitask because it covers 57 different languages. And it's a very, very good benchmark for language management. It's a very, very good benchmark for language development.

And it's a very, very good benchmark for language development. The idea behind it was truly fantastic. And it is important enough to feature prominently on the first page of the GPT-4 technical report. In the past, I have said that getting 100% on this test would be a good sign of AGI.

Others have talked about 95%. I do think I have like a 50% chance, like within the next 20 years or so, there might be something what we might call an AGI or transformative AI. What do I mean by this? Well, maybe you can measure it on benchmarks. There's like, this famous MMLU benchmark is like, yeah, there's something which like scores like 95% on this.

And the paper itself notes that an 89.8% performance represents human expert ability, which as you can tell from the title, we are achingly close to beating. And as you'll see in a moment, GPT-4 with the full power of prompt engineering could likely get upwards of 90 to 92% right now.

And frankly, whether it's GPT-5 or Gemini, that 95% threshold should be easily be broken by next year, not in 20 years. If we didn't use the full power of SmartGPT, how did we get 88.4%? And why does the title say 89%? Well, let me show you the two facets of SmartGPT that we did use.

The thing is the MMLU demands a single character answer, A, B, C, or D. And that answer must be immediate. Now imagine taking a test and the very first thought you had had to be your final answer. On a quick test, you might be able to get a single answer, this is believed to be a key reason for hallucinations.

This was a great paper on how language model hallucinations can snowball from the first few tokens. As they say, the language model first commits to an answer. That's before outputting the explanation. And this is a problem because transformers cannot find the answer within one time step because of their limited reasoning abilities within that time step.

And why don't language models like GPT-4 back down and change halfway through? Well, they prioritize fluency and coherence at the expense of factuality. But rushing an answer in your first token is particularly hobbling in questions like this requiring deeper thought or calculation. It's fine for questions that need memorized knowledge, but not for questions like these.

We went through all of these subjects in the MMLU and characterized around a third of them as requiring that kind of deeper thought. And of course, most ways that you use GPT-4 will also require some thought. But the first two are the most common ones. The first one is the one that is the most common.

The second one is the one that is the most common. The third one is the one that is the most common. The third one is the one that is the most common. The third one is the one that is the most common. Open source teams and groups like OpenAI and Google all draw on the dev set when testing a model.

Notice the five questions each with a single character answer. We were not, of course, the first team to realize that this underplays the abilities of the model. The Minerva paper from Google said this: "The standard way of evaluating on the MMLU is to construct a five-shot prompt out of the dev set." So what they did in the first two years, instead, like us, was to use a prompt which has a chain of thought before outputting the final answer.

You can see some examples below. Essentially, it allows the model to think a bit first and gives it a scratch pad. There are other theories though, like the length and detail of the exemplars triggering different weights of the model. This paper from two months ago used longer exemplars for the moral scenarios subject within the MMLU.

With these five custom exemplars plus self-consistency, which I'll get to in a moment, they saw accuracy go up to 80% from 68%. But before even this paper came out, Josh and I were sourcing and crafting bespoke exemplars for the 21 subjects we deemed would need the most working out.

For the other subjects, we may do with the normal dev examples. OpenAI and Google don't do this for their benchmarking, underplaying the abilities of their model. So why doesn't everyone do it, you might ask? My theory is that it's because you have to hand-grade every answer, and that's why we're doing this.

But it's not just the model that's doing it. It's the level of accuracy that we're taking into account. We're taking into account the level of accuracy that we're taking into account. We're taking into account the level of accuracy that we're taking into account. We're taking into account the level of accuracy that we're taking into account.

And it's not just the level of accuracy that we're taking into account. It's the level of accuracy that we're taking into account. We're taking into account the level of accuracy that we're taking into account. Essentially, you're taking the time to listen, which is the least that we can do, I feel, as we approach human-level intelligences.

Even though it still took weeks, to make checking easier, we taught GPT-3.5 and GPT-4 through our exemplars to always end with a final answer in the same format. Lesson one, therefore, for everyone watching, is don't make the first token the final answer. Lesson two comes from a paper on self-consistency, which in a nutshell says that taking the highest probability answer, sometimes called greedy decoding, doesn't always reflect the best answer the model is capable of.

In other words, don't take the model's first answer as its final answer. Take this example. The highest single probability answer was this, and that was incorrect. For open AI, it would now be over, it's incorrect, done. But sometimes, if you look at all the different answers that a model might give, and then take the majority answer, the final answer that came up the most often, it can get it right.

Interestingly, in the Minerva paper, they used 256 samples, although only 16 for the MMLU. Open AI even put a little footnote in their GPT-4 technical paper, admitting that they don't use that approach, but Google does. And yes, this can significantly affect the final results. Look at the boost going up to 40%.

Now, the MMLU has 40 sampling paths, and it hasn't fully leveled off yet. These aren't re-dos where you keep trying until you get it right. This is letting the model explore its full probability distribution of outputs, and taking the truly most probable final answer. Letting the model think, not rushing it.

For our runs, we limited ourselves to 9 samples, and took the majority vote. But of course, the results could have been dramatically better if we did 40 samples, or indeed 256. Now, aside from these two hard-won lessons, which I'm going to show how all of you can benefit from, the other difference from the previous state-of-the-art 86.4% was that we did use the most current versions of each model.

So, the models may have independently gotten better or worse in certain topics. But I would say that our broad findings do run counter to any simple, it's-got-dumber narrative. And if I had to guess, behind the scenes, Open AI have implemented some fine-tuning involving step-by-step solutions. As I see that phrase cropping up in the middle of the video, I would say that the model is now able to do more than just the same thing.

And that particular trick from the original SmartGPT seems less effective than before. And now I'm going to ask Josh to talk about our state-of-the-art score, not only with GPT-4, but also with GPT-3.5. But just before I do, here is a hint of why the title talks about breaking a benchmark.

I'll show you how GPT-4 itself encouraged us to question many of the tests, leading to the discovery of at least 80, and likely hundreds of errors in the test. Enough to significantly affect final results by up to 2%. And given that the differences in, say, the open-source language model leaderboard come down to as little as 0.1 of a percent, that's pretty big.

Yes, we've been in contact with some of the authors of the test over the past month to check our findings, and I'll say more in a bit. But first, here is ML engineer Josh detailing how the magic happened. Josh, by the way, is a pretty precocious AI consultant working on a master's at Imperial College London.

Josh Stapleton: Hi everyone, nice to meet you all. My name is Josh Stapleton, and let me show you the version of SmartGPT we used. The SmartGPT framework is highly parallelized and can handle industry-scale use cases. We used a thread and a syncIO-based approach to make simultaneous calls to the API at answer option, answer, and subject levels, stacking parallelization upon parallelization.

This led to crazy iteration speed boosts. For example, we were able to complete the final GPT-4 run in under two hours, generating single answer options in series would have taken weeks. We did two large runs using SmartGPT, first with GPT 3.5 and then GPT 4. The 3.5 run was on the entirety of MMLU for a total of nine times 14,042 questions, 126,000 answers.

This was a mammoth effort to manually grade, but the SmartGPT innovations and hard work ended up boosting GPT 3.5's performance by a significant 3.7%, from 70% to 73.7%. The GPT-4 run using SmartGPT also beat the OpenAI MMLU benchmark score substantially, and this run actually resulted in the discovery of a number of problematic MMLU questions, which Philip will talk about shortly.

The cost to run GPT-4 on all MMLU would have been too high for us to self-fund, having already each invested four-figure sums, so we used a representative subset of 2,850 questions from the total of 14,042, of course fully weighted to official standards. SmartGPT is a model-agnostic, parametrized, and highly flexible system that can be applied to disparate use cases.

We are already working on applications in a number of domains in both the public and private sectors. The system is evolving and improving constantly under the hood as we continue to innovate. While the current system can get state-of-the-art results, with the ability to handle enterprise-scale data, the system is still a challenge.

There are a number of known ways to improve it, which we aim to implement in the near future. From better and more numerous automatically sourced exemplars, to LLM-driven prompt optimization, to fine-tuning, we are just getting started with SmartGPT. And we are uniquely positioned as a tiny team to integrate both our own ongoing improvements, as well as promising discoveries in the field as they arise.

Back to Philip. Here is the question that started it all off. As you can see, the question makes no sense. The text says, demand reduction, and the answers are either 134, 234, 123, 124. What on earth is that about? Now remember, it was only human grading that enabled us to spot this.

I reckon most companies like OpenAI rely on auto-grading by exact match. That would immediately toss out any answer like these as null because an answer of A, B, C, or D hasn't been given. Now I should say, it was human grading that caught this, and GPT-4 itself. Here is poor GPT-3.5 bravely giving an answer to a question that doesn't make absolutely any sense at all.

I love how a couple of times it changed its mind and was like, no, no, no, D, not B. What then followed was weeks and weeks of me following up every quote-unquote mistake with the official source the question came from. When I found the original source, I realised what the problem was.

Sometimes, they just hadn't pasted all of these statements. When you can see all four of these statements, the answer options make a lot more sense. Now I know what some of you may be thinking. Maybe it's just business ethics, that's just one subject, and what, it's a dozen questions, what's the big deal?

Well first of all, business ethics only has 100 questions, so 13 of them, missing vital context, completely undermines that entire subject. And second of all, it wasn't just business ethics, and it wasn't just this same problem. It wouldn't always be about missing statements, it would be about missing the whole thing.

So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions. So let's take a look at the results of the first two questions.

SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors

Chapters

Transcript