I have three goals for this video. First, I want to show you a way of using GPT-4 to get smarter results. Second, I want to argue that the benchmark results we have for GPT-4 do not reflect its full abilities. And third, I want to show you a system that I am developing, somewhat cheekily called SmartGPT, that is already showing significant results on official benchmarks.
It remains to be fully optimized, which I think is exciting in itself. I have shown the system to people at OpenAI who have been quite impressed, and I'm going to end with some reflections on where that might leave us for GPT-5. But before I get into how it works, I just want to show you one example of it in action to whet your appetite.
This example comes from a TED talk that was released this week. So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours.
Not good. On the left, you can see GPT-4's original answer, and it gives this answer pretty consistently whenever you prompt it with the question provided. On the right, you can see the final answer from the SmartGPT model, which is correct, and it consistently gives that answer. I really like how it gives context as well, and it provides some of the assumptions that it had in giving this correct answer.
Now, don't you worry, there will be plenty more examples to go through in this video, including another one from that TED talk. But first, I want to give you an overview of what is this SmartGPT model, where did I get my inspiration for it from, and how does it work?
I'm going to keep it fairly simple because it's the beginning of the video, and I know a lot of people won't really care about the inner details. That will come later in the video. But the high-level overview is this. There are at least three things that have been proven to improve the outputs of GPT-4.
What's called chain of thought prompting, sometimes called step-by-step prompting, reflection, or finding its own errors. And I did an entire video on this called GPT-4 can self-improve. And dialoguing with itself, entering into a back and forth on its own outputs and deciding which one is best. You can see the title of the papers, which contain much more detailed results, of course, linked above.
Now, the first paper only came out a few days ago, midway through my testing. So my results don't even reflect the full capacity of the model. And even if there's nothing else you take from this video, the results from this paper can instantly improve the results of the model.
So I'm going to show you how to instantly improve the outputs you get from GPT-4. Many of you might remember that prompting GPT-4 with let's think step-by-step improves its results. To give you a very quick reference point, just asking a question to GPT-4 gives you 81% accuracy. With that prompt, let's think step-by-step, it goes up to 86%.
But algorithmically, the paper found an improved prompt that can give you even better results, 89% accuracy. So let's go back to the first part of SmartGPT, is we add answer. Let's work this out in a step-by-step way to be sure we have the right answer. Now, I have so much to say about why I think this works, but I know many of you won't be that interested in my theories.
So I'm going to save them to the end for those who are interested. Some of you just want the results. So I'm going to get to those first. So far, you might be thinking, well, thanks, Philip, that's a cool prompt. I'm going to use that. But what's this whole SmartGPT about?
Is it just a single prompt? No, I'm going to use the answer. So I believe with evidence, there are ways of leveraging even better results than just using a great chain of thought prompt. So let's move on to the next part of the system, these different outputs in the middle.
For my tests, I typically did three outputs, but of course, depending on the context window, it could be far more than that. And I'm going to talk about ways I could further improve this model, or we could, later on in the video. Just to restate, these outputs are when you take the user input and add the word question at the start, and then at the end, add answer, and then add answer.
So I'm going to use the answer, and then add answer, and then add answer. Let's work this out in a step-by-step way to make sure we have the right answer. And at this moment, many of you are thinking, what is the point of multiple outputs? It's GPT-4, it's just going to give you the answer it thinks is best, and that's it.
Well, actually, it doesn't quite work like that. These models have a temperature between zero and one. I believe the default for GPT-4 might be around 0.5. And simplifying massively, this determines how creative or conservative the model is in giving its outputs. So given that GPT-4 tries to be fairly creative, you don't get the same output every time.
The output is randomly sampled according to an internal probability distribution. So you can get situations, and I face this hundreds of times, where some of the outputs are correct, and others are incorrect. And this is where reflection comes in. Sometimes, definitely not always, but sometimes, quite often, GPT-4 can detect the errors in its own output.
And many of you will notice at this point that the prompt that I used to elicit GPT-4 to spot its own errors contains the same step-by-step prompt I used earlier that has been shown to produce good results. So to summarize, sometimes at this stage, GPT-4 detects the errors that some of its outputs have made.
Definitely not always. There are certain questions it just simply can't spot the error. But sometimes it can, and then I get it to engage in a dialogue using a format similar to one in this paper published last month. It's a short dialogue, and this is the step I believe that can be most optimized.
In the future, I envision an entire council of advisors made up of GPT-4 imitating mathematicians, judges, etc. At the moment, it's just being a resolver and printing a final improved output. Anyway, I'm going to get back to the theory later in the video, because I know some of you will be getting bored at this stage and want to see more practical examples and the results from my benchmark tests.
As I don't have the GPT-4 API key, yes, I had to import each of these steps hundreds of times, waiting sometimes three hours between each go, because you can only do 25 messages every three hours. On the left, you can see the three outputs when you ask it to think step by step.
And then you have the researcher step in the middle and at the top right. And finally, the resolver step. Notice here, I was using the original let's think step by step because the paper hadn't yet been published on improving that prompt. It's time for the second example from that TED talk, and then I definitely will get on to the benchmarks.
A different one. I have 12 liter jug and 6 liter jug and I want to measure 6 liters. How do I do it? Just use the 6 liter jug, right? GPT-4 spits out some very elaborate nonsense. Of course, I tested SmartGPT with that question, and you can see the difference between the original GPT-4, which gives this incredibly convoluted bad answer, and SmartGPT, the final answer output.
Now, at this point, I know many of you will be impressed, but you'll be thinking, I don't have time to input things five times. Well, I'm developing a model where it can all be done automatically. Here is a preview of how it works. But of course, at the moment, it has to use GPT-3.5 turbo because I don't have the API key of GPT-4.
But the epic thing is this. You just ask a single question. I've written, ask SmartGPT a question. And of course, it does take a little bit longer to respond because it's doing five or six calls via API. But it does output the final answer from the resolver. I will be honest and say that GPT-3.5 isn't as good at reflecting or resolving.
But this is an example of a question where the original ChatGPT consistently gets it wrong, and SmartGPT-3.5 gets it right using this program. Remember, all you have to do as a user is type in a question as normal, and it goes through this entire five or six step process behind the scenes.
By the way, this was a question from MMLU, which is a famous benchmark, which I'll get to in a second. Here's one last practical example. Before I get to that benchmark. I know many teachers use ChatGPT and GPT-4 to create quizzes for their classes. And here is the same question put through GPT-4 and SmartGPT.
The question is create a high school algebra quiz with five questions and answers and explanations at the end. Now points for spotting the difference. But if the teacher had handed out the original quiz, look at the answers for question five. It says the answers are one and 1.5, but then in the explanation, it gives the final answers, which are correct, by the way, of three and 0.5.
So that would really confuse some students. At the reflection stage, SmartGPT spotted that error and resolved it. And as you can see, the answer for question five has the correct answers straight away. If at any point you're wondering if I completed the OpenAI ChatGPT prompt engineering course, the answer is yes, but it didn't inform too much of my thinking.
It was more for beginners, and I had already factored in things like giving the model time to think and writing clear instructions. The benchmark that I chose to test SmartGPT on was the famous MMLU, Massive Multitask Language Understanding benchmark. As you can see, the state of the art is indeed GPT-4 with 86.4% accuracy.
And you know OpenAI think it's a big deal because it's the benchmark mentioned on the front page of their technical report. Without boring you too much, I extracted the questions from the test set of the MMLU data file. And I didn't pick the question. I didn't pick the topics at random.
I went for those that I thought GPT-4 would find the hardest. Delving into the original MMLU paper, you can see that GPT-3 found formal logic the hardest, scoring just over 25%, which is random chance. It's a four question multiple choice test, so around 25 or 30% is pretty bad.
And notice they helped out GPT-3 here. They did it few shot, meaning they gave it five successful examples. Before asking it a new question. It's the same thing they did with GPT-4. They did it five shot. But just before I show you the results, there are three things I want to mention here.
First, I was curious how SmartGPT would do without any help. Zero shot. Second, I wanted to do it zero shot because people using GPT-4 don't typically give five successful examples before asking GPT-4 a question. They just want code or a quiz or a poem or an example. They don't often provide five brilliant examples of code.
They just want a quiz or a poem or an example. They don't often provide five brilliant examples of code. Before asking their question. And third, if I can prove it works zero shot, then of course future refinements can be made to push the results even further. And here are the results from the first 25 questions from the formal logic test set of the MMLU.
I did many more tests after this. But you can see from this set, if you just ask the question, you get a lower overall accuracy. But of course 68% for GPT-4 is still a huge improvement over GPT-3's around 25%. What happens when you add "Let's think step by step" which as we know now isn't the fully optimized chain of thought prompt.
Well, on average you get around 74-75%. That was 75 examples inputted manually and I still have all the tabs open. I'm keeping them open because I'm compiling a spreadsheet with the actual outputs. But what did the resolver get drawing upon GPT-4's ability to reflect and engage in dialogue with itself?
It got 84%. Now notice something about that number. GPT-4 zero shot got 32% of the questions wrong. That was halved to 16% after putting it through the smart GPT system. There was one question where the resolver model gave both a correct and incorrect answer. But I'm counting that as an incorrect answer for the purposes of this test.
Anyway, from 32% to 16% incorrect. That is a pattern that stayed consistent throughout all my testing. Ultimately, half of the errors that GPT-4 makes can be rectified if you give it the optimized step by step prompt, get it to reflect on its results and get it to engage in dialogue and decide on a final answer.
At this point, for those people losing track of all the details, I want to put into context what resolving half of the errors on MMLU might mean in the context of the big picture. Here's Lennart Heim, an AI governance researcher, suggesting a score of 95% of the errors in the big picture.
95% on the MMLU would be reflective of AGI-like abilities. I do think I have like a 50% chance, like within the next 20 years or so, there might be something what we might call an AGI or transformative AI. What do I mean by this? Well, maybe can measure it on benchmarks.
There's like this famous MMLU benchmarks like, yeah, there's something which like scores like 95% on this. Going back to the results, if a smart GPT-like system can automatically resolve half of the errors that GPT-4 makes on the MMLU, that would increase its score from around 86.4% to around 93%, which is not far off 95%.
Remember, his prediction was a 50% chance in 20 years. I'm talking about GPT-4 now. For those who are still skeptical, I'm going to show you plenty more results now and then walk through the papers that give the theory as to why this works. One thing that I forgot to mention earlier is that the human expert level on the MMLU, is 89.8% and that's taking the 95th percentile of human test takers.
And remember, those are domain experts in each of the subtopics. What we're doing is testing GPT-4 or smart GPT on all of the topics simultaneously. So even if smart GPT-like systems can't quite reach 95%, and I think honestly they'll get pretty close with all the refinements that I'm going to suggest, I think they should almost certainly be 89.8%.
Which is the human expert test taker level. Intrigued by these results, I then put it through the college math test from the MMLU. And remember, this was before using the optimized version of the step-by-step prompt. Obviously, I'm not going to go through all the questions here, but let's skip to the final results.
We have zero shot accuracy, 6 out of 15, which is 40%. The average when you add let's think step-by-step was 53.5%. And then the final output of the resolver model had a 60% accuracy. So it couldn't quite resolve half of the errors, but the overall pattern held up. In case anyone is wondering about methodology, I kept the formatting identical for every question.
I always opened a new tab for each question. It wasn't looking at the context of what it had already put out. Each attempt was fresh, aside from the resolver model, which looked at the context of the researcher's output. And again, as you can see from example 14, it wasn't like the researcher could always spot the errors, or that the resolver could always pick the right option.
Sometimes the let's think step-by-step prompt gave the right output, but the resolver couldn't quite distinguish it. The optimized prompt gets a slightly better output. And upon reflection, the researcher can sometimes, but not always, spot the errors of those outputs. And sometimes, but not always, the resolver can spot them.
But the researcher can spot the errors of those outputs. And the resolver can spot based on those flaws, which answer is best. These are incremental improvements. Sometimes GPC4 simply can't get it right. I have noticed a few themes in those questions. Anytime it comes to division, multiplication, characters, or counting in general, GPC4 tends to make mistakes that neither the researcher nor resolver can spot.
Of course, integrating a few tools via API would likely solve those issues. And I don't want to preempt the conclusion too much, but I believe a smart GPT-like system with tools integrated could probably score around 95% right now on the MMLU. Especially if it was helped out with few-shot prompting.
To add weight to that preliminary conclusion, I tested it on certain topics and had to stop because it simply got the questions right every single time. For example, high school psychology from the MMLU. I then tried prehistory, which it also aced, before finding machine learning. Where I got more interesting results.
Zooming in, this time the raw score was 65%. The chain of thought, let's think step-by-step average was 71.6%. And the resolver model got 80%. Let's now look a little deeper into why all of these steps might improve the end result. In reply to the original let's think step-by-step paper, which was published around a year ago, Andrej Karpathy said this: "Adding something like let's think step-by-step to the prompt is a way of using the idea that the input space for computation is what you'd normally want in the hidden state of the model.
Instead of the workings out being done in the activations of the neural network, it's done in the discrete tokens of that input space." And he adds: "Did not super see this coming." And here is the paper released three days ago that improves upon that original prompt. They also did their testing zero-shot like me.
And they tested many prompts. Starting like I did with just direct prompting. Just asking the question like 99% of users would do of GPT-4. And then they tried like me the well-established let's think step-by-step prompt. They also iteratively tested seven original prompts as well as the prompt that I've now integrated into SmartGPT.
The let's work this out in a step-by-step way, etc. They share my opinion that zero-shot prompting setups have the benefit of not requiring such task-dependent selection of exemplars. You don't have to find correct examples. It just does it all for you. Here are the end results for GPT-4 that we saw earlier.
Showing the difference between asking directly your question and using these refined prompts. Notice that this technique is somewhat model dependent and it doesn't have the same effect on smaller or weaker models. Before we move on to the next paper, there is one somewhat failed prompt that I want to pick up on.
It's this self-critique prompt where they ask: "Answer the question then critique the answer. Based on the critique, reconsider the other answer options and give a single final answer." And you might wonder why didn't that prompt perform best when we know that reflection and dialogue can work. My theory is because it's trying to do all of it in one prompt.
Through my hundreds of experiments, I've noticed that GPT-4 can only handle so much in one go. It simply gets overwhelmed or confused if you ask it to do too much in one prompt. That's why I broke my model into stages to allow it to show off each of its abilities one by one.
And before we get to the other papers, what's my personal theory as to why this eliminates up to half of the errors that GPT-4 makes? Well, my guess is this: Remember that GPT-4 is drawing on a vast dataset of internet text. And let me ask you, what kind of text has things like: Question, Answer, Let's work this out.
Be sure we have the right answer. The kind of data that would have that text would be things like: Tutorials or Expert Breakdowns. So I believe you're triggering more of the weights inside GPT-4 that relate to things like Expert Tutorials. And so inevitably you're getting slightly better answers. Next, I've already explained why you get different outputs when you give the exact same prompt.
That's down to sampling and the temperature of the model. But to simplify massively, sometimes GPT-4 will give you an output that it knows isn't the most probable. It introduces some randomness into its sampling. By generating multiple outputs, you're getting a larger sample size, reflecting the full range of probabilities that GPT-4 ascribes to its outputs.
You're reducing a little bit some of the randomness that's inherent in GPT-4 outputs. Next, I believe that GPT-4 can sometimes spot its own errors through reflection. Because prompting like this triggers a different set of weights. You could almost think of it as a different mindset. One more focused on finding errors.
Again, if the question is too hard or involves counter-checking, characters, division, multiplication, as I said earlier, this won't help. But a percentage of the time, it can spot its own errors and point them out. Notice this is a separate bit of inference not lumped into the original prompt. And when it does successfully point out the errors, it can often engage in this dialogue with itself.
Notice in a meta kind of way, I'm using the step-by-step prompting to improve the reflection and dialogue. So those are my theories as to why it works. But at the end of the video, I'm going to show you at least five ways, I think the model can be further refined.
Before we do though, I looked up the paper by Zhou, which produced that prompt that did the best in the previous paper. They came to that special prompt through automatic prompt engineering. But there's something interesting I want to point out though. On page seven, they say, we use automatic prompt engineering to find a prompt starting with lets, that maximizes the likelihood of correct reasoning steps.
Then they found the best one that I integrated into SmartGPT, let's work this out in a step-by-step way to be sure we have the right answer. That's the one I want you to use. And they ran their own benchmarks. And of course it did improve the scores. But the interesting thing to me is they started with lets each time.
So even that first stage for the model might not yet be fully optimized. Maybe there's a prompt that doesn't begin with lets that improves this initial result still further. Anyway, back to the papers. I know many people watching this will wonder if I read the paper boosting theory of mind performance in the future.
So I did a theory of mind performance in large language models via prompting. And yes, I did because they tested something similar for a theory of mind test. Using similar techniques, they were able to get theory of mind accuracy for GPT-4 from 80% to 100%. And they conclude that these results demonstrate that appropriate prompting enhances large language model theory of mind reasoning.
And they underscore the context dependent nature of these models' cognitive capacities. They used that original prompt, and they used the same formula for the first two examples. So they did improve the results dramatically. And as I theorized earlier, adding few shot examples would push this still further. This is part of why I think that 95% barrier on the MMLU will be broken probably this year by GPT-4.
A few other points from this paper. They admit that there is not currently a theoretical understanding of why these prompting methods are so important. And they also say that the results of the MMLU are not the same as the results of the GPT-4. So they are not sure why these prompting techniques are beneficial.
I've given you my theory and Karpathy's, but no one quite knows for sure. Lastly from this paper, and I found this really interesting. Giving it generic few shot prompts that weren't directly theory of mind, actually improved the outputs slightly more than giving it direct theory of mind examples. This opens the door to the first of the five ways I anticipate smart GPT getting even smarter.
It could be possible to come up with generic few shot prompts that could be automatically integrated into the model that don't necessarily relate to the topic at hand. This graph shows the impact of adding few shot examples to GPT-3. And if this can be done in a generic way for GPT-4, results could be improved still further.
Next, the boosting theory of mind paper speculates that integrating some of these approaches could boost the performance of weaker models to beyond the levels of GPT-4 zero shot accuracy. Next, here is the original DERA paper that inspired me to have the researcher and resolver dialogue at the end of smart GPT.
As they say, the DERA approach shows significant improvement over base GPT-4 performance. And these were open-ended questions by the way, not multiple choice. So this is more generally applicable than you might think. You can see from this table how results improved after engaging in this dialogue. And that brings me to the second way I anticipate smart GPT getting smarter in the future.
A longer and more rich dialogue. At the moment we have this simple researcher and resolver, two-step dialogue. But I can imagine a council of advisors. You can imagine a mathematician chipping in, and a philosopher, and a professor. Each one tapping into slightly different weights of GPT-4. Extracting more hidden expertise.
I'm not saying that would transform the results, but it might edge them another few percent higher. Next, even with longer dialogues and different experts, we could find ways of optimizing these prompts, just like we did with the original "Let's think step by step". That's the third avenue of improvement that I envisage.
Because I came up with these prompts, I'm sure they could be improved. Next, we could experiment with different temperatures. Remember, a lower temperature makes the model more conservative. A higher one, towards one, makes it more creative. We could experiment with a higher temperature to produce a more diverse range of outputs at this stage.
And then perhaps a more conservative, deterministic temperature, for the final judge or resolver. It might not work, but it's worth trying. And the fifth improvement I know would work. Integrating APIs for character counting, calculators, code interpreters, etc. Spending these weeks manually sorting through the outputs of GPT-4 on these benchmarks, I can really see where it goes wrong.
And it's often by getting letters in the wrong order, or making mistakes with division. It gets the high level logic right, and then makes quite simple errors. Basic tool integration would, I am sure, push the results still higher. Now I know this isn't my usual video, and trust me, I have been following the AI news, and we'll get back to that very soon.
I'm determined to make those improvements and push SmartGPT even further, but of course that would be aided massively by getting access to the plugins and the GPT-4 API key. So far I've had to do all of this manually, which was a lot of work. Now as you saw earlier, I have drawn on GPT-4 to help me develop a project.
I'm also working on a program in Replit to automate this process, but at the moment it's GPT-3.5, and honestly the context window really limits the ability. But I do look forward to the day when I can integrate GPT-4, and put this out as an automatic model for people to test and play about with.
I am sure that something similar will ultimately be incorporated by OpenAI itself, maybe as a thoughtful mode, or smart mode, a bit like Bing has creative, precise, balanced, etc. Each response does take longer, but as you've seen the outputs are noticeably better. If the results of models like this one do officially exceed the 86.4% that OpenAI talked about in the GPT-4 technical report, I do think that would reveal quite a few things.
First, that OpenAI isn't even aware of the full capabilities of its own model. I don't even know if they anticipated things like AutoGPT. I do think it would reveal that they need to do far more proper testing of their models before they release them. They should make falsifiable predictions about what their models won't be capable of.
That way we would know just how much they know about their own models. What we're trying to avoid is a situation where OpenAI say their model can only achieve X, and then when they release the model in the wild, someone comes along and achieves Y, where Y is much more impactful than X.
So those were the goals of this video. To show you how to get more out of GPT-4, to run you through some of the fascinating papers that have been released in the last few days and weeks, the third goal was to show you what this model could do with some official benchmarks, and suggest ways it might get better in the near term future.
Of course, if you have a GPT-4 API key, or are an expert in benchmarking systems like GPT-4, I'd love to hear from you. I guess the final goal was to perhaps suggest to you that OpenAI don't know as much about their own models as they might lead you to believe.
Thank you so much for watching to the end, and have a wonderful day.