GPT-4 can improve itself by reflecting on its mistakes and learning from them. Even if the world does pause AI development, GPT-4 will keep getting smarter. Drawing upon the stunning reflection paper and three other papers released only in the last 72 hours, I will show you not only how GPT-4 is breaking its own records, but also how it's helping AI researchers to develop better models.
I will also cover the groundbreaking Hugging GPT model, which, like a centralized brain, can draw upon thousands of other AI models to combine tasks like text-to-image, text-to-video, and question-answering. The reflection paper and follow-up Substack post that caught global attention was released only a week ago, and yes, I did read both, but I also reached out to the lead author, Noah Shin, and discussed their significance at length.
Others picked up on the results, with the legendary Andrej Karpathy of Tesla and OpenAI fame saying that this meta-cognition strategy revealed that we haven't yet seen the max capacity of GPT-4 yet. So what exactly was found? Here is the headline result. I'm going to explain and demonstrate what was tested in a moment, but look how they used GPT-4 itself to beat past GPT-4 standards using this reflection technique.
This isn't any random challenge. This is human eval, a coding test, designed to test the ability of human evolution to improve the quality of life of humans. This is a test that was designed by the most senior AI researchers just two years ago. The designers included Ilya Sutskova of OpenAI fame and Dario Amadei, who went on to found Anthropic.
These are realistic, handwritten programming tasks that assess language comprehension, reasoning, algorithms, and mathematics. So how exactly did GPT-4 improve itself and beat its own record? Because remember in the distant past of two weeks ago in the GPT-4 technical report, it scored 67%, not 88%. Well, that's not true. Well, here is an example from page nine of the reflection paper.
As you can read in the caption, this was a hot pot QA trial designed specifically such that models needed to find multiple documents and analyze the data in each of them to come up with the correct answer. Notice how initially a mistake was made on the left by the model, and then the model at the bottom reflected on how it had gone wrong.
In a self-contained loop, it then came up with a better strategy and got it right. And the authors put it like this. They said, "We emphasize that LLMs, large language models, possess an emergent property of self-reflection, meaning that earlier models couldn't do this or couldn't do it as well.
It's a bit like GPT models are learning how to learn. In case you think it was the model blindly trying again and again until it was successful. No, it wasn't. This was another challenge called ALF World. And look at the difference between success without reflection and success with reflection.
I discussed this, of course, with the lead author, and the goal was to distinguish learning curves from self-improvement to simple probabilistic success over time. If you're wondering about ALF World, by the way, it's about interactively aligning text and embodied worlds. For example, in a simulated environment, the model had the task of putting a pan on the dining table, and it had to understand and action that prompt.
So as you can see, this ability to reflect doesn't just help with coding. It helps with a variety of tasks. At this point, I want to quickly mention something. I know that there will be a couple of well-versed insiders who say, didn't GPT-4 actually get 82% in human eval in the Sparks of AGI paper?
Of course, I did a video on that paper, too, and asked the author of Reflection about this point. There are a few possibilities, such as prompting changes and the Sparks authors having access to the raw GPT-4 model. But either way, it is the relative performance gain that matters. Whichever baseline you start with, GPT-4 can improve on it with reflection.
And the 88% figure is not a cap. The author has observed results in the last few hours as high as 91%. But before I go on, I can't resist showing you the examples I found through experimentation and also shared with the author. Take this prompt that I gave GPT-4.
Write a poem in which every word begins with E. Now, as you can see, it did a good job, but it didn't fully get it right. Look at the word "ascent", for example. Without mentioning anything specific, I just then wrote, "Did the poem meet the assignment?" Not even a particularly leading question, because, it could have just said, "Yes." GPT-4 then said, "Apologies, it appears the poem I provided did not meet the assignment requirements.
Not every word begins with the letter E. Here is a revised poem with every word beginning with the letter E. Remember, I didn't help it at all. And look at the results. Every word begins with E. How far can we take this? For the next example, I chose mathematics and asked, "Write me a five question multiple choice quiz to test my knowledge of probability with correct answers and explanations at the bottom.
There should only be one correct answer per question." It comes up with a decent quiz, but notice a problem in question three, for example. The probability of drawing an ace or a king is indeed eight out of fifty two. But that simplifies down to two out of thirteen. So two of the answers are correct.
And I explicitly asked for it not to do this in the prompt. So can the model self-reflect with mathematics? Kind of, almost. Look what happens. First, I give a vague response saying, "Did the quiz meet the assignment?" GPT-4 fumbles this and says, "Yes, the quiz did meet the assignment." Hmm.
So I try, "Did the quiz meet all of the requirements?" And GPT-4 says, "Yes." So I did have to help it a bit and said, "Did the quiz meet the requirement that there should only be one correct answer per question?" That was just enough to get GPT-4 to self-reflect properly.
And it corrected the mistake. But I must say it didn't self-correct perfectly. Notice it identified C and D as being correct and equivalent when it was B and D. But despite making that mistake, it was able to correct the quiz. In case you're wondering, the original ChatGPT or GPT-3.5 can't self-reflect as well.
I went back to the poem example, and not only was the poem generated full of words that didn't begin with E, also the self-reflection was lacking. I said, "Did the poem meet the assignment?" And it said, "Yes, the poem meets the assignment." As the lead author Noah Shin wrote it, "With GPT-4, we are shifting the accuracy bottleneck from correct syntactic and semantic generation to correct syntactic and semantic test generation." In other words, if a model can know how to test its outputs accurately, that might be enough, even if its initial generations don't work.
It just needs to be smart enough to know where it went wrong. Others are discovering similar breakthroughs. This paper from just three days ago comes up with this self-improvement technique. They get GPT-4 and frame its dialogue as a discussion between two agent types, a researcher and a decider, a bit like a split personality, one identifying crucial problem components and the other one deciding how to integrate that information.
Here is an example with GPT-4's initial medical care plan being insufficient in crucial regards. The model then talks to itself as a researcher and as a decider. And then, lo and behold, it comes up with a better final care plan. The points in bold were added by GPT-4 to its initial care plan after discussions with itself.
And the results are incredible. Physicians chose the final summary produced by this DERA dialogue over the initial GPT-4 generator summary 90% to 10%. That's the dark red versus the pink. I'm colorblind, but even I can see there's a pretty big difference. The authors also introduce hallucinations at different levels, low, medium and high.
And they wanted to see whether this dialogue model would reduce those hallucinations. These are different medical gradings, and you can see that pretty much every time it did improve it quite dramatically. And then there was this paper also released less than 72 hours ago. They also get a model to recursively criticize and improve its own output and find that this process of reflection outperforms chain of thought prompting.
They tested their model on MiniWob++, which is a challenging suite of web browser based tasks for computer control, ranging from simple button clicking to complex form modeling. Here it is deleting files, clicking on like buttons and switching between tabs. A bit like my earlier experiments, they gave it a math problem and said, review your previous answer and find problems with your answer.
This was a slightly more leading response, but it worked. They then said, based on the problems you found, improve your answer. And then the model got it right. Even if you take nothing else from this video, just deploying this technique will massively improve your outputs from GPT-4, but we can go much further, which is what the rest of the video is about.
Before I move on, though, I found it very interesting that the authors say that this technique can be viewed as using the LLMs output to write to an external memory, which is later retrieved to choose an action. Going back to Carpathi, remember that this critique retry metacognition strategy isn't the only way that GPT-4 will beat its own records.
The use of tools, as he says, will also be critical. Less than 72 hours ago, this paper was released and arguably it is as significant as the reflection paper. It's called Hugging GPT, and as the authors put it, it achieves impressive results in language, vision, speech and other challenging tasks, which paves a new way towards AGI.
Essentially what the paper did is it used language as an interface to connect numerous AI models for solving complicated AI tasks. It's a little bit like a brain deciding which muscle to use to complete an action. Take this example. The prompt was, can you describe what this picture depicts and count how many objects in the picture?
The model, which was actually ChatGPT, not even GPT-4, used two different tools to execute the task, one model to describe the image and one model to count the objects within it. And if you didn't think that was impressive, what about six different models? So the task was this. Please generate an image where a girl is reading a book and her pose is the same as the boy in the image given, then please describe the new image with your voice.
The central language model, or brain, which was ChatGPT, had to delegate appropriately. All of these models, by the way, are freely available on Hugging Face. The first model was used to analyze the pose of the boy. The next one was to transpose that into an image, then generate an image, detect an object in that image, break that down into text and then turn that text into speech.
It did all of this and notice how the girl is in the same pose as the boy. Same head position and arm position. And then as a cherry on top, the model read out loud what it had accomplished. This example actually comes from another paper released four days ago called Task Matrix.
Remember how the original toolformer paper used only five APIs? This paper proposes that we could soon use millions. In this example, the model is calling different APIs to answer questions about the image, caption the image and do outpainting from the image, extending it from a simple single flower to this 4K image.
Going back to Hugging GPT, we can see how it deciphers these inscrutable invoices and reads them out loud and can even perform text to video with an astronaut walking in space. At this point, I can't resist showing you what CGI video editing might soon be possible with AI. Here's Wonder Studio, which is backed by Steven Spielberg.
Welcome to Wonder Studio, where making movies with CGI is as simple as selecting your actor and assigning a character. The system uses AI to track the actor's performance across cuts and automatically animates lights and composes the CG character directly into the scene. Whether it's one shot or a full sequence, Wonder Studio analyzes and captures everything from body motion, lighting, compositing, camera motion, and it even tracks the actor's facial performance.
These advancements do seem to be accelerating and requiring fewer and fewer humans. This paper showed back in the before times of October that models didn't need carefully labeled human datasets and could generate their own. Going back to the Language Models Can Solve Computer Task paper, the authors seem to concur.
They said that previously significant amounts of expert demonstration data are still required to fine tune large language models. On the contrary, the agent we suggest needs less than two demonstrations per task on average and doesn't necessitate any fine tuning. This reminded me of the alpaca model that fine tuned its answers based on the outputs of another language model.
Human experts were needed briefly at the start, but far less than before. A bit like a child no longer needing a parent, except maybe GPT-4 is on growth steroids. Ilya Sutskova from OpenAI put it like this. I mean, already most of the data for reinforcement learning is coming from AIs.
The humans are being used to train the reward function. But then the reward function in its interaction with the model is automatic and all the data that's generated during the process of reinforcement learning is created by AI. Before I end, I should point out that these recursive self improvements are not limited to algorithms and APIs.
Even hardware is advancing more rapidly due to AI. This week we had this from Reuters. NVIDIA on Monday showed new research that explains how AI can be used to improve chip design. By the way, this includes the new H100 GPU. They say that the NVIDIA research took reinforcement learning and added a second layer of AI on top of it to get even better results.
And to go back to where we started, the GPT-4 technical report showed that even with compute alone, not self learning, we can predict with a high degree of specificity the future performance of models like GPT-5 on tasks such as human eval. These accelerations of AI are even giving the CEO of Google a flash and I can't help feeling that there is one more feedback loop to point out.
As one company like OpenAI make breakthroughs, it puts pressure on other companies like Google to catch up. Apparently, BARD, which has been powered by Lambda, will soon be upgraded to the more powerful model Palm. With self-improvement, tool use, hardware advances and now commercial pressure, it is hard to see how AI will slow down.
And of course, as always, I will be here to discuss it all. Thank you for watching to the end and have a wonderful day.