back to indexGPT 4 Can Improve Itself - (ft. Reflexion, HuggingGPT, Bard Upgrade and much more)
Chapters
0:0 Intro
4:0 Examples
6:47 Other breakthroughs
9:7 HuggingGT
12:22 Fewer and fewer humans
13:34 AI and Hardware
00:00:00.000 |
GPT-4 can improve itself by reflecting on its mistakes and learning from them. 00:00:05.840 |
Even if the world does pause AI development, GPT-4 will keep getting smarter. 00:00:11.420 |
Drawing upon the stunning reflection paper and three other papers released only in the last 72 hours, 00:00:18.420 |
I will show you not only how GPT-4 is breaking its own records, 00:00:22.740 |
but also how it's helping AI researchers to develop better models. 00:00:26.820 |
I will also cover the groundbreaking Hugging GPT model, 00:00:30.520 |
which, like a centralized brain, can draw upon thousands of other AI models 00:00:35.640 |
to combine tasks like text-to-image, text-to-video, and question-answering. 00:00:40.600 |
The reflection paper and follow-up Substack post that caught global attention was released only a week ago, 00:00:46.780 |
and yes, I did read both, but I also reached out to the lead author, Noah Shin, 00:00:55.920 |
with the legendary Andrej Karpathy of Tesla and OpenAI fame saying that this 00:01:00.960 |
meta-cognition strategy revealed that we haven't yet seen the max capacity of GPT-4 yet. 00:01:10.840 |
I'm going to explain and demonstrate what was tested in a moment, 00:01:14.060 |
but look how they used GPT-4 itself to beat past GPT-4 standards using this reflection technique. 00:01:25.520 |
designed to test the ability of human evolution to improve the quality of life of humans. 00:01:25.900 |
This is a test that was designed by the most senior AI researchers just two years ago. 00:01:29.240 |
The designers included Ilya Sutskova of OpenAI fame and Dario Amadei, who went on to found Anthropic. 00:01:35.640 |
These are realistic, handwritten programming tasks that assess language comprehension, reasoning, algorithms, and mathematics. 00:01:43.120 |
So how exactly did GPT-4 improve itself and beat its own record? 00:01:47.680 |
Because remember in the distant past of two weeks ago in the GPT-4 technical report, it scored 67%, not 88%. 00:01:55.880 |
Well, here is an example from page nine of the reflection paper. 00:01:59.380 |
As you can read in the caption, this was a hot pot QA trial designed specifically such that models needed to find multiple documents 00:02:06.880 |
and analyze the data in each of them to come up with the correct answer. 00:02:10.900 |
Notice how initially a mistake was made on the left by the model, 00:02:14.540 |
and then the model at the bottom reflected on how it had gone wrong. 00:02:18.460 |
In a self-contained loop, it then came up with a better strategy and got it right. 00:02:25.860 |
They said, "We emphasize that LLMs, large language models, possess an emergent property of self-reflection, 00:02:31.860 |
meaning that earlier models couldn't do this or couldn't do it as well. 00:02:35.460 |
It's a bit like GPT models are learning how to learn. 00:02:38.660 |
In case you think it was the model blindly trying again and again until it was successful. 00:02:46.580 |
And look at the difference between success without reflection and success with reflection. 00:02:51.660 |
I discussed this, of course, with the lead author, and the goal was to distinguish 00:02:55.980 |
learning curves from self-improvement to simple probabilistic success over time. 00:03:00.780 |
If you're wondering about ALF World, by the way, 00:03:03.020 |
it's about interactively aligning text and embodied worlds. 00:03:09.340 |
the model had the task of putting a pan on the dining table, 00:03:13.140 |
and it had to understand and action that prompt. 00:03:16.340 |
So as you can see, this ability to reflect doesn't just help with coding. 00:03:22.340 |
At this point, I want to quickly mention something. 00:03:25.940 |
a couple of well-versed insiders who say, didn't GPT-4 actually get 82% 00:03:36.140 |
and asked the author of Reflection about this point. 00:03:39.300 |
There are a few possibilities, such as prompting changes and the Sparks 00:03:43.060 |
authors having access to the raw GPT-4 model. 00:03:45.820 |
But either way, it is the relative performance gain that matters. 00:03:49.140 |
Whichever baseline you start with, GPT-4 can improve on it with reflection. 00:03:55.900 |
The author has observed results in the last few hours as high as 91%. 00:04:00.060 |
But before I go on, I can't resist showing you the examples 00:04:03.900 |
I found through experimentation and also shared with the author. 00:04:09.740 |
Write a poem in which every word begins with E. 00:04:12.860 |
Now, as you can see, it did a good job, but it didn't fully get it right. 00:04:17.820 |
Without mentioning anything specific, I just then wrote, 00:04:20.980 |
"Did the poem meet the assignment?" Not even a particularly leading question, because, 00:04:26.900 |
GPT-4 then said, "Apologies, it appears the poem I provided did not meet the assignment requirements. 00:04:35.940 |
Here is a revised poem with every word beginning with the letter E. 00:04:46.340 |
For the next example, I chose mathematics and asked, "Write me a five question 00:04:50.780 |
multiple choice quiz to test my knowledge of probability with correct answers and explanations at 00:04:55.820 |
the bottom. There should only be one correct answer per question." 00:04:58.940 |
It comes up with a decent quiz, but notice a problem in question three, for example. 00:05:04.100 |
The probability of drawing an ace or a king is indeed eight out of fifty two. 00:05:09.540 |
But that simplifies down to two out of thirteen. 00:05:14.620 |
And I explicitly asked for it not to do this in the prompt. 00:05:18.380 |
So can the model self-reflect with mathematics? 00:05:25.660 |
vague response saying, "Did the quiz meet the assignment?" 00:05:28.580 |
GPT-4 fumbles this and says, "Yes, the quiz did meet the assignment." 00:05:32.460 |
Hmm. So I try, "Did the quiz meet all of the requirements?" 00:05:38.260 |
So I did have to help it a bit and said, "Did the quiz meet the requirement that 00:05:42.380 |
there should only be one correct answer per question?" 00:05:45.460 |
That was just enough to get GPT-4 to self-reflect properly. 00:05:50.940 |
But I must say it didn't self-correct perfectly. 00:05:55.580 |
D as being correct and equivalent when it was B and D. 00:05:59.540 |
But despite making that mistake, it was able to correct the quiz. 00:06:05.660 |
ChatGPT or GPT-3.5 can't self-reflect as well. 00:06:10.060 |
I went back to the poem example, and not only was the poem generated full 00:06:14.860 |
of words that didn't begin with E, also the self-reflection was lacking. 00:06:21.220 |
And it said, "Yes, the poem meets the assignment." As the lead author Noah Shin 00:06:25.540 |
wrote it, "With GPT-4, we are shifting the accuracy bottleneck from correct 00:06:29.860 |
syntactic and semantic generation to correct syntactic and semantic test generation." 00:06:36.380 |
In other words, if a model can know how to test its outputs accurately, 00:06:40.300 |
that might be enough, even if its initial generations don't work. 00:06:43.980 |
It just needs to be smart enough to know where it went wrong. 00:06:47.380 |
Others are discovering similar breakthroughs. 00:06:49.460 |
This paper from just three days ago comes up with this self-improvement technique. 00:06:55.500 |
and frame its dialogue as a discussion between two agent types, a researcher 00:07:00.180 |
and a decider, a bit like a split personality, one identifying crucial 00:07:04.780 |
problem components and the other one deciding how to integrate that information. 00:07:11.860 |
medical care plan being insufficient in crucial regards. 00:07:15.420 |
The model then talks to itself as a researcher and as a decider. 00:07:19.340 |
And then, lo and behold, it comes up with a better final care plan. 00:07:25.460 |
by GPT-4 to its initial care plan after discussions with itself. 00:07:31.740 |
Physicians chose the final summary produced by this DERA dialogue over 00:07:36.220 |
the initial GPT-4 generator summary 90% to 10%. 00:07:42.540 |
I'm colorblind, but even I can see there's a pretty big difference. 00:07:45.580 |
The authors also introduce hallucinations at different levels, low, medium and high. 00:07:50.420 |
And they wanted to see whether this dialogue model would reduce those hallucinations. 00:07:55.460 |
medical gradings, and you can see that pretty much every time it did improve it 00:07:59.340 |
quite dramatically. And then there was this paper also released less than 72 hours ago. 00:08:06.380 |
criticize and improve its own output and find that this process of reflection 00:08:13.340 |
They tested their model on MiniWob++, which is a challenging suite of web browser 00:08:19.260 |
based tasks for computer control, ranging from simple button clicking to complex form 00:08:25.420 |
modeling. Here it is deleting files, clicking on like buttons and switching between tabs. 00:08:32.260 |
they gave it a math problem and said, review your previous answer and find problems with your answer. 00:08:37.740 |
This was a slightly more leading response, but it worked. 00:08:40.580 |
They then said, based on the problems you found, improve your answer. 00:08:45.420 |
Even if you take nothing else from this video, just deploying this technique will 00:08:49.460 |
massively improve your outputs from GPT-4, but we can go much further, which is what the rest of the video 00:08:55.500 |
is about. Before I move on, though, I found it very interesting that the authors say that this 00:08:59.580 |
technique can be viewed as using the LLMs output to write to an external memory, 00:09:05.140 |
which is later retrieved to choose an action. 00:09:07.620 |
Going back to Carpathi, remember that this critique retry metacognition strategy isn't 00:09:12.980 |
the only way that GPT-4 will beat its own records. 00:09:16.260 |
The use of tools, as he says, will also be critical. 00:09:19.860 |
Less than 72 hours ago, this paper was released and arguably it is as significant 00:09:26.620 |
It's called Hugging GPT, and as the authors put it, it achieves impressive 00:09:30.820 |
results in language, vision, speech and other challenging tasks, 00:09:34.660 |
which paves a new way towards AGI. Essentially what the paper did is it used 00:09:39.220 |
language as an interface to connect numerous AI models for solving complicated AI tasks. 00:09:45.220 |
It's a little bit like a brain deciding which muscle to use to complete an action. 00:09:50.780 |
The prompt was, can you describe what this picture depicts and count 00:09:56.580 |
The model, which was actually ChatGPT, not even GPT-4, used two different tools 00:10:02.140 |
to execute the task, one model to describe the image and one model to count the objects within it. 00:10:08.220 |
And if you didn't think that was impressive, what about six different models? 00:10:13.380 |
Please generate an image where a girl is reading a book and her pose is the same 00:10:18.460 |
as the boy in the image given, then please describe the new image with your voice. 00:10:25.340 |
or brain, which was ChatGPT, had to delegate appropriately. 00:10:29.020 |
All of these models, by the way, are freely available on Hugging Face. 00:10:32.580 |
The first model was used to analyze the pose of the boy. 00:10:36.420 |
The next one was to transpose that into an image, then generate an image, 00:10:40.860 |
detect an object in that image, break that down into text and then turn that text into speech. 00:10:46.740 |
It did all of this and notice how the girl is in the same pose as the boy. 00:10:59.180 |
paper released four days ago called Task Matrix. 00:11:02.580 |
Remember how the original toolformer paper used only five APIs? 00:11:06.820 |
This paper proposes that we could soon use millions. 00:11:09.940 |
In this example, the model is calling different APIs to answer 00:11:13.900 |
questions about the image, caption the image and do outpainting from the image, 00:11:18.700 |
extending it from a simple single flower to this 4K image. 00:11:25.180 |
we can see how it deciphers these inscrutable invoices and reads them out 00:11:29.420 |
loud and can even perform text to video with an astronaut walking in space. 00:11:35.700 |
what CGI video editing might soon be possible with AI. 00:11:39.220 |
Here's Wonder Studio, which is backed by Steven Spielberg. 00:11:42.740 |
Welcome to Wonder Studio, where making movies with CGI 00:11:47.140 |
is as simple as selecting your actor and assigning a character. 00:11:51.900 |
The system uses AI to track the actor's performance 00:11:55.260 |
across cuts and automatically animates lights and composes the CG character 00:12:06.380 |
Wonder Studio analyzes and captures everything from body motion, 00:12:17.540 |
and it even tracks the actor's facial performance. 00:12:22.020 |
These advancements do seem to be accelerating and 00:12:27.300 |
This paper showed back in the before times of October that models didn't need 00:12:32.140 |
carefully labeled human datasets and could generate their own. 00:12:35.820 |
Going back to the Language Models Can Solve Computer Task paper, 00:12:40.900 |
They said that previously significant amounts of expert demonstration data are 00:12:45.180 |
still required to fine tune large language models. 00:12:48.060 |
On the contrary, the agent we suggest needs less than two 00:12:51.660 |
demonstrations per task on average and doesn't necessitate 00:12:56.300 |
This reminded me of the alpaca model that fine tuned its answers based on the outputs 00:13:03.060 |
Human experts were needed briefly at the start, but far less than before. 00:13:07.460 |
A bit like a child no longer needing a parent, except maybe GPT-4 is on growth steroids. 00:13:16.380 |
I mean, already most of the data for reinforcement learning is coming from AIs. 00:13:20.580 |
The humans are being used to train the reward function. 00:13:26.820 |
in its interaction with the model is automatic and all the data that's 00:13:30.260 |
generated during the process of reinforcement learning is created by AI. 00:13:34.460 |
Before I end, I should point out that these recursive self improvements are not 00:13:38.980 |
limited to algorithms and APIs. Even hardware is advancing more rapidly due to AI. 00:13:49.780 |
explains how AI can be used to improve chip design. 00:13:56.340 |
They say that the NVIDIA research took reinforcement learning and added a second 00:14:00.860 |
layer of AI on top of it to get even better results. 00:14:05.700 |
the GPT-4 technical report showed that even with compute alone, not self learning, 00:14:11.540 |
we can predict with a high degree of specificity the future performance 00:14:16.260 |
of models like GPT-5 on tasks such as human eval. 00:14:19.980 |
These accelerations of AI are even giving the CEO of Google 00:14:24.940 |
a flash and I can't help feeling that there is one more feedback loop to point out. 00:14:28.860 |
As one company like OpenAI make breakthroughs, 00:14:31.860 |
it puts pressure on other companies like Google to catch up. 00:14:37.060 |
by Lambda, will soon be upgraded to the more powerful model Palm. 00:14:40.980 |
With self-improvement, tool use, hardware advances and now commercial pressure, 00:14:48.940 |
And of course, as always, I will be here to discuss it all. 00:14:52.020 |
Thank you for watching to the end and have a wonderful day.