GPT 4 Can Improve Itself - (ft. Reflexion, HuggingGPT, Bard Upgrade and much more)

00:00:00.000 | GPT-4 can improve itself by reflecting on its mistakes and learning from them.

00:00:05.840 | Even if the world does pause AI development, GPT-4 will keep getting smarter.

00:00:11.420 | Drawing upon the stunning reflection paper and three other papers released only in the last 72 hours,

00:00:18.420 | I will show you not only how GPT-4 is breaking its own records,

00:00:22.740 | but also how it's helping AI researchers to develop better models.

00:00:26.820 | I will also cover the groundbreaking Hugging GPT model,

00:00:30.520 | which, like a centralized brain, can draw upon thousands of other AI models

00:00:35.640 | to combine tasks like text-to-image, text-to-video, and question-answering.

00:00:40.600 | The reflection paper and follow-up Substack post that caught global attention was released only a week ago,

00:00:46.780 | and yes, I did read both, but I also reached out to the lead author, Noah Shin,

00:00:51.720 | and discussed their significance at length.

00:00:54.160 | Others picked up on the results,

00:00:55.920 | with the legendary Andrej Karpathy of Tesla and OpenAI fame saying that this

00:01:00.960 | meta-cognition strategy revealed that we haven't yet seen the max capacity of GPT-4 yet.

00:01:07.340 | So what exactly was found?

00:01:09.160 | Here is the headline result.

00:01:10.840 | I'm going to explain and demonstrate what was tested in a moment,

00:01:14.060 | but look how they used GPT-4 itself to beat past GPT-4 standards using this reflection technique.

00:01:21.280 | This isn't any random challenge.

00:01:22.940 | This is human eval, a coding test,

00:01:25.520 | designed to test the ability of human evolution to improve the quality of life of humans.

00:01:25.900 | This is a test that was designed by the most senior AI researchers just two years ago.

00:01:29.240 | The designers included Ilya Sutskova of OpenAI fame and Dario Amadei, who went on to found Anthropic.

00:01:35.640 | These are realistic, handwritten programming tasks that assess language comprehension, reasoning, algorithms, and mathematics.

00:01:43.120 | So how exactly did GPT-4 improve itself and beat its own record?

00:01:47.680 | Because remember in the distant past of two weeks ago in the GPT-4 technical report, it scored 67%, not 88%.

00:01:55.780 | Well, that's not true.

00:01:55.880 | Well, here is an example from page nine of the reflection paper.

00:01:59.380 | As you can read in the caption, this was a hot pot QA trial designed specifically such that models needed to find multiple documents

00:02:06.880 | and analyze the data in each of them to come up with the correct answer.

00:02:10.900 | Notice how initially a mistake was made on the left by the model,

00:02:14.540 | and then the model at the bottom reflected on how it had gone wrong.

00:02:18.460 | In a self-contained loop, it then came up with a better strategy and got it right.

00:02:23.160 | And the authors put it like this.

00:02:25.860 | They said, "We emphasize that LLMs, large language models, possess an emergent property of self-reflection,

00:02:31.860 | meaning that earlier models couldn't do this or couldn't do it as well.

00:02:35.460 | It's a bit like GPT models are learning how to learn.

00:02:38.660 | In case you think it was the model blindly trying again and again until it was successful.

00:02:43.260 | No, it wasn't.

00:02:44.140 | This was another challenge called ALF World.

00:02:46.580 | And look at the difference between success without reflection and success with reflection.

00:02:51.660 | I discussed this, of course, with the lead author, and the goal was to distinguish

00:02:55.980 | learning curves from self-improvement to simple probabilistic success over time.

00:03:00.780 | If you're wondering about ALF World, by the way,

00:03:03.020 | it's about interactively aligning text and embodied worlds.

00:03:07.140 | For example, in a simulated environment,

00:03:09.340 | the model had the task of putting a pan on the dining table,

00:03:13.140 | and it had to understand and action that prompt.

00:03:16.340 | So as you can see, this ability to reflect doesn't just help with coding.

00:03:20.340 | It helps with a variety of tasks.

00:03:22.340 | At this point, I want to quickly mention something.

00:03:24.500 | I know that there will be

00:03:25.940 | a couple of well-versed insiders who say, didn't GPT-4 actually get 82%

00:03:31.020 | in human eval in the Sparks of AGI paper?

00:03:33.820 | Of course, I did a video on that paper, too,

00:03:36.140 | and asked the author of Reflection about this point.

00:03:39.300 | There are a few possibilities, such as prompting changes and the Sparks

00:03:43.060 | authors having access to the raw GPT-4 model.

00:03:45.820 | But either way, it is the relative performance gain that matters.

00:03:49.140 | Whichever baseline you start with, GPT-4 can improve on it with reflection.

00:03:53.700 | And the 88% figure is not a cap.

00:03:55.900 | The author has observed results in the last few hours as high as 91%.

00:04:00.060 | But before I go on, I can't resist showing you the examples

00:04:03.900 | I found through experimentation and also shared with the author.

00:04:07.460 | Take this prompt that I gave GPT-4.

00:04:09.740 | Write a poem in which every word begins with E.

00:04:12.860 | Now, as you can see, it did a good job, but it didn't fully get it right.

00:04:16.140 | Look at the word "ascent", for example.

00:04:17.820 | Without mentioning anything specific, I just then wrote,

00:04:20.980 | "Did the poem meet the assignment?" Not even a particularly leading question, because,

00:04:25.700 | it could have just said, "Yes."

00:04:26.900 | GPT-4 then said, "Apologies, it appears the poem I provided did not meet the assignment requirements.

00:04:33.660 | Not every word begins with the letter E.

00:04:35.940 | Here is a revised poem with every word beginning with the letter E.

00:04:39.660 | Remember, I didn't help it at all.

00:04:41.540 | And look at the results.

00:04:43.020 | Every word begins with E.

00:04:44.980 | How far can we take this?

00:04:46.340 | For the next example, I chose mathematics and asked, "Write me a five question

00:04:50.780 | multiple choice quiz to test my knowledge of probability with correct answers and explanations at

00:04:55.820 | the bottom. There should only be one correct answer per question."

00:04:58.940 | It comes up with a decent quiz, but notice a problem in question three, for example.

00:05:04.100 | The probability of drawing an ace or a king is indeed eight out of fifty two.

00:05:09.540 | But that simplifies down to two out of thirteen.

00:05:12.660 | So two of the answers are correct.

00:05:14.620 | And I explicitly asked for it not to do this in the prompt.

00:05:18.380 | So can the model self-reflect with mathematics?

00:05:21.220 | Kind of, almost.

00:05:22.940 | Look what happens. First, I give a

00:05:25.660 | vague response saying, "Did the quiz meet the assignment?"

00:05:28.580 | GPT-4 fumbles this and says, "Yes, the quiz did meet the assignment."

00:05:32.460 | Hmm. So I try, "Did the quiz meet all of the requirements?"

00:05:36.300 | And GPT-4 says, "Yes."

00:05:38.260 | So I did have to help it a bit and said, "Did the quiz meet the requirement that

00:05:42.380 | there should only be one correct answer per question?"

00:05:45.460 | That was just enough to get GPT-4 to self-reflect properly.

00:05:49.500 | And it corrected the mistake.

00:05:50.940 | But I must say it didn't self-correct perfectly.

00:05:53.460 | Notice it identified C and

00:05:55.580 | D as being correct and equivalent when it was B and D.

00:05:59.540 | But despite making that mistake, it was able to correct the quiz.

00:06:03.940 | In case you're wondering, the original

00:06:05.660 | ChatGPT or GPT-3.5 can't self-reflect as well.

00:06:10.060 | I went back to the poem example, and not only was the poem generated full

00:06:14.860 | of words that didn't begin with E, also the self-reflection was lacking.

00:06:18.980 | I said, "Did the poem meet the assignment?"

00:06:21.220 | And it said, "Yes, the poem meets the assignment." As the lead author Noah Shin

00:06:25.540 | wrote it, "With GPT-4, we are shifting the accuracy bottleneck from correct

00:06:29.860 | syntactic and semantic generation to correct syntactic and semantic test generation."

00:06:36.380 | In other words, if a model can know how to test its outputs accurately,

00:06:40.300 | that might be enough, even if its initial generations don't work.

00:06:43.980 | It just needs to be smart enough to know where it went wrong.

00:06:47.380 | Others are discovering similar breakthroughs.

00:06:49.460 | This paper from just three days ago comes up with this self-improvement technique.

00:06:53.980 | They get GPT-4

00:06:55.500 | and frame its dialogue as a discussion between two agent types, a researcher

00:07:00.180 | and a decider, a bit like a split personality, one identifying crucial

00:07:04.780 | problem components and the other one deciding how to integrate that information.

00:07:09.300 | Here is an example with GPT-4's initial

00:07:11.860 | medical care plan being insufficient in crucial regards.

00:07:15.420 | The model then talks to itself as a researcher and as a decider.

00:07:19.340 | And then, lo and behold, it comes up with a better final care plan.

00:07:23.700 | The points in bold were added

00:07:25.460 | by GPT-4 to its initial care plan after discussions with itself.

00:07:30.020 | And the results are incredible.

00:07:31.740 | Physicians chose the final summary produced by this DERA dialogue over

00:07:36.220 | the initial GPT-4 generator summary 90% to 10%.

00:07:40.060 | That's the dark red versus the pink.

00:07:42.540 | I'm colorblind, but even I can see there's a pretty big difference.

00:07:45.580 | The authors also introduce hallucinations at different levels, low, medium and high.

00:07:50.420 | And they wanted to see whether this dialogue model would reduce those hallucinations.

00:07:54.380 | These are different

00:07:55.460 | medical gradings, and you can see that pretty much every time it did improve it

00:07:59.340 | quite dramatically. And then there was this paper also released less than 72 hours ago.

00:08:04.380 | They also get a model to recursively

00:08:06.380 | criticize and improve its own output and find that this process of reflection

00:08:10.980 | outperforms chain of thought prompting.

00:08:13.340 | They tested their model on MiniWob++, which is a challenging suite of web browser

00:08:19.260 | based tasks for computer control, ranging from simple button clicking to complex form

00:08:25.420 | modeling. Here it is deleting files, clicking on like buttons and switching between tabs.

00:08:30.340 | A bit like my earlier experiments,

00:08:32.260 | they gave it a math problem and said, review your previous answer and find problems with your answer.

00:08:37.740 | This was a slightly more leading response, but it worked.

00:08:40.580 | They then said, based on the problems you found, improve your answer.

00:08:43.780 | And then the model got it right.

00:08:45.420 | Even if you take nothing else from this video, just deploying this technique will

00:08:49.460 | massively improve your outputs from GPT-4, but we can go much further, which is what the rest of the video

00:08:55.500 | is about. Before I move on, though, I found it very interesting that the authors say that this

00:08:59.580 | technique can be viewed as using the LLMs output to write to an external memory,

00:09:05.140 | which is later retrieved to choose an action.

00:09:07.620 | Going back to Carpathi, remember that this critique retry metacognition strategy isn't

00:09:12.980 | the only way that GPT-4 will beat its own records.

00:09:16.260 | The use of tools, as he says, will also be critical.

00:09:19.860 | Less than 72 hours ago, this paper was released and arguably it is as significant

00:09:25.340 | as the reflection paper.

00:09:26.620 | It's called Hugging GPT, and as the authors put it, it achieves impressive

00:09:30.820 | results in language, vision, speech and other challenging tasks,

00:09:34.660 | which paves a new way towards AGI. Essentially what the paper did is it used

00:09:39.220 | language as an interface to connect numerous AI models for solving complicated AI tasks.

00:09:45.220 | It's a little bit like a brain deciding which muscle to use to complete an action.

00:09:49.700 | Take this example.

00:09:50.780 | The prompt was, can you describe what this picture depicts and count

00:09:55.260 | how many objects in the picture?

00:09:56.580 | The model, which was actually ChatGPT, not even GPT-4, used two different tools

00:10:02.140 | to execute the task, one model to describe the image and one model to count the objects within it.

00:10:08.220 | And if you didn't think that was impressive, what about six different models?

00:10:12.060 | So the task was this.

00:10:13.380 | Please generate an image where a girl is reading a book and her pose is the same

00:10:18.460 | as the boy in the image given, then please describe the new image with your voice.

00:10:23.660 | The central language model,

00:10:25.340 | or brain, which was ChatGPT, had to delegate appropriately.

00:10:29.020 | All of these models, by the way, are freely available on Hugging Face.

00:10:32.580 | The first model was used to analyze the pose of the boy.

00:10:36.420 | The next one was to transpose that into an image, then generate an image,

00:10:40.860 | detect an object in that image, break that down into text and then turn that text into speech.

00:10:46.740 | It did all of this and notice how the girl is in the same pose as the boy.

00:10:51.180 | Same head position and arm position.

00:10:53.260 | And then as a cherry on top, the model

00:10:55.340 | read out loud what it had accomplished.

00:10:57.180 | This example actually comes from another

00:10:59.180 | paper released four days ago called Task Matrix.

00:11:02.580 | Remember how the original toolformer paper used only five APIs?

00:11:06.820 | This paper proposes that we could soon use millions.

00:11:09.940 | In this example, the model is calling different APIs to answer

00:11:13.900 | questions about the image, caption the image and do outpainting from the image,

00:11:18.700 | extending it from a simple single flower to this 4K image.

00:11:23.340 | Going back to Hugging GPT,

00:11:25.180 | we can see how it deciphers these inscrutable invoices and reads them out

00:11:29.420 | loud and can even perform text to video with an astronaut walking in space.

00:11:34.060 | At this point, I can't resist showing you

00:11:35.700 | what CGI video editing might soon be possible with AI.

00:11:39.220 | Here's Wonder Studio, which is backed by Steven Spielberg.

00:11:42.740 | Welcome to Wonder Studio, where making movies with CGI

00:11:47.140 | is as simple as selecting your actor and assigning a character.

00:11:51.900 | The system uses AI to track the actor's performance

00:11:55.260 | across cuts and automatically animates lights and composes the CG character

00:12:00.220 | directly into the scene.

00:12:03.580 | Whether it's one shot or a full sequence,

00:12:06.380 | Wonder Studio analyzes and captures everything from body motion,

00:12:12.100 | lighting, compositing,

00:12:15.260 | camera motion,

00:12:17.540 | and it even tracks the actor's facial performance.

00:12:22.020 | These advancements do seem to be accelerating and

00:12:25.100 | requiring fewer and fewer humans.

00:12:27.300 | This paper showed back in the before times of October that models didn't need

00:12:32.140 | carefully labeled human datasets and could generate their own.

00:12:35.820 | Going back to the Language Models Can Solve Computer Task paper,

00:12:39.500 | the authors seem to concur.

00:12:40.900 | They said that previously significant amounts of expert demonstration data are

00:12:45.180 | still required to fine tune large language models.

00:12:48.060 | On the contrary, the agent we suggest needs less than two

00:12:51.660 | demonstrations per task on average and doesn't necessitate

00:12:55.180 | any fine tuning.

00:12:56.300 | This reminded me of the alpaca model that fine tuned its answers based on the outputs

00:13:01.500 | of another language model.

00:13:03.060 | Human experts were needed briefly at the start, but far less than before.

00:13:07.460 | A bit like a child no longer needing a parent, except maybe GPT-4 is on growth steroids.

00:13:13.380 | Ilya Sutskova from OpenAI put it like this.

00:13:16.380 | I mean, already most of the data for reinforcement learning is coming from AIs.

00:13:20.580 | The humans are being used to train the reward function.

00:13:25.020 | But then the reward function

00:13:26.820 | in its interaction with the model is automatic and all the data that's

00:13:30.260 | generated during the process of reinforcement learning is created by AI.

00:13:34.460 | Before I end, I should point out that these recursive self improvements are not

00:13:38.980 | limited to algorithms and APIs. Even hardware is advancing more rapidly due to AI.

00:13:45.100 | This week we had this from Reuters.

00:13:47.180 | NVIDIA on Monday showed new research that

00:13:49.780 | explains how AI can be used to improve chip design.

00:13:53.300 | By the way, this includes the new

00:13:55.100 | H100 GPU.

00:13:56.340 | They say that the NVIDIA research took reinforcement learning and added a second

00:14:00.860 | layer of AI on top of it to get even better results.

00:14:03.980 | And to go back to where we started,

00:14:05.700 | the GPT-4 technical report showed that even with compute alone, not self learning,

00:14:11.540 | we can predict with a high degree of specificity the future performance

00:14:16.260 | of models like GPT-5 on tasks such as human eval.

00:14:19.980 | These accelerations of AI are even giving the CEO of Google

00:14:24.940 | a flash and I can't help feeling that there is one more feedback loop to point out.

00:14:28.860 | As one company like OpenAI make breakthroughs,

00:14:31.860 | it puts pressure on other companies like Google to catch up.

00:14:35.100 | Apparently, BARD, which has been powered

00:14:37.060 | by Lambda, will soon be upgraded to the more powerful model Palm.

00:14:40.980 | With self-improvement, tool use, hardware advances and now commercial pressure,

00:14:46.140 | it is hard to see how AI will slow down.

00:14:48.940 | And of course, as always, I will be here to discuss it all.

00:14:52.020 | Thank you for watching to the end and have a wonderful day.

GPT 4 Can Improve Itself - (ft. Reflexion, HuggingGPT, Bard Upgrade and much more)

Chapters