back to indexGPT 4 is Smarter than You Think: Introducing SmartGPT
00:00:00.000 |
I have three goals for this video. First, I want to show you a way of using GPT-4 to get smarter 00:00:06.120 |
results. Second, I want to argue that the benchmark results we have for GPT-4 do not reflect its full 00:00:13.080 |
abilities. And third, I want to show you a system that I am developing, somewhat cheekily called 00:00:18.080 |
SmartGPT, that is already showing significant results on official benchmarks. It remains to 00:00:24.420 |
be fully optimized, which I think is exciting in itself. I have shown the system to people 00:00:29.520 |
at OpenAI who have been quite impressed, and I'm going to end with some reflections on where 00:00:34.340 |
that might leave us for GPT-5. But before I get into how it works, I just want to show you one 00:00:40.200 |
example of it in action to whet your appetite. This example comes from a TED talk that was 00:00:45.380 |
released this week. So suppose I left five clothes to dry out in the sun, and it took them five hours 00:00:52.120 |
to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI 00:00:59.040 |
system says 30 hours. Not good. On the left, you can see GPT-4's original answer, and it gives this 00:01:05.980 |
answer pretty consistently whenever you prompt it with the question provided. On the right, 00:01:10.680 |
you can see the final answer from the SmartGPT model, which is correct, and it consistently 00:01:15.640 |
gives that answer. I really like how it gives context as well, and it provides some of the 00:01:20.260 |
assumptions that it had in giving this correct answer. Now, don't you worry, there will be 00:01:24.620 |
plenty more examples to go through in this video, including another one from that TED talk. 00:01:28.560 |
But first, I want to give you an overview of what is this SmartGPT model, where did I get my 00:01:33.920 |
inspiration for it from, and how does it work? I'm going to keep it fairly simple because it's 00:01:38.100 |
the beginning of the video, and I know a lot of people won't really care about the inner details. 00:01:42.280 |
That will come later in the video. But the high-level overview is this. There are at least 00:01:47.700 |
three things that have been proven to improve the outputs of GPT-4. What's called chain of 00:01:53.480 |
thought prompting, sometimes called step-by-step prompting, reflection, or finding its own 00:01:58.080 |
errors. And I did an entire video on this called GPT-4 can self-improve. And dialoguing with itself, 00:02:04.040 |
entering into a back and forth on its own outputs and deciding which one is best. You can see the 00:02:09.800 |
title of the papers, which contain much more detailed results, of course, linked above. Now, 00:02:14.480 |
the first paper only came out a few days ago, midway through my testing. So my results don't 00:02:19.400 |
even reflect the full capacity of the model. And even if there's nothing else you take from this 00:02:24.440 |
video, the results from this paper can instantly improve the results of the model. So I'm going to 00:02:27.600 |
show you how to instantly improve the outputs you get from GPT-4. Many of you might remember that 00:02:33.360 |
prompting GPT-4 with let's think step-by-step improves its results. To give you a very quick 00:02:39.920 |
reference point, just asking a question to GPT-4 gives you 81% accuracy. With that prompt, let's 00:02:46.320 |
think step-by-step, it goes up to 86%. But algorithmically, the paper found an improved 00:02:52.560 |
prompt that can give you even better results, 89% accuracy. 00:02:57.120 |
So let's go back to the first part of SmartGPT, is we add answer. Let's work this out in a step-by-step 00:03:03.200 |
way to be sure we have the right answer. Now, I have so much to say about why I think this works, 00:03:09.600 |
but I know many of you won't be that interested in my theories. So I'm going to save them to the 00:03:14.240 |
end for those who are interested. Some of you just want the results. So I'm going to get to 00:03:18.120 |
those first. So far, you might be thinking, well, thanks, Philip, that's a cool prompt. I'm going 00:03:22.040 |
to use that. But what's this whole SmartGPT about? Is it just a single prompt? No, I'm going to use 00:03:26.640 |
the answer. So I believe with evidence, there are ways of leveraging even better results than just 00:03:31.480 |
using a great chain of thought prompt. So let's move on to the next part of the system, these 00:03:36.920 |
different outputs in the middle. For my tests, I typically did three outputs, but of course, 00:03:40.840 |
depending on the context window, it could be far more than that. And I'm going to talk about ways 00:03:45.080 |
I could further improve this model, or we could, later on in the video. Just to restate, these 00:03:50.440 |
outputs are when you take the user input and add the word question at the start, and then at the end, 00:03:55.480 |
add answer, and then add answer. So I'm going to use the answer, and then add answer, and then add 00:03:56.160 |
answer. Let's work this out in a step-by-step way to make sure we have the right answer. 00:03:59.800 |
And at this moment, many of you are thinking, what is the point of multiple outputs? It's GPT-4, 00:04:04.900 |
it's just going to give you the answer it thinks is best, and that's it. Well, actually, it doesn't 00:04:08.580 |
quite work like that. These models have a temperature between zero and one. I believe 00:04:12.860 |
the default for GPT-4 might be around 0.5. And simplifying massively, this determines how creative 00:04:19.380 |
or conservative the model is in giving its outputs. So given that GPT-4 tries to be fairly creative, 00:04:25.680 |
you don't get the same output every time. The output is randomly sampled according to an 00:04:30.760 |
internal probability distribution. So you can get situations, and I face this hundreds of times, 00:04:35.960 |
where some of the outputs are correct, and others are incorrect. And this is where reflection comes 00:04:41.580 |
in. Sometimes, definitely not always, but sometimes, quite often, GPT-4 can detect the errors in its 00:04:48.640 |
own output. And many of you will notice at this point that the prompt that I used to elicit GPT-4 00:04:55.200 |
to spot its own errors contains the same step-by-step prompt I used earlier that has been shown to 00:05:02.200 |
produce good results. So to summarize, sometimes at this stage, GPT-4 detects the errors that some 00:05:09.420 |
of its outputs have made. Definitely not always. There are certain questions it just simply can't 00:05:14.260 |
spot the error. But sometimes it can, and then I get it to engage in a dialogue using a format 00:05:19.680 |
similar to one in this paper published last month. It's a short dialogue, and this is the 00:05:24.720 |
step I believe that can be most optimized. In the future, I envision an entire council of advisors 00:05:30.660 |
made up of GPT-4 imitating mathematicians, judges, etc. At the moment, it's just being a resolver and 00:05:38.040 |
printing a final improved output. Anyway, I'm going to get back to the theory later in the video, 00:05:42.900 |
because I know some of you will be getting bored at this stage and want to see more practical 00:05:46.640 |
examples and the results from my benchmark tests. As I don't have the GPT-4 API key, yes, I had to 00:05:54.240 |
import each of these steps hundreds of times, waiting sometimes three hours between each go, 00:05:59.740 |
because you can only do 25 messages every three hours. On the left, you can see the three outputs 00:06:05.020 |
when you ask it to think step by step. And then you have the researcher step in the middle and 00:06:10.340 |
at the top right. And finally, the resolver step. Notice here, I was using the original 00:06:14.680 |
let's think step by step because the paper hadn't yet been published on improving that prompt. 00:06:19.580 |
It's time for the second example from that TED talk, and then I definitely will 00:06:25.420 |
A different one. I have 12 liter jug and 6 liter jug and I want to measure 6 liters. How do I do it? 00:06:31.860 |
Just use the 6 liter jug, right? GPT-4 spits out some very elaborate nonsense. 00:06:38.260 |
Of course, I tested SmartGPT with that question, and you can see the difference between the original GPT-4, 00:06:45.760 |
which gives this incredibly convoluted bad answer, and SmartGPT, the final answer output. 00:06:51.520 |
Now, at this point, I know many of you will be impressed, 00:06:53.280 |
but you'll be thinking, I don't have time to input things five times. 00:06:57.460 |
Well, I'm developing a model where it can all be done automatically. 00:07:01.340 |
Here is a preview of how it works. But of course, at the moment, it has to use GPT-3.5 turbo 00:07:06.780 |
because I don't have the API key of GPT-4. But the epic thing is this. You just ask a single question. 00:07:12.840 |
I've written, ask SmartGPT a question. And of course, it does take a little bit longer to respond 00:07:17.580 |
because it's doing five or six calls via API. But it does output the final answer from the resolver. 00:07:22.800 |
I will be honest and say that GPT-3.5 isn't as good at reflecting or resolving. 00:07:28.680 |
But this is an example of a question where the original ChatGPT consistently gets it wrong, 00:07:33.360 |
and SmartGPT-3.5 gets it right using this program. 00:07:37.900 |
Remember, all you have to do as a user is type in a question as normal, 00:07:41.700 |
and it goes through this entire five or six step process behind the scenes. 00:07:46.020 |
By the way, this was a question from MMLU, which is a famous benchmark, which I'll get to in a second. 00:07:54.820 |
I know many teachers use ChatGPT and GPT-4 to create quizzes for their classes. 00:07:59.760 |
And here is the same question put through GPT-4 and SmartGPT. 00:08:03.900 |
The question is create a high school algebra quiz with five questions and answers and explanations at the end. 00:08:11.520 |
But if the teacher had handed out the original quiz, look at the answers for question five. 00:08:16.780 |
It says the answers are one and 1.5, but then in the explanation, it gives 00:08:21.840 |
the final answers, which are correct, by the way, of three and 0.5. 00:08:28.800 |
At the reflection stage, SmartGPT spotted that error and resolved it. 00:08:32.880 |
And as you can see, the answer for question five has the correct answers straight away. 00:08:37.920 |
If at any point you're wondering if I completed the OpenAI ChatGPT prompt engineering course, the answer is yes, 00:08:43.800 |
but it didn't inform too much of my thinking. 00:08:46.080 |
It was more for beginners, and I had already factored in things like giving the model time to think and 00:08:52.980 |
The benchmark that I chose to test SmartGPT on was the famous MMLU, Massive Multitask Language Understanding benchmark. 00:09:01.680 |
As you can see, the state of the art is indeed GPT-4 with 86.4% accuracy. 00:09:07.560 |
And you know OpenAI think it's a big deal because it's the benchmark mentioned on the front page of their technical report. 00:09:14.160 |
Without boring you too much, I extracted the questions from the test set of the MMLU data file. 00:09:22.800 |
I went for those that I thought GPT-4 would find the hardest. 00:09:26.940 |
Delving into the original MMLU paper, you can see that GPT-3 found formal logic the hardest, scoring just over 25%, which is random chance. 00:09:37.860 |
It's a four question multiple choice test, so around 25 or 30% is pretty bad. 00:09:45.900 |
They did it few shot, meaning they gave it five successful examples. 00:09:56.400 |
But just before I show you the results, there are three things I want to mention here. 00:09:58.400 |
First, I was curious how SmartGPT would do without any help. 00:10:06.400 |
Second, I wanted to do it zero shot because people using GPT-4 don't typically give five successful examples before asking GPT-4 a question. 00:10:14.400 |
They just want code or a quiz or a poem or an example. 00:10:16.400 |
They don't often provide five brilliant examples of code. 00:10:18.400 |
They just want a quiz or a poem or an example. 00:10:18.400 |
They don't often provide five brilliant examples of code. 00:10:22.400 |
And third, if I can prove it works zero shot, then of course future refinements can be made to push the results even further. 00:10:28.400 |
And here are the results from the first 25 questions from the formal logic test set of the MMLU. 00:10:37.400 |
But you can see from this set, if you just ask the question, you get a lower overall accuracy. 00:10:43.400 |
But of course 68% for GPT-4 is still a huge improvement over GPT-3's around 25%. 00:10:49.400 |
What happens when you add "Let's think step by step" which as we know now isn't the fully optimized chain of thought prompt. 00:11:01.400 |
That was 75 examples inputted manually and I still have all the tabs open. 00:11:06.400 |
I'm keeping them open because I'm compiling a spreadsheet with the actual outputs. 00:11:10.400 |
But what did the resolver get drawing upon GPT-4's ability to reflect and engage in dialogue with itself? 00:11:20.400 |
GPT-4 zero shot got 32% of the questions wrong. 00:11:24.400 |
That was halved to 16% after putting it through the smart GPT system. 00:11:29.400 |
There was one question where the resolver model gave both a correct and incorrect answer. 00:11:34.400 |
But I'm counting that as an incorrect answer for the purposes of this test. 00:11:42.400 |
That is a pattern that stayed consistent throughout all my testing. 00:11:47.400 |
Ultimately, half of the errors that GPT-4 makes can be rectified if you give it the optimized step by step prompt, 00:11:55.400 |
get it to reflect on its results and get it to engage in dialogue and decide on a final answer. 00:12:02.400 |
At this point, for those people losing track of all the details, I want to put into context what resolving half of the errors on MMLU might mean in the context of the big picture. 00:12:12.400 |
Here's Lennart Heim, an AI governance researcher, suggesting a score of 95% of the errors in the big picture. 00:12:16.400 |
95% on the MMLU would be reflective of AGI-like abilities. 00:12:22.400 |
I do think I have like a 50% chance, like within the next 20 years or so, there might be something what we might call an AGI or transformative AI. 00:12:30.400 |
What do I mean by this? Well, maybe can measure it on benchmarks. 00:12:33.400 |
There's like this famous MMLU benchmarks like, yeah, there's something which like scores like 95% on this. 00:12:39.400 |
Going back to the results, if a smart GPT-like system can automatically resolve half of the errors 00:12:45.400 |
that GPT-4 makes on the MMLU, that would increase its score from around 86.4% to around 93%, which is not far off 95%. 00:12:56.400 |
Remember, his prediction was a 50% chance in 20 years. 00:13:02.400 |
For those who are still skeptical, I'm going to show you plenty more results now and then walk through the papers that give the theory as to why this works. 00:13:10.400 |
One thing that I forgot to mention earlier is that the human expert level on the MMLU, 00:13:14.400 |
is 89.8% and that's taking the 95th percentile of human test takers. 00:13:21.400 |
And remember, those are domain experts in each of the subtopics. 00:13:25.400 |
What we're doing is testing GPT-4 or smart GPT on all of the topics simultaneously. 00:13:32.400 |
So even if smart GPT-like systems can't quite reach 95%, and I think honestly they'll get pretty close with all the refinements that I'm going to suggest, 00:13:40.400 |
I think they should almost certainly be 89.8%. 00:13:47.400 |
Intrigued by these results, I then put it through the college math test from the MMLU. 00:13:53.400 |
And remember, this was before using the optimized version of the step-by-step prompt. 00:13:57.400 |
Obviously, I'm not going to go through all the questions here, but let's skip to the final results. 00:14:04.400 |
We have zero shot accuracy, 6 out of 15, which is 40%. 00:14:08.400 |
The average when you add let's think step-by-step was 53.5%. 00:14:12.400 |
And then the final output of the resolver model had a 60% accuracy. 00:14:18.400 |
So it couldn't quite resolve half of the errors, but the overall pattern held up. 00:14:22.400 |
In case anyone is wondering about methodology, I kept the formatting identical for every question. 00:14:31.400 |
It wasn't looking at the context of what it had already put out. 00:14:34.400 |
Each attempt was fresh, aside from the resolver model, which looked at the context of the researcher's output. 00:14:41.400 |
And again, as you can see from example 14, it wasn't like the researcher could always spot the errors, 00:14:47.400 |
or that the resolver could always pick the right option. 00:14:50.400 |
Sometimes the let's think step-by-step prompt gave the right output, but the resolver couldn't quite distinguish it. 00:14:57.400 |
The optimized prompt gets a slightly better output. 00:15:00.400 |
And upon reflection, the researcher can sometimes, but not always, spot the errors of those outputs. 00:15:07.400 |
And sometimes, but not always, the resolver can spot them. 00:15:10.400 |
But the researcher can spot the errors of those outputs. 00:15:12.400 |
And the resolver can spot based on those flaws, which answer is best. 00:15:18.400 |
I have noticed a few themes in those questions. 00:15:21.400 |
Anytime it comes to division, multiplication, characters, or counting in general, 00:15:26.400 |
GPC4 tends to make mistakes that neither the researcher nor resolver can spot. 00:15:31.400 |
Of course, integrating a few tools via API would likely solve those issues. 00:15:36.400 |
And I don't want to preempt the conclusion too much, 00:15:39.400 |
but I believe a smart GPT-like system with tools integrated could probably score around 95% right now on the MMLU. 00:15:49.400 |
Especially if it was helped out with few-shot prompting. 00:15:53.400 |
To add weight to that preliminary conclusion, I tested it on certain topics and had to stop because it simply got the questions right every single time. 00:16:01.400 |
For example, high school psychology from the MMLU. 00:16:04.400 |
I then tried prehistory, which it also aced, before finding machine learning. 00:16:14.400 |
The chain of thought, let's think step-by-step average was 71.6%. 00:16:21.400 |
Let's now look a little deeper into why all of these steps might improve the end result. 00:16:25.400 |
In reply to the original let's think step-by-step paper, which was published around a year ago, 00:16:33.400 |
"Adding something like let's think step-by-step to the prompt is a way of using the idea that 00:16:37.400 |
the input space for computation is what you'd normally want in the hidden state of the model. 00:16:43.400 |
Instead of the workings out being done in the activations of the neural network, 00:16:47.400 |
it's done in the discrete tokens of that input space." 00:16:54.400 |
And here is the paper released three days ago that improves upon that original prompt. 00:16:59.400 |
They also did their testing zero-shot like me. 00:17:04.400 |
Starting like I did with just direct prompting. 00:17:06.400 |
Just asking the question like 99% of users would do of GPT-4. 00:17:12.400 |
And then they tried like me the well-established let's think step-by-step prompt. 00:17:17.400 |
They also iteratively tested seven original prompts as well as the prompt that I've now integrated into SmartGPT. 00:17:23.400 |
The let's work this out in a step-by-step way, etc. 00:17:26.400 |
They share my opinion that zero-shot prompting setups have the benefit of not requiring such task-dependent selection of exemplars. 00:17:39.400 |
Here are the end results for GPT-4 that we saw earlier. 00:17:42.400 |
Showing the difference between asking directly your question and using these refined prompts. 00:17:47.400 |
Notice that this technique is somewhat model dependent and it doesn't have the same effect on smaller or weaker models. 00:17:54.400 |
Before we move on to the next paper, there is one somewhat failed prompt that I want to pick up on. 00:17:59.400 |
It's this self-critique prompt where they ask: 00:18:01.400 |
"Answer the question then critique the answer. 00:18:04.400 |
reconsider the other answer options and give a single final answer." 00:18:08.400 |
And you might wonder why didn't that prompt perform best when we know that reflection and dialogue can work. 00:18:14.400 |
My theory is because it's trying to do all of it in one prompt. 00:18:18.400 |
Through my hundreds of experiments, I've noticed that GPT-4 can only handle so much in one go. 00:18:24.400 |
It simply gets overwhelmed or confused if you ask it to do too much in one prompt. 00:18:29.400 |
That's why I broke my model into stages to allow it to show off each of its abilities 00:18:36.400 |
what's my personal theory as to why this eliminates up to half of the errors that GPT-4 makes? 00:18:44.400 |
Remember that GPT-4 is drawing on a vast dataset of internet text. 00:18:57.400 |
The kind of data that would have that text would be things like: 00:19:02.400 |
So I believe you're triggering more of the weights inside GPT-4 that relate to things like Expert Tutorials. 00:19:10.400 |
And so inevitably you're getting slightly better answers. 00:19:13.400 |
Next, I've already explained why you get different outputs when you give the exact same prompt. 00:19:18.400 |
That's down to sampling and the temperature of the model. 00:19:21.400 |
But to simplify massively, sometimes GPT-4 will give you an output that it knows isn't the most probable. 00:19:27.400 |
It introduces some randomness into its sampling. 00:19:33.400 |
reflecting the full range of probabilities that GPT-4 ascribes to its outputs. 00:19:39.400 |
You're reducing a little bit some of the randomness that's inherent in GPT-4 outputs. 00:19:44.400 |
Next, I believe that GPT-4 can sometimes spot its own errors through reflection. 00:19:49.400 |
Because prompting like this triggers a different set of weights. 00:19:53.400 |
You could almost think of it as a different mindset. 00:19:58.400 |
Again, if the question is too hard or involves counter-checking, 00:20:00.400 |
characters, division, multiplication, as I said earlier, 00:20:04.400 |
But a percentage of the time, it can spot its own errors and point them out. 00:20:08.400 |
Notice this is a separate bit of inference not lumped into the original prompt. 00:20:12.400 |
And when it does successfully point out the errors, 00:20:15.400 |
it can often engage in this dialogue with itself. 00:20:18.400 |
Notice in a meta kind of way, I'm using the step-by-step prompting to improve the reflection and dialogue. 00:20:35.400 |
which produced that prompt that did the best in the previous paper. 00:20:39.400 |
They came to that special prompt through automatic prompt engineering. 00:20:43.400 |
But there's something interesting I want to point out though. 00:20:46.400 |
we use automatic prompt engineering to find a prompt starting with lets, 00:20:51.400 |
that maximizes the likelihood of correct reasoning steps. 00:20:55.400 |
Then they found the best one that I integrated into SmartGPT, 00:20:58.400 |
let's work this out in a step-by-step way to be sure we have the right answer. 00:21:07.400 |
But the interesting thing to me is they started with lets each time. 00:21:11.400 |
So even that first stage for the model might not yet be fully optimized. 00:21:15.400 |
Maybe there's a prompt that doesn't begin with lets that improves this initial result still further. 00:21:22.400 |
I know many people watching this will wonder if I read the paper boosting theory of mind performance in the future. 00:21:27.400 |
So I did a theory of mind performance in large language models via prompting. 00:21:30.400 |
And yes, I did because they tested something similar for a theory of mind test. 00:21:34.400 |
Using similar techniques, they were able to get theory of mind accuracy for GPT-4 from 80% to 100%. 00:21:41.400 |
And they conclude that these results demonstrate that appropriate prompting enhances large language model theory of mind reasoning. 00:21:48.400 |
And they underscore the context dependent nature of these models' cognitive capacities. 00:21:56.400 |
and they used the same formula for the first two examples. 00:21:59.400 |
So they did improve the results dramatically. 00:22:03.400 |
adding few shot examples would push this still further. 00:22:06.400 |
This is part of why I think that 95% barrier on the MMLU will be broken probably this year by GPT-4. 00:22:14.400 |
They admit that there is not currently a theoretical understanding of why these prompting methods are so important. 00:22:20.400 |
And they also say that the results of the MMLU are not the same as the results of the GPT-4. 00:22:25.400 |
So they are not sure why these prompting techniques are beneficial. 00:22:35.400 |
Giving it generic few shot prompts that weren't directly theory of mind, 00:22:39.400 |
actually improved the outputs slightly more than giving it direct theory of mind examples. 00:22:45.400 |
This opens the door to the first of the five ways I anticipate smart GPT getting even smarter. 00:22:51.400 |
It could be possible to come up with generic few shot prompts 00:22:54.400 |
that could be automatically integrated into the model 00:22:57.400 |
that don't necessarily relate to the topic at hand. 00:23:00.400 |
This graph shows the impact of adding few shot examples to GPT-3. 00:23:04.400 |
And if this can be done in a generic way for GPT-4, 00:23:10.400 |
Next, the boosting theory of mind paper speculates 00:23:16.400 |
could boost the performance of weaker models to beyond the levels of GPT-4 zero shot accuracy. 00:23:25.400 |
that inspired me to have the researcher and resolver dialogue at the end of smart GPT. 00:23:30.400 |
As they say, the DERA approach shows significant improvement over base GPT-4 performance. 00:23:36.400 |
And these were open-ended questions by the way, not multiple choice. 00:23:39.400 |
So this is more generally applicable than you might think. 00:23:42.400 |
You can see from this table how results improved after engaging in this dialogue. 00:23:47.400 |
And that brings me to the second way I anticipate smart GPT getting smarter in the future. 00:23:54.400 |
At the moment we have this simple researcher and resolver, two-step dialogue. 00:24:02.400 |
You can imagine a mathematician chipping in, and a philosopher, and a professor. 00:24:06.400 |
Each one tapping into slightly different weights of GPT-4. 00:24:13.400 |
I'm not saying that would transform the results, but it might edge them another few percent higher. 00:24:18.400 |
Next, even with longer dialogues and different experts, 00:24:21.400 |
we could find ways of optimizing these prompts, 00:24:24.400 |
just like we did with the original "Let's think step by step". 00:24:27.400 |
That's the third avenue of improvement that I envisage. 00:24:29.400 |
Because I came up with these prompts, I'm sure they could be improved. 00:24:32.400 |
Next, we could experiment with different temperatures. 00:24:35.400 |
Remember, a lower temperature makes the model more conservative. 00:24:38.400 |
A higher one, towards one, makes it more creative. 00:24:41.400 |
We could experiment with a higher temperature to produce a more diverse range of outputs at this stage. 00:24:47.400 |
And then perhaps a more conservative, deterministic temperature, 00:24:58.400 |
Integrating APIs for character counting, calculators, code interpreters, etc. 00:25:03.400 |
Spending these weeks manually sorting through the outputs of GPT-4 on these benchmarks, 00:25:11.400 |
And it's often by getting letters in the wrong order, 00:25:16.400 |
It gets the high level logic right, and then makes quite simple errors. 00:25:19.400 |
Basic tool integration would, I am sure, push the results still higher. 00:25:23.400 |
Now I know this isn't my usual video, and trust me, I have been following the AI news, 00:25:30.400 |
I'm determined to make those improvements and push SmartGPT even further, 00:25:34.400 |
but of course that would be aided massively by getting access to the plugins and the GPT-4 API key. 00:25:41.400 |
So far I've had to do all of this manually, which was a lot of work. 00:25:45.400 |
Now as you saw earlier, I have drawn on GPT-4 to help me develop a project. 00:25:48.400 |
I'm also working on a program in Replit to automate this process, 00:25:51.400 |
but at the moment it's GPT-3.5, and honestly the context window really limits the ability. 00:25:57.400 |
But I do look forward to the day when I can integrate GPT-4, 00:26:00.400 |
and put this out as an automatic model for people to test and play about with. 00:26:05.400 |
I am sure that something similar will ultimately be incorporated by OpenAI itself, 00:26:12.400 |
a bit like Bing has creative, precise, balanced, etc. 00:26:17.400 |
but as you've seen the outputs are noticeably better. 00:26:20.400 |
If the results of models like this one do officially exceed the 86.4% 00:26:26.400 |
that OpenAI talked about in the GPT-4 technical report, 00:26:30.400 |
I do think that would reveal quite a few things. 00:26:32.400 |
First, that OpenAI isn't even aware of the full capabilities of its own model. 00:26:37.400 |
I don't even know if they anticipated things like AutoGPT. 00:26:40.400 |
I do think it would reveal that they need to do far more proper testing of their models before they release them. 00:26:46.400 |
They should make falsifiable predictions about what their models won't be capable of. 00:26:51.400 |
That way we would know just how much they know about their own models. 00:26:55.400 |
What we're trying to avoid is a situation where OpenAI say their model can only achieve X, 00:27:00.400 |
and then when they release the model in the wild, 00:27:11.400 |
to run you through some of the fascinating papers that have been released in the last few days and weeks, 00:27:15.400 |
the third goal was to show you what this model could do with some official benchmarks, 00:27:19.400 |
and suggest ways it might get better in the near term future. 00:27:25.400 |
or are an expert in benchmarking systems like GPT-4, 00:27:30.400 |
I guess the final goal was to perhaps suggest to you that OpenAI don't know as much about their own models