I would not blame you if you thought that all talk about GPT-4 or ChatGPT-4 is just that, talk. But we actually can have a surprising amount of confidence in the ways in which GPT-4 will improve on ChatGPT. By examining publicly accessible benchmarks, comparable large language models like Palm, and the latest research papers, which I've spent dozens of hours reading, we can discern at least eight clear ways in which GPT-4, integrated into Bing or otherwise, will beat ChatGPT.
I'm going to show you how unreleased models already beat current ChatGPT. And all of this will give us a clearer insight into what even GPT-5 and future rival models from Google might well soon be able to achieve. There are numerous benchmarks that Palm, Google's large language model, and by extension, Google's large language model, will be able to achieve.
And by extension, GPT-4 will beat ChatGPT on. But the largest and most impressive is the big bench set of tasks. More than 150 or now 200 language modeling tasks, and I've studied almost all of them. And you can see the approximate current state of affairs summarized in this graph, where the latest models are now beating the average human and showing dramatic improvement on previous models.
ChatGPT would be somewhere around this point. Lower than what is actually... privately available, but better than previous models down here. But this just skims the surface. I want to show you in detail the eight ways that you can expect ChatGPT-4 or GPT-4 to beat the current ChatGPT. And no, that's not just because it's going to have more parameters off to the right of this graph, 10 to the 12, a trillion parameters.
It's also because compute efficiency will improve. Chain of thought prompting will be integrated, and the number of tokens it's trained on might go up by an awful lot. This is a very important aspect of ChatGPT-4. And it's also very important to know that the output of your data will go up by an order of magnitude.
Lots of reasons why GPT-4 will be better. Let's start with logic and logical inference. This example comes from Google's Palm Research paper. The question or input was this. Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. So vague. Going home next Tuesday.
Question, is it likely that Shelley will be near the Pacific Ocean this weekend? And you can see how the improved model is able to deduce that the improved model is likely to be near the Pacific Ocean. And you can see how the improved model is able to deduce that the improved model is likely to be near the Pacific Ocean.
And you can see how the improved model is able to deduce that the improved model is likely to be near the Pacific Ocean. And you can see how the improved model is able to deduce that the improved model is likely to be near the Pacific Ocean. And you can see how the improved model is likely to be near the Pacific Ocean.
And you can see how the improved model is likely to be near the Pacific Ocean. Whereas if you ask current ChatGPT this question, what you get is, based on the information given, it's not possible to determine. Whereas if you ask current ChatGPT this question, what you get is, based on the information given, it's not possible to determine.
Whereas if you ask current ChatGPT this question, what you get is, based on the information given, it's not possible to determine. The statement only mentions that Shelley is from Virginia and visiting a city with a famous market. The statement only mentions that Shelley is from Virginia and visiting a city with a famous market.
The statement only mentions that Shelley is from Virginia and visiting a city with a famous market. It really can't handle it. It can't do that level of logical inference. It really can't handle it. It can't do that level of logical inference. Here is another great example. This test of critical reasoning and logic was designed again for the Big Bench benchmark.
This test of critical reasoning and logic was designed again for the Big Bench benchmark. This test of critical reasoning and logic was designed again for the Big Bench benchmark. And it was tested on different language models. And most of them fail, including ChatGPT. And most of them fail, including ChatGPT.
I gave it this question and it picked the wrong answer. I gave it this question and it picked the wrong answer. You can examine the question yourself, but C is not the correct answer. You can examine the question yourself, but C is not the correct answer. It gets it wrong.
However, let's take a look at the graph beneath at other language models. However, let's take a look at the graph beneath at other language models. Ones to come. GPT-4 maybe. And look what happens. As the models increase in effective parameter count and other things like token size, As the models increase in effective parameter count and other things like token size, As the models increase in effective parameter count and other things like token size, look at the performance.
We start to beat not only average raters but all previous models We start to beat not only average raters but all previous models and approximate the performance of the best human language. and approximate the performance of the best human language. and approximate the performance of the best human language.
The top line is the best human rater. The blue line is the average human rater. These unreleased models, The three shot means it was given three examples of what was expected before being tested. The three shot means it was given three examples of what was expected before being tested.
The three shot means it was given three examples of what was expected before being tested. These best models, and you can imagine GPT-4 would be around the same level, These best models, and you can imagine GPT-4 would be around the same level, crush what ChatGPT was capable of. crush what ChatGPT was capable of.
You can imagine what this means in terms of GPT-4 giving more rigorous arguments. You can imagine what this means in terms of GPT-4 giving more rigorous arguments. You can imagine what this means in terms of GPT-4 giving more rigorous arguments. Or conversely, you can give vague inputs Or conversely, you can give vague inputs like this thing talking about a famous market where they throw the fish.
like this thing talking about a famous market where they throw the fish. And GPT-4 might well be able to understand exactly what you mean. And GPT-4 might well be able to understand exactly what you mean. And to be honest, if you thought that's interesting, we are just getting started.
Next, jokes. On the left you can see a computer science-y type of joke On the left you can see a computer science-y type of joke that it was able to explain. But I tested ChatGPT on a variety of jokes and some of them it could explain,