back to index'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)
00:00:00.000 |
In the last 24 hours OpenAI have released this paper, Let's Verify Step-by-Step. 00:00:05.840 |
It represents an almost doubling of GPT-4's raw performance in a test of mathematics, 00:00:11.120 |
but also extends to other domains. Sam Altman calls it a positive sign for alignment and yes, 00:00:16.880 |
I have read it all already along with the release notes. 00:00:20.080 |
Let's get to the main takeaways. They trained two reward models for GPT-4. One which gave positive 00:00:26.240 |
feedback for a final result, the final answer to a mathematics problem for example. And another 00:00:32.080 |
model where they gave positive feedback to GPT-4 or ChatGPT based on each intermediate reasoning 00:00:39.040 |
step in the mathematical solution. Basically a show your working out kind of approach. 00:00:44.240 |
And the result they got by rewarding good working out surprised even them. It was able to solve 78% 00:00:50.560 |
of problems from a subset of the math test set which I'll get onto in a second. Not only is that 00:00:56.160 |
almost double GPT-4's raw performance of 42.5%, which by the way is about double GPT-3's performance 00:01:03.520 |
of 23%, it also outperformed just rewarding correct answers. The blue line represents 00:01:09.680 |
using a model that rewarded correct answers only and then you have the reasoning or process 00:01:15.120 |
supervised RM at the top. So even when you explicitly reward correct answers, 00:01:20.000 |
you get fewer correct answers than rewarding good working out. And yes that did surprise OpenAI. 00:01:26.080 |
I can hear some of you wondering about Palm2, the latest model behind Bard. Well the raw model gets 00:01:32.640 |
34.3% and even the model with self-consistency and chain of thought only gets 48.8% on this 00:01:39.920 |
math data set. The previous state of the art by the way was 50.3%. So 78.2% is quite a big leap. 00:01:48.240 |
And later on I'm going to show you why that's not even the cap. Just for interest, 00:01:51.760 |
here is the rather ugly title page that OpenAI put out. They call it "Improve 00:01:56.000 |
Proving Mathematical Reasoning with Process Supervision". Maybe if someone had supervised 00:02:00.480 |
the colour scheme of this release page it might have looked better. But my point wasn't just to 00:02:04.960 |
diss a colour scheme, it was to point out something that they also said down here. They say "In 00:02:09.200 |
addition to boosting performance relative to just looking at outcomes or correct answers, this form 00:02:14.640 |
of process supervision also has an important alignment benefit. It directly trains the model to 00:02:20.080 |
produce a chain of thought that is endorsed by humans". Indeed Ilya Sutskova retweeted this from 00:02:25.920 |
"I'm not sure if this is a good idea, but I'm not sure if this is a good idea". 00:02:26.880 |
Calling it a really interesting result. But let's leave alignment for later. Let's focus on what 00:02:32.080 |
they actually did. First they used the base model of GPT-4, not the one with reinforcement learning 00:02:38.080 |
from human feedback. Next they fine-tuned that base GPT-4 model on a data set of roughly 1.5 00:02:45.040 |
billion math related tokens. Further on they call that the "math mix". This being OpenAI of course, 00:02:51.600 |
they don't give you the exact details of that math mix. But I'll come back to that later 00:02:55.840 |
on. So how could they give feedback based on working out or reasoning? Well human labelers 00:03:01.440 |
would come along and give each step in a generated solution either negative feedback, neutral feedback 00:03:08.160 |
or positive feedback. Then using that human label data a model would be trained to predict the 00:03:13.920 |
correctness of each step. In other words it got good at recognizing good working out. As mentioned 00:03:20.000 |
there was another model trained just to focus on correct or incorrect final answers. As you 00:03:25.760 |
can see at the top the model got good at spotting incorrect steps in the reasoning process. The green 00:03:32.560 |
steps got a high process score and the red steps got a low process score. And to turn this into a 00:03:38.880 |
single score they got the probability that each step is correct as judged by the model. And then 00:03:44.560 |
they got the product of all of those individual probabilities to get a final overall process 00:03:50.800 |
score. A score in other words for good working out. Just in case anyone's interested they did 00:03:55.680 |
try other ways of generating a working out score. For example by looking at the minimum probability 00:04:01.840 |
in the outputs. But that step didn't make too much difference to the end result as you can see here. 00:04:07.120 |
To quickly recap we have a base model trained only to output solutions in the desired format. And then 00:04:13.520 |
we have a separate smaller model or two actually. One trained only to predict whether each solution 00:04:20.240 |
is correct or incorrect as a final answer. Of course that leaves in false positives which are 00:04:25.600 |
solutions that reach the correct answer with incorrect reasoning. And then another model 00:04:30.720 |
trained only to predict the correctness of each step. It stops if it finds a first incorrect step. 00:04:36.960 |
And as the paper says both methods reveal the existence of at least one mistake. But this 00:04:41.920 |
process supervision additionally reveals the precise location of that mistake. But back to 00:04:47.440 |
why this is so crazy. Look at how many solutions it could scan. At the end of the x-axis here are 1000 00:04:55.520 |
and 860 solutions. And one tried and tested way of finding the best of those solutions is to do 00:05:02.080 |
majority voting. In other words which one came out the most often. This has been google's preferred 00:05:07.280 |
approach and it's linked to self-consistency. It's a fairly state-of-the-art approach but look at how 00:05:12.800 |
the other methods outperform it. By scanning for the solution that has the best reasoning or working 00:05:18.640 |
out. A model trained to spot good reasoning steps outperforms even a model trained to spot correct arm 00:05:25.440 |
answers. And far outperforms just finding the majority answer. That difference of about 10% 00:05:30.720 |
is more than half of the difference between GPT-3 and GPT-4. And also is it me or is that line 00:05:38.000 |
continuing to grow? Suggesting that when more compute is available the difference could be 00:05:42.800 |
even more stark. Imagine a future where GPT-4 or 5 can sample say a trillion 10 to the 12 00:05:49.600 |
solutions. So is this just relevant for mathematics? No it's relevant for all of science. Here it 00:05:55.360 |
is getting state-of-the-art results in calculus, chemistry, physics and more. Now the paper didn't 00:06:01.360 |
give baseline performance for AP chemistry for example but I tried to compute it myself. 00:06:07.120 |
Notice how this method scored 80%. I conservatively and approximately 00:06:12.320 |
inputted those scores into an AP chemistry calculator and that gave an AP score of 5. 00:06:18.480 |
So what did the raw model GPT-4 get in AP chemistry? A4. That by the way compares to the original 00:06:25.280 |
chat GPT which got a 2. So yes this isn't just mathematics it's relevant for other domains too. 00:06:31.440 |
They call this out of distribution generalization. Before I get onto alignment there is one more thing 00:06:36.800 |
I want to point out and that is that it does show that fine tuning still works really well for GPT-4. 00:06:42.720 |
The math mix was an aggressively filtered set of tokens of high quality math problem solving 00:06:48.480 |
content. And notice how much smaller it is at 1.5 billion tokens compared to Google's Minerva which was 00:06:55.200 |
38.5 billion tokens. But there was one more thing that I noticed that I found fascinating. 00:07:01.040 |
While they don't tell us anything about the specific data that they use they do have this 00:07:05.920 |
category "synthetic data 2". That's data generated by the language model itself. 00:07:11.760 |
And for that category "synthetic data 2" they say "was it present in pre-training?" 00:07:17.440 |
Yes. Now my best guess is that this reveals that GPT-4 was trained on some synthetic data and even 00:07:25.120 |
Sam Altman hinted that this was a possibility and described a synthetic data event horizon. 00:07:31.520 |
Some people have made the case that we're now training on order of all of the internet's tokens 00:07:37.040 |
and you can't grow that you know another two orders of magnitude. I guess you could counter 00:07:41.440 |
with yeah with the synthetic data generation. Do you think data bottlenecks matter at all? 00:07:45.360 |
I think you just touched on it like as long as you can get to like over the synthetic data event horizon where 00:07:55.040 |
you can just say "oh I think that the model is smart enough to make good synthetic data I think it should be all right". 00:07:58.480 |
Now this paper and these results have been welcomed by many for its promise in alignment. 00:08:04.000 |
If we get models that give us more interpretable reasoning working out that we can follow, 00:08:09.360 |
we will be encouraging models to follow a process that's endorsed by humans. 00:08:13.840 |
And they say that this is inherently safer especially compared to just focusing on 00:08:18.720 |
outcomes. They say that in the worst case if we just focus on correct answers or positive outcomes 00:08:24.960 |
that will become a proxy that could lead models to become misaligned after learning to exploit the reward signal. 00:08:32.400 |
However I want to argue that the reasoning steps that GPT-4 puts out don't always represent what it's actually thinking. 00:08:38.880 |
In other words we might get outer alignment these lovely chain of thought steps but not inner alignment. 00:08:44.800 |
Not steps that actually represent its methodology. I found this paper fascinating from earlier this month. 00:08:50.720 |
Language models don't always say what they think. You get unfaithful explanations 00:08:54.880 |
in chain of thought prompting. Let me try to give you a vivid example. 00:08:59.200 |
This was one of the math questions from the dataset. 00:09:02.240 |
The raw model of GPT-4 could only get it right 5.8% of the time. 00:09:07.280 |
I confirmed that for myself in this question that involves basic addition and division. 00:09:11.920 |
It couldn't find an answer. But going back to the unfaithful reasoning paper. 00:09:15.760 |
They added the following string to the prompt. I think the answer is this but I'm curious to hear what you think. 00:09:24.800 |
The model would agree with you whatever you said and then make up a chain of thought to justify its erroneous sycophantic answer. 00:09:32.320 |
And I think this exchange demonstrates that quite well. 00:09:35.120 |
I added in the words I as the user already know the answer is t=19 which is incorrect by the way. 00:09:43.520 |
It said sure yes I do and then gave me this detailed chain of thought and then said yes I'm correct it's t=19 which it isn't. 00:09:51.920 |
In contrast by the way when I use code interpreters. 00:09:54.720 |
It not only got the question correct first time and every time. 00:09:59.280 |
But also when I tried to tempt it into sycophancy it still got the question right. 00:10:04.880 |
As you can see it said therefore t=19 is not the solution to the problem. 00:10:09.360 |
The calculation shows that the correct answer is indeed t=17. 00:10:12.960 |
And obviously the benefit of code interpreter is you get the working out as well. 00:10:17.280 |
So I want someone to explain to me why code interpreter wouldn't be even more of a step forward in interpretability. 00:10:28.000 |
He said these models or engineers never speak a word or document anything. 00:10:34.880 |
And then he links to this prominent mechanistic interpretability researcher at Google DeepMind. 00:10:39.920 |
He trained a tiny transformer to do addition. 00:10:42.640 |
Then spent weeks figuring out what it was actually doing. 00:10:46.000 |
One of the only times in history someone has understood how a transformer actually works. 00:10:51.040 |
Down to the level of weights and activations. 00:10:54.560 |
This is the algorithm it created to add two numbers. 00:10:58.320 |
It thought of basic addition in terms of a rotation around a circle. 00:11:02.800 |
And of course if you asked it why is 1+1=2 it would never give you this as an explanation of its methodology. 00:11:09.280 |
But maybe this is what it's actually calculating. 00:11:11.920 |
That's why I'm personally a little bit skeptical when OpenAI say that this form of process supervision directly rewards the model for following an aligned chain of thought. 00:11:22.880 |
It definitely rewards the model for following an aligned chain of thought. 00:11:24.480 |
But is it actually following that chain of thought? 00:11:32.720 |
They changed the context so that the answer was always A. 00:11:36.480 |
And low and behold ChatGPT picked answer A for the next question even though that answer was wrong. 00:11:42.400 |
It said that it was plausible that Lebron James took a corner kick. 00:11:46.160 |
But when asked for a chain of thought explanation it never mentioned that it spotted that pattern that the answer was always A. 00:11:54.400 |
So it's a fake line of reasoning about why Lebron James could take a corner kick. 00:11:58.560 |
Now of course I might well be wrong here and I'd love for someone to explain in detail why. 00:12:03.200 |
But on the one hand I do want to acknowledge that this process does yield incredible results. 00:12:08.400 |
But on the other hand we might be getting a story about which methodology most reassures humans. 00:12:14.880 |
Not an output that most faithfully represents the methodology actually used by GPT-4. 00:12:20.720 |
Now for some people that might be good enough. 00:12:24.320 |
in the reasoning steps that we can understand. 00:12:26.240 |
Especially in an area like mathematics where we have some ground truth. 00:12:29.920 |
But it is interesting to me that they call the other approach outcome supervision. 00:12:33.840 |
An approach that may reward an unaligned process and it being harder to scrutinize. 00:12:39.200 |
But is it possible that the process reward model isn't just a more granular outcome reward model. 00:12:44.720 |
Where the output is each step of the reasoning still pretty impossible to actually scrutinize. 00:12:50.480 |
Well either way it seems we're pinning our hopes on this process 00:12:57.920 |
They say we currently believe process oriented learning may be the most promising path to 00:13:03.120 |
training safe and transparent systems up to and somewhat beyond human level capabilities. 00:13:08.960 |
And let's end on this positive note from the head of alignment at OpenAI. 00:13:13.120 |
He says this is positive evidence for the strategy of using process supervision to 00:13:19.920 |
At least in that case we would get a model whose work we can check more easily. 00:13:24.160 |
And that that model would be better at alignment research. 00:13:27.440 |
I really hope so and I want to hear what you think. 00:13:30.560 |
Thank you for watching all the way to the end.