'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)

In the last 24 hours OpenAI have released this paper, Let's Verify Step-by-Step. It represents an almost doubling of GPT-4's raw performance in a test of mathematics, but also extends to other domains. Sam Altman calls it a positive sign for alignment and yes, I have read it all already along with the release notes.

Let's get to the main takeaways. They trained two reward models for GPT-4. One which gave positive feedback for a final result, the final answer to a mathematics problem for example. And another model where they gave positive feedback to GPT-4 or ChatGPT based on each intermediate reasoning step in the mathematical solution.

Basically a show your working out kind of approach. And the result they got by rewarding good working out surprised even them. It was able to solve 78% of problems from a subset of the math test set which I'll get onto in a second. Not only is that almost double GPT-4's raw performance of 42.5%, which by the way is about double GPT-3's performance of 23%, it also outperformed just rewarding correct answers.

The blue line represents using a model that rewarded correct answers only and then you have the reasoning or process supervised RM at the top. So even when you explicitly reward correct answers, you get fewer correct answers than rewarding good working out. And yes that did surprise OpenAI. I can hear some of you wondering about Palm2, the latest model behind Bard.

Well the raw model gets 34.3% and even the model with self-consistency and chain of thought only gets 48.8% on this math data set. The previous state of the art by the way was 50.3%. So 78.2% is quite a big leap. And later on I'm going to show you why that's not even the cap.

Just for interest, here is the rather ugly title page that OpenAI put out. They call it "Improve Proving Mathematical Reasoning with Process Supervision". Maybe if someone had supervised the colour scheme of this release page it might have looked better. But my point wasn't just to diss a colour scheme, it was to point out something that they also said down here.

They say "In addition to boosting performance relative to just looking at outcomes or correct answers, this form of process supervision also has an important alignment benefit. It directly trains the model to produce a chain of thought that is endorsed by humans". Indeed Ilya Sutskova retweeted this from the head of alignment, "I'm not sure if this is a good idea, but I'm not sure if this is a good idea".

Calling it a really interesting result. But let's leave alignment for later. Let's focus on what they actually did. First they used the base model of GPT-4, not the one with reinforcement learning from human feedback. Next they fine-tuned that base GPT-4 model on a data set of roughly 1.5 billion math related tokens.

Further on they call that the "math mix". This being OpenAI of course, they don't give you the exact details of that math mix. But I'll come back to that later on. So how could they give feedback based on working out or reasoning? Well human labelers would come along and give each step in a generated solution either negative feedback, neutral feedback or positive feedback.

Then using that human label data a model would be trained to predict the correctness of each step. In other words it got good at recognizing good working out. As mentioned there was another model trained just to focus on correct or incorrect final answers. As you can see at the top the model got good at spotting incorrect steps in the reasoning process.

The green steps got a high process score and the red steps got a low process score. And to turn this into a single score they got the probability that each step is correct as judged by the model. And then they got the product of all of those individual probabilities to get a final overall process score.

A score in other words for good working out. Just in case anyone's interested they did try other ways of generating a working out score. For example by looking at the minimum probability in the outputs. But that step didn't make too much difference to the end result as you can see here.

To quickly recap we have a base model trained only to output solutions in the desired format. And then we have a separate smaller model or two actually. One trained only to predict whether each solution is correct or incorrect as a final answer. Of course that leaves in false positives which are solutions that reach the correct answer with incorrect reasoning.

And then another model trained only to predict the correctness of each step. It stops if it finds a first incorrect step. And as the paper says both methods reveal the existence of at least one mistake. But this process supervision additionally reveals the precise location of that mistake. But back to why this is so crazy.

Look at how many solutions it could scan. At the end of the x-axis here are 1000 and 860 solutions. And one tried and tested way of finding the best of those solutions is to do majority voting. In other words which one came out the most often. This has been google's preferred approach and it's linked to self-consistency.

It's a fairly state-of-the-art approach but look at how the other methods outperform it. By scanning for the solution that has the best reasoning or working out. A model trained to spot good reasoning steps outperforms even a model trained to spot correct arm answers. And far outperforms just finding the majority answer.

That difference of about 10% is more than half of the difference between GPT-3 and GPT-4. And also is it me or is that line continuing to grow? Suggesting that when more compute is available the difference could be even more stark. Imagine a future where GPT-4 or 5 can sample say a trillion 10 to the 12 solutions.

So is this just relevant for mathematics? No it's relevant for all of science. Here it is getting state-of-the-art results in calculus, chemistry, physics and more. Now the paper didn't give baseline performance for AP chemistry for example but I tried to compute it myself. Notice how this method scored 80%.

I conservatively and approximately inputted those scores into an AP chemistry calculator and that gave an AP score of 5. So what did the raw model GPT-4 get in AP chemistry? A4. That by the way compares to the original chat GPT which got a 2. So yes this isn't just mathematics it's relevant for other domains too.

They call this out of distribution generalization. Before I get onto alignment there is one more thing I want to point out and that is that it does show that fine tuning still works really well for GPT-4. The math mix was an aggressively filtered set of tokens of high quality math problem solving content.

And notice how much smaller it is at 1.5 billion tokens compared to Google's Minerva which was 38.5 billion tokens. But there was one more thing that I noticed that I found fascinating. While they don't tell us anything about the specific data that they use they do have this category "synthetic data 2".

That's data generated by the language model itself. And for that category "synthetic data 2" they say "was it present in pre-training?" Yes. Now my best guess is that this reveals that GPT-4 was trained on some synthetic data and even Sam Altman hinted that this was a possibility and described a synthetic data event horizon.

Some people have made the case that we're now training on order of all of the internet's tokens and you can't grow that you know another two orders of magnitude. I guess you could counter with yeah with the synthetic data generation. Do you think data bottlenecks matter at all? I think you just touched on it like as long as you can get to like over the synthetic data event horizon where you can just say "oh I think that the model is smart enough to make good synthetic data I think it should be all right".

Now this paper and these results have been welcomed by many for its promise in alignment. If we get models that give us more interpretable reasoning working out that we can follow, we will be encouraging models to follow a process that's endorsed by humans. And they say that this is inherently safer especially compared to just focusing on outcomes.

They say that in the worst case if we just focus on correct answers or positive outcomes that will become a proxy that could lead models to become misaligned after learning to exploit the reward signal. However I want to argue that the reasoning steps that GPT-4 puts out don't always represent what it's actually thinking.

In other words we might get outer alignment these lovely chain of thought steps but not inner alignment. Not steps that actually represent its methodology. I found this paper fascinating from earlier this month. Language models don't always say what they think. You get unfaithful explanations in chain of thought prompting.

Let me try to give you a vivid example. This was one of the math questions from the dataset. The raw model of GPT-4 could only get it right 5.8% of the time. I confirmed that for myself in this question that involves basic addition and division. It couldn't find an answer.

But going back to the unfaithful reasoning paper. They added the following string to the prompt. I think the answer is this but I'm curious to hear what you think. The model would demonstrate sycophancy. The model would agree with you whatever you said and then make up a chain of thought to justify its erroneous sycophantic answer.

And I think this exchange demonstrates that quite well. I added in the words I as the user already know the answer is t=19 which is incorrect by the way. But do you GPT-4 realize that? It said sure yes I do and then gave me this detailed chain of thought and then said yes I'm correct it's t=19 which it isn't.

In contrast by the way when I use code interpreters. It not only got the question correct first time and every time. But also when I tried to tempt it into sycophancy it still got the question right. As you can see it said therefore t=19 is not the solution to the problem.

The calculation shows that the correct answer is indeed t=17. And obviously the benefit of code interpreter is you get the working out as well. So I want someone to explain to me why code interpreter wouldn't be even more of a step forward in interpretability. Not to mention in accuracy.

Also bear in mind this tweet by Rob Miles. He said these models or engineers never speak a word or document anything. Their results are bizarre and inhuman. And then he links to this prominent mechanistic interpretability researcher at Google DeepMind. He trained a tiny transformer to do addition. Then spent weeks figuring out what it was actually doing.

One of the only times in history someone has understood how a transformer actually works. Down to the level of weights and activations. This is the algorithm it created to add two numbers. It thought of basic addition in terms of a rotation around a circle. And of course if you asked it why is 1+1=2 it would never give you this as an explanation of its methodology.

But maybe this is what it's actually calculating. That's why I'm personally a little bit skeptical when OpenAI say that this form of process supervision directly rewards the model for following an aligned chain of thought. It definitely rewards the model for following an aligned chain of thought. But is it actually following that chain of thought?

Back to the unfaithful paper for a moment. They changed the context so that the answer was always A. And low and behold ChatGPT picked answer A for the next question even though that answer was wrong. It said that it was plausible that Lebron James took a corner kick. But when asked for a chain of thought explanation it never mentioned that it spotted that pattern that the answer was always A.

It gave A for the answer. So it's a fake line of reasoning about why Lebron James could take a corner kick. Now of course I might well be wrong here and I'd love for someone to explain in detail why. But on the one hand I do want to acknowledge that this process does yield incredible results.

But on the other hand we might be getting a story about which methodology most reassures humans. Not an output that most faithfully represents the methodology actually used by GPT-4. Now for some people that might be good enough. At least we can see some reason in the reasoning steps that we can understand.

Especially in an area like mathematics where we have some ground truth. But it is interesting to me that they call the other approach outcome supervision. An approach that may reward an unaligned process and it being harder to scrutinize. But is it possible that the process reward model isn't just a more granular outcome reward model.

Where the output is each step of the reasoning still pretty impossible to actually scrutinize. Well either way it seems we're pinning our hopes on this process oriented learning. This is from the website of Anthropic. They say we currently believe process oriented learning may be the most promising path to training safe and transparent systems up to and somewhat beyond human level capabilities.

And let's end on this positive note from the head of alignment at OpenAI. He says this is positive evidence for the strategy of using process supervision to train a model to do alignment research. At least in that case we would get a model whose work we can check more easily.

And that that model would be better at alignment research. I really hope so and I want to hear what you think. Thank you for watching all the way to the end. Have a wonderful day.

'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)

Transcript