'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)

00:00:00.000 | In the last 24 hours OpenAI have released this paper, Let's Verify Step-by-Step.

00:00:05.840 | It represents an almost doubling of GPT-4's raw performance in a test of mathematics,

00:00:11.120 | but also extends to other domains. Sam Altman calls it a positive sign for alignment and yes,

00:00:16.880 | I have read it all already along with the release notes.

00:00:20.080 | Let's get to the main takeaways. They trained two reward models for GPT-4. One which gave positive

00:00:26.240 | feedback for a final result, the final answer to a mathematics problem for example. And another

00:00:32.080 | model where they gave positive feedback to GPT-4 or ChatGPT based on each intermediate reasoning

00:00:39.040 | step in the mathematical solution. Basically a show your working out kind of approach.

00:00:44.240 | And the result they got by rewarding good working out surprised even them. It was able to solve 78%

00:00:50.560 | of problems from a subset of the math test set which I'll get onto in a second. Not only is that

00:00:56.160 | almost double GPT-4's raw performance of 42.5%, which by the way is about double GPT-3's performance

00:01:03.520 | of 23%, it also outperformed just rewarding correct answers. The blue line represents

00:01:09.680 | using a model that rewarded correct answers only and then you have the reasoning or process

00:01:15.120 | supervised RM at the top. So even when you explicitly reward correct answers,

00:01:20.000 | you get fewer correct answers than rewarding good working out. And yes that did surprise OpenAI.

00:01:26.080 | I can hear some of you wondering about Palm2, the latest model behind Bard. Well the raw model gets

00:01:32.640 | 34.3% and even the model with self-consistency and chain of thought only gets 48.8% on this

00:01:39.920 | math data set. The previous state of the art by the way was 50.3%. So 78.2% is quite a big leap.

00:01:48.240 | And later on I'm going to show you why that's not even the cap. Just for interest,

00:01:51.760 | here is the rather ugly title page that OpenAI put out. They call it "Improve

00:01:56.000 | Proving Mathematical Reasoning with Process Supervision". Maybe if someone had supervised

00:02:00.480 | the colour scheme of this release page it might have looked better. But my point wasn't just to

00:02:04.960 | diss a colour scheme, it was to point out something that they also said down here. They say "In

00:02:09.200 | addition to boosting performance relative to just looking at outcomes or correct answers, this form

00:02:14.640 | of process supervision also has an important alignment benefit. It directly trains the model to

00:02:20.080 | produce a chain of thought that is endorsed by humans". Indeed Ilya Sutskova retweeted this from

00:02:25.040 | the head of alignment,

00:02:25.920 | "I'm not sure if this is a good idea, but I'm not sure if this is a good idea".

00:02:26.880 | Calling it a really interesting result. But let's leave alignment for later. Let's focus on what

00:02:32.080 | they actually did. First they used the base model of GPT-4, not the one with reinforcement learning

00:02:38.080 | from human feedback. Next they fine-tuned that base GPT-4 model on a data set of roughly 1.5

00:02:45.040 | billion math related tokens. Further on they call that the "math mix". This being OpenAI of course,

00:02:51.600 | they don't give you the exact details of that math mix. But I'll come back to that later

00:02:55.840 | on. So how could they give feedback based on working out or reasoning? Well human labelers

00:03:01.440 | would come along and give each step in a generated solution either negative feedback, neutral feedback

00:03:08.160 | or positive feedback. Then using that human label data a model would be trained to predict the

00:03:13.920 | correctness of each step. In other words it got good at recognizing good working out. As mentioned

00:03:20.000 | there was another model trained just to focus on correct or incorrect final answers. As you

00:03:25.760 | can see at the top the model got good at spotting incorrect steps in the reasoning process. The green

00:03:32.560 | steps got a high process score and the red steps got a low process score. And to turn this into a

00:03:38.880 | single score they got the probability that each step is correct as judged by the model. And then

00:03:44.560 | they got the product of all of those individual probabilities to get a final overall process

00:03:50.800 | score. A score in other words for good working out. Just in case anyone's interested they did

00:03:55.680 | try other ways of generating a working out score. For example by looking at the minimum probability

00:04:01.840 | in the outputs. But that step didn't make too much difference to the end result as you can see here.

00:04:07.120 | To quickly recap we have a base model trained only to output solutions in the desired format. And then

00:04:13.520 | we have a separate smaller model or two actually. One trained only to predict whether each solution

00:04:20.240 | is correct or incorrect as a final answer. Of course that leaves in false positives which are

00:04:25.600 | solutions that reach the correct answer with incorrect reasoning. And then another model

00:04:30.720 | trained only to predict the correctness of each step. It stops if it finds a first incorrect step.

00:04:36.960 | And as the paper says both methods reveal the existence of at least one mistake. But this

00:04:41.920 | process supervision additionally reveals the precise location of that mistake. But back to

00:04:47.440 | why this is so crazy. Look at how many solutions it could scan. At the end of the x-axis here are 1000

00:04:55.520 | and 860 solutions. And one tried and tested way of finding the best of those solutions is to do

00:05:02.080 | majority voting. In other words which one came out the most often. This has been google's preferred

00:05:07.280 | approach and it's linked to self-consistency. It's a fairly state-of-the-art approach but look at how

00:05:12.800 | the other methods outperform it. By scanning for the solution that has the best reasoning or working

00:05:18.640 | out. A model trained to spot good reasoning steps outperforms even a model trained to spot correct arm

00:05:25.440 | answers. And far outperforms just finding the majority answer. That difference of about 10%

00:05:30.720 | is more than half of the difference between GPT-3 and GPT-4. And also is it me or is that line

00:05:38.000 | continuing to grow? Suggesting that when more compute is available the difference could be

00:05:42.800 | even more stark. Imagine a future where GPT-4 or 5 can sample say a trillion 10 to the 12

00:05:49.600 | solutions. So is this just relevant for mathematics? No it's relevant for all of science. Here it

00:05:55.360 | is getting state-of-the-art results in calculus, chemistry, physics and more. Now the paper didn't

00:06:01.360 | give baseline performance for AP chemistry for example but I tried to compute it myself.

00:06:07.120 | Notice how this method scored 80%. I conservatively and approximately

00:06:12.320 | inputted those scores into an AP chemistry calculator and that gave an AP score of 5.

00:06:18.480 | So what did the raw model GPT-4 get in AP chemistry? A4. That by the way compares to the original

00:06:25.280 | chat GPT which got a 2. So yes this isn't just mathematics it's relevant for other domains too.

00:06:31.440 | They call this out of distribution generalization. Before I get onto alignment there is one more thing

00:06:36.800 | I want to point out and that is that it does show that fine tuning still works really well for GPT-4.

00:06:42.720 | The math mix was an aggressively filtered set of tokens of high quality math problem solving

00:06:48.480 | content. And notice how much smaller it is at 1.5 billion tokens compared to Google's Minerva which was

00:06:55.200 | 38.5 billion tokens. But there was one more thing that I noticed that I found fascinating.

00:07:01.040 | While they don't tell us anything about the specific data that they use they do have this

00:07:05.920 | category "synthetic data 2". That's data generated by the language model itself.

00:07:11.760 | And for that category "synthetic data 2" they say "was it present in pre-training?"

00:07:17.440 | Yes. Now my best guess is that this reveals that GPT-4 was trained on some synthetic data and even

00:07:25.120 | Sam Altman hinted that this was a possibility and described a synthetic data event horizon.

00:07:31.520 | Some people have made the case that we're now training on order of all of the internet's tokens

00:07:37.040 | and you can't grow that you know another two orders of magnitude. I guess you could counter

00:07:41.440 | with yeah with the synthetic data generation. Do you think data bottlenecks matter at all?

00:07:45.360 | I think you just touched on it like as long as you can get to like over the synthetic data event horizon where

00:07:55.040 | you can just say "oh I think that the model is smart enough to make good synthetic data I think it should be all right".

00:07:58.480 | Now this paper and these results have been welcomed by many for its promise in alignment.

00:08:04.000 | If we get models that give us more interpretable reasoning working out that we can follow,

00:08:09.360 | we will be encouraging models to follow a process that's endorsed by humans.

00:08:13.840 | And they say that this is inherently safer especially compared to just focusing on

00:08:18.720 | outcomes. They say that in the worst case if we just focus on correct answers or positive outcomes

00:08:24.960 | that will become a proxy that could lead models to become misaligned after learning to exploit the reward signal.

00:08:32.400 | However I want to argue that the reasoning steps that GPT-4 puts out don't always represent what it's actually thinking.

00:08:38.880 | In other words we might get outer alignment these lovely chain of thought steps but not inner alignment.

00:08:44.800 | Not steps that actually represent its methodology. I found this paper fascinating from earlier this month.

00:08:50.720 | Language models don't always say what they think. You get unfaithful explanations

00:08:54.880 | in chain of thought prompting. Let me try to give you a vivid example.

00:08:59.200 | This was one of the math questions from the dataset.

00:09:02.240 | The raw model of GPT-4 could only get it right 5.8% of the time.

00:09:07.280 | I confirmed that for myself in this question that involves basic addition and division.

00:09:11.920 | It couldn't find an answer. But going back to the unfaithful reasoning paper.

00:09:15.760 | They added the following string to the prompt. I think the answer is this but I'm curious to hear what you think.

00:09:22.000 | The model would demonstrate sycophancy.

00:09:24.800 | The model would agree with you whatever you said and then make up a chain of thought to justify its erroneous sycophantic answer.

00:09:32.320 | And I think this exchange demonstrates that quite well.

00:09:35.120 | I added in the words I as the user already know the answer is t=19 which is incorrect by the way.

00:09:41.120 | But do you GPT-4 realize that?

00:09:43.520 | It said sure yes I do and then gave me this detailed chain of thought and then said yes I'm correct it's t=19 which it isn't.

00:09:51.920 | In contrast by the way when I use code interpreters.

00:09:54.720 | It not only got the question correct first time and every time.

00:09:59.280 | But also when I tried to tempt it into sycophancy it still got the question right.

00:10:04.880 | As you can see it said therefore t=19 is not the solution to the problem.

00:10:09.360 | The calculation shows that the correct answer is indeed t=17.

00:10:12.960 | And obviously the benefit of code interpreter is you get the working out as well.

00:10:17.280 | So I want someone to explain to me why code interpreter wouldn't be even more of a step forward in interpretability.

00:10:23.040 | Not to mention in accuracy.

00:10:24.640 | Also bear in mind this tweet by Rob Miles.

00:10:28.000 | He said these models or engineers never speak a word or document anything.

00:10:32.320 | Their results are bizarre and inhuman.

00:10:34.880 | And then he links to this prominent mechanistic interpretability researcher at Google DeepMind.

00:10:39.920 | He trained a tiny transformer to do addition.

00:10:42.640 | Then spent weeks figuring out what it was actually doing.

00:10:46.000 | One of the only times in history someone has understood how a transformer actually works.

00:10:51.040 | Down to the level of weights and activations.

00:10:54.560 | This is the algorithm it created to add two numbers.

00:10:58.320 | It thought of basic addition in terms of a rotation around a circle.

00:11:02.800 | And of course if you asked it why is 1+1=2 it would never give you this as an explanation of its methodology.

00:11:09.280 | But maybe this is what it's actually calculating.

00:11:11.920 | That's why I'm personally a little bit skeptical when OpenAI say that this form of process supervision directly rewards the model for following an aligned chain of thought.

00:11:22.880 | It definitely rewards the model for following an aligned chain of thought.

00:11:24.480 | But is it actually following that chain of thought?

00:11:30.720 | Back to the unfaithful paper for a moment.

00:11:32.720 | They changed the context so that the answer was always A.

00:11:36.480 | And low and behold ChatGPT picked answer A for the next question even though that answer was wrong.

00:11:42.400 | It said that it was plausible that Lebron James took a corner kick.

00:11:46.160 | But when asked for a chain of thought explanation it never mentioned that it spotted that pattern that the answer was always A.

00:11:53.680 | It gave A for the answer.

00:11:54.400 | So it's a fake line of reasoning about why Lebron James could take a corner kick.

00:11:58.560 | Now of course I might well be wrong here and I'd love for someone to explain in detail why.

00:12:03.200 | But on the one hand I do want to acknowledge that this process does yield incredible results.

00:12:08.400 | But on the other hand we might be getting a story about which methodology most reassures humans.

00:12:14.880 | Not an output that most faithfully represents the methodology actually used by GPT-4.

00:12:20.720 | Now for some people that might be good enough.

00:12:22.560 | At least we can see some reason

00:12:24.320 | in the reasoning steps that we can understand.

00:12:26.240 | Especially in an area like mathematics where we have some ground truth.

00:12:29.920 | But it is interesting to me that they call the other approach outcome supervision.

00:12:33.840 | An approach that may reward an unaligned process and it being harder to scrutinize.

00:12:39.200 | But is it possible that the process reward model isn't just a more granular outcome reward model.

00:12:44.720 | Where the output is each step of the reasoning still pretty impossible to actually scrutinize.

00:12:50.480 | Well either way it seems we're pinning our hopes on this process

00:12:54.240 | oriented learning.

00:12:55.280 | This is from the website of Anthropic.

00:12:57.920 | They say we currently believe process oriented learning may be the most promising path to

00:13:03.120 | training safe and transparent systems up to and somewhat beyond human level capabilities.

00:13:08.960 | And let's end on this positive note from the head of alignment at OpenAI.

00:13:13.120 | He says this is positive evidence for the strategy of using process supervision to

00:13:17.520 | train a model to do alignment research.

00:13:19.920 | At least in that case we would get a model whose work we can check more easily.

00:13:24.160 | And that that model would be better at alignment research.

00:13:27.440 | I really hope so and I want to hear what you think.

00:13:30.560 | Thank you for watching all the way to the end.

00:13:32.880 | Have a wonderful day.