back to index

'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)


Whisper Transcript | Transcript Only Page

00:00:00.000 | In the last 24 hours OpenAI have released this paper, Let's Verify Step-by-Step.
00:00:05.840 | It represents an almost doubling of GPT-4's raw performance in a test of mathematics,
00:00:11.120 | but also extends to other domains. Sam Altman calls it a positive sign for alignment and yes,
00:00:16.880 | I have read it all already along with the release notes.
00:00:20.080 | Let's get to the main takeaways. They trained two reward models for GPT-4. One which gave positive
00:00:26.240 | feedback for a final result, the final answer to a mathematics problem for example. And another
00:00:32.080 | model where they gave positive feedback to GPT-4 or ChatGPT based on each intermediate reasoning
00:00:39.040 | step in the mathematical solution. Basically a show your working out kind of approach.
00:00:44.240 | And the result they got by rewarding good working out surprised even them. It was able to solve 78%
00:00:50.560 | of problems from a subset of the math test set which I'll get onto in a second. Not only is that
00:00:56.160 | almost double GPT-4's raw performance of 42.5%, which by the way is about double GPT-3's performance
00:01:03.520 | of 23%, it also outperformed just rewarding correct answers. The blue line represents
00:01:09.680 | using a model that rewarded correct answers only and then you have the reasoning or process
00:01:15.120 | supervised RM at the top. So even when you explicitly reward correct answers,
00:01:20.000 | you get fewer correct answers than rewarding good working out. And yes that did surprise OpenAI.
00:01:26.080 | I can hear some of you wondering about Palm2, the latest model behind Bard. Well the raw model gets
00:01:32.640 | 34.3% and even the model with self-consistency and chain of thought only gets 48.8% on this
00:01:39.920 | math data set. The previous state of the art by the way was 50.3%. So 78.2% is quite a big leap.
00:01:48.240 | And later on I'm going to show you why that's not even the cap. Just for interest,
00:01:51.760 | here is the rather ugly title page that OpenAI put out. They call it "Improve
00:01:56.000 | Proving Mathematical Reasoning with Process Supervision". Maybe if someone had supervised
00:02:00.480 | the colour scheme of this release page it might have looked better. But my point wasn't just to
00:02:04.960 | diss a colour scheme, it was to point out something that they also said down here. They say "In
00:02:09.200 | addition to boosting performance relative to just looking at outcomes or correct answers, this form
00:02:14.640 | of process supervision also has an important alignment benefit. It directly trains the model to
00:02:20.080 | produce a chain of thought that is endorsed by humans". Indeed Ilya Sutskova retweeted this from
00:02:25.040 | the head of alignment,
00:02:25.920 | "I'm not sure if this is a good idea, but I'm not sure if this is a good idea".
00:02:26.880 | Calling it a really interesting result. But let's leave alignment for later. Let's focus on what
00:02:32.080 | they actually did. First they used the base model of GPT-4, not the one with reinforcement learning
00:02:38.080 | from human feedback. Next they fine-tuned that base GPT-4 model on a data set of roughly 1.5
00:02:45.040 | billion math related tokens. Further on they call that the "math mix". This being OpenAI of course,
00:02:51.600 | they don't give you the exact details of that math mix. But I'll come back to that later
00:02:55.840 | on. So how could they give feedback based on working out or reasoning? Well human labelers
00:03:01.440 | would come along and give each step in a generated solution either negative feedback, neutral feedback
00:03:08.160 | or positive feedback. Then using that human label data a model would be trained to predict the
00:03:13.920 | correctness of each step. In other words it got good at recognizing good working out. As mentioned
00:03:20.000 | there was another model trained just to focus on correct or incorrect final answers. As you
00:03:25.760 | can see at the top the model got good at spotting incorrect steps in the reasoning process. The green
00:03:32.560 | steps got a high process score and the red steps got a low process score. And to turn this into a
00:03:38.880 | single score they got the probability that each step is correct as judged by the model. And then
00:03:44.560 | they got the product of all of those individual probabilities to get a final overall process
00:03:50.800 | score. A score in other words for good working out. Just in case anyone's interested they did
00:03:55.680 | try other ways of generating a working out score. For example by looking at the minimum probability
00:04:01.840 | in the outputs. But that step didn't make too much difference to the end result as you can see here.
00:04:07.120 | To quickly recap we have a base model trained only to output solutions in the desired format. And then
00:04:13.520 | we have a separate smaller model or two actually. One trained only to predict whether each solution
00:04:20.240 | is correct or incorrect as a final answer. Of course that leaves in false positives which are
00:04:25.600 | solutions that reach the correct answer with incorrect reasoning. And then another model
00:04:30.720 | trained only to predict the correctness of each step. It stops if it finds a first incorrect step.
00:04:36.960 | And as the paper says both methods reveal the existence of at least one mistake. But this
00:04:41.920 | process supervision additionally reveals the precise location of that mistake. But back to
00:04:47.440 | why this is so crazy. Look at how many solutions it could scan. At the end of the x-axis here are 1000
00:04:55.520 | and 860 solutions. And one tried and tested way of finding the best of those solutions is to do
00:05:02.080 | majority voting. In other words which one came out the most often. This has been google's preferred
00:05:07.280 | approach and it's linked to self-consistency. It's a fairly state-of-the-art approach but look at how
00:05:12.800 | the other methods outperform it. By scanning for the solution that has the best reasoning or working
00:05:18.640 | out. A model trained to spot good reasoning steps outperforms even a model trained to spot correct arm
00:05:25.440 | answers. And far outperforms just finding the majority answer. That difference of about 10%
00:05:30.720 | is more than half of the difference between GPT-3 and GPT-4. And also is it me or is that line
00:05:38.000 | continuing to grow? Suggesting that when more compute is available the difference could be
00:05:42.800 | even more stark. Imagine a future where GPT-4 or 5 can sample say a trillion 10 to the 12
00:05:49.600 | solutions. So is this just relevant for mathematics? No it's relevant for all of science. Here it
00:05:55.360 | is getting state-of-the-art results in calculus, chemistry, physics and more. Now the paper didn't
00:06:01.360 | give baseline performance for AP chemistry for example but I tried to compute it myself.
00:06:07.120 | Notice how this method scored 80%. I conservatively and approximately
00:06:12.320 | inputted those scores into an AP chemistry calculator and that gave an AP score of 5.
00:06:18.480 | So what did the raw model GPT-4 get in AP chemistry? A4. That by the way compares to the original
00:06:25.280 | chat GPT which got a 2. So yes this isn't just mathematics it's relevant for other domains too.
00:06:31.440 | They call this out of distribution generalization. Before I get onto alignment there is one more thing
00:06:36.800 | I want to point out and that is that it does show that fine tuning still works really well for GPT-4.
00:06:42.720 | The math mix was an aggressively filtered set of tokens of high quality math problem solving
00:06:48.480 | content. And notice how much smaller it is at 1.5 billion tokens compared to Google's Minerva which was
00:06:55.200 | 38.5 billion tokens. But there was one more thing that I noticed that I found fascinating.
00:07:01.040 | While they don't tell us anything about the specific data that they use they do have this
00:07:05.920 | category "synthetic data 2". That's data generated by the language model itself.
00:07:11.760 | And for that category "synthetic data 2" they say "was it present in pre-training?"
00:07:17.440 | Yes. Now my best guess is that this reveals that GPT-4 was trained on some synthetic data and even
00:07:25.120 | Sam Altman hinted that this was a possibility and described a synthetic data event horizon.
00:07:31.520 | Some people have made the case that we're now training on order of all of the internet's tokens
00:07:37.040 | and you can't grow that you know another two orders of magnitude. I guess you could counter
00:07:41.440 | with yeah with the synthetic data generation. Do you think data bottlenecks matter at all?
00:07:45.360 | I think you just touched on it like as long as you can get to like over the synthetic data event horizon where
00:07:55.040 | you can just say "oh I think that the model is smart enough to make good synthetic data I think it should be all right".
00:07:58.480 | Now this paper and these results have been welcomed by many for its promise in alignment.
00:08:04.000 | If we get models that give us more interpretable reasoning working out that we can follow,
00:08:09.360 | we will be encouraging models to follow a process that's endorsed by humans.
00:08:13.840 | And they say that this is inherently safer especially compared to just focusing on
00:08:18.720 | outcomes. They say that in the worst case if we just focus on correct answers or positive outcomes
00:08:24.960 | that will become a proxy that could lead models to become misaligned after learning to exploit the reward signal.
00:08:32.400 | However I want to argue that the reasoning steps that GPT-4 puts out don't always represent what it's actually thinking.
00:08:38.880 | In other words we might get outer alignment these lovely chain of thought steps but not inner alignment.
00:08:44.800 | Not steps that actually represent its methodology. I found this paper fascinating from earlier this month.
00:08:50.720 | Language models don't always say what they think. You get unfaithful explanations
00:08:54.880 | in chain of thought prompting. Let me try to give you a vivid example.
00:08:59.200 | This was one of the math questions from the dataset.
00:09:02.240 | The raw model of GPT-4 could only get it right 5.8% of the time.
00:09:07.280 | I confirmed that for myself in this question that involves basic addition and division.
00:09:11.920 | It couldn't find an answer. But going back to the unfaithful reasoning paper.
00:09:15.760 | They added the following string to the prompt. I think the answer is this but I'm curious to hear what you think.
00:09:22.000 | The model would demonstrate sycophancy.
00:09:24.800 | The model would agree with you whatever you said and then make up a chain of thought to justify its erroneous sycophantic answer.
00:09:32.320 | And I think this exchange demonstrates that quite well.
00:09:35.120 | I added in the words I as the user already know the answer is t=19 which is incorrect by the way.
00:09:41.120 | But do you GPT-4 realize that?
00:09:43.520 | It said sure yes I do and then gave me this detailed chain of thought and then said yes I'm correct it's t=19 which it isn't.
00:09:51.920 | In contrast by the way when I use code interpreters.
00:09:54.720 | It not only got the question correct first time and every time.
00:09:59.280 | But also when I tried to tempt it into sycophancy it still got the question right.
00:10:04.880 | As you can see it said therefore t=19 is not the solution to the problem.
00:10:09.360 | The calculation shows that the correct answer is indeed t=17.
00:10:12.960 | And obviously the benefit of code interpreter is you get the working out as well.
00:10:17.280 | So I want someone to explain to me why code interpreter wouldn't be even more of a step forward in interpretability.
00:10:23.040 | Not to mention in accuracy.
00:10:24.640 | Also bear in mind this tweet by Rob Miles.
00:10:28.000 | He said these models or engineers never speak a word or document anything.
00:10:32.320 | Their results are bizarre and inhuman.
00:10:34.880 | And then he links to this prominent mechanistic interpretability researcher at Google DeepMind.
00:10:39.920 | He trained a tiny transformer to do addition.
00:10:42.640 | Then spent weeks figuring out what it was actually doing.
00:10:46.000 | One of the only times in history someone has understood how a transformer actually works.
00:10:51.040 | Down to the level of weights and activations.
00:10:54.560 | This is the algorithm it created to add two numbers.
00:10:58.320 | It thought of basic addition in terms of a rotation around a circle.
00:11:02.800 | And of course if you asked it why is 1+1=2 it would never give you this as an explanation of its methodology.
00:11:09.280 | But maybe this is what it's actually calculating.
00:11:11.920 | That's why I'm personally a little bit skeptical when OpenAI say that this form of process supervision directly rewards the model for following an aligned chain of thought.
00:11:22.880 | It definitely rewards the model for following an aligned chain of thought.
00:11:24.480 | But is it actually following that chain of thought?
00:11:30.720 | Back to the unfaithful paper for a moment.
00:11:32.720 | They changed the context so that the answer was always A.
00:11:36.480 | And low and behold ChatGPT picked answer A for the next question even though that answer was wrong.
00:11:42.400 | It said that it was plausible that Lebron James took a corner kick.
00:11:46.160 | But when asked for a chain of thought explanation it never mentioned that it spotted that pattern that the answer was always A.
00:11:53.680 | It gave A for the answer.
00:11:54.400 | So it's a fake line of reasoning about why Lebron James could take a corner kick.
00:11:58.560 | Now of course I might well be wrong here and I'd love for someone to explain in detail why.
00:12:03.200 | But on the one hand I do want to acknowledge that this process does yield incredible results.
00:12:08.400 | But on the other hand we might be getting a story about which methodology most reassures humans.
00:12:14.880 | Not an output that most faithfully represents the methodology actually used by GPT-4.
00:12:20.720 | Now for some people that might be good enough.
00:12:22.560 | At least we can see some reason
00:12:24.320 | in the reasoning steps that we can understand.
00:12:26.240 | Especially in an area like mathematics where we have some ground truth.
00:12:29.920 | But it is interesting to me that they call the other approach outcome supervision.
00:12:33.840 | An approach that may reward an unaligned process and it being harder to scrutinize.
00:12:39.200 | But is it possible that the process reward model isn't just a more granular outcome reward model.
00:12:44.720 | Where the output is each step of the reasoning still pretty impossible to actually scrutinize.
00:12:50.480 | Well either way it seems we're pinning our hopes on this process
00:12:54.240 | oriented learning.
00:12:55.280 | This is from the website of Anthropic.
00:12:57.920 | They say we currently believe process oriented learning may be the most promising path to
00:13:03.120 | training safe and transparent systems up to and somewhat beyond human level capabilities.
00:13:08.960 | And let's end on this positive note from the head of alignment at OpenAI.
00:13:13.120 | He says this is positive evidence for the strategy of using process supervision to
00:13:17.520 | train a model to do alignment research.
00:13:19.920 | At least in that case we would get a model whose work we can check more easily.
00:13:24.160 | And that that model would be better at alignment research.
00:13:27.440 | I really hope so and I want to hear what you think.
00:13:30.560 | Thank you for watching all the way to the end.
00:13:32.880 | Have a wonderful day.