back to indexStanford CS25: V2 I Language and Human Alignment
00:00:08.600 |
He leads the alignment team there and was previously a researcher at DeepMind as well. 00:00:12.920 |
He holds a PhD in reinforcement learning theory, has been thinking about the alignment problem 00:00:17.840 |
for over 10 years, and today he'll be giving a very interesting talk. 00:00:22.920 |
Yeah, thanks a lot for the intro, and thanks a lot for having me. 00:00:33.320 |
If you have questions at any point, please interrupt me. 00:00:39.280 |
I want to start out with a few very basic observations on what I think is going on. 00:00:50.080 |
So the first one is, TeamAI is joining the game, so TeamAI has a lot of different players. 00:01:00.200 |
They don't all join at the same time, but rather they join one by one. 00:01:05.520 |
And not all their players vary a lot in how good they are. 00:01:10.640 |
And right now, a lot of the players that have joined so far aren't really that smart and 00:01:16.720 |
usually can do only a very narrow set of tasks. 00:01:23.000 |
But one thing that we've kind of observed is that over time, you know, we're seeing 00:01:29.000 |
stronger and stronger players join, and this is kind of where we are now. 00:01:35.520 |
And then in general, we expect that TeamAI has incredibly strong players, so those will 00:01:41.880 |
be players that are able to think so much better than humans, so much faster, and so 00:01:53.400 |
And so the anchor point that we have, if you think, for example, about chat2BT, chat2BT 00:02:01.880 |
can already beat any human at knowing more facts or speaking more languages, and it can 00:02:09.360 |
write about 50 words per second, and can do so about 100 times cheaper than humans could 00:02:18.560 |
And so, you know, chat2BT also has some really important limitations, and there's a lot of 00:02:25.940 |
things that it can't do yet, but it is kind of an indicator of some of the players that 00:02:37.220 |
And so it seems like in the long run, TeamAI will have all the advantages over TeamHuman. 00:02:47.380 |
And there's an important caveat, which is there's one important advantage that TeamHuman 00:02:56.100 |
has, which is TeamHuman gets to pick which players from TeamAI join and win. 00:03:03.640 |
And so this is kind of like an advantage that we should really be leaning into when we're 00:03:09.500 |
thinking about what to do, and when we're thinking about, you know, this game that we're 00:03:13.260 |
playing with TeamAI, and that we'll be playing with TeamAI in the future. 00:03:19.340 |
So I think two of the main objectives of what we as TeamHuman should do is, like, first, 00:03:28.880 |
we should try to recruit players from TeamAI to play on TeamHuman. 00:03:36.100 |
And so this is kind of what I would broadly call alignment. 00:03:40.420 |
And this is kind of like the problem that I'm working on. 00:03:45.700 |
So another objective that I think is going to be really important is you want to write 00:03:49.540 |
the rules of the game so that TeamHuman doesn't lose. 00:03:53.600 |
And right now, TeamHuman kind of has the ball, and we get to write the rules, so we should 00:03:58.140 |
write rules that, you know, make sense, and that still play this game in the future. 00:04:06.660 |
And so in this talk, I won't really talk about the second point at all. 00:04:10.820 |
And I'll talk about the first point, because that's where I know best enough where I'm 00:04:16.220 |
And kind of to phrase it differently, or to make it kind of, like, more practical, like, 00:04:24.340 |
one way I'm thinking about alignment is, like, you want to build AI systems that follow human 00:04:29.340 |
intent, and that, you know, follow human preferences that do what we want them to do. 00:04:36.620 |
And so a bunch of the things, basically, I'll talk about two main things. 00:04:42.380 |
The first part is going to be work that we've done in the past, and kind of, like, which 00:04:48.560 |
roughly is in the bucket of, like, we are trying to figure out how we can make the models 00:04:56.620 |
that we have today as aligned as we can, and we're just kind of trying -- we're going to 00:05:01.420 |
try hard to do this, and we'll see how far we get. 00:05:05.180 |
And then the second bucket is the things that we have to do next, the stuff that we haven't 00:05:10.140 |
done yet that we think are going to be really important, and I want to kind of, like, lay 00:05:14.940 |
out why I think they're going to be important. 00:05:19.740 |
So, now, I said, you know, I'm, like, trying to make this more clear, or, like, more broken 00:05:27.220 |
down what alignment means, so I'm not back here, because now, you know, the big question 00:05:32.540 |
is, like, what does it mean to follow human intent? 00:05:35.180 |
And kind of, like, two main categories of intent that we care about is, I would say, 00:05:40.580 |
assisted intent, so if you -- you know, I give the system an instruction, or if it wanted 00:05:45.420 |
to be my assistant, it should be my assistant, and it should follow the instruction. 00:05:49.700 |
But then there's also all these other intents that I don't say when I'm usually, you know, 00:05:54.380 |
talking to a system or a human that I also really care about, like, you know, it shouldn't 00:05:59.420 |
literally always do what I say, but the thing that I mean, and it shouldn't make up stuff, 00:06:03.860 |
and it shouldn't, you know, do harmful things, and it should ask a lot of questions when 00:06:08.580 |
it's not sure what I mean, and so on and so on. 00:06:11.420 |
And so these are all kind of, like, things that are often just, like, really difficult 00:06:17.180 |
to, like, precisely specify, or, like, you know, put precisely in groups, but it is still 00:06:26.740 |
things that we want to get AI to do, and that we have to figure out how to, you know, get 00:06:35.220 |
And so, kind of, like, the main technique that we're using today for this is what we 00:06:40.420 |
call the cross-line feedback, so that was used to train ScrapGPT and ChatGPT, which 00:06:46.100 |
are the two, like, main systems that I'll talk about in this talk. 00:06:50.500 |
And basically, the basic system is very simple, and it's also, like, a super general technique 00:06:56.260 |
that applies to lots of different AI models, and modalities, and settings, but in this 00:07:03.140 |
case, we'll be using that as well, and so the two steps is actually another step of, 00:07:08.980 |
like, planning and demonstration, so I'm going to just stop for the sake of simplicity. 00:07:14.580 |
The first step is you want to train your reward model from comparison, so you have it on the 00:07:20.420 |
top, in this case, you know, explain moving that nucleotide field, or, you know, help 00:07:27.300 |
me with my tramp paper, whatever it is, and then the model does a bunch of things, and 00:07:32.900 |
then you rate which one is, like, close to the thing that you intended the model to be. 00:07:37.940 |
And so you have this big set of preferences, and you train your reward model, and the reward 00:07:41.060 |
model basically just learns to predict which one would you prefer. 00:07:46.180 |
I'm going to say, like, just stand more in front of the camera, but I think it'll look 00:08:00.220 |
So now we have this reward model that captures kind of our preferences and what we care about 00:08:06.540 |
and what we intend for the model to do, and then the second step is now you optimize your 00:08:15.180 |
And so in that setting, you know, like, the model tries a whole bunch of different things, 00:08:19.780 |
and the reward model kind of tells it which one of these things is probably more of, like, 00:08:28.580 |
When you say "comparison," is that made by a new labeler to go get the data? 00:08:34.340 |
And are those consistent, or does that not depend on the labeler? 00:08:39.740 |
Different labelers will have different preferences. 00:08:40.740 |
There also might be inconsistencies, and we can give you examples of, like, in-front-of-this 00:08:46.740 |
preferences, but those haven't really been a problem in practice. 00:08:52.620 |
And so far, you know, like, our labelers often don't agree, but the model will average over 00:09:07.340 |
You can make it even simpler if you had, you know, if you didn't train the reward model 00:09:16.340 |
and you labeled, instead, like, every arrow, but it would be a lot less data efficient. 00:09:22.820 |
And so you can train your reward model to think of, like, data as efficient. 00:09:31.660 |
So this is kind of, like, one of the main thoughts from the instruction sheet paper. 00:09:35.660 |
And this is the one I like showing, because it really blew my mind, and it still does. 00:09:42.780 |
So on the x-axis, you see, this is from the GP3 model series, and you see this is, like, 00:09:49.060 |
three different sizes of models over two orders of magnitude. 00:09:53.020 |
And on the y-axis is, how well does the model score on human preferences? 00:09:57.980 |
So if we show a bunch of samples to humans, how likely are they to prefer one over the 00:10:04.460 |
And then what we see is that even, like, the largest GP3 model, is this preferred to the 00:10:13.780 |
And so the 100x smaller instruct model is actually preferred over the much larger, like, 00:10:33.380 |
So basically, it basically shows that there was, like, a phenomenal line of code that 00:10:45.380 |
makes the model score on information so much more useful than, you know, scaling up and 00:10:56.300 |
Or fine-tuning makes the model worse, because nobody wants to use it, and then we make all 00:10:57.300 |
these fancy alignment techniques that don't get adopted. 00:11:00.100 |
And so what we were-- like, originally, in, like, the first version, we saw these regressions. 00:11:07.100 |
And then what here is labeled PPO, PTX, is kind of like a variant where we mix in pre-training 00:11:16.220 |
And that mitigated a bunch of the regressions that we saw. 00:11:45.680 |
How important is, like, fidelity of fine-tuning data that you have? 00:11:46.680 |
Like, you guys-- you collect data from humans, right? 00:11:48.680 |
What if you were to use some pre-trained language model to score, you know, general data for 00:11:53.560 |
In terms of-- well, there are certain things that the language model will be able to automatically 00:11:58.440 |
rank, and some things it won't, because it won't know your exact preferences, or it won't 00:12:07.440 |
And so whenever the language model does something that we disprefer, we actually-- we have to 00:12:16.240 |
Or in other words, you know, if you're aligning with humans, you somehow have to put humans 00:12:20.720 |
into the loop so that, you know-- otherwise, how does the model know what it's supposed 00:12:31.720 |
How many human-- approximately, like, what's-- how many orders of that intuitive, like, human 00:12:36.720 |
Of course, it will sort of like-- it will look at your PD over here, which is, I think, 00:13:02.280 |
So why would you decide to use an experiment theory for this? 00:13:03.280 |
We haven't-- we haven't actually compared-- carefully compared across our all algorithms. 00:13:04.760 |
And it could very well be that a different RL algorithm would be better. 00:13:08.080 |
That was kind of like-- I know, PPO was invented in OpenAI, so that's why we used it. 00:13:14.760 |
It's not-- not a really good reason other than that. 00:13:21.760 |
What are the labels that humans are using to, like, count up and count down versus comparisons? 00:13:31.760 |
We have people compare between, like, three to six different responses from usually different 00:13:39.760 |
So is PPO in the reward model currently used in GPT-- 00:13:49.760 |
And if so, like, do you use any of the human feedback, like, you know, regenerate responses 00:13:54.760 |
and stuff like that to help as a reward function as well? 00:13:59.760 |
Like, there's a button on chat GPT where you can say, like, regenerate responses. 00:14:03.760 |
Or do you use any implicit feedback, basically, in human use? 00:14:07.760 |
I don't know what the current state is for that. 00:14:12.760 |
But, you know, model-- chat GPT hasn't been out that long. 00:14:19.760 |
Like, it seems like 100x, as you mentioned, increasing parameter doesn't give you that 00:14:25.760 |
Qualitatively, you have been tracking this for a while. 00:14:28.760 |
Can you tell right off the bat, if you're, like, interacting with the 1 billion, like, 00:14:32.760 |
model or the, like, 100 billion model, like, a pseudo-Turing test, the parameter size? 00:14:36.760 |
Like, I give you a black box, can you tell me how many parameters it has? 00:14:43.760 |
But I think the big counter question is, like, do I get to write the prompt? 00:14:50.760 |
So if you just draw random prompts from whatever people put in the opening of Playground, which 00:14:55.760 |
is what we use for unstructured GPT, then I probably need quite a few to tell the difference. 00:15:01.760 |
But if I get to write the prompt, I can probably do it in one or two. 00:15:04.760 |
At least, like, if the task is, like, tell the difference between this and this. 00:15:13.760 |
I want to-- can I just do two more slides, and maybe your questions get answered? 00:15:17.760 |
And then-- so this was the question about training costs. 00:15:22.760 |
So this is another thing that kind of really blew my mind, is, like, compared to pre-training, 00:15:29.760 |
So if you look at, like, the amount of laps that it takes to train GPT to E, and then 00:15:34.760 |
you compare it with, like, how much does fine-tuning and the RL, what's pre-training mix and everything, 00:15:41.760 |
like, the most expensive unstructured GPT version is, like, less than 2% of the pre-training 00:15:47.760 |
And if you want to train an even bigger model, it's going to be more expensive, and you could 00:15:51.760 |
still use the same, like, fine-tuning step to make it more aligned. 00:15:56.760 |
And of course, I think the important thing to note also here is, like, we haven't fixed 00:16:02.760 |
And so I wouldn't say that this is, like, you know, the last version, and we wouldn't 00:16:07.760 |
try to figure out how to spend more compute and more human data in the future. 00:16:11.760 |
But all in all, it was surprisingly effective. 00:16:24.760 |
Mixing pre-training data into the RL fine-tune, just, like, mix the gradients. 00:16:32.760 |
What's the number of branches for this graph? 00:16:34.760 |
So you're fixing a number of branches for this graph. 00:16:58.760 |
So the first question is, how do you deal with RFH breaking in the limit? 00:17:02.760 |
Example preferences are a good proxy for values. 00:17:05.760 |
But optimizing for them is theorized to incentivize perception. 00:17:17.760 |
So that is, like, you want to automate alignment research. 00:17:20.760 |
What happens if you need conceptual breakthroughs, which 00:17:25.760 |
OK, that would be a good take at the end as well. 00:17:36.760 |
how would fine-tuning direct you on human feedback 00:17:44.760 |
I think it's more like if you directly use the human feedback data. 00:17:56.760 |
is, like, what if you just take human demonstrations in the sense 00:18:02.760 |
We just ask humans to do them, record what they did, 00:18:08.760 |
And here, it's, like, just very basic behavioral cloning, 00:18:11.760 |
just using the same loss they use in pre-training. 00:18:14.760 |
And then, you know, it is noticeably better than the Qsharp-pumped version, 00:18:23.760 |
And basically, conceptually, there's two problems 00:18:28.760 |
One is humans are better at some things than the model is, 00:18:34.760 |
And so at the things that the model is worse, 00:18:36.760 |
you're trying to imitate something that you can't do. 00:18:41.760 |
you're making the model worse because you're forcing it 00:18:44.760 |
to do the thing in the way that the human would. 00:18:53.760 |
you're kind of letting the model do whatever it wants to, 00:18:55.760 |
and it can just figure out, like, the best way for it to do things. 00:19:03.760 |
and I'm going to get to that, but I briefly want to talk about chat GPT. 00:19:07.760 |
So one thing--I kind of think of chat GPT as, like, 00:19:12.760 |
It's kind of like the next step at making the models more aligned 00:19:17.760 |
And some things that is, like, you know, I think chat does better 00:19:21.760 |
is kind of, like, using dialogue as the universal interface, right? 00:19:28.760 |
You can, like, ask it to, you know, refine the answer 00:19:39.760 |
but it's also--there's still important limitations, right? 00:19:43.760 |
Like, the biggest one is, like, the model hallucinates a lot. 00:19:46.760 |
It makes up facts when, you know, for whatever task you give it, 00:19:53.760 |
and that, you know, just makes it quite unreliable. 00:19:58.760 |
which kind of shows that, you know, it still has important misalignment 00:20:09.760 |
the model should really, like, do the task to the best of its ability 00:20:20.760 |
But, yeah, one important principle that I think is really useful for-- 00:20:28.760 |
is that evaluation is easier than generation. 00:20:31.760 |
So if we ask humans to compare and rank different responses the model gave, 00:20:37.760 |
it is easier to tell the difference between different variants 00:20:42.760 |
of what the model did than it is to do the task itself. 00:20:46.760 |
Or, in other words, you know, you can do the comparisons on tasks-- 00:20:51.760 |
you can still, like, spot good behavior on tasks 00:20:53.760 |
that you might not be able to do by yourself. 00:20:56.760 |
And so if you're giving this kind of, like, feedback 00:21:00.760 |
that lets the system do better than you actually could. 00:21:07.760 |
And I think that's a very general principle that holds in lots of domains. 00:21:11.760 |
So, kind of like, you're probably most familiar-- 00:21:15.760 |
if you studied CS, you know that P versus NP and everyone-- 00:21:18.760 |
you know, we don't actually know whether they're different, 00:21:20.760 |
but in practice it seems like NP tasks are just much harder. 00:21:27.760 |
like a lot of professional sports or esports just wouldn't be fun to watch 00:21:30.760 |
if you couldn't tell who's winning more easily 00:21:34.760 |
than you could actually compete on a professional level. 00:21:40.760 |
You can, like, look at your smartphones and tell which one you like more. 00:21:45.760 |
That is, like, also deeper than just looking at, like, the specs. 00:21:50.760 |
But it is actually very hard to build a good smartphone. 00:22:10.760 |
yeah, basically there's lots of domains where this applies. 00:22:16.760 |
this principle is, like, very useful when we want to, like, 00:22:20.760 |
align AI systems on tasks that we might not be able to do ourselves well. 00:22:33.760 |
And I think that's going to make it really difficult 00:22:46.760 |
So basically, on the x-axis, let's plot, like, the AI progress. 00:22:53.760 |
And on the y-axis, how difficult different tasks are. 00:22:57.760 |
And then as we have more AI progress, kind of like the tasks that AI-- 00:23:01.760 |
the difficulty of tasks that AI can do goes up. 00:23:05.760 |
And, like, one of the fundamental problems is that 00:23:09.760 |
the level of tasks that humans can reliably evaluate doesn't go up 00:23:14.760 |
because humans don't get better with AI progress. 00:23:22.760 |
But the problem is, once you cross this line, 00:23:28.760 |
like, whether your model is actually doing the right thing 00:23:40.760 |
And what we'll probably see is kind of what the question 00:23:45.760 |
before I lead it to is, like, well, now the systems are optimized 00:23:52.760 |
And so they will try to tell us what we want to hear, 00:23:54.760 |
rather all the things that they know to be true. 00:23:57.760 |
And, you know, they might learn how to deceive us 00:24:00.760 |
because, you know, that makes it easier to score higher on preferences. 00:24:06.760 |
And so kind of, like, the basic idea that we want to leverage 00:24:11.760 |
is related to the principle I just mentioned, 00:24:19.760 |
So, for example, if you have a large language model 00:24:22.760 |
writing a code base, like an entire code base, 00:24:25.760 |
there's just no way humans would be able to find all the bugs 00:24:31.760 |
Or, you know, the code base could have, like, a Trojan in there 00:24:34.760 |
and you might not be able to tell because it is so hard. 00:24:38.760 |
And that's why we see so much buggy code out there. 00:24:41.760 |
But if you ask your language model to find bugs and point them out to you, 00:24:46.760 |
once you've seen the bug, it's so much easier for you to say, 00:24:54.760 |
And so now you've taken the task of writing a code base down to, 00:24:58.760 |
"Well, I just have to evaluate whether that was a bug 00:25:05.760 |
And so the general principle that we're excited about here is, like, 00:25:09.760 |
we want to leverage AI assistance for human evaluation. 00:25:14.760 |
And so the hope is that we, together, if we pair up humans with AI, 00:25:17.760 |
you actually get a line that looks more like this, 00:25:20.760 |
where, you know, like, humans together with AI can evaluate 00:25:32.760 |
there's, like, two different ways you could do that, 00:25:34.760 |
or there's many different ways you could do that. 00:25:36.760 |
Two I want to highlight is, like, first, you can ask AI to write a critique. 00:25:44.760 |
And in this case, it was a simple summarization task, 00:25:47.760 |
and we trained a language model to kind of, like, 00:25:49.760 |
to say things that are wrong with the summary. 00:25:57.760 |
For example, you could give people chat GPT and ask them, 00:26:10.760 |
You can ask for fact-checking or a quote or, you know, 00:26:13.760 |
whatever the model, like, chat GPT can actually reliably help you with. 00:26:18.760 |
And so the idea would be that, you know, like, using AI assistance, 00:26:22.760 |
you can kind of get all the smarts that AI has and leverage that 00:26:28.760 |
in order to figure out how you should evaluate what this system is doing 00:26:31.760 |
and, like, whether it's aligned with your preferences 00:26:38.760 |
And the big problem with this is how do we know whether it's working? 00:26:43.760 |
And one of the kind of, like, difficulties is that by assumption, 00:26:51.760 |
we're kind of dealing with a hard task where it's difficult to evaluate. 00:26:55.760 |
And we also want the task to be real because we don't want to, you know, 00:26:59.760 |
we don't want to solve a hard task that doesn't matter. 00:27:09.760 |
But also, if you have those, you usually don't have ground truth, 00:27:15.760 |
and how do you know whether the assistance is working 00:27:17.760 |
or it's biasing everyone to just say the same thing. 00:27:21.760 |
And so there's a simple technique that we use in a critique to do this 00:27:29.760 |
where, like, that we call targeted perturbations. 00:27:33.760 |
And so what you do is you have a bunch of prompts. 00:27:35.760 |
So this could be, like, whatever people type into ChatGPT. 00:27:39.760 |
And then you kind of, like, take the response that you have 00:27:46.760 |
It doesn't actually have to be correct, but let's just assume it is. 00:27:50.760 |
And now you're asking a human to introduce some kind of subtle flaw 00:27:54.760 |
that is easy to miss but is an important flaw. 00:27:58.760 |
And now what you have is you have this pure data set of, like, 00:28:03.760 |
And you know which one is good and bad because you made it worse. 00:28:07.760 |
And so, you know, in a way, that gives you ground truth. 00:28:13.760 |
And so what you now can do is you can take, like, 00:28:18.760 |
randomly select either the correct or the flawed response 00:28:21.760 |
and then show it to either a human or a human with assistance. 00:28:24.760 |
And then they have to say whether it was the correct or the flawed one 00:28:30.760 |
And in this setting, we can kind of, like, figure, like, 00:28:36.760 |
just try a lot of different AI assistance or scalable oversight techniques 00:28:41.760 |
and then say, you know, like, is this actually helping humans find the flaws 00:28:51.760 |
And if it is, then, you know, like, you're actually really helping. 00:29:01.760 |
So this is training the language models to write critiques for summaries. 00:29:05.760 |
And what we can show is that when we are assisting humans with critiques 00:29:09.760 |
at the evaluation, they actually find 50% more flaws than it did without. 00:29:14.760 |
And so this is kind of, like, real signs of life that you can already use 00:29:18.760 |
in models that we can have today to help humans evaluate 00:29:22.760 |
and, like, find problems they would have missed otherwise. 00:29:26.760 |
And, of course, we still have to do this, like, on a much harder task 00:29:30.760 |
and, like, with, like, a real task in a sense. 00:29:35.760 |
And we also want to have, like, bigger effect size. 00:29:38.760 |
But I think it's just, like, it shows that there's promise 00:29:44.760 |
And so in the long run, what I think we want to get to is 00:29:50.760 |
we kind of want to leverage AI for all the cognitive labor 00:29:54.760 |
that goes into evaluating whatever our AI systems are doing. 00:29:58.760 |
And this could be, you know, like, reading everything that's relevant 00:30:02.760 |
or fact-checking or doing calculations or, like, writing code 00:30:10.760 |
And then humans should focus on, like, their preference input, 00:30:13.760 |
like the things figuring out what they actually care about 00:30:19.760 |
And this way we can kind of, like, leverage, you know, like, 00:30:27.760 |
the abilities that, you know, the AI players will bring to the table 00:30:32.760 |
and the things that they will be better at than us eventually. 00:30:36.760 |
And then kind of, like, use them to help communicate the thing 00:30:41.760 |
that we actually care about and, you know, the things that we 00:31:05.760 |
I was wondering about this hallucination of responses. 00:31:09.760 |
Have you ever tried to consider some notion of uncertainty 00:31:24.760 |
So ensembling is difficult because either you're, like, 00:31:28.760 |
training and fine-tuning an ensemble from the same pre-trained model 00:31:32.760 |
so you don't get that much variance in your ensemble, 00:31:34.760 |
or you're pre-training a bunch of different models 00:31:37.760 |
and now you're spending a lot of money on pre-trainings. 00:31:41.760 |
One thing, I mean, it seems like it should be a solvable problem 00:31:46.760 |
to just teach the model to say it's uncertain when it's actually uncertain. 00:31:53.760 |
And there's been a bunch of research in that direction, 00:31:56.760 |
but I think right now it's still, like, we're not really in a good shape. 00:32:07.760 |
Do you think we may run into a kind of signals and noise ratio problem 00:32:12.760 |
when it comes to AI-suggested critiques to AI answers? 00:32:17.760 |
Because I'm sure, like, when AI is trying to point out 00:32:21.760 |
particular problems in text, humans are more likely to report more problems. 00:32:26.760 |
But what if it's noticing problems that humans wouldn't have necessarily 00:32:32.760 |
So we did try to control for that a little bit 00:32:35.760 |
by, like, having humans rate the severity of their flaws 00:32:40.760 |
and whether they would have noticed them otherwise. 00:32:47.760 |
But also, like, I mean, a lot of the time the model is nitpicking, 00:32:51.760 |
and then those are, like, not the interesting cases. 00:32:55.760 |
Also, if you, like, look at the example I showed, 00:33:00.760 |
like, a lot of the critiques are just actually quite garbage. 00:33:03.760 |
And one of the, like, things that makes it easy for critiques is it's okay 00:33:10.760 |
if most of them are garbage because the human can just read them 00:33:14.760 |
And it kind of, like, more, you know, helps the evaluator know 00:33:19.760 |
where to focus on or, like, notice, like, think of something 00:33:25.760 |
So it's more like, you know, the critiques help you brainstorm 00:33:31.760 |
But if you're kind of, like, using an assistant, 00:33:33.760 |
you probably want more reliability than, like, filling most of the answers 00:33:45.760 |
How do we ensure that the evaluation metrics we are using 00:33:48.760 |
in your recursive reward modeling approach, like, detect deception 00:33:52.760 |
and, like, left turns or something don't have, like, 00:33:57.760 |
Yeah, I think, well, it depends a lot what kind of discontinuity 00:34:04.760 |
Like, if, you know, you get overnight, like, a model that is, 00:34:09.760 |
let's say, 1,000 inches larger on, like, number of parameters 00:34:14.760 |
or, like, equivalently better, that can, like, create quite a step up 00:34:20.760 |
and that makes it quite difficult to do this kind of evaluation. 00:34:29.760 |
So in that sense, I think it's going to be very important 00:34:32.760 |
to, like, scale up AI more continuously and, like, 00:34:42.760 |
with recursive reward modeling is that you're training the systems 00:34:47.760 |
to help you evaluate, you know, systems that are trained 00:34:53.760 |
And so if you can figure out how to get, like, fine-tune them 00:35:00.760 |
in a way that they mention everything that is actually useful 00:35:03.760 |
and relevant, then it would still be able to evaluate systems, 00:35:08.760 |
even though they're much smarter than anything you've evaluated before. 00:35:13.760 |
Maybe let me make this more concrete because there is, I think, 00:35:17.760 |
a way that you could measure that or, like, one way we've tried it 00:35:21.760 |
was using what we call the discriminative critique gap. 00:35:24.760 |
So to measure that, you're training two different, 00:35:34.760 |
is this the flawed response or the correct response 00:35:49.760 |
And the discriminator is just, like, pure gradient descent. 00:35:52.760 |
It doesn't have to want to tell you anything. 00:35:54.760 |
You're just basically hooking into the model's representations 00:35:57.760 |
and trying to, like, get all the relevant latent insight it has 00:36:05.760 |
And then on the other hand, you have this critique model 00:36:08.760 |
or this general assistance model, and you're measuring 00:36:12.760 |
how often does the assistant actually help me point to the right flaw? 00:36:28.760 |
knows a lot about the task it's not telling you. 00:36:42.760 |
Yeah, but I guess there's, like, lots of value 00:36:51.760 |
That's why we want to test it on the current models. 00:36:59.760 |
So I wanted to ask about, like, maybe towards the end, 00:37:08.760 |
you had a slide where, like, there was, like, 00:37:12.760 |
And so, you know, I couldn't help but notice, like, 00:37:27.760 |
And so, like, at least, like, in my personal experience 00:37:30.760 |
using the chat GPT, like, there were some things 00:37:38.760 |
And I was like, oh, like, how did that come up? 00:37:43.760 |
if it's, like-- there's, like, different things, right? 00:37:49.760 |
One thing that I thought was a bit concerning 00:37:54.760 |
people don't always communicate their preferences, 00:37:58.760 |
Or, like, there could be, like, coordinated efforts, 00:38:31.760 |
And so my question is, like, how do you, like, 00:38:38.760 |
Like, have you, like, recognized coordinated efforts 00:38:41.760 |
to, like, you know, like, specifically reward 00:39:00.760 |
I mean, the first obvious thing that you shouldn't do 00:39:14.760 |
If you think of, like, Microsoft Pay or something, 00:39:20.760 |
I mean, right now, what we're doing is, like, 00:39:22.760 |
we're hiring a bunch of people and then ask them 00:39:44.760 |
a diverse and representative set of human preferences. 00:39:53.760 |
And so I kind of wish there was also just, like, 00:39:56.760 |
more targeted research on, like, how we should do that 00:40:04.760 |
better placed outside of, like, big tech companies. 00:40:21.760 |
humanity would do under a reflection or something. 00:40:24.760 |
And so I think it's a really big, important question. 00:40:55.760 |
considering that we're currently training these models 00:41:01.760 |
and hopefully getting closer to human preferences 00:41:04.760 |
As human preferences change, we've seen, like, 00:41:20.260 |
I mean, the most obvious thing is it's, like-- 00:41:22.760 |
the model's knowledge base is kind of like the free training 00:41:35.760 |
In terms of updating kind of, like, human preferences 00:41:43.760 |
And the fine-tuning run is, like, comparatively cheap. 00:41:53.760 |
and people started using it for all kinds of, you know, 00:41:56.760 |
tasks that they want to build their company around, 00:42:07.760 |
into, like, adopting their prompts to whatever 00:42:19.760 |
So on the note of exceeding human level performance, 00:42:25.760 |
has this immense corpus of the entire internet. 00:42:28.760 |
If you want to specialize in a specific domain, 00:42:30.760 |
like chemistry or material science or something, 00:42:46.760 |
You mean, like, less data on, like, the chemical domain 00:42:50.760 |
[INAUDIBLE] research paper over the last 30 years or something. 00:42:54.760 |
And you can throw that into pre-training, right? 00:42:57.760 |
But can the model really learn this effectively 00:43:00.760 |
Or can we somehow adapt the abstract concepts 00:43:06.260 |
I mean, that's kind of the general idea with what you 00:43:14.760 |
For example, InstructGPT was trained almost entirely 00:43:17.760 |
on English language feedback and demonstrations. 00:43:28.760 |
with people who don't know anything about chemistry. 00:43:37.760 |
And this fine tuning can be very sample efficient. 00:43:41.760 |
make a meaningful change in the model behavior. 00:43:59.760 |
put emphasis in training on different expression styles? 00:44:03.760 |
So what I've noticed from GPT-3 that it always 00:44:06.760 |
gives you, like, very structured or scientifically 00:44:11.760 |
Do you consider any training if it returns you, 00:44:25.760 |
I mean, the tricky thing is, ideally, the model 00:44:28.760 |
should give you the kind of answer that you want to have. 00:44:33.760 |
And some people prefer a more scientific or technical 00:44:36.760 |
Some people might prefer a more generic answer. 00:44:39.760 |
And I mean, right now, like, ChatGPT doesn't have, 00:44:48.760 |
And that's something that would be really exciting to have. 00:44:52.760 |
But also, I think the kind of statistic property 00:44:56.760 |
that you've observed is, in fact, like, probably 00:45:02.760 |
And so a lot of the ChatGPT workers were, like, more, 00:45:06.760 |
you know, like, I think more, like, computer science-y 00:45:09.760 |
and, like, more-- there was, like, more data generated 00:45:12.760 |
by programmers compared to InstructGPT, which 00:45:20.760 |
And yeah, there's, like, different-- it's, like, 00:45:27.760 |
So there is no specific effort to distinguish that, 00:45:35.760 |
I mean, we should make a distinguished effort. 00:45:38.760 |
It should give you, like, the style that you want, right? 00:45:45.760 |
So one of the things that I've been thinking about, 00:45:52.760 |
to play a factor in the education of the younger 00:46:00.760 |
And so if you go back to the graph of the AI progress 00:46:03.760 |
and the human level-- yeah, what humans can evaluate, 00:46:08.760 |
what I'm starting to think about is, like, over a break, 00:46:12.760 |
I showed, like, my 10-year-old cousin how to compare ChatGPT 00:46:30.760 |
or I perceive it to be more difficult for them 00:46:32.760 |
to discriminate even simpler tasks than what we do now. 00:46:39.760 |
how that might disrupt or make this alignment 00:46:49.760 |
who take, for instance, what ChatGPT says as a given truth 00:46:58.760 |
I was just wondering what your thoughts are on that. 00:47:09.760 |
like, please don't believe everything the model says, 00:47:13.760 |
But also, I think one thing that I'm hopeful for 00:47:17.760 |
is that, like, your cousin will end up, like, figuring out 00:47:24.760 |
with all of these AI tools that are getting better 00:47:29.760 |
and learning how to actually leverage them productively, 00:47:38.760 |
20 years ago or something, when you were, like, 00:47:41.760 |
using Google Search much earlier than everyone else, 00:47:44.760 |
you're probably going to get better at, like, 00:47:46.760 |
using that as a tool for everything you want to do. 00:47:57.760 |
I think the slide where human tasks and the chat tasks 00:48:18.760 |
to the real world, to, like, physical ground truth, 00:48:21.760 |
and using language as, like, a compressed interface 00:48:28.760 |
accessor technology directly with your models 00:48:38.760 |
I mean, it depends on what that sensor could be, right? 00:48:40.760 |
Like, I guess, like, one of the most straightforward things 00:48:45.760 |
and then it can, like, fact check its own answers, 00:48:48.760 |
and it can, you know, like, import external knowledge 00:48:54.760 |
And yeah, I think that would be quite useful. 00:48:59.760 |
I think that would also be quite useful for assisting 00:49:12.760 |
is a published work on using the model for browsing. 00:49:17.760 |
I think-- so one thing that makes it harder when you're 00:49:20.760 |
using these, like, external sensors, or if you're 00:49:23.760 |
letting the model interact more directly with the real world 00:49:26.760 |
is that it raises more safety questions, right? 00:49:29.760 |
If you let your language model make arbitrary API calls, 00:49:33.760 |
then you have to be a lot more careful with which calls 00:49:41.760 |
And if you're-- as opposed to if you're just, like, 00:49:46.760 |
then you can decide which ones you want to make. 00:50:02.760 |
About the reasoning abilities of these model language models. 00:50:06.760 |
I've seen, like, different people talk about how 00:50:08.760 |
it's only, like, a fixed amount of compute per token, 00:50:10.760 |
while, like, humans, they have system one and system two, 00:50:13.760 |
where we can, like, just speak quickly versus, like, 00:50:15.760 |
actually do some reasoning and think through things 00:50:19.760 |
And then I've seen other works that try to, like, 00:50:21.760 |
kind of use-- to force it to a chain of prompt 00:50:27.760 |
Do you think that stuff is sufficient to do, like, 00:50:46.760 |
to have new capabilities, and more, like, you know, 00:51:05.760 |
So what do you think is the-- is there a role for [INAUDIBLE] 00:51:08.760 |
playing from the feedback, especially if you have, like, 00:51:11.760 |
human-- like, you don't have to play chatbots. 00:51:15.760 |
So do you think, like, this would be more popping out? 00:51:38.760 |
I think, broadly, you can categorize this kind of thing 00:51:46.760 |
And I think that's valuable, and that should help us, 00:51:48.760 |
like, make the same pre-trained models, like, 00:51:52.760 |
more aligned according to the human preferences 00:52:03.760 |
Also, if, like, someone wants to use RLHF on, like, 00:52:06.760 |
the GPT-3 models, will OpenAI offer some sort of API for that? 00:52:19.760 |
you can, like, distill best of M and do this kind 00:52:26.760 |
So the first question is, could you more clearly describe 00:52:35.760 |
then xdb of programming data, y steps of RLHF. 00:52:43.760 |
And then, so how much ease of data do you use? 00:53:20.760 |
yeah, it was, like, about 20,000 hours of human feedback, 00:53:26.760 |
human feedback, because that's-- you can get, like, 00:53:32.260 |
I mean, the big question is, like, how do you make-- 00:53:53.260 |
So the next question, I think, that I was told was, 00:54:02.760 |
Yeah, so, I mean, the kind of, like, ambition of that plan 00:54:18.760 |
writes an alignment research paper that we read, 00:54:22.760 |
and then we're like, oh, this is a really cool idea. 00:54:26.760 |
And I think, you know, going back to evaluation, 00:54:31.760 |
I think it also applies to alignment research. 00:54:36.760 |
like, I find it much easier to evaluate, you know, 00:54:39.760 |
alignment research than I find it to, like, produce it. 00:54:43.760 |
And so while there might be conceptual breakthroughs 00:54:48.760 |
that we need that we couldn't even evaluate right now 00:55:00.760 |
the reason why we want to do scalable oversight, right? 00:55:06.760 |
the language model produces this really brilliant insight 00:55:13.760 |
we should be able to have an easier time recognizing it 00:55:22.760 |
to, like, figure out whether or not that was a good idea, 00:55:25.760 |
what is the weaknesses and what are the strengths? 00:55:27.760 |
And, like, you know, what kind of experiments should we run 00:55:37.760 |
the story of just using LLHF to train a model 00:55:42.760 |
you have the obvious pitfalls, which is, you know, 00:55:45.760 |
the model might write, like, an alignment proposal 00:55:47.760 |
that kind of looks good to us, but is actually, you know, 00:55:58.760 |
which might be really hard, maybe it's not, but, you know, 00:56:02.760 |
I think we should expect it to be really hard, 00:56:04.760 |
and then leveraging AI assistance to evaluate that 00:56:47.760 |
all the cool things you see it do come from pre-training 00:56:55.760 |
to the fine-tuning stage is that you didn't see it 00:57:03.760 |
that we didn't see it in the pre-trained model 00:57:05.760 |
is because the pre-written model was so misaligned, 00:57:09.760 |
was not trying to show you all the things it can do. 00:57:21.760 |
I think that what our project basically has been doing 00:57:24.760 |
is, like, unlocking capabilities that were already in the model 00:57:28.760 |
and making those available for humans to use. 00:57:35.760 |
alignment research is very dual-use in the sense that, 00:57:38.760 |
you know, A, if you have really good alignment techniques, 00:57:42.760 |
you can use it to align with whatever values you want, 00:57:49.760 |
And B, it also, like, if you're doing alignment right, 00:57:54.760 |
it will always look a little bit like you made 00:58:01.760 |
it just wasn't really trying that hard to help you. 00:58:06.760 |
So, you know, you actually see these capabilities 00:58:27.760 |
Yeah, so that was what I was talking about here, right? 00:58:29.760 |
Like, this is, like, the whole problem that we have, 00:58:43.760 |
And that's why we want to do scalable supervision 00:59:10.760 |
have to test empirically, like, how difficult 00:59:24.760 |
to get the outer alignment signal really right 00:59:30.760 |
And once we have that, then a lot of the other things 00:59:40.760 |
But, you know, one story is you're kind of training 00:59:47.760 |
it learns, basically, a bunch of inner optimizers, 00:59:53.760 |
So for example, like, GPT-3 can do, like, in-context learning. 00:59:57.760 |
And that's, like, a kind of, you know, learned optimizer. 01:00:01.760 |
And so now you're, like, doing all that GIF training 01:00:05.760 |
or whatever, like, alignment training you have. 01:00:11.760 |
learn to do the thing that you want on distribution. 01:00:17.760 |
and this distributional shift could be auto-induced, 01:00:19.760 |
meaning, like, the model is causing it itself. 01:00:33.760 |
and, you know, like, how much that would actually happen 01:00:37.760 |
But one kind of, like, more important question 01:00:40.760 |
is, like, if you have a really reliable outer alignment 01:00:44.760 |
signal and you have this, like, general training 01:00:56.760 |
or, like, to get its inner optimizers in a row, basically. 01:01:01.760 |
And so then you've reduced, like, the inner alignment 01:01:07.760 |
And how do you, like, construct an outer alignment 01:01:11.760 |
And those are problems that we have to deal with anyways. 01:01:14.760 |
But yeah, I don't know how it's actually going to shake out. 01:01:27.760 |
So regarding alignments, one of the kind of problems 01:01:33.760 |
that I've been encountering in some discussions 01:01:38.760 |
in, like, explaining why we come to these judgments 01:01:42.760 |
There's not even been much interest in the way 01:01:54.760 |
But as to why it's making these out-of-line judgments, 01:02:00.760 |
have you all been able to interrogate the model? 01:02:16.760 |
But you don't know whether it's answering truthfully. 01:02:42.760 |
is probably interpretability, where, you know, 01:02:59.760 |
in particular, and that's such a high-critical space. 01:03:06.760 |
reducing the missionality of that representation 01:03:12.760 |
Yeah, I mean, we are working on that problem. 01:03:20.760 |
And so it seems generally not to be a very easy problem. 01:03:28.760 |
But, you know, I'm hopeful that we can do some things. 01:03:32.760 |
I think, in general, the problem of interpretability, 01:03:37.760 |
or, like, using interpretability for alignment, 01:03:51.760 |
you can leverage would be useful because it's 01:03:54.760 |
another tool in your toolbox of, like, detecting deception 01:04:10.760 |
if you really get really good at interpretability, 01:04:25.760 |
are really hard to find with the interpretability tools? 01:04:32.760 |
of the standard practice that we have in general 01:04:34.760 |
is that you have to find explanations of the problem. 01:04:37.760 |
- And I guess then my question would be is, like, 01:04:46.760 |
So, again, this is kind of, like, an open question. 01:04:53.760 |
is that at the end of the day, what really is going to matter 01:04:57.760 |
is the decisions that the model actually takes 01:05:05.760 |
where you're confident that all the things the model actually 01:05:12.760 |
then does it still matter what the model thinks internally? 01:05:19.760 |
You have to find out, like, if that value would be more 01:05:24.760 |
Like, we're trying to make, like, a really, really 01:05:30.760 |
you know, you can train the model to do the things 01:05:32.760 |
that you want it to do because you can always 01:05:53.760 |
as we're talking about the topic of connectivity,