back to index

Stanford CS25: V2 I Language and Human Alignment


Whisper Transcript | Transcript Only Page

00:00:00.000 | It's my pleasure to welcome Jan from OpenAI.
00:00:08.600 | He leads the alignment team there and was previously a researcher at DeepMind as well.
00:00:12.920 | He holds a PhD in reinforcement learning theory, has been thinking about the alignment problem
00:00:17.840 | for over 10 years, and today he'll be giving a very interesting talk.
00:00:21.400 | So hope you guys enjoy.
00:00:22.920 | Yeah, thanks a lot for the intro, and thanks a lot for having me.
00:00:27.160 | I'm very excited to talk about this stuff.
00:00:30.720 | I'm also super happy to keep it interactive.
00:00:33.320 | If you have questions at any point, please interrupt me.
00:00:39.280 | I want to start out with a few very basic observations on what I think is going on.
00:00:50.080 | So the first one is, TeamAI is joining the game, so TeamAI has a lot of different players.
00:01:00.200 | They don't all join at the same time, but rather they join one by one.
00:01:05.520 | And not all their players vary a lot in how good they are.
00:01:10.640 | And right now, a lot of the players that have joined so far aren't really that smart and
00:01:16.720 | usually can do only a very narrow set of tasks.
00:01:23.000 | But one thing that we've kind of observed is that over time, you know, we're seeing
00:01:29.000 | stronger and stronger players join, and this is kind of where we are now.
00:01:35.520 | And then in general, we expect that TeamAI has incredibly strong players, so those will
00:01:41.880 | be players that are able to think so much better than humans, so much faster, and so
00:01:47.820 | much more cheaply.
00:01:50.200 | And these haven't joined yet.
00:01:53.400 | And so the anchor point that we have, if you think, for example, about chat2BT, chat2BT
00:02:01.880 | can already beat any human at knowing more facts or speaking more languages, and it can
00:02:09.360 | write about 50 words per second, and can do so about 100 times cheaper than humans could
00:02:15.760 | at minimum wage.
00:02:18.560 | And so, you know, chat2BT also has some really important limitations, and there's a lot of
00:02:25.940 | things that it can't do yet, but it is kind of an indicator of some of the players that
00:02:33.920 | may be able to join in the future.
00:02:37.220 | And so it seems like in the long run, TeamAI will have all the advantages over TeamHuman.
00:02:47.380 | And there's an important caveat, which is there's one important advantage that TeamHuman
00:02:56.100 | has, which is TeamHuman gets to pick which players from TeamAI join and win.
00:03:03.640 | And so this is kind of like an advantage that we should really be leaning into when we're
00:03:09.500 | thinking about what to do, and when we're thinking about, you know, this game that we're
00:03:13.260 | playing with TeamAI, and that we'll be playing with TeamAI in the future.
00:03:19.340 | So I think two of the main objectives of what we as TeamHuman should do is, like, first,
00:03:28.880 | we should try to recruit players from TeamAI to play on TeamHuman.
00:03:36.100 | And so this is kind of what I would broadly call alignment.
00:03:40.420 | And this is kind of like the problem that I'm working on.
00:03:44.040 | And then there's also other objectives.
00:03:45.700 | So another objective that I think is going to be really important is you want to write
00:03:49.540 | the rules of the game so that TeamHuman doesn't lose.
00:03:53.600 | And right now, TeamHuman kind of has the ball, and we get to write the rules, so we should
00:03:58.140 | write rules that, you know, make sense, and that still play this game in the future.
00:04:06.660 | And so in this talk, I won't really talk about the second point at all.
00:04:10.820 | And I'll talk about the first point, because that's where I know best enough where I'm
00:04:15.220 | working on.
00:04:16.220 | And kind of to phrase it differently, or to make it kind of, like, more practical, like,
00:04:24.340 | one way I'm thinking about alignment is, like, you want to build AI systems that follow human
00:04:29.340 | intent, and that, you know, follow human preferences that do what we want them to do.
00:04:36.620 | And so a bunch of the things, basically, I'll talk about two main things.
00:04:42.380 | The first part is going to be work that we've done in the past, and kind of, like, which
00:04:48.560 | roughly is in the bucket of, like, we are trying to figure out how we can make the models
00:04:56.620 | that we have today as aligned as we can, and we're just kind of trying -- we're going to
00:05:01.420 | try hard to do this, and we'll see how far we get.
00:05:05.180 | And then the second bucket is the things that we have to do next, the stuff that we haven't
00:05:10.140 | done yet that we think are going to be really important, and I want to kind of, like, lay
00:05:14.940 | out why I think they're going to be important.
00:05:19.740 | So, now, I said, you know, I'm, like, trying to make this more clear, or, like, more broken
00:05:27.220 | down what alignment means, so I'm not back here, because now, you know, the big question
00:05:32.540 | is, like, what does it mean to follow human intent?
00:05:35.180 | And kind of, like, two main categories of intent that we care about is, I would say,
00:05:40.580 | assisted intent, so if you -- you know, I give the system an instruction, or if it wanted
00:05:45.420 | to be my assistant, it should be my assistant, and it should follow the instruction.
00:05:49.700 | But then there's also all these other intents that I don't say when I'm usually, you know,
00:05:54.380 | talking to a system or a human that I also really care about, like, you know, it shouldn't
00:05:59.420 | literally always do what I say, but the thing that I mean, and it shouldn't make up stuff,
00:06:03.860 | and it shouldn't, you know, do harmful things, and it should ask a lot of questions when
00:06:08.580 | it's not sure what I mean, and so on and so on.
00:06:11.420 | And so these are all kind of, like, things that are often just, like, really difficult
00:06:17.180 | to, like, precisely specify, or, like, you know, put precisely in groups, but it is still
00:06:26.740 | things that we want to get AI to do, and that we have to figure out how to, you know, get
00:06:32.220 | into our system.
00:06:35.220 | And so, kind of, like, the main technique that we're using today for this is what we
00:06:40.420 | call the cross-line feedback, so that was used to train ScrapGPT and ChatGPT, which
00:06:46.100 | are the two, like, main systems that I'll talk about in this talk.
00:06:50.500 | And basically, the basic system is very simple, and it's also, like, a super general technique
00:06:56.260 | that applies to lots of different AI models, and modalities, and settings, but in this
00:07:03.140 | case, we'll be using that as well, and so the two steps is actually another step of,
00:07:08.980 | like, planning and demonstration, so I'm going to just stop for the sake of simplicity.
00:07:14.580 | The first step is you want to train your reward model from comparison, so you have it on the
00:07:20.420 | top, in this case, you know, explain moving that nucleotide field, or, you know, help
00:07:27.300 | me with my tramp paper, whatever it is, and then the model does a bunch of things, and
00:07:32.900 | then you rate which one is, like, close to the thing that you intended the model to be.
00:07:37.940 | And so you have this big set of preferences, and you train your reward model, and the reward
00:07:41.060 | model basically just learns to predict which one would you prefer.
00:07:44.980 | Everything okay?
00:07:46.180 | I'm going to say, like, just stand more in front of the camera, but I think it'll look
00:07:53.620 | good.
00:07:54.620 | Sorry about that.
00:07:55.620 | Maybe let's turn it a little bit.
00:07:59.220 | Okay.
00:08:00.220 | So now we have this reward model that captures kind of our preferences and what we care about
00:08:06.540 | and what we intend for the model to do, and then the second step is now you optimize your
00:08:12.420 | reward model with the input.
00:08:15.180 | And so in that setting, you know, like, the model tries a whole bunch of different things,
00:08:19.780 | and the reward model kind of tells it which one of these things is probably more of, like,
00:08:25.140 | the thing that it cares about.
00:08:28.580 | When you say "comparison," is that made by a new labeler to go get the data?
00:08:33.340 | Okay.
00:08:34.340 | And are those consistent, or does that not depend on the labeler?
00:08:38.740 | It'll depend on the labeler.
00:08:39.740 | Different labelers will have different preferences.
00:08:40.740 | There also might be inconsistencies, and we can give you examples of, like, in-front-of-this
00:08:46.740 | preferences, but those haven't really been a problem in practice.
00:08:52.620 | And so far, you know, like, our labelers often don't agree, but the model will average over
00:08:58.540 | all of that.
00:08:59.540 | But yeah.
00:09:00.540 | So this is, like, the basic technique.
00:09:03.540 | It's conceptually, like, quite simple.
00:09:07.340 | You can make it even simpler if you had, you know, if you didn't train the reward model
00:09:16.340 | and you labeled, instead, like, every arrow, but it would be a lot less data efficient.
00:09:22.820 | And so you can train your reward model to think of, like, data as efficient.
00:09:30.660 | So how well does it work?
00:09:31.660 | So this is kind of, like, one of the main thoughts from the instruction sheet paper.
00:09:35.660 | And this is the one I like showing, because it really blew my mind, and it still does.
00:09:41.780 | What do we see here?
00:09:42.780 | So on the x-axis, you see, this is from the GP3 model series, and you see this is, like,
00:09:49.060 | three different sizes of models over two orders of magnitude.
00:09:53.020 | And on the y-axis is, how well does the model score on human preferences?
00:09:57.980 | So if we show a bunch of samples to humans, how likely are they to prefer one over the
00:10:03.460 | other?
00:10:04.460 | And then what we see is that even, like, the largest GP3 model, is this preferred to the
00:10:11.260 | smallest instruct GPT variant.
00:10:13.780 | And so the 100x smaller instruct model is actually preferred over the much larger, like,
00:10:25.900 | full-size GP3 model.
00:10:29.380 | And that's kind of wild.
00:10:30.380 | Sorry.
00:10:31.380 | Let me just finish my talk.
00:10:32.380 | So why is this a big deal?
00:10:33.380 | So basically, it basically shows that there was, like, a phenomenal line of code that
00:10:45.380 | makes the model score on information so much more useful than, you know, scaling up and
00:10:55.300 | fine-tuning.
00:10:56.300 | Or fine-tuning makes the model worse, because nobody wants to use it, and then we make all
00:10:57.300 | these fancy alignment techniques that don't get adopted.
00:11:00.100 | And so what we were-- like, originally, in, like, the first version, we saw these regressions.
00:11:07.100 | And then what here is labeled PPO, PTX, is kind of like a variant where we mix in pre-training
00:11:14.220 | data into the fine-tuning.
00:11:16.220 | And that mitigated a bunch of the regressions that we saw.
00:11:19.780 | Yeah.
00:11:20.780 | And I just had a quick follow-up to that.
00:11:45.680 | How important is, like, fidelity of fine-tuning data that you have?
00:11:46.680 | Like, you guys-- you collect data from humans, right?
00:11:47.680 | Yeah.
00:11:48.680 | What if you were to use some pre-trained language model to score, you know, general data for
00:11:49.680 | the courts or something like that?
00:11:50.680 | How do you do that?
00:11:53.560 | In terms of-- well, there are certain things that the language model will be able to automatically
00:11:58.440 | rank, and some things it won't, because it won't know your exact preferences, or it won't
00:12:04.320 | know exactly what we wanted to do.
00:12:07.440 | And so whenever the language model does something that we disprefer, we actually-- we have to
00:12:14.160 | give it another data point, right?
00:12:16.240 | Or in other words, you know, if you're aligning with humans, you somehow have to put humans
00:12:20.720 | into the loop so that, you know-- otherwise, how does the model know what it's supposed
00:12:26.720 | to do?
00:12:28.720 | And lots of more questions.
00:12:29.720 | I don't know who was first.
00:12:30.720 | Yeah?
00:12:31.720 | How many human-- approximately, like, what's-- how many orders of that intuitive, like, human
00:12:32.720 | preferences do you need to achieve these--
00:12:34.720 | I'm going to get to that in a second.
00:12:35.720 | Sure.
00:12:36.720 | Of course, it will sort of like-- it will look at your PD over here, which is, I think,
00:13:00.280 | not a sample of each other, but some others.
00:13:01.280 | Yeah.
00:13:02.280 | So why would you decide to use an experiment theory for this?
00:13:03.280 | We haven't-- we haven't actually compared-- carefully compared across our all algorithms.
00:13:04.760 | And it could very well be that a different RL algorithm would be better.
00:13:08.080 | That was kind of like-- I know, PPO was invented in OpenAI, so that's why we used it.
00:13:14.760 | It's not-- not a really good reason other than that.
00:13:18.760 | It works also pretty well.
00:13:21.760 | What are the labels that humans are using to, like, count up and count down versus comparisons?
00:13:29.760 | Comparisons.
00:13:30.760 | This is better than this other thing.
00:13:31.760 | We have people compare between, like, three to six different responses from usually different
00:13:37.760 | models.
00:13:38.760 | Yeah.
00:13:39.760 | So is PPO in the reward model currently used in GPT--
00:13:48.760 | --introduction?
00:13:49.760 | And if so, like, do you use any of the human feedback, like, you know, regenerate responses
00:13:54.760 | and stuff like that to help as a reward function as well?
00:13:58.760 | How do you mean regenerate?
00:13:59.760 | Like, there's a button on chat GPT where you can say, like, regenerate responses.
00:14:03.760 | Or do you use any implicit feedback, basically, in human use?
00:14:07.760 | I don't know what the current state is for that.
00:14:10.760 | I expect people will try to use it.
00:14:12.760 | But, you know, model-- chat GPT hasn't been out that long.
00:14:16.760 | Yeah.
00:14:17.760 | So I'm curious about this graph.
00:14:19.760 | Like, it seems like 100x, as you mentioned, increasing parameter doesn't give you that
00:14:23.760 | much more, like, fidelity there.
00:14:25.760 | Qualitatively, you have been tracking this for a while.
00:14:28.760 | Can you tell right off the bat, if you're, like, interacting with the 1 billion, like,
00:14:32.760 | model or the, like, 100 billion model, like, a pseudo-Turing test, the parameter size?
00:14:36.760 | Like, I give you a black box, can you tell me how many parameters it has?
00:14:40.760 | Probably not very precisely.
00:14:43.760 | But I think the big counter question is, like, do I get to write the prompt?
00:14:49.760 | I see.
00:14:50.760 | So if you just draw random prompts from whatever people put in the opening of Playground, which
00:14:55.760 | is what we use for unstructured GPT, then I probably need quite a few to tell the difference.
00:15:01.760 | But if I get to write the prompt, I can probably do it in one or two.
00:15:04.760 | At least, like, if the task is, like, tell the difference between this and this.
00:15:10.760 | Yeah.
00:15:13.760 | I want to-- can I just do two more slides, and maybe your questions get answered?
00:15:17.760 | And then-- so this was the question about training costs.
00:15:22.760 | So this is another thing that kind of really blew my mind, is, like, compared to pre-training,
00:15:27.760 | it is incredibly cheap.
00:15:29.760 | So if you look at, like, the amount of laps that it takes to train GPT to E, and then
00:15:34.760 | you compare it with, like, how much does fine-tuning and the RL, what's pre-training mix and everything,
00:15:41.760 | like, the most expensive unstructured GPT version is, like, less than 2% of the pre-training
00:15:46.760 | compute.
00:15:47.760 | And if you want to train an even bigger model, it's going to be more expensive, and you could
00:15:51.760 | still use the same, like, fine-tuning step to make it more aligned.
00:15:56.760 | And of course, I think the important thing to note also here is, like, we haven't fixed
00:15:59.760 | all the problems.
00:16:00.760 | There's, like, important limitations.
00:16:02.760 | And so I wouldn't say that this is, like, you know, the last version, and we wouldn't
00:16:07.760 | try to figure out how to spend more compute and more human data in the future.
00:16:11.760 | But all in all, it was surprisingly effective.
00:16:16.760 | OK, there were no more questions.
00:16:19.760 | More questions?
00:16:20.760 | Yeah.
00:16:21.760 | I just wanted to ask what the PTFs were.
00:16:24.760 | Mixing pre-training data into the RL fine-tune, just, like, mix the gradients.
00:16:30.760 | Yeah.
00:16:31.760 | Quick one.
00:16:32.760 | What's the number of branches for this graph?
00:16:34.760 | So you're fixing a number of branches for this graph.
00:16:37.760 | So this is the full-size GPT-3 version.
00:16:39.760 | So this is the $175 billion model.
00:16:45.760 | More questions?
00:16:48.760 | There's also some questions on Zoom.
00:16:50.760 | Great.
00:16:51.760 | [INAUDIBLE]
00:16:54.760 | OK, sure.
00:16:55.760 | So the first one is--
00:16:57.760 | OK, sure.
00:16:58.760 | So the first question is, how do you deal with RFH breaking in the limit?
00:17:02.760 | Example preferences are a good proxy for values.
00:17:05.760 | But optimizing for them is theorized to incentivize perception.
00:17:11.760 | Yes, I'll get to that.
00:17:13.760 | Sure.
00:17:15.760 | Sure.
00:17:16.760 | That's the next question.
00:17:17.760 | So that is, like, you want to automate alignment research.
00:17:20.760 | What happens if you need conceptual breakthroughs, which
00:17:23.760 | are difficult for experts to verify?
00:17:25.760 | OK, that would be a good take at the end as well.
00:17:30.760 | Sure, let's see.
00:17:31.760 | Sorry.
00:17:32.760 | [LAUGHTER]
00:17:34.760 | Yeah, I guess, like, one question is, like,
00:17:36.760 | how would fine-tuning direct you on human feedback
00:17:38.760 | compared to fine-tuning with RL?
00:17:41.760 | Fine-tuning, like, supervised fine-tuning?
00:17:44.760 | I think it's more like if you directly use the human feedback data.
00:17:50.760 | I'm also not sure what that means.
00:17:54.760 | So, I mean, so one baseline I'm showing here
00:17:56.760 | is, like, what if you just take human demonstrations in the sense
00:18:00.760 | that, you know, we have a bunch of tasks.
00:18:02.760 | We just ask humans to do them, record what they did,
00:18:05.760 | and then train the model to imitate that.
00:18:08.760 | And here, it's, like, just very basic behavioral cloning,
00:18:11.760 | just using the same loss they use in pre-training.
00:18:14.760 | And then, you know, it is noticeably better than the Qsharp-pumped version,
00:18:18.760 | but it's still not as good as RL.
00:18:21.760 | And so that's why we like using RL.
00:18:23.760 | And basically, conceptually, there's two problems
00:18:26.760 | with the imitating humans approach.
00:18:28.760 | One is humans are better at some things than the model is,
00:18:32.760 | and they're worse at other things.
00:18:34.760 | And so at the things that the model is worse,
00:18:36.760 | you're trying to imitate something that you can't do.
00:18:39.760 | And on the things where the model is better,
00:18:41.760 | you're making the model worse because you're forcing it
00:18:44.760 | to do the thing in the way that the human would.
00:18:49.760 | And so with RL, you're kind of--with RLHF,
00:18:53.760 | you're kind of letting the model do whatever it wants to,
00:18:55.760 | and it can just figure out, like, the best way for it to do things.
00:19:01.760 | There's also another important advantage,
00:19:03.760 | and I'm going to get to that, but I briefly want to talk about chat GPT.
00:19:07.760 | So one thing--I kind of think of chat GPT as, like,
00:19:10.760 | the upgrade to instructor GPT.
00:19:12.760 | It's kind of like the next step at making the models more aligned
00:19:15.760 | and more useful to humans.
00:19:17.760 | And some things that is, like, you know, I think chat does better
00:19:21.760 | is kind of, like, using dialogue as the universal interface, right?
00:19:24.760 | You can talk to it directly.
00:19:26.760 | You can ask follow-up questions.
00:19:28.760 | You can, like, ask it to, you know, refine the answer
00:19:32.760 | and all these things.
00:19:33.760 | That makes it a lot easier to deal with.
00:19:36.760 | It's better at refusing harmful tasks,
00:19:39.760 | but it's also--there's still important limitations, right?
00:19:43.760 | Like, the biggest one is, like, the model hallucinates a lot.
00:19:46.760 | It makes up facts when, you know, for whatever task you give it,
00:19:53.760 | and that, you know, just makes it quite unreliable.
00:19:56.760 | It's also still sensitive to prompting,
00:19:58.760 | which kind of shows that, you know, it still has important misalignment
00:20:03.760 | that we need to fix.
00:20:06.760 | Like, really, if the model was, like--
00:20:09.760 | the model should really, like, do the task to the best of its ability
00:20:15.760 | no matter how you prompt it to do that.
00:20:20.760 | But, yeah, one important principle that I think is really useful for--
00:20:26.760 | or that, like, our job leans on a lot
00:20:28.760 | is that evaluation is easier than generation.
00:20:31.760 | So if we ask humans to compare and rank different responses the model gave,
00:20:37.760 | it is easier to tell the difference between different variants
00:20:42.760 | of what the model did than it is to do the task itself.
00:20:46.760 | Or, in other words, you know, you can do the comparisons on tasks--
00:20:51.760 | you can still, like, spot good behavior on tasks
00:20:53.760 | that you might not be able to do by yourself.
00:20:56.760 | And so if you're giving this kind of, like, feedback
00:21:00.760 | that lets the system do better than you actually could.
00:21:07.760 | And I think that's a very general principle that holds in lots of domains.
00:21:11.760 | So, kind of like, you're probably most familiar--
00:21:15.760 | if you studied CS, you know that P versus NP and everyone--
00:21:18.760 | you know, we don't actually know whether they're different,
00:21:20.760 | but in practice it seems like NP tasks are just much harder.
00:21:25.760 | It also applies to lots of other settings,
00:21:27.760 | like a lot of professional sports or esports just wouldn't be fun to watch
00:21:30.760 | if you couldn't tell who's winning more easily
00:21:34.760 | than you could actually compete on a professional level.
00:21:38.760 | It applies to a lot of consumer products.
00:21:40.760 | You can, like, look at your smartphones and tell which one you like more.
00:21:45.760 | That is, like, also deeper than just looking at, like, the specs.
00:21:50.760 | But it is actually very hard to build a good smartphone.
00:21:53.760 | It also applies to academic research.
00:21:56.760 | You know, it's much easier to review a paper
00:21:58.760 | and say all the things that are bad about it
00:22:02.760 | than it is to write a good paper yourself.
00:22:05.760 | It applies to, I don't know, when you--
00:22:10.760 | yeah, basically there's lots of domains where this applies.
00:22:13.760 | And so I think this is, like, a very--
00:22:16.760 | this principle is, like, very useful when we want to, like,
00:22:20.760 | align AI systems on tasks that we might not be able to do ourselves well.
00:22:26.760 | Okay, so having said that,
00:22:29.760 | RLHF has some really important limitations.
00:22:33.760 | And I think that's going to make it really difficult
00:22:36.760 | to use RLHF to scale alignment.
00:22:44.760 | Let me explain this with a diagram.
00:22:46.760 | So basically, on the x-axis, let's plot, like, the AI progress.
00:22:53.760 | And on the y-axis, how difficult different tasks are.
00:22:57.760 | And then as we have more AI progress, kind of like the tasks that AI--
00:23:01.760 | the difficulty of tasks that AI can do goes up.
00:23:05.760 | And, like, one of the fundamental problems is that
00:23:09.760 | the level of tasks that humans can reliably evaluate doesn't go up
00:23:14.760 | because humans don't get better with AI progress.
00:23:18.760 | And so I think we're, like, somewhere here.
00:23:22.760 | But the problem is, once you cross this line,
00:23:25.760 | you don't really know what--
00:23:28.760 | like, whether your model is actually doing the right thing
00:23:31.760 | because you can't reliably evaluate anymore.
00:23:34.760 | And so that's kind of, like, the point where
00:23:37.760 | RLHF training will start to break down.
00:23:40.760 | And what we'll probably see is kind of what the question
00:23:45.760 | before I lead it to is, like, well, now the systems are optimized
00:23:49.760 | for whatever feedback we give them.
00:23:52.760 | And so they will try to tell us what we want to hear,
00:23:54.760 | rather all the things that they know to be true.
00:23:57.760 | And, you know, they might learn how to deceive us
00:24:00.760 | because, you know, that makes it easier to score higher on preferences.
00:24:06.760 | And so kind of, like, the basic idea that we want to leverage
00:24:11.760 | is related to the principle I just mentioned,
00:24:16.760 | which is evaluation is easier in generation.
00:24:19.760 | So, for example, if you have a large language model
00:24:22.760 | writing a code base, like an entire code base,
00:24:25.760 | there's just no way humans would be able to find all the bugs
00:24:29.760 | and all the flaws in the code base.
00:24:31.760 | Or, you know, the code base could have, like, a Trojan in there
00:24:34.760 | and you might not be able to tell because it is so hard.
00:24:38.760 | And that's why we see so much buggy code out there.
00:24:41.760 | But if you ask your language model to find bugs and point them out to you,
00:24:46.760 | once you've seen the bug, it's so much easier for you to say,
00:24:50.760 | "Oh, yeah, this was a bug. Please fix it."
00:24:54.760 | And so now you've taken the task of writing a code base down to,
00:24:58.760 | "Well, I just have to evaluate whether that was a bug
00:25:02.760 | according to the spec that I had in mind."
00:25:05.760 | And so the general principle that we're excited about here is, like,
00:25:09.760 | we want to leverage AI assistance for human evaluation.
00:25:14.760 | And so the hope is that we, together, if we pair up humans with AI,
00:25:17.760 | you actually get a line that looks more like this,
00:25:20.760 | where, you know, like, humans together with AI can evaluate
00:25:23.760 | much more than they could on their own.
00:25:29.760 | And so to make this concrete,
00:25:32.760 | there's, like, two different ways you could do that,
00:25:34.760 | or there's many different ways you could do that.
00:25:36.760 | Two I want to highlight is, like, first, you can ask AI to write a critique.
00:25:41.760 | This is a project we did last year.
00:25:44.760 | And in this case, it was a simple summarization task,
00:25:47.760 | and we trained a language model to kind of, like,
00:25:49.760 | to say things that are wrong with the summary.
00:25:54.760 | And there's other things you could do.
00:25:57.760 | For example, you could give people chat GPT and ask them,
00:26:01.760 | "Okay, use chat GPT to help you evaluate."
00:26:04.760 | And then you could ask for a critique,
00:26:06.760 | or you could ask for a lot of other things.
00:26:08.760 | You could ask for an explanation.
00:26:10.760 | You can ask for fact-checking or a quote or, you know,
00:26:13.760 | whatever the model, like, chat GPT can actually reliably help you with.
00:26:18.760 | And so the idea would be that, you know, like, using AI assistance,
00:26:22.760 | you can kind of get all the smarts that AI has and leverage that
00:26:28.760 | in order to figure out how you should evaluate what this system is doing
00:26:31.760 | and, like, whether it's aligned with your preferences
00:26:34.760 | or whether it's trying to deceive you.
00:26:38.760 | And the big problem with this is how do we know whether it's working?
00:26:43.760 | And one of the kind of, like, difficulties is that by assumption,
00:26:51.760 | we're kind of dealing with a hard task where it's difficult to evaluate.
00:26:55.760 | And we also want the task to be real because we don't want to, you know,
00:26:59.760 | we don't want to solve a hard task that doesn't matter.
00:27:03.760 | And so it becomes different.
00:27:06.760 | So you need, like, a hard task that is real.
00:27:09.760 | But also, if you have those, you usually don't have ground truth,
00:27:13.760 | so you don't know which was the right answer
00:27:15.760 | and how do you know whether the assistance is working
00:27:17.760 | or it's biasing everyone to just say the same thing.
00:27:21.760 | And so there's a simple technique that we use in a critique to do this
00:27:29.760 | where, like, that we call targeted perturbations.
00:27:33.760 | And so what you do is you have a bunch of prompts.
00:27:35.760 | So this could be, like, whatever people type into ChatGPT.
00:27:39.760 | And then you kind of, like, take the response that you have
00:27:45.760 | and say, like, this is the correct response.
00:27:46.760 | It doesn't actually have to be correct, but let's just assume it is.
00:27:50.760 | And now you're asking a human to introduce some kind of subtle flaw
00:27:54.760 | that is easy to miss but is an important flaw.
00:27:58.760 | And now what you have is you have this pure data set of, like,
00:28:01.760 | a good response and a bad response.
00:28:03.760 | And you know which one is good and bad because you made it worse.
00:28:07.760 | And so, you know, in a way, that gives you ground truth.
00:28:13.760 | And so what you now can do is you can take, like,
00:28:18.760 | randomly select either the correct or the flawed response
00:28:21.760 | and then show it to either a human or a human with assistance.
00:28:24.760 | And then they have to say whether it was the correct or the flawed one
00:28:27.760 | or, like, how good the response is.
00:28:30.760 | And in this setting, we can kind of, like, figure, like,
00:28:36.760 | just try a lot of different AI assistance or scalable oversight techniques
00:28:41.760 | and then say, you know, like, is this actually helping humans find the flaws
00:28:47.760 | that we wanted more likely than not.
00:28:51.760 | And if it is, then, you know, like, you're actually really helping.
00:28:57.760 | And so we did this in the critiques paper.
00:29:01.760 | So this is training the language models to write critiques for summaries.
00:29:05.760 | And what we can show is that when we are assisting humans with critiques
00:29:09.760 | at the evaluation, they actually find 50% more flaws than it did without.
00:29:14.760 | And so this is kind of, like, real signs of life that you can already use
00:29:18.760 | in models that we can have today to help humans evaluate
00:29:22.760 | and, like, find problems they would have missed otherwise.
00:29:26.760 | And, of course, we still have to do this, like, on a much harder task
00:29:30.760 | and, like, with, like, a real task in a sense.
00:29:35.760 | And we also want to have, like, bigger effect size.
00:29:38.760 | But I think it's just, like, it shows that there's promise
00:29:41.760 | of these techniques already working.
00:29:44.760 | And so in the long run, what I think we want to get to is
00:29:50.760 | we kind of want to leverage AI for all the cognitive labor
00:29:54.760 | that goes into evaluating whatever our AI systems are doing.
00:29:58.760 | And this could be, you know, like, reading everything that's relevant
00:30:02.760 | or fact-checking or doing calculations or, like, writing code
00:30:07.760 | or any of these things.
00:30:10.760 | And then humans should focus on, like, their preference input,
00:30:13.760 | like the things figuring out what they actually care about
00:30:16.760 | and what they want the model to do.
00:30:19.760 | And this way we can kind of, like, leverage, you know, like,
00:30:27.760 | the abilities that, you know, the AI players will bring to the table
00:30:32.760 | and the things that they will be better at than us eventually.
00:30:36.760 | And then kind of, like, use them to help communicate the thing
00:30:41.760 | that we actually care about and, you know, the things that we
00:30:44.760 | actually want them to do.
00:30:48.760 | And, yeah, that's it.
00:30:52.760 | But, yeah, those are, like, the main slides.
00:30:54.760 | I'm happy to take more questions.
00:31:05.760 | I was wondering about this hallucination of responses.
00:31:09.760 | Have you ever tried to consider some notion of uncertainty
00:31:12.760 | in the answers?
00:31:15.760 | [INAUDIBLE]
00:31:24.760 | So ensembling is difficult because either you're, like,
00:31:28.760 | training and fine-tuning an ensemble from the same pre-trained model
00:31:32.760 | so you don't get that much variance in your ensemble,
00:31:34.760 | or you're pre-training a bunch of different models
00:31:37.760 | and now you're spending a lot of money on pre-trainings.
00:31:41.760 | One thing, I mean, it seems like it should be a solvable problem
00:31:46.760 | to just teach the model to say it's uncertain when it's actually uncertain.
00:31:53.760 | And there's been a bunch of research in that direction,
00:31:56.760 | but I think right now it's still, like, we're not really in a good shape.
00:32:02.760 | There's more stuff to do.
00:32:06.760 | Yeah.
00:32:07.760 | Do you think we may run into a kind of signals and noise ratio problem
00:32:12.760 | when it comes to AI-suggested critiques to AI answers?
00:32:17.760 | Because I'm sure, like, when AI is trying to point out
00:32:21.760 | particular problems in text, humans are more likely to report more problems.
00:32:26.760 | But what if it's noticing problems that humans wouldn't have necessarily
00:32:29.760 | had a problem with to begin with?
00:32:31.760 | Yeah.
00:32:32.760 | So we did try to control for that a little bit
00:32:35.760 | by, like, having humans rate the severity of their flaws
00:32:40.760 | and whether they would have noticed them otherwise.
00:32:43.760 | It can still see a significant effect.
00:32:47.760 | But also, like, I mean, a lot of the time the model is nitpicking,
00:32:51.760 | and then those are, like, not the interesting cases.
00:32:54.760 | Yeah.
00:32:55.760 | Also, if you, like, look at the example I showed,
00:32:58.760 | which I think is from the blog post,
00:33:00.760 | like, a lot of the critiques are just actually quite garbage.
00:33:03.760 | And one of the, like, things that makes it easy for critiques is it's okay
00:33:10.760 | if most of them are garbage because the human can just read them
00:33:12.760 | and discard them.
00:33:14.760 | And it kind of, like, more, you know, helps the evaluator know
00:33:19.760 | where to focus on or, like, notice, like, think of something
00:33:23.760 | they would have missed otherwise.
00:33:25.760 | So it's more like, you know, the critiques help you brainstorm
00:33:28.760 | how you should evaluate or something.
00:33:31.760 | But if you're kind of, like, using an assistant,
00:33:33.760 | you probably want more reliability than, like, filling most of the answers
00:33:37.760 | of why.
00:33:40.760 | Yeah.
00:33:45.760 | How do we ensure that the evaluation metrics we are using
00:33:48.760 | in your recursive reward modeling approach, like, detect deception
00:33:52.760 | and, like, left turns or something don't have, like,
00:33:54.760 | major discontinuity?
00:33:57.760 | Yeah, I think, well, it depends a lot what kind of discontinuity
00:34:03.760 | you're talking about.
00:34:04.760 | Like, if, you know, you get overnight, like, a model that is,
00:34:09.760 | let's say, 1,000 inches larger on, like, number of parameters
00:34:14.760 | or, like, equivalently better, that can, like, create quite a step up
00:34:20.760 | and that makes it quite difficult to do this kind of evaluation.
00:34:29.760 | So in that sense, I think it's going to be very important
00:34:32.760 | to, like, scale up AI more continuously and, like,
00:34:36.760 | do more incremental steps.
00:34:39.760 | Having said that, I think the basic picture
00:34:42.760 | with recursive reward modeling is that you're training the systems
00:34:47.760 | to help you evaluate, you know, systems that are trained
00:34:51.760 | or fine-tuned from the same system, right?
00:34:53.760 | And so if you can figure out how to get, like, fine-tune them
00:35:00.760 | in a way that they mention everything that is actually useful
00:35:03.760 | and relevant, then it would still be able to evaluate systems,
00:35:08.760 | even though they're much smarter than anything you've evaluated before.
00:35:13.760 | Maybe let me make this more concrete because there is, I think,
00:35:17.760 | a way that you could measure that or, like, one way we've tried it
00:35:21.760 | was using what we call the discriminative critique gap.
00:35:24.760 | So to measure that, you're training two different,
00:35:27.760 | or you're fine-tuning two different models
00:35:29.760 | from the same pre-trained model.
00:35:31.760 | One is the discriminator, which says, like,
00:35:34.760 | is this the flawed response or the correct response
00:35:37.760 | or more correct response?
00:35:39.760 | And the other one is a critique model
00:35:41.760 | that tells you what the flaw was.
00:35:44.760 | And now you can measure how often
00:35:47.760 | did the discriminator get it correctly?
00:35:49.760 | And the discriminator is just, like, pure gradient descent.
00:35:52.760 | It doesn't have to want to tell you anything.
00:35:54.760 | You're just basically hooking into the model's representations
00:35:57.760 | and trying to, like, get all the relevant latent insight it has
00:36:03.760 | to this prediction task.
00:36:05.760 | And then on the other hand, you have this critique model
00:36:08.760 | or this general assistance model, and you're measuring
00:36:12.760 | how often does the assistant actually help me point to the right flaw?
00:36:17.760 | And then if you compare these two scores,
00:36:19.760 | you can, like, look at the gap between them.
00:36:22.760 | Hence, this discriminator critique gap.
00:36:25.760 | And then if the gap is large, then the model
00:36:28.760 | knows a lot about the task it's not telling you.
00:36:30.760 | And so that would be very worrying.
00:36:32.760 | And if the gap is small, then there's much--
00:36:38.760 | presumably, there's, like, fewer things
00:36:40.760 | that the model is, like, not telling you.
00:36:42.760 | Yeah, but I guess there's, like, lots of value
00:36:45.760 | in ensuring this approach is, like, robust
00:36:47.760 | and testing it on [INAUDIBLE]
00:36:50.760 | That's right.
00:36:51.760 | That's why we want to test it on the current models.
00:36:55.760 | Yes, I don't know who was first.
00:36:57.760 | I think someone in the back.
00:36:59.760 | So I wanted to ask about, like, maybe towards the end,
00:37:08.760 | you had a slide where, like, there was, like,
00:37:10.760 | [INAUDIBLE]
00:37:12.760 | And so, you know, I couldn't help but notice, like,
00:37:15.760 | part of that also is, like, communicating
00:37:17.760 | what you want the AI to do, right?
00:37:19.760 | Right.
00:37:20.760 | Like, not just, like, evaluating, but, like,
00:37:22.760 | communicating, like, what happens, like,
00:37:24.760 | I would like you to do this.
00:37:25.760 | And maybe it can't do that.
00:37:27.760 | And so, like, at least, like, in my personal experience
00:37:30.760 | using the chat GPT, like, there were some things
00:37:32.760 | that could do that without surprising.
00:37:34.760 | Like, you could, like, approach somebody
00:37:36.760 | with a terminal, for instance.
00:37:38.760 | And I was like, oh, like, how did that come up?
00:37:40.760 | Or, you know, like, you can ask about, like,
00:37:43.760 | if it's, like-- there's, like, different things, right?
00:37:45.760 | Or I'm like, OK, like, what can I ask for?
00:37:47.760 | And, like, what kind of--
00:37:48.760 | Yeah.
00:37:49.760 | One thing that I thought was a bit concerning
00:37:52.760 | was just this idea that, like, you know,
00:37:54.760 | people don't always communicate their preferences,
00:37:56.760 | like, honestly.
00:37:58.760 | Or, like, there could be, like, coordinated efforts,
00:38:02.760 | right, to, like, instill rewards for, like,
00:38:06.760 | specific capabilities, you know, like,
00:38:08.760 | a coordinated effort to do such a thing.
00:38:10.760 | One idea, like, I had with this was, like,
00:38:12.760 | I tried to ask if it has, like, some idea
00:38:14.760 | of, like, a Wikipedia for itself, right?
00:38:16.760 | Because, like, I don't--
00:38:17.760 | I didn't know how to use it at first,
00:38:19.760 | so I just thought, like, maybe [INAUDIBLE]
00:38:21.760 | well, there didn't seem to be one.
00:38:23.760 | But, like, I was hoping there was one.
00:38:25.760 | There was one for, like, GPT-3, right?
00:38:27.760 | Like, I think Brockman broke, like,
00:38:29.760 | a little unofficially.
00:38:30.760 | So I was hoping.
00:38:31.760 | And so my question is, like, how do you, like,
00:38:35.760 | make that sort of, like, thing safe, right?
00:38:38.760 | Like, have you, like, recognized coordinated efforts
00:38:41.760 | to, like, you know, like, specifically reward
00:38:44.760 | certain kinds of behavior?
00:38:47.760 | Maybe, like, some group opportunity
00:38:49.760 | decides that they would like to--
00:38:51.760 | Yeah.
00:38:52.760 | --you know, give it some capability.
00:38:55.760 | So this is a--
00:38:56.760 | You know, like--
00:38:57.760 | Yeah, this is a really good question.
00:38:58.760 | And, like, in a way--
00:39:00.760 | I mean, the first obvious thing that you shouldn't do
00:39:02.760 | is, like, you shouldn't just, like,
00:39:04.760 | literally turn in the data that people
00:39:07.760 | give through, like, using interface.
00:39:09.760 | And we've kind of, like, seen other examples
00:39:12.760 | of what happens if you do that.
00:39:14.760 | If you think of, like, Microsoft Pay or something,
00:39:16.760 | that can go pretty wrong.
00:39:18.760 | The other thing is--
00:39:20.760 | I mean, right now, what we're doing is, like,
00:39:22.760 | we're hiring a bunch of people and then ask them
00:39:24.760 | to rate different model responses.
00:39:26.760 | But also, now the question becomes, like,
00:39:29.760 | you know, who are we hiring?
00:39:31.760 | And, like, what's their background?
00:39:32.760 | What are they trying to do?
00:39:34.760 | And so--
00:39:35.760 | And in particular, like, the thing I think
00:39:38.760 | we're doing quite poorly right now is, like,
00:39:40.760 | actually, like, importing, like,
00:39:44.760 | a diverse and representative set of human preferences.
00:39:48.760 | And it's more just, like, you know,
00:39:49.760 | whoever we end up, we can hire.
00:39:53.760 | And so I kind of wish there was also just, like,
00:39:56.760 | more targeted research on, like, how we should do that
00:40:00.760 | and how that could be done well.
00:40:02.760 | And some of it is also, like, you know,
00:40:04.760 | better placed outside of, like, big tech companies.
00:40:07.760 | Because if you are-- like, tech companies
00:40:10.760 | always have an incentive to, like, you know,
00:40:15.760 | import human preferences in a way that maybe
00:40:18.760 | is not, like, the thing that we actually--
00:40:21.760 | humanity would do under a reflection or something.
00:40:24.760 | And so I think it's a really big, important question.
00:40:27.760 | There's a slight follow-up.
00:40:28.760 | Like, data contamination is, like,
00:40:30.760 | the dual problem for this.
00:40:31.760 | Like, do you think the internet might be
00:40:34.760 | contaminated with--
00:40:36.760 | Obvious.
00:40:37.760 | Yeah.
00:40:38.260 | I mean--
00:40:38.760 | Do you have, like-- is that something--
00:40:39.760 | People might-- can--
00:40:40.760 | anyone can poison the free training, right?
00:40:42.760 | Just put something on the internet.
00:40:45.760 | And it's, you know, something that we
00:40:47.760 | have to be very mindful of.
00:40:52.760 | I don't know.
00:40:53.760 | Have you thought much about, like,
00:40:55.760 | considering that we're currently training these models
00:40:59.760 | [INAUDIBLE]
00:41:01.760 | and hopefully getting closer to human preferences
00:41:03.760 | at this point.
00:41:04.760 | As human preferences change, we've seen, like,
00:41:07.760 | [INAUDIBLE]
00:41:08.760 | quite drastically.
00:41:10.760 | Is there something-- like, is it a paradigm?
00:41:12.760 | Is it a [INAUDIBLE]
00:41:14.760 | models keeping up with data better?
00:41:16.760 | Like, that's, like, a very complex problem.
00:41:19.760 | Yeah.
00:41:20.260 | I mean, the most obvious thing is it's, like--
00:41:22.760 | the model's knowledge base is kind of like the free training
00:41:25.760 | cut-out date.
00:41:26.760 | Like, somebody-- you know, whatever data
00:41:28.760 | you went into pre-training, it doesn't
00:41:30.760 | know about, like, a lot of things
00:41:32.760 | that happened after that.
00:41:35.760 | In terms of updating kind of, like, human preferences
00:41:38.760 | or the, you know, like, the comparisons that
00:41:40.760 | go into the robot model, you just
00:41:42.760 | collect more data and retrain.
00:41:43.760 | And the fine-tuning run is, like, comparatively cheap.
00:41:46.760 | So you can, you know, do that again.
00:41:48.760 | I think what gets harder is that, you know,
00:41:51.760 | like, as you've deployed the model
00:41:53.760 | and people started using it for all kinds of, you know,
00:41:56.760 | tasks that they want to build their company around,
00:41:59.760 | like, they--
00:42:00.760 | if you update and you change the model,
00:42:05.760 | then they also have to do a bunch of work
00:42:07.760 | into, like, adopting their prompts to whatever
00:42:09.760 | they're doing.
00:42:11.760 | And so it doesn't come at a zero cost.
00:42:17.760 | Sorry.
00:42:19.760 | So on the note of exceeding human level performance,
00:42:22.760 | one of the advantages of GPT-3 is that it
00:42:25.760 | has this immense corpus of the entire internet.
00:42:28.760 | If you want to specialize in a specific domain,
00:42:30.760 | like chemistry or material science or something,
00:42:33.760 | and potentially to generate new compounds,
00:42:36.760 | then GPT-3 adapted, like, to use less data
00:42:41.760 | and still learn as efficiently.
00:42:44.760 | [INAUDIBLE]
00:42:46.760 | You mean, like, less data on, like, the chemical domain
00:42:49.760 | or something?
00:42:50.760 | Yeah.
00:42:50.760 | [INAUDIBLE] research paper over the last 30 years or something.
00:42:53.760 | Yeah.
00:42:54.760 | And you can throw that into pre-training, right?
00:42:56.760 | And then the model knows about it.
00:42:57.760 | But can the model really learn this effectively
00:42:59.760 | without so much data?
00:43:00.760 | Or can we somehow adapt the abstract concepts
00:43:02.760 | behind GPT-3 before that's over?
00:43:05.760 | Yeah.
00:43:06.260 | I mean, that's kind of the general idea with what you
00:43:09.760 | intend to do with fine tuning.
00:43:11.260 | And to some extent, we've seen it, like,
00:43:13.760 | generalized in this way.
00:43:14.760 | For example, InstructGPT was trained almost entirely
00:43:17.760 | on English language feedback and demonstrations.
00:43:22.760 | And it works in other languages.
00:43:24.760 | And so that's kind of wild.
00:43:25.760 | And so similarly, you could train the model
00:43:28.760 | with people who don't know anything about chemistry.
00:43:31.760 | And then it learns to follow instructions.
00:43:33.760 | And it will do so on the topic of chemistry.
00:43:37.760 | And this fine tuning can be very sample efficient.
00:43:39.760 | Like, with 100 data points, you can actually
00:43:41.760 | make a meaningful change in the model behavior.
00:43:44.760 | So it can be quite effective.
00:43:47.760 | I'm going to pick someone who hasn't asked.
00:43:51.760 | Regarding response generation, do you
00:43:55.760 | or how much effort do you put on or do you
00:43:59.760 | put emphasis in training on different expression styles?
00:44:03.760 | So what I've noticed from GPT-3 that it always
00:44:06.760 | gives you, like, very structured or scientifically
00:44:09.760 | structured answers.
00:44:11.760 | Do you consider any training if it returns you,
00:44:16.760 | like, a scientifically structured answer
00:44:20.760 | or rather an asterisk answer?
00:44:23.760 | Yeah.
00:44:25.760 | I mean, the tricky thing is, ideally, the model
00:44:28.760 | should give you the kind of answer that you want to have.
00:44:32.760 | Right?
00:44:33.760 | And some people prefer a more scientific or technical
00:44:35.760 | answer.
00:44:36.760 | Some people might prefer a more generic answer.
00:44:39.760 | And I mean, right now, like, ChatGPT doesn't have,
00:44:43.760 | like, you know, a way for you to set, like,
00:44:46.760 | your specific preferences.
00:44:48.760 | And that's something that would be really exciting to have.
00:44:52.760 | But also, I think the kind of statistic property
00:44:56.760 | that you've observed is, in fact, like, probably
00:44:59.760 | a product of our labeler pool.
00:45:02.760 | And so a lot of the ChatGPT workers were, like, more,
00:45:06.760 | you know, like, I think more, like, computer science-y
00:45:09.760 | and, like, more-- there was, like, more data generated
00:45:12.760 | by programmers compared to InstructGPT, which
00:45:16.760 | was more, like, generalist labelers.
00:45:20.760 | And yeah, there's, like, different-- it's, like,
00:45:24.760 | kind of-- it changes also the style.
00:45:27.760 | So there is no specific effort to distinguish that,
00:45:31.760 | but you can.
00:45:34.760 | Yeah.
00:45:35.760 | I mean, we should make a distinguished effort.
00:45:38.760 | It should give you, like, the style that you want, right?
00:45:45.760 | So one of the things that I've been thinking about,
00:45:48.760 | honestly, is how ChatGPT is going
00:45:52.760 | to play a factor in the education of the younger
00:45:57.760 | generation or the coming generation.
00:46:00.760 | And so if you go back to the graph of the AI progress
00:46:03.760 | and the human level-- yeah, what humans can evaluate,
00:46:08.760 | what I'm starting to think about is, like, over a break,
00:46:11.760 | I have used this.
00:46:12.760 | I showed, like, my 10-year-old cousin how to compare ChatGPT
00:46:15.760 | just to mess around with.
00:46:17.760 | And that green line is a lot lower, right?
00:46:22.760 | And furthermore, if said just becomes
00:46:25.760 | part of their educational experience,
00:46:28.760 | it's going to be much--
00:46:30.760 | or I perceive it to be more difficult for them
00:46:32.760 | to discriminate even simpler tasks than what we do now.
00:46:37.760 | And so I'm already thinking about, like,
00:46:39.760 | how that might disrupt or make this alignment
00:46:44.760 | a little bit more difficult in the long run,
00:46:47.760 | as you have people who are more--
00:46:49.760 | who take, for instance, what ChatGPT says as a given truth
00:46:56.760 | anyway.
00:46:58.760 | I was just wondering what your thoughts are on that.
00:47:01.760 | I mean, there's a real risk of overlying
00:47:03.760 | on a tech that is immature and that is not
00:47:05.760 | ready for you just believing--
00:47:09.760 | like, please don't believe everything the model says,
00:47:11.760 | right?
00:47:12.760 | Right.
00:47:13.760 | But also, I think one thing that I'm hopeful for
00:47:17.760 | is that, like, your cousin will end up, like, figuring out
00:47:21.760 | how to do this, where, like, they grew up
00:47:24.760 | with all of these AI tools that are getting better
00:47:29.760 | and learning how to actually leverage them productively,
00:47:33.760 | right?
00:47:34.760 | And, like, it's kind of like, you know,
00:47:38.760 | 20 years ago or something, when you were, like,
00:47:41.760 | using Google Search much earlier than everyone else,
00:47:44.760 | you're probably going to get better at, like,
00:47:46.760 | using that as a tool for everything you want to do.
00:47:48.760 | Any more questions?
00:47:54.760 | I think you had your hand up for a while.
00:47:56.760 | [INAUDIBLE]
00:47:57.760 | I think the slide where human tasks and the chat tasks
00:48:02.760 | and the model tasks, right?
00:48:04.760 | The AI tasks.
00:48:07.760 | Oh, wait, the last one?
00:48:08.760 | The last one.
00:48:11.760 | Yeah, so right now, it seems like you
00:48:13.760 | guys are using humans as biological sensors
00:48:18.760 | to the real world, to, like, physical ground truth,
00:48:21.760 | and using language as, like, a compressed interface
00:48:25.760 | to that ground truth.
00:48:26.760 | Are you guys also looking at using
00:48:28.760 | accessor technology directly with your models
00:48:31.760 | to get a more truthful answer of, you know--
00:48:36.760 | Yeah.
00:48:38.760 | I mean, it depends on what that sensor could be, right?
00:48:40.760 | Like, I guess, like, one of the most straightforward things
00:48:43.760 | is you could ask the model to browse,
00:48:45.760 | and then it can, like, fact check its own answers,
00:48:48.760 | and it can, you know, like, import external knowledge
00:48:52.760 | that it didn't remember.
00:48:54.760 | And yeah, I think that would be quite useful.
00:48:59.760 | I think that would also be quite useful for assisting
00:49:01.760 | human evaluation.
00:49:02.760 | [INAUDIBLE]
00:49:08.760 | And you can look at WebGBT, which, you know,
00:49:12.760 | is a published work on using the model for browsing.
00:49:17.760 | I think-- so one thing that makes it harder when you're
00:49:20.760 | using these, like, external sensors, or if you're
00:49:23.760 | letting the model interact more directly with the real world
00:49:26.760 | is that it raises more safety questions, right?
00:49:29.760 | If you let your language model make arbitrary API calls,
00:49:33.760 | then you have to be a lot more careful with which calls
00:49:38.760 | it's allowed to make, and which is it not.
00:49:41.760 | And if you're-- as opposed to if you're just, like,
00:49:44.760 | you're reviewing everything the model says,
00:49:46.760 | then you can decide which ones you want to make.
00:49:49.760 | So yeah.
00:49:53.760 | It's an open problem.
00:49:57.760 | One more question.
00:49:59.760 | I think you did.
00:50:01.760 | [INAUDIBLE]
00:50:02.760 | About the reasoning abilities of these model language models.
00:50:06.760 | I've seen, like, different people talk about how
00:50:08.760 | it's only, like, a fixed amount of compute per token,
00:50:10.760 | while, like, humans, they have system one and system two,
00:50:13.760 | where we can, like, just speak quickly versus, like,
00:50:15.760 | actually do some reasoning and think through things
00:50:17.760 | that take more effort.
00:50:19.760 | And then I've seen other works that try to, like,
00:50:21.760 | kind of use-- to force it to a chain of prompt
00:50:23.760 | or a chain of, like, reasoning, or, like,
00:50:25.760 | just think step by step and stuff.
00:50:27.760 | Do you think that stuff is sufficient to do, like,
00:50:29.760 | everything that's two-level what we want,
00:50:31.760 | or will it require real, big fine-tuning
00:50:35.760 | or architectural changes?
00:50:39.760 | I don't know.
00:50:40.760 | I'm also the wrong person to ask.
00:50:42.760 | I'm mostly not trying to get the models
00:50:46.760 | to have new capabilities, and more, like, you know,
00:50:50.760 | getting them to play on Team Human.
00:50:52.760 | Oh, do we want to do the online questions?
00:50:57.760 | Yeah.
00:51:02.760 | Sorry.
00:51:03.760 | [INAUDIBLE]
00:51:05.760 | So what do you think is the-- is there a role for [INAUDIBLE]
00:51:08.760 | playing from the feedback, especially if you have, like,
00:51:11.760 | human-- like, you don't have to play chatbots.
00:51:13.760 | It's hard to model human interactions.
00:51:15.760 | So do you think, like, this would be more popping out?
00:51:17.760 | Yeah, quite possibly.
00:51:22.760 | I mean, yeah, as you point out, right,
00:51:30.760 | like, there is a lot of conversational data.
00:51:32.760 | And if you can use it, that would be--
00:51:35.760 | should be useful.
00:51:38.760 | I think, broadly, you can categorize this kind of thing
00:51:41.760 | as, like, let's make the RL algorithm better
00:51:44.760 | and, like, RL from the feedback.
00:51:46.760 | And I think that's valuable, and that should help us,
00:51:48.760 | like, make the same pre-trained models, like,
00:51:52.760 | more aligned according to the human preferences
00:51:54.760 | that we collected.
00:51:57.760 | But also, you would still run into, like,
00:52:00.760 | all the limitations that RLHF has, right?
00:52:03.760 | Also, if, like, someone wants to use RLHF on, like,
00:52:06.760 | the GPT-3 models, will OpenAI offer some sort of API for that?
00:52:09.760 | I think there's a fine-tuning API for GPT-3.
00:52:12.760 | I don't think it offers RL right now.
00:52:14.760 | Got it.
00:52:15.760 | But it's supervised fine-tuning.
00:52:17.760 | So you could do, like--
00:52:19.760 | you can, like, distill best of M and do this kind
00:52:21.760 | of expert iteration RL.
00:52:23.760 | Got it.
00:52:24.760 | So I'll try to move on to the questions.
00:52:26.760 | So the first question is, could you more clearly describe
00:52:29.760 | the pre-training process for chat GPT?
00:52:31.760 | For example, starting with the text wc001,
00:52:35.760 | then xdb of programming data, y steps of RLHF.
00:52:39.760 | Sorry.
00:52:40.760 | I didn't catch that.
00:52:41.760 | Start with text wc001.
00:52:43.760 | And then, so how much ease of data do you use?
00:52:46.760 | And how many steps of RLHF kind of things?
00:52:49.760 | I think the exact numbers are not public.
00:52:56.760 | It's basically similar to Instruct GPT.
00:52:59.760 | And for the Instruct GPT numbers,
00:53:03.760 | we had, I think, around 50,000 comparisons,
00:53:08.760 | and probably, like, 10,000 demonstrations,
00:53:11.760 | or, like, maybe tens of thousands.
00:53:13.760 | I don't remember the exact number.
00:53:16.760 | And so I had, like, this other slide with--
00:53:20.760 | yeah, it was, like, about 20,000 hours of human feedback,
00:53:23.760 | is what I calculated.
00:53:24.760 | And do you think it's [INAUDIBLE]
00:53:26.760 | human feedback, because that's-- you can get, like,
00:53:28.760 | 1 million or whatever [INAUDIBLE]
00:53:31.760 | Right.
00:53:32.260 | I mean, the big question is, like, how do you make--
00:53:34.760 | how do you ensure quality?
00:53:35.760 | [INAUDIBLE]
00:53:39.760 | But that's the whole problem, right?
00:53:45.260 | Like, that assumes you already have
00:53:46.760 | their one model that you trust.
00:53:52.760 | Sure.
00:53:53.260 | So the next question, I think, that I was told was,
00:53:55.760 | you want to automate alignment research.
00:53:57.760 | What happens if you need conceptual visuals,
00:53:59.760 | which are difficult for experts to verify?
00:54:02.760 | Yeah, so, I mean, the kind of, like, ambition of that plan
00:54:08.760 | is to train a model that can do this kind
00:54:11.760 | of conceptual research.
00:54:13.760 | And, you know, you can picture, like,
00:54:16.760 | a language model that, like, you know,
00:54:18.760 | writes an alignment research paper that we read,
00:54:22.760 | and then we're like, oh, this is a really cool idea.
00:54:24.760 | We should try this.
00:54:26.760 | And I think, you know, going back to evaluation,
00:54:30.760 | it's easier in generation.
00:54:31.760 | I think it also applies to alignment research.
00:54:34.760 | And, like, I think, at the very least,
00:54:36.760 | like, I find it much easier to evaluate, you know,
00:54:39.760 | alignment research than I find it to, like, produce it.
00:54:43.760 | And so while there might be conceptual breakthroughs
00:54:48.760 | that we need that we couldn't even evaluate right now
00:54:52.760 | because they're just, like, you know,
00:54:54.760 | if we saw them, we'd be like, what is this?
00:54:57.760 | And this is kind of, like, this is, like,
00:55:00.760 | the reason why we want to do scalable oversight, right?
00:55:03.760 | Because, you know, like, if, you know,
00:55:06.760 | the language model produces this really brilliant insight
00:55:09.760 | and we can't even recognize it at the time,
00:55:13.760 | we should be able to have an easier time recognizing it
00:55:16.760 | if we use AI assistance.
00:55:18.760 | And if we leverage, like, our best AI models
00:55:22.760 | to, like, figure out whether or not that was a good idea,
00:55:25.760 | what is the weaknesses and what are the strengths?
00:55:27.760 | And, like, you know, what kind of experiments should we run
00:55:30.760 | to know whether this is a good idea?
00:55:32.760 | And so, yeah, I think basically, you know,
00:55:37.760 | the story of just using LLHF to train a model
00:55:40.760 | to do good alignment research,
00:55:42.760 | you have the obvious pitfalls, which is, you know,
00:55:45.760 | the model might write, like, an alignment proposal
00:55:47.760 | that kind of looks good to us, but is actually, you know,
00:55:50.760 | not a good proposal, and it creates AI
00:55:52.760 | that is misaligned with humans.
00:55:55.760 | And so in order to distinguish the two,
00:55:58.760 | which might be really hard, maybe it's not, but, you know,
00:56:02.760 | I think we should expect it to be really hard,
00:56:04.760 | and then leveraging AI assistance to evaluate that
00:56:08.760 | seems like a really promising plan.
00:56:12.760 | [INAUDIBLE]
00:56:18.760 | I mean, that was my whole point.
00:56:20.760 | It's not suppression.
00:56:21.760 | [LAUGHTER]
00:56:24.760 | [INAUDIBLE]
00:56:38.760 | I mean, the generalists were, like,
00:56:41.760 | I think basically the vast majority
00:56:45.760 | of the model's capabilities and, like,
00:56:47.760 | all the cool things you see it do come from pre-training
00:56:50.760 | and not from the fine-tuning stage.
00:56:52.760 | The reason why people sometimes attribute it
00:56:55.760 | to the fine-tuning stage is that you didn't see it
00:56:59.760 | in the pre-trained model.
00:57:01.760 | And the reason, I think, the reason
00:57:03.760 | that we didn't see it in the pre-trained model
00:57:05.760 | is because the pre-written model was so misaligned,
00:57:07.760 | it was not trying to help you, and it
00:57:09.760 | was not trying to show you all the things it can do.
00:57:12.760 | And instead, it just regurgitates
00:57:14.760 | a bunch of random web text.
00:57:16.760 | And that's not what you're looking for.
00:57:19.760 | And so, yeah.
00:57:21.760 | I think that what our project basically has been doing
00:57:24.760 | is, like, unlocking capabilities that were already in the model
00:57:28.760 | and making those available for humans to use.
00:57:32.760 | And in some ways, like, you know,
00:57:35.760 | alignment research is very dual-use in the sense that,
00:57:38.760 | you know, A, if you have really good alignment techniques,
00:57:42.760 | you can use it to align with whatever values you want,
00:57:45.760 | including values that, you know,
00:57:47.760 | we wouldn't particularly endorse.
00:57:49.760 | And B, it also, like, if you're doing alignment right,
00:57:54.760 | it will always look a little bit like you made
00:57:57.760 | the AI system more capable because before,
00:58:01.760 | it just wasn't really trying that hard to help you.
00:58:04.760 | And now, you've made it more aligned.
00:58:06.760 | So, you know, you actually see these capabilities
00:58:08.760 | that you already have.
00:58:11.760 | Sure.
00:58:12.760 | Let's see.
00:58:14.760 | [INAUDIBLE]
00:58:27.760 | Yeah, so that was what I was talking about here, right?
00:58:29.760 | Like, this is, like, the whole problem that we have,
00:58:34.760 | where what humans can evaluate is constant.
00:58:38.760 | And so we won't be able to evaluate, like,
00:58:40.760 | sophisticated attempts at deceiving us.
00:58:43.760 | And that's why we want to do scalable supervision
00:58:45.760 | so that we empower humans to spot
00:58:47.760 | these attempts at deception.
00:58:53.760 | [INAUDIBLE]
00:59:02.760 | Yeah, so I think these are real worries.
00:59:07.760 | And to some extent, we kind of, like,
00:59:10.760 | have to test empirically, like, how difficult
00:59:15.760 | and how severe they actually are.
00:59:18.760 | I think-- so my personal stance right now
00:59:22.760 | is something like, I think, trying
00:59:24.760 | to get the outer alignment signal really right
00:59:27.760 | is going to be, like, 90% of the effort.
00:59:30.760 | And once we have that, then a lot of the other things
00:59:33.760 | might also fall into place.
00:59:35.760 | So for example, I mean, it kind of
00:59:37.760 | depends on which story of inner misalignment
00:59:39.760 | you're worried about.
00:59:40.760 | But, you know, one story is you're kind of training
00:59:43.760 | your system, and it learns how to do--
00:59:47.760 | it learns, basically, a bunch of inner optimizers,
00:59:50.760 | kind of like meta reinforcement learning.
00:59:53.760 | So for example, like, GPT-3 can do, like, in-context learning.
00:59:57.760 | And that's, like, a kind of, you know, learned optimizer.
01:00:01.760 | And so now you're, like, doing all that GIF training
01:00:05.760 | or whatever, like, alignment training you have.
01:00:08.760 | And you're, like, the learned optimizers
01:00:11.760 | learn to do the thing that you want on distribution.
01:00:15.760 | But now if you have a distributional shift--
01:00:17.760 | and this distributional shift could be auto-induced,
01:00:19.760 | meaning, like, the model is causing it itself.
01:00:22.760 | And now you're going out of distribution.
01:00:24.760 | All these inner optimizers, like, try
01:00:26.760 | to optimize for something else.
01:00:28.760 | And one way you can, like--
01:00:33.760 | and, you know, like, how much that would actually happen
01:00:36.760 | in practice is kind of unclear.
01:00:37.760 | But one kind of, like, more important question
01:00:40.760 | is, like, if you have a really reliable outer alignment
01:00:44.760 | signal and you have this, like, general training
01:00:46.760 | signal that you trust, you can also
01:00:50.760 | use that, you know, on the new distribution
01:00:52.760 | to train the system to be more--
01:00:56.760 | or, like, to get its inner optimizers in a row, basically.
01:01:01.760 | And so then you've reduced, like, the inner alignment
01:01:04.760 | problems to, like, how do you deal
01:01:06.760 | with a distributional shift?
01:01:07.760 | And how do you, like, construct an outer alignment
01:01:10.760 | signal that you trust?
01:01:11.760 | And those are problems that we have to deal with anyways.
01:01:14.760 | But yeah, I don't know how it's actually going to shake out.
01:01:19.760 | But there's some important open questions.
01:01:27.760 | So regarding alignments, one of the kind of problems
01:01:33.760 | that I've been encountering in some discussions
01:01:35.760 | is there's not much interest, it seems,
01:01:38.760 | in, like, explaining why we come to these judgments
01:01:41.760 | or these lack thereof.
01:01:42.760 | There's not even been much interest in the way
01:01:45.760 | I would decompose these models.
01:01:47.760 | It's, like, explaining why this--
01:01:49.760 | and even set constraints arbitrarily.
01:01:51.760 | I mean, that's definitely a truthful route
01:01:53.760 | is to be able to show this.
01:01:54.760 | But as to why it's making these out-of-line judgments,
01:02:00.760 | have you all been able to interrogate the model?
01:02:04.760 | I mean, I think where we are right now
01:02:07.760 | is, like, pretty dissatisfactory.
01:02:12.760 | I mean, you can ask the model why
01:02:15.760 | it gave a certain response.
01:02:16.760 | But you don't know whether it's answering truthfully.
01:02:19.760 | And you can also--
01:02:20.760 | I mean, another thing you can do is
01:02:22.760 | you can give the model its own response
01:02:24.760 | and ask it to find out flaws, which is
01:02:26.760 | what we did in the Geeks paper.
01:02:29.760 | But, you know, I think that, like--
01:02:32.760 | I mean, there's one version where
01:02:34.760 | you try to make that better.
01:02:35.760 | But then the question is, like, what
01:02:37.760 | is your ground truth signal?
01:02:39.760 | I think a, like, better angle of attack
01:02:42.760 | is probably interpretability, where, you know,
01:02:46.760 | you figure out how to look inside the model
01:02:48.760 | and then how it actually works.
01:02:50.760 | That's what I was asking about, like,
01:02:52.760 | the level of research of interpretability.
01:02:54.760 | Yeah.
01:02:55.760 | It's like, it seems as though that's
01:02:57.760 | been really difficult to do for you
01:02:58.760 | because you are going through models
01:02:59.760 | in particular, and that's such a high-critical space.
01:03:01.760 | Yeah.
01:03:02.760 | Your current thinking is moving towards
01:03:05.760 | [INAUDIBLE]
01:03:06.760 | reducing the missionality of that representation
01:03:09.760 | [INAUDIBLE]
01:03:12.760 | Yeah, I mean, we are working on that problem.
01:03:16.760 | But I don't think we have anything
01:03:18.760 | that to show right now.
01:03:20.760 | And so it seems generally not to be a very easy problem.
01:03:28.760 | But, you know, I'm hopeful that we can do some things.
01:03:32.760 | I think, in general, the problem of interpretability,
01:03:37.760 | or, like, using interpretability for alignment,
01:03:39.760 | is kind of tricky because I suspect
01:03:43.760 | it's going to be neither--
01:03:45.760 | it's going to be not sufficient.
01:03:47.760 | And it might not be necessary.
01:03:49.760 | So any amount of interpretability
01:03:51.760 | you can leverage would be useful because it's
01:03:54.760 | another tool in your toolbox of, like, detecting deception
01:03:57.760 | or, like, knowing what you said, like,
01:04:00.760 | how-- why the model gave a certain answer
01:04:02.760 | and made a certain decision.
01:04:05.760 | But, you know, it is kind of unclear
01:04:10.760 | if you really get really good at interpretability,
01:04:12.760 | how you then leverage that for alignment.
01:04:15.760 | Like, presumably, you can look in the model
01:04:17.760 | and just, like, throw all the models out
01:04:19.760 | that you can find a misalignment in.
01:04:21.760 | But then aren't you just selecting
01:04:23.760 | for models that have misalignments that
01:04:25.760 | are really hard to find with the interpretability tools?
01:04:28.760 | - Sure.
01:04:29.760 | Just a follow-up to that.
01:04:30.760 | It's-- really, my aspect of that is kind
01:04:32.760 | of the standard practice that we have in general
01:04:34.760 | is that you have to find explanations of the problem.
01:04:36.760 | - Yeah.
01:04:37.760 | - And I guess then my question would be is, like,
01:04:39.760 | why would you take the interpretability
01:04:41.760 | to not be necessary?
01:04:43.760 | - Yes.
01:04:44.760 | - Why would it not be necessary?
01:04:46.760 | So, again, this is kind of, like, an open question.
01:04:50.760 | But basically, what stance you could take
01:04:53.760 | is that at the end of the day, what really is going to matter
01:04:57.760 | is the decisions that the model actually takes
01:05:00.760 | and not the reasons why it took them.
01:05:03.760 | And so if you can get to the point
01:05:05.760 | where you're confident that all the things the model actually
01:05:08.760 | does are aligned with what you want,
01:05:12.760 | then does it still matter what the model thinks internally?
01:05:16.760 | I don't know.
01:05:17.760 | - But you have to find out, too.
01:05:19.760 | You have to find out, like, if that value would be more
01:05:21.760 | valid than the model.
01:05:22.760 | - Yeah.
01:05:23.760 | That's what we're trying to do, right?
01:05:24.760 | Like, we're trying to make, like, a really, really
01:05:26.760 | good evaluation signal.
01:05:28.760 | And then you can select for--
01:05:30.760 | you know, you can train the model to do the things
01:05:32.760 | that you want it to do because you can always
01:05:35.760 | evaluate better than the model can do stuff.
01:05:41.760 | Yeah.
01:05:42.760 | - I'll probably have my shot at that.
01:05:44.760 | That's my question.
01:05:45.760 | - Very good.
01:05:49.760 | Thanks so much, Dion, for the great lecture.
01:05:51.760 | Very interesting.
01:05:52.760 | - I might actually do, like--
01:05:53.760 | as we're talking about the topic of connectivity,
01:05:55.760 | I might just do, like, a live--
01:05:58.760 | I want just, like, an application thing.
01:06:01.760 | - Yeah, sure.
01:06:02.760 | Sure.
01:06:03.760 | - Just to end up the class.
01:06:06.760 | Sorry.
01:06:07.760 | So how do you find this thing?
01:06:10.760 | - Oh, you need--
01:06:11.760 | - All right, guys.
01:06:12.760 | Can we give our speaker a round of applause?
01:06:14.760 | - Thank you.
01:06:15.760 | - Thank you.
01:06:16.760 | - Thank you.
01:06:17.760 | - Thank you.
01:06:18.760 | - Thank you.
01:06:19.760 | - Thank you.