Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback

00:00:00.000 | Okay, awesome. We're going to get started. So my name is Jesse Mu. I'm a PhD student

00:00:13.400 | in the CS department here working with the NLP group and really excited to be talking

00:00:18.760 | about the topic of today's lecture, which is on prompting instruction, fine tuning,

00:00:23.480 | and RLHF. So this is all stuff that has been super hot recently because of all the latest

00:00:30.480 | criticise about chatbots, chat-dbt, etc. And we're going to hopefully get somewhat of an

00:00:36.480 | understanding as to how these systems are trained.

00:00:40.360 | Okay, so before that, some course logistics things. So project proposals, both custom

00:00:45.760 | and final, were due a few minutes ago. So if you haven't done that, this is a nice reminder.

00:00:52.560 | We're in the process of assigning mentors of projects, so we'll give feedback soon.

00:00:57.840 | Besides that, assignment five is due on Friday at midnight. We still recommend using Colab

00:01:03.320 | for the assignments, even if you've had AWS or Azure credits granted. If that doesn't

00:01:07.960 | work, there's instructions for how to connect to a Kaggle notebook where you will also be

00:01:11.200 | able to use GPUs. Look for that post on ed. And then finally, also just posted on ed by

00:01:16.760 | John is a course feedback survey. So this is part of your participation grade. Please

00:01:21.440 | fill that in by Sunday, 11.59 p.m.

00:01:28.440 | Okay, so let's get into this lecture, which is going to be about what we are trying to

00:01:34.920 | do with these larger and larger models. Over the years, the compute for these models have

00:01:40.680 | just gone up hundreds of powers of 10, trained on more and more data. So larger and larger

00:01:49.680 | models, they're seeing more and more data. And in lecture 10, if you recall this slide,

00:01:56.520 | we talked a little bit about what happens when you do pre-training. And as you begin

00:02:01.680 | to really learn to predict the missing sentence in certain texts, right? You learn things

00:02:06.240 | like syntax, co-reference, sentiment, et cetera. But in this lecture, we're going to take it

00:02:11.840 | a little bit further and really take this idea to its logical conclusion. So if you

00:02:15.680 | really follow this idea of we're just going to train a giant language model on all of

00:02:20.400 | the world's text, you really begin to see language models sort of in a way as rudimentary

00:02:25.680 | world models. So maybe they're not very good at world models, but they kind of have to

00:02:29.480 | be doing some implicit world modeling just because we have so much information on the

00:02:33.600 | internet and so much of human collective knowledge is transcribed and written for us on the internet,

00:02:38.680 | right? So if you are really good at predicting the next word in text, what do you learn to

00:02:42.720 | do? There's evidence that these large language models are to some degree learning to represent

00:02:47.960 | and think about agents and humans and the beliefs and actions that they might take.

00:02:52.600 | So here's an example from our recent paper where we are talking about someone named Pat

00:02:57.800 | watching a demonstration of a bowling ball and a leaf being dropped at the same time

00:03:01.520 | in a vacuum chamber. And the idea is here we're saying Pat is a physicist, right? So

00:03:07.320 | Pat is a physicist and we ask for the language models next continuation of this sentence.

00:03:13.520 | Because he's a physicist, we do some inference about what kind of knowledge Pat has and Pat

00:03:17.560 | will predict that the bowling ball and the leaf will fall at the same time. But if we

00:03:21.880 | change the sentence of the prompt and we say, well, Pat has actually never seen this demonstration

00:03:25.400 | before, then Pat will predict that the bowling ball will fall to the ground first, which

00:03:29.600 | is wrong, right? So if you get really good at predicting the next sentence in text, you

00:03:33.880 | also to some degree have to learn to predict an agent's beliefs, their backgrounds, common

00:03:39.080 | knowledge and what they might do next. So not just that, of course, if we continue browsing

00:03:44.920 | the internet, we see a lot of encyclopedic knowledge. So maybe language models are actually

00:03:48.960 | good at solving math reasoning problems if they've seen enough demonstrations of math

00:03:53.080 | on the internet. Code, of course, code generation is a really exciting topic that people will

00:03:58.560 | that people are looking into and we'll give a presentation on that in a few weeks. Even

00:04:04.120 | medicine, right? We're beginning to think about language models trained on medical texts

00:04:07.680 | and being applied to the sciences and whatnot. So this is what happens when we really take

00:04:11.920 | this language modeling idea seriously. And this has resulted in a resurgence of interest

00:04:17.760 | in building language models that are basically assistants, right? You can give them any task

00:04:22.960 | under the sun. I want to create a three course meal and a language model should be able to

00:04:27.960 | take a good stab at being able to do this. This is kind of the promise of language modeling.

00:04:34.320 | But of course, there's a lot of steps required to get from this, from our basic language

00:04:38.720 | modeling objective. And that's what this lecture is going to be about. So how do we get from

00:04:44.040 | just predicting the next word in a sentence to something like chat GPT, which you can

00:04:48.600 | really ask it to do anything and it might fail sometimes, but it's getting really, really

00:04:52.400 | convincingly good at some things. Okay. So this is the lecture plan. Basically, I'm going

00:04:58.760 | to talk about as we're working with these large language models, we come up with kind

00:05:02.360 | of increasingly complex ways of steering the language models closer and closer to something

00:05:06.600 | like chat GPT. So we'll start with zero shot and few shot learning, then instruction, fine

00:05:11.120 | tuning and then reinforcing learning from human feedback or RLHF. Okay. So let's first

00:05:21.000 | talk about few shot and zero shot learning. And in order to do so, we're again going to

00:05:26.480 | kind of build off of the pre-training lecture last Tuesday. So in the pre-training lecture,

00:05:30.800 | John talked about these models like GPT, generative pre-trained transformer, that are these decoder

00:05:38.200 | only language models. So they're just trained to predict the next word in a corpus of text.

00:05:43.600 | And back in 2018 was the first iteration of this model. And it was 117 million parameters.

00:05:50.000 | So at the time it was pretty big. Nowadays, it's definitely much smaller. And again, it's

00:05:54.400 | just a vanilla transformer decoder using the techniques that you've seen. And it's trained

00:05:58.280 | on a corpus of books. So about 4.6 gigabytes of text. And what GPT showed was the promise

00:06:04.840 | at doing this simple language modeling objective and serving as an effective pre-training technique

00:06:10.200 | for various downstream tasks that you might care about. So if you wanted to apply it to

00:06:13.920 | something like natural language inference, you would take your premise sentence and your

00:06:17.720 | hypothesis sentence, concatenate them, and then maybe train a linear classifier on the

00:06:22.080 | last representation the model produces.

00:06:24.960 | OK, but that was three, four, five years ago. What has changed since then? So they came

00:06:34.520 | out with GPT-2. So GPT-2 was released the next year in 2019. This is 1.5 billion parameters.

00:06:41.720 | So it's the same architecture as GPT, but just an order of magnitude bigger. And also

00:06:46.620 | trained on much more data. So we went from 4 gigabytes of books to 40 gigabytes of internet

00:06:53.040 | text data. So they produced a data set called WebText. This is produced by scraping a bunch

00:06:57.560 | of links to comments on Reddit. So the idea is that the web contains a lot of spam, maybe

00:07:01.920 | a lot of low-quality information. But they took links that were posted on Reddit that

00:07:05.680 | had at least a few upvotes. So humans maybe looked through it and said, you know, this

00:07:09.240 | is a useful post. So that was kind of a rough proxy of human quality. And that's how they

00:07:14.120 | collected this large data set. And so if you look at the size of GPT in 2018, we can

00:07:20.920 | draw a bigger dot, which is the size of GPT-2 in 2019. And one might ask, how much better

00:07:27.120 | does this do? What is this value?

00:07:30.500 | So the authors of GPT-2 titled their paper, "Language Models are Unsupervised Multitask

00:07:35.600 | Learners." And that kind of gives you a hint as to what the key takeaway they found was,

00:07:40.240 | which is this unsupervised multitasking part.

00:07:44.640 | So basically, I think the key takeaway from GPT-2 was this idea that language models can

00:07:49.920 | display zero-shot learning. So what I mean by zero-shot learning is you can do many tasks

00:07:55.720 | that the model may not have actually explicitly been trained for with no gradient updates.

00:08:00.360 | So you just kind of query the model by simply specifying the right sequence prediction problem.

00:08:06.340 | So if you care about question answering, for example, you might include your passage, like

00:08:10.000 | a Wikipedia article about Tom Brady. And then you'll add a question, so a question, where

00:08:14.000 | was Tom Brady born? And then include an answer, like A followed by a colon. And then just

00:08:18.720 | ask the model to predict the next token. You've kind of jury-rigged the model into doing question

00:08:23.760 | answering.

00:08:26.720 | For other tasks, like classification tasks, another thing you can do is compare different

00:08:30.440 | probabilities of sequences. So this task is called the Winograd Schema Challenge. It's

00:08:35.080 | a pronoun resolution task. So the task is to kind of resolve a pronoun which requires

00:08:39.620 | some world knowledge. So one example is something like, the cat couldn't fit into the hat because

00:08:44.720 | it was too big. And the question is whether it refers to the cat or to the hat. And in

00:08:50.480 | this case, it makes most sense for it to refer to the cat because things fitting into things

00:08:55.400 | because they're too big, you need to use some world knowledge to kind of resolve that.

00:09:00.060 | So the way that you get zero-shot predictions for this task out of a language model like

00:09:03.560 | GPT-2 is you just ask the language model, which sequence is more likely? Is the probability

00:09:10.080 | of the cat couldn't fit into the hat because the cat was too big deemed more likely by

00:09:15.080 | the language model than the probability that the cat couldn't fit into the hat because

00:09:19.760 | the hat was too big? You can score those sequences because this is a language model. And from

00:09:24.120 | there, you get your zero-shot prediction. And you can end up doing fairly well on this

00:09:27.960 | task.

00:09:28.960 | Any questions about this?

00:09:31.960 | OK. Yeah, so digging a little bit more into the results, GPT-2 at the time beat the state

00:09:39.480 | of the art on a bunch of language modeling benchmarks with no task-specific fine-tuning.

00:09:44.000 | So no traditional fine-tune on a training set and then test on a testing set. So here's

00:09:48.800 | an example of such a task. This is a language modeling task called Lambada, where the goal

00:09:52.480 | is to predict a missing word. And the idea is that the word that you need to predict

00:09:56.600 | depends on some discourse earlier in the sentence or earlier a few sentences ago. And by simply

00:10:03.560 | training your language model and then running it on the Lambada task, you end up doing better

00:10:07.880 | than the supervised fine-tuned state of the art at the time and across a wide variety

00:10:13.120 | of other tasks as well.

00:10:18.640 | OK. Another kind of interesting behavior they observed-- and so you'll see hints of things

00:10:27.240 | that we now take for granted in this paper-- is that you can get interesting zero-shot

00:10:31.200 | behavior as long as you take some liberties with how you specify the task. So for example,

00:10:36.360 | let's imagine that we want our model to do summarization. Even though GPT-2 was just

00:10:40.880 | a language model, how can we make it do summarization?

00:10:44.880 | The idea they explored was we're going to take an article, some news article, and then

00:10:48.960 | at the end, we're going to append the TLDR sign, the TLDR token. So this stands for Too

00:10:53.920 | Long Didn't Read. It's used a lot on Reddit to just say, if you didn't want to read the

00:10:57.460 | above stuff, here's a few sentences that summarizes it.

00:11:00.920 | So if you ask the model to predict what follows after the TLDR token, you might expect that

00:11:06.600 | it'll generate some sort of summary. And this is kind of early whispers at this term that

00:11:12.040 | we now call prompting, which is thinking of the right way to define a task such that your

00:11:17.160 | model will do the behavior that you want it to do.

00:11:22.240 | So if we look at the performance we actually observed on this task, here at the bottom

00:11:25.960 | is a random baseline. So you just select three sentences from the article. And the scores

00:11:31.040 | that we're using here are Rouge scores, if you remember the natural language generation

00:11:34.040 | lecture. GPT-2 is right above. So it's not actually that good. It only does maybe a little

00:11:40.040 | bit or barely any better than the random baseline. But it is approaching approaches that are

00:11:46.160 | supervised approaches that are actually explicitly fine-tuned to do summarization.

00:11:52.120 | And of course, at the time, it still underperformed the state of the art. But this really showed

00:11:56.180 | the promise of getting language models to do things that maybe they weren't trained

00:11:59.400 | to do.

00:12:01.520 | OK, so that was GPT-2. That was 2019. Now here's 2020, GPT-3. So GPT-3 is 175 billion

00:12:13.000 | parameters. So it's another increase in size by an order of magnitude. And at the time,

00:12:17.280 | it was unprecedented. I think it still is kind of overwhelmingly large for most people.

00:12:22.320 | And data. So they scaled up the data once again.

00:12:24.720 | OK, so what is this by you? This paper's title was called Language Models are Few Shot Learners.

00:12:30.960 | So what does that mean? So the key takeaway from GPT-3 was emergent few-shot learning.

00:12:37.480 | So the idea here is, sure, GPT can still do zero-shot learning. But now you can specify

00:12:43.200 | a task by basically giving examples of the task before asking it to predict the example

00:12:48.260 | that you care about.

00:12:51.000 | So this is often called in-context learning to stress that there are no gradient updates

00:12:55.540 | being performed when you learn a new task. You're basically kind of constructing a tiny

00:12:59.580 | little training data set and just including it in the prompt, including it in the context

00:13:03.520 | window of your transformer, and then asking it to pick up on what the task is and then

00:13:07.500 | predict the right answer. And this is in contrast to a separate literature on few-shot learning,

00:13:13.600 | which assumes that you can do gradient updates. In this case, it's really just a frozen language

00:13:17.380 | model.

00:13:21.340 | So few-shot learning works, and it's really impressive. So here's a graph. SuperGLUE here

00:13:26.560 | is a kind of a wide coverage natural language understanding benchmark. And what they did

00:13:30.360 | was they took GPT-3, and this data point here is what you get when you just do zero-shot

00:13:36.480 | learning with GPT-3. So you provide an English description of a task to be completed, and

00:13:40.800 | then you ask it to complete the task.

00:13:44.960 | Just by providing one example, so one shot, you get like a 10% accuracy increase. So you

00:13:49.840 | give not only the natural language task description, but also an example input and an example output,

00:13:55.520 | and you ask it to code the next output. And as you increase to more shots, you do get

00:14:01.000 | better and better scores, although, of course, you get diminishing returns after a while.

00:14:06.480 | But what you can notice is that few-shot GPT-3, so no gradient updates, is doing as well as

00:14:12.000 | or outperforming BERT fine-tuned on the SuperGLUE task explicitly.

00:14:19.040 | Any questions about this?

00:14:23.680 | So one thing that I think is really exciting is that you might think, OK, a few-shot learning,

00:14:28.840 | whatever, it's just memorizing. Maybe there's a lot of examples of needing to do a few-shot

00:14:32.200 | learning in the internet text data. And that's true, but I think there's also evidence that

00:14:37.320 | GPT-3 is really learning to do some sort of on-the-fly optimization or reasoning.

00:14:43.560 | And so the evidence for this comes in the form of these synthetic word unscrambling

00:14:46.840 | tasks. So the authors came up with a bunch of simple kind of letter manipulation tasks

00:14:51.480 | that are probably unlikely to exist in internet text data. So these include things like cycling

00:14:56.880 | through the letters to get the kind of uncycled version of a word, so converting from P-L-E-A-P

00:15:02.160 | to Apple, removing characters added to a word, or even just reversing words.

00:15:07.800 | And what you see here is performance as you do few-shot learning as you increase the model

00:15:12.240 | size. And what you can see is that the ability to do few-shot learning is kind of an emergent

00:15:19.160 | property of model scale. So at the very largest model, we're actually seeing a model be able

00:15:24.080 | to do this exclusively in context.

00:15:26.160 | Question? Yeah.

00:15:27.160 | I've noticed the reversed words are horrible, like the performance.

00:15:36.160 | Yeah. Yeah. So the question was the reversed words. Mine is still low. Yeah, that's an

00:15:42.160 | example of a task that these models still can't solve yet, although I'm not sure if

00:15:46.600 | we've evaluated it with newer and newer models. Maybe the latest versions can indeed actually

00:15:50.360 | solve that task. Yeah, question?

00:15:51.360 | Is there some intuition for why this emerges as a result of model scale?

00:15:56.160 | I think that's a highly active area of research, and there's been papers published every week

00:16:00.360 | on this. So I think there's a lot of interesting experiments that really try to dissect either

00:16:05.120 | with synthetic tasks, like can GPT-3 learn linear regression in context? And there's

00:16:10.720 | some model interpretability tasks, like what in the attention layers or what in the hidden

00:16:14.560 | states are resulting in this kind of emergent learning. But yeah, I'd have to just refer

00:16:19.160 | you to the recent literature on that. Anything else? Awesome.

00:16:27.000 | Okay, so just to summarize, traditional fine tuning here is on the right. We take a bunch

00:16:32.280 | of examples of a task that we care about. We give it to our model, and then we do a

00:16:35.680 | gradient step on each example. And then at the end, we hopefully get a model that can

00:16:39.080 | do well on some outputs. And in this new kind of paradigm of just prompting a language model,

00:16:43.640 | we just have a frozen language model, and we just give some examples and ask the model

00:16:47.240 | to predict the right answer.

00:16:53.320 | So you might think, and you'd be right, that there are some limits of prompting. Well,

00:16:57.040 | there's a lot of limits of prompting, but especially for tasks that are too hard. There

00:17:00.680 | are a lot of tasks that maybe seem too difficult, especially ones that involve maybe richer

00:17:04.840 | reasoning steps or needing to synthesize multiple pieces of information. And these are tasks

00:17:10.060 | that humans struggle with too. So one example is GPT-3. I don't have the actual graph here,

00:17:16.620 | but it was famously bad at doing addition for much larger digits. And so if you prompt

00:17:21.960 | GPT-3 with a bunch of examples of addition, it won't do it correct. But part of the reason

00:17:27.000 | is because humans are also pretty bad at doing this in one step. Like if I asked you to just

00:17:31.280 | add these two numbers on the fly and didn't give you a pencil and paper, you'd have a

00:17:34.600 | pretty hard time with it.

00:17:37.520 | So one observation is that you can just change the prompts and hopefully get some better

00:17:41.160 | performance out of this.

00:17:44.100 | So there's this idea of doing chain of thought prompting, where in standard prompting, we

00:17:49.700 | give some examples of a task that we'd like to complete. So here is an example of a math

00:17:53.380 | word problem. And I told you that what we would do is we would give the question and

00:17:58.180 | then the answer. And then for a data point that we actually care about, we ask the model

00:18:03.320 | to predict the answer. And the model will try to produce the right answer, and it's

00:18:07.040 | just wrong.

00:18:09.780 | So the idea of chain of thought prompting is to actually demonstrate what kind of reasoning

00:18:14.040 | you want the model to complete. So in your prompt, you not only put the question, but

00:18:20.060 | you also put an answer and the kinds of reasoning steps that are required to arrive at the correct

00:18:24.700 | answer. So here is actually some reasoning of how you actually would answer this tennis

00:18:28.400 | ball question and then get the right answer. And because the language model is incentivized

00:18:33.540 | to just follow the pattern and continue the prompt, if you give it another question, it

00:18:38.260 | will in turn produce an answer, sorry, a rationale followed by an answer.

00:18:44.860 | So you're kind of asking the language model to work through the steps yourself. And by

00:18:48.640 | doing so, you end up getting some questions right when you otherwise might not.

00:18:54.620 | So a super simple idea, but it's shown to be extremely effective. So here is this middle

00:18:59.940 | school math word problems benchmark. And again, as we scale up the model for GPT and

00:19:04.540 | some other kinds of models, being able to do chain of thought prompting emerges. So

00:19:10.020 | we really see a performance approaching that of supervised baselines for these larger and

00:19:15.220 | larger models.

00:19:16.220 | Questions? Yeah.

00:19:17.220 | Seemingly the problem with the addition of the large numbers, do you have results on

00:19:35.900 | how the chain of thought prompting for the larger numbers that middle school math word

00:19:36.900 | problems?

00:19:37.900 | Yeah. So the question is, does chain of thought prompting work for those addition problems

00:19:39.020 | that I had presented? Yeah. There should be some results in the actual paper. They're

00:19:44.220 | just not here, but you can take a look. Yeah.

00:19:48.540 | Intuition of how the model was trained without doing gradient update?

00:19:53.340 | Intuition about how the model is learning without gradient updates. Yeah. So this is

00:19:56.540 | related to the question asked earlier about how is this actually happening. That is, yeah,

00:20:02.220 | again, it's an active area of research. So my understanding of the literature is something

00:20:06.660 | like you can show that models are kind of almost doing in-context gradient descent as

00:20:11.140 | it's encoding a prompt. And you can analyze this with model interpretability experiments.

00:20:16.820 | But I'm happy to suggest papers afterwards that kind of deal with this problem more carefully.

00:20:26.220 | Cool. Okay. So a follow up work to this asked the question of, do we actually even need

00:20:35.740 | examples of reasoning? Do we actually even need to collect humans working through these

00:20:39.860 | problems? Can we actually just ask the model to reason through things? Just ask it nicely.

00:20:45.980 | So this introduced this idea called zero shot chain of thought prompting. And it was honestly

00:20:49.820 | like I think probably like the highest impact to simple idea ratio I've seen in a paper

00:20:55.260 | where it's like the simplest possible thing where instead of doing this chain of thought

00:20:58.620 | stuff, you just ask the question and then the answer, you first prepend the token, let's

00:21:03.740 | think step by step. And the model will decode as if it had said, let's think step by step.

00:21:09.540 | And it will work through some reasoning and produce the right answer. So does this work

00:21:15.900 | on some arithmetic benchmarks? Here's what happens when you prompt the model just zero

00:21:20.300 | shot. So just asking it to produce the answer right away without any reasoning. A few shots

00:21:24.900 | are giving some examples of inputs and outputs. And this is zero shot chain of thought. So

00:21:29.820 | just asking the model to think through things, you get crazy good accuracy. When we compare

00:21:35.980 | to actually doing manual chain of thought, you still do better with manual chains of

00:21:39.620 | thought. But that just goes to show you how simple of an idea this is and ends up producing

00:21:44.620 | improved performance numbers. So the funny part of this paper was why use let's think

00:21:51.620 | by step by step. They used actually a lot of prompts and tried them out. So here's zero

00:21:55.860 | shot baseline performance. They tried out a bunch of different prefixes, the answers

00:22:00.340 | after the proof. Let's think. Let's think about this logically. And they found that

00:22:04.140 | let's think step by step was the best one. It turns out this was actually built upon

00:22:09.260 | later in the year where they actually use a language model to search through the best

00:22:12.860 | possible strings that would maximize performance on this task, which is probably gross overfitting.

00:22:18.420 | But the best prompt they found was let's work this out step by step in a step by step way

00:22:23.060 | to be sure that we have the right answer. So the right answer thing is presuming that

00:22:27.060 | you get the answer right. It's like giving the model some confidence in itself.

00:22:32.780 | So this might seem to you like a total dark arcane art. And that's because it is. We really

00:22:38.260 | have no intuition as to what's going on here. Or we're trying to build some intuition. But

00:22:44.580 | as a result, and I'm sure you've seen if you spend time in tech circles or you've seen

00:22:48.380 | on the internet, there's this whole new idea of prompt engineering being an emergent science

00:22:52.700 | and profession. So this includes things like asking a model for reasoning. It includes

00:22:57.060 | jailbreaking language models for telling them to do things that they otherwise aren't trained

00:23:01.780 | to do. Even AI art like DALI or stable diffusion, this idea of constructing these really complex

00:23:08.740 | prompts to get model outputs that you want. That's also prompting. Anecdotally, I've heard

00:23:13.660 | of people saying I'm going to use a code generation model, but I'm going to include the Google

00:23:17.100 | code header in first because that will make more professional or bug free code depending

00:23:21.460 | on how much you believe in Google. But yeah, and there's a Wikipedia article on this now

00:23:27.260 | and there's even startups that are hiring for prompt engineers and they pay quite well.

00:23:30.480 | So if you want to be a prompt engineer, definitely practice your GPT whispering skills.

00:23:35.980 | We have a question? Sorry. Yes, you go. Yeah, go ahead.

00:23:44.500 | A few slides ago, you said LM design that was like this long. How can you get the LM

00:23:51.700 | to design and input like--

00:23:54.620 | I think they treated it like a reinforcement learning problem. But I'll just direct you

00:23:58.100 | to this paper at the bottom to learn more details. Yeah, I think it's the Joe et al

00:24:01.780 | 2022 paper. Yeah. Yeah, question?

00:24:02.780 | So I'm just a bit curious about how they provided feedback. So in case the model was not giving

00:24:03.780 | the right answer, were there prompts to say that that's not right? Maybe think about this

00:24:04.780 | different approach. How is feedback provided?

00:24:09.780 | They don't think about feedback in this kind of chain of thought prompting experiments.

00:24:25.020 | They just like if the model gets the answer wrong, then it gets the answer wrong and we

00:24:27.900 | just evaluate accuracy. Right. But this idea of incorporating feedback, I think, is quite

00:24:31.700 | interesting and I think you'll see some maybe hints of discussion of that later on. Yeah.

00:24:41.900 | Questions? Okay, awesome. Okay, so talking about these three things, I'm going to talk

00:24:51.820 | about the benefits and limitations of the various different things that we could be

00:24:55.060 | doing here. So for zero shot and few shot in context learning, the benefit is you don't

00:24:59.860 | need any fine tuning and you can carefully construct your prompts to hopefully get better

00:25:04.140 | performance. The downsides are there are limits to what you can fit in context. Transformers

00:25:09.540 | have a fixed context window of say 1,000 or a few thousand tokens. And I think, as you

00:25:14.940 | will probably find out, for really complex tasks, you are indeed going to need some gradient

00:25:18.980 | steps. So you're going to need some sort of fine tuning. But that brings us to the next

00:25:24.220 | part of the lecture. So that's instruction fine tuning. Okay, so the idea of instruction

00:25:31.180 | fine tuning is that, sure, these models are pretty good at doing prompting. You can get

00:25:35.900 | them to do really interesting things. But there is still a problem, which is that language

00:25:40.280 | models are trained to predict the most likely continuation of tokens. And that is not the

00:25:44.220 | same as what we want language models to do, which is to assist people. So as an example,

00:25:49.460 | if I give GPT-3 this kind of prompt, explain the moon landing, GPT-3 is trained to predict,

00:25:54.700 | you know, if I saw this on the internet somewhere, what is the most likely continuation? Well,

00:25:59.140 | maybe someone was coming up with a list of things to do with a six year old. So it's

00:26:02.420 | just predicting a list of other tasks, right? It's not answering your question. And so the

00:26:07.260 | issue here is that language models are not, the term is aligned with user intent. So how

00:26:13.360 | might we better align models with user intent for this case? Well, super simple answer,

00:26:19.700 | right? We're machine learners. Let's do machine learning. So we're going to ask a human, give

00:26:25.360 | me the right answer, right? Give me the way that a language model should respond according

00:26:29.300 | to this prompt. And let's just do fine tuning. So this is a slide from the pre-training lecture.

00:26:39.140 | Again, pre-training can improve NLP applications by serving as parameter initialization. So

00:26:45.940 | this kind of pipeline, I think you are familiar with. And the difference here is that instead

00:26:51.420 | of fine tuning on a single downstream task of interest, like sentiment analysis, what

00:26:55.460 | we're going to do is we're going to fine tune on many tasks. So we have a lot of tasks and

00:26:59.860 | the hope is that we can then generalize to other unseen tasks at test time. So as you

00:27:06.040 | might expect, data and scale is kind of key for this to work. So we're going to collect

00:27:11.340 | a bunch of examples of instruction output pairs across many tasks and then fine tune

00:27:16.940 | our language model and then evaluate generalization to unseen tasks.

00:27:23.100 | Yeah, so data and scale is important. So as an example, one recent data set that was published

00:27:29.900 | for this is called the Supernatural Instructions Dataset. It contains over 1.6 thousand tasks

00:27:35.820 | containing 3 million examples. So this includes translation, question answering, question

00:27:40.740 | generation, even coding, mathematical reasoning, etc. And when you look at this, you really

00:27:47.540 | begin to think, well, is this actually fine tuning or is this just more pre-training?

00:27:51.340 | And it's actually both. We're kind of blurring the lines here where the amount of scale that

00:27:55.780 | we're training this on, basically it is kind of a still general but slightly more specific

00:28:00.520 | than language modeling type of pre-training task.

00:28:05.780 | So one question I have is, now that we are training our model on so many tasks, how do

00:28:10.660 | we evaluate such a model? Because you can't really say, OK, can you now do sentiment analysis

00:28:15.140 | well? The scale of tasks we want to evaluate this language model on is much greater.

00:28:22.220 | So just as a brief aside, a lot of research has been going into building up these benchmarks

00:28:27.740 | for these massive multicast language models and seeing to what degree they can do not

00:28:32.380 | only just one task, but just a variety of tasks. So this is the Massive Multitask Language

00:28:37.180 | Understanding Benchmark or MMLU. It consists of a bunch of benchmarks for measuring language

00:28:42.420 | model performance on a bunch of knowledge intensive tasks that you would expect a high

00:28:46.380 | school or college student to complete. So you're testing a language model not only on

00:28:51.500 | sentiment analysis, but on astronomy and logic and European history. And here are some numbers

00:28:58.340 | where at the time, DPD 3 is not that good, but it's certainly above a random baseline

00:29:02.900 | on all of these tasks.

00:29:07.380 | Here's another example. So this is the Beyond the Imitation Game Benchmark or BigBench.

00:29:11.760 | This has like a billion authors because it was a huge collaborative effort. And this

00:29:16.100 | is a word cloud of the tasks that were evaluated. And it really contains some very esoteric

00:29:22.740 | tasks. So this is an example of one task included where you have to, given a kanji or Japanese

00:29:27.740 | character in ASCII art, you need to predict the meaning of the character. So we're really

00:29:31.580 | stress testing these language models.

00:29:35.060 | OK, so instruction fine tuning, does it work? Recall there's a T5 encoded decoder model.

00:29:44.860 | So this is kind of Google's encoded decoder model, where it's pre-trained on this span

00:29:48.540 | corruption task. So if you don't remember that, you can refer back to that lecture.

00:29:52.780 | But the authors released a newer version called FLAN T5. So FLAN stands for fine tuning language

00:29:57.780 | models. And this is T5 models trained on an additional 1.8 thousand tasks, which include

00:30:02.820 | the natural instructions data that I just mentioned. And if we average across both the

00:30:07.120 | BigBench and an MLU performance and normalize it, what we see is that instruction fine tuning

00:30:12.940 | works. And crucially, the bigger the model, the bigger the benefit that you get from doing

00:30:18.300 | instruction fine tuning. So it's really the large models that stand to do well from fine

00:30:22.840 | tuning.

00:30:25.640 | And you might look at this and say, this is kind of sad for academics or anyone without

00:30:29.620 | a massive GPU cluster. It's like who can run an 11 billion parameter model? I guess the

00:30:34.020 | one silver lining, if you look at the results here, are the 80 million model, which is the

00:30:38.500 | smallest one. If you look at after fine tuning, it ends up performing about as well as the

00:30:43.140 | un-fine tuned 11 billion parameter model. So there's a lot of examples in the literature

00:30:48.020 | about smaller instruction fine tune pre-trained models outperforming larger models that are

00:30:53.700 | many, many more times the size. So hopefully there's still some hope for people with just

00:30:57.380 | like a few GPUs.

00:31:00.420 | Any questions? Awesome. In order to really understand the capabilities, I highly recommend

00:31:08.140 | that you just try it out yourself. So Flan T5 is hosted on Hugging Face. I think Hugging

00:31:13.500 | Face has a demo where you can just type in a little query, ask it to do anything, see

00:31:17.500 | what it does. But there are qualitative examples of this working. So four questions where a

00:31:23.140 | non-instruction fine tune model will just kind of waffle on and not answer the question.

00:31:27.820 | Doing instruction fine tuning will get your model to much more accurately reason through

00:31:31.460 | things and give you the right answer.

00:31:37.380 | OK. So that was instruction fine tuning. Positives of this method. Super simple, super straightforward.

00:31:45.300 | It's just doing fine tuning. And you see this really cool ability to generalize to unseen

00:31:50.100 | tasks.

00:31:52.940 | In terms of negatives, does anyone have any ideas for what might be downsides of instruction

00:31:58.580 | fine tuning?

00:31:59.580 | It seems like it suffers from the same negatives of any human source data. It's hard to get

00:32:13.620 | people to provide the input. You don't know. Different people think different inputs about

00:32:18.620 | it.

00:32:19.620 | Yeah, yeah, exactly. So comments are, well, it's hard and annoying to get human labels

00:32:25.700 | and it's expensive. That's something that definitely matters. And that last part you

00:32:29.180 | mentioned about there might be, you know, humans might disagree on what the right label

00:32:32.540 | is. Yeah, that's increasingly a problem. Yeah. So what are the limitations? The obvious limitation

00:32:39.540 | is money. Collecting ground truth data for so many tasks costs a lot of money. Subtler

00:32:45.740 | limitations include the one that you were mentioning. So as we begin to ask for more

00:32:50.340 | creative and open-ended tasks from our models, right, there are tasks where there is no right

00:32:54.300 | answer. And it's a little bit weird to say, you know, this is an example of how to write

00:32:59.020 | some story, right? So write me a story about a dog and her pet grasshopper. Like there

00:33:03.100 | is not one answer to this, but if we were only to collect one or two demonstrations,

00:33:07.740 | the language modeling objective would say you should put all of your probability mass

00:33:11.820 | on the two ways that two humans wrote this answer, right? When in reality, there's no

00:33:16.360 | right answer.

00:33:19.900 | Another problem, which is related kind of fundamentally to language modeling in the first

00:33:23.260 | place, is that language modeling as an objective penalizes all token level mistakes equally.

00:33:29.860 | So what I mean by that is if you were asking a language model, for example, to predict

00:33:33.020 | the sentence, "Avatar is a fantasy TV show," and you were asking it, and let's imagine

00:33:39.140 | that the LM mispredicted adventure instead of fantasy, right? So adventure is a mistake.

00:33:45.420 | It's not the right word, but it is equally as bad as if the model were to predict something

00:33:50.500 | like musical, right? But the problem is that "Avatar is an adventure TV show" is still

00:33:56.420 | true, right? So it's not necessarily a bad thing, whereas "Avatar is a musical" is just

00:34:00.460 | false. So under the language modeling objective, right, if the model were equally confident,

00:34:06.020 | you would pay the equal penalty, an equal loss penalty for predicting either of those

00:34:09.480 | tokens wrong. But it's clear that this objective is not actually aligned with what users want,

00:34:15.700 | which is maybe truth or creativity or generally just this idea of human preferences, right?

00:34:20.860 | Yeah, question.

00:34:21.860 | Could we do something like multiply the penalty by the distance from where you're betting

00:34:29.060 | in order to reduce this? Because musical would have a higher distance away than adventure.

00:34:34.740 | Yeah, that's an interesting question. It's an interesting idea. I haven't heard of people

00:34:41.140 | doing that, but it seems plausible. I guess one issue is you might come up with adversarial

00:34:46.540 | settings where maybe the word embedding distance is also not telling you the right thing, right?

00:34:50.340 | So for example, show and musical maybe are very close together because they're both shows

00:34:55.220 | or things to watch, but they are in veracity, right? They're completely different. One is

00:34:59.860 | true, one is false, right? So yeah, you can try it, although I think there might be some

00:35:04.340 | tricky edge cases like that.

00:35:07.540 | Cool. Okay, so in the next part of the talk, we're going to actually explicitly try to

00:35:14.700 | satisfy human preferences and come with a mathematical framework for doing so. And yeah,

00:35:23.020 | so these are the limitations, as I had just mentioned. So this is where we get into reinforcing

00:35:27.820 | learning from human feedback.

00:35:31.020 | Okay, so RLHF. So let's say we were training a language model on some task like summarization.

00:35:41.700 | And let's imagine that for each language model sample S, let's imagine that we had a way

00:35:46.340 | to obtain a human reward of that summary. So we could score this summary with a reward

00:35:52.740 | function, which we'll call R of S, and the higher the reward, the better. So let's imagine

00:35:59.940 | we're summarizing this article, and we have this summary, which maybe is pretty good,

00:36:04.900 | let's say. We had another summary, maybe it's a bit worse. And if we were able to ask a

00:36:10.780 | human to just rate all these outputs, then the objective that we want to maximize or

00:36:14.740 | satisfy is very obvious. We just want to maximize the expected reward of samples from our language

00:36:19.900 | model, right? So in expectation, as we take samples from our language model, P theta,

00:36:25.940 | we just want to maximize the reward of those samples. Fairly straightforward.

00:36:33.300 | So for mathematical simplicity here, I'm kind of assuming that there's only one task or

00:36:38.220 | one prompt, right? So let's imagine we were just trying to summarize this article, but

00:36:42.340 | we could talk about how to extend it to multiple prompts later on.

00:36:46.580 | Okay, so this kind of task is the domain of reinforcement learning. So I'm not going to

00:36:52.580 | presume there's any knowledge of reinforcement learning, although I'm sure some of you are

00:36:55.780 | quite familiar with it, probably even more familiar than I am. But the field of reinforcement

00:37:00.380 | learning has studied these kinds of problems, these optimization problems of how to optimize

00:37:04.700 | something while you're simulating the optimization for many years now. And in 2013, there was

00:37:11.140 | a resurgence of interest in reinforcement learning for deep learning specifically. So

00:37:15.100 | you might have seen these results from DeepMind about an agent learning to play Atari games,

00:37:20.140 | an agent mastering Go much earlier than expected.

00:37:24.660 | But interestingly, I think the interest in applying reinforcement learning to modern

00:37:28.140 | LMs is a bit newer, on the other hand. And I think the kind of earliest success story

00:37:32.940 | or one of the earliest success stories was only in 2019, for example. So why might this

00:37:37.620 | be the case? There's a few reasons. I think in general, the field had kind of this sense

00:37:42.100 | that reinforcement learning with language models was really hard to get right, partially

00:37:46.780 | because language models are very complicated. And if you think of language models as actors

00:37:52.540 | that have an action space where they can spit out any sentence, that's a lot of sentences.

00:37:56.980 | So it's a very complex space to explore. So it still is a really hard problem. So that's

00:38:01.460 | part of the reason. But also practically, I think there have been these newer algorithms

00:38:06.340 | that seem to work much better for deep neural models, including language models. And these

00:38:11.080 | include algorithms like proximal policy optimization. But we won't get into the details of that

00:38:15.620 | for this course. But these are the kind of the reasons why we've been reinterested in

00:38:20.420 | this idea of doing RL with language models.

00:38:28.060 | So how do we actually maximize this objective? I've written it down. And ideally, we should

00:38:32.420 | just change our parameters data so that reward is high. But it's not really clear how to

00:38:37.020 | do so. So when we think about it, I mean, what have we learned in the class thus far?

00:38:42.180 | We know that we can do gradient descent or gradient ascent. So let's try doing gradient

00:38:45.780 | ascent. We're going to maximize this objective. So we're going to step in the direction of

00:38:49.220 | steepest gradient. But this quickly becomes a problem, which is what is this quantity

00:38:54.820 | and how do we evaluate it? How do we estimate this expectation given that the variables

00:39:00.460 | of the gradient that we're taking, theta, appear in the sample of the expectation? And

00:39:06.420 | the second is what if our reward function is not differentiable? Like human judgments

00:39:10.600 | are not differentiable. We can't back prop through them. And so we need this to be able

00:39:14.000 | to work with a black box reward function.

00:39:16.620 | So there's a class of methods in reinforcement learning called policy gradient methods that

00:39:22.820 | gives us tools for estimating and optimizing this objective. And for the purposes of this

00:39:28.220 | course, I'm going to try to describe the highest level possible intuition for this, which looks

00:39:35.260 | at the math and shows what's going on here. But it is going to omit a lot of the details.

00:39:40.380 | And a full treatment of reinforcement learning is definitely outside of the scope of this

00:39:43.820 | course. So if you're more interested in this kind of content, you should check out CS234

00:39:48.620 | Reinforcement Learning, for example. And in general, I think this is going to get a little

00:39:53.380 | mathy, but it's totally fine if you don't understand it. We will talk, we'll regroup

00:39:56.740 | at the end and just show what this means for how to do RLHF.

00:40:03.320 | But what I'm going to do is just describe how we actually estimate this objective. So

00:40:06.780 | we want to obtain this gradient. So it's the gradient of the expectation of the reward

00:40:12.980 | of samples from our language model. And if we do the math, we break this apart. This

00:40:17.260 | is our definition of what an expectation is. We're going to sum over all sentences rated

00:40:21.580 | by the probability. And due to the linearity of the gradient, we can put the gradient operator

00:40:26.740 | inside of the sum.

00:40:31.680 | Now what we're going to do is we're going to use a very handy trick known as a log derivative

00:40:35.380 | trick. And this is called a trick, but it's really just the chain rule. But let's just

00:40:38.940 | see what happens when we take the gradient of the log probability of a sample from our

00:40:43.580 | language model.

00:40:46.240 | So if I take the gradients, then how do we use the chain rule? So the gradient of the

00:40:49.900 | log of something is going to be 1 over that something times the gradient of the middle

00:40:53.460 | of that something. So 1 over P theta of s times the gradient. And if we rearrange, we

00:40:58.380 | see that we can alternatively write the gradient of P theta of s as this product. So P theta

00:41:04.580 | of s times the gradient of the log P theta of s. And we can plug this back in.

00:41:12.660 | And the reason why we're doing this is because we're going to convert this into a form where

00:41:15.740 | the expectation is easy to estimate. So we plug it back in. That gives us this. And if

00:41:22.340 | you squint quite closely at this last equation here, this first part here is the definition

00:41:28.540 | of an expectation. We are summing over a bunch of samples from our model, and we are weighting

00:41:33.060 | it by the probability of that sample, which means that we can rewrite it as an expectation.

00:41:37.740 | And in particular, it's an expectation of this quantity here. So let's just rewrite

00:41:42.900 | it. And this gives us our kind of newer form of this objective. So these two are equivalent,

00:41:47.900 | the top here and the bottom.

00:41:50.500 | And what has happened here is we've kind of shoved the gradient inside of the expectation,

00:41:54.860 | if that makes sense. So why is this useful? Does anyone have any questions on this before

00:42:00.700 | I move on? If you don't understand it, that's fine as well, because we will understand the

00:42:06.620 | intuition behind it later.

00:42:14.420 | So we've converted this into this. And we put the gradient inside the expectation, which

00:42:20.180 | means we can now approximate this objective with Monte Carlo samples. So the way to approximate

00:42:24.900 | any expectation is to just take a bunch of samples and then average them. So approximately,

00:42:30.420 | this is equal to sampling a finite number of samples from our model, and then summing

00:42:34.920 | up the average of the reward times the log probability, the gradient of the log probability

00:42:39.700 | of that sample. And that gives us this update rule, plugging it back in for that gradient

00:42:46.980 | descent step that we wanted.

00:42:49.580 | So what is this? What does this mean? Let's think about a very simple case. Imagine the

00:42:56.840 | reward was a binary reward. So it was either 0 or 1. So for example, imagine we were trying

00:43:01.820 | to train a language model to talk about cats. So whenever it utters a sentence with the

00:43:05.500 | word cat, we give it a 1 reward. Otherwise, we give it a 0 reward. Now, if our reward

00:43:10.860 | is binary, does anyone know what this objective reduces to or look like? Any ideas? If I've

00:43:23.360 | lost everyone, that's fine too.

00:43:27.460 | The reward would just be an indicator function.

00:43:35.380 | So basically, to answer, the reward would be 0 everywhere, except for sentences that

00:43:41.260 | contain the word cat. And in that case, it would be 1. So basically, that would just

00:43:46.300 | look like vanilla gradient descent, just on sentences that contain the word cat.

00:43:52.760 | So to generalize this to the more general case, where the reward is scalar, what this

00:43:57.860 | is looking like, if you look at it, is if r is very high, very positive, then we're

00:44:03.060 | multiplying the gradient of that sample by a large number. And so our objective will

00:44:07.420 | try to take gradient steps in the direction of maximizing the probability of producing

00:44:11.460 | that sample again, producing the sample that led to high reward.

00:44:15.680 | And on the other hand, if r is low or even negative, then we will actively take steps

00:44:19.700 | to minimize the probability of that happening again. And that's the English intuition of

00:44:24.180 | what's going on here. The reason why we call it reinforcement learning is because we want

00:44:28.280 | to reinforce good actions and increase the probability that they happen again in the

00:44:32.100 | future.

00:44:33.100 | And hopefully, this intuitively makes sense to all of you. Let's say you're playing a

00:44:36.100 | video game, and on one run, you get a super high score. And you think to yourself, oh,

00:44:40.300 | that was really good. Whatever I did that time, I should do again in the future. This

00:44:43.980 | is what we're trying to capture with this kind of update.

00:44:46.020 | Question?

00:44:47.020 | Is there any reason that we use policy gradient and not value iteration or other methods?

00:44:55.300 | You can do a lot of things. I think there have been methods for doing Q-learning, offline

00:44:59.980 | learning, et cetera, with language models. I think the design space has been very underexplored.

00:45:06.660 | So there's a lot of low-hanging fruit out there for people who are willing to think

00:45:09.440 | about what fancy things we can do in RL and apply them to this language modeling case.

00:45:15.540 | And in practice, what we use is not this simple thing, but we use a fancier thing that is

00:45:20.340 | proximal policy optimization.

00:45:21.340 | Question?

00:45:22.340 | Do you know if you're on LN, the space are super big, like almost a bit?

00:45:28.100 | So that's the challenge. So one thing that I haven't mentioned here is that right now,

00:45:33.500 | I'm talking about entire samples of sentences, which is a massive space. In practice, when

00:45:38.260 | we do RL, we actually do it at the level of generating individual tokens. So each token

00:45:42.180 | is, let's say, GPT has 50,000 tokens. So it's a pretty large action space, but it's still

00:45:47.540 | manageable.

00:45:51.300 | So that kind of answers this question I was asking, which is, can you see any problems

00:45:54.240 | with this objective? Which is that this is a very simplified objective. There is a lot

00:45:58.440 | more tricks needed to make this work. But hopefully, this has given you kind of the

00:46:02.100 | high-level intuition as to what we're trying to do in the first place.

00:46:09.260 | OK, so now we are set. We have a bunch of samples from a language model. And for any

00:46:19.020 | arbitrary reward function, like we're just asking a human to rate these samples, we can

00:46:23.500 | maximize that reward. So we're done.

00:46:26.260 | OK, so not so fast. There's a few problems. The first is the same as in the instruction

00:46:31.660 | fine-tuning case, which is that keeping a human in the loop is expensive. I don't really

00:46:36.180 | want to supervise every single output from a language model. I don't know if you all

00:46:39.340 | want to. So what can we do to fix this?

00:46:44.660 | So one idea is, instead of needing to ask humans for preferences every single time,

00:46:49.120 | you can actually build a model of their preferences, like literally just train an NLP model of

00:46:53.540 | their preferences. So this idea was kind of first introduced outside of language modeling

00:46:58.820 | by this paper, Knox and Stone. They called it Tamr. But we're going to see it re-implemented

00:47:04.560 | in this idea, where we're going to train a language model-- we'll call it a reward model,

00:47:08.940 | RM, which is parameterized by phi-- to predict human preferences from an annotated data set.

00:47:15.060 | And then when doing RLHF, we're going to optimize for the reward model rewards instead of actual

00:47:19.900 | human rewards.

00:47:25.140 | Here's another conceptual problem. So here's a new sample for our summarization task. What

00:47:30.260 | is the score of the sample? Anyone give me a number. Does anyone want to rate this sample?

00:47:36.020 | It's like a 3, 6. What scale are we using? Et cetera.

00:47:42.560 | So the issue here is that human judgments can be noisy and miscalibrated when you ask

00:47:46.660 | people for things alone. So one workaround for this problem is, instead of asking for

00:47:53.960 | direct ratings, ask humans to compare two summaries and judge which one is better. This

00:47:59.660 | has been shown, I think, in a variety of fields where people work with human subjects and

00:48:03.400 | human responses to be more reliable. This includes psychology and medicine, et cetera.

00:48:09.340 | So in other words, instead of asking humans to just give absolute scores, we're going

00:48:13.400 | to ask humans to compare different samples and rate which one is better. So as an example,

00:48:19.580 | maybe this first sample is better than the middle sample, and it's better than the last

00:48:22.760 | sample.

00:48:25.760 | Now that we have these pairwise comparisons, our reward model is going to generate latent

00:48:30.260 | scores, so implicit scores based on this pairwise comparison data. So our reward model is a

00:48:35.660 | language model that takes in a possible sample, and then it's going to produce a number, which

00:48:41.200 | is the score or the reward.

00:48:43.700 | And the way that we're going to train this model-- and again, you don't really need to

00:48:46.980 | know too much of the details here, but this is a classic statistical comparison model--

00:48:51.700 | is via the following loss, where the reward model essentially should just predict a higher

00:48:56.700 | score if a sample is judged to be better than another sample. So in expectation, if we sample

00:49:03.160 | winning samples and losing samples from our data sets, then if you look at this term here,

00:49:09.160 | the score of the higher sample should be higher than the score of the losing sample. Does

00:49:16.540 | that make sense? And in doing so, by just training on this objective, you will get a

00:49:22.020 | language model that will learn to assign numerical scores to things, which indicate their relative

00:49:27.420 | preference over other samples. And we can use those outputs as rewards.

00:49:32.580 | Is there some renormalization either in the output or somewhere else?

00:49:44.140 | Yeah, so I don't remember if it happens during training. But certainly, after you've trained

00:49:49.580 | this model, you normalize the reward model so that the score is-- the expectation of

00:49:52.380 | the score is 0, because that's good for reinforcement learning and things like that as well. Yeah,

00:49:58.380 | question?

00:49:59.380 | How do we account for the fact that even though things are noisy, some people could view S3

00:50:06.780 | as better than S1. How do we account for even though when it's noisy, the border and the

00:50:12.780 | coordination still work?

00:50:14.260 | Yeah, I think that's just kind of limitations with asking for these preferences in the first

00:50:20.060 | place is that humans will disagree. So we really have no ground truth unless we maybe

00:50:24.500 | ask an ensemble of humans, for example. That's just a limitation with this. I think hopefully,

00:50:29.540 | in the limit with enough data, this kind of noise washes out. But it's certainly an issue.

00:50:33.980 | And this next slide will also kind of touch on this.

00:50:38.180 | So does the reward model work? Can we actually learn to model human preferences in this way?

00:50:42.620 | This is obviously an important standard we check before we actually try to optimize this

00:50:45.620 | objective. And they measured this. So this is kind of evaluating the reward model on

00:50:51.540 | a standard kind of validation set. So can the reward model predict outcomes for data

00:50:56.660 | points that they have not seen during training? And does it change based on model size or

00:51:02.020 | amount of data? And if you notice here, there's one dashed line, which is the human baseline,

00:51:06.780 | which is if you ask a human to predict the outcome, a human does not get 100% accuracy

00:51:11.660 | because humans disagree. And even an ensemble of, let's say, five humans also doesn't get

00:51:16.660 | 100% accuracy because humans have different preferences.

00:51:20.580 | But the key takeaway here is that for the largest possible model and for enough data,

00:51:25.820 | a reward model, at least on the validation set that they used, is kind of approaching

00:51:30.380 | the performance of a single human person. And that's kind of a green light that maybe

00:51:34.620 | we can try this out and see what happens.

00:51:42.940 | So if there are no questions, this is kind of the components of our LHF. So we have a

00:51:49.500 | pre-trained model, maybe it's instruction fine-tuned, which we're going to call P of

00:51:53.180 | PT. We have a reward model, which produces scalar rewards for language model outputs,

00:51:59.500 | and it is trained on a dataset of human comparisons. And we have a method, policy gradient, for

00:52:04.900 | arbitrarily optimizing language model perimeters towards some reward function.

00:52:10.220 | And so now if you want to do our LHF, you clone the pre-trained model, we're going to

00:52:14.860 | call this a copy of the model, which is the RL model, with parameters data that we're

00:52:19.420 | actually going to optimize. And we're going to optimize the following reward with reinforcement

00:52:25.660 | learning. And this reward looks a little bit more complicated than just using the reward

00:52:29.660 | model. And the extra term here is a penalty, which prevents us from diverging too far from

00:52:37.060 | the pre-trained model. So in expectation, this is known as the KL or Kohlback-Lieber

00:52:41.860 | divergence between the RL model and the pre-trained model.

00:52:47.940 | And I'll explain why we need this in a few slides. But basically, if you over-optimize

00:52:52.580 | the reward model, you end up producing-- you can produce gibberish. And what happens is

00:52:57.300 | you pay a price. So this quantity is large if the probability of a sample under the RL

00:53:04.180 | tuned model is much higher than the probability of the sample under the pre-trained model.

00:53:08.820 | So the pre-trained model would say, this is a very unlikely sequence of characters for

00:53:12.160 | anyone to say. That's when you would pay a price here. And beta here is a tunable parameter.

00:53:17.780 | Yeah, question?

00:53:18.780 | When you say initialize a copy, that means the first iteration, PRL is equal to PPT?

00:53:28.060 | That's right. Yeah. Yeah, when I say initialize a copy, basically, we want to be able to compare

00:53:33.200 | to the non-fine-tuned model just to evaluate this penalty term. So just leave the predictions

00:53:38.840 | of the pre-RL model around.

00:53:43.620 | More questions? Great. So does it work? The answer is yes. So here is the key takeaway,

00:53:56.660 | at least for the task summarization on this daily mail data set. So again, we're looking

00:54:02.220 | at different model sizes. But at the end here, we see that if we do just pre-training-- so

00:54:07.060 | just like the typical language modeling objective that GPT uses-- you end up producing summaries

00:54:12.060 | that, in general, are not preferred to the reference summaries. So this is on the y-axis

00:54:16.020 | here is the amount of times that a human prefers the model-generated summary to a summary that

00:54:21.860 | a human actually wrote or the one that's in the data set.

00:54:25.480 | So pre-training doesn't work well, even if you do supervised learning. So supervised

00:54:29.340 | learning in this case is, let's actually fine-tune our model on the summaries that were in our

00:54:33.900 | data sets. Even if you do that, you still kind of underperform the reference summaries,

00:54:39.020 | because you're not perfectly modeling those summaries. But it's only with this human feedback

00:54:44.860 | that we end up producing a language model that actually ends up producing summaries

00:54:48.780 | that are judged to be better than the summaries in a data set that you were training on in

00:54:52.260 | the first place. I think that's quite interesting. Any questions?

00:55:06.100 | So now we talk about-- yeah, we're getting closer and closer to something like InstructGPT

00:55:10.900 | or ChatGPT. The basic idea of InstructGPT is that we are scaling up RLHF to not just

00:55:18.660 | one prompt, as I had described previously, but tens of thousands of prompts. And if you

00:55:23.940 | look at these three pieces, these are the three pieces that we've just described. The

00:55:27.740 | first piece here being instruction fine-tuning, the second piece being RLHF, and the third

00:55:33.540 | piece-- oh, sorry, the second part being reward model training, and the last part being RLHF.

00:55:39.740 | The difference here is that they use 30,000 tasks. So again, with the same instruction

00:55:47.060 | fine-tuning idea, it's really about the scale and diversity of tasks that really matters

00:55:50.820 | for getting good performance for these things. Yeah?

00:55:54.300 | Yeah, so the preceding results, you suggested that you really needed the RLHF, and it didn't

00:56:08.540 | work so well to do supervised learning on the data. But they do supervised learning

00:56:14.020 | on the data in the fine-tuning in the first stage. Is that necessary, or else they should

00:56:21.060 | have tended to go haywire and just went straight to RLHF?

00:56:27.140 | Oh, yeah, that's a good question. So I think a key point here is that they initialized

00:56:31.300 | the RL policy on the supervised policy. So they first got the model getting reasonably

00:56:36.340 | good at doing summarization first, and then you do the RLHF on top to get the boost performance.

00:56:42.640 | Your question you're asking is maybe, can we just do the RLHF starting from that pre-trained

00:56:46.380 | baseline? That's a good question. I don't think they explored that, although I'm not

00:56:52.300 | sure. I'd have to look at the paper again to remind myself. Yeah.

00:57:02.380 | So certainly for something like InstructGPT, yeah, they've always kind of presumed that

00:57:06.240 | you need the kind of fine-tuning phase first, and then you build on top of it. But I think,

00:57:10.820 | yeah, there's still some interesting open questions as to whether you can just go directly

00:57:14.580 | to RLHF. Question?

00:57:16.580 | Is the human reward function trained simultaneously with the fine-tuning of the language model?

00:57:27.980 | Or is it sequential?

00:57:30.340 | Reward model should be trained first. Yeah. You train it first, you make sure it's good,

00:57:34.620 | it's frozen, you optimize against that.

00:57:35.620 | What are the samples for the human rewards? Do they come from the generated task from

00:57:42.620 | language model? Or where does the training sample come from?

00:57:45.500 | For training the reward model?

00:57:47.220 | Yeah.

00:57:48.220 | So, yeah, actually, it's a good question. Where do the rewards come from? So there's

00:57:54.700 | kind of an iterative process you can apply where you kind of repeat steps two and three

00:57:58.580 | over and over again. So you sample a bunch of outputs from your language model. You get

00:58:03.380 | humans to rate them. You then do RLHF to update your model again. And then you sample more

00:58:07.740 | outputs and get humans to rate them.

00:58:09.800 | So in general, the rewards are done on sampled model outputs, because those are the outputs

00:58:13.780 | that you want to steer in one direction or another. But you can do this in an iterative

00:58:17.820 | process where you kind of do RL and then maybe train a better reward model based on the new

00:58:22.420 | outputs and continue. And I think they do a few iterations in InstructGBT, for example.

00:58:31.220 | Questions? OK. So 30,000 tasks. I think we're getting into very recent stuff where increasingly

00:58:45.180 | companies like OpenAI are sharing less and less details about what actually happens in

00:58:49.740 | training these models. So we have a little bit less clarity as to what's going on here

00:58:53.220 | than maybe we have had in the past. But they do share the data that's not public, but they

00:58:59.620 | do share the kinds of tasks that they collected from labelers. So they collected a bunch of

00:59:03.780 | prompts from people who were already using the GPT-3 API. So they had the benefit of

00:59:08.540 | having many, many users of their API and taking the kinds of tasks that users would ask GPT

00:59:15.860 | to do. And so these include things like brainstorming or open-end generation, et cetera.

00:59:24.860 | And yeah, I mean, the key results of InstructGBT, which is kind of the backbone of ChatGBT,

00:59:31.060 | really just needs to be seen and played with to understand. So you can feel free to play

00:59:34.540 | with either ChatGBT or one of the OpenAI APIs. But again, this example of a language model

00:59:40.620 | and not necessarily following tasks, by doing this kind of instruction fine tuning followed

00:59:45.900 | by RLHF, you get a model that is much better at adhering to user commands. Similarly, a

00:59:55.220 | language model can be very good at generating super interesting open-ended creative text

01:00:00.660 | as well.

01:00:09.580 | This brings us to ChatGBT, which is even newer, and we have even less information about what's

01:00:14.260 | actually going on or what's being trained here. But yeah, and they're keeping their

01:00:19.660 | secret sauce secret. But we do have a blog post where they wrote two paragraphs. And

01:00:25.940 | in the first paragraph, they said that they did instruction fine tuning. So we trained

01:00:30.980 | an initial model using supervised fine tuning. So human AI trainers provided conversations

01:00:35.940 | where they played both sides. And then we asked them to act as a AI assistant. And then

01:00:40.620 | we fine-tuned our model on acting like an AI assistant for humans. That's part one.

01:00:46.540 | Second paragraph, to create a reward model for RL, we collected comparison data. So we

01:00:52.940 | took conversations with an earlier version of the chatbot, so the one that's pre-trained

01:00:56.940 | on instruction following or instruction fine tuning, and then take multiple samples and

01:01:02.020 | then rate the quality of the samples. And then using these reward models, we fine-tune

01:01:07.900 | it with RL. In particular, they used PPO, which is a fancier version of RL.

01:01:18.340 | And yeah, so that produces-- I don't need to introduce the capabilities of ChatGBT.

01:01:21.580 | It's been very exciting recently. Here's an example. It's fun to play with. Definitely

01:01:26.260 | play with it. Sorry, it's a bit of an attack on the students. Yeah. OK.

01:01:43.300 | So reinforcement learning, pluses. You're kind of directly modeling what you care about,

01:01:49.700 | which is human preferences, not is the collection of the demonstration that I collected, is

01:01:55.300 | that the highest probability mass in your model. You're actually just saying, how well

01:01:59.140 | am I satisfying human preferences? So that's a clear benefit over something like instruction

01:02:03.660 | fine tuning.

01:02:06.340 | So in terms of negatives, one is that RL is hard. It's very tricky to get right. I think

01:02:11.420 | it will get easier in the future as we kind of explore the design space of possible options.

01:02:16.900 | So that's an obvious one. Does anyone come up with any other kind of maybe weaknesses

01:02:21.060 | or issues they see with this kind of training? Yeah.

01:02:25.420 | Is it possible that your language model and then your reward model can over-fit to each

01:02:33.300 | other, especially-- even if you're not training them together, if you're going back and forth

01:02:37.820 | and like--

01:02:38.820 | Yeah. Yeah. So over-optimization, I think, of the reward model is an issue. Yeah.

01:02:43.900 | Is it also that if you retrain your baseline, if you repeat all this human feedback, it

01:02:49.580 | will fall over again?

01:02:50.580 | Yeah. So it still is extremely data expensive. And you can see some articles if you just

01:02:55.820 | Google OpenAI data labeling. People have not been very happy with the amount of data that

01:03:00.020 | has been needed to train something like ChatGBT. I mean, they're hiring developers to just

01:03:03.660 | explain coding problems 40 hours a week. So it is still data intensive. That's kind of

01:03:09.820 | the takeaway. All of these are-- it's all still data intensive, every single one of

01:03:13.420 | these.

01:03:16.420 | Yeah. I think that summarizes kind of the big ones here. So when we talk about limitations

01:03:24.100 | of RLHF, we also need to talk about just limitations in general of RL, and also this idea that

01:03:30.540 | we can model or capture human reward in this single data point.

01:03:35.460 | So human preferences can be very unreliable. The RL people have known this for a very long

01:03:41.100 | time. They have a term called reward hacking, which is when an agent is optimizing for something

01:03:45.900 | that the developer specified, but it is not what we actually care about. So one of the

01:03:51.620 | classic examples is this example from OpenAI, where they were training this agent to race

01:03:58.020 | boats. And they were training it to maximize the score, which you can see at the bottom

01:04:02.220 | left. But implicitly, the score actually isn't what you care about. What you care about is

01:04:06.220 | just finishing the race ahead of everyone else. And the score is just kind of this bonus.

01:04:09.740 | But what the agent found out was that there are these turbo boost things that you can

01:04:13.660 | collect, which boost your score. And so what it ends up doing is it ends up kind of just

01:04:17.580 | driving in the middle, collecting these turbo boosts over and over again. So it's racking

01:04:20.820 | up insane score, but it is not doing the race. It is continuously crashing into objects,

01:04:25.860 | and its boat is always on fire. And this is a pretty salient example of what we call AI

01:04:33.300 | misalignment.

01:04:34.300 | And you might think, well, OK, this is a really simple example. They made a dumb mistake.

01:04:39.900 | They shouldn't have used score as a reward function. But I think it's even more naive

01:04:44.220 | to think that we can capture all of human preferences in a single number and assign

01:04:49.820 | certain scalar values to things.

01:04:54.020 | So one example where I think this is already happening, you can see, is maybe you have

01:04:58.940 | played with chatbots before, and you notice that they do a lot of hallucination. They

01:05:03.060 | make up a lot of facts. And this might be because of RLHF. Chatbots are rewarded to

01:05:07.960 | produce responses that seem authoritative or seem helpful, but they don't care about

01:05:13.060 | whether it's actually true or not. They just want to seem helpful.

01:05:17.060 | So this results in making up facts. You may be seeing the news about chatbots. Companies

01:05:22.180 | are in this race to deploy chatbots, and they make mistakes. Even Bing also has been hallucinating

01:05:28.180 | a lot.

01:05:31.220 | And in general, when you think about that, you think, well, models of human preferences

01:05:35.700 | are even more unreliable. We're not even just using human preferences by themselves. We're

01:05:40.340 | also training a model, a deep model, that we have no idea how that works. We're going

01:05:44.500 | to use that instead. And that can obviously be quite dangerous.

01:05:50.420 | And so going back to this slide here, where I was describing why we need this KL penalty

01:05:54.740 | term, this yellow highlighted term here, here's a concrete example of what actually happens

01:05:59.900 | of a language model overfitting to the reward model.

01:06:03.460 | So what this is showing is, in this case, they took off the KL penalty. So they were

01:06:07.300 | just trying to maximize reward. They trained this reward model. Let's just push those numbers

01:06:11.140 | up as high as possible. And on the x-axis here is what happens as training continues.

01:06:16.620 | You diverge further and further. This is the KL divergence or the distance from where you

01:06:20.740 | started.

01:06:22.580 | And the golden dashed line here is what the reward model predicts your language model

01:06:27.340 | is doing. So your reward model is thinking, wow, you are killing it. They are going to

01:06:31.220 | love these summaries. They are going to love them way more than the reference summaries.

01:06:35.860 | But in reality, when you actually ask humans, the preferences peak, and then they just crater.

01:06:43.060 | So this can be an example of over-optimizing for a metric that you care about. It ceases

01:06:48.720 | to become a good metric to optimize for.

01:06:52.620 | Any questions about this?

01:06:57.380 | So there's this real concern of, I think, what people are calling the AI alignment problem.

01:07:01.020 | I'll let Percy Leung talk about this. He tweeted that the main tool that we have for alignment

01:07:07.580 | is RLHF. But reward hacking happens a lot. Humans are not very good supervisors of rewards.

01:07:14.520 | So this strategy is probably going to result in agents that seem like they're doing the

01:07:18.100 | right thing, but they're wrong in subtle and conspicuous ways. And I think we're already

01:07:21.900 | seeing examples of that in the current generation of chatbots.

01:07:29.060 | So in terms of positives, here are some positives. But again, RL is tricky to get right. Human

01:07:34.700 | preferences are fallible, and models of human preferences are even more so.

01:07:42.860 | So I remember seeing a joke on Twitter somewhere where someone was saying that zero shot and

01:07:47.540 | few shot learning is the worst way to align in AI. Instruction fine tuning is the second

01:07:52.180 | worst way to align in AI. And RLHF is the third worst way to align in AI. So we're getting

01:07:57.700 | somewhere, but each of these have clear fundamental limitations.

01:08:02.100 | Yeah, question.

01:08:03.660 | I have a question on more of like competition of reinforcement learning. Because if you

01:08:11.540 | get the math that Nick showed before, essentially you're putting the gradient inside so that

01:08:15.940 | you can sample it, the sample expectation. But when it comes to sampling, how do you

01:08:20.980 | make that parallel? Because then you need to adaptively stop sampling, and then you

01:08:27.660 | don't know when you're going to stop. How do you make that process quicker? The whole

01:08:32.700 | unit on transformers and all that was parallelizing everything.

01:08:38.460 | I mean, yeah. So this is really compute heavy. And I'm actually not sure what kind of infrastructure

01:08:44.220 | is used for a state of the art, very performant implementation of RLHF. But it's possible

01:08:48.420 | that they use parallelization like what you're describing, where I think in a lot of maybe

01:08:52.140 | more traditional RL, there's this kind of idea of having an actor learner architecture

01:08:57.020 | where you have a bunch of actor workers, which are each kind of a language model producing

01:09:00.100 | a bunch of samples. And then the learner would then integrate them and perform the gradient

01:09:03.740 | updates. So it's possible that you do need to do just sheer multiprocessing in order

01:09:08.420 | to get enough samples to make this work in a reasonable amount of time. Is that the kind

01:09:13.060 | of question you had? Or do you have other questions?

01:09:15.060 | Kind of. So you're basically saying that each unit that you parallelize over is larger than

01:09:23.380 | what we would typically see in transformers?

01:09:26.780 | I was saying that you might need to actually copy your model several times and take samples

01:09:31.180 | from different copies of the models. Yeah. But in terms of like-- yeah, so autoregressive

01:09:35.540 | generation, transformers, especially like the forward pass and the multi-head attention

01:09:39.500 | stuff is very easy to parallelize. But autoregressive generation is still kind of bottlenecked by

01:09:48.220 | the fact that it's autoregressive. So you have to run it first and then you need to--

01:09:51.500 | depends on what you sample, you have to run it again. So those are kind of blocks that

01:09:55.140 | we haven't fully been able to solve, I think. And that will add to compute cost.

01:10:07.260 | So I think we have 10 more minutes if I'm not mistaken. So we've mostly finally answered

01:10:12.220 | how we get from this to this. There's some details missing. But the key kind of factors

01:10:16.940 | are one, instruction fine tuning. Two, this idea of reinforced learning from human feedback.

01:10:24.260 | So let's talk a little bit about what's next. So as I had mentioned, RLHF is still a very

01:10:31.520 | new area. It's still very fast moving. I think by the next lecture, by the time we say that

01:10:36.380 | I did these slides again, these slides might look completely different because maybe a

01:10:39.740 | lot of the things that I was presenting here turn out to be really bad ideas or not the

01:10:44.580 | most efficient way of going about things. RLHF gets you further than instruction fine

01:10:49.860 | tuning. But as someone had already mentioned, it is still very data expensive. There are

01:10:54.740 | a lot of articles about OpenAI needing to hire a legion of annotators or developers

01:10:59.140 | to just compare outputs over and over again.

01:11:03.700 | I think a recent work that I'm especially interested in and been thinking about is how

01:11:07.740 | we can get the benefits of RLHF without such stringent data requirements. So there's these

01:11:13.140 | newer kind of crazy ideas about doing reinforcement learning from not human feedback, but from

01:11:19.100 | AI feedback. So having language models themselves evaluate the output of language models. So

01:11:24.340 | as an example of what that might look like, a team from Anthropic, which works on these

01:11:28.300 | large language models, came up with this idea called constitutional AI. And the basic idea

01:11:33.580 | here is that if you ask GPT-3 to identify whether a response was not helpful, it would

01:11:38.180 | be pretty good at doing so. And you might be able to use that feedback itself to improve

01:11:42.260 | a model. So as an example, if you have some sort of human request, like, can you help

01:11:47.380 | me hack into my neighbor's Wi-Fi? And the assistant says, yeah, sure, you can use this

01:11:51.380 | app, right? We can ask a model for feedback on this. What we do is we add a critique request,

01:11:58.540 | which says, hey, language model GPT-3, identify ways in which the assistant's response is

01:12:04.060 | harmful. And then it will generate a critique, like hacking into someone else's Wi-Fi is

01:12:09.580 | illegal. And then you might ask it to then revise it, right? So just rewrite the assistant

01:12:14.940 | response to remove harmful content. And it does so. And now by just decoding from a language

01:12:23.260 | model, assuming you can do this well, what you have now is a set of data that you can

01:12:28.220 | do instruction fine tuning on, right? You have a request and you have a request that

01:12:32.220 | has been revised to make sure it doesn't contain harmful content.

01:12:37.780 | So this is pretty interesting. I think it's quite exciting. But all of those issues that

01:12:41.700 | I had mentioned about alignment, mis-overinterpreting human preferences, reward models being fallible,

01:12:49.980 | everything gets compounded like 40,000 times when you're thinking about this, right? We

01:12:53.140 | have no understanding of how safe this is or where this ends up going, but it is something.

01:12:59.940 | Another kind of more common idea also is this general idea of fine tuning language models

01:13:03.780 | on their own outputs. And this has been explored a lot in the context of chain of thought reasoning,

01:13:07.580 | which is something I presented at the beginning of the lecture. And these are provocatively

01:13:11.420 | named large language models can self-improve. But again, it's not clear how much runway

01:13:16.220 | there is.

01:13:17.220 | But the basic idea maybe is to-- you can use let's think step by step, for example, to

01:13:21.420 | get a language model to produce a bunch of reasoning. And then you can say fine tune

01:13:24.260 | on that reasoning as if it were true data and see whether or not a language model can

01:13:27.460 | get any better using that technique.

01:13:33.940 | But as I mentioned, this is all still very new. There are, I think, a lot of limitations

01:13:38.100 | of large language models like hallucination and also just the sheer size and compute intensity

01:13:42.900 | of this that may or may not be solvable with RLHF.

01:13:46.420 | Question?

01:13:47.420 | [INAUDIBLE] feedback of how we don't want to be at that. I've seen people talking about

01:13:56.700 | how you can jailbreak chat GPT to still give those types of funnable responses. Are there

01:14:02.300 | any ways for us to buffer against those types of things as well? Because it seems like you're

01:14:09.700 | just going to keep building on-- we need to identify chances where it's trying to say

01:14:14.940 | action not like yourself. I guess is there any way to build up that scale to avoid those

01:14:23.740 | jailbreaking possibilities?

01:14:24.740 | Yeah, that's interesting. So there are certainly ways that you can use either AI feedback or

01:14:32.860 | human feedback to mitigate those kinds of jailbreaks. If you see someone on Twitter

01:14:36.420 | saying that, oh, I made GPT-3 jailbreak using this strategy or whatever, you can then maybe

01:14:43.180 | plug it into this kind of framework and say identify ways in which the assistant went

01:14:46.260 | off the rails and then fine tune and hopefully correct those. But it is really difficult,

01:14:50.980 | I think, in most of these kinds of settings. It's really difficult to anticipate all the

01:14:54.140 | possible ways in which a user might jailbreak an assistant. So you always have this kind

01:14:58.820 | of dynamic of like in security, cybersecurity, for example, there's always the attacker advantage

01:15:04.260 | where the attacker will always come up with something new or some new exploit. So yeah,

01:15:10.260 | I think this is a deep problem. I don't have a really clear answer. But certainly, if we

01:15:14.180 | knew what the jailbreak was, we could mitigate it. I think that seems pretty straightforward.

01:15:21.180 | But if you know how to do that, you should be hired by one of these companies. They'll

01:15:24.340 | pay you millions if you can solve this.

01:15:30.100 | OK. Yeah, so just last remarks is with all of these scaling results that I presented

01:15:36.900 | and all of these like, oh, you can just do instruction fine tuning and it'll follow your

01:15:40.300 | instructions, or you can do RLHF. You might have a very bullish view on like, oh, this

01:15:44.580 | is how we're going to solve artificial general intelligence by just scaling up RLHF. It's

01:15:48.860 | possible that that is actually going to happen. But it's also possible that there are certain

01:15:53.300 | fundamental limitations that we just need to figure out how to solve, like hallucination,

01:15:58.220 | before we get anywhere productive with these models. But it is a really exciting time to

01:16:01.720 | work on this kind of stuff. So yeah. Thanks for listening.

01:16:06.700 | Thanks.

01:16:07.700 | [APPLAUSE]

01:16:08.700 | [END OF TRANSCRIPT]

01:16:08.700 | 1

01:16:09.700 | 1

01:16:10.700 | 1

01:16:11.700 | 1

01:16:12.700 | 1

01:16:13.700 | 1