Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback

Okay, awesome. We're going to get started. So my name is Jesse Mu. I'm a PhD student in the CS department here working with the NLP group and really excited to be talking about the topic of today's lecture, which is on prompting instruction, fine tuning, and RLHF. So this is all stuff that has been super hot recently because of all the latest criticise about chatbots, chat-dbt, etc.

And we're going to hopefully get somewhat of an understanding as to how these systems are trained. Okay, so before that, some course logistics things. So project proposals, both custom and final, were due a few minutes ago. So if you haven't done that, this is a nice reminder. We're in the process of assigning mentors of projects, so we'll give feedback soon.

Besides that, assignment five is due on Friday at midnight. We still recommend using Colab for the assignments, even if you've had AWS or Azure credits granted. If that doesn't work, there's instructions for how to connect to a Kaggle notebook where you will also be able to use GPUs. Look for that post on ed.

And then finally, also just posted on ed by John is a course feedback survey. So this is part of your participation grade. Please fill that in by Sunday, 11.59 p.m. Okay, so let's get into this lecture, which is going to be about what we are trying to do with these larger and larger models.

Over the years, the compute for these models have just gone up hundreds of powers of 10, trained on more and more data. So larger and larger models, they're seeing more and more data. And in lecture 10, if you recall this slide, we talked a little bit about what happens when you do pre-training.

And as you begin to really learn to predict the missing sentence in certain texts, right? You learn things like syntax, co-reference, sentiment, et cetera. But in this lecture, we're going to take it a little bit further and really take this idea to its logical conclusion. So if you really follow this idea of we're just going to train a giant language model on all of the world's text, you really begin to see language models sort of in a way as rudimentary world models.

So maybe they're not very good at world models, but they kind of have to be doing some implicit world modeling just because we have so much information on the internet and so much of human collective knowledge is transcribed and written for us on the internet, right? So if you are really good at predicting the next word in text, what do you learn to do?

There's evidence that these large language models are to some degree learning to represent and think about agents and humans and the beliefs and actions that they might take. So here's an example from our recent paper where we are talking about someone named Pat watching a demonstration of a bowling ball and a leaf being dropped at the same time in a vacuum chamber.

And the idea is here we're saying Pat is a physicist, right? So Pat is a physicist and we ask for the language models next continuation of this sentence. Because he's a physicist, we do some inference about what kind of knowledge Pat has and Pat will predict that the bowling ball and the leaf will fall at the same time.

But if we change the sentence of the prompt and we say, well, Pat has actually never seen this demonstration before, then Pat will predict that the bowling ball will fall to the ground first, which is wrong, right? So if you get really good at predicting the next sentence in text, you also to some degree have to learn to predict an agent's beliefs, their backgrounds, common knowledge and what they might do next.

So not just that, of course, if we continue browsing the internet, we see a lot of encyclopedic knowledge. So maybe language models are actually good at solving math reasoning problems if they've seen enough demonstrations of math on the internet. Code, of course, code generation is a really exciting topic that people will that people are looking into and we'll give a presentation on that in a few weeks.

Even medicine, right? We're beginning to think about language models trained on medical texts and being applied to the sciences and whatnot. So this is what happens when we really take this language modeling idea seriously. And this has resulted in a resurgence of interest in building language models that are basically assistants, right?

You can give them any task under the sun. I want to create a three course meal and a language model should be able to take a good stab at being able to do this. This is kind of the promise of language modeling. But of course, there's a lot of steps required to get from this, from our basic language modeling objective.

And that's what this lecture is going to be about. So how do we get from just predicting the next word in a sentence to something like chat GPT, which you can really ask it to do anything and it might fail sometimes, but it's getting really, really convincingly good at some things.

Okay. So this is the lecture plan. Basically, I'm going to talk about as we're working with these large language models, we come up with kind of increasingly complex ways of steering the language models closer and closer to something like chat GPT. So we'll start with zero shot and few shot learning, then instruction, fine tuning and then reinforcing learning from human feedback or RLHF.

Okay. So let's first talk about few shot and zero shot learning. And in order to do so, we're again going to kind of build off of the pre-training lecture last Tuesday. So in the pre-training lecture, John talked about these models like GPT, generative pre-trained transformer, that are these decoder only language models.

So they're just trained to predict the next word in a corpus of text. And back in 2018 was the first iteration of this model. And it was 117 million parameters. So at the time it was pretty big. Nowadays, it's definitely much smaller. And again, it's just a vanilla transformer decoder using the techniques that you've seen.

And it's trained on a corpus of books. So about 4.6 gigabytes of text. And what GPT showed was the promise at doing this simple language modeling objective and serving as an effective pre-training technique for various downstream tasks that you might care about. So if you wanted to apply it to something like natural language inference, you would take your premise sentence and your hypothesis sentence, concatenate them, and then maybe train a linear classifier on the last representation the model produces.

OK, but that was three, four, five years ago. What has changed since then? So they came out with GPT-2. So GPT-2 was released the next year in 2019. This is 1.5 billion parameters. So it's the same architecture as GPT, but just an order of magnitude bigger. And also trained on much more data.

So we went from 4 gigabytes of books to 40 gigabytes of internet text data. So they produced a data set called WebText. This is produced by scraping a bunch of links to comments on Reddit. So the idea is that the web contains a lot of spam, maybe a lot of low-quality information.

But they took links that were posted on Reddit that had at least a few upvotes. So humans maybe looked through it and said, you know, this is a useful post. So that was kind of a rough proxy of human quality. And that's how they collected this large data set.

And so if you look at the size of GPT in 2018, we can draw a bigger dot, which is the size of GPT-2 in 2019. And one might ask, how much better does this do? What is this value? So the authors of GPT-2 titled their paper, "Language Models are Unsupervised Multitask Learners." And that kind of gives you a hint as to what the key takeaway they found was, which is this unsupervised multitasking part.

So basically, I think the key takeaway from GPT-2 was this idea that language models can display zero-shot learning. So what I mean by zero-shot learning is you can do many tasks that the model may not have actually explicitly been trained for with no gradient updates. So you just kind of query the model by simply specifying the right sequence prediction problem.

So if you care about question answering, for example, you might include your passage, like a Wikipedia article about Tom Brady. And then you'll add a question, so a question, where was Tom Brady born? And then include an answer, like A followed by a colon. And then just ask the model to predict the next token.

You've kind of jury-rigged the model into doing question answering. For other tasks, like classification tasks, another thing you can do is compare different probabilities of sequences. So this task is called the Winograd Schema Challenge. It's a pronoun resolution task. So the task is to kind of resolve a pronoun which requires some world knowledge.

So one example is something like, the cat couldn't fit into the hat because it was too big. And the question is whether it refers to the cat or to the hat. And in this case, it makes most sense for it to refer to the cat because things fitting into things because they're too big, you need to use some world knowledge to kind of resolve that.

So the way that you get zero-shot predictions for this task out of a language model like GPT-2 is you just ask the language model, which sequence is more likely? Is the probability of the cat couldn't fit into the hat because the cat was too big deemed more likely by the language model than the probability that the cat couldn't fit into the hat because the hat was too big?

You can score those sequences because this is a language model. And from there, you get your zero-shot prediction. And you can end up doing fairly well on this task. Any questions about this? OK. Yeah, so digging a little bit more into the results, GPT-2 at the time beat the state of the art on a bunch of language modeling benchmarks with no task-specific fine-tuning.

So no traditional fine-tune on a training set and then test on a testing set. So here's an example of such a task. This is a language modeling task called Lambada, where the goal is to predict a missing word. And the idea is that the word that you need to predict depends on some discourse earlier in the sentence or earlier a few sentences ago.

And by simply training your language model and then running it on the Lambada task, you end up doing better than the supervised fine-tuned state of the art at the time and across a wide variety of other tasks as well. OK. Another kind of interesting behavior they observed-- and so you'll see hints of things that we now take for granted in this paper-- is that you can get interesting zero-shot behavior as long as you take some liberties with how you specify the task.

So for example, let's imagine that we want our model to do summarization. Even though GPT-2 was just a language model, how can we make it do summarization? The idea they explored was we're going to take an article, some news article, and then at the end, we're going to append the TLDR sign, the TLDR token.

So this stands for Too Long Didn't Read. It's used a lot on Reddit to just say, if you didn't want to read the above stuff, here's a few sentences that summarizes it. So if you ask the model to predict what follows after the TLDR token, you might expect that it'll generate some sort of summary.

And this is kind of early whispers at this term that we now call prompting, which is thinking of the right way to define a task such that your model will do the behavior that you want it to do. So if we look at the performance we actually observed on this task, here at the bottom is a random baseline.

So you just select three sentences from the article. And the scores that we're using here are Rouge scores, if you remember the natural language generation lecture. GPT-2 is right above. So it's not actually that good. It only does maybe a little bit or barely any better than the random baseline.

But it is approaching approaches that are supervised approaches that are actually explicitly fine-tuned to do summarization. And of course, at the time, it still underperformed the state of the art. But this really showed the promise of getting language models to do things that maybe they weren't trained to do.

OK, so that was GPT-2. That was 2019. Now here's 2020, GPT-3. So GPT-3 is 175 billion parameters. So it's another increase in size by an order of magnitude. And at the time, it was unprecedented. I think it still is kind of overwhelmingly large for most people. And data. So they scaled up the data once again.

OK, so what is this by you? This paper's title was called Language Models are Few Shot Learners. So what does that mean? So the key takeaway from GPT-3 was emergent few-shot learning. So the idea here is, sure, GPT can still do zero-shot learning. But now you can specify a task by basically giving examples of the task before asking it to predict the example that you care about.

So this is often called in-context learning to stress that there are no gradient updates being performed when you learn a new task. You're basically kind of constructing a tiny little training data set and just including it in the prompt, including it in the context window of your transformer, and then asking it to pick up on what the task is and then predict the right answer.

And this is in contrast to a separate literature on few-shot learning, which assumes that you can do gradient updates. In this case, it's really just a frozen language model. So few-shot learning works, and it's really impressive. So here's a graph. SuperGLUE here is a kind of a wide coverage natural language understanding benchmark.

And what they did was they took GPT-3, and this data point here is what you get when you just do zero-shot learning with GPT-3. So you provide an English description of a task to be completed, and then you ask it to complete the task. Just by providing one example, so one shot, you get like a 10% accuracy increase.

So you give not only the natural language task description, but also an example input and an example output, and you ask it to code the next output. And as you increase to more shots, you do get better and better scores, although, of course, you get diminishing returns after a while.

But what you can notice is that few-shot GPT-3, so no gradient updates, is doing as well as or outperforming BERT fine-tuned on the SuperGLUE task explicitly. Any questions about this? So one thing that I think is really exciting is that you might think, OK, a few-shot learning, whatever, it's just memorizing.

Maybe there's a lot of examples of needing to do a few-shot learning in the internet text data. And that's true, but I think there's also evidence that GPT-3 is really learning to do some sort of on-the-fly optimization or reasoning. And so the evidence for this comes in the form of these synthetic word unscrambling tasks.

So the authors came up with a bunch of simple kind of letter manipulation tasks that are probably unlikely to exist in internet text data. So these include things like cycling through the letters to get the kind of uncycled version of a word, so converting from P-L-E-A-P to Apple, removing characters added to a word, or even just reversing words.

And what you see here is performance as you do few-shot learning as you increase the model size. And what you can see is that the ability to do few-shot learning is kind of an emergent property of model scale. So at the very largest model, we're actually seeing a model be able to do this exclusively in context.

Question? Yeah. I've noticed the reversed words are horrible, like the performance. Yeah. Yeah. So the question was the reversed words. Mine is still low. Yeah, that's an example of a task that these models still can't solve yet, although I'm not sure if we've evaluated it with newer and newer models.

Maybe the latest versions can indeed actually solve that task. Yeah, question? Is there some intuition for why this emerges as a result of model scale? I think that's a highly active area of research, and there's been papers published every week on this. So I think there's a lot of interesting experiments that really try to dissect either with synthetic tasks, like can GPT-3 learn linear regression in context?

And there's some model interpretability tasks, like what in the attention layers or what in the hidden states are resulting in this kind of emergent learning. But yeah, I'd have to just refer you to the recent literature on that. Anything else? Awesome. Okay, so just to summarize, traditional fine tuning here is on the right.

We take a bunch of examples of a task that we care about. We give it to our model, and then we do a gradient step on each example. And then at the end, we hopefully get a model that can do well on some outputs. And in this new kind of paradigm of just prompting a language model, we just have a frozen language model, and we just give some examples and ask the model to predict the right answer.

So you might think, and you'd be right, that there are some limits of prompting. Well, there's a lot of limits of prompting, but especially for tasks that are too hard. There are a lot of tasks that maybe seem too difficult, especially ones that involve maybe richer reasoning steps or needing to synthesize multiple pieces of information.

And these are tasks that humans struggle with too. So one example is GPT-3. I don't have the actual graph here, but it was famously bad at doing addition for much larger digits. And so if you prompt GPT-3 with a bunch of examples of addition, it won't do it correct.

But part of the reason is because humans are also pretty bad at doing this in one step. Like if I asked you to just add these two numbers on the fly and didn't give you a pencil and paper, you'd have a pretty hard time with it. So one observation is that you can just change the prompts and hopefully get some better performance out of this.

So there's this idea of doing chain of thought prompting, where in standard prompting, we give some examples of a task that we'd like to complete. So here is an example of a math word problem. And I told you that what we would do is we would give the question and then the answer.

And then for a data point that we actually care about, we ask the model to predict the answer. And the model will try to produce the right answer, and it's just wrong. So the idea of chain of thought prompting is to actually demonstrate what kind of reasoning you want the model to complete.

So in your prompt, you not only put the question, but you also put an answer and the kinds of reasoning steps that are required to arrive at the correct answer. So here is actually some reasoning of how you actually would answer this tennis ball question and then get the right answer.

And because the language model is incentivized to just follow the pattern and continue the prompt, if you give it another question, it will in turn produce an answer, sorry, a rationale followed by an answer. So you're kind of asking the language model to work through the steps yourself. And by doing so, you end up getting some questions right when you otherwise might not.

So a super simple idea, but it's shown to be extremely effective. So here is this middle school math word problems benchmark. And again, as we scale up the model for GPT and some other kinds of models, being able to do chain of thought prompting emerges. So we really see a performance approaching that of supervised baselines for these larger and larger models.

Questions? Yeah. Seemingly the problem with the addition of the large numbers, do you have results on how the chain of thought prompting for the larger numbers that middle school math word problems? Yeah. So the question is, does chain of thought prompting work for those addition problems that I had presented?

Yeah. There should be some results in the actual paper. They're just not here, but you can take a look. Yeah. Intuition of how the model was trained without doing gradient update? Intuition about how the model is learning without gradient updates. Yeah. So this is related to the question asked earlier about how is this actually happening.

That is, yeah, again, it's an active area of research. So my understanding of the literature is something like you can show that models are kind of almost doing in-context gradient descent as it's encoding a prompt. And you can analyze this with model interpretability experiments. But I'm happy to suggest papers afterwards that kind of deal with this problem more carefully.

Cool. Okay. So a follow up work to this asked the question of, do we actually even need examples of reasoning? Do we actually even need to collect humans working through these problems? Can we actually just ask the model to reason through things? Just ask it nicely. So this introduced this idea called zero shot chain of thought prompting.

And it was honestly like I think probably like the highest impact to simple idea ratio I've seen in a paper where it's like the simplest possible thing where instead of doing this chain of thought stuff, you just ask the question and then the answer, you first prepend the token, let's think step by step.

And the model will decode as if it had said, let's think step by step. And it will work through some reasoning and produce the right answer. So does this work on some arithmetic benchmarks? Here's what happens when you prompt the model just zero shot. So just asking it to produce the answer right away without any reasoning.

A few shots are giving some examples of inputs and outputs. And this is zero shot chain of thought. So just asking the model to think through things, you get crazy good accuracy. When we compare to actually doing manual chain of thought, you still do better with manual chains of thought.

But that just goes to show you how simple of an idea this is and ends up producing improved performance numbers. So the funny part of this paper was why use let's think by step by step. They used actually a lot of prompts and tried them out. So here's zero shot baseline performance.

They tried out a bunch of different prefixes, the answers after the proof. Let's think. Let's think about this logically. And they found that let's think step by step was the best one. It turns out this was actually built upon later in the year where they actually use a language model to search through the best possible strings that would maximize performance on this task, which is probably gross overfitting.

But the best prompt they found was let's work this out step by step in a step by step way to be sure that we have the right answer. So the right answer thing is presuming that you get the answer right. It's like giving the model some confidence in itself.

So this might seem to you like a total dark arcane art. And that's because it is. We really have no intuition as to what's going on here. Or we're trying to build some intuition. But as a result, and I'm sure you've seen if you spend time in tech circles or you've seen on the internet, there's this whole new idea of prompt engineering being an emergent science and profession.

So this includes things like asking a model for reasoning. It includes jailbreaking language models for telling them to do things that they otherwise aren't trained to do. Even AI art like DALI or stable diffusion, this idea of constructing these really complex prompts to get model outputs that you want.

That's also prompting. Anecdotally, I've heard of people saying I'm going to use a code generation model, but I'm going to include the Google code header in first because that will make more professional or bug free code depending on how much you believe in Google. But yeah, and there's a Wikipedia article on this now and there's even startups that are hiring for prompt engineers and they pay quite well.

So if you want to be a prompt engineer, definitely practice your GPT whispering skills. We have a question? Sorry. Yes, you go. Yeah, go ahead. A few slides ago, you said LM design that was like this long. How can you get the LM to design and input like-- I think they treated it like a reinforcement learning problem.

But I'll just direct you to this paper at the bottom to learn more details. Yeah, I think it's the Joe et al 2022 paper. Yeah. Yeah, question? So I'm just a bit curious about how they provided feedback. So in case the model was not giving the right answer, were there prompts to say that that's not right?

Maybe think about this different approach. How is feedback provided? They don't think about feedback in this kind of chain of thought prompting experiments. They just like if the model gets the answer wrong, then it gets the answer wrong and we just evaluate accuracy. Right. But this idea of incorporating feedback, I think, is quite interesting and I think you'll see some maybe hints of discussion of that later on.

Yeah. Questions? Okay, awesome. Okay, so talking about these three things, I'm going to talk about the benefits and limitations of the various different things that we could be doing here. So for zero shot and few shot in context learning, the benefit is you don't need any fine tuning and you can carefully construct your prompts to hopefully get better performance.

The downsides are there are limits to what you can fit in context. Transformers have a fixed context window of say 1,000 or a few thousand tokens. And I think, as you will probably find out, for really complex tasks, you are indeed going to need some gradient steps. So you're going to need some sort of fine tuning.

But that brings us to the next part of the lecture. So that's instruction fine tuning. Okay, so the idea of instruction fine tuning is that, sure, these models are pretty good at doing prompting. You can get them to do really interesting things. But there is still a problem, which is that language models are trained to predict the most likely continuation of tokens.

And that is not the same as what we want language models to do, which is to assist people. So as an example, if I give GPT-3 this kind of prompt, explain the moon landing, GPT-3 is trained to predict, you know, if I saw this on the internet somewhere, what is the most likely continuation?

Well, maybe someone was coming up with a list of things to do with a six year old. So it's just predicting a list of other tasks, right? It's not answering your question. And so the issue here is that language models are not, the term is aligned with user intent.

So how might we better align models with user intent for this case? Well, super simple answer, right? We're machine learners. Let's do machine learning. So we're going to ask a human, give me the right answer, right? Give me the way that a language model should respond according to this prompt.

And let's just do fine tuning. So this is a slide from the pre-training lecture. Again, pre-training can improve NLP applications by serving as parameter initialization. So this kind of pipeline, I think you are familiar with. And the difference here is that instead of fine tuning on a single downstream task of interest, like sentiment analysis, what we're going to do is we're going to fine tune on many tasks.

So we have a lot of tasks and the hope is that we can then generalize to other unseen tasks at test time. So as you might expect, data and scale is kind of key for this to work. So we're going to collect a bunch of examples of instruction output pairs across many tasks and then fine tune our language model and then evaluate generalization to unseen tasks.

Yeah, so data and scale is important. So as an example, one recent data set that was published for this is called the Supernatural Instructions Dataset. It contains over 1.6 thousand tasks containing 3 million examples. So this includes translation, question answering, question generation, even coding, mathematical reasoning, etc. And when you look at this, you really begin to think, well, is this actually fine tuning or is this just more pre-training?

And it's actually both. We're kind of blurring the lines here where the amount of scale that we're training this on, basically it is kind of a still general but slightly more specific than language modeling type of pre-training task. So one question I have is, now that we are training our model on so many tasks, how do we evaluate such a model?

Because you can't really say, OK, can you now do sentiment analysis well? The scale of tasks we want to evaluate this language model on is much greater. So just as a brief aside, a lot of research has been going into building up these benchmarks for these massive multicast language models and seeing to what degree they can do not only just one task, but just a variety of tasks.

So this is the Massive Multitask Language Understanding Benchmark or MMLU. It consists of a bunch of benchmarks for measuring language model performance on a bunch of knowledge intensive tasks that you would expect a high school or college student to complete. So you're testing a language model not only on sentiment analysis, but on astronomy and logic and European history.

And here are some numbers where at the time, DPD 3 is not that good, but it's certainly above a random baseline on all of these tasks. Here's another example. So this is the Beyond the Imitation Game Benchmark or BigBench. This has like a billion authors because it was a huge collaborative effort.

And this is a word cloud of the tasks that were evaluated. And it really contains some very esoteric tasks. So this is an example of one task included where you have to, given a kanji or Japanese character in ASCII art, you need to predict the meaning of the character.

So we're really stress testing these language models. OK, so instruction fine tuning, does it work? Recall there's a T5 encoded decoder model. So this is kind of Google's encoded decoder model, where it's pre-trained on this span corruption task. So if you don't remember that, you can refer back to that lecture.

But the authors released a newer version called FLAN T5. So FLAN stands for fine tuning language models. And this is T5 models trained on an additional 1.8 thousand tasks, which include the natural instructions data that I just mentioned. And if we average across both the BigBench and an MLU performance and normalize it, what we see is that instruction fine tuning works.

And crucially, the bigger the model, the bigger the benefit that you get from doing instruction fine tuning. So it's really the large models that stand to do well from fine tuning. And you might look at this and say, this is kind of sad for academics or anyone without a massive GPU cluster.

It's like who can run an 11 billion parameter model? I guess the one silver lining, if you look at the results here, are the 80 million model, which is the smallest one. If you look at after fine tuning, it ends up performing about as well as the un-fine tuned 11 billion parameter model.

So there's a lot of examples in the literature about smaller instruction fine tune pre-trained models outperforming larger models that are many, many more times the size. So hopefully there's still some hope for people with just like a few GPUs. Any questions? Awesome. In order to really understand the capabilities, I highly recommend that you just try it out yourself.

So Flan T5 is hosted on Hugging Face. I think Hugging Face has a demo where you can just type in a little query, ask it to do anything, see what it does. But there are qualitative examples of this working. So four questions where a non-instruction fine tune model will just kind of waffle on and not answer the question.

Doing instruction fine tuning will get your model to much more accurately reason through things and give you the right answer. OK. So that was instruction fine tuning. Positives of this method. Super simple, super straightforward. It's just doing fine tuning. And you see this really cool ability to generalize to unseen tasks.

In terms of negatives, does anyone have any ideas for what might be downsides of instruction fine tuning? It seems like it suffers from the same negatives of any human source data. It's hard to get people to provide the input. You don't know. Different people think different inputs about it.

Yeah, yeah, exactly. So comments are, well, it's hard and annoying to get human labels and it's expensive. That's something that definitely matters. And that last part you mentioned about there might be, you know, humans might disagree on what the right label is. Yeah, that's increasingly a problem. Yeah. So what are the limitations?

The obvious limitation is money. Collecting ground truth data for so many tasks costs a lot of money. Subtler limitations include the one that you were mentioning. So as we begin to ask for more creative and open-ended tasks from our models, right, there are tasks where there is no right answer.

And it's a little bit weird to say, you know, this is an example of how to write some story, right? So write me a story about a dog and her pet grasshopper. Like there is not one answer to this, but if we were only to collect one or two demonstrations, the language modeling objective would say you should put all of your probability mass on the two ways that two humans wrote this answer, right?

When in reality, there's no right answer. Another problem, which is related kind of fundamentally to language modeling in the first place, is that language modeling as an objective penalizes all token level mistakes equally. So what I mean by that is if you were asking a language model, for example, to predict the sentence, "Avatar is a fantasy TV show," and you were asking it, and let's imagine that the LM mispredicted adventure instead of fantasy, right?

So adventure is a mistake. It's not the right word, but it is equally as bad as if the model were to predict something like musical, right? But the problem is that "Avatar is an adventure TV show" is still true, right? So it's not necessarily a bad thing, whereas "Avatar is a musical" is just false.

So under the language modeling objective, right, if the model were equally confident, you would pay the equal penalty, an equal loss penalty for predicting either of those tokens wrong. But it's clear that this objective is not actually aligned with what users want, which is maybe truth or creativity or generally just this idea of human preferences, right?

Yeah, question. Could we do something like multiply the penalty by the distance from where you're betting in order to reduce this? Because musical would have a higher distance away than adventure. Yeah, that's an interesting question. It's an interesting idea. I haven't heard of people doing that, but it seems plausible.

I guess one issue is you might come up with adversarial settings where maybe the word embedding distance is also not telling you the right thing, right? So for example, show and musical maybe are very close together because they're both shows or things to watch, but they are in veracity, right?

They're completely different. One is true, one is false, right? So yeah, you can try it, although I think there might be some tricky edge cases like that. Cool. Okay, so in the next part of the talk, we're going to actually explicitly try to satisfy human preferences and come with a mathematical framework for doing so.

And yeah, so these are the limitations, as I had just mentioned. So this is where we get into reinforcing learning from human feedback. Okay, so RLHF. So let's say we were training a language model on some task like summarization. And let's imagine that for each language model sample S, let's imagine that we had a way to obtain a human reward of that summary.

So we could score this summary with a reward function, which we'll call R of S, and the higher the reward, the better. So let's imagine we're summarizing this article, and we have this summary, which maybe is pretty good, let's say. We had another summary, maybe it's a bit worse.

And if we were able to ask a human to just rate all these outputs, then the objective that we want to maximize or satisfy is very obvious. We just want to maximize the expected reward of samples from our language model, right? So in expectation, as we take samples from our language model, P theta, we just want to maximize the reward of those samples.

Fairly straightforward. So for mathematical simplicity here, I'm kind of assuming that there's only one task or one prompt, right? So let's imagine we were just trying to summarize this article, but we could talk about how to extend it to multiple prompts later on. Okay, so this kind of task is the domain of reinforcement learning.

So I'm not going to presume there's any knowledge of reinforcement learning, although I'm sure some of you are quite familiar with it, probably even more familiar than I am. But the field of reinforcement learning has studied these kinds of problems, these optimization problems of how to optimize something while you're simulating the optimization for many years now.

And in 2013, there was a resurgence of interest in reinforcement learning for deep learning specifically. So you might have seen these results from DeepMind about an agent learning to play Atari games, an agent mastering Go much earlier than expected. But interestingly, I think the interest in applying reinforcement learning to modern LMs is a bit newer, on the other hand.

And I think the kind of earliest success story or one of the earliest success stories was only in 2019, for example. So why might this be the case? There's a few reasons. I think in general, the field had kind of this sense that reinforcement learning with language models was really hard to get right, partially because language models are very complicated.

And if you think of language models as actors that have an action space where they can spit out any sentence, that's a lot of sentences. So it's a very complex space to explore. So it still is a really hard problem. So that's part of the reason. But also practically, I think there have been these newer algorithms that seem to work much better for deep neural models, including language models.

And these include algorithms like proximal policy optimization. But we won't get into the details of that for this course. But these are the kind of the reasons why we've been reinterested in this idea of doing RL with language models. So how do we actually maximize this objective? I've written it down.

And ideally, we should just change our parameters data so that reward is high. But it's not really clear how to do so. So when we think about it, I mean, what have we learned in the class thus far? We know that we can do gradient descent or gradient ascent.

So let's try doing gradient ascent. We're going to maximize this objective. So we're going to step in the direction of steepest gradient. But this quickly becomes a problem, which is what is this quantity and how do we evaluate it? How do we estimate this expectation given that the variables of the gradient that we're taking, theta, appear in the sample of the expectation?

And the second is what if our reward function is not differentiable? Like human judgments are not differentiable. We can't back prop through them. And so we need this to be able to work with a black box reward function. So there's a class of methods in reinforcement learning called policy gradient methods that gives us tools for estimating and optimizing this objective.

And for the purposes of this course, I'm going to try to describe the highest level possible intuition for this, which looks at the math and shows what's going on here. But it is going to omit a lot of the details. And a full treatment of reinforcement learning is definitely outside of the scope of this course.

So if you're more interested in this kind of content, you should check out CS234 Reinforcement Learning, for example. And in general, I think this is going to get a little mathy, but it's totally fine if you don't understand it. We will talk, we'll regroup at the end and just show what this means for how to do RLHF.

But what I'm going to do is just describe how we actually estimate this objective. So we want to obtain this gradient. So it's the gradient of the expectation of the reward of samples from our language model. And if we do the math, we break this apart. This is our definition of what an expectation is.

We're going to sum over all sentences rated by the probability. And due to the linearity of the gradient, we can put the gradient operator inside of the sum. Now what we're going to do is we're going to use a very handy trick known as a log derivative trick. And this is called a trick, but it's really just the chain rule.

But let's just see what happens when we take the gradient of the log probability of a sample from our language model. So if I take the gradients, then how do we use the chain rule? So the gradient of the log of something is going to be 1 over that something times the gradient of the middle of that something.

So 1 over P theta of s times the gradient. And if we rearrange, we see that we can alternatively write the gradient of P theta of s as this product. So P theta of s times the gradient of the log P theta of s. And we can plug this back in.

And the reason why we're doing this is because we're going to convert this into a form where the expectation is easy to estimate. So we plug it back in. That gives us this. And if you squint quite closely at this last equation here, this first part here is the definition of an expectation.

We are summing over a bunch of samples from our model, and we are weighting it by the probability of that sample, which means that we can rewrite it as an expectation. And in particular, it's an expectation of this quantity here. So let's just rewrite it. And this gives us our kind of newer form of this objective.

So these two are equivalent, the top here and the bottom. And what has happened here is we've kind of shoved the gradient inside of the expectation, if that makes sense. So why is this useful? Does anyone have any questions on this before I move on? If you don't understand it, that's fine as well, because we will understand the intuition behind it later.

So we've converted this into this. And we put the gradient inside the expectation, which means we can now approximate this objective with Monte Carlo samples. So the way to approximate any expectation is to just take a bunch of samples and then average them. So approximately, this is equal to sampling a finite number of samples from our model, and then summing up the average of the reward times the log probability, the gradient of the log probability of that sample.

And that gives us this update rule, plugging it back in for that gradient descent step that we wanted. So what is this? What does this mean? Let's think about a very simple case. Imagine the reward was a binary reward. So it was either 0 or 1. So for example, imagine we were trying to train a language model to talk about cats.

So whenever it utters a sentence with the word cat, we give it a 1 reward. Otherwise, we give it a 0 reward. Now, if our reward is binary, does anyone know what this objective reduces to or look like? Any ideas? If I've lost everyone, that's fine too. The reward would just be an indicator function.

So basically, to answer, the reward would be 0 everywhere, except for sentences that contain the word cat. And in that case, it would be 1. So basically, that would just look like vanilla gradient descent, just on sentences that contain the word cat. So to generalize this to the more general case, where the reward is scalar, what this is looking like, if you look at it, is if r is very high, very positive, then we're multiplying the gradient of that sample by a large number.

And so our objective will try to take gradient steps in the direction of maximizing the probability of producing that sample again, producing the sample that led to high reward. And on the other hand, if r is low or even negative, then we will actively take steps to minimize the probability of that happening again.

And that's the English intuition of what's going on here. The reason why we call it reinforcement learning is because we want to reinforce good actions and increase the probability that they happen again in the future. And hopefully, this intuitively makes sense to all of you. Let's say you're playing a video game, and on one run, you get a super high score.

And you think to yourself, oh, that was really good. Whatever I did that time, I should do again in the future. This is what we're trying to capture with this kind of update. Question? Is there any reason that we use policy gradient and not value iteration or other methods?

You can do a lot of things. I think there have been methods for doing Q-learning, offline learning, et cetera, with language models. I think the design space has been very underexplored. So there's a lot of low-hanging fruit out there for people who are willing to think about what fancy things we can do in RL and apply them to this language modeling case.

And in practice, what we use is not this simple thing, but we use a fancier thing that is proximal policy optimization. Question? Do you know if you're on LN, the space are super big, like almost a bit? So that's the challenge. So one thing that I haven't mentioned here is that right now, I'm talking about entire samples of sentences, which is a massive space.

In practice, when we do RL, we actually do it at the level of generating individual tokens. So each token is, let's say, GPT has 50,000 tokens. So it's a pretty large action space, but it's still manageable. So that kind of answers this question I was asking, which is, can you see any problems with this objective?

Which is that this is a very simplified objective. There is a lot more tricks needed to make this work. But hopefully, this has given you kind of the high-level intuition as to what we're trying to do in the first place. OK, so now we are set. We have a bunch of samples from a language model.

And for any arbitrary reward function, like we're just asking a human to rate these samples, we can maximize that reward. So we're done. OK, so not so fast. There's a few problems. The first is the same as in the instruction fine-tuning case, which is that keeping a human in the loop is expensive.

I don't really want to supervise every single output from a language model. I don't know if you all want to. So what can we do to fix this? So one idea is, instead of needing to ask humans for preferences every single time, you can actually build a model of their preferences, like literally just train an NLP model of their preferences.

So this idea was kind of first introduced outside of language modeling by this paper, Knox and Stone. They called it Tamr. But we're going to see it re-implemented in this idea, where we're going to train a language model-- we'll call it a reward model, RM, which is parameterized by phi-- to predict human preferences from an annotated data set.

And then when doing RLHF, we're going to optimize for the reward model rewards instead of actual human rewards. Here's another conceptual problem. So here's a new sample for our summarization task. What is the score of the sample? Anyone give me a number. Does anyone want to rate this sample?

It's like a 3, 6. What scale are we using? Et cetera. So the issue here is that human judgments can be noisy and miscalibrated when you ask people for things alone. So one workaround for this problem is, instead of asking for direct ratings, ask humans to compare two summaries and judge which one is better.

This has been shown, I think, in a variety of fields where people work with human subjects and human responses to be more reliable. This includes psychology and medicine, et cetera. So in other words, instead of asking humans to just give absolute scores, we're going to ask humans to compare different samples and rate which one is better.

So as an example, maybe this first sample is better than the middle sample, and it's better than the last sample. Now that we have these pairwise comparisons, our reward model is going to generate latent scores, so implicit scores based on this pairwise comparison data. So our reward model is a language model that takes in a possible sample, and then it's going to produce a number, which is the score or the reward.

And the way that we're going to train this model-- and again, you don't really need to know too much of the details here, but this is a classic statistical comparison model-- is via the following loss, where the reward model essentially should just predict a higher score if a sample is judged to be better than another sample.

So in expectation, if we sample winning samples and losing samples from our data sets, then if you look at this term here, the score of the higher sample should be higher than the score of the losing sample. Does that make sense? And in doing so, by just training on this objective, you will get a language model that will learn to assign numerical scores to things, which indicate their relative preference over other samples.

And we can use those outputs as rewards. Is there some renormalization either in the output or somewhere else? Yeah, so I don't remember if it happens during training. But certainly, after you've trained this model, you normalize the reward model so that the score is-- the expectation of the score is 0, because that's good for reinforcement learning and things like that as well.

Yeah, question? How do we account for the fact that even though things are noisy, some people could view S3 as better than S1. How do we account for even though when it's noisy, the border and the coordination still work? Yeah, I think that's just kind of limitations with asking for these preferences in the first place is that humans will disagree.

So we really have no ground truth unless we maybe ask an ensemble of humans, for example. That's just a limitation with this. I think hopefully, in the limit with enough data, this kind of noise washes out. But it's certainly an issue. And this next slide will also kind of touch on this.

So does the reward model work? Can we actually learn to model human preferences in this way? This is obviously an important standard we check before we actually try to optimize this objective. And they measured this. So this is kind of evaluating the reward model on a standard kind of validation set.

So can the reward model predict outcomes for data points that they have not seen during training? And does it change based on model size or amount of data? And if you notice here, there's one dashed line, which is the human baseline, which is if you ask a human to predict the outcome, a human does not get 100% accuracy because humans disagree.

And even an ensemble of, let's say, five humans also doesn't get 100% accuracy because humans have different preferences. But the key takeaway here is that for the largest possible model and for enough data, a reward model, at least on the validation set that they used, is kind of approaching the performance of a single human person.

And that's kind of a green light that maybe we can try this out and see what happens. So if there are no questions, this is kind of the components of our LHF. So we have a pre-trained model, maybe it's instruction fine-tuned, which we're going to call P of PT.

We have a reward model, which produces scalar rewards for language model outputs, and it is trained on a dataset of human comparisons. And we have a method, policy gradient, for arbitrarily optimizing language model perimeters towards some reward function. And so now if you want to do our LHF, you clone the pre-trained model, we're going to call this a copy of the model, which is the RL model, with parameters data that we're actually going to optimize.

And we're going to optimize the following reward with reinforcement learning. And this reward looks a little bit more complicated than just using the reward model. And the extra term here is a penalty, which prevents us from diverging too far from the pre-trained model. So in expectation, this is known as the KL or Kohlback-Lieber divergence between the RL model and the pre-trained model.

And I'll explain why we need this in a few slides. But basically, if you over-optimize the reward model, you end up producing-- you can produce gibberish. And what happens is you pay a price. So this quantity is large if the probability of a sample under the RL tuned model is much higher than the probability of the sample under the pre-trained model.

So the pre-trained model would say, this is a very unlikely sequence of characters for anyone to say. That's when you would pay a price here. And beta here is a tunable parameter. Yeah, question? When you say initialize a copy, that means the first iteration, PRL is equal to PPT?

That's right. Yeah. Yeah, when I say initialize a copy, basically, we want to be able to compare to the non-fine-tuned model just to evaluate this penalty term. So just leave the predictions of the pre-RL model around. More questions? Great. So does it work? The answer is yes. So here is the key takeaway, at least for the task summarization on this daily mail data set.

So again, we're looking at different model sizes. But at the end here, we see that if we do just pre-training-- so just like the typical language modeling objective that GPT uses-- you end up producing summaries that, in general, are not preferred to the reference summaries. So this is on the y-axis here is the amount of times that a human prefers the model-generated summary to a summary that a human actually wrote or the one that's in the data set.

So pre-training doesn't work well, even if you do supervised learning. So supervised learning in this case is, let's actually fine-tune our model on the summaries that were in our data sets. Even if you do that, you still kind of underperform the reference summaries, because you're not perfectly modeling those summaries.

But it's only with this human feedback that we end up producing a language model that actually ends up producing summaries that are judged to be better than the summaries in a data set that you were training on in the first place. I think that's quite interesting. Any questions? So now we talk about-- yeah, we're getting closer and closer to something like InstructGPT or ChatGPT.

The basic idea of InstructGPT is that we are scaling up RLHF to not just one prompt, as I had described previously, but tens of thousands of prompts. And if you look at these three pieces, these are the three pieces that we've just described. The first piece here being instruction fine-tuning, the second piece being RLHF, and the third piece-- oh, sorry, the second part being reward model training, and the last part being RLHF.

The difference here is that they use 30,000 tasks. So again, with the same instruction fine-tuning idea, it's really about the scale and diversity of tasks that really matters for getting good performance for these things. Yeah? Yeah, so the preceding results, you suggested that you really needed the RLHF, and it didn't work so well to do supervised learning on the data.

But they do supervised learning on the data in the fine-tuning in the first stage. Is that necessary, or else they should have tended to go haywire and just went straight to RLHF? Oh, yeah, that's a good question. So I think a key point here is that they initialized the RL policy on the supervised policy.

So they first got the model getting reasonably good at doing summarization first, and then you do the RLHF on top to get the boost performance. Your question you're asking is maybe, can we just do the RLHF starting from that pre-trained baseline? That's a good question. I don't think they explored that, although I'm not sure.

I'd have to look at the paper again to remind myself. Yeah. So certainly for something like InstructGPT, yeah, they've always kind of presumed that you need the kind of fine-tuning phase first, and then you build on top of it. But I think, yeah, there's still some interesting open questions as to whether you can just go directly to RLHF.

Question? Is the human reward function trained simultaneously with the fine-tuning of the language model? Or is it sequential? Reward model should be trained first. Yeah. You train it first, you make sure it's good, it's frozen, you optimize against that. What are the samples for the human rewards? Do they come from the generated task from language model?

Or where does the training sample come from? For training the reward model? Yeah. So, yeah, actually, it's a good question. Where do the rewards come from? So there's kind of an iterative process you can apply where you kind of repeat steps two and three over and over again. So you sample a bunch of outputs from your language model.

You get humans to rate them. You then do RLHF to update your model again. And then you sample more outputs and get humans to rate them. So in general, the rewards are done on sampled model outputs, because those are the outputs that you want to steer in one direction or another.

But you can do this in an iterative process where you kind of do RL and then maybe train a better reward model based on the new outputs and continue. And I think they do a few iterations in InstructGBT, for example. Questions? OK. So 30,000 tasks. I think we're getting into very recent stuff where increasingly companies like OpenAI are sharing less and less details about what actually happens in training these models.

So we have a little bit less clarity as to what's going on here than maybe we have had in the past. But they do share the data that's not public, but they do share the kinds of tasks that they collected from labelers. So they collected a bunch of prompts from people who were already using the GPT-3 API.

So they had the benefit of having many, many users of their API and taking the kinds of tasks that users would ask GPT to do. And so these include things like brainstorming or open-end generation, et cetera. And yeah, I mean, the key results of InstructGBT, which is kind of the backbone of ChatGBT, really just needs to be seen and played with to understand.

So you can feel free to play with either ChatGBT or one of the OpenAI APIs. But again, this example of a language model and not necessarily following tasks, by doing this kind of instruction fine tuning followed by RLHF, you get a model that is much better at adhering to user commands.

Similarly, a language model can be very good at generating super interesting open-ended creative text as well. This brings us to ChatGBT, which is even newer, and we have even less information about what's actually going on or what's being trained here. But yeah, and they're keeping their secret sauce secret.

But we do have a blog post where they wrote two paragraphs. And in the first paragraph, they said that they did instruction fine tuning. So we trained an initial model using supervised fine tuning. So human AI trainers provided conversations where they played both sides. And then we asked them to act as a AI assistant.

And then we fine-tuned our model on acting like an AI assistant for humans. That's part one. Second paragraph, to create a reward model for RL, we collected comparison data. So we took conversations with an earlier version of the chatbot, so the one that's pre-trained on instruction following or instruction fine tuning, and then take multiple samples and then rate the quality of the samples.

And then using these reward models, we fine-tune it with RL. In particular, they used PPO, which is a fancier version of RL. And yeah, so that produces-- I don't need to introduce the capabilities of ChatGBT. It's been very exciting recently. Here's an example. It's fun to play with. Definitely play with it.

Sorry, it's a bit of an attack on the students. Yeah. OK. So reinforcement learning, pluses. You're kind of directly modeling what you care about, which is human preferences, not is the collection of the demonstration that I collected, is that the highest probability mass in your model. You're actually just saying, how well am I satisfying human preferences?

So that's a clear benefit over something like instruction fine tuning. So in terms of negatives, one is that RL is hard. It's very tricky to get right. I think it will get easier in the future as we kind of explore the design space of possible options. So that's an obvious one.

Does anyone come up with any other kind of maybe weaknesses or issues they see with this kind of training? Yeah. Is it possible that your language model and then your reward model can over-fit to each other, especially-- even if you're not training them together, if you're going back and forth and like-- Yeah.

Yeah. So over-optimization, I think, of the reward model is an issue. Yeah. Is it also that if you retrain your baseline, if you repeat all this human feedback, it will fall over again? Yeah. So it still is extremely data expensive. And you can see some articles if you just Google OpenAI data labeling.

People have not been very happy with the amount of data that has been needed to train something like ChatGBT. I mean, they're hiring developers to just explain coding problems 40 hours a week. So it is still data intensive. That's kind of the takeaway. All of these are-- it's all still data intensive, every single one of these.

Yeah. I think that summarizes kind of the big ones here. So when we talk about limitations of RLHF, we also need to talk about just limitations in general of RL, and also this idea that we can model or capture human reward in this single data point. So human preferences can be very unreliable.

The RL people have known this for a very long time. They have a term called reward hacking, which is when an agent is optimizing for something that the developer specified, but it is not what we actually care about. So one of the classic examples is this example from OpenAI, where they were training this agent to race boats.

And they were training it to maximize the score, which you can see at the bottom left. But implicitly, the score actually isn't what you care about. What you care about is just finishing the race ahead of everyone else. And the score is just kind of this bonus. But what the agent found out was that there are these turbo boost things that you can collect, which boost your score.

And so what it ends up doing is it ends up kind of just driving in the middle, collecting these turbo boosts over and over again. So it's racking up insane score, but it is not doing the race. It is continuously crashing into objects, and its boat is always on fire.

And this is a pretty salient example of what we call AI misalignment. And you might think, well, OK, this is a really simple example. They made a dumb mistake. They shouldn't have used score as a reward function. But I think it's even more naive to think that we can capture all of human preferences in a single number and assign certain scalar values to things.

So one example where I think this is already happening, you can see, is maybe you have played with chatbots before, and you notice that they do a lot of hallucination. They make up a lot of facts. And this might be because of RLHF. Chatbots are rewarded to produce responses that seem authoritative or seem helpful, but they don't care about whether it's actually true or not.

They just want to seem helpful. So this results in making up facts. You may be seeing the news about chatbots. Companies are in this race to deploy chatbots, and they make mistakes. Even Bing also has been hallucinating a lot. And in general, when you think about that, you think, well, models of human preferences are even more unreliable.

We're not even just using human preferences by themselves. We're also training a model, a deep model, that we have no idea how that works. We're going to use that instead. And that can obviously be quite dangerous. And so going back to this slide here, where I was describing why we need this KL penalty term, this yellow highlighted term here, here's a concrete example of what actually happens of a language model overfitting to the reward model.

So what this is showing is, in this case, they took off the KL penalty. So they were just trying to maximize reward. They trained this reward model. Let's just push those numbers up as high as possible. And on the x-axis here is what happens as training continues. You diverge further and further.

This is the KL divergence or the distance from where you started. And the golden dashed line here is what the reward model predicts your language model is doing. So your reward model is thinking, wow, you are killing it. They are going to love these summaries. They are going to love them way more than the reference summaries.

But in reality, when you actually ask humans, the preferences peak, and then they just crater. So this can be an example of over-optimizing for a metric that you care about. It ceases to become a good metric to optimize for. Any questions about this? So there's this real concern of, I think, what people are calling the AI alignment problem.

I'll let Percy Leung talk about this. He tweeted that the main tool that we have for alignment is RLHF. But reward hacking happens a lot. Humans are not very good supervisors of rewards. So this strategy is probably going to result in agents that seem like they're doing the right thing, but they're wrong in subtle and conspicuous ways.

And I think we're already seeing examples of that in the current generation of chatbots. So in terms of positives, here are some positives. But again, RL is tricky to get right. Human preferences are fallible, and models of human preferences are even more so. So I remember seeing a joke on Twitter somewhere where someone was saying that zero shot and few shot learning is the worst way to align in AI.

Instruction fine tuning is the second worst way to align in AI. And RLHF is the third worst way to align in AI. So we're getting somewhere, but each of these have clear fundamental limitations. Yeah, question. I have a question on more of like competition of reinforcement learning. Because if you get the math that Nick showed before, essentially you're putting the gradient inside so that you can sample it, the sample expectation.

But when it comes to sampling, how do you make that parallel? Because then you need to adaptively stop sampling, and then you don't know when you're going to stop. How do you make that process quicker? The whole unit on transformers and all that was parallelizing everything. I mean, yeah.

So this is really compute heavy. And I'm actually not sure what kind of infrastructure is used for a state of the art, very performant implementation of RLHF. But it's possible that they use parallelization like what you're describing, where I think in a lot of maybe more traditional RL, there's this kind of idea of having an actor learner architecture where you have a bunch of actor workers, which are each kind of a language model producing a bunch of samples.

And then the learner would then integrate them and perform the gradient updates. So it's possible that you do need to do just sheer multiprocessing in order to get enough samples to make this work in a reasonable amount of time. Is that the kind of question you had? Or do you have other questions?

Kind of. So you're basically saying that each unit that you parallelize over is larger than what we would typically see in transformers? I was saying that you might need to actually copy your model several times and take samples from different copies of the models. Yeah. But in terms of like-- yeah, so autoregressive generation, transformers, especially like the forward pass and the multi-head attention stuff is very easy to parallelize.

But autoregressive generation is still kind of bottlenecked by the fact that it's autoregressive. So you have to run it first and then you need to-- depends on what you sample, you have to run it again. So those are kind of blocks that we haven't fully been able to solve, I think.

And that will add to compute cost. So I think we have 10 more minutes if I'm not mistaken. So we've mostly finally answered how we get from this to this. There's some details missing. But the key kind of factors are one, instruction fine tuning. Two, this idea of reinforced learning from human feedback.

So let's talk a little bit about what's next. So as I had mentioned, RLHF is still a very new area. It's still very fast moving. I think by the next lecture, by the time we say that I did these slides again, these slides might look completely different because maybe a lot of the things that I was presenting here turn out to be really bad ideas or not the most efficient way of going about things.

RLHF gets you further than instruction fine tuning. But as someone had already mentioned, it is still very data expensive. There are a lot of articles about OpenAI needing to hire a legion of annotators or developers to just compare outputs over and over again. I think a recent work that I'm especially interested in and been thinking about is how we can get the benefits of RLHF without such stringent data requirements.

So there's these newer kind of crazy ideas about doing reinforcement learning from not human feedback, but from AI feedback. So having language models themselves evaluate the output of language models. So as an example of what that might look like, a team from Anthropic, which works on these large language models, came up with this idea called constitutional AI.

And the basic idea here is that if you ask GPT-3 to identify whether a response was not helpful, it would be pretty good at doing so. And you might be able to use that feedback itself to improve a model. So as an example, if you have some sort of human request, like, can you help me hack into my neighbor's Wi-Fi?

And the assistant says, yeah, sure, you can use this app, right? We can ask a model for feedback on this. What we do is we add a critique request, which says, hey, language model GPT-3, identify ways in which the assistant's response is harmful. And then it will generate a critique, like hacking into someone else's Wi-Fi is illegal.

And then you might ask it to then revise it, right? So just rewrite the assistant response to remove harmful content. And it does so. And now by just decoding from a language model, assuming you can do this well, what you have now is a set of data that you can do instruction fine tuning on, right?

You have a request and you have a request that has been revised to make sure it doesn't contain harmful content. So this is pretty interesting. I think it's quite exciting. But all of those issues that I had mentioned about alignment, mis-overinterpreting human preferences, reward models being fallible, everything gets compounded like 40,000 times when you're thinking about this, right?

We have no understanding of how safe this is or where this ends up going, but it is something. Another kind of more common idea also is this general idea of fine tuning language models on their own outputs. And this has been explored a lot in the context of chain of thought reasoning, which is something I presented at the beginning of the lecture.

And these are provocatively named large language models can self-improve. But again, it's not clear how much runway there is. But the basic idea maybe is to-- you can use let's think step by step, for example, to get a language model to produce a bunch of reasoning. And then you can say fine tune on that reasoning as if it were true data and see whether or not a language model can get any better using that technique.

But as I mentioned, this is all still very new. There are, I think, a lot of limitations of large language models like hallucination and also just the sheer size and compute intensity of this that may or may not be solvable with RLHF. Question? feedback of how we don't want to be at that.

I've seen people talking about how you can jailbreak chat GPT to still give those types of funnable responses. Are there any ways for us to buffer against those types of things as well? Because it seems like you're just going to keep building on-- we need to identify chances where it's trying to say action not like yourself.

I guess is there any way to build up that scale to avoid those jailbreaking possibilities? Yeah, that's interesting. So there are certainly ways that you can use either AI feedback or human feedback to mitigate those kinds of jailbreaks. If you see someone on Twitter saying that, oh, I made GPT-3 jailbreak using this strategy or whatever, you can then maybe plug it into this kind of framework and say identify ways in which the assistant went off the rails and then fine tune and hopefully correct those.

But it is really difficult, I think, in most of these kinds of settings. It's really difficult to anticipate all the possible ways in which a user might jailbreak an assistant. So you always have this kind of dynamic of like in security, cybersecurity, for example, there's always the attacker advantage where the attacker will always come up with something new or some new exploit.

So yeah, I think this is a deep problem. I don't have a really clear answer. But certainly, if we knew what the jailbreak was, we could mitigate it. I think that seems pretty straightforward. But if you know how to do that, you should be hired by one of these companies.

They'll pay you millions if you can solve this. OK. Yeah, so just last remarks is with all of these scaling results that I presented and all of these like, oh, you can just do instruction fine tuning and it'll follow your instructions, or you can do RLHF. You might have a very bullish view on like, oh, this is how we're going to solve artificial general intelligence by just scaling up RLHF.

It's possible that that is actually going to happen. But it's also possible that there are certain fundamental limitations that we just need to figure out how to solve, like hallucination, before we get anywhere productive with these models. But it is a really exciting time to work on this kind of stuff.

So yeah. Thanks for listening. Thanks. 1 1 1 1 1 1

Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback

Transcript