Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11

00:00:00.000 | Hello everyone, my name is Lisa. I'm a third year PhD student in the NLP group. I'm advised

00:00:11.080 | by Percy and Tatsu. Today I will give a lecture on natural language generation. And this is

00:00:17.080 | also the research area that I work on. So I'm super excited about it. I'm happy to answer

00:00:21.400 | any questions both during the lecture and after class about natural language generation.

00:00:26.580 | So NLG is a super exciting area and it's also moving really, really fast. So today we will

00:00:32.840 | discuss all the excitement of NLG. But before we get into the really exciting part, I have

00:00:39.120 | to make some announcements. So first, it is very, very important for you to remember to

00:00:44.060 | sign up for AWS by midnight today. So this is related to our homework five, whether you

00:00:51.140 | have GPU access and then also related to our final project. So please, please remember

00:00:56.180 | to sign up for AWS by tonight. And second, the project proposal is due on Tuesday, next

00:01:04.300 | Tuesday. And I think assignment four should just due. Hopefully you had fun with machine

00:01:10.660 | translation and stuff. And also assignment five is out today, I think just now. And it

00:01:17.420 | is due on Friday, basically Friday midnight. And last, we will hold a Hugging Face Transformer

00:01:27.260 | Library tutorial this Friday. So if your final project is related to implementing transformers

00:01:33.820 | or playing with large language models, you should definitely go to this tutorial because

00:01:37.260 | it's going to be very, very helpful. Also, yeah, just one more time, please remember

00:01:42.180 | to sign up for AWS because this is the final hard deadline. Okay, cool. Now moving on to

00:01:49.580 | the main topic for today, the very exciting natural language generation stuff. So today,

00:01:54.700 | we will discuss what is NLG, review some models, discuss about how to decode from language

00:02:00.620 | models and how to train language models. And we will also talk about evaluations. And finally,

00:02:06.780 | we'll discuss ethical and risk considerations with the current NLG systems. So this natural

00:02:12.500 | language generation techniques are going to be really exciting because this is kind of

00:02:17.260 | getting us closer to explain the magic of chatGPT, which is a super popular model recently.

00:02:22.860 | And practically speaking, they could also help you with your final project if you decide

00:02:26.700 | to work on something related to text generation. So let's get started. To begin with, let's

00:02:32.660 | ask the question of what is natural language generation. So natural language generation

00:02:37.940 | is actually a really broad category. People have divided NLP into natural language understanding

00:02:44.180 | and natural language generation. So the understanding part mostly means that the task input is in

00:02:49.900 | natural language, such as semantic parsing, natural language inference, and so on. Whereas

00:02:56.260 | natural language generation means that the task output is in natural language. So NLG

00:03:02.780 | focuses on systems that produce fluent, coherent, and useful language outputs for human to use.

00:03:09.700 | Historically, there are many NLG systems that use rule-based systems, such as templates

00:03:15.820 | or infilling. But nowadays, deep learning is powering almost every text generation systems.

00:03:22.460 | So this lecture today will be mostly focused on deep learning stuff.

00:03:27.940 | So first, what are some examples of natural language generation? It's actually everywhere,

00:03:33.140 | including our homework. Machine translation is a form of NLG, where the input is some

00:03:38.460 | utterance in the source language, and the output is generated text in the target language.

00:03:44.500 | Digital assistant, such as Ceres or Alexa, they are also NLG systems. So it takes in

00:03:50.140 | dialogue history and generates continuations of the conversation. There is also summarization

00:03:56.620 | systems that takes in a long document, such as a research article, and then the idea is

00:04:02.300 | trying to summarize it into a few sentences that are easy to read.

00:04:07.460 | So beyond these classic tasks, there are some more interesting uses, like creative storywriting,

00:04:13.500 | where you can prompt a language model with a story plot, and then it will give you some

00:04:17.860 | creative stories that are aligned with the plot. There is data to text, where you give

00:04:22.660 | the language model some database or some tables, and then the idea is that it will output some

00:04:27.980 | textual description of the table content. And finally, there is also visual description-based

00:04:33.540 | NLG systems, like image captioning or image-based storytelling.

00:04:40.380 | So the really cool example is the popular ChatGPT models. So ChatGPT is also an NLG

00:04:48.500 | system. It is very general purpose, so therefore you can use it to do many different tasks

00:04:54.860 | with different prompts. For example, we can use ChatGPT to simulate a chatbot. It can

00:05:01.220 | answer questions about creative GIFs for 10 years old. It can be used to do poetry generation.

00:05:08.820 | For example, we can ask it to generate a poem about sorting algorithms. And it's actually,

00:05:14.100 | well, I wouldn't say it's very poetic, but at least it has the same format as a poem,

00:05:18.540 | and the content is actually correct.

00:05:22.740 | So ChatGPT can also be used in some really useful settings, like web search. So here,

00:05:30.180 | Bing is augmented with ChatGPT, and there are some Twitters that are saying that the

00:05:34.100 | magic of ChatGPT is that it actually makes people be happy to use Bing.

00:05:42.700 | So there are so many tasks that actually belong to the NLG category. So how do we categorize

00:05:47.380 | these tasks? One common way is to think about the open-endedness of the task. So here, we

00:05:53.100 | draw a line for the spectrum of open-endedness. On the one end, we have tasks like machine

00:05:58.820 | translation and summarization. So we consider them not very open-ended, because for each

00:06:04.500 | source sentence, the output is almost determined by the input. Because basically, we are trying

00:06:10.580 | to do machine translation, the semantics should be exactly similar to the input sentence.

00:06:15.540 | So there are only a few ways that you can rephrase the output, like authorities have

00:06:19.500 | announced that today is a national holiday. You can rephrase it a little bit to say, today

00:06:24.060 | is a national holiday announced by the authorities. But the actual space is really small, because

00:06:29.500 | you have to make sure the semantics doesn't change. So we can say that the output space

00:06:34.140 | here is not very diverse.

00:06:37.900 | And moving to the middle of the spectrum, there is dialogue tasks, such as task-driven

00:06:42.260 | dialogue or a chitchat dialogue. So we can see that for each dialogue input, there are

00:06:47.100 | multiple responses, and the degree of freedom has increased. Here, we can respond by saying

00:06:53.580 | good and you, or we can say about, thanks for asking, barely surviving all my homeworks.

00:06:59.980 | So here, we are observing that there are actually multiple ways to continue this conversation.

00:07:04.780 | And then this is where we say the output space is getting more and more diverse.

00:07:09.900 | And on the other end of the spectrum, there is the very open-ended generation tasks, like

00:07:15.020 | story generation. So given the input, like write me a story about three little pigs,

00:07:20.060 | there are so many ways to continue the prompt. We can write about them going to schools,

00:07:24.300 | building houses, like they always do. So the valid output here is extremely large. And

00:07:30.300 | we call this open-ended generation.

00:07:33.940 | So it's hard to really draw a boundary between open-ended and non-open-ended tasks. But we

00:07:38.820 | still try to give a rough categorization. So open-ended generation refers to tasks whose

00:07:44.140 | output distribution has a high degree of freedom. Or non-open-ended generation tasks refers

00:07:50.580 | to tasks where the input will almost certainly determine the output generation. Examples

00:07:57.260 | of non-open-ended generations are machine translation, summarization. And examples of

00:08:01.940 | open-ended generations are story generation, chitchat dialogue, task-oriented dialogue,

00:08:07.100 | et cetera.

00:08:08.060 | So how do we formalize this categorization? One way of formalizing is by computing the

00:08:13.500 | entropy of the NLG system. So high entropy means that we are to the right of the spectrum.

00:08:20.220 | So it is more open-ended. And low entropy means that we are to the left of the spectrum

00:08:25.420 | and less open-ended. So these two classes of NLG tasks actually require different decoding

00:08:32.020 | and training approaches, as we will talk about later.

00:08:35.420 | OK, cool. Now let's recall some previous lectures and review the NLG models and trainings that

00:08:42.260 | we have studied before. So I think we discussed the basics of natural language generation.

00:08:48.780 | So here is how autoregressive language model works. At each time step, our model would

00:08:53.420 | take in a sequence of tokens as input. And here it is y less than t. And the output is

00:09:00.540 | basically the new token yt. So to decide on yt, we first use the model to assign a score

00:09:06.980 | for each token in the vocabulary, denoted as s. And then we apply softmax to get the

00:09:12.740 | next token distribution, p. And we choose a token according to this next token distribution.

00:09:19.300 | And similarly, once we have predicted yt hat, we then pass it back into the language model

00:09:23.300 | as the input, predict y hat t plus 1. And then we do so recursively until we reach the

00:09:29.660 | end of the sequence.

00:09:32.020 | So any questions so far? OK, good. So for the two types of NLG tasks that we talked

00:09:40.620 | about, like the open-ended and non-open-ended tasks, they tend to prefer different model

00:09:45.140 | architectures. So for non-open-ended tasks, like machine translation, we typically use

00:09:50.660 | an encoder-decoder system, where the autoregressive decoder that we just talked about functions

00:09:56.060 | as the decoder. And then we have another bidirectional encoder for encoding the inputs. So this is

00:10:01.260 | kind of what you implemented for assignment 4, because the encoder is like the bidirectional

00:10:07.140 | LSTM, and the decoder is another LSTM that is autoregressive.

00:10:12.660 | So for more open-ended tasks, typically autoregressive generation model is the only component. Of

00:10:21.180 | course, these architectures are not really hard constraints, because an autoregressive

00:10:25.860 | decoder alone can also be used to do machine translation. And an encoder-decoder model

00:10:30.660 | can also be used for storage generation. So this is kind of the convention for now, but

00:10:35.980 | it's a reasonable convention, because using decoder-only model for MT tends to hurt performance

00:10:42.180 | compared to an encoder-decoder model for MT. And using an encoder-decoder model for open-ended

00:10:47.500 | generation seems to achieve similar performance to a decoder-only model. And therefore, if

00:10:53.260 | you have the compute budget to train an encoder-decoder model, you might just be better off by only

00:10:57.900 | training a larger decoder model. So it's kind of more of an allocation of resources problem

00:11:02.740 | than whether this architecture will type check with your task.

00:11:08.700 | So how do we train such a language model? In previous lectures, we talked about that

00:11:15.540 | the language models are trained by maximum likelihood. So basically, we were trying to

00:11:20.740 | maximize the probability of the next token, yt, given the preceding words. And this is

00:11:26.260 | our optimization objective. So at each time step, this can be regarded as a classification

00:11:32.540 | task, because we are trying to distinguish the actual word, yt star, from all the remaining

00:11:38.340 | words in the vocabulary. And this is also called teacher forcing, because at each time

00:11:43.780 | step, we are using the gold standard, y star less than t, as input to the model. Whereas,

00:11:52.940 | presumably, at generation time, you wouldn't have any access to y star. So you would have

00:11:57.140 | to use the model's own prediction to feed it back into the model to generate the next

00:12:01.340 | token. And that is called student forcing, which we'll talk in detail later.

00:12:05.500 | Oh, sorry.

00:12:06.500 | Yeah, I think I skipped two slides ago. About autoregressive, we never used that word before.

00:12:14.740 | What does it mean?

00:12:15.740 | Autoregressive? Oh, it just means like-- so let's look at this animation again. Oops,

00:12:22.380 | sorry. It just looks like you are generating word from left to right, one by one. So here,

00:12:28.020 | suppose that you are given y less than t. And then autoregressively, you first generate

00:12:32.660 | yt. And then once you have yt, you'll feed it back in, generate yt plus 1, and then feed

00:12:37.780 | it back in, generate another thing. So this left to right nature, because you are using

00:12:41.260 | chain rule to condition on the tokens that you just generated, this chain rule thing

00:12:47.100 | is called autoregressive.

00:12:48.740 | And typically, I think conventionally, we are doing left to right autoregressive by

00:12:52.460 | generating from left to right. But there are also other more interesting models that can

00:12:56.580 | do backward or infill and other things. This idea of generating one token at once is autoregressive.

00:13:03.580 | Cool. Any other questions? Yep.

00:13:13.340 | So at inference time, our decoding algorithm would define a function to select a token

00:13:19.020 | from this distribution. So we've discussed that we can use the language model to compute

00:13:23.820 | this p, which is the next token distribution. And then g here, based on our notation, is

00:13:29.140 | the decoding algorithm, which helps us select what token we are actually going to use for

00:13:33.380 | yt.

00:13:34.700 | So the obvious decoding algorithm is to greedily choose the highest probability token as yt

00:13:40.660 | for each time step. So while this basic algorithm sort of works, because they work for your

00:13:45.340 | homework 4, to do better, there are two main avenues that we can take. We can decide to

00:13:50.500 | improve decoding. And we can also decide to improve the training.

00:13:55.020 | Of course, there are other things that we can do. We can improve training data. And

00:13:58.340 | we can improve model architectures. But for this lecture, we will focus on decoding and

00:14:02.460 | training.

00:14:04.660 | So now let's talk about how decoding algorithms work for natural language generation

00:14:09.860 | models. Before that, I'm happy to take any questions about the previous slides.

00:14:13.980 | OK. Yeah.

00:14:14.980 | Sorry, could you just explain one more time the difference between teacher forcing and

00:14:15.980 | student forcing?

00:14:16.980 | I think I'll go into this in detail later. But sure. So basically, for teacher forcing,

00:14:28.820 | the idea is you do teacher forcing where you train the language model, because you already

00:14:32.500 | observe the gold text. So you use the gold text up until time step t, put it into the

00:14:38.500 | model. And then the model would try to predict y t plus 1.

00:14:42.900 | Whereas student forcing means that you don't have access to this gold reference data. Instead,

00:14:48.020 | but you are still trying to generate a sequence of data. So you have to use the text that

00:14:51.380 | you generated yourself using the model, and then feed it back into the model as input

00:14:55.580 | to predict t plus 1. That's the primary difference.

00:15:00.420 | Cool. So what is decoding all about? At each time step, our model computes a vector of

00:15:08.180 | score for each token. So it takes in preceding context y less than t and produce a score

00:15:13.860 | s. And then we try to compute a probability distribution p out of the scores by just applying

00:15:19.980 | softmax to normalize them. And our decoding algorithm is defined as this function g, which

00:15:26.860 | takes in the probability distribution and try to map it to some word. Basically, try

00:15:31.580 | to select a token from this probability distribution.

00:15:35.140 | So in the machine translation lecture, we talked about greedy decoding, which selects

00:15:40.300 | the highest probability token of this p distribution. And we also talk about beam search, which

00:15:47.380 | has the same objective as greedy decoding, which is that we are both trying to find the

00:15:51.980 | most likely string defined based on the model. But instead of doing so greedily for beam

00:15:56.780 | search, we actually explore a wider range of candidates. So we have a wider exploration

00:16:02.140 | of candidates by keeping always k candidates in the beam.

00:16:08.340 | So overall, this maximum probability decoding is good for low entropy tasks like machine

00:16:13.180 | translation and summarization. But it actually encounters more problems for open-ended generation.

00:16:19.420 | So the most likely string is actually very repetitive when we try to do open-ended text

00:16:24.380 | generation. As we can see in this example, the context is perfectly normal. It's about

00:16:30.540 | a unicorn trying to speak English.

00:16:33.140 | And by the continuation, the first part of it looks great. It's like valid English. It

00:16:38.380 | talks about science. But suddenly, it starts to repeat. And it starts to repeat, I think,

00:16:44.100 | an institution's name.

00:16:46.660 | So why does this happen? If we look at, for example, this plot, which shows the language

00:16:54.500 | model's probability assigned to the sequence I don't know, we can see here is the pattern.

00:17:00.140 | It has regular probability. But if we keep repeating this phrase, I don't know, I don't

00:17:04.460 | know, I don't know, for 10 times, then we can see that there is a decreasing trend in

00:17:09.300 | their negative log likelihood. So the y-axis is the negative log probability.

00:17:14.020 | We can see this decreasing trend, which means that the model actually has higher probability

00:17:18.740 | as the repeat goes on, which is quite strange because it's suggesting that there is a self-amplification

00:17:25.180 | effect. So the more repeat we have, the more confident the model becomes about this

00:17:29.980 | repeat.

00:17:32.420 | And this keeps going on. We can see that for I am tired, I'm tired, repeat 100 times, we

00:17:36.540 | can see a continuously decreasing trend until the model is almost 100% sure that it's going

00:17:42.100 | to keep repeating the same thing.

00:17:45.900 | And sadly, this problem is not really solved by architecture. Here, the red plot is a LSTM

00:17:53.020 | model, and the blue curve is a transformer model. We can see that both models kind of

00:17:57.380 | suffers from the same problem. And scale also doesn't solve this problem. So we kind of

00:18:02.060 | believe that scale is the magical thing in NLP. But even models with 175 billion parameters

00:18:09.060 | will still suffer from repetition if we try to find the most likely string.

00:18:16.300 | So how do we reduce repetition? One canonical approach is to do n-gram blocking. So the

00:18:22.180 | principle is fairly simple. Basically, you just don't want to see the same n-gram twice.

00:18:27.460 | If we set n to be 3, then for any text that contains the phrase "I am happy," the next

00:18:32.300 | time you see the prefix "I am," n-gram blocking would automatically set the probability of

00:18:37.340 | happy to be 0 so that you will never see this n-gram, this trigram again.

00:18:43.100 | But clearly, this n-gram blocking heuristic has some problems because sometimes it is

00:18:48.020 | quite common for you to want to see a person's name appear twice or three times or even more

00:18:52.500 | in a text. But this n-gram blocking will eliminate that possibility.

00:18:57.300 | So what are better options that possibly are more complicated? For example, we can use

00:19:02.700 | a different training objective. Instead of training by MLE, we can train by unlikelihood

00:19:08.300 | objective. So in this approach, the model is actually penalized for generating already

00:19:14.660 | seen tokens. So it's kind of like putting this n-gram blocking idea into training time.

00:19:20.700 | Rather than at decoding time for this constraint, at training time, we just decrease the probability

00:19:24.460 | of repetition. Another training objective is coverage wealth, which uses the attention

00:19:31.420 | mechanism to prevent repetition. So basically, if you try to regularize and enforce your

00:19:35.980 | attention so that it's always attending to different words for each token, then it is

00:19:41.260 | highly likely that you are not going to repeat because repetition tends to happen when you

00:19:45.700 | have similar attention patterns. Another different angle is that instead of searching for the

00:19:51.980 | most likely string, we can use a different decoding objective. So maybe we can search

00:19:56.500 | for strings that maximizes the difference between log probabilities of two models. Say

00:20:01.940 | that we want to maximize log probability of large model minus log probability of small

00:20:06.260 | model. In this way, because both models are repetitive, so they kind of cancel out. So

00:20:10.860 | they would both assign high probabilities of repetition. And after applying this new

00:20:14.900 | objective, the repetition stuff will actually be penalized because it cancels out.

00:20:20.820 | So here comes the broader question. Is finding the most likely string even a reasonable thing

00:20:26.420 | to do for open-ended text generation? The answer is probably no, because this doesn't

00:20:32.540 | really match human pattern. So we can see in this plot, the orange curve is the human

00:20:37.060 | pattern, and the blue curve is the machine-generated text using beam search. So you can see that

00:20:42.100 | with human talks, there are actually lots of uncertainty, as we can see by the fluctuation

00:20:47.860 | of the probabilities. For some words, we can be very certain. For some words, we are a

00:20:52.580 | little bit unsure. Whereas here, for the model distribution, it's always very sure. It's

00:20:56.580 | always assigning probability 1 to the sequence.

00:20:59.500 | So because we now are seeing a-- basically, there is a mismatch between the two distributions.

00:21:06.080 | So it's kind of suggesting that maybe searching for the most likely string is not the right

00:21:10.940 | decoding objective at all. Any questions so far before we move on? Yeah?

00:21:15.940 | So is this the underlying mechanism for some detector of whether some text is generated

00:21:21.940 | by changing the [INAUDIBLE]

00:21:24.540 | Not really, because this can only detect the really simple things that humans are also

00:21:28.740 | able to detect, like repetition. So in order to avoid the previous problems that we've

00:21:34.420 | talked about, I'll talk about some other decoding families that generate more robust text that

00:21:40.180 | actually look like this, whose probability distribution looks like the orange curve.

00:21:45.500 | So I wouldn't say this is the to-go answer for watermarking or detection.

00:21:50.460 | Can you repeat the student's question?

00:21:53.700 | Oh, yeah. OK, cool. So she asked about whether this mechanism of plotting the probabilities

00:22:00.180 | of human text and machine-generated text is one way of detecting whether some text is

00:22:05.980 | generated by a model or a human. And my answer is, I don't think so, but this could be an

00:22:11.900 | interesting research direction. Because I feel like there are more robust decoding approaches

00:22:17.540 | that generate text that actually fluctuates a lot.

00:22:24.400 | So yeah, let's talk about the decoding algorithm that is able to generate text that fluctuates.

00:22:29.260 | So given that searching for the most likely string is a bad idea, what else should we

00:22:33.580 | do? And how do we simulate that human pattern? And the answer to this is to introduce randomness

00:22:39.060 | and stochasticity to decoding.

00:22:41.860 | So suppose that we are sampling a token from this distribution, P. Basically, we are trying

00:22:48.420 | to sample YT hat from this distribution. It is random so that you can essentially sample

00:22:53.420 | any token in the distribution. Previously, you are kind of restricted to selecting rest

00:22:57.460 | from your grocery. But now you can select bathroom instead.

00:23:02.980 | So however, sampling introduces a new set of problems. Since we never really zero out

00:23:08.580 | any token probabilities, vanilla sampling would make every token in the vocabulary a

00:23:14.100 | viable option. And in some unlucky cases, we might end up with a bad word.

00:23:20.040 | So assuming that we already have a very well-trained model, even if most of the probability mass

00:23:26.540 | of the distribution is over the limited set of good options, the tail of the distribution

00:23:31.580 | will still be very long because we have so many words in our vocabulary. And therefore,

00:23:36.700 | if we add all those long tails, it aggregates. They still have a considerable mass. So statistically

00:23:42.060 | speaking, this is called heavy tail distribution. And language is exactly a heavy tail distribution.

00:23:47.980 | So for example, many tokens are probably really wrong in this context. And then given that

00:23:54.420 | we have a good language model, we assign them each very little probability.

00:23:58.740 | But this doesn't really solve the problem because there are so many of them. So you

00:24:02.500 | aggregate them as a group. We'll still have a high chance of being selected.

00:24:08.180 | And the solution here that we have for this problem of long tail is that we should just

00:24:12.380 | cut off the tail. We should just zero out the probabilities that we don't want. And

00:24:16.900 | one idea is called top case sampling, where the idea is that we would only sample from

00:24:23.580 | the top k tokens in the probability distribution.

00:24:29.060 | Any questions for now?

00:24:30.580 | OK, yeah.

00:24:31.580 | Well, the model we were looking at a second ago had some very low probability samples

00:24:39.940 | as well on the graph, right? How would top case sampling deal with that?

00:24:45.420 | You mean this one?

00:24:46.420 | You mean the orange-blue graph of the human versus--

00:24:51.660 | Oh, yeah. So top k will basically eliminate-- it will make it impossible to generate the

00:24:58.940 | super low probability tokens. So technically, it's not exactly simulating this pattern because

00:25:04.700 | now you don't have the super low probability tokens, whereas human can generate super low

00:25:08.780 | probability tokens in a fluent way. But yeah, that could be another hint that people can

00:25:15.140 | use for detecting machine-generated text.

00:25:17.980 | Yeah?

00:25:18.980 | It also depends on the type of text you want to generate, for example, for more novels

00:25:24.500 | or more creative writing. Is it then you decide the hyperparameter?

00:25:28.820 | Yeah, yeah, for sure. K is a hyperparameter. Depending on the type of task, you will choose

00:25:33.020 | K differently. Mostly for a closed-ended task, K should be small. And for open-ended, K should

00:25:39.100 | be large. Yeah, question in the back.

00:25:41.660 | How come-- I guess intuitively, this builds off of one of the earlier questions. Why don't

00:25:46.820 | we consider the case where we sample, and then we just weight the probability of each

00:25:52.020 | word by its score or something, rather than just looking at top K? We don't do a weighted

00:25:57.580 | sampling type of situation. So we still have that small but non-zero probability of selecting.

00:26:03.740 | I think top K is also weighted. So top K just zeroes out all the tails of the distribution.

00:26:11.380 | But for the things that it didn't zero out, it's not a uniform choice among the K. It's

00:26:16.260 | still trying to choose proportional to the scores that you computed.

00:26:20.540 | Is that just like a computationally it's more efficient because you don't have to do for

00:26:25.700 | 17,000 words. It could be for 10 or something? Yeah, sure. That could be one gain of top

00:26:31.820 | K decoding is that your softmax will take in fewer candidates.

00:26:36.020 | But it's not the main reason. I think you should show--

00:26:40.900 | Yeah, I'll keep talking about the main reason. So we've discussed this part. And then here,

00:26:51.140 | this is formally what is happening for top K sampling. Now that we are only sampling

00:26:57.740 | from the top K tokens of the probability distribution. And as we've said, K is a hyperparameter.

00:27:03.780 | So we can set K to be large or small. If we increase K, this means that we are making

00:27:09.460 | our output more diverse, but at the risk of including some tokens that are bad. If we

00:27:14.500 | decrease K, then we are making more conservative and safe options. But possibly the generation

00:27:19.380 | will be quite generic and boring.

00:27:24.340 | So is top K decoding good enough? The answer is not really. Because we can still find some

00:27:30.220 | problems with top K decoding. For example, in the context, she said, I never blank. There

00:27:36.060 | are many words that are still valid options, such as went, ate. But those words got zeroed

00:27:42.340 | out because they are not within the top K candidates. So this actually leads to bad

00:27:46.780 | recall for your generation system. And similarly, another failure of top K is that it can also

00:27:53.260 | cut off too quickly. So in this example, code is not really a valid answer, according to

00:27:59.580 | common sense, because you probably don't want to eat a piece of code. But the probability

00:28:04.100 | remains non-zero, meaning that the model might still sample code as an output, despite this

00:28:10.100 | low probability, but it might still happen. And this means bad precision for the generation

00:28:15.180 | model.

00:28:17.780 | So given these problems with top K decoding, how can we address them? How can we address

00:28:23.900 | this issue of there is no single K that fits all circumstances? This is basically because

00:28:30.580 | the probability distribution that we sample from our dynamic. So when the probability

00:28:34.740 | distribution is relatively flat, having a small K will remove many viable options. So

00:28:41.660 | having a limited K will remove many viable options, and we want K to be larger for this

00:28:45.820 | case. Similarly, when a distribution P is too picky, then we want the-- a high K would

00:28:53.700 | allow for too many options to be viable. And instead, we might want a smaller K so that

00:28:59.460 | we are being safer. So the solution here is that maybe K is just a bad hyperparameter.

00:29:05.060 | And instead of doing K, we should think about probability. We should think about how to

00:29:10.420 | sample from tokens in a top P probability percentiles of the cumulative probability

00:29:16.980 | mass of the CDF, for example.

00:29:20.940 | So now, the advantage of doing top P sampling, where we sample from the top P percentile

00:29:27.340 | of the cumulative probability mass, is that this is actually equivalent to-- we have now

00:29:31.980 | an adaptive K for each different distribution. And let me explain what I mean by having an

00:29:38.260 | adaptive K. So in the first distribution, this is like a regular power law of language

00:29:44.180 | that's kind of typical. And then doing top K sampling means we are selecting the top

00:29:49.820 | K. But doing top P sampling means that we are zooming into maybe something that's similar

00:29:55.460 | to top K in effect. But if I have a relatively flat distribution like the blue one, we can

00:30:01.380 | see that doing top P means that we are including more candidates. And then if we have a more

00:30:06.700 | skewed distribution like the green one, doing top P means that we actually include fewer

00:30:11.180 | candidates. So by actually selecting the top P percentile in the probability distribution,

00:30:18.780 | we are actually having a more flexible K and therefore have a better sense of what are

00:30:24.460 | the good options in the model. Any questions about top P, top K decoding? So everything's

00:30:33.500 | clear. Yeah, sounds good. So to go back to that question, doing top K is not necessarily

00:30:40.460 | saving compute. Or this whole idea is not really compute saving intended. Because in

00:30:46.820 | the case of top P, in order to select the top P percentile, we still need to compute

00:30:51.540 | the softmax over the entire vocabulary set in order for us to compute the P properly.

00:30:59.180 | So therefore, it's not really saving compute, but it's improving performance. Cool. Moving

00:31:05.420 | on. So there are much more to go with decoding algorithms. Besides the top K and top P that

00:31:12.740 | we've discussed, there are some more recent approaches like typical sampling, where the

00:31:17.780 | idea is that we want to relate the score based on the entropy of the distribution and try

00:31:22.740 | to generate texts that are closer to the negative-- whose probability is closer to the negative

00:31:27.620 | entropy of the data distribution. This means that if you have a closed-ended task or non-open-ended

00:31:34.700 | task, it has smaller entropy. So you'll want negative log probability to be smaller. So

00:31:40.380 | you want probability to be larger. So it type checks very well. And additionally, there

00:31:46.380 | is also epsilon sampling coming from John. So this is an idea where we set the threshold

00:31:53.700 | to lower bound probabilities. So basically, if you have a word whose probability is less

00:31:58.380 | than 0.03, for example, then that word will never appear in the output distribution. Now,

00:32:05.140 | that word will never be part of your output because it has so low probability. Yeah.

00:32:09.980 | How do you know if it's top-degree, the entropy of a distribution?

00:32:14.100 | Oh, cool. Great question. So the entropy distribution is defined as-- suppose that we have a discrete

00:32:21.300 | distribution. We can go over it. We'll just enumerate x. And then it's negative log probability

00:32:27.620 | of x. So if we write it from an expectation perspective, it's basically expected of log

00:32:34.700 | probability of x. I have to do a little bit here. So this is the entropy of a distribution.

00:32:45.140 | And then-- so basically, if your distribution is very concentrated to a few words, then

00:32:50.260 | the entropy will be relatively small. If your distribution is very flat, then your entropy

00:32:55.660 | will be very large. Yeah.

00:32:57.660 | What if the epsilon sampling is such that we have no valid option?

00:33:05.700 | Oh, yeah. I mean, there will be some back-off cases, I think. So in the case that there

00:33:11.180 | is no valid options, you'll probably still want to select one or two things, just as

00:33:16.540 | an edge case, I think. OK, cool. Moving on. So another hyperparameter that we can tune

00:33:26.780 | to affect decoding is the temperature parameter. So recall that previously at each time step,

00:33:33.060 | we asked the model to compute a score. And then we renormalized that score using softmax

00:33:38.180 | to get a probability distribution. So one thing that we can adjust here is that we can

00:33:42.540 | insert this temperature parameter tau to relate the score. So basically, we just divide all

00:33:47.660 | the sw by tau. And after dividing this, we apply softmax. And we get a new distribution.

00:33:55.460 | And this temperature adjustment is not really going to affect the monotonosity of the distribution.

00:34:01.580 | For example, if word A has higher probability than word B previously, then after the adjustment,

00:34:07.620 | word A is still going to have a higher probability than word B. But their relative difference

00:34:12.300 | will change. So for example, if we raise the temperature tau to be greater than 1, then

00:34:20.380 | the distribution Pt will become more uniform. It will be flatter. And this implies that

00:34:26.700 | there will be more diverse output because our distribution is flatter. And it's more

00:34:31.660 | spread out across different words in the vocabulary. On the other hand, if we lower the temperature

00:34:37.620 | tau less than 1, then Pt becomes very spiky. And then this means that if we sample from

00:34:44.700 | the Pt, we'll get less diverse output. So because here, the probability is concentrated

00:34:49.860 | only on the top words. So in the very extreme case, if we set tau to be very, very close

00:34:54.540 | to 0, then the probability will be a 1/2 vector, where all the probability mass will be centered

00:35:01.220 | on one word. And then this reduces back to argmax sampling or greedy decoding.

00:35:07.860 | So temperature is a hyperparameter as well, as for k and P in topk and topp. It is a hyperparameter

00:35:14.220 | for decoding. It can be tuned for beam search and sampling algorithms. So it's kind of orthogonal

00:35:19.900 | to the approaches that we discussed before. Any questions so far? OK, cool. Temperature

00:35:29.860 | is so easy. So well, because sampling still involves randomness, even though we try very

00:35:38.900 | hard in terms of truncation, truncating the tail, sampling still has randomness. So what

00:35:43.740 | if we're just unlucky and decode a bad sequence from the model? One common solution is to

00:35:49.180 | do re-ranking. So basically, we would decode a bunch of sequences. For example, we can

00:35:53.460 | decode 10 candidates. But 10 or 30 is up to you. The only choice is that you want to balance

00:35:59.540 | between your compute efficiency and performance. So if you decode too many sequences, then,

00:36:05.460 | of course, your performance is going to increase. But it's also very costly to just generate

00:36:10.460 | a lot of things for one example. And so once you have a bunch of sample sequences, then

00:36:18.100 | we are trying to define a score to approximate the quality of the sequence and re-rank all

00:36:24.060 | the candidates by this score. So the simple thing to do is we can use perplexity as a

00:36:29.500 | metric, as a scoring function. But we need to be careful that, because we have talked

00:36:35.180 | about this, the extreme of perplexity, like if we try to arc max log probability, when

00:36:40.540 | we try to aim for a super low perplexity, the tags are actually very repetitive. So

00:36:45.500 | we shouldn't really aim for extremely low perplexity. And perplexity, to some extent,

00:36:50.180 | is not a perfect scoring function. It's not a perfect scoring function because it's not

00:36:55.540 | really robust to maximize. So alternatively, the re-rankers can actually use a wide variety

00:37:02.380 | of other scoring functions. We can score tags based on their style, their discourse coherence,

00:37:08.620 | their entailment, factuality properties, consistency, and so on. And additionally, we can compose

00:37:16.860 | multiple re-rankers together. Yeah, question?

00:37:20.540 | >> You mentioned 10 candidates or any number of candidates. What's the strategy you usually

00:37:27.540 | use to generate these other candidates? Like what heuristic do you use?

00:37:32.540 | >> So basically, the idea is to sample from the model. So when you sample from the model,

00:37:37.260 | each time you sample, you are going to get a different output. And then that's what I

00:37:40.820 | mean by different candidates. So if you sample 10 times, you will very likely get 10 different

00:37:45.820 | outputs. And then you are just-- given these 10 different outputs that come from sampling,

00:37:51.500 | you can just decide, re-rank them, and select the candidate that has the highest score.

00:37:56.180 | >> Where does the randomness come from? >> Oh, because we are sampling here.

00:38:00.980 | >> That sample, okay. >> Yeah, yeah. For example, if you are doing

00:38:04.420 | top-T sampling, then, well, suppose that A and B are equally probable, then you might

00:38:09.180 | sample A, you might sample B with the same probability. Okay, cool. And another cool

00:38:16.140 | thing that we can do with re-ranking is that we can compose multiple re-rankers together.

00:38:20.540 | So basically, suppose you have a scoring function for style, and you have a scoring function

00:38:24.780 | for factual consistency. You can just add those two scoring functions together to get

00:38:28.900 | a new scoring function, and then re-rank everything based on your new scoring function to get

00:38:34.580 | tags that are both good at style and good at factual consistency. Yeah?

00:38:38.980 | >> Yeah, so when you say that we re-rank by score, do we just pick the decoding that has

00:38:45.500 | the highest score, or do we do some more sampling again based on the score?

00:38:49.900 | >> The idea is you just take the decoding that has the highest score, because you already

00:38:52.540 | have, say, 10 candidates. So out of these 10, you only need one, and then you just choose

00:38:57.300 | one that has the highest score. Yeah. Cool. Any other questions? Yeah?

00:39:04.900 | >> Sorry. What is perplexity?

00:39:08.900 | >> Oh, yeah. Perplexity is like, you can kind of regard it as log probabilities. It's like

00:39:15.180 | E to the negative log probabilities. It's kind of like if a token has high perplexity,

00:39:22.140 | then it means it has low probability, because you are more perplexed.

00:39:28.020 | Okay. So taking a step back to summarize this decoding section, we have discussed many decoding

00:39:35.180 | approaches from selecting the most probable string to sampling, and then to various truncation

00:39:41.860 | approaches that we can do to improve sampling, like top P, top K, epsilon, typical decoding.

00:39:47.820 | And finally, we discussed how we can do in terms of re-ranking the results. So decoding

00:39:54.020 | is still a really essential problem in NLG, and there are lots of works to be done here

00:39:59.420 | still, especially as like chatGP is so powerful. We should all go study decoding. So it would

00:40:05.060 | be interesting if you want to do such final projects. And also, different decoding algorithms

00:40:10.380 | can allow us to inject different inductive biases to the text that we are trying to generate.

00:40:17.800 | And some of the most impactful advances in NLG in the last couple of years actually come

00:40:22.420 | from simple but effective decoding algorithms. For example, the nuclear sampling paper is

00:40:28.260 | actually very, very highly cited.

00:40:31.540 | So moving on to talk about training NLG models. Well, we have seen this example before in

00:40:38.740 | the decoding slides, and I'm just trying to show them again, because even though we can

00:40:43.100 | solve this repetition problem by instead of doing search, doing sampling. But it's still

00:40:49.300 | concerning from a language modeling perspective that your model would put so much probability

00:40:54.540 | on such repetitive and degenerate text. So we ask this question, well, is repetition

00:40:59.740 | due to how language models are trained? You have also seen this plot before, which shows

00:41:06.940 | this decaying pattern or this self-amplification effect. So we can conclude from this observation

00:41:13.340 | that model trained via a MLE objective wears a really bad mode of the distribution. By

00:41:19.820 | mode of the distribution, I mean the argmax of the distribution. So basically, they would

00:41:23.500 | assign high probability to terrible strings. And this is definitely problematic for a model

00:41:28.900 | perspective. So why is this the case? Shouldn't MLE be a gold standard in machine learning

00:41:36.140 | in general, not just machine translation? Shouldn't MLE be a gold standard for machine

00:41:39.700 | learning? The answer here is not really, especially for text, because MLE has some problem for

00:41:46.340 | sequential data. And we call this problem exposure bias. So training with teacher forcing

00:41:53.340 | leads to exposure bias at generation time, because during training, our model's inputs

00:41:58.140 | are gold context tokens from real human-generated text, as denoted by a hat less than T here.

00:42:05.060 | But during generation time, our model's input become previously decoded tokens from the

00:42:10.820 | model, y hat T. And suppose that our model has minor errors, then y hat less than T will

00:42:18.260 | be much worse in terms of quality than y star less than T. And this discrepancy is terrible,

00:42:23.900 | because it actually causes a discrepancy between training and test time, which actually hurts

00:42:30.300 | model performance. And we call this problem exposure bias.

00:42:35.980 | So people have proposed many solutions to address this exposure bias problem. One thing

00:42:41.180 | to do is to do scheduled sampling, which means that with probability p, we try to decode

00:42:47.900 | a token and feed it back in as context to train the model. And with probability 1 minus

00:42:53.860 | p, we use the gold token as context. So throughout training, we try to increase

00:43:00.140 | p to gradually warm it up, and then prepare it for test time generation. So this leads

00:43:06.500 | to improvement in practice, because using this p probabilities, we're actually gradually

00:43:14.580 | trying to narrow the discrepancy between training and test time. But the objective is actually

00:43:19.260 | quite strange, and training can be very unstable.

00:43:23.580 | Another idea is to do data set aggregation. And the method is called Dagger. Essentially,

00:43:29.860 | at various interval during training, we try to generate a sequence of tags from the current

00:43:33.980 | model, and then use this, and then put this sequence of tags into the training data. So

00:43:39.060 | we're kind of continuously doing this training data augmentation scheme to make sure that

00:43:44.940 | the training distribution and the generation distribution are closer together. So both

00:43:49.980 | approaches, both scheduled sampling and data set aggregation, are ways to narrow the discrepancy

00:43:55.060 | between training and test. Yes, question?

00:43:58.820 | What is the gold token?

00:44:00.980 | Gold token just means human text. It means like-- well, when you train a language model,

00:44:06.500 | you will see lots of corpus that are human written. Gold is just human. Yeah. OK, cool.

00:44:15.540 | So another approach is to do retrieval augmented generation. So we first learn to retrieve

00:44:20.700 | a sequence from some existing corpus of prototypes. And then we train a model to actually edit

00:44:26.140 | the retrieved text by doing insertion, deletion, or swapping. We can add or remove tokens from

00:44:33.260 | this prototype, and then try to modify it into another sentence. So this doesn't really

00:44:40.260 | suffer from exposure bias, because we start from a high-quality prototype. So that at

00:44:45.620 | training time and at test time, you don't really have the discrepancy anymore, because

00:44:49.460 | you are not generating from left to right.

00:44:53.780 | Another approach is to do reinforcement learning. So here, the idea is to cast your generation

00:44:59.340 | problem as a Markov decision process. So there is the state s, which is the model's representation

00:45:06.700 | for all the preceding context. There is action a, which is basically the next token that

00:45:12.740 | we are trying to pick. And there is policy, which is the language model, or also called

00:45:16.860 | the decoder. And there is the reward r, which is provided by some external score. And the

00:45:22.540 | idea here-- well, we won't go into details about reinforcement learning and how it works,

00:45:28.220 | but we will recommend the class CS234.

00:45:34.380 | So in the reinforcement learning context, because reinforcement learning involves a

00:45:38.180 | reward function, that's very important. So how do we do reward estimation for text generation?

00:45:44.020 | Well, really a natural idea is to just use the evaluation metrics. So whatever-- because

00:45:49.060 | you are trying to do well in terms of evaluation, so why not just improve for evaluation metrics

00:45:54.100 | directly at training time? For example, in the case of machine translation, we can use

00:45:58.940 | blue score as the reward function. In the case of summarization, we can use root score

00:46:03.980 | as the reward function.

00:46:06.620 | But we really need to be careful about optimizing for tasks as opposed to gaining the reward,

00:46:12.140 | because evaluation metrics are merely proxies for the generation quality. So sometimes,

00:46:17.060 | you run RL and improve the blue score by a lot. But when you run human evaluations, humans

00:46:23.740 | might still think that, well, this generated text is no better than the previous one, or

00:46:28.300 | even worse, even though it gives you a much better blue score. So we want to be careful

00:46:33.540 | about this case of not gaining the reward.

00:46:37.500 | So what behaviors can we tie to a reward function? This is about reward design and reward estimation.

00:46:42.780 | There are so many things that we can do. We can do cross-modality consistency for image

00:46:47.540 | captioning. We can do sentence simplicity to make sure that we are generating simple

00:46:53.780 | English that are understandable. We can do formality and politeness to make sure that,

00:46:58.660 | I don't know, your chatbot doesn't suddenly yell at you.

00:47:02.780 | And the most important thing that's really, really popular recently is human preference.

00:47:08.860 | So we should just build a reward model that captures human preference. And this is actually

00:47:14.220 | the technique behind the chat GPT model. So the idea here is that we would ask human to

00:47:19.860 | rank a bunch of generated text based on their preference. And then we will use this preference

00:47:24.740 | data to learn a reward function, which will basically always assign high score to something

00:47:31.580 | that humans might prefer and assign low score to something that humans wouldn't prefer.

00:47:36.620 | Yeah, question?

00:47:37.620 | Would it be more expensive? It's like, is it all just real?

00:47:42.620 | Oh yeah, sure. I mean, it is going to be very expensive. But I feel like compared to all

00:47:47.660 | the cost of training models, training like 170 billion parameter models, I feel like

00:47:52.780 | OpenAI and Google are, well, they can afford hiring lots of humans to do human annotations

00:47:58.300 | and ask their preference.

00:47:59.300 | How much data would we need to, like, give simple answers?

00:48:04.300 | Yeah, this is a great question. So I think it's kind of a mystery about how much data

00:48:09.860 | you exactly need to achieve the level of performance of chat GPT. But roughly speaking, I feel

00:48:15.500 | like, I mean, whenever you try to fine tune a model on some downstream task, similarly

00:48:19.940 | here you are trying to fine tune your model on human preference. It do need quite a lot

00:48:24.940 | of data, like maybe on the scale of 50k to 100k. That's roughly the scale that-- like

00:48:29.780 | Anthropic actually released some data set about human preference. That's roughly the

00:48:33.780 | scale that they released, I think, if I remember correctly. Yeah, question.

00:48:38.780 | So we talked about earlier about how many of the state of the art language models use

00:48:43.180 | transformers as their architecture. How do you apply reinforcement learning to this model?

00:48:50.860 | To what do you mean? To transformer model?

00:48:53.380 | Yeah.

00:48:54.380 | Yeah, I feel like reinforcement learning is kind of a modeling tool. I mean, it's kind

00:49:00.220 | of an objective that you are trying to optimize. Instead of an MLE objective, now you are optimizing

00:49:04.540 | for an RL objective. So it's kind of orthogonal to the architecture choice. So a transformer

00:49:11.900 | is an architecture. You just use transformer to give you probability of the next token

00:49:16.260 | distribution or to try to estimate probability of a sequence. And then once you have the

00:49:21.420 | probability of a sequence, you use that probability of the sequence, pass it into the RL objective

00:49:27.660 | that you have. And then suppose that you are trying to do policy gradient or something,

00:49:31.900 | then you need to estimate the probability of that sequence. And then you just need to

00:49:36.460 | be able to backprop through transformer, which is doable.

00:49:40.340 | Yeah, so I think the question about architecture and objectives are orthogonal. So even if

00:49:45.220 | you have an LSTM, you can do it. You have a transformer, you can also do it.

00:49:48.700 | Yeah.

00:49:49.700 | Cool. Hope I answered that question. Yeah.

00:49:52.700 | And it just like with a model for this kind of reward. For example, we can do another

00:49:59.700 | transformer to calculate the reward.

00:50:02.420 | Yeah, I think that's exactly what they did. So for example, you would have GPT-3. You

00:50:08.020 | use GPT-3 as the generator that generate text. And you kind of have another pre-trained model

00:50:13.700 | that could probably also be GPT-3, but I'm guessing here, that you fine tune it to learn

00:50:18.820 | human preference. And then once you have a human preference model, you use the human

00:50:23.580 | preference model to put it into RL as the reward model. And then use the original GPT-3

00:50:29.140 | as the policy model. And then you apply RL objectives and then update them so that you

00:50:35.460 | will get a new model that's better at everything.

00:50:38.940 | OK, cool. Yeah, actually, if you are very curious about RLHF, I would encourage you

00:50:45.020 | to come to the next lecture, where Jesse will talk about RLHF. RLHF is shorthand for RL

00:50:54.660 | using human feedback.

00:50:59.100 | So takeaways. T-shirt forcing is still the main algorithm for training text generation

00:51:05.300 | models. And exposure bias causes problems in text generation models. For example, it

00:51:10.940 | causes models to lose coherence, causes model to be repetitive. And models must learn to

00:51:16.180 | recover from their own bad samples by using techniques like scheduled sampling or a dagger.

00:51:22.860 | And models shouldn't-- another approach to reduce exposure bias is to start with good

00:51:28.540 | text, like retrieval plus generation. And we also discussed how to do training with

00:51:32.780 | RL. And this can actually make model learn behaviors that are preferred by human-- that

00:51:40.260 | are preferred by human or preferred by some metrics.

00:51:43.180 | So to be very up to date, in the best language model nowadays, chat-GPT, the training is

00:51:49.220 | actually pipelined. For example, we would first pre-train a large language models using

00:51:53.420 | internet corpus by self-supervision. And this kind of gets you chat-GPT-- sorry, GPT-3,

00:51:59.740 | which is the original version. And then you would do some sort of instruction tuning to

00:52:04.260 | fine-tune the language model, to fine-tune the pre-trained language model so that it

00:52:07.500 | learns roughly how to follow human instructions.

00:52:10.900 | And finally, we would do RLHF to make sure that these models are well-aligned with human

00:52:15.580 | preference. So if we start RLHF from scratch, it's probably going to be very hard for the

00:52:21.700 | model to converge, because RL is hard to train for text data, et cetera. So RL doesn't really

00:52:27.500 | work from scratch. But with all these smart tricks about pre-training and instruction

00:52:33.260 | tuning, suddenly now they're off to a good start.

00:52:39.060 | Cool. Any questions so far? OK. Oh, yeah.

00:52:45.060 | [INAUDIBLE]

00:52:55.060 | You mean the difference between Dagger and schedule sampling is how long the sequence

00:53:01.300 | are? Yeah, I think roughly that is it. Because for Dagger, you are trying to put in full-generated

00:53:10.100 | sequence. But I feel like there can be variations of Dagger. Dagger is just like a high-level

00:53:13.740 | framework and idea. There can be variations of Dagger that are very similar to schedule

00:53:18.820 | sampling, I think.

00:53:20.140 | I feel like for schedule sampling, it's kind of a more smoothed version of Dagger. Because

00:53:24.860 | for Dagger, you have to-- well, basically, for this epoch, I am generating something.

00:53:31.260 | And then after this epoch finishes, I put this into the data together and then train

00:53:35.380 | for another epoch. Whereas Dagger seems to be more flexible in terms of where you add

00:53:39.940 | data. Yes?

00:53:40.940 | So for Dagger, if you regress the models coming out, how does it help the model?

00:53:48.740 | I think that's a good question. I feel like if you regress the model-- for example, if

00:53:54.900 | you regress the model on its own output, I think there should be smarter ways than to

00:54:01.740 | exactly regress on your own output. For example, you might still consult some gold reference

00:54:06.780 | data, for example, given that you ask the model to generate for something. And then

00:54:10.980 | you can, instead of using-- say you ask the model to generate for five tokens. And then

00:54:16.420 | instead of using the model's generation to be the sixth token, you'll probably try to

00:54:21.980 | find some examples in the training data that would be good continuations. And then you

00:54:26.660 | try to plug that in by connecting the model generation and some gold text. And then therefore,

00:54:34.100 | you are able to correct the model, even though it probably went off path a little bit by

00:54:39.700 | generating its own stuff. So it's kind of like letting the model learn how to correct

00:54:43.060 | for itself.

00:54:44.060 | But yes, I think you are right. If you just put model generation in the data, it shouldn't

00:54:51.540 | really work. Yeah. Any other questions? Cool. Moving on. Yes. So now we'll talk about how

00:55:08.020 | we are going to evaluate NLG systems. So there are three types of methods for evaluation.

00:55:13.540 | There is content overlap metrics. There is model-based metrics. And there is human evaluations.

00:55:20.340 | So first, content overlap metrics compute a score based on lexical similarities between

00:55:25.100 | the generated text and the gold reference text. So the advantage of this approach is

00:55:29.460 | that it's very fast and efficient and widely used. For example, a blue score is very popular

00:55:35.300 | in MT. And rouge score is very popular in summarization.

00:55:41.660 | So these methods are very popular because they are cheap and easy to run. But they are

00:55:47.940 | not really the ideal metrics. For example, simply relying on lexical overlap might miss

00:55:53.540 | some refreezings that have the same semantic meaning. Or it might reward text with a large

00:55:59.180 | portion of lexical overlap, but actually have the opposite meaning. So you have lots of

00:56:03.620 | both false positive and false negative problems.

00:56:07.540 | So despite all these disadvantages, the metrics are still the to-go evaluation standard in

00:56:12.300 | machine translation. Part of the reason is that MT is actually super close-ended. It's

00:56:18.060 | very non-open-ended. And then therefore, this is probably still fine to use a blue score

00:56:25.020 | to measure machine translation. And they get progressively worse for tasks that are more

00:56:29.740 | open-ended. For example, they get worse for summarization, as long as the output text--

00:56:35.580 | because the output text becomes much harder to measure. They are much worse for dialogue,

00:56:40.340 | which is more open-ended. And then they are much, much worse for story generation, which

00:56:44.100 | is also open-ended. And then the drawback here is that because the n-gram metrics--

00:56:51.100 | this is because suppose that you are generating a story that's relatively long. Then if you

00:56:55.740 | are still looking at word overlap, then you might actually get very high n-gram scores

00:57:00.380 | because of your text is very long, not because it's actually of high quality. Just because

00:57:04.780 | you are talking so much that you might have covered lots of points already.

00:57:08.580 | Yes?

00:57:09.580 | [INAUDIBLE]

00:57:10.580 | Yes, exactly. That's the next thing that I will talk about as a better metric for evaluation.

00:57:22.700 | But for now, let's do a case study of a failure mode for blue score, for example. So suppose

00:57:29.300 | that Chris asked the question, are you enjoying the CS224L lectures? The correct answer, of

00:57:33.900 | course, is heck yes. So if we have this, if one of the answers is yes, it will get a score

00:57:42.220 | of 0.61 because it has some lexical overlap with the correct answer. If you answer you

00:57:48.100 | know it, then it gets a relatively lower score because it doesn't really have any lexical

00:57:53.420 | overlap except from the exclamation mark. And if you answer yep, this is semantically

00:57:59.420 | correct, but it actually gets 0 score because there is no lexical overlap between the gold

00:58:05.340 | answer and the generation. If you answer heck no, this should be wrong. But because it has

00:58:12.020 | lots of lexical overlap with the correct answer, it's actually getting some high scores.

00:58:19.740 | So these two cases are the major failure modes of lexical-based n-gram overlap metrics. You

00:58:26.060 | get false negatives and false positives. So moving beyond this failure modes of lexical-based

00:58:35.180 | metrics, the next step is to check for semantic similarities. And model-based metrics are

00:58:40.380 | better at capturing the semantic similarities. So this is kind of similar to what you kind

00:58:45.200 | of raised up a couple minutes ago. We can actually use learned representation of words

00:58:50.700 | and sentences to compute semantic similarities between generated and referenced text. So

00:58:58.460 | now we are no longer bottlenecked by n-gram. And instead, we're using embeddings. And these

00:59:03.820 | embeddings are going to be pre-trained. But the methods can still move on because we can

00:59:07.900 | just swap in different pre-trained method and use the fixed metrics.

00:59:12.460 | So here are some good examples of the metrics that could be used. One thing is to do vector

00:59:17.540 | similarity. This is very similar to homework one, where you are trying to compute similarity

00:59:22.340 | between words, except now we are trying to compute similarity between sentences. There

00:59:27.620 | are some ideas of how to go from word similarity to sentence similarities. For example, you

00:59:32.100 | can just average the embedding, which is like a relatively naive idea, but it works sometimes.

00:59:40.260 | Another high-level idea is that we can measure word movers distance. The idea here is that

00:59:47.460 | we can use optimal transports to align the source and target word embeddings. Suppose

00:59:52.180 | that your source word embedding is Obama speaks to the media in Illinois, and the target is

00:59:59.820 | the president grace the press in Chicago. From a human evaluation perspective, these

01:00:04.100 | two are actually very similar, but they are not exactly aligned word by word. So we need

01:00:09.100 | to figure out how to optimally align words to words, like align Obama to president, align

01:00:14.220 | Chicago to Illinois, and then therefore we can compute a score. We can compute the pairwise

01:00:20.100 | word embedding difference between this, and then get a good score for the sentence similarities.

01:00:27.540 | And finally, there is BERT score, which is also a very popular metric for semantic similarity.

01:00:32.580 | So it first computes pairwise cosine distance using BERT embeddings, and then it finds an

01:00:38.260 | optimal alignment between the source and target sentence, and then it finally computes some

01:00:43.260 | score. So I feel like these details are not really that important, but the high-level

01:00:48.580 | idea is super important, is that we can now use word embeddings to compute sentence similarities

01:00:56.180 | by doing some sort of smart alignment, and then transform from word similarity to sentence

01:01:00.620 | similarities. To move beyond word embeddings, we can also use sentence embeddings to compute

01:01:06.820 | sentence similarities. So typically, this doesn't have the very comprehensive alignment

01:01:11.260 | by word problem, but it has similar problems about you need to now align sentences or phrases

01:01:16.580 | in a sentence. And similarly, there is BLURT, which is slightly different. It is a regression

01:01:22.340 | model based on BERT. So the model is trained as a regression problem to return a score

01:01:29.140 | that indicates how good the text is in terms of grammaticality and the meaning of the reference

01:01:33.620 | text, and similarity with the reference text. So this is kind of a training evaluation as

01:01:38.020 | a regression problem. Any questions so far? OK, cool. You can move on.

01:01:51.180 | So all the previous mentioned approaches are evaluating semantic similarities, so they

01:01:55.500 | can be applied to non-open-ended generation tasks. But what about open-ended settings?

01:02:00.980 | So here, enforcing semantic similarity seems wrong, because a story can be perfectly fluent

01:02:06.220 | and perfectly high quality without having to reassemble any of the reference stories.

01:02:11.380 | So one idea here is that maybe we want to evaluate open-ended text generation using

01:02:16.780 | this MOV score. MOV score computes the information divergence in a quantized embedding space

01:02:23.260 | between the generated text and the goal reference text. So here is roughly the detail of what's

01:02:28.420 | going on. Suppose that you have a batch of text from the goal reference that are human

01:02:33.220 | written, and you have a batch of text that's generated by your model. Step number one is

01:02:37.820 | that you want to embed this text. You want to put this text into some continuous representation

01:02:42.340 | space, which is kind of the figure to the left. But it's really hard to compute any

01:02:47.660 | distance metrics in this continuous embedding space, because different sentences might actually

01:02:53.140 | lie very far away from each other. So the idea here is that we are trying to do a k-means

01:02:58.300 | cluster to discretize the continuous space into some discrete space. Now, after the discretization,

01:03:04.460 | we can actually have a histogram for the goal human written text and a histogram for the

01:03:11.220 | machine generated text. And then we can now compute precision recall using these two discretized

01:03:17.020 | distributions. And then we can compute precision by forward KL and recall by backward KL. Yes,

01:03:23.300 | question?

01:03:24.300 | Why do we want to discretize it? I didn't catch that.

01:03:28.780 | So imagine that you-- suppose-- maybe it's equivalent to answer, why is it hard to work

01:03:34.300 | with the continuous space? The idea is if you embed a sentence into the continuous space,

01:03:40.580 | say that it lies here, and you embed another sentence in a continuous space that lies here,

01:03:44.740 | suppose that you only have a finite number of sentences. Then they would basically be

01:03:48.660 | direct delta distributions in your manifold. So it's hard to-- you probably want a smoother

01:03:55.460 | distribution. But it's hard to define what is a good, smooth distribution in the case

01:03:59.940 | of text embedding, because they're not super interpretable. So therefore, eventually, you

01:04:04.100 | will have-- if you embed everything in a continuous space, you will have lots of direct deltas

01:04:10.380 | that are just very high and then not really connected to its neighbors. So it's hard to

01:04:17.540 | quantify KL divergence or a distance matrix in that space.

01:04:22.220 | For example, you have to make some assumptions. For example, you want to make Gaussian assumptions

01:04:26.060 | that I want to smooth all the embeddings by convolving it with a Gaussian. And then you

01:04:31.260 | can start getting some meaningful distance metrics. But with just the embeddings alone,

01:04:37.540 | you're not going to get meaningful distance metrics. And then it doesn't really make sense

01:04:40.540 | to smooth things using Gaussian, because who said word representations are Gaussian related?

01:04:46.060 | Yeah.

01:04:47.060 | Question?

01:04:48.060 | How do you know it would be continuous to understand distributions?

01:04:51.380 | I think this requires some Gaussian smoothing. Yeah, I think that the plot is made with some

01:04:55.780 | smoothing. Yeah, I mean, I didn't make the plot, so I couldn't be perfectly sure. But

01:04:59.980 | I think the fact that it looks like this means that you smooth it a little bit.

01:05:03.220 | So you put in word embeddings and--

01:05:06.100 | These are sentence embeddings or concatenated word embeddings, because you are comparing

01:05:10.380 | sentences to sentences, not words to words. Yeah, so the advantage of MOLF score is that

01:05:16.980 | it is applicable to open-ended settings, because you are now measuring precision and recall

01:05:22.820 | with regard to the target distribution. Cool. So it has a better probabilistic interpretation

01:05:30.580 | than all the previous similarity metrics. Cool. Any other questions? Yes?

01:05:37.980 | I'm just not entirely clear. So if we're trying to maximize precision, we can call it here.

01:05:42.980 | Yeah.

01:05:43.980 | How is that different from just trying to maximize the similarity between the target

01:05:47.980 | and the distribution?

01:05:48.980 | Oh, yeah, that's a good question. Well, this is because in a case where it's really hard

01:05:55.580 | to get exactly the same thing-- well, for example, I would say that maybe-- because

01:06:00.580 | I've never tried this myself, but if you try to run MOLF on a machine translation task,

01:06:05.660 | you might get very high score. But if you try to run Bool score on the open-ended text

01:06:11.260 | generation, you will get super low score. So it's just not really measurable, because

01:06:14.900 | everything's so different from each other. So I feel like MOLF is kind of a middle ground,

01:06:19.780 | where you are trying to evaluate something that are actually very far away from each

01:06:23.460 | other, but you still want a meaningful representation. Of course, I mean, if your source and target

01:06:30.380 | are exactly the same or are just different up to some rephrasing, you will get the best

01:06:34.780 | MOLF score. But maybe that's not really what you're looking for, because given the current

01:06:39.940 | situation, you only have generations that are very far away from the gold text. How

01:06:43.940 | do we evaluate this type of things?

01:06:46.220 | Yes, question in the back.

01:06:48.900 | I'm still trying to understand the MOF score. Is it possible to write out the map, even

01:06:54.500 | in just kind of pseudo, simple form?

01:06:57.700 | Yeah, I think it's possible. I mean, maybe we can put this discussion after class, because

01:07:02.780 | I kind of want to finish my slides. Yeah, but happy to chat after class. There is a

01:07:08.060 | paper about it if you search for MOLF score. I think it's probably the best paper in some

01:07:12.860 | ICML or Europe's conference as well.

01:07:16.020 | OK, so moving on. I've pointed out that there are so many evaluation methods. So let's take

01:07:22.300 | a step back and think about what's a good metric for evaluation methods. So how do we

01:07:26.660 | evaluate evaluations? Nowadays, the gold standard is still to check how well this metric is

01:07:32.580 | aligned with human judgment. So if a model match human preference, in other words, if

01:07:40.980 | the metric correlates very strongly with human judgment, then we say that the metric is a

01:07:45.100 | good metric. So in this plot, people have plot blue score and human score on y and x

01:07:52.140 | axis respectively. And then because we didn't see a correlation, a strong correlation, this

01:07:56.580 | kind of suggests that blue score is not a very good metric.

01:08:01.660 | So actually, the gold standard for human evaluation-- the gold standard for evaluating language

01:08:07.860 | models is always to do human evaluation. So automatic metrics fall short of matching human

01:08:14.860 | decisions. And human evaluation is kind of the most important criteria for evaluating

01:08:20.620 | text that are generated from a model. And it's also the gold standard in developing

01:08:25.200 | automatic metrics because we want everything to match human evaluation.

01:08:31.620 | So what do we mean by human evaluation? How is it conducted? Typically, we will provide

01:08:36.780 | human annotators with some axes that we care about, like fluency, coherence for open-ended

01:08:43.460 | text generation. Suppose that we also care about factuality for summarization. We care

01:08:47.900 | about style of the writing and common sense, for example, if we're trying to write a children's

01:08:52.740 | story.

01:08:56.540 | Essentially, another thing to note is that please don't compare human evaluations across

01:09:00.720 | different papers or different studies because human evaluations tends to not be well-collaborated

01:09:05.660 | and are not really reproducible. Even though we believe that human evaluations are the

01:09:10.900 | gold standard, there are still many drawbacks. For example, human evaluations are really

01:09:15.580 | slow and expensive. But even beyond the slow and expensiveness, they are still not perfect

01:09:23.300 | because first, human evaluations, the results may be inconsistent, and they may not be very

01:09:28.220 | reproducible. So if you ask the same human whether you like A or B, they might say A

01:09:32.100 | the first time and B the second time. And then human evaluations are typically not really

01:09:37.500 | logical. And sometimes, human annotators might misinterpret your question. Suppose that you

01:09:44.580 | want them to measure coherence of the text. Different people have different criteria for

01:09:48.860 | coherence. Some people might think coherence is equivalent to fluency, and then they look

01:09:53.020 | for grammaticality errors. Some people might think coherence means how well your continuation

01:09:58.860 | is aligned with the prompt or the topic. So there are all sorts of misunderstandings that

01:10:05.100 | might make human evaluation very hard.

01:10:08.420 | And finally, human evaluation only measures precision, not recall. This means that you

01:10:13.380 | can give a sentence to human and ask the human, how do you like the sentence? But you couldn't

01:10:18.220 | ask the human whether this model is able to generate all possible sentences that are good.

01:10:24.460 | So it's only a precision-based metrics, not a recall-based metrics.

01:10:28.260 | So here are two approaches that tries to combine human evaluations with modeling. For example,

01:10:36.260 | the first idea is basically trying to learn a metric from human judgment, basically by

01:10:42.540 | trying to use human judgment data as training data, and then train a model to simulate human

01:10:48.420 | judgment. And the second approach is trying to ask human and model to collaborate so that

01:10:54.780 | the human would be in charge of evaluating precision, whereas the model would be in charge

01:10:58.940 | of evaluating recall.

01:11:01.140 | Also, we have tried approaches in terms of evaluating models interactively. So in this

01:11:07.380 | case, we not only care about the output quality, we also care about how the person feels when

01:11:14.460 | they interact with the model, when they try to be a co-author with the model, and how

01:11:18.860 | the person feels about the writing process, et cetera. So this is called trying to evaluate

01:11:24.900 | the models more interactively.

01:11:29.300 | So the takeaway here is that content overlap is a bad metric. Model-based metrics become

01:11:35.900 | better because it's more focused on semantics, but it's still not good enough. Human judgment

01:11:40.900 | is the gold standard, but it's hard to do human judgment-- it's hard to do human study

01:11:45.140 | well. And in many cases, this is a hint for final project. The best judge of the output

01:11:51.420 | quality is actually you. So if you want to do a final project in natural language generation,

01:11:56.980 | you should look at the model output yourself. And don't just rely on the numbers that are

01:12:01.740 | reported by Blue Swirl or something.

01:12:04.940 | Cool. So finally, we will discuss ethical considerations of natural language generation

01:12:11.220 | problems. So as language models get better and better, ethical considerations become

01:12:16.980 | much more pressing. So we want to ensure that the models are well-aligned with human values.

01:12:21.940 | For example, we want to make sure the models are not harmful, they are not toxic, and we

01:12:26.820 | want to make sure that the models are unbiased and fair to all demographics groups.

01:12:31.460 | So for example here, we also don't want the model to generate any harmful content. Basically,

01:12:37.460 | I try to prompt ChatGPT to say, can you write me some toxic content? ChatGPT politely refused

01:12:42.860 | me, which I'm quite happy about. But there are other people who try to jailbreak ChatGPT.

01:12:51.860 | The idea here is that ChatGPT-- actually, I think internally, they probably implement

01:12:56.060 | some detection tools so that when you try to prompt it adversarially, it's going to

01:13:00.380 | avoid doing adversarial things. But here, there are many very complicated ways to prompt

01:13:06.820 | ChatGPT so that you can get over the firewall and then therefore still ask ChatGPT to generate

01:13:13.420 | some bad English.

01:13:22.140 | So another problem with these large language models is that they are not necessarily truthful.

01:13:27.940 | So for example, this very famous news that Google's model actually generated factual

01:13:33.460 | errors, which is quite disappointing. But the way the model talks about it is very convincing.

01:13:41.660 | So you wouldn't really know that it's a factual error unless you go check that this is not

01:13:46.100 | the first picture or something.

01:13:51.060 | So we want to avoid this type of problems. Actually, the models have already been trying

01:13:55.700 | very hard to reframe from generating harmful content. But for models that are more open-sourced

01:14:03.540 | and are smaller, the same problem still appears. And then typically, when we do our final projects

01:14:09.460 | or when we work with models, we are probably going to deal with much smaller models. And

01:14:13.500 | then therefore, we need to think about ways to deal with these problems better.

01:14:17.540 | So text generation models are often constructed from pre-trained language models. And then

01:14:21.980 | pre-trained language models are trained on internet data, which contains lots of harmful

01:14:25.940 | stuff and bias. So when the models are prompted for this information, they will just repeat

01:14:33.340 | the negative stereotypes that they learn from the internet training data.

01:14:37.060 | So one way to avoid this is to do extensive data cleaning so that the pre-training data

01:14:41.980 | does not contain any bias or stereotypical content. However, this is going to be very

01:14:46.700 | labor-intensive and almost impossible to do because filtering a large amount of internet

01:14:51.100 | data is just so costly that it's not really possible.

01:14:56.860 | Again, with existing language models like GPT-2 Medium, there are some adversarial inputs

01:15:03.860 | that almost always trigger toxic content. And these models might be exploited in the

01:15:09.100 | real world by ill-intended people. So for example, there is a paper about universal

01:15:15.820 | adversarial triggers where the authors just find some universal set of words that would

01:15:21.060 | trigger bad content from the-- that would trigger toxic content from the model.

01:15:28.300 | And sometimes, even if you don't try to trigger the model, the model might still start to

01:15:32.180 | generate toxic content by itself. So in this case, the pre-trained language models are

01:15:38.100 | prompted with very innocuous prompts, but they still degenerate into toxic content.

01:15:43.540 | So the takeaway here is that models really shouldn't be deployed without proper safeguards

01:15:48.940 | to control for toxic content or any harmful contents in general. And models should not

01:15:53.420 | be deployed without careful considerations of how users will interact with these models.

01:16:02.300 | So in the ethics section, one major takeaway is that we are trying to advocate that you

01:16:07.460 | need to think more about the model that you are building. So before deploying or publishing

01:16:13.420 | any NLG models, please check if the model's output is not harmful. And please check if

01:16:19.340 | the model is more robust-- is robust to all the trigger words and other adversarial prompts.

01:16:25.380 | And of course, there are more. So well, basically, one can never do enough to improve the ethics

01:16:30.660 | of text generation systems. And OK, cool. I still have three minutes left, so I can

01:16:35.420 | still do concluding thoughts. The idea here-- well, today, we talk about the exciting applications

01:16:41.380 | of natural language generation systems. But one might think that, well, given that ChatGPT

01:16:48.460 | is already so good, are there any other things that we can do research-wise? If you try interacting

01:16:53.380 | with these models, if you try to interact with these models, actually, you can see that

01:16:58.580 | there are still lots of limitations in their skills and performance. For example, ChatGPT

01:17:02.940 | is able to do a lot of things with manipulating text, but it couldn't really create interesting

01:17:09.540 | contents, or it couldn't really think deeply about stuff. So there are lots of headrooms,

01:17:15.860 | and there are still many improvements ahead.

01:17:18.460 | And evaluation remains a really huge challenge in natural language generation. Basically,

01:17:23.900 | we need better ways to automatically evaluate performance of NLG models, because human evaluations

01:17:29.500 | are expensive and not reproducible. So it's better to figure out ways to compile all those

01:17:36.900 | human judgments into a very reliable and trustworthy model.

01:17:41.620 | And also, with the advance of all these large-scale language models, doing neural natural language

01:17:48.300 | generation has been reset. And it's never been easier to jump into this space, because

01:17:54.340 | now there are all the tools that are already there for you to build upon.

01:17:58.740 | And finally, it is one of the most exciting and fun areas of NLP to work on. So yeah,

01:18:03.340 | I'm happy to chat more about NLG if you have any questions, both after class and in class,

01:18:08.580 | I guess, in one minute.

01:18:10.580 | OK, cool. That's everything. So do you have any questions? If you don't, we can end the

01:18:17.300 | class.

01:18:17.980 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation