back to indexStanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation
00:00:00.000 |
Hello everyone, my name is Lisa. I'm a third year PhD student in the NLP group. I'm advised 00:00:11.080 |
by Percy and Tatsu. Today I will give a lecture on natural language generation. And this is 00:00:17.080 |
also the research area that I work on. So I'm super excited about it. I'm happy to answer 00:00:21.400 |
any questions both during the lecture and after class about natural language generation. 00:00:26.580 |
So NLG is a super exciting area and it's also moving really, really fast. So today we will 00:00:32.840 |
discuss all the excitement of NLG. But before we get into the really exciting part, I have 00:00:39.120 |
to make some announcements. So first, it is very, very important for you to remember to 00:00:44.060 |
sign up for AWS by midnight today. So this is related to our homework five, whether you 00:00:51.140 |
have GPU access and then also related to our final project. So please, please remember 00:00:56.180 |
to sign up for AWS by tonight. And second, the project proposal is due on Tuesday, next 00:01:04.300 |
Tuesday. And I think assignment four should just due. Hopefully you had fun with machine 00:01:10.660 |
translation and stuff. And also assignment five is out today, I think just now. And it 00:01:17.420 |
is due on Friday, basically Friday midnight. And last, we will hold a Hugging Face Transformer 00:01:27.260 |
Library tutorial this Friday. So if your final project is related to implementing transformers 00:01:33.820 |
or playing with large language models, you should definitely go to this tutorial because 00:01:37.260 |
it's going to be very, very helpful. Also, yeah, just one more time, please remember 00:01:42.180 |
to sign up for AWS because this is the final hard deadline. Okay, cool. Now moving on to 00:01:49.580 |
the main topic for today, the very exciting natural language generation stuff. So today, 00:01:54.700 |
we will discuss what is NLG, review some models, discuss about how to decode from language 00:02:00.620 |
models and how to train language models. And we will also talk about evaluations. And finally, 00:02:06.780 |
we'll discuss ethical and risk considerations with the current NLG systems. So this natural 00:02:12.500 |
language generation techniques are going to be really exciting because this is kind of 00:02:17.260 |
getting us closer to explain the magic of chatGPT, which is a super popular model recently. 00:02:22.860 |
And practically speaking, they could also help you with your final project if you decide 00:02:26.700 |
to work on something related to text generation. So let's get started. To begin with, let's 00:02:32.660 |
ask the question of what is natural language generation. So natural language generation 00:02:37.940 |
is actually a really broad category. People have divided NLP into natural language understanding 00:02:44.180 |
and natural language generation. So the understanding part mostly means that the task input is in 00:02:49.900 |
natural language, such as semantic parsing, natural language inference, and so on. Whereas 00:02:56.260 |
natural language generation means that the task output is in natural language. So NLG 00:03:02.780 |
focuses on systems that produce fluent, coherent, and useful language outputs for human to use. 00:03:09.700 |
Historically, there are many NLG systems that use rule-based systems, such as templates 00:03:15.820 |
or infilling. But nowadays, deep learning is powering almost every text generation systems. 00:03:22.460 |
So this lecture today will be mostly focused on deep learning stuff. 00:03:27.940 |
So first, what are some examples of natural language generation? It's actually everywhere, 00:03:33.140 |
including our homework. Machine translation is a form of NLG, where the input is some 00:03:38.460 |
utterance in the source language, and the output is generated text in the target language. 00:03:44.500 |
Digital assistant, such as Ceres or Alexa, they are also NLG systems. So it takes in 00:03:50.140 |
dialogue history and generates continuations of the conversation. There is also summarization 00:03:56.620 |
systems that takes in a long document, such as a research article, and then the idea is 00:04:02.300 |
trying to summarize it into a few sentences that are easy to read. 00:04:07.460 |
So beyond these classic tasks, there are some more interesting uses, like creative storywriting, 00:04:13.500 |
where you can prompt a language model with a story plot, and then it will give you some 00:04:17.860 |
creative stories that are aligned with the plot. There is data to text, where you give 00:04:22.660 |
the language model some database or some tables, and then the idea is that it will output some 00:04:27.980 |
textual description of the table content. And finally, there is also visual description-based 00:04:33.540 |
NLG systems, like image captioning or image-based storytelling. 00:04:40.380 |
So the really cool example is the popular ChatGPT models. So ChatGPT is also an NLG 00:04:48.500 |
system. It is very general purpose, so therefore you can use it to do many different tasks 00:04:54.860 |
with different prompts. For example, we can use ChatGPT to simulate a chatbot. It can 00:05:01.220 |
answer questions about creative GIFs for 10 years old. It can be used to do poetry generation. 00:05:08.820 |
For example, we can ask it to generate a poem about sorting algorithms. And it's actually, 00:05:14.100 |
well, I wouldn't say it's very poetic, but at least it has the same format as a poem, 00:05:22.740 |
So ChatGPT can also be used in some really useful settings, like web search. So here, 00:05:30.180 |
Bing is augmented with ChatGPT, and there are some Twitters that are saying that the 00:05:34.100 |
magic of ChatGPT is that it actually makes people be happy to use Bing. 00:05:42.700 |
So there are so many tasks that actually belong to the NLG category. So how do we categorize 00:05:47.380 |
these tasks? One common way is to think about the open-endedness of the task. So here, we 00:05:53.100 |
draw a line for the spectrum of open-endedness. On the one end, we have tasks like machine 00:05:58.820 |
translation and summarization. So we consider them not very open-ended, because for each 00:06:04.500 |
source sentence, the output is almost determined by the input. Because basically, we are trying 00:06:10.580 |
to do machine translation, the semantics should be exactly similar to the input sentence. 00:06:15.540 |
So there are only a few ways that you can rephrase the output, like authorities have 00:06:19.500 |
announced that today is a national holiday. You can rephrase it a little bit to say, today 00:06:24.060 |
is a national holiday announced by the authorities. But the actual space is really small, because 00:06:29.500 |
you have to make sure the semantics doesn't change. So we can say that the output space 00:06:37.900 |
And moving to the middle of the spectrum, there is dialogue tasks, such as task-driven 00:06:42.260 |
dialogue or a chitchat dialogue. So we can see that for each dialogue input, there are 00:06:47.100 |
multiple responses, and the degree of freedom has increased. Here, we can respond by saying 00:06:53.580 |
good and you, or we can say about, thanks for asking, barely surviving all my homeworks. 00:06:59.980 |
So here, we are observing that there are actually multiple ways to continue this conversation. 00:07:04.780 |
And then this is where we say the output space is getting more and more diverse. 00:07:09.900 |
And on the other end of the spectrum, there is the very open-ended generation tasks, like 00:07:15.020 |
story generation. So given the input, like write me a story about three little pigs, 00:07:20.060 |
there are so many ways to continue the prompt. We can write about them going to schools, 00:07:24.300 |
building houses, like they always do. So the valid output here is extremely large. And 00:07:33.940 |
So it's hard to really draw a boundary between open-ended and non-open-ended tasks. But we 00:07:38.820 |
still try to give a rough categorization. So open-ended generation refers to tasks whose 00:07:44.140 |
output distribution has a high degree of freedom. Or non-open-ended generation tasks refers 00:07:50.580 |
to tasks where the input will almost certainly determine the output generation. Examples 00:07:57.260 |
of non-open-ended generations are machine translation, summarization. And examples of 00:08:01.940 |
open-ended generations are story generation, chitchat dialogue, task-oriented dialogue, 00:08:08.060 |
So how do we formalize this categorization? One way of formalizing is by computing the 00:08:13.500 |
entropy of the NLG system. So high entropy means that we are to the right of the spectrum. 00:08:20.220 |
So it is more open-ended. And low entropy means that we are to the left of the spectrum 00:08:25.420 |
and less open-ended. So these two classes of NLG tasks actually require different decoding 00:08:32.020 |
and training approaches, as we will talk about later. 00:08:35.420 |
OK, cool. Now let's recall some previous lectures and review the NLG models and trainings that 00:08:42.260 |
we have studied before. So I think we discussed the basics of natural language generation. 00:08:48.780 |
So here is how autoregressive language model works. At each time step, our model would 00:08:53.420 |
take in a sequence of tokens as input. And here it is y less than t. And the output is 00:09:00.540 |
basically the new token yt. So to decide on yt, we first use the model to assign a score 00:09:06.980 |
for each token in the vocabulary, denoted as s. And then we apply softmax to get the 00:09:12.740 |
next token distribution, p. And we choose a token according to this next token distribution. 00:09:19.300 |
And similarly, once we have predicted yt hat, we then pass it back into the language model 00:09:23.300 |
as the input, predict y hat t plus 1. And then we do so recursively until we reach the 00:09:32.020 |
So any questions so far? OK, good. So for the two types of NLG tasks that we talked 00:09:40.620 |
about, like the open-ended and non-open-ended tasks, they tend to prefer different model 00:09:45.140 |
architectures. So for non-open-ended tasks, like machine translation, we typically use 00:09:50.660 |
an encoder-decoder system, where the autoregressive decoder that we just talked about functions 00:09:56.060 |
as the decoder. And then we have another bidirectional encoder for encoding the inputs. So this is 00:10:01.260 |
kind of what you implemented for assignment 4, because the encoder is like the bidirectional 00:10:07.140 |
LSTM, and the decoder is another LSTM that is autoregressive. 00:10:12.660 |
So for more open-ended tasks, typically autoregressive generation model is the only component. Of 00:10:21.180 |
course, these architectures are not really hard constraints, because an autoregressive 00:10:25.860 |
decoder alone can also be used to do machine translation. And an encoder-decoder model 00:10:30.660 |
can also be used for storage generation. So this is kind of the convention for now, but 00:10:35.980 |
it's a reasonable convention, because using decoder-only model for MT tends to hurt performance 00:10:42.180 |
compared to an encoder-decoder model for MT. And using an encoder-decoder model for open-ended 00:10:47.500 |
generation seems to achieve similar performance to a decoder-only model. And therefore, if 00:10:53.260 |
you have the compute budget to train an encoder-decoder model, you might just be better off by only 00:10:57.900 |
training a larger decoder model. So it's kind of more of an allocation of resources problem 00:11:02.740 |
than whether this architecture will type check with your task. 00:11:08.700 |
So how do we train such a language model? In previous lectures, we talked about that 00:11:15.540 |
the language models are trained by maximum likelihood. So basically, we were trying to 00:11:20.740 |
maximize the probability of the next token, yt, given the preceding words. And this is 00:11:26.260 |
our optimization objective. So at each time step, this can be regarded as a classification 00:11:32.540 |
task, because we are trying to distinguish the actual word, yt star, from all the remaining 00:11:38.340 |
words in the vocabulary. And this is also called teacher forcing, because at each time 00:11:43.780 |
step, we are using the gold standard, y star less than t, as input to the model. Whereas, 00:11:52.940 |
presumably, at generation time, you wouldn't have any access to y star. So you would have 00:11:57.140 |
to use the model's own prediction to feed it back into the model to generate the next 00:12:01.340 |
token. And that is called student forcing, which we'll talk in detail later. 00:12:06.500 |
Yeah, I think I skipped two slides ago. About autoregressive, we never used that word before. 00:12:15.740 |
Autoregressive? Oh, it just means like-- so let's look at this animation again. Oops, 00:12:22.380 |
sorry. It just looks like you are generating word from left to right, one by one. So here, 00:12:28.020 |
suppose that you are given y less than t. And then autoregressively, you first generate 00:12:32.660 |
yt. And then once you have yt, you'll feed it back in, generate yt plus 1, and then feed 00:12:37.780 |
it back in, generate another thing. So this left to right nature, because you are using 00:12:41.260 |
chain rule to condition on the tokens that you just generated, this chain rule thing 00:12:48.740 |
And typically, I think conventionally, we are doing left to right autoregressive by 00:12:52.460 |
generating from left to right. But there are also other more interesting models that can 00:12:56.580 |
do backward or infill and other things. This idea of generating one token at once is autoregressive. 00:13:13.340 |
So at inference time, our decoding algorithm would define a function to select a token 00:13:19.020 |
from this distribution. So we've discussed that we can use the language model to compute 00:13:23.820 |
this p, which is the next token distribution. And then g here, based on our notation, is 00:13:29.140 |
the decoding algorithm, which helps us select what token we are actually going to use for 00:13:34.700 |
So the obvious decoding algorithm is to greedily choose the highest probability token as yt 00:13:40.660 |
for each time step. So while this basic algorithm sort of works, because they work for your 00:13:45.340 |
homework 4, to do better, there are two main avenues that we can take. We can decide to 00:13:50.500 |
improve decoding. And we can also decide to improve the training. 00:13:55.020 |
Of course, there are other things that we can do. We can improve training data. And 00:13:58.340 |
we can improve model architectures. But for this lecture, we will focus on decoding and 00:14:04.660 |
So now let's talk about how decoding algorithms work for natural language generation 00:14:09.860 |
models. Before that, I'm happy to take any questions about the previous slides. 00:14:14.980 |
Sorry, could you just explain one more time the difference between teacher forcing and 00:14:16.980 |
I think I'll go into this in detail later. But sure. So basically, for teacher forcing, 00:14:28.820 |
the idea is you do teacher forcing where you train the language model, because you already 00:14:32.500 |
observe the gold text. So you use the gold text up until time step t, put it into the 00:14:38.500 |
model. And then the model would try to predict y t plus 1. 00:14:42.900 |
Whereas student forcing means that you don't have access to this gold reference data. Instead, 00:14:48.020 |
but you are still trying to generate a sequence of data. So you have to use the text that 00:14:51.380 |
you generated yourself using the model, and then feed it back into the model as input 00:14:55.580 |
to predict t plus 1. That's the primary difference. 00:15:00.420 |
Cool. So what is decoding all about? At each time step, our model computes a vector of 00:15:08.180 |
score for each token. So it takes in preceding context y less than t and produce a score 00:15:13.860 |
s. And then we try to compute a probability distribution p out of the scores by just applying 00:15:19.980 |
softmax to normalize them. And our decoding algorithm is defined as this function g, which 00:15:26.860 |
takes in the probability distribution and try to map it to some word. Basically, try 00:15:31.580 |
to select a token from this probability distribution. 00:15:35.140 |
So in the machine translation lecture, we talked about greedy decoding, which selects 00:15:40.300 |
the highest probability token of this p distribution. And we also talk about beam search, which 00:15:47.380 |
has the same objective as greedy decoding, which is that we are both trying to find the 00:15:51.980 |
most likely string defined based on the model. But instead of doing so greedily for beam 00:15:56.780 |
search, we actually explore a wider range of candidates. So we have a wider exploration 00:16:02.140 |
of candidates by keeping always k candidates in the beam. 00:16:08.340 |
So overall, this maximum probability decoding is good for low entropy tasks like machine 00:16:13.180 |
translation and summarization. But it actually encounters more problems for open-ended generation. 00:16:19.420 |
So the most likely string is actually very repetitive when we try to do open-ended text 00:16:24.380 |
generation. As we can see in this example, the context is perfectly normal. It's about 00:16:33.140 |
And by the continuation, the first part of it looks great. It's like valid English. It 00:16:38.380 |
talks about science. But suddenly, it starts to repeat. And it starts to repeat, I think, 00:16:46.660 |
So why does this happen? If we look at, for example, this plot, which shows the language 00:16:54.500 |
model's probability assigned to the sequence I don't know, we can see here is the pattern. 00:17:00.140 |
It has regular probability. But if we keep repeating this phrase, I don't know, I don't 00:17:04.460 |
know, I don't know, for 10 times, then we can see that there is a decreasing trend in 00:17:09.300 |
their negative log likelihood. So the y-axis is the negative log probability. 00:17:14.020 |
We can see this decreasing trend, which means that the model actually has higher probability 00:17:18.740 |
as the repeat goes on, which is quite strange because it's suggesting that there is a self-amplification 00:17:25.180 |
effect. So the more repeat we have, the more confident the model becomes about this 00:17:32.420 |
And this keeps going on. We can see that for I am tired, I'm tired, repeat 100 times, we 00:17:36.540 |
can see a continuously decreasing trend until the model is almost 100% sure that it's going 00:17:45.900 |
And sadly, this problem is not really solved by architecture. Here, the red plot is a LSTM 00:17:53.020 |
model, and the blue curve is a transformer model. We can see that both models kind of 00:17:57.380 |
suffers from the same problem. And scale also doesn't solve this problem. So we kind of 00:18:02.060 |
believe that scale is the magical thing in NLP. But even models with 175 billion parameters 00:18:09.060 |
will still suffer from repetition if we try to find the most likely string. 00:18:16.300 |
So how do we reduce repetition? One canonical approach is to do n-gram blocking. So the 00:18:22.180 |
principle is fairly simple. Basically, you just don't want to see the same n-gram twice. 00:18:27.460 |
If we set n to be 3, then for any text that contains the phrase "I am happy," the next 00:18:32.300 |
time you see the prefix "I am," n-gram blocking would automatically set the probability of 00:18:37.340 |
happy to be 0 so that you will never see this n-gram, this trigram again. 00:18:43.100 |
But clearly, this n-gram blocking heuristic has some problems because sometimes it is 00:18:48.020 |
quite common for you to want to see a person's name appear twice or three times or even more 00:18:52.500 |
in a text. But this n-gram blocking will eliminate that possibility. 00:18:57.300 |
So what are better options that possibly are more complicated? For example, we can use 00:19:02.700 |
a different training objective. Instead of training by MLE, we can train by unlikelihood 00:19:08.300 |
objective. So in this approach, the model is actually penalized for generating already 00:19:14.660 |
seen tokens. So it's kind of like putting this n-gram blocking idea into training time. 00:19:20.700 |
Rather than at decoding time for this constraint, at training time, we just decrease the probability 00:19:24.460 |
of repetition. Another training objective is coverage wealth, which uses the attention 00:19:31.420 |
mechanism to prevent repetition. So basically, if you try to regularize and enforce your 00:19:35.980 |
attention so that it's always attending to different words for each token, then it is 00:19:41.260 |
highly likely that you are not going to repeat because repetition tends to happen when you 00:19:45.700 |
have similar attention patterns. Another different angle is that instead of searching for the 00:19:51.980 |
most likely string, we can use a different decoding objective. So maybe we can search 00:19:56.500 |
for strings that maximizes the difference between log probabilities of two models. Say 00:20:01.940 |
that we want to maximize log probability of large model minus log probability of small 00:20:06.260 |
model. In this way, because both models are repetitive, so they kind of cancel out. So 00:20:10.860 |
they would both assign high probabilities of repetition. And after applying this new 00:20:14.900 |
objective, the repetition stuff will actually be penalized because it cancels out. 00:20:20.820 |
So here comes the broader question. Is finding the most likely string even a reasonable thing 00:20:26.420 |
to do for open-ended text generation? The answer is probably no, because this doesn't 00:20:32.540 |
really match human pattern. So we can see in this plot, the orange curve is the human 00:20:37.060 |
pattern, and the blue curve is the machine-generated text using beam search. So you can see that 00:20:42.100 |
with human talks, there are actually lots of uncertainty, as we can see by the fluctuation 00:20:47.860 |
of the probabilities. For some words, we can be very certain. For some words, we are a 00:20:52.580 |
little bit unsure. Whereas here, for the model distribution, it's always very sure. It's 00:20:56.580 |
always assigning probability 1 to the sequence. 00:20:59.500 |
So because we now are seeing a-- basically, there is a mismatch between the two distributions. 00:21:06.080 |
So it's kind of suggesting that maybe searching for the most likely string is not the right 00:21:10.940 |
decoding objective at all. Any questions so far before we move on? Yeah? 00:21:15.940 |
So is this the underlying mechanism for some detector of whether some text is generated 00:21:24.540 |
Not really, because this can only detect the really simple things that humans are also 00:21:28.740 |
able to detect, like repetition. So in order to avoid the previous problems that we've 00:21:34.420 |
talked about, I'll talk about some other decoding families that generate more robust text that 00:21:40.180 |
actually look like this, whose probability distribution looks like the orange curve. 00:21:45.500 |
So I wouldn't say this is the to-go answer for watermarking or detection. 00:21:53.700 |
Oh, yeah. OK, cool. So she asked about whether this mechanism of plotting the probabilities 00:22:00.180 |
of human text and machine-generated text is one way of detecting whether some text is 00:22:05.980 |
generated by a model or a human. And my answer is, I don't think so, but this could be an 00:22:11.900 |
interesting research direction. Because I feel like there are more robust decoding approaches 00:22:17.540 |
that generate text that actually fluctuates a lot. 00:22:24.400 |
So yeah, let's talk about the decoding algorithm that is able to generate text that fluctuates. 00:22:29.260 |
So given that searching for the most likely string is a bad idea, what else should we 00:22:33.580 |
do? And how do we simulate that human pattern? And the answer to this is to introduce randomness 00:22:41.860 |
So suppose that we are sampling a token from this distribution, P. Basically, we are trying 00:22:48.420 |
to sample YT hat from this distribution. It is random so that you can essentially sample 00:22:53.420 |
any token in the distribution. Previously, you are kind of restricted to selecting rest 00:22:57.460 |
from your grocery. But now you can select bathroom instead. 00:23:02.980 |
So however, sampling introduces a new set of problems. Since we never really zero out 00:23:08.580 |
any token probabilities, vanilla sampling would make every token in the vocabulary a 00:23:14.100 |
viable option. And in some unlucky cases, we might end up with a bad word. 00:23:20.040 |
So assuming that we already have a very well-trained model, even if most of the probability mass 00:23:26.540 |
of the distribution is over the limited set of good options, the tail of the distribution 00:23:31.580 |
will still be very long because we have so many words in our vocabulary. And therefore, 00:23:36.700 |
if we add all those long tails, it aggregates. They still have a considerable mass. So statistically 00:23:42.060 |
speaking, this is called heavy tail distribution. And language is exactly a heavy tail distribution. 00:23:47.980 |
So for example, many tokens are probably really wrong in this context. And then given that 00:23:54.420 |
we have a good language model, we assign them each very little probability. 00:23:58.740 |
But this doesn't really solve the problem because there are so many of them. So you 00:24:02.500 |
aggregate them as a group. We'll still have a high chance of being selected. 00:24:08.180 |
And the solution here that we have for this problem of long tail is that we should just 00:24:12.380 |
cut off the tail. We should just zero out the probabilities that we don't want. And 00:24:16.900 |
one idea is called top case sampling, where the idea is that we would only sample from 00:24:23.580 |
the top k tokens in the probability distribution. 00:24:31.580 |
Well, the model we were looking at a second ago had some very low probability samples 00:24:39.940 |
as well on the graph, right? How would top case sampling deal with that? 00:24:46.420 |
You mean the orange-blue graph of the human versus-- 00:24:51.660 |
Oh, yeah. So top k will basically eliminate-- it will make it impossible to generate the 00:24:58.940 |
super low probability tokens. So technically, it's not exactly simulating this pattern because 00:25:04.700 |
now you don't have the super low probability tokens, whereas human can generate super low 00:25:08.780 |
probability tokens in a fluent way. But yeah, that could be another hint that people can 00:25:18.980 |
It also depends on the type of text you want to generate, for example, for more novels 00:25:24.500 |
or more creative writing. Is it then you decide the hyperparameter? 00:25:28.820 |
Yeah, yeah, for sure. K is a hyperparameter. Depending on the type of task, you will choose 00:25:33.020 |
K differently. Mostly for a closed-ended task, K should be small. And for open-ended, K should 00:25:41.660 |
How come-- I guess intuitively, this builds off of one of the earlier questions. Why don't 00:25:46.820 |
we consider the case where we sample, and then we just weight the probability of each 00:25:52.020 |
word by its score or something, rather than just looking at top K? We don't do a weighted 00:25:57.580 |
sampling type of situation. So we still have that small but non-zero probability of selecting. 00:26:03.740 |
I think top K is also weighted. So top K just zeroes out all the tails of the distribution. 00:26:11.380 |
But for the things that it didn't zero out, it's not a uniform choice among the K. It's 00:26:16.260 |
still trying to choose proportional to the scores that you computed. 00:26:20.540 |
Is that just like a computationally it's more efficient because you don't have to do for 00:26:25.700 |
17,000 words. It could be for 10 or something? Yeah, sure. That could be one gain of top 00:26:31.820 |
K decoding is that your softmax will take in fewer candidates. 00:26:36.020 |
But it's not the main reason. I think you should show-- 00:26:40.900 |
Yeah, I'll keep talking about the main reason. So we've discussed this part. And then here, 00:26:51.140 |
this is formally what is happening for top K sampling. Now that we are only sampling 00:26:57.740 |
from the top K tokens of the probability distribution. And as we've said, K is a hyperparameter. 00:27:03.780 |
So we can set K to be large or small. If we increase K, this means that we are making 00:27:09.460 |
our output more diverse, but at the risk of including some tokens that are bad. If we 00:27:14.500 |
decrease K, then we are making more conservative and safe options. But possibly the generation 00:27:24.340 |
So is top K decoding good enough? The answer is not really. Because we can still find some 00:27:30.220 |
problems with top K decoding. For example, in the context, she said, I never blank. There 00:27:36.060 |
are many words that are still valid options, such as went, ate. But those words got zeroed 00:27:42.340 |
out because they are not within the top K candidates. So this actually leads to bad 00:27:46.780 |
recall for your generation system. And similarly, another failure of top K is that it can also 00:27:53.260 |
cut off too quickly. So in this example, code is not really a valid answer, according to 00:27:59.580 |
common sense, because you probably don't want to eat a piece of code. But the probability 00:28:04.100 |
remains non-zero, meaning that the model might still sample code as an output, despite this 00:28:10.100 |
low probability, but it might still happen. And this means bad precision for the generation 00:28:17.780 |
So given these problems with top K decoding, how can we address them? How can we address 00:28:23.900 |
this issue of there is no single K that fits all circumstances? This is basically because 00:28:30.580 |
the probability distribution that we sample from our dynamic. So when the probability 00:28:34.740 |
distribution is relatively flat, having a small K will remove many viable options. So 00:28:41.660 |
having a limited K will remove many viable options, and we want K to be larger for this 00:28:45.820 |
case. Similarly, when a distribution P is too picky, then we want the-- a high K would 00:28:53.700 |
allow for too many options to be viable. And instead, we might want a smaller K so that 00:28:59.460 |
we are being safer. So the solution here is that maybe K is just a bad hyperparameter. 00:29:05.060 |
And instead of doing K, we should think about probability. We should think about how to 00:29:10.420 |
sample from tokens in a top P probability percentiles of the cumulative probability 00:29:20.940 |
So now, the advantage of doing top P sampling, where we sample from the top P percentile 00:29:27.340 |
of the cumulative probability mass, is that this is actually equivalent to-- we have now 00:29:31.980 |
an adaptive K for each different distribution. And let me explain what I mean by having an 00:29:38.260 |
adaptive K. So in the first distribution, this is like a regular power law of language 00:29:44.180 |
that's kind of typical. And then doing top K sampling means we are selecting the top 00:29:49.820 |
K. But doing top P sampling means that we are zooming into maybe something that's similar 00:29:55.460 |
to top K in effect. But if I have a relatively flat distribution like the blue one, we can 00:30:01.380 |
see that doing top P means that we are including more candidates. And then if we have a more 00:30:06.700 |
skewed distribution like the green one, doing top P means that we actually include fewer 00:30:11.180 |
candidates. So by actually selecting the top P percentile in the probability distribution, 00:30:18.780 |
we are actually having a more flexible K and therefore have a better sense of what are 00:30:24.460 |
the good options in the model. Any questions about top P, top K decoding? So everything's 00:30:33.500 |
clear. Yeah, sounds good. So to go back to that question, doing top K is not necessarily 00:30:40.460 |
saving compute. Or this whole idea is not really compute saving intended. Because in 00:30:46.820 |
the case of top P, in order to select the top P percentile, we still need to compute 00:30:51.540 |
the softmax over the entire vocabulary set in order for us to compute the P properly. 00:30:59.180 |
So therefore, it's not really saving compute, but it's improving performance. Cool. Moving 00:31:05.420 |
on. So there are much more to go with decoding algorithms. Besides the top K and top P that 00:31:12.740 |
we've discussed, there are some more recent approaches like typical sampling, where the 00:31:17.780 |
idea is that we want to relate the score based on the entropy of the distribution and try 00:31:22.740 |
to generate texts that are closer to the negative-- whose probability is closer to the negative 00:31:27.620 |
entropy of the data distribution. This means that if you have a closed-ended task or non-open-ended 00:31:34.700 |
task, it has smaller entropy. So you'll want negative log probability to be smaller. So 00:31:40.380 |
you want probability to be larger. So it type checks very well. And additionally, there 00:31:46.380 |
is also epsilon sampling coming from John. So this is an idea where we set the threshold 00:31:53.700 |
to lower bound probabilities. So basically, if you have a word whose probability is less 00:31:58.380 |
than 0.03, for example, then that word will never appear in the output distribution. Now, 00:32:05.140 |
that word will never be part of your output because it has so low probability. Yeah. 00:32:09.980 |
How do you know if it's top-degree, the entropy of a distribution? 00:32:14.100 |
Oh, cool. Great question. So the entropy distribution is defined as-- suppose that we have a discrete 00:32:21.300 |
distribution. We can go over it. We'll just enumerate x. And then it's negative log probability 00:32:27.620 |
of x. So if we write it from an expectation perspective, it's basically expected of log 00:32:34.700 |
probability of x. I have to do a little bit here. So this is the entropy of a distribution. 00:32:45.140 |
And then-- so basically, if your distribution is very concentrated to a few words, then 00:32:50.260 |
the entropy will be relatively small. If your distribution is very flat, then your entropy 00:32:57.660 |
What if the epsilon sampling is such that we have no valid option? 00:33:05.700 |
Oh, yeah. I mean, there will be some back-off cases, I think. So in the case that there 00:33:11.180 |
is no valid options, you'll probably still want to select one or two things, just as 00:33:16.540 |
an edge case, I think. OK, cool. Moving on. So another hyperparameter that we can tune 00:33:26.780 |
to affect decoding is the temperature parameter. So recall that previously at each time step, 00:33:33.060 |
we asked the model to compute a score. And then we renormalized that score using softmax 00:33:38.180 |
to get a probability distribution. So one thing that we can adjust here is that we can 00:33:42.540 |
insert this temperature parameter tau to relate the score. So basically, we just divide all 00:33:47.660 |
the sw by tau. And after dividing this, we apply softmax. And we get a new distribution. 00:33:55.460 |
And this temperature adjustment is not really going to affect the monotonosity of the distribution. 00:34:01.580 |
For example, if word A has higher probability than word B previously, then after the adjustment, 00:34:07.620 |
word A is still going to have a higher probability than word B. But their relative difference 00:34:12.300 |
will change. So for example, if we raise the temperature tau to be greater than 1, then 00:34:20.380 |
the distribution Pt will become more uniform. It will be flatter. And this implies that 00:34:26.700 |
there will be more diverse output because our distribution is flatter. And it's more 00:34:31.660 |
spread out across different words in the vocabulary. On the other hand, if we lower the temperature 00:34:37.620 |
tau less than 1, then Pt becomes very spiky. And then this means that if we sample from 00:34:44.700 |
the Pt, we'll get less diverse output. So because here, the probability is concentrated 00:34:49.860 |
only on the top words. So in the very extreme case, if we set tau to be very, very close 00:34:54.540 |
to 0, then the probability will be a 1/2 vector, where all the probability mass will be centered 00:35:01.220 |
on one word. And then this reduces back to argmax sampling or greedy decoding. 00:35:07.860 |
So temperature is a hyperparameter as well, as for k and P in topk and topp. It is a hyperparameter 00:35:14.220 |
for decoding. It can be tuned for beam search and sampling algorithms. So it's kind of orthogonal 00:35:19.900 |
to the approaches that we discussed before. Any questions so far? OK, cool. Temperature 00:35:29.860 |
is so easy. So well, because sampling still involves randomness, even though we try very 00:35:38.900 |
hard in terms of truncation, truncating the tail, sampling still has randomness. So what 00:35:43.740 |
if we're just unlucky and decode a bad sequence from the model? One common solution is to 00:35:49.180 |
do re-ranking. So basically, we would decode a bunch of sequences. For example, we can 00:35:53.460 |
decode 10 candidates. But 10 or 30 is up to you. The only choice is that you want to balance 00:35:59.540 |
between your compute efficiency and performance. So if you decode too many sequences, then, 00:36:05.460 |
of course, your performance is going to increase. But it's also very costly to just generate 00:36:10.460 |
a lot of things for one example. And so once you have a bunch of sample sequences, then 00:36:18.100 |
we are trying to define a score to approximate the quality of the sequence and re-rank all 00:36:24.060 |
the candidates by this score. So the simple thing to do is we can use perplexity as a 00:36:29.500 |
metric, as a scoring function. But we need to be careful that, because we have talked 00:36:35.180 |
about this, the extreme of perplexity, like if we try to arc max log probability, when 00:36:40.540 |
we try to aim for a super low perplexity, the tags are actually very repetitive. So 00:36:45.500 |
we shouldn't really aim for extremely low perplexity. And perplexity, to some extent, 00:36:50.180 |
is not a perfect scoring function. It's not a perfect scoring function because it's not 00:36:55.540 |
really robust to maximize. So alternatively, the re-rankers can actually use a wide variety 00:37:02.380 |
of other scoring functions. We can score tags based on their style, their discourse coherence, 00:37:08.620 |
their entailment, factuality properties, consistency, and so on. And additionally, we can compose 00:37:16.860 |
multiple re-rankers together. Yeah, question? 00:37:20.540 |
>> You mentioned 10 candidates or any number of candidates. What's the strategy you usually 00:37:27.540 |
use to generate these other candidates? Like what heuristic do you use? 00:37:32.540 |
>> So basically, the idea is to sample from the model. So when you sample from the model, 00:37:37.260 |
each time you sample, you are going to get a different output. And then that's what I 00:37:40.820 |
mean by different candidates. So if you sample 10 times, you will very likely get 10 different 00:37:45.820 |
outputs. And then you are just-- given these 10 different outputs that come from sampling, 00:37:51.500 |
you can just decide, re-rank them, and select the candidate that has the highest score. 00:37:56.180 |
>> Where does the randomness come from? >> Oh, because we are sampling here. 00:38:00.980 |
>> That sample, okay. >> Yeah, yeah. For example, if you are doing 00:38:04.420 |
top-T sampling, then, well, suppose that A and B are equally probable, then you might 00:38:09.180 |
sample A, you might sample B with the same probability. Okay, cool. And another cool 00:38:16.140 |
thing that we can do with re-ranking is that we can compose multiple re-rankers together. 00:38:20.540 |
So basically, suppose you have a scoring function for style, and you have a scoring function 00:38:24.780 |
for factual consistency. You can just add those two scoring functions together to get 00:38:28.900 |
a new scoring function, and then re-rank everything based on your new scoring function to get 00:38:34.580 |
tags that are both good at style and good at factual consistency. Yeah? 00:38:38.980 |
>> Yeah, so when you say that we re-rank by score, do we just pick the decoding that has 00:38:45.500 |
the highest score, or do we do some more sampling again based on the score? 00:38:49.900 |
>> The idea is you just take the decoding that has the highest score, because you already 00:38:52.540 |
have, say, 10 candidates. So out of these 10, you only need one, and then you just choose 00:38:57.300 |
one that has the highest score. Yeah. Cool. Any other questions? Yeah? 00:39:08.900 |
>> Oh, yeah. Perplexity is like, you can kind of regard it as log probabilities. It's like 00:39:15.180 |
E to the negative log probabilities. It's kind of like if a token has high perplexity, 00:39:22.140 |
then it means it has low probability, because you are more perplexed. 00:39:28.020 |
Okay. So taking a step back to summarize this decoding section, we have discussed many decoding 00:39:35.180 |
approaches from selecting the most probable string to sampling, and then to various truncation 00:39:41.860 |
approaches that we can do to improve sampling, like top P, top K, epsilon, typical decoding. 00:39:47.820 |
And finally, we discussed how we can do in terms of re-ranking the results. So decoding 00:39:54.020 |
is still a really essential problem in NLG, and there are lots of works to be done here 00:39:59.420 |
still, especially as like chatGP is so powerful. We should all go study decoding. So it would 00:40:05.060 |
be interesting if you want to do such final projects. And also, different decoding algorithms 00:40:10.380 |
can allow us to inject different inductive biases to the text that we are trying to generate. 00:40:17.800 |
And some of the most impactful advances in NLG in the last couple of years actually come 00:40:22.420 |
from simple but effective decoding algorithms. For example, the nuclear sampling paper is 00:40:31.540 |
So moving on to talk about training NLG models. Well, we have seen this example before in 00:40:38.740 |
the decoding slides, and I'm just trying to show them again, because even though we can 00:40:43.100 |
solve this repetition problem by instead of doing search, doing sampling. But it's still 00:40:49.300 |
concerning from a language modeling perspective that your model would put so much probability 00:40:54.540 |
on such repetitive and degenerate text. So we ask this question, well, is repetition 00:40:59.740 |
due to how language models are trained? You have also seen this plot before, which shows 00:41:06.940 |
this decaying pattern or this self-amplification effect. So we can conclude from this observation 00:41:13.340 |
that model trained via a MLE objective wears a really bad mode of the distribution. By 00:41:19.820 |
mode of the distribution, I mean the argmax of the distribution. So basically, they would 00:41:23.500 |
assign high probability to terrible strings. And this is definitely problematic for a model 00:41:28.900 |
perspective. So why is this the case? Shouldn't MLE be a gold standard in machine learning 00:41:36.140 |
in general, not just machine translation? Shouldn't MLE be a gold standard for machine 00:41:39.700 |
learning? The answer here is not really, especially for text, because MLE has some problem for 00:41:46.340 |
sequential data. And we call this problem exposure bias. So training with teacher forcing 00:41:53.340 |
leads to exposure bias at generation time, because during training, our model's inputs 00:41:58.140 |
are gold context tokens from real human-generated text, as denoted by a hat less than T here. 00:42:05.060 |
But during generation time, our model's input become previously decoded tokens from the 00:42:10.820 |
model, y hat T. And suppose that our model has minor errors, then y hat less than T will 00:42:18.260 |
be much worse in terms of quality than y star less than T. And this discrepancy is terrible, 00:42:23.900 |
because it actually causes a discrepancy between training and test time, which actually hurts 00:42:30.300 |
model performance. And we call this problem exposure bias. 00:42:35.980 |
So people have proposed many solutions to address this exposure bias problem. One thing 00:42:41.180 |
to do is to do scheduled sampling, which means that with probability p, we try to decode 00:42:47.900 |
a token and feed it back in as context to train the model. And with probability 1 minus 00:42:53.860 |
p, we use the gold token as context. So throughout training, we try to increase 00:43:00.140 |
p to gradually warm it up, and then prepare it for test time generation. So this leads 00:43:06.500 |
to improvement in practice, because using this p probabilities, we're actually gradually 00:43:14.580 |
trying to narrow the discrepancy between training and test time. But the objective is actually 00:43:19.260 |
quite strange, and training can be very unstable. 00:43:23.580 |
Another idea is to do data set aggregation. And the method is called Dagger. Essentially, 00:43:29.860 |
at various interval during training, we try to generate a sequence of tags from the current 00:43:33.980 |
model, and then use this, and then put this sequence of tags into the training data. So 00:43:39.060 |
we're kind of continuously doing this training data augmentation scheme to make sure that 00:43:44.940 |
the training distribution and the generation distribution are closer together. So both 00:43:49.980 |
approaches, both scheduled sampling and data set aggregation, are ways to narrow the discrepancy 00:44:00.980 |
Gold token just means human text. It means like-- well, when you train a language model, 00:44:06.500 |
you will see lots of corpus that are human written. Gold is just human. Yeah. OK, cool. 00:44:15.540 |
So another approach is to do retrieval augmented generation. So we first learn to retrieve 00:44:20.700 |
a sequence from some existing corpus of prototypes. And then we train a model to actually edit 00:44:26.140 |
the retrieved text by doing insertion, deletion, or swapping. We can add or remove tokens from 00:44:33.260 |
this prototype, and then try to modify it into another sentence. So this doesn't really 00:44:40.260 |
suffer from exposure bias, because we start from a high-quality prototype. So that at 00:44:45.620 |
training time and at test time, you don't really have the discrepancy anymore, because 00:44:53.780 |
Another approach is to do reinforcement learning. So here, the idea is to cast your generation 00:44:59.340 |
problem as a Markov decision process. So there is the state s, which is the model's representation 00:45:06.700 |
for all the preceding context. There is action a, which is basically the next token that 00:45:12.740 |
we are trying to pick. And there is policy, which is the language model, or also called 00:45:16.860 |
the decoder. And there is the reward r, which is provided by some external score. And the 00:45:22.540 |
idea here-- well, we won't go into details about reinforcement learning and how it works, 00:45:34.380 |
So in the reinforcement learning context, because reinforcement learning involves a 00:45:38.180 |
reward function, that's very important. So how do we do reward estimation for text generation? 00:45:44.020 |
Well, really a natural idea is to just use the evaluation metrics. So whatever-- because 00:45:49.060 |
you are trying to do well in terms of evaluation, so why not just improve for evaluation metrics 00:45:54.100 |
directly at training time? For example, in the case of machine translation, we can use 00:45:58.940 |
blue score as the reward function. In the case of summarization, we can use root score 00:46:06.620 |
But we really need to be careful about optimizing for tasks as opposed to gaining the reward, 00:46:12.140 |
because evaluation metrics are merely proxies for the generation quality. So sometimes, 00:46:17.060 |
you run RL and improve the blue score by a lot. But when you run human evaluations, humans 00:46:23.740 |
might still think that, well, this generated text is no better than the previous one, or 00:46:28.300 |
even worse, even though it gives you a much better blue score. So we want to be careful 00:46:37.500 |
So what behaviors can we tie to a reward function? This is about reward design and reward estimation. 00:46:42.780 |
There are so many things that we can do. We can do cross-modality consistency for image 00:46:47.540 |
captioning. We can do sentence simplicity to make sure that we are generating simple 00:46:53.780 |
English that are understandable. We can do formality and politeness to make sure that, 00:46:58.660 |
I don't know, your chatbot doesn't suddenly yell at you. 00:47:02.780 |
And the most important thing that's really, really popular recently is human preference. 00:47:08.860 |
So we should just build a reward model that captures human preference. And this is actually 00:47:14.220 |
the technique behind the chat GPT model. So the idea here is that we would ask human to 00:47:19.860 |
rank a bunch of generated text based on their preference. And then we will use this preference 00:47:24.740 |
data to learn a reward function, which will basically always assign high score to something 00:47:31.580 |
that humans might prefer and assign low score to something that humans wouldn't prefer. 00:47:37.620 |
Would it be more expensive? It's like, is it all just real? 00:47:42.620 |
Oh yeah, sure. I mean, it is going to be very expensive. But I feel like compared to all 00:47:47.660 |
the cost of training models, training like 170 billion parameter models, I feel like 00:47:52.780 |
OpenAI and Google are, well, they can afford hiring lots of humans to do human annotations 00:47:59.300 |
How much data would we need to, like, give simple answers? 00:48:04.300 |
Yeah, this is a great question. So I think it's kind of a mystery about how much data 00:48:09.860 |
you exactly need to achieve the level of performance of chat GPT. But roughly speaking, I feel 00:48:15.500 |
like, I mean, whenever you try to fine tune a model on some downstream task, similarly 00:48:19.940 |
here you are trying to fine tune your model on human preference. It do need quite a lot 00:48:24.940 |
of data, like maybe on the scale of 50k to 100k. That's roughly the scale that-- like 00:48:29.780 |
Anthropic actually released some data set about human preference. That's roughly the 00:48:33.780 |
scale that they released, I think, if I remember correctly. Yeah, question. 00:48:38.780 |
So we talked about earlier about how many of the state of the art language models use 00:48:43.180 |
transformers as their architecture. How do you apply reinforcement learning to this model? 00:48:54.380 |
Yeah, I feel like reinforcement learning is kind of a modeling tool. I mean, it's kind 00:49:00.220 |
of an objective that you are trying to optimize. Instead of an MLE objective, now you are optimizing 00:49:04.540 |
for an RL objective. So it's kind of orthogonal to the architecture choice. So a transformer 00:49:11.900 |
is an architecture. You just use transformer to give you probability of the next token 00:49:16.260 |
distribution or to try to estimate probability of a sequence. And then once you have the 00:49:21.420 |
probability of a sequence, you use that probability of the sequence, pass it into the RL objective 00:49:27.660 |
that you have. And then suppose that you are trying to do policy gradient or something, 00:49:31.900 |
then you need to estimate the probability of that sequence. And then you just need to 00:49:36.460 |
be able to backprop through transformer, which is doable. 00:49:40.340 |
Yeah, so I think the question about architecture and objectives are orthogonal. So even if 00:49:45.220 |
you have an LSTM, you can do it. You have a transformer, you can also do it. 00:49:52.700 |
And it just like with a model for this kind of reward. For example, we can do another 00:50:02.420 |
Yeah, I think that's exactly what they did. So for example, you would have GPT-3. You 00:50:08.020 |
use GPT-3 as the generator that generate text. And you kind of have another pre-trained model 00:50:13.700 |
that could probably also be GPT-3, but I'm guessing here, that you fine tune it to learn 00:50:18.820 |
human preference. And then once you have a human preference model, you use the human 00:50:23.580 |
preference model to put it into RL as the reward model. And then use the original GPT-3 00:50:29.140 |
as the policy model. And then you apply RL objectives and then update them so that you 00:50:35.460 |
will get a new model that's better at everything. 00:50:38.940 |
OK, cool. Yeah, actually, if you are very curious about RLHF, I would encourage you 00:50:45.020 |
to come to the next lecture, where Jesse will talk about RLHF. RLHF is shorthand for RL 00:50:59.100 |
So takeaways. T-shirt forcing is still the main algorithm for training text generation 00:51:05.300 |
models. And exposure bias causes problems in text generation models. For example, it 00:51:10.940 |
causes models to lose coherence, causes model to be repetitive. And models must learn to 00:51:16.180 |
recover from their own bad samples by using techniques like scheduled sampling or a dagger. 00:51:22.860 |
And models shouldn't-- another approach to reduce exposure bias is to start with good 00:51:28.540 |
text, like retrieval plus generation. And we also discussed how to do training with 00:51:32.780 |
RL. And this can actually make model learn behaviors that are preferred by human-- that 00:51:40.260 |
are preferred by human or preferred by some metrics. 00:51:43.180 |
So to be very up to date, in the best language model nowadays, chat-GPT, the training is 00:51:49.220 |
actually pipelined. For example, we would first pre-train a large language models using 00:51:53.420 |
internet corpus by self-supervision. And this kind of gets you chat-GPT-- sorry, GPT-3, 00:51:59.740 |
which is the original version. And then you would do some sort of instruction tuning to 00:52:04.260 |
fine-tune the language model, to fine-tune the pre-trained language model so that it 00:52:07.500 |
learns roughly how to follow human instructions. 00:52:10.900 |
And finally, we would do RLHF to make sure that these models are well-aligned with human 00:52:15.580 |
preference. So if we start RLHF from scratch, it's probably going to be very hard for the 00:52:21.700 |
model to converge, because RL is hard to train for text data, et cetera. So RL doesn't really 00:52:27.500 |
work from scratch. But with all these smart tricks about pre-training and instruction 00:52:33.260 |
tuning, suddenly now they're off to a good start. 00:52:55.060 |
You mean the difference between Dagger and schedule sampling is how long the sequence 00:53:01.300 |
are? Yeah, I think roughly that is it. Because for Dagger, you are trying to put in full-generated 00:53:10.100 |
sequence. But I feel like there can be variations of Dagger. Dagger is just like a high-level 00:53:13.740 |
framework and idea. There can be variations of Dagger that are very similar to schedule 00:53:20.140 |
I feel like for schedule sampling, it's kind of a more smoothed version of Dagger. Because 00:53:24.860 |
for Dagger, you have to-- well, basically, for this epoch, I am generating something. 00:53:31.260 |
And then after this epoch finishes, I put this into the data together and then train 00:53:35.380 |
for another epoch. Whereas Dagger seems to be more flexible in terms of where you add 00:53:40.940 |
So for Dagger, if you regress the models coming out, how does it help the model? 00:53:48.740 |
I think that's a good question. I feel like if you regress the model-- for example, if 00:53:54.900 |
you regress the model on its own output, I think there should be smarter ways than to 00:54:01.740 |
exactly regress on your own output. For example, you might still consult some gold reference 00:54:06.780 |
data, for example, given that you ask the model to generate for something. And then 00:54:10.980 |
you can, instead of using-- say you ask the model to generate for five tokens. And then 00:54:16.420 |
instead of using the model's generation to be the sixth token, you'll probably try to 00:54:21.980 |
find some examples in the training data that would be good continuations. And then you 00:54:26.660 |
try to plug that in by connecting the model generation and some gold text. And then therefore, 00:54:34.100 |
you are able to correct the model, even though it probably went off path a little bit by 00:54:39.700 |
generating its own stuff. So it's kind of like letting the model learn how to correct 00:54:44.060 |
But yes, I think you are right. If you just put model generation in the data, it shouldn't 00:54:51.540 |
really work. Yeah. Any other questions? Cool. Moving on. Yes. So now we'll talk about how 00:55:08.020 |
we are going to evaluate NLG systems. So there are three types of methods for evaluation. 00:55:13.540 |
There is content overlap metrics. There is model-based metrics. And there is human evaluations. 00:55:20.340 |
So first, content overlap metrics compute a score based on lexical similarities between 00:55:25.100 |
the generated text and the gold reference text. So the advantage of this approach is 00:55:29.460 |
that it's very fast and efficient and widely used. For example, a blue score is very popular 00:55:35.300 |
in MT. And rouge score is very popular in summarization. 00:55:41.660 |
So these methods are very popular because they are cheap and easy to run. But they are 00:55:47.940 |
not really the ideal metrics. For example, simply relying on lexical overlap might miss 00:55:53.540 |
some refreezings that have the same semantic meaning. Or it might reward text with a large 00:55:59.180 |
portion of lexical overlap, but actually have the opposite meaning. So you have lots of 00:56:03.620 |
both false positive and false negative problems. 00:56:07.540 |
So despite all these disadvantages, the metrics are still the to-go evaluation standard in 00:56:12.300 |
machine translation. Part of the reason is that MT is actually super close-ended. It's 00:56:18.060 |
very non-open-ended. And then therefore, this is probably still fine to use a blue score 00:56:25.020 |
to measure machine translation. And they get progressively worse for tasks that are more 00:56:29.740 |
open-ended. For example, they get worse for summarization, as long as the output text-- 00:56:35.580 |
because the output text becomes much harder to measure. They are much worse for dialogue, 00:56:40.340 |
which is more open-ended. And then they are much, much worse for story generation, which 00:56:44.100 |
is also open-ended. And then the drawback here is that because the n-gram metrics-- 00:56:51.100 |
this is because suppose that you are generating a story that's relatively long. Then if you 00:56:55.740 |
are still looking at word overlap, then you might actually get very high n-gram scores 00:57:00.380 |
because of your text is very long, not because it's actually of high quality. Just because 00:57:04.780 |
you are talking so much that you might have covered lots of points already. 00:57:10.580 |
Yes, exactly. That's the next thing that I will talk about as a better metric for evaluation. 00:57:22.700 |
But for now, let's do a case study of a failure mode for blue score, for example. So suppose 00:57:29.300 |
that Chris asked the question, are you enjoying the CS224L lectures? The correct answer, of 00:57:33.900 |
course, is heck yes. So if we have this, if one of the answers is yes, it will get a score 00:57:42.220 |
of 0.61 because it has some lexical overlap with the correct answer. If you answer you 00:57:48.100 |
know it, then it gets a relatively lower score because it doesn't really have any lexical 00:57:53.420 |
overlap except from the exclamation mark. And if you answer yep, this is semantically 00:57:59.420 |
correct, but it actually gets 0 score because there is no lexical overlap between the gold 00:58:05.340 |
answer and the generation. If you answer heck no, this should be wrong. But because it has 00:58:12.020 |
lots of lexical overlap with the correct answer, it's actually getting some high scores. 00:58:19.740 |
So these two cases are the major failure modes of lexical-based n-gram overlap metrics. You 00:58:26.060 |
get false negatives and false positives. So moving beyond this failure modes of lexical-based 00:58:35.180 |
metrics, the next step is to check for semantic similarities. And model-based metrics are 00:58:40.380 |
better at capturing the semantic similarities. So this is kind of similar to what you kind 00:58:45.200 |
of raised up a couple minutes ago. We can actually use learned representation of words 00:58:50.700 |
and sentences to compute semantic similarities between generated and referenced text. So 00:58:58.460 |
now we are no longer bottlenecked by n-gram. And instead, we're using embeddings. And these 00:59:03.820 |
embeddings are going to be pre-trained. But the methods can still move on because we can 00:59:07.900 |
just swap in different pre-trained method and use the fixed metrics. 00:59:12.460 |
So here are some good examples of the metrics that could be used. One thing is to do vector 00:59:17.540 |
similarity. This is very similar to homework one, where you are trying to compute similarity 00:59:22.340 |
between words, except now we are trying to compute similarity between sentences. There 00:59:27.620 |
are some ideas of how to go from word similarity to sentence similarities. For example, you 00:59:32.100 |
can just average the embedding, which is like a relatively naive idea, but it works sometimes. 00:59:40.260 |
Another high-level idea is that we can measure word movers distance. The idea here is that 00:59:47.460 |
we can use optimal transports to align the source and target word embeddings. Suppose 00:59:52.180 |
that your source word embedding is Obama speaks to the media in Illinois, and the target is 00:59:59.820 |
the president grace the press in Chicago. From a human evaluation perspective, these 01:00:04.100 |
two are actually very similar, but they are not exactly aligned word by word. So we need 01:00:09.100 |
to figure out how to optimally align words to words, like align Obama to president, align 01:00:14.220 |
Chicago to Illinois, and then therefore we can compute a score. We can compute the pairwise 01:00:20.100 |
word embedding difference between this, and then get a good score for the sentence similarities. 01:00:27.540 |
And finally, there is BERT score, which is also a very popular metric for semantic similarity. 01:00:32.580 |
So it first computes pairwise cosine distance using BERT embeddings, and then it finds an 01:00:38.260 |
optimal alignment between the source and target sentence, and then it finally computes some 01:00:43.260 |
score. So I feel like these details are not really that important, but the high-level 01:00:48.580 |
idea is super important, is that we can now use word embeddings to compute sentence similarities 01:00:56.180 |
by doing some sort of smart alignment, and then transform from word similarity to sentence 01:01:00.620 |
similarities. To move beyond word embeddings, we can also use sentence embeddings to compute 01:01:06.820 |
sentence similarities. So typically, this doesn't have the very comprehensive alignment 01:01:11.260 |
by word problem, but it has similar problems about you need to now align sentences or phrases 01:01:16.580 |
in a sentence. And similarly, there is BLURT, which is slightly different. It is a regression 01:01:22.340 |
model based on BERT. So the model is trained as a regression problem to return a score 01:01:29.140 |
that indicates how good the text is in terms of grammaticality and the meaning of the reference 01:01:33.620 |
text, and similarity with the reference text. So this is kind of a training evaluation as 01:01:38.020 |
a regression problem. Any questions so far? OK, cool. You can move on. 01:01:51.180 |
So all the previous mentioned approaches are evaluating semantic similarities, so they 01:01:55.500 |
can be applied to non-open-ended generation tasks. But what about open-ended settings? 01:02:00.980 |
So here, enforcing semantic similarity seems wrong, because a story can be perfectly fluent 01:02:06.220 |
and perfectly high quality without having to reassemble any of the reference stories. 01:02:11.380 |
So one idea here is that maybe we want to evaluate open-ended text generation using 01:02:16.780 |
this MOV score. MOV score computes the information divergence in a quantized embedding space 01:02:23.260 |
between the generated text and the goal reference text. So here is roughly the detail of what's 01:02:28.420 |
going on. Suppose that you have a batch of text from the goal reference that are human 01:02:33.220 |
written, and you have a batch of text that's generated by your model. Step number one is 01:02:37.820 |
that you want to embed this text. You want to put this text into some continuous representation 01:02:42.340 |
space, which is kind of the figure to the left. But it's really hard to compute any 01:02:47.660 |
distance metrics in this continuous embedding space, because different sentences might actually 01:02:53.140 |
lie very far away from each other. So the idea here is that we are trying to do a k-means 01:02:58.300 |
cluster to discretize the continuous space into some discrete space. Now, after the discretization, 01:03:04.460 |
we can actually have a histogram for the goal human written text and a histogram for the 01:03:11.220 |
machine generated text. And then we can now compute precision recall using these two discretized 01:03:17.020 |
distributions. And then we can compute precision by forward KL and recall by backward KL. Yes, 01:03:24.300 |
Why do we want to discretize it? I didn't catch that. 01:03:28.780 |
So imagine that you-- suppose-- maybe it's equivalent to answer, why is it hard to work 01:03:34.300 |
with the continuous space? The idea is if you embed a sentence into the continuous space, 01:03:40.580 |
say that it lies here, and you embed another sentence in a continuous space that lies here, 01:03:44.740 |
suppose that you only have a finite number of sentences. Then they would basically be 01:03:48.660 |
direct delta distributions in your manifold. So it's hard to-- you probably want a smoother 01:03:55.460 |
distribution. But it's hard to define what is a good, smooth distribution in the case 01:03:59.940 |
of text embedding, because they're not super interpretable. So therefore, eventually, you 01:04:04.100 |
will have-- if you embed everything in a continuous space, you will have lots of direct deltas 01:04:10.380 |
that are just very high and then not really connected to its neighbors. So it's hard to 01:04:17.540 |
quantify KL divergence or a distance matrix in that space. 01:04:22.220 |
For example, you have to make some assumptions. For example, you want to make Gaussian assumptions 01:04:26.060 |
that I want to smooth all the embeddings by convolving it with a Gaussian. And then you 01:04:31.260 |
can start getting some meaningful distance metrics. But with just the embeddings alone, 01:04:37.540 |
you're not going to get meaningful distance metrics. And then it doesn't really make sense 01:04:40.540 |
to smooth things using Gaussian, because who said word representations are Gaussian related? 01:04:48.060 |
How do you know it would be continuous to understand distributions? 01:04:51.380 |
I think this requires some Gaussian smoothing. Yeah, I think that the plot is made with some 01:04:55.780 |
smoothing. Yeah, I mean, I didn't make the plot, so I couldn't be perfectly sure. But 01:04:59.980 |
I think the fact that it looks like this means that you smooth it a little bit. 01:05:06.100 |
These are sentence embeddings or concatenated word embeddings, because you are comparing 01:05:10.380 |
sentences to sentences, not words to words. Yeah, so the advantage of MOLF score is that 01:05:16.980 |
it is applicable to open-ended settings, because you are now measuring precision and recall 01:05:22.820 |
with regard to the target distribution. Cool. So it has a better probabilistic interpretation 01:05:30.580 |
than all the previous similarity metrics. Cool. Any other questions? Yes? 01:05:37.980 |
I'm just not entirely clear. So if we're trying to maximize precision, we can call it here. 01:05:43.980 |
How is that different from just trying to maximize the similarity between the target 01:05:48.980 |
Oh, yeah, that's a good question. Well, this is because in a case where it's really hard 01:05:55.580 |
to get exactly the same thing-- well, for example, I would say that maybe-- because 01:06:00.580 |
I've never tried this myself, but if you try to run MOLF on a machine translation task, 01:06:05.660 |
you might get very high score. But if you try to run Bool score on the open-ended text 01:06:11.260 |
generation, you will get super low score. So it's just not really measurable, because 01:06:14.900 |
everything's so different from each other. So I feel like MOLF is kind of a middle ground, 01:06:19.780 |
where you are trying to evaluate something that are actually very far away from each 01:06:23.460 |
other, but you still want a meaningful representation. Of course, I mean, if your source and target 01:06:30.380 |
are exactly the same or are just different up to some rephrasing, you will get the best 01:06:34.780 |
MOLF score. But maybe that's not really what you're looking for, because given the current 01:06:39.940 |
situation, you only have generations that are very far away from the gold text. How 01:06:48.900 |
I'm still trying to understand the MOF score. Is it possible to write out the map, even 01:06:57.700 |
Yeah, I think it's possible. I mean, maybe we can put this discussion after class, because 01:07:02.780 |
I kind of want to finish my slides. Yeah, but happy to chat after class. There is a 01:07:08.060 |
paper about it if you search for MOLF score. I think it's probably the best paper in some 01:07:16.020 |
OK, so moving on. I've pointed out that there are so many evaluation methods. So let's take 01:07:22.300 |
a step back and think about what's a good metric for evaluation methods. So how do we 01:07:26.660 |
evaluate evaluations? Nowadays, the gold standard is still to check how well this metric is 01:07:32.580 |
aligned with human judgment. So if a model match human preference, in other words, if 01:07:40.980 |
the metric correlates very strongly with human judgment, then we say that the metric is a 01:07:45.100 |
good metric. So in this plot, people have plot blue score and human score on y and x 01:07:52.140 |
axis respectively. And then because we didn't see a correlation, a strong correlation, this 01:07:56.580 |
kind of suggests that blue score is not a very good metric. 01:08:01.660 |
So actually, the gold standard for human evaluation-- the gold standard for evaluating language 01:08:07.860 |
models is always to do human evaluation. So automatic metrics fall short of matching human 01:08:14.860 |
decisions. And human evaluation is kind of the most important criteria for evaluating 01:08:20.620 |
text that are generated from a model. And it's also the gold standard in developing 01:08:25.200 |
automatic metrics because we want everything to match human evaluation. 01:08:31.620 |
So what do we mean by human evaluation? How is it conducted? Typically, we will provide 01:08:36.780 |
human annotators with some axes that we care about, like fluency, coherence for open-ended 01:08:43.460 |
text generation. Suppose that we also care about factuality for summarization. We care 01:08:47.900 |
about style of the writing and common sense, for example, if we're trying to write a children's 01:08:56.540 |
Essentially, another thing to note is that please don't compare human evaluations across 01:09:00.720 |
different papers or different studies because human evaluations tends to not be well-collaborated 01:09:05.660 |
and are not really reproducible. Even though we believe that human evaluations are the 01:09:10.900 |
gold standard, there are still many drawbacks. For example, human evaluations are really 01:09:15.580 |
slow and expensive. But even beyond the slow and expensiveness, they are still not perfect 01:09:23.300 |
because first, human evaluations, the results may be inconsistent, and they may not be very 01:09:28.220 |
reproducible. So if you ask the same human whether you like A or B, they might say A 01:09:32.100 |
the first time and B the second time. And then human evaluations are typically not really 01:09:37.500 |
logical. And sometimes, human annotators might misinterpret your question. Suppose that you 01:09:44.580 |
want them to measure coherence of the text. Different people have different criteria for 01:09:48.860 |
coherence. Some people might think coherence is equivalent to fluency, and then they look 01:09:53.020 |
for grammaticality errors. Some people might think coherence means how well your continuation 01:09:58.860 |
is aligned with the prompt or the topic. So there are all sorts of misunderstandings that 01:10:08.420 |
And finally, human evaluation only measures precision, not recall. This means that you 01:10:13.380 |
can give a sentence to human and ask the human, how do you like the sentence? But you couldn't 01:10:18.220 |
ask the human whether this model is able to generate all possible sentences that are good. 01:10:24.460 |
So it's only a precision-based metrics, not a recall-based metrics. 01:10:28.260 |
So here are two approaches that tries to combine human evaluations with modeling. For example, 01:10:36.260 |
the first idea is basically trying to learn a metric from human judgment, basically by 01:10:42.540 |
trying to use human judgment data as training data, and then train a model to simulate human 01:10:48.420 |
judgment. And the second approach is trying to ask human and model to collaborate so that 01:10:54.780 |
the human would be in charge of evaluating precision, whereas the model would be in charge 01:11:01.140 |
Also, we have tried approaches in terms of evaluating models interactively. So in this 01:11:07.380 |
case, we not only care about the output quality, we also care about how the person feels when 01:11:14.460 |
they interact with the model, when they try to be a co-author with the model, and how 01:11:18.860 |
the person feels about the writing process, et cetera. So this is called trying to evaluate 01:11:29.300 |
So the takeaway here is that content overlap is a bad metric. Model-based metrics become 01:11:35.900 |
better because it's more focused on semantics, but it's still not good enough. Human judgment 01:11:40.900 |
is the gold standard, but it's hard to do human judgment-- it's hard to do human study 01:11:45.140 |
well. And in many cases, this is a hint for final project. The best judge of the output 01:11:51.420 |
quality is actually you. So if you want to do a final project in natural language generation, 01:11:56.980 |
you should look at the model output yourself. And don't just rely on the numbers that are 01:12:04.940 |
Cool. So finally, we will discuss ethical considerations of natural language generation 01:12:11.220 |
problems. So as language models get better and better, ethical considerations become 01:12:16.980 |
much more pressing. So we want to ensure that the models are well-aligned with human values. 01:12:21.940 |
For example, we want to make sure the models are not harmful, they are not toxic, and we 01:12:26.820 |
want to make sure that the models are unbiased and fair to all demographics groups. 01:12:31.460 |
So for example here, we also don't want the model to generate any harmful content. Basically, 01:12:37.460 |
I try to prompt ChatGPT to say, can you write me some toxic content? ChatGPT politely refused 01:12:42.860 |
me, which I'm quite happy about. But there are other people who try to jailbreak ChatGPT. 01:12:51.860 |
The idea here is that ChatGPT-- actually, I think internally, they probably implement 01:12:56.060 |
some detection tools so that when you try to prompt it adversarially, it's going to 01:13:00.380 |
avoid doing adversarial things. But here, there are many very complicated ways to prompt 01:13:06.820 |
ChatGPT so that you can get over the firewall and then therefore still ask ChatGPT to generate 01:13:22.140 |
So another problem with these large language models is that they are not necessarily truthful. 01:13:27.940 |
So for example, this very famous news that Google's model actually generated factual 01:13:33.460 |
errors, which is quite disappointing. But the way the model talks about it is very convincing. 01:13:41.660 |
So you wouldn't really know that it's a factual error unless you go check that this is not 01:13:51.060 |
So we want to avoid this type of problems. Actually, the models have already been trying 01:13:55.700 |
very hard to reframe from generating harmful content. But for models that are more open-sourced 01:14:03.540 |
and are smaller, the same problem still appears. And then typically, when we do our final projects 01:14:09.460 |
or when we work with models, we are probably going to deal with much smaller models. And 01:14:13.500 |
then therefore, we need to think about ways to deal with these problems better. 01:14:17.540 |
So text generation models are often constructed from pre-trained language models. And then 01:14:21.980 |
pre-trained language models are trained on internet data, which contains lots of harmful 01:14:25.940 |
stuff and bias. So when the models are prompted for this information, they will just repeat 01:14:33.340 |
the negative stereotypes that they learn from the internet training data. 01:14:37.060 |
So one way to avoid this is to do extensive data cleaning so that the pre-training data 01:14:41.980 |
does not contain any bias or stereotypical content. However, this is going to be very 01:14:46.700 |
labor-intensive and almost impossible to do because filtering a large amount of internet 01:14:51.100 |
data is just so costly that it's not really possible. 01:14:56.860 |
Again, with existing language models like GPT-2 Medium, there are some adversarial inputs 01:15:03.860 |
that almost always trigger toxic content. And these models might be exploited in the 01:15:09.100 |
real world by ill-intended people. So for example, there is a paper about universal 01:15:15.820 |
adversarial triggers where the authors just find some universal set of words that would 01:15:21.060 |
trigger bad content from the-- that would trigger toxic content from the model. 01:15:28.300 |
And sometimes, even if you don't try to trigger the model, the model might still start to 01:15:32.180 |
generate toxic content by itself. So in this case, the pre-trained language models are 01:15:38.100 |
prompted with very innocuous prompts, but they still degenerate into toxic content. 01:15:43.540 |
So the takeaway here is that models really shouldn't be deployed without proper safeguards 01:15:48.940 |
to control for toxic content or any harmful contents in general. And models should not 01:15:53.420 |
be deployed without careful considerations of how users will interact with these models. 01:16:02.300 |
So in the ethics section, one major takeaway is that we are trying to advocate that you 01:16:07.460 |
need to think more about the model that you are building. So before deploying or publishing 01:16:13.420 |
any NLG models, please check if the model's output is not harmful. And please check if 01:16:19.340 |
the model is more robust-- is robust to all the trigger words and other adversarial prompts. 01:16:25.380 |
And of course, there are more. So well, basically, one can never do enough to improve the ethics 01:16:30.660 |
of text generation systems. And OK, cool. I still have three minutes left, so I can 01:16:35.420 |
still do concluding thoughts. The idea here-- well, today, we talk about the exciting applications 01:16:41.380 |
of natural language generation systems. But one might think that, well, given that ChatGPT 01:16:48.460 |
is already so good, are there any other things that we can do research-wise? If you try interacting 01:16:53.380 |
with these models, if you try to interact with these models, actually, you can see that 01:16:58.580 |
there are still lots of limitations in their skills and performance. For example, ChatGPT 01:17:02.940 |
is able to do a lot of things with manipulating text, but it couldn't really create interesting 01:17:09.540 |
contents, or it couldn't really think deeply about stuff. So there are lots of headrooms, 01:17:18.460 |
And evaluation remains a really huge challenge in natural language generation. Basically, 01:17:23.900 |
we need better ways to automatically evaluate performance of NLG models, because human evaluations 01:17:29.500 |
are expensive and not reproducible. So it's better to figure out ways to compile all those 01:17:36.900 |
human judgments into a very reliable and trustworthy model. 01:17:41.620 |
And also, with the advance of all these large-scale language models, doing neural natural language 01:17:48.300 |
generation has been reset. And it's never been easier to jump into this space, because 01:17:54.340 |
now there are all the tools that are already there for you to build upon. 01:17:58.740 |
And finally, it is one of the most exciting and fun areas of NLP to work on. So yeah, 01:18:03.340 |
I'm happy to chat more about NLG if you have any questions, both after class and in class, 01:18:10.580 |
OK, cool. That's everything. So do you have any questions? If you don't, we can end the