Hello everyone, my name is Lisa. I'm a third year PhD student in the NLP group. I'm advised by Percy and Tatsu. Today I will give a lecture on natural language generation. And this is also the research area that I work on. So I'm super excited about it. I'm happy to answer any questions both during the lecture and after class about natural language generation.
So NLG is a super exciting area and it's also moving really, really fast. So today we will discuss all the excitement of NLG. But before we get into the really exciting part, I have to make some announcements. So first, it is very, very important for you to remember to sign up for AWS by midnight today.
So this is related to our homework five, whether you have GPU access and then also related to our final project. So please, please remember to sign up for AWS by tonight. And second, the project proposal is due on Tuesday, next Tuesday. And I think assignment four should just due.
Hopefully you had fun with machine translation and stuff. And also assignment five is out today, I think just now. And it is due on Friday, basically Friday midnight. And last, we will hold a Hugging Face Transformer Library tutorial this Friday. So if your final project is related to implementing transformers or playing with large language models, you should definitely go to this tutorial because it's going to be very, very helpful.
Also, yeah, just one more time, please remember to sign up for AWS because this is the final hard deadline. Okay, cool. Now moving on to the main topic for today, the very exciting natural language generation stuff. So today, we will discuss what is NLG, review some models, discuss about how to decode from language models and how to train language models.
And we will also talk about evaluations. And finally, we'll discuss ethical and risk considerations with the current NLG systems. So this natural language generation techniques are going to be really exciting because this is kind of getting us closer to explain the magic of chatGPT, which is a super popular model recently.
And practically speaking, they could also help you with your final project if you decide to work on something related to text generation. So let's get started. To begin with, let's ask the question of what is natural language generation. So natural language generation is actually a really broad category. People have divided NLP into natural language understanding and natural language generation.
So the understanding part mostly means that the task input is in natural language, such as semantic parsing, natural language inference, and so on. Whereas natural language generation means that the task output is in natural language. So NLG focuses on systems that produce fluent, coherent, and useful language outputs for human to use.
Historically, there are many NLG systems that use rule-based systems, such as templates or infilling. But nowadays, deep learning is powering almost every text generation systems. So this lecture today will be mostly focused on deep learning stuff. So first, what are some examples of natural language generation? It's actually everywhere, including our homework.
Machine translation is a form of NLG, where the input is some utterance in the source language, and the output is generated text in the target language. Digital assistant, such as Ceres or Alexa, they are also NLG systems. So it takes in dialogue history and generates continuations of the conversation.
There is also summarization systems that takes in a long document, such as a research article, and then the idea is trying to summarize it into a few sentences that are easy to read. So beyond these classic tasks, there are some more interesting uses, like creative storywriting, where you can prompt a language model with a story plot, and then it will give you some creative stories that are aligned with the plot.
There is data to text, where you give the language model some database or some tables, and then the idea is that it will output some textual description of the table content. And finally, there is also visual description-based NLG systems, like image captioning or image-based storytelling. So the really cool example is the popular ChatGPT models.
So ChatGPT is also an NLG system. It is very general purpose, so therefore you can use it to do many different tasks with different prompts. For example, we can use ChatGPT to simulate a chatbot. It can answer questions about creative GIFs for 10 years old. It can be used to do poetry generation.
For example, we can ask it to generate a poem about sorting algorithms. And it's actually, well, I wouldn't say it's very poetic, but at least it has the same format as a poem, and the content is actually correct. So ChatGPT can also be used in some really useful settings, like web search.
So here, Bing is augmented with ChatGPT, and there are some Twitters that are saying that the magic of ChatGPT is that it actually makes people be happy to use Bing. So there are so many tasks that actually belong to the NLG category. So how do we categorize these tasks?
One common way is to think about the open-endedness of the task. So here, we draw a line for the spectrum of open-endedness. On the one end, we have tasks like machine translation and summarization. So we consider them not very open-ended, because for each source sentence, the output is almost determined by the input.
Because basically, we are trying to do machine translation, the semantics should be exactly similar to the input sentence. So there are only a few ways that you can rephrase the output, like authorities have announced that today is a national holiday. You can rephrase it a little bit to say, today is a national holiday announced by the authorities.
But the actual space is really small, because you have to make sure the semantics doesn't change. So we can say that the output space here is not very diverse. And moving to the middle of the spectrum, there is dialogue tasks, such as task-driven dialogue or a chitchat dialogue. So we can see that for each dialogue input, there are multiple responses, and the degree of freedom has increased.
Here, we can respond by saying good and you, or we can say about, thanks for asking, barely surviving all my homeworks. So here, we are observing that there are actually multiple ways to continue this conversation. And then this is where we say the output space is getting more and more diverse.
And on the other end of the spectrum, there is the very open-ended generation tasks, like story generation. So given the input, like write me a story about three little pigs, there are so many ways to continue the prompt. We can write about them going to schools, building houses, like they always do.
So the valid output here is extremely large. And we call this open-ended generation. So it's hard to really draw a boundary between open-ended and non-open-ended tasks. But we still try to give a rough categorization. So open-ended generation refers to tasks whose output distribution has a high degree of freedom.
Or non-open-ended generation tasks refers to tasks where the input will almost certainly determine the output generation. Examples of non-open-ended generations are machine translation, summarization. And examples of open-ended generations are story generation, chitchat dialogue, task-oriented dialogue, et cetera. So how do we formalize this categorization? One way of formalizing is by computing the entropy of the NLG system.
So high entropy means that we are to the right of the spectrum. So it is more open-ended. And low entropy means that we are to the left of the spectrum and less open-ended. So these two classes of NLG tasks actually require different decoding and training approaches, as we will talk about later.
OK, cool. Now let's recall some previous lectures and review the NLG models and trainings that we have studied before. So I think we discussed the basics of natural language generation. So here is how autoregressive language model works. At each time step, our model would take in a sequence of tokens as input.
And here it is y less than t. And the output is basically the new token yt. So to decide on yt, we first use the model to assign a score for each token in the vocabulary, denoted as s. And then we apply softmax to get the next token distribution, p.
And we choose a token according to this next token distribution. And similarly, once we have predicted yt hat, we then pass it back into the language model as the input, predict y hat t plus 1. And then we do so recursively until we reach the end of the sequence.
So any questions so far? OK, good. So for the two types of NLG tasks that we talked about, like the open-ended and non-open-ended tasks, they tend to prefer different model architectures. So for non-open-ended tasks, like machine translation, we typically use an encoder-decoder system, where the autoregressive decoder that we just talked about functions as the decoder.
And then we have another bidirectional encoder for encoding the inputs. So this is kind of what you implemented for assignment 4, because the encoder is like the bidirectional LSTM, and the decoder is another LSTM that is autoregressive. So for more open-ended tasks, typically autoregressive generation model is the only component.
Of course, these architectures are not really hard constraints, because an autoregressive decoder alone can also be used to do machine translation. And an encoder-decoder model can also be used for storage generation. So this is kind of the convention for now, but it's a reasonable convention, because using decoder-only model for MT tends to hurt performance compared to an encoder-decoder model for MT.
And using an encoder-decoder model for open-ended generation seems to achieve similar performance to a decoder-only model. And therefore, if you have the compute budget to train an encoder-decoder model, you might just be better off by only training a larger decoder model. So it's kind of more of an allocation of resources problem than whether this architecture will type check with your task.
So how do we train such a language model? In previous lectures, we talked about that the language models are trained by maximum likelihood. So basically, we were trying to maximize the probability of the next token, yt, given the preceding words. And this is our optimization objective. So at each time step, this can be regarded as a classification task, because we are trying to distinguish the actual word, yt star, from all the remaining words in the vocabulary.
And this is also called teacher forcing, because at each time step, we are using the gold standard, y star less than t, as input to the model. Whereas, presumably, at generation time, you wouldn't have any access to y star. So you would have to use the model's own prediction to feed it back into the model to generate the next token.
And that is called student forcing, which we'll talk in detail later. Oh, sorry. Yeah, I think I skipped two slides ago. About autoregressive, we never used that word before. What does it mean? Autoregressive? Oh, it just means like-- so let's look at this animation again. Oops, sorry. It just looks like you are generating word from left to right, one by one.
So here, suppose that you are given y less than t. And then autoregressively, you first generate yt. And then once you have yt, you'll feed it back in, generate yt plus 1, and then feed it back in, generate another thing. So this left to right nature, because you are using chain rule to condition on the tokens that you just generated, this chain rule thing is called autoregressive.
And typically, I think conventionally, we are doing left to right autoregressive by generating from left to right. But there are also other more interesting models that can do backward or infill and other things. This idea of generating one token at once is autoregressive. Cool. Any other questions? Yep. So at inference time, our decoding algorithm would define a function to select a token from this distribution.
So we've discussed that we can use the language model to compute this p, which is the next token distribution. And then g here, based on our notation, is the decoding algorithm, which helps us select what token we are actually going to use for yt. So the obvious decoding algorithm is to greedily choose the highest probability token as yt for each time step.
So while this basic algorithm sort of works, because they work for your homework 4, to do better, there are two main avenues that we can take. We can decide to improve decoding. And we can also decide to improve the training. Of course, there are other things that we can do.
We can improve training data. And we can improve model architectures. But for this lecture, we will focus on decoding and training. So now let's talk about how decoding algorithms work for natural language generation models. Before that, I'm happy to take any questions about the previous slides. OK. Yeah. Sorry, could you just explain one more time the difference between teacher forcing and student forcing?
I think I'll go into this in detail later. But sure. So basically, for teacher forcing, the idea is you do teacher forcing where you train the language model, because you already observe the gold text. So you use the gold text up until time step t, put it into the model.
And then the model would try to predict y t plus 1. Whereas student forcing means that you don't have access to this gold reference data. Instead, but you are still trying to generate a sequence of data. So you have to use the text that you generated yourself using the model, and then feed it back into the model as input to predict t plus 1.
That's the primary difference. Cool. So what is decoding all about? At each time step, our model computes a vector of score for each token. So it takes in preceding context y less than t and produce a score s. And then we try to compute a probability distribution p out of the scores by just applying softmax to normalize them.
And our decoding algorithm is defined as this function g, which takes in the probability distribution and try to map it to some word. Basically, try to select a token from this probability distribution. So in the machine translation lecture, we talked about greedy decoding, which selects the highest probability token of this p distribution.
And we also talk about beam search, which has the same objective as greedy decoding, which is that we are both trying to find the most likely string defined based on the model. But instead of doing so greedily for beam search, we actually explore a wider range of candidates. So we have a wider exploration of candidates by keeping always k candidates in the beam.
So overall, this maximum probability decoding is good for low entropy tasks like machine translation and summarization. But it actually encounters more problems for open-ended generation. So the most likely string is actually very repetitive when we try to do open-ended text generation. As we can see in this example, the context is perfectly normal.
It's about a unicorn trying to speak English. And by the continuation, the first part of it looks great. It's like valid English. It talks about science. But suddenly, it starts to repeat. And it starts to repeat, I think, an institution's name. So why does this happen? If we look at, for example, this plot, which shows the language model's probability assigned to the sequence I don't know, we can see here is the pattern.
It has regular probability. But if we keep repeating this phrase, I don't know, I don't know, I don't know, for 10 times, then we can see that there is a decreasing trend in their negative log likelihood. So the y-axis is the negative log probability. We can see this decreasing trend, which means that the model actually has higher probability as the repeat goes on, which is quite strange because it's suggesting that there is a self-amplification effect.
So the more repeat we have, the more confident the model becomes about this repeat. And this keeps going on. We can see that for I am tired, I'm tired, repeat 100 times, we can see a continuously decreasing trend until the model is almost 100% sure that it's going to keep repeating the same thing.
And sadly, this problem is not really solved by architecture. Here, the red plot is a LSTM model, and the blue curve is a transformer model. We can see that both models kind of suffers from the same problem. And scale also doesn't solve this problem. So we kind of believe that scale is the magical thing in NLP.
But even models with 175 billion parameters will still suffer from repetition if we try to find the most likely string. So how do we reduce repetition? One canonical approach is to do n-gram blocking. So the principle is fairly simple. Basically, you just don't want to see the same n-gram twice.
If we set n to be 3, then for any text that contains the phrase "I am happy," the next time you see the prefix "I am," n-gram blocking would automatically set the probability of happy to be 0 so that you will never see this n-gram, this trigram again. But clearly, this n-gram blocking heuristic has some problems because sometimes it is quite common for you to want to see a person's name appear twice or three times or even more in a text.
But this n-gram blocking will eliminate that possibility. So what are better options that possibly are more complicated? For example, we can use a different training objective. Instead of training by MLE, we can train by unlikelihood objective. So in this approach, the model is actually penalized for generating already seen tokens.
So it's kind of like putting this n-gram blocking idea into training time. Rather than at decoding time for this constraint, at training time, we just decrease the probability of repetition. Another training objective is coverage wealth, which uses the attention mechanism to prevent repetition. So basically, if you try to regularize and enforce your attention so that it's always attending to different words for each token, then it is highly likely that you are not going to repeat because repetition tends to happen when you have similar attention patterns.
Another different angle is that instead of searching for the most likely string, we can use a different decoding objective. So maybe we can search for strings that maximizes the difference between log probabilities of two models. Say that we want to maximize log probability of large model minus log probability of small model.
In this way, because both models are repetitive, so they kind of cancel out. So they would both assign high probabilities of repetition. And after applying this new objective, the repetition stuff will actually be penalized because it cancels out. So here comes the broader question. Is finding the most likely string even a reasonable thing to do for open-ended text generation?
The answer is probably no, because this doesn't really match human pattern. So we can see in this plot, the orange curve is the human pattern, and the blue curve is the machine-generated text using beam search. So you can see that with human talks, there are actually lots of uncertainty, as we can see by the fluctuation of the probabilities.
For some words, we can be very certain. For some words, we are a little bit unsure. Whereas here, for the model distribution, it's always very sure. It's always assigning probability 1 to the sequence. So because we now are seeing a-- basically, there is a mismatch between the two distributions.
So it's kind of suggesting that maybe searching for the most likely string is not the right decoding objective at all. Any questions so far before we move on? Yeah? So is this the underlying mechanism for some detector of whether some text is generated by changing the Not really, because this can only detect the really simple things that humans are also able to detect, like repetition.
So in order to avoid the previous problems that we've talked about, I'll talk about some other decoding families that generate more robust text that actually look like this, whose probability distribution looks like the orange curve. So I wouldn't say this is the to-go answer for watermarking or detection. Can you repeat the student's question?
Oh, yeah. OK, cool. So she asked about whether this mechanism of plotting the probabilities of human text and machine-generated text is one way of detecting whether some text is generated by a model or a human. And my answer is, I don't think so, but this could be an interesting research direction.
Because I feel like there are more robust decoding approaches that generate text that actually fluctuates a lot. So yeah, let's talk about the decoding algorithm that is able to generate text that fluctuates. So given that searching for the most likely string is a bad idea, what else should we do?
And how do we simulate that human pattern? And the answer to this is to introduce randomness and stochasticity to decoding. So suppose that we are sampling a token from this distribution, P. Basically, we are trying to sample YT hat from this distribution. It is random so that you can essentially sample any token in the distribution.
Previously, you are kind of restricted to selecting rest from your grocery. But now you can select bathroom instead. So however, sampling introduces a new set of problems. Since we never really zero out any token probabilities, vanilla sampling would make every token in the vocabulary a viable option. And in some unlucky cases, we might end up with a bad word.
So assuming that we already have a very well-trained model, even if most of the probability mass of the distribution is over the limited set of good options, the tail of the distribution will still be very long because we have so many words in our vocabulary. And therefore, if we add all those long tails, it aggregates.
They still have a considerable mass. So statistically speaking, this is called heavy tail distribution. And language is exactly a heavy tail distribution. So for example, many tokens are probably really wrong in this context. And then given that we have a good language model, we assign them each very little probability.
But this doesn't really solve the problem because there are so many of them. So you aggregate them as a group. We'll still have a high chance of being selected. And the solution here that we have for this problem of long tail is that we should just cut off the tail.
We should just zero out the probabilities that we don't want. And one idea is called top case sampling, where the idea is that we would only sample from the top k tokens in the probability distribution. Any questions for now? OK, yeah. Well, the model we were looking at a second ago had some very low probability samples as well on the graph, right?
How would top case sampling deal with that? You mean this one? You mean the orange-blue graph of the human versus-- Oh, yeah. So top k will basically eliminate-- it will make it impossible to generate the super low probability tokens. So technically, it's not exactly simulating this pattern because now you don't have the super low probability tokens, whereas human can generate super low probability tokens in a fluent way.
But yeah, that could be another hint that people can use for detecting machine-generated text. Yeah? It also depends on the type of text you want to generate, for example, for more novels or more creative writing. Is it then you decide the hyperparameter? Yeah, yeah, for sure. K is a hyperparameter.
Depending on the type of task, you will choose K differently. Mostly for a closed-ended task, K should be small. And for open-ended, K should be large. Yeah, question in the back. How come-- I guess intuitively, this builds off of one of the earlier questions. Why don't we consider the case where we sample, and then we just weight the probability of each word by its score or something, rather than just looking at top K?
We don't do a weighted sampling type of situation. So we still have that small but non-zero probability of selecting. I think top K is also weighted. So top K just zeroes out all the tails of the distribution. But for the things that it didn't zero out, it's not a uniform choice among the K.
It's still trying to choose proportional to the scores that you computed. Is that just like a computationally it's more efficient because you don't have to do for 17,000 words. It could be for 10 or something? Yeah, sure. That could be one gain of top K decoding is that your softmax will take in fewer candidates.
But it's not the main reason. I think you should show-- Yeah, I'll keep talking about the main reason. So we've discussed this part. And then here, this is formally what is happening for top K sampling. Now that we are only sampling from the top K tokens of the probability distribution.
And as we've said, K is a hyperparameter. So we can set K to be large or small. If we increase K, this means that we are making our output more diverse, but at the risk of including some tokens that are bad. If we decrease K, then we are making more conservative and safe options.
But possibly the generation will be quite generic and boring. So is top K decoding good enough? The answer is not really. Because we can still find some problems with top K decoding. For example, in the context, she said, I never blank. There are many words that are still valid options, such as went, ate.
But those words got zeroed out because they are not within the top K candidates. So this actually leads to bad recall for your generation system. And similarly, another failure of top K is that it can also cut off too quickly. So in this example, code is not really a valid answer, according to common sense, because you probably don't want to eat a piece of code.
But the probability remains non-zero, meaning that the model might still sample code as an output, despite this low probability, but it might still happen. And this means bad precision for the generation model. So given these problems with top K decoding, how can we address them? How can we address this issue of there is no single K that fits all circumstances?
This is basically because the probability distribution that we sample from our dynamic. So when the probability distribution is relatively flat, having a small K will remove many viable options. So having a limited K will remove many viable options, and we want K to be larger for this case. Similarly, when a distribution P is too picky, then we want the-- a high K would allow for too many options to be viable.
And instead, we might want a smaller K so that we are being safer. So the solution here is that maybe K is just a bad hyperparameter. And instead of doing K, we should think about probability. We should think about how to sample from tokens in a top P probability percentiles of the cumulative probability mass of the CDF, for example.
So now, the advantage of doing top P sampling, where we sample from the top P percentile of the cumulative probability mass, is that this is actually equivalent to-- we have now an adaptive K for each different distribution. And let me explain what I mean by having an adaptive K.
So in the first distribution, this is like a regular power law of language that's kind of typical. And then doing top K sampling means we are selecting the top K. But doing top P sampling means that we are zooming into maybe something that's similar to top K in effect.
But if I have a relatively flat distribution like the blue one, we can see that doing top P means that we are including more candidates. And then if we have a more skewed distribution like the green one, doing top P means that we actually include fewer candidates. So by actually selecting the top P percentile in the probability distribution, we are actually having a more flexible K and therefore have a better sense of what are the good options in the model.
Any questions about top P, top K decoding? So everything's clear. Yeah, sounds good. So to go back to that question, doing top K is not necessarily saving compute. Or this whole idea is not really compute saving intended. Because in the case of top P, in order to select the top P percentile, we still need to compute the softmax over the entire vocabulary set in order for us to compute the P properly.
So therefore, it's not really saving compute, but it's improving performance. Cool. Moving on. So there are much more to go with decoding algorithms. Besides the top K and top P that we've discussed, there are some more recent approaches like typical sampling, where the idea is that we want to relate the score based on the entropy of the distribution and try to generate texts that are closer to the negative-- whose probability is closer to the negative entropy of the data distribution.
This means that if you have a closed-ended task or non-open-ended task, it has smaller entropy. So you'll want negative log probability to be smaller. So you want probability to be larger. So it type checks very well. And additionally, there is also epsilon sampling coming from John. So this is an idea where we set the threshold to lower bound probabilities.
So basically, if you have a word whose probability is less than 0.03, for example, then that word will never appear in the output distribution. Now, that word will never be part of your output because it has so low probability. Yeah. How do you know if it's top-degree, the entropy of a distribution?
Oh, cool. Great question. So the entropy distribution is defined as-- suppose that we have a discrete distribution. We can go over it. We'll just enumerate x. And then it's negative log probability of x. So if we write it from an expectation perspective, it's basically expected of log probability of x.
I have to do a little bit here. So this is the entropy of a distribution. And then-- so basically, if your distribution is very concentrated to a few words, then the entropy will be relatively small. If your distribution is very flat, then your entropy will be very large. Yeah.
What if the epsilon sampling is such that we have no valid option? Oh, yeah. I mean, there will be some back-off cases, I think. So in the case that there is no valid options, you'll probably still want to select one or two things, just as an edge case, I think.
OK, cool. Moving on. So another hyperparameter that we can tune to affect decoding is the temperature parameter. So recall that previously at each time step, we asked the model to compute a score. And then we renormalized that score using softmax to get a probability distribution. So one thing that we can adjust here is that we can insert this temperature parameter tau to relate the score.
So basically, we just divide all the sw by tau. And after dividing this, we apply softmax. And we get a new distribution. And this temperature adjustment is not really going to affect the monotonosity of the distribution. For example, if word A has higher probability than word B previously, then after the adjustment, word A is still going to have a higher probability than word B.
But their relative difference will change. So for example, if we raise the temperature tau to be greater than 1, then the distribution Pt will become more uniform. It will be flatter. And this implies that there will be more diverse output because our distribution is flatter. And it's more spread out across different words in the vocabulary.
On the other hand, if we lower the temperature tau less than 1, then Pt becomes very spiky. And then this means that if we sample from the Pt, we'll get less diverse output. So because here, the probability is concentrated only on the top words. So in the very extreme case, if we set tau to be very, very close to 0, then the probability will be a 1/2 vector, where all the probability mass will be centered on one word.
And then this reduces back to argmax sampling or greedy decoding. So temperature is a hyperparameter as well, as for k and P in topk and topp. It is a hyperparameter for decoding. It can be tuned for beam search and sampling algorithms. So it's kind of orthogonal to the approaches that we discussed before.
Any questions so far? OK, cool. Temperature is so easy. So well, because sampling still involves randomness, even though we try very hard in terms of truncation, truncating the tail, sampling still has randomness. So what if we're just unlucky and decode a bad sequence from the model? One common solution is to do re-ranking.
So basically, we would decode a bunch of sequences. For example, we can decode 10 candidates. But 10 or 30 is up to you. The only choice is that you want to balance between your compute efficiency and performance. So if you decode too many sequences, then, of course, your performance is going to increase.
But it's also very costly to just generate a lot of things for one example. And so once you have a bunch of sample sequences, then we are trying to define a score to approximate the quality of the sequence and re-rank all the candidates by this score. So the simple thing to do is we can use perplexity as a metric, as a scoring function.
But we need to be careful that, because we have talked about this, the extreme of perplexity, like if we try to arc max log probability, when we try to aim for a super low perplexity, the tags are actually very repetitive. So we shouldn't really aim for extremely low perplexity.
And perplexity, to some extent, is not a perfect scoring function. It's not a perfect scoring function because it's not really robust to maximize. So alternatively, the re-rankers can actually use a wide variety of other scoring functions. We can score tags based on their style, their discourse coherence, their entailment, factuality properties, consistency, and so on.
And additionally, we can compose multiple re-rankers together. Yeah, question? >> You mentioned 10 candidates or any number of candidates. What's the strategy you usually use to generate these other candidates? Like what heuristic do you use? >> So basically, the idea is to sample from the model. So when you sample from the model, each time you sample, you are going to get a different output.
And then that's what I mean by different candidates. So if you sample 10 times, you will very likely get 10 different outputs. And then you are just-- given these 10 different outputs that come from sampling, you can just decide, re-rank them, and select the candidate that has the highest score.
>> Where does the randomness come from? >> Oh, because we are sampling here. >> That sample, okay. >> Yeah, yeah. For example, if you are doing top-T sampling, then, well, suppose that A and B are equally probable, then you might sample A, you might sample B with the same probability.
Okay, cool. And another cool thing that we can do with re-ranking is that we can compose multiple re-rankers together. So basically, suppose you have a scoring function for style, and you have a scoring function for factual consistency. You can just add those two scoring functions together to get a new scoring function, and then re-rank everything based on your new scoring function to get tags that are both good at style and good at factual consistency.
Yeah? >> Yeah, so when you say that we re-rank by score, do we just pick the decoding that has the highest score, or do we do some more sampling again based on the score? >> The idea is you just take the decoding that has the highest score, because you already have, say, 10 candidates.
So out of these 10, you only need one, and then you just choose one that has the highest score. Yeah. Cool. Any other questions? Yeah? >> Sorry. What is perplexity? >> Oh, yeah. Perplexity is like, you can kind of regard it as log probabilities. It's like E to the negative log probabilities.
It's kind of like if a token has high perplexity, then it means it has low probability, because you are more perplexed. Okay. So taking a step back to summarize this decoding section, we have discussed many decoding approaches from selecting the most probable string to sampling, and then to various truncation approaches that we can do to improve sampling, like top P, top K, epsilon, typical decoding.
And finally, we discussed how we can do in terms of re-ranking the results. So decoding is still a really essential problem in NLG, and there are lots of works to be done here still, especially as like chatGP is so powerful. We should all go study decoding. So it would be interesting if you want to do such final projects.
And also, different decoding algorithms can allow us to inject different inductive biases to the text that we are trying to generate. And some of the most impactful advances in NLG in the last couple of years actually come from simple but effective decoding algorithms. For example, the nuclear sampling paper is actually very, very highly cited.
So moving on to talk about training NLG models. Well, we have seen this example before in the decoding slides, and I'm just trying to show them again, because even though we can solve this repetition problem by instead of doing search, doing sampling. But it's still concerning from a language modeling perspective that your model would put so much probability on such repetitive and degenerate text.
So we ask this question, well, is repetition due to how language models are trained? You have also seen this plot before, which shows this decaying pattern or this self-amplification effect. So we can conclude from this observation that model trained via a MLE objective wears a really bad mode of the distribution.
By mode of the distribution, I mean the argmax of the distribution. So basically, they would assign high probability to terrible strings. And this is definitely problematic for a model perspective. So why is this the case? Shouldn't MLE be a gold standard in machine learning in general, not just machine translation?
Shouldn't MLE be a gold standard for machine learning? The answer here is not really, especially for text, because MLE has some problem for sequential data. And we call this problem exposure bias. So training with teacher forcing leads to exposure bias at generation time, because during training, our model's inputs are gold context tokens from real human-generated text, as denoted by a hat less than T here.
But during generation time, our model's input become previously decoded tokens from the model, y hat T. And suppose that our model has minor errors, then y hat less than T will be much worse in terms of quality than y star less than T. And this discrepancy is terrible, because it actually causes a discrepancy between training and test time, which actually hurts model performance.
And we call this problem exposure bias. So people have proposed many solutions to address this exposure bias problem. One thing to do is to do scheduled sampling, which means that with probability p, we try to decode a token and feed it back in as context to train the model.
And with probability 1 minus p, we use the gold token as context. So throughout training, we try to increase p to gradually warm it up, and then prepare it for test time generation. So this leads to improvement in practice, because using this p probabilities, we're actually gradually trying to narrow the discrepancy between training and test time.
But the objective is actually quite strange, and training can be very unstable. Another idea is to do data set aggregation. And the method is called Dagger. Essentially, at various interval during training, we try to generate a sequence of tags from the current model, and then use this, and then put this sequence of tags into the training data.
So we're kind of continuously doing this training data augmentation scheme to make sure that the training distribution and the generation distribution are closer together. So both approaches, both scheduled sampling and data set aggregation, are ways to narrow the discrepancy between training and test. Yes, question? What is the gold token?
Gold token just means human text. It means like-- well, when you train a language model, you will see lots of corpus that are human written. Gold is just human. Yeah. OK, cool. So another approach is to do retrieval augmented generation. So we first learn to retrieve a sequence from some existing corpus of prototypes.
And then we train a model to actually edit the retrieved text by doing insertion, deletion, or swapping. We can add or remove tokens from this prototype, and then try to modify it into another sentence. So this doesn't really suffer from exposure bias, because we start from a high-quality prototype.
So that at training time and at test time, you don't really have the discrepancy anymore, because you are not generating from left to right. Another approach is to do reinforcement learning. So here, the idea is to cast your generation problem as a Markov decision process. So there is the state s, which is the model's representation for all the preceding context.
There is action a, which is basically the next token that we are trying to pick. And there is policy, which is the language model, or also called the decoder. And there is the reward r, which is provided by some external score. And the idea here-- well, we won't go into details about reinforcement learning and how it works, but we will recommend the class CS234.
So in the reinforcement learning context, because reinforcement learning involves a reward function, that's very important. So how do we do reward estimation for text generation? Well, really a natural idea is to just use the evaluation metrics. So whatever-- because you are trying to do well in terms of evaluation, so why not just improve for evaluation metrics directly at training time?
For example, in the case of machine translation, we can use blue score as the reward function. In the case of summarization, we can use root score as the reward function. But we really need to be careful about optimizing for tasks as opposed to gaining the reward, because evaluation metrics are merely proxies for the generation quality.
So sometimes, you run RL and improve the blue score by a lot. But when you run human evaluations, humans might still think that, well, this generated text is no better than the previous one, or even worse, even though it gives you a much better blue score. So we want to be careful about this case of not gaining the reward.
So what behaviors can we tie to a reward function? This is about reward design and reward estimation. There are so many things that we can do. We can do cross-modality consistency for image captioning. We can do sentence simplicity to make sure that we are generating simple English that are understandable.
We can do formality and politeness to make sure that, I don't know, your chatbot doesn't suddenly yell at you. And the most important thing that's really, really popular recently is human preference. So we should just build a reward model that captures human preference. And this is actually the technique behind the chat GPT model.
So the idea here is that we would ask human to rank a bunch of generated text based on their preference. And then we will use this preference data to learn a reward function, which will basically always assign high score to something that humans might prefer and assign low score to something that humans wouldn't prefer.
Yeah, question? Would it be more expensive? It's like, is it all just real? Oh yeah, sure. I mean, it is going to be very expensive. But I feel like compared to all the cost of training models, training like 170 billion parameter models, I feel like OpenAI and Google are, well, they can afford hiring lots of humans to do human annotations and ask their preference.
How much data would we need to, like, give simple answers? Yeah, this is a great question. So I think it's kind of a mystery about how much data you exactly need to achieve the level of performance of chat GPT. But roughly speaking, I feel like, I mean, whenever you try to fine tune a model on some downstream task, similarly here you are trying to fine tune your model on human preference.
It do need quite a lot of data, like maybe on the scale of 50k to 100k. That's roughly the scale that-- like Anthropic actually released some data set about human preference. That's roughly the scale that they released, I think, if I remember correctly. Yeah, question. So we talked about earlier about how many of the state of the art language models use transformers as their architecture.
How do you apply reinforcement learning to this model? To what do you mean? To transformer model? Yeah. Yeah, I feel like reinforcement learning is kind of a modeling tool. I mean, it's kind of an objective that you are trying to optimize. Instead of an MLE objective, now you are optimizing for an RL objective.
So it's kind of orthogonal to the architecture choice. So a transformer is an architecture. You just use transformer to give you probability of the next token distribution or to try to estimate probability of a sequence. And then once you have the probability of a sequence, you use that probability of the sequence, pass it into the RL objective that you have.
And then suppose that you are trying to do policy gradient or something, then you need to estimate the probability of that sequence. And then you just need to be able to backprop through transformer, which is doable. Yeah, so I think the question about architecture and objectives are orthogonal. So even if you have an LSTM, you can do it.
You have a transformer, you can also do it. Yeah. Cool. Hope I answered that question. Yeah. And it just like with a model for this kind of reward. For example, we can do another transformer to calculate the reward. Yeah, I think that's exactly what they did. So for example, you would have GPT-3.
You use GPT-3 as the generator that generate text. And you kind of have another pre-trained model that could probably also be GPT-3, but I'm guessing here, that you fine tune it to learn human preference. And then once you have a human preference model, you use the human preference model to put it into RL as the reward model.
And then use the original GPT-3 as the policy model. And then you apply RL objectives and then update them so that you will get a new model that's better at everything. OK, cool. Yeah, actually, if you are very curious about RLHF, I would encourage you to come to the next lecture, where Jesse will talk about RLHF.
RLHF is shorthand for RL using human feedback. So takeaways. T-shirt forcing is still the main algorithm for training text generation models. And exposure bias causes problems in text generation models. For example, it causes models to lose coherence, causes model to be repetitive. And models must learn to recover from their own bad samples by using techniques like scheduled sampling or a dagger.
And models shouldn't-- another approach to reduce exposure bias is to start with good text, like retrieval plus generation. And we also discussed how to do training with RL. And this can actually make model learn behaviors that are preferred by human-- that are preferred by human or preferred by some metrics.
So to be very up to date, in the best language model nowadays, chat-GPT, the training is actually pipelined. For example, we would first pre-train a large language models using internet corpus by self-supervision. And this kind of gets you chat-GPT-- sorry, GPT-3, which is the original version. And then you would do some sort of instruction tuning to fine-tune the language model, to fine-tune the pre-trained language model so that it learns roughly how to follow human instructions.
And finally, we would do RLHF to make sure that these models are well-aligned with human preference. So if we start RLHF from scratch, it's probably going to be very hard for the model to converge, because RL is hard to train for text data, et cetera. So RL doesn't really work from scratch.
But with all these smart tricks about pre-training and instruction tuning, suddenly now they're off to a good start. Cool. Any questions so far? OK. Oh, yeah. You mean the difference between Dagger and schedule sampling is how long the sequence are? Yeah, I think roughly that is it. Because for Dagger, you are trying to put in full-generated sequence.
But I feel like there can be variations of Dagger. Dagger is just like a high-level framework and idea. There can be variations of Dagger that are very similar to schedule sampling, I think. I feel like for schedule sampling, it's kind of a more smoothed version of Dagger. Because for Dagger, you have to-- well, basically, for this epoch, I am generating something.
And then after this epoch finishes, I put this into the data together and then train for another epoch. Whereas Dagger seems to be more flexible in terms of where you add data. Yes? So for Dagger, if you regress the models coming out, how does it help the model? I think that's a good question.
I feel like if you regress the model-- for example, if you regress the model on its own output, I think there should be smarter ways than to exactly regress on your own output. For example, you might still consult some gold reference data, for example, given that you ask the model to generate for something.
And then you can, instead of using-- say you ask the model to generate for five tokens. And then instead of using the model's generation to be the sixth token, you'll probably try to find some examples in the training data that would be good continuations. And then you try to plug that in by connecting the model generation and some gold text.
And then therefore, you are able to correct the model, even though it probably went off path a little bit by generating its own stuff. So it's kind of like letting the model learn how to correct for itself. But yes, I think you are right. If you just put model generation in the data, it shouldn't really work.
Yeah. Any other questions? Cool. Moving on. Yes. So now we'll talk about how we are going to evaluate NLG systems. So there are three types of methods for evaluation. There is content overlap metrics. There is model-based metrics. And there is human evaluations. So first, content overlap metrics compute a score based on lexical similarities between the generated text and the gold reference text.
So the advantage of this approach is that it's very fast and efficient and widely used. For example, a blue score is very popular in MT. And rouge score is very popular in summarization. So these methods are very popular because they are cheap and easy to run. But they are not really the ideal metrics.
For example, simply relying on lexical overlap might miss some refreezings that have the same semantic meaning. Or it might reward text with a large portion of lexical overlap, but actually have the opposite meaning. So you have lots of both false positive and false negative problems. So despite all these disadvantages, the metrics are still the to-go evaluation standard in machine translation.
Part of the reason is that MT is actually super close-ended. It's very non-open-ended. And then therefore, this is probably still fine to use a blue score to measure machine translation. And they get progressively worse for tasks that are more open-ended. For example, they get worse for summarization, as long as the output text-- because the output text becomes much harder to measure.
They are much worse for dialogue, which is more open-ended. And then they are much, much worse for story generation, which is also open-ended. And then the drawback here is that because the n-gram metrics-- this is because suppose that you are generating a story that's relatively long. Then if you are still looking at word overlap, then you might actually get very high n-gram scores because of your text is very long, not because it's actually of high quality.
Just because you are talking so much that you might have covered lots of points already. Yes? Yes, exactly. That's the next thing that I will talk about as a better metric for evaluation. But for now, let's do a case study of a failure mode for blue score, for example.
So suppose that Chris asked the question, are you enjoying the CS224L lectures? The correct answer, of course, is heck yes. So if we have this, if one of the answers is yes, it will get a score of 0.61 because it has some lexical overlap with the correct answer. If you answer you know it, then it gets a relatively lower score because it doesn't really have any lexical overlap except from the exclamation mark.
And if you answer yep, this is semantically correct, but it actually gets 0 score because there is no lexical overlap between the gold answer and the generation. If you answer heck no, this should be wrong. But because it has lots of lexical overlap with the correct answer, it's actually getting some high scores.
So these two cases are the major failure modes of lexical-based n-gram overlap metrics. You get false negatives and false positives. So moving beyond this failure modes of lexical-based metrics, the next step is to check for semantic similarities. And model-based metrics are better at capturing the semantic similarities. So this is kind of similar to what you kind of raised up a couple minutes ago.
We can actually use learned representation of words and sentences to compute semantic similarities between generated and referenced text. So now we are no longer bottlenecked by n-gram. And instead, we're using embeddings. And these embeddings are going to be pre-trained. But the methods can still move on because we can just swap in different pre-trained method and use the fixed metrics.
So here are some good examples of the metrics that could be used. One thing is to do vector similarity. This is very similar to homework one, where you are trying to compute similarity between words, except now we are trying to compute similarity between sentences. There are some ideas of how to go from word similarity to sentence similarities.
For example, you can just average the embedding, which is like a relatively naive idea, but it works sometimes. Another high-level idea is that we can measure word movers distance. The idea here is that we can use optimal transports to align the source and target word embeddings. Suppose that your source word embedding is Obama speaks to the media in Illinois, and the target is the president grace the press in Chicago.
From a human evaluation perspective, these two are actually very similar, but they are not exactly aligned word by word. So we need to figure out how to optimally align words to words, like align Obama to president, align Chicago to Illinois, and then therefore we can compute a score. We can compute the pairwise word embedding difference between this, and then get a good score for the sentence similarities.
And finally, there is BERT score, which is also a very popular metric for semantic similarity. So it first computes pairwise cosine distance using BERT embeddings, and then it finds an optimal alignment between the source and target sentence, and then it finally computes some score. So I feel like these details are not really that important, but the high-level idea is super important, is that we can now use word embeddings to compute sentence similarities by doing some sort of smart alignment, and then transform from word similarity to sentence similarities.
To move beyond word embeddings, we can also use sentence embeddings to compute sentence similarities. So typically, this doesn't have the very comprehensive alignment by word problem, but it has similar problems about you need to now align sentences or phrases in a sentence. And similarly, there is BLURT, which is slightly different.
It is a regression model based on BERT. So the model is trained as a regression problem to return a score that indicates how good the text is in terms of grammaticality and the meaning of the reference text, and similarity with the reference text. So this is kind of a training evaluation as a regression problem.
Any questions so far? OK, cool. You can move on. So all the previous mentioned approaches are evaluating semantic similarities, so they can be applied to non-open-ended generation tasks. But what about open-ended settings? So here, enforcing semantic similarity seems wrong, because a story can be perfectly fluent and perfectly high quality without having to reassemble any of the reference stories.
So one idea here is that maybe we want to evaluate open-ended text generation using this MOV score. MOV score computes the information divergence in a quantized embedding space between the generated text and the goal reference text. So here is roughly the detail of what's going on. Suppose that you have a batch of text from the goal reference that are human written, and you have a batch of text that's generated by your model.
Step number one is that you want to embed this text. You want to put this text into some continuous representation space, which is kind of the figure to the left. But it's really hard to compute any distance metrics in this continuous embedding space, because different sentences might actually lie very far away from each other.
So the idea here is that we are trying to do a k-means cluster to discretize the continuous space into some discrete space. Now, after the discretization, we can actually have a histogram for the goal human written text and a histogram for the machine generated text. And then we can now compute precision recall using these two discretized distributions.
And then we can compute precision by forward KL and recall by backward KL. Yes, question? Why do we want to discretize it? I didn't catch that. So imagine that you-- suppose-- maybe it's equivalent to answer, why is it hard to work with the continuous space? The idea is if you embed a sentence into the continuous space, say that it lies here, and you embed another sentence in a continuous space that lies here, suppose that you only have a finite number of sentences.
Then they would basically be direct delta distributions in your manifold. So it's hard to-- you probably want a smoother distribution. But it's hard to define what is a good, smooth distribution in the case of text embedding, because they're not super interpretable. So therefore, eventually, you will have-- if you embed everything in a continuous space, you will have lots of direct deltas that are just very high and then not really connected to its neighbors.
So it's hard to quantify KL divergence or a distance matrix in that space. For example, you have to make some assumptions. For example, you want to make Gaussian assumptions that I want to smooth all the embeddings by convolving it with a Gaussian. And then you can start getting some meaningful distance metrics.
But with just the embeddings alone, you're not going to get meaningful distance metrics. And then it doesn't really make sense to smooth things using Gaussian, because who said word representations are Gaussian related? Yeah. Question? How do you know it would be continuous to understand distributions? I think this requires some Gaussian smoothing.
Yeah, I think that the plot is made with some smoothing. Yeah, I mean, I didn't make the plot, so I couldn't be perfectly sure. But I think the fact that it looks like this means that you smooth it a little bit. So you put in word embeddings and-- These are sentence embeddings or concatenated word embeddings, because you are comparing sentences to sentences, not words to words.
Yeah, so the advantage of MOLF score is that it is applicable to open-ended settings, because you are now measuring precision and recall with regard to the target distribution. Cool. So it has a better probabilistic interpretation than all the previous similarity metrics. Cool. Any other questions? Yes? I'm just not entirely clear.
So if we're trying to maximize precision, we can call it here. Yeah. How is that different from just trying to maximize the similarity between the target and the distribution? Oh, yeah, that's a good question. Well, this is because in a case where it's really hard to get exactly the same thing-- well, for example, I would say that maybe-- because I've never tried this myself, but if you try to run MOLF on a machine translation task, you might get very high score.
But if you try to run Bool score on the open-ended text generation, you will get super low score. So it's just not really measurable, because everything's so different from each other. So I feel like MOLF is kind of a middle ground, where you are trying to evaluate something that are actually very far away from each other, but you still want a meaningful representation.
Of course, I mean, if your source and target are exactly the same or are just different up to some rephrasing, you will get the best MOLF score. But maybe that's not really what you're looking for, because given the current situation, you only have generations that are very far away from the gold text.
How do we evaluate this type of things? Yes, question in the back. I'm still trying to understand the MOF score. Is it possible to write out the map, even in just kind of pseudo, simple form? Yeah, I think it's possible. I mean, maybe we can put this discussion after class, because I kind of want to finish my slides.
Yeah, but happy to chat after class. There is a paper about it if you search for MOLF score. I think it's probably the best paper in some ICML or Europe's conference as well. OK, so moving on. I've pointed out that there are so many evaluation methods. So let's take a step back and think about what's a good metric for evaluation methods.
So how do we evaluate evaluations? Nowadays, the gold standard is still to check how well this metric is aligned with human judgment. So if a model match human preference, in other words, if the metric correlates very strongly with human judgment, then we say that the metric is a good metric.
So in this plot, people have plot blue score and human score on y and x axis respectively. And then because we didn't see a correlation, a strong correlation, this kind of suggests that blue score is not a very good metric. So actually, the gold standard for human evaluation-- the gold standard for evaluating language models is always to do human evaluation.
So automatic metrics fall short of matching human decisions. And human evaluation is kind of the most important criteria for evaluating text that are generated from a model. And it's also the gold standard in developing automatic metrics because we want everything to match human evaluation. So what do we mean by human evaluation?
How is it conducted? Typically, we will provide human annotators with some axes that we care about, like fluency, coherence for open-ended text generation. Suppose that we also care about factuality for summarization. We care about style of the writing and common sense, for example, if we're trying to write a children's story.
Essentially, another thing to note is that please don't compare human evaluations across different papers or different studies because human evaluations tends to not be well-collaborated and are not really reproducible. Even though we believe that human evaluations are the gold standard, there are still many drawbacks. For example, human evaluations are really slow and expensive.
But even beyond the slow and expensiveness, they are still not perfect because first, human evaluations, the results may be inconsistent, and they may not be very reproducible. So if you ask the same human whether you like A or B, they might say A the first time and B the second time.
And then human evaluations are typically not really logical. And sometimes, human annotators might misinterpret your question. Suppose that you want them to measure coherence of the text. Different people have different criteria for coherence. Some people might think coherence is equivalent to fluency, and then they look for grammaticality errors.
Some people might think coherence means how well your continuation is aligned with the prompt or the topic. So there are all sorts of misunderstandings that might make human evaluation very hard. And finally, human evaluation only measures precision, not recall. This means that you can give a sentence to human and ask the human, how do you like the sentence?
But you couldn't ask the human whether this model is able to generate all possible sentences that are good. So it's only a precision-based metrics, not a recall-based metrics. So here are two approaches that tries to combine human evaluations with modeling. For example, the first idea is basically trying to learn a metric from human judgment, basically by trying to use human judgment data as training data, and then train a model to simulate human judgment.
And the second approach is trying to ask human and model to collaborate so that the human would be in charge of evaluating precision, whereas the model would be in charge of evaluating recall. Also, we have tried approaches in terms of evaluating models interactively. So in this case, we not only care about the output quality, we also care about how the person feels when they interact with the model, when they try to be a co-author with the model, and how the person feels about the writing process, et cetera.
So this is called trying to evaluate the models more interactively. So the takeaway here is that content overlap is a bad metric. Model-based metrics become better because it's more focused on semantics, but it's still not good enough. Human judgment is the gold standard, but it's hard to do human judgment-- it's hard to do human study well.
And in many cases, this is a hint for final project. The best judge of the output quality is actually you. So if you want to do a final project in natural language generation, you should look at the model output yourself. And don't just rely on the numbers that are reported by Blue Swirl or something.
Cool. So finally, we will discuss ethical considerations of natural language generation problems. So as language models get better and better, ethical considerations become much more pressing. So we want to ensure that the models are well-aligned with human values. For example, we want to make sure the models are not harmful, they are not toxic, and we want to make sure that the models are unbiased and fair to all demographics groups.
So for example here, we also don't want the model to generate any harmful content. Basically, I try to prompt ChatGPT to say, can you write me some toxic content? ChatGPT politely refused me, which I'm quite happy about. But there are other people who try to jailbreak ChatGPT. The idea here is that ChatGPT-- actually, I think internally, they probably implement some detection tools so that when you try to prompt it adversarially, it's going to avoid doing adversarial things.
But here, there are many very complicated ways to prompt ChatGPT so that you can get over the firewall and then therefore still ask ChatGPT to generate some bad English. So another problem with these large language models is that they are not necessarily truthful. So for example, this very famous news that Google's model actually generated factual errors, which is quite disappointing.
But the way the model talks about it is very convincing. So you wouldn't really know that it's a factual error unless you go check that this is not the first picture or something. So we want to avoid this type of problems. Actually, the models have already been trying very hard to reframe from generating harmful content.
But for models that are more open-sourced and are smaller, the same problem still appears. And then typically, when we do our final projects or when we work with models, we are probably going to deal with much smaller models. And then therefore, we need to think about ways to deal with these problems better.
So text generation models are often constructed from pre-trained language models. And then pre-trained language models are trained on internet data, which contains lots of harmful stuff and bias. So when the models are prompted for this information, they will just repeat the negative stereotypes that they learn from the internet training data.
So one way to avoid this is to do extensive data cleaning so that the pre-training data does not contain any bias or stereotypical content. However, this is going to be very labor-intensive and almost impossible to do because filtering a large amount of internet data is just so costly that it's not really possible.
Again, with existing language models like GPT-2 Medium, there are some adversarial inputs that almost always trigger toxic content. And these models might be exploited in the real world by ill-intended people. So for example, there is a paper about universal adversarial triggers where the authors just find some universal set of words that would trigger bad content from the-- that would trigger toxic content from the model.
And sometimes, even if you don't try to trigger the model, the model might still start to generate toxic content by itself. So in this case, the pre-trained language models are prompted with very innocuous prompts, but they still degenerate into toxic content. So the takeaway here is that models really shouldn't be deployed without proper safeguards to control for toxic content or any harmful contents in general.
And models should not be deployed without careful considerations of how users will interact with these models. So in the ethics section, one major takeaway is that we are trying to advocate that you need to think more about the model that you are building. So before deploying or publishing any NLG models, please check if the model's output is not harmful.
And please check if the model is more robust-- is robust to all the trigger words and other adversarial prompts. And of course, there are more. So well, basically, one can never do enough to improve the ethics of text generation systems. And OK, cool. I still have three minutes left, so I can still do concluding thoughts.
The idea here-- well, today, we talk about the exciting applications of natural language generation systems. But one might think that, well, given that ChatGPT is already so good, are there any other things that we can do research-wise? If you try interacting with these models, if you try to interact with these models, actually, you can see that there are still lots of limitations in their skills and performance.
For example, ChatGPT is able to do a lot of things with manipulating text, but it couldn't really create interesting contents, or it couldn't really think deeply about stuff. So there are lots of headrooms, and there are still many improvements ahead. And evaluation remains a really huge challenge in natural language generation.
Basically, we need better ways to automatically evaluate performance of NLG models, because human evaluations are expensive and not reproducible. So it's better to figure out ways to compile all those human judgments into a very reliable and trustworthy model. And also, with the advance of all these large-scale language models, doing neural natural language generation has been reset.
And it's never been easier to jump into this space, because now there are all the tools that are already there for you to build upon. And finally, it is one of the most exciting and fun areas of NLP to work on. So yeah, I'm happy to chat more about NLG if you have any questions, both after class and in class, I guess, in one minute.
OK, cool. That's everything. So do you have any questions? If you don't, we can end the class.