Back to Index

How-to Decode Outputs From NLP Models (Python)


Chapters

0:0 Introduction
1:17 Generate Method
3:32 Random Sampling Method
4:41 Random Sampling
7:9 Beam Search Explained
7:41 Adding Beam Search
8:29 Temperature

Transcript

Hi, welcome to this video on decoding methods for text generation. So when we're performing a lot of different NLP tasks, a lot of people focus very much on the actual model, which is fair enough. The model is very important, but a lot of people, at least when they're self-learning, they miss the fact that you can make a big impact on your output, depending on how you're decoding your texts or your tokens at the end of your model.

What I mean by that is you have your model here, and then when you're outputting that, you typically have a probability distribution across your vocabulary or output vocabulary. And it is that probability distribution that allows you to choose what token you're going to output. So we're going to focus on a few different ways of doing that because there are quite a few different ways.

The three ways that we're going to look at include greedy decoding, random sampling, and beam search. Generally, I think once you know these three, you're pretty good to go, and you can make a really big difference. You'll see in this video in the actual outputs of your models. So let's just jump straight into it.

Here, we just have the setup. So we have our model setup, and it's just generating texts based on this input here, which is just an extract from the Wikipedia page about Winston Churchill. We have this, we've encoded our input here and all we need to do to actually generate our outputs from this is we'll output the generated tokens to outputs.

And we just call the generate method on our model. And then we pass in the inputs. And then we just pass in the length that we want to use. So the number of tokens that we actually want to output. So I'm just going to go 200. Okay. And then to actually read that we need to decode it.

Okay. And that is our text. So I just want to point out here that when we're calling this decode method, that is not when we are using these decoding methods that this video is about that that is just decoding using our tokenizer. So taking the token IDs that we do have our predicted token IDs and decoding them into English, the decoding methods that we're talking about take place at the end of this generate method here.

So by default, this is greedy decoding and greedy decoding is, I would say the simplest approach and it's the one that you would probably think of when you try and just figure this out through intuition. So obviously we have our probability distribution and we have our output vocabulary and each value within that output distribution represents a mapping to one of our words or tokens.

What greedy decoding does is simply chooses the token that has the highest probability, which obviously makes a lot of sense. And for shorter sequences, this is perfectly fine. But then when we start applying this to longer sequences, it causes a lot of issues. If you've ever seen a output that looks like this, where you have a lot of repetition, it just keeps repeating itself over and over again.

This is typically because of greedy decoding. So when we're looking at longer sequences, what we will find is that it creates kind of like a loop. So it sees, you know, the best match for one word and then it reads the best match for the next word. And then as soon as it sees that first word again as the next best match, it's just going to create a loop.

So that's why we want to avoid greedy decoding for longer sequences. So the next option we have is random sampling. Like before, we still have our outputs and we have that probability distribution across all of those. And what random sampling does differently is it chooses one of these tokens based on that probability at random.

So what I mean by that is it will choose a token at random. So it can choose any of those, but the probability of choosing that token is weighted based on the probability of that token being the next token from the model. It's very similar to greedy decoding, but it just adds that sort of layer of randomness where every now and again, it's going to choose something other than the most probable token.

And all we do to actually use random sampling is add a do sample argument here and set this to true. And this will switch our decoding method from greedy decoding to random sampling. And we will see that the output of this is not as repetitive. So if we just decode this here.

Okay. And we can see that it's a lot more random. So this solves our problem of getting stuck in a repeating loop of the same words over and over again, because we just add this randomness to our predictions. However, this does introduce another problem, which is we will often find that the outputs we are producing are too random and they're just not coherent because of that.

This one is reasonably coherent. It's so talking about UK politics. Sometimes it gets super weird. It will start talking about chameleons and game sales and stuff like that within, you know, a few sentences, it was actually talking about Churchill. So on one side, we have greedy search, which is too strict and it just causes a loop of the same words over and over again.

And on the other side, we have random sampling, which is too random and ends up just producing gibberish almost all the time. So what we need is something in the middle, and that is why we use beam search. So beam search allows us to explore multiple levels of our output before selecting the best option.

Whereas greedy decoding and random sampling is just looking at the very next token and then calculates, you know, which one to choose. What beam search does is it goes ahead in time and it searches through multiple potential paths and then finds the best option based on the full sequence rather than just the next token.

And this just allows us to assess multiple different options and assess a longer length of time than just one token, which means typically we're going to see more coherent language from these outputs as well. And the beam in beam search is, all that's referring to is the number of paths that we assess and we consider.

To add beam search, all we need to do is add number of beams. And we just set that to a value that is more than one because a beam of one is actually just the default. So if we set this to two, we then search two different beams or two different options at once.

And this usually is pretty good. However, because we're now back to ranking sequences and selecting the most probable beam search can quite easily fall back into this repetitive nature that we get with greedy decoding. So what we need to do to counteract that is add the temperature argument to our code.

So the temperature essentially controls the amount of randomness within the beam search choice. And by default, this is set to one. We can add more randomness to the output by increasing this. So say to 1.2 and this will just make the outputs more random, but because we're using beam search, it will still remain reasonably coherent unless we really turn the temperature up to a very high number, like five or something.

And if we want to reduce the randomness in our outputs, we just reduce the temperature to something below one. So 0.8, for example. So with this, I mean, we, we have our beam search. We have some good outputs and somewhat peculiar, but they're pretty coherent. So it's pretty good.

I mean, I think that's everything for this video. I think these three decoding methods are pretty important to know and they can make a big difference in the quality of your outputs. So thank you very much for watching and I will see you again next time.