back to indexHow-to Decode Outputs From NLP Models (Python)
Chapters
0:0 Introduction
1:17 Generate Method
3:32 Random Sampling Method
4:41 Random Sampling
7:9 Beam Search Explained
7:41 Adding Beam Search
8:29 Temperature
00:00:00.000 |
Hi, welcome to this video on decoding methods for text generation. 00:00:03.840 |
So when we're performing a lot of different NLP tasks, 00:00:09.000 |
a lot of people focus very much on the actual model, 00:00:13.600 |
which is fair enough. The model is very important, 00:00:16.800 |
but a lot of people, at least when they're self-learning, 00:00:20.160 |
they miss the fact that you can make a big impact on your output, 00:00:24.360 |
depending on how you're decoding your texts or your tokens at the end of your 00:00:29.320 |
model. What I mean by that is you have your model here, 00:00:35.840 |
you typically have a probability distribution across your 00:00:42.000 |
And it is that probability distribution that allows you to choose what token 00:00:47.960 |
So we're going to focus on a few different ways of doing that because there are 00:00:52.560 |
The three ways that we're going to look at include greedy decoding, 00:01:01.560 |
Generally, I think once you know these three, you're pretty good to go, 00:01:08.680 |
You'll see in this video in the actual outputs of your models. 00:01:17.840 |
we just have the setup. So we have our model setup, 00:01:21.400 |
and it's just generating texts based on this input here, 00:01:26.200 |
which is just an extract from the Wikipedia page about Winston Churchill. 00:01:37.920 |
to do to actually generate our outputs from this is we'll output 00:01:45.520 |
And we just call the generate method on our model. 00:01:53.560 |
And then we just pass in the length that we want to use. 00:01:56.360 |
So the number of tokens that we actually want to output. 00:02:04.280 |
And then to actually read that we need to decode it. 00:02:17.520 |
So I just want to point out here that when we're calling this decode method, 00:02:22.760 |
that is not when we are using these decoding methods that this video is about 00:02:27.360 |
that that is just decoding using our tokenizer. 00:02:31.440 |
So taking the token IDs that we do have our predicted token IDs and decoding 00:02:37.400 |
the decoding methods that we're talking about take place at the end of this 00:02:45.280 |
this is greedy decoding and greedy decoding is, 00:02:49.200 |
I would say the simplest approach and it's the one that you would probably think 00:02:53.240 |
of when you try and just figure this out through intuition. 00:02:56.640 |
So obviously we have our probability distribution and we have our output 00:03:02.520 |
vocabulary and each value within that output distribution 00:03:13.920 |
What greedy decoding does is simply chooses the token that has the highest 00:03:18.840 |
probability, which obviously makes a lot of sense. 00:03:21.200 |
And for shorter sequences, this is perfectly fine. 00:03:25.960 |
But then when we start applying this to longer sequences, 00:03:31.560 |
If you've ever seen a output that looks like this, 00:03:38.520 |
it just keeps repeating itself over and over again. 00:03:41.080 |
This is typically because of greedy decoding. 00:03:47.960 |
what we will find is that it creates kind of like a loop. 00:03:55.120 |
the best match for one word and then it reads the best match for 00:04:01.360 |
And then as soon as it sees that first word again as the next best match, 00:04:08.080 |
So that's why we want to avoid greedy decoding for longer 00:04:13.920 |
sequences. So the next option we have is random sampling. 00:04:19.640 |
we still have our outputs and we have that probability distribution across all 00:04:25.080 |
And what random sampling does differently is it chooses 00:04:29.360 |
one of these tokens based on that probability at random. 00:04:34.120 |
So what I mean by that is it will choose a token at random. 00:04:40.880 |
but the probability of choosing that token is 00:04:45.160 |
weighted based on the probability of that token being the next token 00:04:55.480 |
but it just adds that sort of layer of randomness where every now and again, 00:05:00.320 |
it's going to choose something other than the most probable token. 00:05:04.560 |
And all we do to actually use random sampling is 00:05:09.760 |
add a do sample argument here and set this to 00:05:15.200 |
And this will switch our decoding method from 00:05:22.160 |
And we will see that the output of this is not as repetitive. 00:05:37.920 |
So this solves our problem of getting stuck in a repeating loop of the same 00:05:43.600 |
because we just add this randomness to our predictions. However, 00:05:50.400 |
which is we will often find that the outputs we are 00:05:55.040 |
producing are too random and they're just not coherent because of that. 00:06:08.640 |
It will start talking about chameleons and game sales and stuff like that 00:06:13.880 |
within, you know, a few sentences, it was actually talking about Churchill. 00:06:20.920 |
which is too strict and it just causes a loop of the same 00:06:25.800 |
words over and over again. And on the other side, we have random sampling, 00:06:29.920 |
which is too random and ends up just producing 00:06:45.320 |
So beam search allows us to explore multiple levels of 00:06:54.400 |
Whereas greedy decoding and random sampling is just looking at the very next 00:06:59.960 |
token and then calculates, you know, which one to choose. 00:07:03.680 |
What beam search does is it goes ahead in time and it searches through multiple 00:07:08.560 |
potential paths and then finds the best option based on 00:07:13.120 |
the full sequence rather than just the next token. 00:07:17.320 |
And this just allows us to assess multiple different options 00:07:23.160 |
and assess a longer length of time than just one token, 00:07:25.840 |
which means typically we're going to see more coherent language from these 00:07:30.880 |
outputs as well. And the beam in beam search is, 00:07:35.000 |
all that's referring to is the number of paths that we assess 00:07:46.840 |
And we just set that to a value that is more than one because a beam of one is 00:07:51.560 |
actually just the default. So if we set this to two, 00:07:55.760 |
we then search two different beams or two different options 00:08:06.040 |
because we're now back to ranking sequences and selecting the most 00:08:10.720 |
probable beam search can quite easily fall back into 00:08:15.680 |
this repetitive nature that we get with greedy decoding. 00:08:19.040 |
So what we need to do to counteract that is add 00:08:28.280 |
So the temperature essentially controls the amount of randomness 00:08:39.400 |
We can add more randomness to the output by increasing this. 00:08:45.480 |
So say to 1.2 and this will just make the outputs more 00:08:53.800 |
it will still remain reasonably coherent unless we really turn the 00:08:58.440 |
temperature up to a very high number, like five or something. 00:09:01.880 |
And if we want to reduce the randomness in our outputs, 00:09:04.960 |
we just reduce the temperature to something below one. 00:09:08.800 |
So 0.8, for example. So with this, I mean, we, 00:09:14.880 |
We have some good outputs and somewhat peculiar, 00:09:19.040 |
but they're pretty coherent. So it's pretty good. I mean, 00:09:24.240 |
I think these three decoding methods are pretty important to know and they can 00:09:28.880 |
make a big difference in the quality of your outputs. 00:09:31.040 |
So thank you very much for watching and I will see you again next time.