back to index

How-to Decode Outputs From NLP Models (Python)


Chapters

0:0 Introduction
1:17 Generate Method
3:32 Random Sampling Method
4:41 Random Sampling
7:9 Beam Search Explained
7:41 Adding Beam Search
8:29 Temperature

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to this video on decoding methods for text generation.
00:00:03.840 | So when we're performing a lot of different NLP tasks,
00:00:09.000 | a lot of people focus very much on the actual model,
00:00:13.600 | which is fair enough. The model is very important,
00:00:16.800 | but a lot of people, at least when they're self-learning,
00:00:20.160 | they miss the fact that you can make a big impact on your output,
00:00:24.360 | depending on how you're decoding your texts or your tokens at the end of your
00:00:29.320 | model. What I mean by that is you have your model here,
00:00:33.000 | and then when you're outputting that,
00:00:35.840 | you typically have a probability distribution across your
00:00:40.040 | vocabulary or output vocabulary.
00:00:42.000 | And it is that probability distribution that allows you to choose what token
00:00:47.000 | you're going to output.
00:00:47.960 | So we're going to focus on a few different ways of doing that because there are
00:00:51.520 | quite a few different ways.
00:00:52.560 | The three ways that we're going to look at include greedy decoding,
00:00:58.880 | random sampling, and beam search.
00:01:01.560 | Generally, I think once you know these three, you're pretty good to go,
00:01:06.880 | and you can make a really big difference.
00:01:08.680 | You'll see in this video in the actual outputs of your models.
00:01:13.520 | So let's just jump straight into it. Here,
00:01:17.840 | we just have the setup. So we have our model setup,
00:01:21.400 | and it's just generating texts based on this input here,
00:01:26.200 | which is just an extract from the Wikipedia page about Winston Churchill.
00:01:30.880 | We have this,
00:01:33.000 | we've encoded our input here and all we need
00:01:37.920 | to do to actually generate our outputs from this is we'll output
00:01:42.920 | the generated tokens to outputs.
00:01:45.520 | And we just call the generate method on our model.
00:01:50.400 | And then we pass in the inputs.
00:01:53.560 | And then we just pass in the length that we want to use.
00:01:56.360 | So the number of tokens that we actually want to output.
00:01:58.960 | So I'm just going to go 200. Okay.
00:02:04.280 | And then to actually read that we need to decode it.
00:02:07.800 | Okay. And that is our text.
00:02:17.520 | So I just want to point out here that when we're calling this decode method,
00:02:22.760 | that is not when we are using these decoding methods that this video is about
00:02:27.360 | that that is just decoding using our tokenizer.
00:02:31.440 | So taking the token IDs that we do have our predicted token IDs and decoding
00:02:36.360 | them into English,
00:02:37.400 | the decoding methods that we're talking about take place at the end of this
00:02:42.240 | generate method here. So by default,
00:02:45.280 | this is greedy decoding and greedy decoding is,
00:02:49.200 | I would say the simplest approach and it's the one that you would probably think
00:02:53.240 | of when you try and just figure this out through intuition.
00:02:56.640 | So obviously we have our probability distribution and we have our output
00:03:02.520 | vocabulary and each value within that output distribution
00:03:07.400 | represents a mapping to one of our
00:03:12.680 | words or tokens.
00:03:13.920 | What greedy decoding does is simply chooses the token that has the highest
00:03:18.840 | probability, which obviously makes a lot of sense.
00:03:21.200 | And for shorter sequences, this is perfectly fine.
00:03:25.960 | But then when we start applying this to longer sequences,
00:03:30.160 | it causes a lot of issues.
00:03:31.560 | If you've ever seen a output that looks like this,
00:03:35.480 | where you have a lot of repetition,
00:03:38.520 | it just keeps repeating itself over and over again.
00:03:41.080 | This is typically because of greedy decoding.
00:03:44.480 | So when we're looking at longer sequences,
00:03:47.960 | what we will find is that it creates kind of like a loop.
00:03:52.000 | So it sees, you know,
00:03:55.120 | the best match for one word and then it reads the best match for
00:04:00.040 | the next word.
00:04:01.360 | And then as soon as it sees that first word again as the next best match,
00:04:06.360 | it's just going to create a loop.
00:04:08.080 | So that's why we want to avoid greedy decoding for longer
00:04:13.920 | sequences. So the next option we have is random sampling.
00:04:18.040 | Like before,
00:04:19.640 | we still have our outputs and we have that probability distribution across all
00:04:24.240 | of those.
00:04:25.080 | And what random sampling does differently is it chooses
00:04:29.360 | one of these tokens based on that probability at random.
00:04:34.120 | So what I mean by that is it will choose a token at random.
00:04:38.800 | So it can choose any of those,
00:04:40.880 | but the probability of choosing that token is
00:04:45.160 | weighted based on the probability of that token being the next token
00:04:49.800 | from the model.
00:04:50.880 | It's very similar to greedy decoding,
00:04:55.480 | but it just adds that sort of layer of randomness where every now and again,
00:05:00.320 | it's going to choose something other than the most probable token.
00:05:04.560 | And all we do to actually use random sampling is
00:05:09.760 | add a do sample argument here and set this to
00:05:14.360 | true.
00:05:15.200 | And this will switch our decoding method from
00:05:20.160 | greedy decoding to random sampling.
00:05:22.160 | And we will see that the output of this is not as repetitive.
00:05:26.920 | So if we just decode this here. Okay.
00:05:32.840 | And we can see that it's a lot more random.
00:05:37.920 | So this solves our problem of getting stuck in a repeating loop of the same
00:05:42.480 | words over and over again,
00:05:43.600 | because we just add this randomness to our predictions. However,
00:05:47.800 | this does introduce another problem,
00:05:50.400 | which is we will often find that the outputs we are
00:05:55.040 | producing are too random and they're just not coherent because of that.
00:05:59.800 | This one is reasonably coherent.
00:06:02.720 | It's so talking about UK politics.
00:06:07.200 | Sometimes it gets super weird.
00:06:08.640 | It will start talking about chameleons and game sales and stuff like that
00:06:13.880 | within, you know, a few sentences, it was actually talking about Churchill.
00:06:18.440 | So on one side, we have greedy search,
00:06:20.920 | which is too strict and it just causes a loop of the same
00:06:25.800 | words over and over again. And on the other side, we have random sampling,
00:06:29.920 | which is too random and ends up just producing
00:06:34.960 | gibberish almost all the time.
00:06:36.920 | So what we need is something in the middle,
00:06:41.520 | and that is why we use beam search.
00:06:45.320 | So beam search allows us to explore multiple levels of
00:06:51.200 | our output before selecting the best option.
00:06:54.400 | Whereas greedy decoding and random sampling is just looking at the very next
00:06:59.960 | token and then calculates, you know, which one to choose.
00:07:03.680 | What beam search does is it goes ahead in time and it searches through multiple
00:07:08.560 | potential paths and then finds the best option based on
00:07:13.120 | the full sequence rather than just the next token.
00:07:17.320 | And this just allows us to assess multiple different options
00:07:23.160 | and assess a longer length of time than just one token,
00:07:25.840 | which means typically we're going to see more coherent language from these
00:07:30.880 | outputs as well. And the beam in beam search is,
00:07:35.000 | all that's referring to is the number of paths that we assess
00:07:40.320 | and we consider. To add beam search,
00:07:43.680 | all we need to do is add number of beams.
00:07:46.840 | And we just set that to a value that is more than one because a beam of one is
00:07:51.560 | actually just the default. So if we set this to two,
00:07:55.760 | we then search two different beams or two different options
00:08:00.960 | at once. And this usually is pretty good.
00:08:05.240 | However,
00:08:06.040 | because we're now back to ranking sequences and selecting the most
00:08:10.720 | probable beam search can quite easily fall back into
00:08:15.680 | this repetitive nature that we get with greedy decoding.
00:08:19.040 | So what we need to do to counteract that is add
00:08:24.920 | the temperature argument to our code.
00:08:28.280 | So the temperature essentially controls the amount of randomness
00:08:33.160 | within the beam search choice.
00:08:36.680 | And by default, this is set to one.
00:08:39.400 | We can add more randomness to the output by increasing this.
00:08:45.480 | So say to 1.2 and this will just make the outputs more
00:08:50.200 | random, but because we're using beam search,
00:08:53.800 | it will still remain reasonably coherent unless we really turn the
00:08:58.440 | temperature up to a very high number, like five or something.
00:09:01.880 | And if we want to reduce the randomness in our outputs,
00:09:04.960 | we just reduce the temperature to something below one.
00:09:08.800 | So 0.8, for example. So with this, I mean, we,
00:09:13.560 | we have our beam search.
00:09:14.880 | We have some good outputs and somewhat peculiar,
00:09:19.040 | but they're pretty coherent. So it's pretty good. I mean,
00:09:21.960 | I think that's everything for this video.
00:09:24.240 | I think these three decoding methods are pretty important to know and they can
00:09:28.880 | make a big difference in the quality of your outputs.
00:09:31.040 | So thank you very much for watching and I will see you again next time.