How-to Decode Outputs From NLP Models (Python)

00:00:00.000 | Hi, welcome to this video on decoding methods for text generation.

00:00:03.840 | So when we're performing a lot of different NLP tasks,

00:00:09.000 | a lot of people focus very much on the actual model,

00:00:13.600 | which is fair enough. The model is very important,

00:00:16.800 | but a lot of people, at least when they're self-learning,

00:00:20.160 | they miss the fact that you can make a big impact on your output,

00:00:24.360 | depending on how you're decoding your texts or your tokens at the end of your

00:00:29.320 | model. What I mean by that is you have your model here,

00:00:33.000 | and then when you're outputting that,

00:00:35.840 | you typically have a probability distribution across your

00:00:40.040 | vocabulary or output vocabulary.

00:00:42.000 | And it is that probability distribution that allows you to choose what token

00:00:47.000 | you're going to output.

00:00:47.960 | So we're going to focus on a few different ways of doing that because there are

00:00:51.520 | quite a few different ways.

00:00:52.560 | The three ways that we're going to look at include greedy decoding,

00:00:58.880 | random sampling, and beam search.

00:01:01.560 | Generally, I think once you know these three, you're pretty good to go,

00:01:06.880 | and you can make a really big difference.

00:01:08.680 | You'll see in this video in the actual outputs of your models.

00:01:13.520 | So let's just jump straight into it. Here,

00:01:17.840 | we just have the setup. So we have our model setup,

00:01:21.400 | and it's just generating texts based on this input here,

00:01:26.200 | which is just an extract from the Wikipedia page about Winston Churchill.

00:01:30.880 | We have this,

00:01:33.000 | we've encoded our input here and all we need

00:01:37.920 | to do to actually generate our outputs from this is we'll output

00:01:42.920 | the generated tokens to outputs.

00:01:45.520 | And we just call the generate method on our model.

00:01:50.400 | And then we pass in the inputs.

00:01:53.560 | And then we just pass in the length that we want to use.

00:01:56.360 | So the number of tokens that we actually want to output.

00:01:58.960 | So I'm just going to go 200. Okay.

00:02:04.280 | And then to actually read that we need to decode it.

00:02:07.800 | Okay. And that is our text.

00:02:17.520 | So I just want to point out here that when we're calling this decode method,

00:02:22.760 | that is not when we are using these decoding methods that this video is about

00:02:27.360 | that that is just decoding using our tokenizer.

00:02:31.440 | So taking the token IDs that we do have our predicted token IDs and decoding

00:02:36.360 | them into English,

00:02:37.400 | the decoding methods that we're talking about take place at the end of this

00:02:42.240 | generate method here. So by default,

00:02:45.280 | this is greedy decoding and greedy decoding is,

00:02:49.200 | I would say the simplest approach and it's the one that you would probably think

00:02:53.240 | of when you try and just figure this out through intuition.

00:02:56.640 | So obviously we have our probability distribution and we have our output

00:03:02.520 | vocabulary and each value within that output distribution

00:03:07.400 | represents a mapping to one of our

00:03:12.680 | words or tokens.

00:03:13.920 | What greedy decoding does is simply chooses the token that has the highest

00:03:18.840 | probability, which obviously makes a lot of sense.

00:03:21.200 | And for shorter sequences, this is perfectly fine.

00:03:25.960 | But then when we start applying this to longer sequences,

00:03:30.160 | it causes a lot of issues.

00:03:31.560 | If you've ever seen a output that looks like this,

00:03:35.480 | where you have a lot of repetition,

00:03:38.520 | it just keeps repeating itself over and over again.

00:03:41.080 | This is typically because of greedy decoding.

00:03:44.480 | So when we're looking at longer sequences,

00:03:47.960 | what we will find is that it creates kind of like a loop.

00:03:52.000 | So it sees, you know,

00:03:55.120 | the best match for one word and then it reads the best match for

00:04:00.040 | the next word.

00:04:01.360 | And then as soon as it sees that first word again as the next best match,

00:04:06.360 | it's just going to create a loop.

00:04:08.080 | So that's why we want to avoid greedy decoding for longer

00:04:13.920 | sequences. So the next option we have is random sampling.

00:04:18.040 | Like before,

00:04:19.640 | we still have our outputs and we have that probability distribution across all

00:04:24.240 | of those.

00:04:25.080 | And what random sampling does differently is it chooses

00:04:29.360 | one of these tokens based on that probability at random.

00:04:34.120 | So what I mean by that is it will choose a token at random.

00:04:38.800 | So it can choose any of those,

00:04:40.880 | but the probability of choosing that token is

00:04:45.160 | weighted based on the probability of that token being the next token

00:04:49.800 | from the model.

00:04:50.880 | It's very similar to greedy decoding,

00:04:55.480 | but it just adds that sort of layer of randomness where every now and again,

00:05:00.320 | it's going to choose something other than the most probable token.

00:05:04.560 | And all we do to actually use random sampling is

00:05:09.760 | add a do sample argument here and set this to

00:05:14.360 | true.

00:05:15.200 | And this will switch our decoding method from

00:05:20.160 | greedy decoding to random sampling.

00:05:22.160 | And we will see that the output of this is not as repetitive.

00:05:26.920 | So if we just decode this here. Okay.

00:05:32.840 | And we can see that it's a lot more random.

00:05:37.920 | So this solves our problem of getting stuck in a repeating loop of the same

00:05:42.480 | words over and over again,

00:05:43.600 | because we just add this randomness to our predictions. However,

00:05:47.800 | this does introduce another problem,

00:05:50.400 | which is we will often find that the outputs we are

00:05:55.040 | producing are too random and they're just not coherent because of that.

00:05:59.800 | This one is reasonably coherent.

00:06:02.720 | It's so talking about UK politics.

00:06:07.200 | Sometimes it gets super weird.

00:06:08.640 | It will start talking about chameleons and game sales and stuff like that

00:06:13.880 | within, you know, a few sentences, it was actually talking about Churchill.

00:06:18.440 | So on one side, we have greedy search,

00:06:20.920 | which is too strict and it just causes a loop of the same

00:06:25.800 | words over and over again. And on the other side, we have random sampling,

00:06:29.920 | which is too random and ends up just producing

00:06:34.960 | gibberish almost all the time.

00:06:36.920 | So what we need is something in the middle,

00:06:41.520 | and that is why we use beam search.

00:06:45.320 | So beam search allows us to explore multiple levels of

00:06:51.200 | our output before selecting the best option.

00:06:54.400 | Whereas greedy decoding and random sampling is just looking at the very next

00:06:59.960 | token and then calculates, you know, which one to choose.

00:07:03.680 | What beam search does is it goes ahead in time and it searches through multiple

00:07:08.560 | potential paths and then finds the best option based on

00:07:13.120 | the full sequence rather than just the next token.

00:07:17.320 | And this just allows us to assess multiple different options

00:07:23.160 | and assess a longer length of time than just one token,

00:07:25.840 | which means typically we're going to see more coherent language from these

00:07:30.880 | outputs as well. And the beam in beam search is,

00:07:35.000 | all that's referring to is the number of paths that we assess

00:07:40.320 | and we consider. To add beam search,

00:07:43.680 | all we need to do is add number of beams.

00:07:46.840 | And we just set that to a value that is more than one because a beam of one is

00:07:51.560 | actually just the default. So if we set this to two,

00:07:55.760 | we then search two different beams or two different options

00:08:00.960 | at once. And this usually is pretty good.

00:08:05.240 | However,

00:08:06.040 | because we're now back to ranking sequences and selecting the most

00:08:10.720 | probable beam search can quite easily fall back into

00:08:15.680 | this repetitive nature that we get with greedy decoding.

00:08:19.040 | So what we need to do to counteract that is add

00:08:24.920 | the temperature argument to our code.

00:08:28.280 | So the temperature essentially controls the amount of randomness

00:08:33.160 | within the beam search choice.

00:08:36.680 | And by default, this is set to one.

00:08:39.400 | We can add more randomness to the output by increasing this.

00:08:45.480 | So say to 1.2 and this will just make the outputs more

00:08:50.200 | random, but because we're using beam search,

00:08:53.800 | it will still remain reasonably coherent unless we really turn the

00:08:58.440 | temperature up to a very high number, like five or something.

00:09:01.880 | And if we want to reduce the randomness in our outputs,

00:09:04.960 | we just reduce the temperature to something below one.

00:09:08.800 | So 0.8, for example. So with this, I mean, we,

00:09:13.560 | we have our beam search.

00:09:14.880 | We have some good outputs and somewhat peculiar,

00:09:19.040 | but they're pretty coherent. So it's pretty good. I mean,

00:09:21.960 | I think that's everything for this video.

00:09:24.240 | I think these three decoding methods are pretty important to know and they can

00:09:28.880 | make a big difference in the quality of your outputs.

00:09:31.040 | So thank you very much for watching and I will see you again next time.

How-to Decode Outputs From NLP Models (Python)

Chapters