Language Generation with OpenAI's GPT-2 in Python

00:00:00.000 | Hi and welcome to the video. We're going to go through language generation using GPT-2.

00:00:05.920 | Now this is actually incredibly easy to do and we can build this entire model including the

00:00:13.760 | imports, the tokenizer model, and outputting our generated text with just seven lines of code,

00:00:21.600 | which is pretty insane. Now the only libraries we need for this are PyTorch and Transformers,

00:00:29.040 | so we'll go ahead and import them now.

00:00:30.960 | Now all we need from the Transformers library are the GPT-2

00:00:49.200 | LM head model and GPT-2 tokenizer, so we can initialize both of those as well now.

00:00:55.360 | And both will be from pre-trained.

00:01:04.240 | So now we have initialized our tokenizer and model.

00:01:23.520 | We just need a sequence of text to feed in and get our model going.

00:01:30.400 | So I've taken a snippet of text from the Wikipedia page of Winston Churchill, which is here.

00:01:39.200 | And it's just a small little snippet talking about when he took office during World War II.

00:01:48.160 | Now from this, I've tested it briefly and it seems to give some pretty interesting results.

00:01:53.600 | So we will go ahead, use this, all we need to do is tokenize it.

00:01:59.760 | Now all we're doing here is taking each of these words, splitting them into tokens,

00:02:13.120 | so that would be a list where each word is its own item, so he began his premiership.

00:02:21.760 | Each one of those would be a separate value within that list.

00:02:25.600 | Once we have them in that tokenized format, our tokenizer will then convert them into numerical

00:02:34.320 | IDs, which map to a word vector that's been trained to work with the GPT-2.

00:02:41.840 | Now, because we're using PyTorch, we just need to remember to return a PT tensors here.

00:02:48.080 | So now we have our inputs, we just need to feed them into our model.

00:02:58.640 | So we can do that using model.generate.

00:03:05.680 | And we add our inputs. Now, we also need to tell PyTorch how long we want our generated

00:03:15.520 | sequence to be. So all we do for that is add a max length. And this will act as the cutoff point,

00:03:24.400 | anything longer than this will simply be cut off.

00:03:31.360 | And now here we are just generating our output. We also need to pass this into the outputs

00:03:39.760 | variable here, so that we can actually read from it and decode it. So to decode our output IDs,

00:03:50.560 | because it will output numerical IDs representing words, just like we fed into it, we need to use

00:03:57.520 | the tokenizer decode method. And our output IDs are in the zero index of the outputs object.

00:04:11.520 | And we also want to skip any special tokens. So this would be stuff like end of sequence

00:04:19.760 | tokens, padding tokens, unknown word tokens, and so on.

00:04:23.600 | And then we can print the text. Now, we can see here that it's basically just

00:04:33.520 | going over and over again, saying the same things, which is not really what we want.

00:04:37.840 | So this is a pretty common problem. And all we need to do

00:04:42.080 | to fix this is add another argument to our generate method here.

00:04:48.160 | So we simply do sample equals true. And then we can rerun this.

00:04:53.440 | And this looks pretty good now.

00:04:57.840 | So we can add more randomness and restrict the number of possible tokens for the model to use,

00:05:07.840 | using the temperature and top k parameters, respectively.

00:05:14.800 | Now, temperature acts as the amount of randomness input into the model. So a high temperature above

00:05:22.080 | one will create more random tokens than the default. Anything below one makes the model

00:05:29.200 | less random. So say if we put a stupidly high number, like five, we will probably get a pretty

00:05:35.600 | weird output. Okay, so we can see here, initially skimming over, it doesn't look too bad. But then

00:05:43.200 | when you start reading it, it's practically impossible to follow. There's no structure,

00:05:48.720 | and there's just a couple of random words in there that are just completely irrelevant.

00:05:54.400 | Now, we can also see here, there's an end bracket, and there's a starting bracket that pairs with it.

00:06:00.560 | And generally, it's just some really weird syntax. So we turn the temperature down.

00:06:09.200 | Maybe to 0.7, and we will actually decrease the randomness from the original model.

00:06:16.400 | Now, you can toy around with this and see what produces more interesting results. Generally,

00:06:24.240 | higher temperature will create more creative outputs. And the other parameter we can also use

00:06:31.360 | is the top k parameter. Now, top k limits the sample tokens to the top rated tokens that the

00:06:40.160 | model is predicting. So we can add 50, for example, and this will alter our output. Generally,

00:06:48.400 | I've found top k tends to make the text a little more coherent. And I would assume this is because

00:06:56.000 | it is sticking within a smaller space of possible tokens or words that it can output. So now here,

00:07:04.080 | we can see pretty understandable, logical text again. And we can see here, it mentions lord a

00:07:11.920 | lot, which makes sense because this is Britain. So if we put the temperature back up to 1,

00:07:20.640 | we should get a slightly more random output again. And then here, we can see that there's

00:07:28.800 | a little more weird text coming in. So we have here that the first Australian prime minister

00:07:35.120 | was sacked by a labor minister, which obviously is a little bit strange. But it just shows that

00:07:42.160 | we can add more randomness, or we can try and restrict our model to become more coherent.

00:07:48.560 | And we can do this super easily using the generate parameters. So with just a few lines of code,

00:07:56.480 | we built our model up and running and actually generating text incredibly easily. So I hope

00:08:03.840 | this has been insightful and useful. If you have any questions or suggestions, please just let me

00:08:10.560 | know in the comments below. But thank you for watching, and I will see you next time.

00:08:16.560 | time.