back to indexLanguage Generation with OpenAI's GPT-2 in Python
00:00:00.000 |
Hi and welcome to the video. We're going to go through language generation using GPT-2. 00:00:05.920 |
Now this is actually incredibly easy to do and we can build this entire model including the 00:00:13.760 |
imports, the tokenizer model, and outputting our generated text with just seven lines of code, 00:00:21.600 |
which is pretty insane. Now the only libraries we need for this are PyTorch and Transformers, 00:00:30.960 |
Now all we need from the Transformers library are the GPT-2 00:00:49.200 |
LM head model and GPT-2 tokenizer, so we can initialize both of those as well now. 00:01:04.240 |
So now we have initialized our tokenizer and model. 00:01:23.520 |
We just need a sequence of text to feed in and get our model going. 00:01:30.400 |
So I've taken a snippet of text from the Wikipedia page of Winston Churchill, which is here. 00:01:39.200 |
And it's just a small little snippet talking about when he took office during World War II. 00:01:48.160 |
Now from this, I've tested it briefly and it seems to give some pretty interesting results. 00:01:53.600 |
So we will go ahead, use this, all we need to do is tokenize it. 00:01:59.760 |
Now all we're doing here is taking each of these words, splitting them into tokens, 00:02:13.120 |
so that would be a list where each word is its own item, so he began his premiership. 00:02:21.760 |
Each one of those would be a separate value within that list. 00:02:25.600 |
Once we have them in that tokenized format, our tokenizer will then convert them into numerical 00:02:34.320 |
IDs, which map to a word vector that's been trained to work with the GPT-2. 00:02:41.840 |
Now, because we're using PyTorch, we just need to remember to return a PT tensors here. 00:02:48.080 |
So now we have our inputs, we just need to feed them into our model. 00:03:05.680 |
And we add our inputs. Now, we also need to tell PyTorch how long we want our generated 00:03:15.520 |
sequence to be. So all we do for that is add a max length. And this will act as the cutoff point, 00:03:24.400 |
anything longer than this will simply be cut off. 00:03:31.360 |
And now here we are just generating our output. We also need to pass this into the outputs 00:03:39.760 |
variable here, so that we can actually read from it and decode it. So to decode our output IDs, 00:03:50.560 |
because it will output numerical IDs representing words, just like we fed into it, we need to use 00:03:57.520 |
the tokenizer decode method. And our output IDs are in the zero index of the outputs object. 00:04:11.520 |
And we also want to skip any special tokens. So this would be stuff like end of sequence 00:04:19.760 |
tokens, padding tokens, unknown word tokens, and so on. 00:04:23.600 |
And then we can print the text. Now, we can see here that it's basically just 00:04:33.520 |
going over and over again, saying the same things, which is not really what we want. 00:04:37.840 |
So this is a pretty common problem. And all we need to do 00:04:42.080 |
to fix this is add another argument to our generate method here. 00:04:48.160 |
So we simply do sample equals true. And then we can rerun this. 00:04:57.840 |
So we can add more randomness and restrict the number of possible tokens for the model to use, 00:05:07.840 |
using the temperature and top k parameters, respectively. 00:05:14.800 |
Now, temperature acts as the amount of randomness input into the model. So a high temperature above 00:05:22.080 |
one will create more random tokens than the default. Anything below one makes the model 00:05:29.200 |
less random. So say if we put a stupidly high number, like five, we will probably get a pretty 00:05:35.600 |
weird output. Okay, so we can see here, initially skimming over, it doesn't look too bad. But then 00:05:43.200 |
when you start reading it, it's practically impossible to follow. There's no structure, 00:05:48.720 |
and there's just a couple of random words in there that are just completely irrelevant. 00:05:54.400 |
Now, we can also see here, there's an end bracket, and there's a starting bracket that pairs with it. 00:06:00.560 |
And generally, it's just some really weird syntax. So we turn the temperature down. 00:06:09.200 |
Maybe to 0.7, and we will actually decrease the randomness from the original model. 00:06:16.400 |
Now, you can toy around with this and see what produces more interesting results. Generally, 00:06:24.240 |
higher temperature will create more creative outputs. And the other parameter we can also use 00:06:31.360 |
is the top k parameter. Now, top k limits the sample tokens to the top rated tokens that the 00:06:40.160 |
model is predicting. So we can add 50, for example, and this will alter our output. Generally, 00:06:48.400 |
I've found top k tends to make the text a little more coherent. And I would assume this is because 00:06:56.000 |
it is sticking within a smaller space of possible tokens or words that it can output. So now here, 00:07:04.080 |
we can see pretty understandable, logical text again. And we can see here, it mentions lord a 00:07:11.920 |
lot, which makes sense because this is Britain. So if we put the temperature back up to 1, 00:07:20.640 |
we should get a slightly more random output again. And then here, we can see that there's 00:07:28.800 |
a little more weird text coming in. So we have here that the first Australian prime minister 00:07:35.120 |
was sacked by a labor minister, which obviously is a little bit strange. But it just shows that 00:07:42.160 |
we can add more randomness, or we can try and restrict our model to become more coherent. 00:07:48.560 |
And we can do this super easily using the generate parameters. So with just a few lines of code, 00:07:56.480 |
we built our model up and running and actually generating text incredibly easily. So I hope 00:08:03.840 |
this has been insightful and useful. If you have any questions or suggestions, please just let me 00:08:10.560 |
know in the comments below. But thank you for watching, and I will see you next time.