Back to Index

Language Generation with OpenAI's GPT-2 in Python


Transcript

Hi and welcome to the video. We're going to go through language generation using GPT-2. Now this is actually incredibly easy to do and we can build this entire model including the imports, the tokenizer model, and outputting our generated text with just seven lines of code, which is pretty insane.

Now the only libraries we need for this are PyTorch and Transformers, so we'll go ahead and import them now. Now all we need from the Transformers library are the GPT-2 LM head model and GPT-2 tokenizer, so we can initialize both of those as well now. And both will be from pre-trained.

So now we have initialized our tokenizer and model. We just need a sequence of text to feed in and get our model going. So I've taken a snippet of text from the Wikipedia page of Winston Churchill, which is here. And it's just a small little snippet talking about when he took office during World War II.

Now from this, I've tested it briefly and it seems to give some pretty interesting results. So we will go ahead, use this, all we need to do is tokenize it. Now all we're doing here is taking each of these words, splitting them into tokens, so that would be a list where each word is its own item, so he began his premiership.

Each one of those would be a separate value within that list. Once we have them in that tokenized format, our tokenizer will then convert them into numerical IDs, which map to a word vector that's been trained to work with the GPT-2. Now, because we're using PyTorch, we just need to remember to return a PT tensors here.

So now we have our inputs, we just need to feed them into our model. So we can do that using model.generate. And we add our inputs. Now, we also need to tell PyTorch how long we want our generated sequence to be. So all we do for that is add a max length.

And this will act as the cutoff point, anything longer than this will simply be cut off. And now here we are just generating our output. We also need to pass this into the outputs variable here, so that we can actually read from it and decode it. So to decode our output IDs, because it will output numerical IDs representing words, just like we fed into it, we need to use the tokenizer decode method.

And our output IDs are in the zero index of the outputs object. And we also want to skip any special tokens. So this would be stuff like end of sequence tokens, padding tokens, unknown word tokens, and so on. And then we can print the text. Now, we can see here that it's basically just going over and over again, saying the same things, which is not really what we want.

So this is a pretty common problem. And all we need to do to fix this is add another argument to our generate method here. So we simply do sample equals true. And then we can rerun this. And this looks pretty good now. So we can add more randomness and restrict the number of possible tokens for the model to use, using the temperature and top k parameters, respectively.

Now, temperature acts as the amount of randomness input into the model. So a high temperature above one will create more random tokens than the default. Anything below one makes the model less random. So say if we put a stupidly high number, like five, we will probably get a pretty weird output.

Okay, so we can see here, initially skimming over, it doesn't look too bad. But then when you start reading it, it's practically impossible to follow. There's no structure, and there's just a couple of random words in there that are just completely irrelevant. Now, we can also see here, there's an end bracket, and there's a starting bracket that pairs with it.

And generally, it's just some really weird syntax. So we turn the temperature down. Maybe to 0.7, and we will actually decrease the randomness from the original model. Now, you can toy around with this and see what produces more interesting results. Generally, higher temperature will create more creative outputs. And the other parameter we can also use is the top k parameter.

Now, top k limits the sample tokens to the top rated tokens that the model is predicting. So we can add 50, for example, and this will alter our output. Generally, I've found top k tends to make the text a little more coherent. And I would assume this is because it is sticking within a smaller space of possible tokens or words that it can output.

So now here, we can see pretty understandable, logical text again. And we can see here, it mentions lord a lot, which makes sense because this is Britain. So if we put the temperature back up to 1, we should get a slightly more random output again. And then here, we can see that there's a little more weird text coming in.

So we have here that the first Australian prime minister was sacked by a labor minister, which obviously is a little bit strange. But it just shows that we can add more randomness, or we can try and restrict our model to become more coherent. And we can do this super easily using the generate parameters.

So with just a few lines of code, we built our model up and running and actually generating text incredibly easily. So I hope this has been insightful and useful. If you have any questions or suggestions, please just let me know in the comments below. But thank you for watching, and I will see you next time.

time.