Back to Index

Text Summarization with Google AI's T5 in Python


Transcript

Hi and welcome to this video on text summarization. We're going to go ahead and build a really simple easy to use text summarizer using Google AI's T5 model. So this is insanely easy to do. Altogether we actually only need seven lines of code and with that we can actually summarize tech using Google's T5 model which is actually the cutting edge in terms of text summarization at the moment.

So it's really impressive that we can do this so easily and we'll just run through really quickly and see what we can do. So we need to import Torch and the Transformers library. And from the Transformers library we just need the auto tokenizer and auto model with LM head.

Whilst they're importing we can initialize our tokenizer and model. So all we do for this is tokenizer and then we load our tokenizer from pre-trained. And we will be using the T5 base model. Then we do the same for our model except with the auto model with LM head plus.

And we also need to make sure that we return a dictionary here as well. Okay for this we're going to take some text from the PDF page about Winston Churchill. And we will just take this text here. I've already formatted it over here so I'm just going to take this and paste it in.

But this is exactly the same as what I highlighted just here without the numbers and headers. So we run that and we simply build our input IDs. So all we're doing here is taking each of these words splitting them into tokens. So imagine if we split one sentence into a list of words we would have this his first speech as prime minister.

Each one of these words would be a separate token. So we split them into those tokens and then we convert those tokens into unique identifier numbers. Each of these identifying numbers will be used by the model to map that word, which is now a number, to a vector that has been trained and represents that word.

I've summarized at the front here. Followed by our sequence. Because we are using PyTorch we want to return PT tensors. And we set a max length of 512 tokens, which is the maximum number of tokens that T5 can handle at once. Anything longer than this we would like to truncate.

So now we can have a look at those inputs and we can see we have our tensor of input IDs. Now we need to run these input IDs through our model. So we do model generate and this will output a this will generate a certain number of output tokens which are also numeric representations of the words.

Now all we need to do here is pass our inputs and then we give a max length and minimum length as well. This just tells the model we do not want anything longer than we're going to use 150 characters and anything less than 80 words. Now we have a length penalty parameter here as well.

So the higher the number the more the model would be penalized for going either below or above that min and maximum length. We're going to use quite a high value here of 5. And we use two beams. Now what we also need to do here is actually pass these into another variable outputs and then when we want to access these outputs we will use outputs 0 as this is the tensor containing our numeric word IDs.

Now we can use tokenizer again to decode our outputs. So this is converting our outputs from the numeric IDs into text. And we also want to give that to another variable. Finally we can print our summary. And here we can see that the model has taken some of the information I think entirely from this second paragraph here and created a summary of the full text.

Out of the box this is pretty good because if you read through this it includes a lot of the main points. Now the first paragraph isn't that relevant and I would say the final paragraph is not either. Most of the information that we want from my point of view is in the second and third paragraph.

Now the model has quite clearly only extracted information from the second paragraph which is not ideal but for an out of box solution it still performed pretty well. So that's it for this model. I hope this has been pretty useful and insightful as to how quickly we can actually build a pretty good text summarizer and implement Google's T5 model in almost no lines of code.

So I hope you enjoyed and I will see you in the next one.