back to indexText Summarization with Google AI's T5 in Python
00:00:00.000 |
Hi and welcome to this video on text summarization. We're going to go ahead and build a 00:00:06.800 |
really simple easy to use text summarizer using Google AI's T5 model. So this is insanely easy 00:00:16.080 |
to do. Altogether we actually only need seven lines of code and with that we can actually 00:00:21.200 |
summarize tech using Google's T5 model which is actually the cutting edge in terms of text 00:00:28.480 |
summarization at the moment. So it's really impressive that we can do this so easily and 00:00:34.320 |
we'll just run through really quickly and see what we can do. So we need to import Torch and 00:00:42.240 |
the Transformers library. And from the Transformers library we just need the auto tokenizer and auto 00:00:51.840 |
model with LM head. Whilst they're importing we can initialize our tokenizer and model. 00:01:10.240 |
So all we do for this is tokenizer and then we load our tokenizer from pre-trained. 00:01:25.680 |
Then we do the same for our model except with the auto model with LM head plus. 00:01:39.360 |
And we also need to make sure that we return a dictionary here as well. 00:01:52.640 |
Okay for this we're going to take some text from the PDF page about Winston Churchill. 00:02:11.920 |
I've already formatted it over here so I'm just going to take this and paste it in. But this is 00:02:20.160 |
exactly the same as what I highlighted just here without the numbers and headers. So we run that 00:02:28.960 |
and we simply build our input IDs. So all we're doing here is taking each of these words splitting 00:02:39.920 |
them into tokens. So imagine if we split one sentence into a list of words we would have this 00:02:46.800 |
his first speech as prime minister. Each one of these words would be a separate token. So we split 00:02:54.560 |
them into those tokens and then we convert those tokens into unique identifier numbers. Each of 00:03:01.440 |
these identifying numbers will be used by the model to map that word, which is now a number, 00:03:08.240 |
to a vector that has been trained and represents that word. I've summarized at the front here. 00:03:24.240 |
Because we are using PyTorch we want to return PT tensors. 00:03:28.080 |
And we set a max length of 512 tokens, which is the maximum number of tokens that 00:03:41.920 |
T5 can handle at once. Anything longer than this we would like to truncate. 00:03:52.240 |
So now we can have a look at those inputs and we can see we have our tensor of input IDs. 00:03:58.160 |
Now we need to run these input IDs through our model. So we do model generate and this will 00:04:08.640 |
output a this will generate a certain number of output tokens which are also numeric representations 00:04:16.800 |
of the words. Now all we need to do here is pass our inputs 00:04:22.480 |
and then we give a max length and minimum length as well. This just tells the model 00:04:31.680 |
we do not want anything longer than we're going to use 150 characters 00:04:46.160 |
Now we have a length penalty parameter here as well. So the higher the number the more the model 00:04:52.560 |
would be penalized for going either below or above that min and maximum length. We're going 00:04:58.640 |
to use quite a high value here of 5. And we use two beams. Now what we also need to do here is 00:05:17.520 |
actually pass these into another variable outputs and then when we want to access these outputs we 00:05:25.440 |
will use outputs 0 as this is the tensor containing our numeric word IDs. Now we can use tokenizer 00:05:35.920 |
again to decode our outputs. So this is converting our outputs from the numeric IDs into text. 00:05:43.920 |
And we also want to give that to another variable. 00:05:53.120 |
Finally we can print our summary. And here we can see that the model has taken some of the 00:06:01.200 |
information I think entirely from this second paragraph here and created a summary of the 00:06:10.560 |
full text. Out of the box this is pretty good because if you read through this it includes 00:06:16.320 |
a lot of the main points. Now the first paragraph isn't that relevant and I would say the final 00:06:22.640 |
paragraph is not either. Most of the information that we want from my point of view is in the 00:06:28.400 |
second and third paragraph. Now the model has quite clearly only extracted information from 00:06:34.880 |
the second paragraph which is not ideal but for an out of box solution it still performed pretty well. 00:06:40.640 |
So that's it for this model. I hope this has been pretty useful and insightful as to how quickly we 00:06:47.280 |
can actually build a pretty good text summarizer and implement Google's T5 model in almost 00:06:54.160 |
no lines of code. So I hope you enjoyed and I will see you in the next one.