back to index

Text Summarization with Google AI's T5 in Python


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to this video on text summarization. We're going to go ahead and build a
00:00:06.800 | really simple easy to use text summarizer using Google AI's T5 model. So this is insanely easy
00:00:16.080 | to do. Altogether we actually only need seven lines of code and with that we can actually
00:00:21.200 | summarize tech using Google's T5 model which is actually the cutting edge in terms of text
00:00:28.480 | summarization at the moment. So it's really impressive that we can do this so easily and
00:00:34.320 | we'll just run through really quickly and see what we can do. So we need to import Torch and
00:00:42.240 | the Transformers library. And from the Transformers library we just need the auto tokenizer and auto
00:00:51.840 | model with LM head. Whilst they're importing we can initialize our tokenizer and model.
00:01:10.240 | So all we do for this is tokenizer and then we load our tokenizer from pre-trained.
00:01:16.960 | And we will be using the T5 base model.
00:01:25.680 | Then we do the same for our model except with the auto model with LM head plus.
00:01:39.360 | And we also need to make sure that we return a dictionary here as well.
00:01:52.640 | Okay for this we're going to take some text from the PDF page about Winston Churchill.
00:02:03.680 | And we will just take this text here.
00:02:11.920 | I've already formatted it over here so I'm just going to take this and paste it in. But this is
00:02:20.160 | exactly the same as what I highlighted just here without the numbers and headers. So we run that
00:02:28.960 | and we simply build our input IDs. So all we're doing here is taking each of these words splitting
00:02:39.920 | them into tokens. So imagine if we split one sentence into a list of words we would have this
00:02:46.800 | his first speech as prime minister. Each one of these words would be a separate token. So we split
00:02:54.560 | them into those tokens and then we convert those tokens into unique identifier numbers. Each of
00:03:01.440 | these identifying numbers will be used by the model to map that word, which is now a number,
00:03:08.240 | to a vector that has been trained and represents that word. I've summarized at the front here.
00:03:16.000 | Followed by our sequence.
00:03:24.240 | Because we are using PyTorch we want to return PT tensors.
00:03:28.080 | And we set a max length of 512 tokens, which is the maximum number of tokens that
00:03:41.920 | T5 can handle at once. Anything longer than this we would like to truncate.
00:03:52.240 | So now we can have a look at those inputs and we can see we have our tensor of input IDs.
00:03:58.160 | Now we need to run these input IDs through our model. So we do model generate and this will
00:04:08.640 | output a this will generate a certain number of output tokens which are also numeric representations
00:04:16.800 | of the words. Now all we need to do here is pass our inputs
00:04:22.480 | and then we give a max length and minimum length as well. This just tells the model
00:04:31.680 | we do not want anything longer than we're going to use 150 characters
00:04:36.400 | and anything less than 80 words.
00:04:46.160 | Now we have a length penalty parameter here as well. So the higher the number the more the model
00:04:52.560 | would be penalized for going either below or above that min and maximum length. We're going
00:04:58.640 | to use quite a high value here of 5. And we use two beams. Now what we also need to do here is
00:05:17.520 | actually pass these into another variable outputs and then when we want to access these outputs we
00:05:25.440 | will use outputs 0 as this is the tensor containing our numeric word IDs. Now we can use tokenizer
00:05:35.920 | again to decode our outputs. So this is converting our outputs from the numeric IDs into text.
00:05:43.920 | And we also want to give that to another variable.
00:05:53.120 | Finally we can print our summary. And here we can see that the model has taken some of the
00:06:01.200 | information I think entirely from this second paragraph here and created a summary of the
00:06:10.560 | full text. Out of the box this is pretty good because if you read through this it includes
00:06:16.320 | a lot of the main points. Now the first paragraph isn't that relevant and I would say the final
00:06:22.640 | paragraph is not either. Most of the information that we want from my point of view is in the
00:06:28.400 | second and third paragraph. Now the model has quite clearly only extracted information from
00:06:34.880 | the second paragraph which is not ideal but for an out of box solution it still performed pretty well.
00:06:40.640 | So that's it for this model. I hope this has been pretty useful and insightful as to how quickly we
00:06:47.280 | can actually build a pretty good text summarizer and implement Google's T5 model in almost
00:06:54.160 | no lines of code. So I hope you enjoyed and I will see you in the next one.