back to indexWhy are there so many Tokenization methods in HF Transformers?
Chapters
0:0 Intro
0:58 Tokenization
2:40 Building a Tokenizer
3:37 Tensors
11:10 Batch Encode
15:13 Tokenizer
00:00:00.000 |
Hi and welcome to the video. Today we're going to have a look at all of the different tokenization 00:00:05.920 |
methods or a few of them at least in Hugging Face Transformers. Now I'm sure a few of you are asking 00:00:13.040 |
tokenization is pretty straightforward and I believe this as well. So why are there so many 00:00:20.880 |
tokenization methods? So on the screen right now you can see we have these five different methods. 00:00:28.720 |
Now in reality each of these actually does do something different but all of them are simply 00:00:37.360 |
to produce token IDs. Now for those of you that are new to tokenization and maybe Transformers 00:00:45.920 |
we'll just quickly have a look at the very basics of a tokenizer or the very basics for understanding 00:00:55.200 |
what each of these methods actually does. So tokenization in short is this. The process of 00:01:03.920 |
going from what we have up here which is our original human readable text so hello world we 00:01:10.880 |
also have this exclamation mark at the end there converting that original text into what we call 00:01:16.240 |
tokens. Now tokens can be well they can be a few different things in this case what we see is 00:01:23.360 |
tokens built from words so each token represents a word or a part of the syntax so the exclamation 00:01:32.480 |
mark at the end. Now depending on what sort of tokenizer you're using you can build tokens from 00:01:38.080 |
completely different things so you can build tokenizers from the bytes within the text you 00:01:45.120 |
can do word piece encoding so in this case there's no good examples but say maybe we had the word 00:01:51.280 |
something okay we can easily split this into I think probably three different word pieces so we 00:02:00.880 |
have some and then we have ing at the end that's a common part of a word so that would be a word 00:02:07.360 |
piece in itself and then we'd also have thing in the middle there. So we can tokenize it doesn't 00:02:13.840 |
have to be a single word for each token it can be a whole host of different things and then we go 00:02:21.280 |
from those tokens to the token ids which we see at the bottom so in this case hello is being 00:02:28.880 |
translated to seven five nine two the the integer and then we have word and also the exclamation 00:02:34.960 |
mark as well. So that's the process that's what we're doing but how do we do that with a 00:02:42.640 |
Hugging Face Transformer? So we have these two files that our tokenizer is built from 00:02:49.040 |
so these two here that's our tokenizer and when we build a tokenizer you if you've followed some 00:02:59.040 |
of my previous videos on building a tokenizer you will recognize both of these files and these are 00:03:05.280 |
the two steps so the first the emergence of text takes us from that original text here to our 00:03:14.800 |
tokens down here so that's step one and then step two is where we go from those tokens that we 00:03:24.000 |
previously built in step one we process them through vocab.json and that produces our transformer 00:03:33.760 |
readable token ids which we see at the bottom there. Now there are a few different tensors 00:03:41.120 |
that we need for feeding into our model so with transformers so we've just seen how we build the 00:03:48.080 |
input ids or token ids so that's the essential we need that for every transformer model so token 00:03:58.080 |
ids we also have the attention mask I'll just write mask for now these are the typical ones 00:04:06.640 |
that we we would see so the attention mask is typically a tensor containing ones and zeros 00:04:12.400 |
the ones will correlate to the real tokens within our token ids tensor and the zeros correlate to 00:04:22.320 |
padding tokens in the token ids tensor so we have the attention mask and then we also 00:04:29.040 |
have the token type ids or the you can call them segment ids as well 00:04:34.880 |
and segment ids are used so are used when we have multiple segments to our inputs so we might have 00:04:48.400 |
token ids and maybe we are doing question answering so question answering we would have 00:04:53.920 |
if we're feeding it into BERT we would have our question I believe it's in this order so we would 00:05:02.480 |
have the question and then in the middle we'd have this separate token I'll just write s e p 00:05:10.800 |
and then we would have the context that we're getting the answer to our question from 00:05:17.920 |
now in segment ids anything that belongs to our question would be represented by a zero 00:05:26.240 |
anything that belongs to our context would be represented by a one so they're the three key 00:05:33.200 |
tensors that we would be using and here's just a visualization of the attention mask 00:05:41.840 |
so we have the real tokens represented by ones in the attention mask tensor and then the padding 00:05:48.800 |
tokens represented by the zero in the attention mask tensor now I think I mean that's all we really 00:05:54.640 |
need to ask quick summary of tokenizers in transformers now how does that correlate to 00:06:02.160 |
what we're doing here well let's create a new cell and let's take this as our first example 00:06:12.880 |
so we have our text hello world and let's have a look at what happens when we use tokenizer 00:06:22.880 |
tokenizer tokenizer alone okay if we do this we see that we create our token so 00:06:30.400 |
straight away we we know that this method here our first method it does our tokenization in 00:06:39.120 |
the steps that we outlined before so it doesn't do everything all at once it does them 00:06:43.840 |
step by step and you can probably guess that the next step from creating those tokens 00:06:51.600 |
is to convert them into token ids which we we do there now it's completely valid method this 00:07:00.560 |
it works it's good but you see that it's pretty simple like what we have here we can't create 00:07:09.440 |
we can't create pytorch tensors or tensorflow tensors there's no arguments for adding padding 00:07:15.280 |
or truncation which we almost always need and we also can't add special tokens so I mean it works 00:07:24.400 |
it's fine but it's very simple so maybe it's not the best thing if you want to do all that stuff 00:07:30.240 |
automatically maybe you want to do manually in that case you can go ahead you can add your special 00:07:34.400 |
token your padding your truncation manually without a problem and then also convert those 00:07:39.600 |
into pytorch tensors so we can do that that's fine the maybe easier method is if we go ahead 00:07:49.600 |
and use encode so if we look at encode here you can see that we have these two actual tokens so 00:07:58.320 |
we have the same as what we've got up here the 7592 up to 999 so that is our text tokenized 00:08:06.480 |
or converted into token ids and then we also have this 101 102 now if you don't know what those are 00:08:15.440 |
it's fine they are basically special tokens that BERT uses to indicate the start sequence for the 00:08:24.400 |
101 or the end of a sequence for 102 and there's also another special token that we'll see in a 00:08:31.760 |
minute which is zero which is the padding token that BERT uses now if we were to 00:08:38.880 |
use this method for actually building a tensor for for pytorch for BERT we would probably write 00:08:50.720 |
something like this so we'd set the max length equal to 512 we'd set padding equal to the max 00:08:58.960 |
length and then we would make sure that we return pytorch tensors 00:09:07.040 |
like so right and then we see we get this big pytorch tensor with all these zeros in there 00:09:15.760 |
are padding tokens right and that goes up to a length of 512 which is the correct size for 00:09:22.400 |
BERT base so that makes sense i think let me uh let me just restrict how much of that we're seeing 00:09:38.960 |
so that's encode now up here we also have encode plus 00:09:43.680 |
so let's try and see what this is so we'll see up here we we just got our input ids so 00:09:55.600 |
we refer back to here we have our token ids or input ids but we don't have the mask or the 00:10:03.760 |
segment ids which we also need and that's a limitation of the encode method which gets 00:10:09.600 |
fixed using the encode plus method so if we run that we see that instead of getting a 00:10:14.800 |
single list we return dictionary that contains the input ids or token ids the token type ids 00:10:23.760 |
or segment ids and the attention mask so that's straight away it looks a lot better now we can 00:10:33.360 |
also use all the same arguments that we use in encode so let's change to encode plus 00:10:40.880 |
and we'll remove that for now i'll add it back in a minute you can see it now okay we have 00:10:48.720 |
input ids and we'll go down and might go down a little bit um and then we have token type ids 00:10:56.000 |
and we have our token types here now this is just zeros because you don't have two sequences in 00:11:00.800 |
there but if we were to pass two sequences we would get the zeros and ones and then we 00:11:06.560 |
also see we have the attention mask so that's three methods we have the convert tokens and 00:11:17.120 |
the tokenize or tokenize and convert tokens to ids then we have encode and code plus 00:11:25.440 |
now you may have guessed already from name but we also have this batch and code plus now 00:11:31.520 |
batch and code plus allows us to do the same as what we do in code plus but for batches of 00:11:37.520 |
sentences so if we had let's go down here let me let me just remove that 00:12:02.000 |
and in here we're going to have text and i'm going to add another item is also hello world again 00:12:12.160 |
okay now in here if i were to write encode plus and text list we see that we get this 00:12:23.280 |
pretty weird output that doesn't it just doesn't look right that's because it isn't right we we 00:12:30.160 |
don't we can't pass a list to the encode plus it won't work we have to pass each string one at a 00:12:40.800 |
time and actually what we can see here so we have these 100 tokens 100 is the unknown token and 00:12:47.840 |
that's because we're passing two objects in a list string objects in lists and but the tokenizer is 00:12:55.360 |
reading this string objects as a whole and saying i have no idea what this is so it's just give 00:13:01.440 |
giving it a unknown token so we can't use encode plus instead we use batch encode plus like that 00:13:09.200 |
and now we see that we get we don't only get one of our tensors but we also we get two of 00:13:17.200 |
our tensors so we get a array for each one of these so if we um let me let's write this 00:13:31.440 |
okay if we go token ids we can access each one of those tensors 00:13:37.200 |
because it's in a dictionary so you write input ids and let's have a look at the shape 00:13:53.520 |
tensors so um we can only use the the shape method when we have tensor 00:13:58.800 |
also we need to do that so let me here let me uh take that so the reason we were getting that error 00:14:07.200 |
is because our two arrays are of different lengths and we can't create a tensor where we have 00:14:17.040 |
differently sized rows if that makes sense so you'll see in a moment if i i'll change the max 00:14:26.240 |
length to 10 so it's not huge so we can see here that we've added in our first row we've added 00:14:35.040 |
five yep five padding tokens and in the second we've added four to make them both the equal size 00:14:42.640 |
so we can actually create a tensor from them so let's now go for shape and we see that now 00:14:50.400 |
we have we have our two strings that have both been tokenized and converted into 00:14:55.600 |
a list or a tensor row of 10 values so that's a batch in code plus we can add a huge number of 00:15:08.800 |
strings into that we just need to add them as a as a list and that leads us on to our final method 00:15:17.040 |
which is this tokenizer so you see here we've been using we've been using tokenizer and then 00:15:22.640 |
followed by a method within that class or that object this time we're just using the 00:15:32.400 |
the class we're calling that class directly so we write tokenizer and let's put text 00:15:37.360 |
okay and i mean if we look at this it looks like it's doing the same as as encode plus so if we 00:15:46.320 |
compare it to encode plus exactly the same output now what if we convert this into text to list 00:15:57.600 |
okay so now we're getting the same output as we got from batch encode plus so 00:16:04.800 |
what tokenizer doing or calling tokenizer directly is doing it's looking at the data type of the 00:16:14.640 |
input that we're feeding into it is is the data type a string as is the case with text or is 00:16:22.800 |
the data type a list as is the case with text list it looks at either of those and it says okay if 00:16:29.600 |
it's a string i'm going to call the method encode plus if it's a list i'm going to call i'm going 00:16:37.600 |
to call the method batch encode plus so that's all that tokenizer is doing so generally tokenizer is 00:16:47.600 |
usually the the way to go if you're not sure whether you've got batches or 00:16:51.920 |
strings coming through it can be very useful to use tokenizer 00:16:56.880 |
but that's that's pretty much all i wanted to to cover for this video with with tokenizer as well 00:17:06.000 |
it's worth noting we can use the same parameters so if we if we take these we can we can use all 00:17:13.680 |
of these as we did with encode and encode plus like so so they're all tokenization methods or 00:17:24.800 |
the the main ones in transformers now i'd imagine there's probably more that i'm not aware of 00:17:30.720 |
and if you're aware of those let me know in the comments below that'll be pretty interesting to 00:17:35.280 |
see more of those but they're probably the main the five main ones and i've seen a few questions 00:17:42.720 |
on those before so i thought it'd be worth covering those and i was also curious myself as to 00:17:47.120 |
what the actual or the specific differences are between each one of those so that's it 00:17:54.480 |
for this video thank you very much for watching and i'll see you in the next one