Why are there so many Tokenization methods in HF Transformers?

00:00:00.000 | Hi and welcome to the video. Today we're going to have a look at all of the different tokenization

00:00:05.920 | methods or a few of them at least in Hugging Face Transformers. Now I'm sure a few of you are asking

00:00:13.040 | tokenization is pretty straightforward and I believe this as well. So why are there so many

00:00:20.880 | tokenization methods? So on the screen right now you can see we have these five different methods.

00:00:28.720 | Now in reality each of these actually does do something different but all of them are simply

00:00:37.360 | to produce token IDs. Now for those of you that are new to tokenization and maybe Transformers

00:00:45.920 | we'll just quickly have a look at the very basics of a tokenizer or the very basics for understanding

00:00:55.200 | what each of these methods actually does. So tokenization in short is this. The process of

00:01:03.920 | going from what we have up here which is our original human readable text so hello world we

00:01:10.880 | also have this exclamation mark at the end there converting that original text into what we call

00:01:16.240 | tokens. Now tokens can be well they can be a few different things in this case what we see is

00:01:23.360 | tokens built from words so each token represents a word or a part of the syntax so the exclamation

00:01:32.480 | mark at the end. Now depending on what sort of tokenizer you're using you can build tokens from

00:01:38.080 | completely different things so you can build tokenizers from the bytes within the text you

00:01:45.120 | can do word piece encoding so in this case there's no good examples but say maybe we had the word

00:01:51.280 | something okay we can easily split this into I think probably three different word pieces so we

00:02:00.880 | have some and then we have ing at the end that's a common part of a word so that would be a word

00:02:07.360 | piece in itself and then we'd also have thing in the middle there. So we can tokenize it doesn't

00:02:13.840 | have to be a single word for each token it can be a whole host of different things and then we go

00:02:21.280 | from those tokens to the token ids which we see at the bottom so in this case hello is being

00:02:28.880 | translated to seven five nine two the the integer and then we have word and also the exclamation

00:02:34.960 | mark as well. So that's the process that's what we're doing but how do we do that with a

00:02:42.640 | Hugging Face Transformer? So we have these two files that our tokenizer is built from

00:02:49.040 | so these two here that's our tokenizer and when we build a tokenizer you if you've followed some

00:02:59.040 | of my previous videos on building a tokenizer you will recognize both of these files and these are

00:03:05.280 | the two steps so the first the emergence of text takes us from that original text here to our

00:03:14.800 | tokens down here so that's step one and then step two is where we go from those tokens that we

00:03:24.000 | previously built in step one we process them through vocab.json and that produces our transformer

00:03:33.760 | readable token ids which we see at the bottom there. Now there are a few different tensors

00:03:41.120 | that we need for feeding into our model so with transformers so we've just seen how we build the

00:03:48.080 | input ids or token ids so that's the essential we need that for every transformer model so token

00:03:58.080 | ids we also have the attention mask I'll just write mask for now these are the typical ones

00:04:06.640 | that we we would see so the attention mask is typically a tensor containing ones and zeros

00:04:12.400 | the ones will correlate to the real tokens within our token ids tensor and the zeros correlate to

00:04:22.320 | padding tokens in the token ids tensor so we have the attention mask and then we also

00:04:29.040 | have the token type ids or the you can call them segment ids as well

00:04:34.880 | and segment ids are used so are used when we have multiple segments to our inputs so we might have

00:04:48.400 | token ids and maybe we are doing question answering so question answering we would have

00:04:53.920 | if we're feeding it into BERT we would have our question I believe it's in this order so we would

00:05:02.480 | have the question and then in the middle we'd have this separate token I'll just write s e p

00:05:10.800 | and then we would have the context that we're getting the answer to our question from

00:05:17.920 | now in segment ids anything that belongs to our question would be represented by a zero

00:05:26.240 | anything that belongs to our context would be represented by a one so they're the three key

00:05:33.200 | tensors that we would be using and here's just a visualization of the attention mask

00:05:41.840 | so we have the real tokens represented by ones in the attention mask tensor and then the padding

00:05:48.800 | tokens represented by the zero in the attention mask tensor now I think I mean that's all we really

00:05:54.640 | need to ask quick summary of tokenizers in transformers now how does that correlate to

00:06:02.160 | what we're doing here well let's create a new cell and let's take this as our first example

00:06:12.880 | so we have our text hello world and let's have a look at what happens when we use tokenizer

00:06:22.880 | tokenizer tokenizer alone okay if we do this we see that we create our token so

00:06:30.400 | straight away we we know that this method here our first method it does our tokenization in

00:06:39.120 | the steps that we outlined before so it doesn't do everything all at once it does them

00:06:43.840 | step by step and you can probably guess that the next step from creating those tokens

00:06:51.600 | is to convert them into token ids which we we do there now it's completely valid method this

00:07:00.560 | it works it's good but you see that it's pretty simple like what we have here we can't create

00:07:09.440 | we can't create pytorch tensors or tensorflow tensors there's no arguments for adding padding

00:07:15.280 | or truncation which we almost always need and we also can't add special tokens so I mean it works

00:07:24.400 | it's fine but it's very simple so maybe it's not the best thing if you want to do all that stuff

00:07:30.240 | automatically maybe you want to do manually in that case you can go ahead you can add your special

00:07:34.400 | token your padding your truncation manually without a problem and then also convert those

00:07:39.600 | into pytorch tensors so we can do that that's fine the maybe easier method is if we go ahead

00:07:49.600 | and use encode so if we look at encode here you can see that we have these two actual tokens so

00:07:58.320 | we have the same as what we've got up here the 7592 up to 999 so that is our text tokenized

00:08:06.480 | or converted into token ids and then we also have this 101 102 now if you don't know what those are

00:08:15.440 | it's fine they are basically special tokens that BERT uses to indicate the start sequence for the

00:08:24.400 | 101 or the end of a sequence for 102 and there's also another special token that we'll see in a

00:08:31.760 | minute which is zero which is the padding token that BERT uses now if we were to

00:08:38.880 | use this method for actually building a tensor for for pytorch for BERT we would probably write

00:08:50.720 | something like this so we'd set the max length equal to 512 we'd set padding equal to the max

00:08:58.960 | length and then we would make sure that we return pytorch tensors

00:09:07.040 | like so right and then we see we get this big pytorch tensor with all these zeros in there

00:09:15.760 | are padding tokens right and that goes up to a length of 512 which is the correct size for

00:09:22.400 | BERT base so that makes sense i think let me uh let me just restrict how much of that we're seeing

00:09:31.680 | maybe a little bit let's go to 10 okay

00:09:38.960 | so that's encode now up here we also have encode plus

00:09:43.680 | so let's try and see what this is so we'll see up here we we just got our input ids so

00:09:55.600 | we refer back to here we have our token ids or input ids but we don't have the mask or the

00:10:03.760 | segment ids which we also need and that's a limitation of the encode method which gets

00:10:09.600 | fixed using the encode plus method so if we run that we see that instead of getting a

00:10:14.800 | single list we return dictionary that contains the input ids or token ids the token type ids

00:10:23.760 | or segment ids and the attention mask so that's straight away it looks a lot better now we can

00:10:33.360 | also use all the same arguments that we use in encode so let's change to encode plus

00:10:40.880 | and we'll remove that for now i'll add it back in a minute you can see it now okay we have

00:10:48.720 | input ids and we'll go down and might go down a little bit um and then we have token type ids

00:10:56.000 | and we have our token types here now this is just zeros because you don't have two sequences in

00:11:00.800 | there but if we were to pass two sequences we would get the zeros and ones and then we

00:11:06.560 | also see we have the attention mask so that's three methods we have the convert tokens and

00:11:17.120 | the tokenize or tokenize and convert tokens to ids then we have encode and code plus

00:11:25.440 | now you may have guessed already from name but we also have this batch and code plus now

00:11:31.520 | batch and code plus allows us to do the same as what we do in code plus but for batches of

00:11:37.520 | sentences so if we had let's go down here let me let me just remove that

00:11:48.640 | right so let's take let me take this for now

00:11:55.360 | right so let's create a text list

00:12:02.000 | and in here we're going to have text and i'm going to add another item is also hello world again

00:12:12.160 | okay now in here if i were to write encode plus and text list we see that we get this

00:12:23.280 | pretty weird output that doesn't it just doesn't look right that's because it isn't right we we

00:12:30.160 | don't we can't pass a list to the encode plus it won't work we have to pass each string one at a

00:12:40.800 | time and actually what we can see here so we have these 100 tokens 100 is the unknown token and

00:12:47.840 | that's because we're passing two objects in a list string objects in lists and but the tokenizer is

00:12:55.360 | reading this string objects as a whole and saying i have no idea what this is so it's just give

00:13:01.440 | giving it a unknown token so we can't use encode plus instead we use batch encode plus like that

00:13:09.200 | and now we see that we get we don't only get one of our tensors but we also we get two of

00:13:17.200 | our tensors so we get a array for each one of these so if we um let me let's write this

00:13:31.440 | okay if we go token ids we can access each one of those tensors

00:13:37.200 | because it's in a dictionary so you write input ids and let's have a look at the shape

00:13:44.000 | oh sorry so i need to let me just return

00:13:53.520 | tensors so um we can only use the the shape method when we have tensor

00:13:58.800 | also we need to do that so let me here let me uh take that so the reason we were getting that error

00:14:07.200 | is because our two arrays are of different lengths and we can't create a tensor where we have

00:14:17.040 | differently sized rows if that makes sense so you'll see in a moment if i i'll change the max

00:14:26.240 | length to 10 so it's not huge so we can see here that we've added in our first row we've added

00:14:35.040 | five yep five padding tokens and in the second we've added four to make them both the equal size

00:14:42.640 | so we can actually create a tensor from them so let's now go for shape and we see that now

00:14:50.400 | we have we have our two strings that have both been tokenized and converted into

00:14:55.600 | a list or a tensor row of 10 values so that's a batch in code plus we can add a huge number of

00:15:08.800 | strings into that we just need to add them as a as a list and that leads us on to our final method

00:15:17.040 | which is this tokenizer so you see here we've been using we've been using tokenizer and then

00:15:22.640 | followed by a method within that class or that object this time we're just using the

00:15:32.400 | the class we're calling that class directly so we write tokenizer and let's put text

00:15:37.360 | okay and i mean if we look at this it looks like it's doing the same as as encode plus so if we

00:15:46.320 | compare it to encode plus exactly the same output now what if we convert this into text to list

00:15:57.600 | okay so now we're getting the same output as we got from batch encode plus so

00:16:04.800 | what tokenizer doing or calling tokenizer directly is doing it's looking at the data type of the

00:16:14.640 | input that we're feeding into it is is the data type a string as is the case with text or is

00:16:22.800 | the data type a list as is the case with text list it looks at either of those and it says okay if

00:16:29.600 | it's a string i'm going to call the method encode plus if it's a list i'm going to call i'm going

00:16:37.600 | to call the method batch encode plus so that's all that tokenizer is doing so generally tokenizer is

00:16:47.600 | usually the the way to go if you're not sure whether you've got batches or

00:16:51.920 | strings coming through it can be very useful to use tokenizer

00:16:56.880 | but that's that's pretty much all i wanted to to cover for this video with with tokenizer as well

00:17:06.000 | it's worth noting we can use the same parameters so if we if we take these we can we can use all

00:17:13.680 | of these as we did with encode and encode plus like so so they're all tokenization methods or

00:17:24.800 | the the main ones in transformers now i'd imagine there's probably more that i'm not aware of

00:17:30.720 | and if you're aware of those let me know in the comments below that'll be pretty interesting to

00:17:35.280 | see more of those but they're probably the main the five main ones and i've seen a few questions

00:17:42.720 | on those before so i thought it'd be worth covering those and i was also curious myself as to

00:17:47.120 | what the actual or the specific differences are between each one of those so that's it

00:17:54.480 | for this video thank you very much for watching and i'll see you in the next one

Why are there so many Tokenization methods in HF Transformers?

Chapters