Back to Index

Why are there so many Tokenization methods in HF Transformers?


Chapters

0:0 Intro
0:58 Tokenization
2:40 Building a Tokenizer
3:37 Tensors
11:10 Batch Encode
15:13 Tokenizer

Transcript

Hi and welcome to the video. Today we're going to have a look at all of the different tokenization methods or a few of them at least in Hugging Face Transformers. Now I'm sure a few of you are asking tokenization is pretty straightforward and I believe this as well. So why are there so many tokenization methods?

So on the screen right now you can see we have these five different methods. Now in reality each of these actually does do something different but all of them are simply to produce token IDs. Now for those of you that are new to tokenization and maybe Transformers we'll just quickly have a look at the very basics of a tokenizer or the very basics for understanding what each of these methods actually does.

So tokenization in short is this. The process of going from what we have up here which is our original human readable text so hello world we also have this exclamation mark at the end there converting that original text into what we call tokens. Now tokens can be well they can be a few different things in this case what we see is tokens built from words so each token represents a word or a part of the syntax so the exclamation mark at the end.

Now depending on what sort of tokenizer you're using you can build tokens from completely different things so you can build tokenizers from the bytes within the text you can do word piece encoding so in this case there's no good examples but say maybe we had the word something okay we can easily split this into I think probably three different word pieces so we have some and then we have ing at the end that's a common part of a word so that would be a word piece in itself and then we'd also have thing in the middle there.

So we can tokenize it doesn't have to be a single word for each token it can be a whole host of different things and then we go from those tokens to the token ids which we see at the bottom so in this case hello is being translated to seven five nine two the the integer and then we have word and also the exclamation mark as well.

So that's the process that's what we're doing but how do we do that with a Hugging Face Transformer? So we have these two files that our tokenizer is built from so these two here that's our tokenizer and when we build a tokenizer you if you've followed some of my previous videos on building a tokenizer you will recognize both of these files and these are the two steps so the first the emergence of text takes us from that original text here to our tokens down here so that's step one and then step two is where we go from those tokens that we previously built in step one we process them through vocab.json and that produces our transformer readable token ids which we see at the bottom there.

Now there are a few different tensors that we need for feeding into our model so with transformers so we've just seen how we build the input ids or token ids so that's the essential we need that for every transformer model so token ids we also have the attention mask I'll just write mask for now these are the typical ones that we we would see so the attention mask is typically a tensor containing ones and zeros the ones will correlate to the real tokens within our token ids tensor and the zeros correlate to padding tokens in the token ids tensor so we have the attention mask and then we also have the token type ids or the you can call them segment ids as well and segment ids are used so are used when we have multiple segments to our inputs so we might have token ids and maybe we are doing question answering so question answering we would have if we're feeding it into BERT we would have our question I believe it's in this order so we would have the question and then in the middle we'd have this separate token I'll just write s e p and then we would have the context that we're getting the answer to our question from now in segment ids anything that belongs to our question would be represented by a zero anything that belongs to our context would be represented by a one so they're the three key tensors that we would be using and here's just a visualization of the attention mask so we have the real tokens represented by ones in the attention mask tensor and then the padding tokens represented by the zero in the attention mask tensor now I think I mean that's all we really need to ask quick summary of tokenizers in transformers now how does that correlate to what we're doing here well let's create a new cell and let's take this as our first example so we have our text hello world and let's have a look at what happens when we use tokenizer tokenizer tokenizer alone okay if we do this we see that we create our token so straight away we we know that this method here our first method it does our tokenization in the steps that we outlined before so it doesn't do everything all at once it does them step by step and you can probably guess that the next step from creating those tokens is to convert them into token ids which we we do there now it's completely valid method this it works it's good but you see that it's pretty simple like what we have here we can't create we can't create pytorch tensors or tensorflow tensors there's no arguments for adding padding or truncation which we almost always need and we also can't add special tokens so I mean it works it's fine but it's very simple so maybe it's not the best thing if you want to do all that stuff automatically maybe you want to do manually in that case you can go ahead you can add your special token your padding your truncation manually without a problem and then also convert those into pytorch tensors so we can do that that's fine the maybe easier method is if we go ahead and use encode so if we look at encode here you can see that we have these two actual tokens so we have the same as what we've got up here the 7592 up to 999 so that is our text tokenized or converted into token ids and then we also have this 101 102 now if you don't know what those are it's fine they are basically special tokens that BERT uses to indicate the start sequence for the 101 or the end of a sequence for 102 and there's also another special token that we'll see in a minute which is zero which is the padding token that BERT uses now if we were to use this method for actually building a tensor for for pytorch for BERT we would probably write something like this so we'd set the max length equal to 512 we'd set padding equal to the max length and then we would make sure that we return pytorch tensors like so right and then we see we get this big pytorch tensor with all these zeros in there are padding tokens right and that goes up to a length of 512 which is the correct size for BERT base so that makes sense i think let me uh let me just restrict how much of that we're seeing maybe a little bit let's go to 10 okay so that's encode now up here we also have encode plus so let's try and see what this is so we'll see up here we we just got our input ids so we refer back to here we have our token ids or input ids but we don't have the mask or the segment ids which we also need and that's a limitation of the encode method which gets fixed using the encode plus method so if we run that we see that instead of getting a single list we return dictionary that contains the input ids or token ids the token type ids or segment ids and the attention mask so that's straight away it looks a lot better now we can also use all the same arguments that we use in encode so let's change to encode plus and we'll remove that for now i'll add it back in a minute you can see it now okay we have input ids and we'll go down and might go down a little bit um and then we have token type ids and we have our token types here now this is just zeros because you don't have two sequences in there but if we were to pass two sequences we would get the zeros and ones and then we also see we have the attention mask so that's three methods we have the convert tokens and the tokenize or tokenize and convert tokens to ids then we have encode and code plus now you may have guessed already from name but we also have this batch and code plus now batch and code plus allows us to do the same as what we do in code plus but for batches of sentences so if we had let's go down here let me let me just remove that right so let's take let me take this for now right so let's create a text list and in here we're going to have text and i'm going to add another item is also hello world again okay now in here if i were to write encode plus and text list we see that we get this pretty weird output that doesn't it just doesn't look right that's because it isn't right we we don't we can't pass a list to the encode plus it won't work we have to pass each string one at a time and actually what we can see here so we have these 100 tokens 100 is the unknown token and that's because we're passing two objects in a list string objects in lists and but the tokenizer is reading this string objects as a whole and saying i have no idea what this is so it's just give giving it a unknown token so we can't use encode plus instead we use batch encode plus like that and now we see that we get we don't only get one of our tensors but we also we get two of our tensors so we get a array for each one of these so if we um let me let's write this okay if we go token ids we can access each one of those tensors because it's in a dictionary so you write input ids and let's have a look at the shape oh sorry so i need to let me just return tensors so um we can only use the the shape method when we have tensor also we need to do that so let me here let me uh take that so the reason we were getting that error is because our two arrays are of different lengths and we can't create a tensor where we have differently sized rows if that makes sense so you'll see in a moment if i i'll change the max length to 10 so it's not huge so we can see here that we've added in our first row we've added five yep five padding tokens and in the second we've added four to make them both the equal size so we can actually create a tensor from them so let's now go for shape and we see that now we have we have our two strings that have both been tokenized and converted into a list or a tensor row of 10 values so that's a batch in code plus we can add a huge number of strings into that we just need to add them as a as a list and that leads us on to our final method which is this tokenizer so you see here we've been using we've been using tokenizer and then followed by a method within that class or that object this time we're just using the the class we're calling that class directly so we write tokenizer and let's put text okay and i mean if we look at this it looks like it's doing the same as as encode plus so if we compare it to encode plus exactly the same output now what if we convert this into text to list okay so now we're getting the same output as we got from batch encode plus so what tokenizer doing or calling tokenizer directly is doing it's looking at the data type of the input that we're feeding into it is is the data type a string as is the case with text or is the data type a list as is the case with text list it looks at either of those and it says okay if it's a string i'm going to call the method encode plus if it's a list i'm going to call i'm going to call the method batch encode plus so that's all that tokenizer is doing so generally tokenizer is usually the the way to go if you're not sure whether you've got batches or strings coming through it can be very useful to use tokenizer but that's that's pretty much all i wanted to to cover for this video with with tokenizer as well it's worth noting we can use the same parameters so if we if we take these we can we can use all of these as we did with encode and encode plus like so so they're all tokenization methods or the the main ones in transformers now i'd imagine there's probably more that i'm not aware of and if you're aware of those let me know in the comments below that'll be pretty interesting to see more of those but they're probably the main the five main ones and i've seen a few questions on those before so i thought it'd be worth covering those and i was also curious myself as to what the actual or the specific differences are between each one of those so that's it for this video thank you very much for watching and i'll see you in the next one