Back to Index

Build a Custom Transformer Tokenizer - Transformers From Scratch #2


Chapters

0:0
3:10 Training the Tokenizer
5:21 Vocab Size
12:27 Encoding Text

Transcript

Hi and welcome to the video. We're going to have a look at how we can build our own tokenizer in Transformers from scratch. So this is the second video in our Transformers from scratch series and what we're going to be covering is the actual tokenizer itself. So we've already got our data so we can cross off now onto the tokenizer.

So let's move over to our code. So in the previous video we created all these files here. So these are just a lot of text files that contain the Italian subset from the Oscar data set. Now let's maybe open one. Ignore that and we just we get all this Italian.

Now each sample in this text file is separated by a newline character. So let's go ahead and begin using that data to build our tokenizer. So we first want to get a list of all the paths to our files. So we are going to be using path lib. You could also use OS list there as well.

It's up to you. Import. So sorry import path. So from path lib. Import path. I'm using this one because I don't know I've noticed that people are using this a lot at the moment for machine learning stuff. I'm not sure why you would do it over OS list there.

But it's what people are using. So let's give it a go. See how it is. So we have this. And we just want to create a string from each path object that we get. So for x in. And then in here we need to write path. And in here we just want to basically tell this where to look.

So we're using path here and we're just in the same directory. So it's not we don't really need to do anything here. That's fine. And then at the end we are going to use glob here. I think this is why people are using this. And we just create like a wild card like we want all text files in this directory.

So we just write that. Now let's do that. I'll look at the first five and see that we have our text files now. So that's good. And what we can now do is move on to actually training the tokenizer. So the tokenizer that we're going to be using is a byte level byte pair encoding tokenizer or BP tokenizer.

And essentially what that means is that it's going to break down our text into into bytes. So with most tokenizers that we probably use, unless you've used this one before, we tend to have like unknown tokens. So like for BERT, we use sentence piece encodings and we have to have this unknown token for when we don't have a token for a specific word, like for some new word.

Now with the BPE tokenizer, we are breaking things down into bytes. So essentially we don't actually need an unknown token anymore. So that's I think pretty cool. Now to use that, we need to do from tokenizers. So this is a another hugging face package. So maybe you need to, you might need to install that.

So if install tokenizers and you want to do byte level BP tokenizer like that. Okay. Now we take that and we're going to initialize our tokenizer. So we just write that. That's our tokenizer initialized. We haven't trained it yet. To train it, we need to write tokenizer train. And then in here we need to include the files that we're training on.

So this is why we have that pass variable up here. So this is just a list of all the text files that we created, which are all separated by new line characters. Each sample is separated by a new line character. Now the vocab size, we're going to be using a Roberta model here.

And I think the Roberta model, typical Roberta model, vocab size is 50K. Now you can use that if you want. It's up to you, but I'm going to stick with the typical BERT size just because I don't think we need that much. You know, we're just figuring things out here.

So, you know, this is going to mean less training time and that's a good thing in my opinion. We haven't set the min frequency. So this is saying what is the minimum number of times you want to see a word or a part of a word or a byte.

So it's kind of weird with this tokenizer before you add it into our vocabulary. So that's all that is. Okay. And then we also need to include our special tokens. So we're using the Roberta special tokens here. So special tokens. And then in here, we have our starter sequence token.

So I'm going to put this on a new line. Not like that, like this. So we have this starter sequence token, the padding token, end of sequence, which is like this, the unknown token, which with it being a byte level encoding, you'd hope it doesn't need to use this very much, but it's there anyway.

And the masking token. So that's everything we need to train our model. Okay. And one thing I do remember is if you train on all of that, all of those files, it takes a really very, very long time, which is fine if you're training it overnight or something, but that's not what we're doing here.

So I'm just going to shorten that to the first 100 tokens, and maybe I'll train it after this with the full set. Let's see. So I will leave that to train for a while and I'll be back when it's done. Okay. So it's finished training our tokenizer, and we can go ahead and actually save it.

So I'm going to import OS. I'm just doing this so I can make a new directory to store the tokenizer files in. And a typical Italian name, or so I've been told, is Filiberto, which fits really well with BERT. So this is our Italian BERT model name, Filiberto. So that is our new directory.

And if we just come over to here, we have this working directory, which is what I'm in. And then we have this new directory, Filiberto, in here. That's where we're going to save our tokenizer. So we just write tokenizer, save model. And here we can see here, we can do save or save model.

Save just saves a JSON file with our tokenizer data inside it. But I don't think that's the standard way of doing it. I think this is the way that you want to be doing it. And we're saving it as Filiberto, like that. So we'll do that. And we see that we get these two new files, vocab.json and merges.txt.

Now, if we look over here, we see both of those. And these are essentially like two steps of tokenization for our tokenizer. So when we feed text into our tokenizer, it first goes to merges.txt. And in here, we have characters, words, so on. And they are translated into these tokens.

So these are characters on the right, tokens on the left. So we scroll down. We can see different ones. We can keep going. So here, we have zione. That's like, although my Italian's very bad, that is like the English t-i-o-n. So t-i-o-n. And we would say stuff like attention, right?

Italians have the same, but they have like attenzione. So that's what we have there. So it's part of a word, and it's pretty common. And that gets translated into this token here. Now, after that, our tokenizer moves into vocab.json. And I don't know why it started at the bottom there.

Go to the top. If I clean this up quickly, we can see we have a JSON object. It's like a dictionary in Python. And we have all of our tokens and the token IDs that they will get translated into. So if we scroll down here, we should be able to find, was it VA, I think?

Okay, so VA, which is our zione into this token here. And then that eventually gets converted into this token ID. So that's our full tokenizer process. Just open that file back up. If we wanted to load that, we would do that like we normally would with Transformers. So we just write from Transformers, import Roberta.

So we're using a Roberta tokenizer here. So Roberta tokenizer. We can use either Roberta tokenizer or the fast version. It's up to you. And we just initialize our tokenizer. Like that. We front pre-trained. And in here, rather than putting a model name from the HuggingFace website, we would put the path, the local path to our directory, our model directory.

So it's Filiberto for us. And then we can use that to begin encoding text. So we go, "Ciao, come va," which is like, "Hi, how are you?" If we write that, we can see that we get, these are the tokens here. I wonder if we did a 10. So I'll do it.

I'll try in a minute. So we have the sort of sequence token here and the sequence token here. So the S and the D, S like that. So we have those at the sign end of each sequence. And we can also add padding in there. So padding equals max length.

And also max length needs to have a value as well. So max length, like 12. And then we get these padding tokens, which are the ones. So that's pretty cool. And I just want to, purely out of curiosity than anything else. So we have "attenzione." Let's see if we, if that, if we recognize the number there.

So no, we don't. So I suppose this is probably the full word. In fact, it is. So this is the full token here. If we just do this, maybe we will get, I can't remember what number it was. The 3, 3, 2, 2. Maybe, maybe that's right. I'm not sure.

But anyway, that's, that's how everything works. So that, that's it for this video. In the next video, we will take a look how we can use this tokenizer to build out our input pipeline for training our actual transformer model. So that's everything and I'll see you in the next one.