back to indexBuild a Custom Transformer Tokenizer - Transformers From Scratch #2
Chapters
0:0
3:10 Training the Tokenizer
5:21 Vocab Size
12:27 Encoding Text
00:00:00.000 |
Hi and welcome to the video. We're going to have a look at how we can build our own tokenizer 00:00:06.400 |
in Transformers from scratch. So this is the second video in our Transformers from scratch 00:00:14.960 |
series and what we're going to be covering is the actual tokenizer itself. So we've already 00:00:24.160 |
got our data so we can cross off now onto the tokenizer. So let's move over to our code. So in 00:00:33.040 |
the previous video we created all these files here. So these are just a lot of text files 00:00:44.160 |
that contain the Italian subset from the Oscar data set. Now let's maybe open one. 00:00:55.360 |
Ignore that and we just we get all this Italian. Now each sample in this text file 00:01:06.640 |
is separated by a newline character. So let's go ahead and begin using that data 00:01:21.600 |
to build our tokenizer. So we first want to get a list of all the paths to our files. 00:01:31.600 |
So we are going to be using path lib. You could also use OS list there as well. It's up to you. 00:01:38.080 |
Import. So sorry import path. So from path lib. 00:01:46.000 |
Import path. I'm using this one because I don't know I've noticed that people are using this 00:01:54.880 |
a lot at the moment for machine learning stuff. I'm not sure why you would do it over OS list there. 00:02:02.160 |
But it's what people are using. So let's give it a go. See how it is. So we have this. 00:02:10.960 |
And we just want to create a string from each path object that we get. So for x in. 00:02:21.520 |
And then in here we need to write path. And in here we just want to 00:02:28.000 |
basically tell this where to look. So we're using path here and we're just in the same directory. 00:02:36.000 |
So it's not we don't really need to do anything here. That's fine. And then at the end we are 00:02:41.200 |
going to use glob here. I think this is why people are using this. And we just create like a wild 00:02:48.720 |
card like we want all text files in this directory. So we just write that. Now let's do that. I'll 00:03:01.760 |
look at the first five and see that we have our text files now. So that's good. And what we can 00:03:08.480 |
now do is move on to actually training the tokenizer. So the tokenizer that we're going to 00:03:14.880 |
be using is a byte level byte pair encoding tokenizer or BP tokenizer. And essentially what 00:03:26.480 |
that means is that it's going to break down our text into into bytes. So with most tokenizers 00:03:36.320 |
that we probably use, unless you've used this one before, we tend to have like unknown tokens. 00:03:45.840 |
So like for BERT, we use sentence piece encodings and we have to have this unknown token for when 00:03:53.760 |
we don't have a token for a specific word, like for some new word. Now with the BPE tokenizer, 00:04:05.680 |
we are breaking things down into bytes. So essentially we don't actually need an unknown 00:04:10.880 |
token anymore. So that's I think pretty cool. Now to use that, we need to do from tokenizers. So 00:04:19.920 |
this is a another hugging face package. So maybe you need to, you might need to install that. So 00:04:28.160 |
if install tokenizers and you want to do byte level BP tokenizer like that. Okay. Now we take 00:04:39.840 |
that and we're going to initialize our tokenizer. So we just write that. That's our tokenizer 00:04:51.760 |
initialized. We haven't trained it yet. To train it, we need to write tokenizer train. 00:04:59.040 |
And then in here we need to include the files that we're training on. So this is why we have 00:05:06.880 |
that pass variable up here. So this is just a list of all the text files that we created, 00:05:12.480 |
which are all separated by new line characters. Each sample is separated by a new line character. 00:05:21.200 |
Now the vocab size, we're going to be using a Roberta model here. And I think the Roberta 00:05:32.320 |
model, typical Roberta model, vocab size is 50K. Now you can use that if you want. It's up to you, 00:05:41.360 |
but I'm going to stick with the typical BERT size just because I don't think we need that much. 00:05:50.160 |
You know, we're just figuring things out here. So, you know, this is going to mean 00:05:54.960 |
less training time and that's a good thing in my opinion. We haven't set the min frequency. 00:06:03.040 |
So this is saying what is the minimum number of times you want to see a word or a part of a word 00:06:11.760 |
or a byte. So it's kind of weird with this tokenizer before you add it into our vocabulary. 00:06:19.680 |
So that's all that is. Okay. And then we also need to include our special tokens. 00:06:26.720 |
So we're using the Roberta special tokens here. So special tokens. And then in here, 00:06:32.720 |
we have our starter sequence token. So I'm going to put this on a new line. 00:06:36.640 |
Not like that, like this. So we have this starter sequence token, the padding token, 00:06:49.600 |
end of sequence, which is like this, the unknown token, which with it being a 00:06:57.440 |
byte level encoding, you'd hope it doesn't need to use this very much, but it's there anyway. 00:07:04.240 |
And the masking token. So that's everything we need to train our model. 00:07:17.920 |
Okay. And one thing I do remember is if you train on all of that, all of those files, 00:07:27.120 |
it takes a really very, very long time, which is fine if you're training it overnight or something, 00:07:33.040 |
but that's not what we're doing here. So I'm just going to shorten that to the first 100 00:07:40.640 |
tokens, and maybe I'll train it after this with the full set. Let's see. So I will leave that to 00:07:50.000 |
train for a while and I'll be back when it's done. Okay. So it's finished training our tokenizer, 00:07:57.920 |
and we can go ahead and actually save it. So I'm going to import OS. I'm just doing this so I can 00:08:06.400 |
make a new directory to store the tokenizer files in. And a typical Italian name, or so I've been 00:08:15.440 |
told, is Filiberto, which fits really well with BERT. So this is our Italian BERT model name, 00:08:25.440 |
Filiberto. So that is our new directory. And if we just come over to here, we have this working 00:08:38.000 |
directory, which is what I'm in. And then we have this new directory, Filiberto, in here. That's 00:08:45.680 |
where we're going to save our tokenizer. So we just write tokenizer, save model. And here we can 00:08:53.600 |
see here, we can do save or save model. Save just saves a JSON file with our tokenizer data inside 00:09:02.160 |
it. But I don't think that's the standard way of doing it. I think this is the way that you want 00:09:09.680 |
to be doing it. And we're saving it as Filiberto, like that. So we'll do that. And we see that we 00:09:19.360 |
get these two new files, vocab.json and merges.txt. Now, if we look over here, we see both of those. 00:09:27.520 |
And these are essentially like two steps of tokenization for our tokenizer. 00:09:35.280 |
So when we feed text into our tokenizer, it first goes to merges.txt. And in here, we have 00:09:46.560 |
characters, words, so on. And they are translated into these tokens. So these are characters on the 00:09:56.640 |
right, tokens on the left. So we scroll down. We can see different ones. We can keep going. 00:10:03.360 |
So here, we have zione. That's like, although my Italian's very bad, that is like the English t-i-o-n. 00:10:16.080 |
So t-i-o-n. And we would say stuff like attention, right? Italians have the same, 00:10:25.680 |
but they have like attenzione. So that's what we have there. So it's part of a word, 00:10:33.840 |
and it's pretty common. And that gets translated into this token here. Now, after that, 00:10:41.600 |
our tokenizer moves into vocab.json. And I don't know why it started at the bottom there. 00:10:50.800 |
Go to the top. If I clean this up quickly, we can see we have a JSON object. It's like 00:11:01.520 |
a dictionary in Python. And we have all of our tokens and the token IDs that they will get 00:11:08.960 |
translated into. So if we scroll down here, we should be able to find, was it VA, I think? 00:11:16.000 |
Okay, so VA, which is our zione into this token here. And then that eventually gets 00:11:24.640 |
converted into this token ID. So that's our full tokenizer process. 00:11:34.400 |
Just open that file back up. If we wanted to load that, we would do that like we normally would 00:11:40.880 |
with Transformers. So we just write from Transformers, import 00:11:44.960 |
Roberta. So we're using a Roberta tokenizer here. So Roberta tokenizer. We can use either 00:11:55.280 |
Roberta tokenizer or the fast version. It's up to you. And we just initialize our tokenizer. 00:12:03.440 |
Like that. We front pre-trained. And in here, rather than putting a model name from the 00:12:12.000 |
HuggingFace website, we would put the path, the local path to our directory, our model directory. 00:12:19.680 |
So it's Filiberto for us. And then we can use that to begin encoding text. 00:12:29.360 |
So we go, "Ciao, come va," which is like, "Hi, how are you?" If we write that, we can see that we get, 00:12:38.480 |
these are the tokens here. I wonder if we did a 10. So I'll do it. I'll try in a minute. 00:12:48.800 |
So we have the sort of sequence token here and the sequence token here. So the S and the 00:12:58.400 |
D, S like that. So we have those at the sign end of each 00:13:04.400 |
sequence. And we can also add padding in there. So padding equals max length. And also max length 00:13:12.960 |
needs to have a value as well. So max length, like 12. And then we get these padding tokens, 00:13:19.760 |
which are the ones. So that's pretty cool. And I just want to, purely out of curiosity than anything 00:13:25.680 |
else. So we have "attenzione." Let's see if we, if that, if we recognize the number there. So no, 00:13:33.280 |
we don't. So I suppose this is probably the full word. In fact, it is. So this is the full token 00:13:42.960 |
here. If we just do this, maybe we will get, I can't remember what number it was. 00:13:50.800 |
The 3, 3, 2, 2. Maybe, maybe that's right. I'm not sure. But anyway, that's, that's how 00:13:57.120 |
everything works. So that, that's it for this video. In the next video, we will take a look 00:14:04.720 |
how we can use this tokenizer to build out our input pipeline for training our actual 00:14:11.440 |
transformer model. So that's everything and I'll see you in the next one.