back to index

Build a Custom Transformer Tokenizer - Transformers From Scratch #2


Chapters

0:0
3:10 Training the Tokenizer
5:21 Vocab Size
12:27 Encoding Text

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to the video. We're going to have a look at how we can build our own tokenizer
00:00:06.400 | in Transformers from scratch. So this is the second video in our Transformers from scratch
00:00:14.960 | series and what we're going to be covering is the actual tokenizer itself. So we've already
00:00:24.160 | got our data so we can cross off now onto the tokenizer. So let's move over to our code. So in
00:00:33.040 | the previous video we created all these files here. So these are just a lot of text files
00:00:44.160 | that contain the Italian subset from the Oscar data set. Now let's maybe open one.
00:00:55.360 | Ignore that and we just we get all this Italian. Now each sample in this text file
00:01:06.640 | is separated by a newline character. So let's go ahead and begin using that data
00:01:21.600 | to build our tokenizer. So we first want to get a list of all the paths to our files.
00:01:31.600 | So we are going to be using path lib. You could also use OS list there as well. It's up to you.
00:01:38.080 | Import. So sorry import path. So from path lib.
00:01:46.000 | Import path. I'm using this one because I don't know I've noticed that people are using this
00:01:54.880 | a lot at the moment for machine learning stuff. I'm not sure why you would do it over OS list there.
00:02:02.160 | But it's what people are using. So let's give it a go. See how it is. So we have this.
00:02:10.960 | And we just want to create a string from each path object that we get. So for x in.
00:02:21.520 | And then in here we need to write path. And in here we just want to
00:02:28.000 | basically tell this where to look. So we're using path here and we're just in the same directory.
00:02:36.000 | So it's not we don't really need to do anything here. That's fine. And then at the end we are
00:02:41.200 | going to use glob here. I think this is why people are using this. And we just create like a wild
00:02:48.720 | card like we want all text files in this directory. So we just write that. Now let's do that. I'll
00:03:01.760 | look at the first five and see that we have our text files now. So that's good. And what we can
00:03:08.480 | now do is move on to actually training the tokenizer. So the tokenizer that we're going to
00:03:14.880 | be using is a byte level byte pair encoding tokenizer or BP tokenizer. And essentially what
00:03:26.480 | that means is that it's going to break down our text into into bytes. So with most tokenizers
00:03:36.320 | that we probably use, unless you've used this one before, we tend to have like unknown tokens.
00:03:45.840 | So like for BERT, we use sentence piece encodings and we have to have this unknown token for when
00:03:53.760 | we don't have a token for a specific word, like for some new word. Now with the BPE tokenizer,
00:04:05.680 | we are breaking things down into bytes. So essentially we don't actually need an unknown
00:04:10.880 | token anymore. So that's I think pretty cool. Now to use that, we need to do from tokenizers. So
00:04:19.920 | this is a another hugging face package. So maybe you need to, you might need to install that. So
00:04:28.160 | if install tokenizers and you want to do byte level BP tokenizer like that. Okay. Now we take
00:04:39.840 | that and we're going to initialize our tokenizer. So we just write that. That's our tokenizer
00:04:51.760 | initialized. We haven't trained it yet. To train it, we need to write tokenizer train.
00:04:59.040 | And then in here we need to include the files that we're training on. So this is why we have
00:05:06.880 | that pass variable up here. So this is just a list of all the text files that we created,
00:05:12.480 | which are all separated by new line characters. Each sample is separated by a new line character.
00:05:21.200 | Now the vocab size, we're going to be using a Roberta model here. And I think the Roberta
00:05:32.320 | model, typical Roberta model, vocab size is 50K. Now you can use that if you want. It's up to you,
00:05:41.360 | but I'm going to stick with the typical BERT size just because I don't think we need that much.
00:05:50.160 | You know, we're just figuring things out here. So, you know, this is going to mean
00:05:54.960 | less training time and that's a good thing in my opinion. We haven't set the min frequency.
00:06:03.040 | So this is saying what is the minimum number of times you want to see a word or a part of a word
00:06:11.760 | or a byte. So it's kind of weird with this tokenizer before you add it into our vocabulary.
00:06:19.680 | So that's all that is. Okay. And then we also need to include our special tokens.
00:06:26.720 | So we're using the Roberta special tokens here. So special tokens. And then in here,
00:06:32.720 | we have our starter sequence token. So I'm going to put this on a new line.
00:06:36.640 | Not like that, like this. So we have this starter sequence token, the padding token,
00:06:49.600 | end of sequence, which is like this, the unknown token, which with it being a
00:06:57.440 | byte level encoding, you'd hope it doesn't need to use this very much, but it's there anyway.
00:07:04.240 | And the masking token. So that's everything we need to train our model.
00:07:17.920 | Okay. And one thing I do remember is if you train on all of that, all of those files,
00:07:27.120 | it takes a really very, very long time, which is fine if you're training it overnight or something,
00:07:33.040 | but that's not what we're doing here. So I'm just going to shorten that to the first 100
00:07:40.640 | tokens, and maybe I'll train it after this with the full set. Let's see. So I will leave that to
00:07:50.000 | train for a while and I'll be back when it's done. Okay. So it's finished training our tokenizer,
00:07:57.920 | and we can go ahead and actually save it. So I'm going to import OS. I'm just doing this so I can
00:08:06.400 | make a new directory to store the tokenizer files in. And a typical Italian name, or so I've been
00:08:15.440 | told, is Filiberto, which fits really well with BERT. So this is our Italian BERT model name,
00:08:25.440 | Filiberto. So that is our new directory. And if we just come over to here, we have this working
00:08:38.000 | directory, which is what I'm in. And then we have this new directory, Filiberto, in here. That's
00:08:45.680 | where we're going to save our tokenizer. So we just write tokenizer, save model. And here we can
00:08:53.600 | see here, we can do save or save model. Save just saves a JSON file with our tokenizer data inside
00:09:02.160 | it. But I don't think that's the standard way of doing it. I think this is the way that you want
00:09:09.680 | to be doing it. And we're saving it as Filiberto, like that. So we'll do that. And we see that we
00:09:19.360 | get these two new files, vocab.json and merges.txt. Now, if we look over here, we see both of those.
00:09:27.520 | And these are essentially like two steps of tokenization for our tokenizer.
00:09:35.280 | So when we feed text into our tokenizer, it first goes to merges.txt. And in here, we have
00:09:46.560 | characters, words, so on. And they are translated into these tokens. So these are characters on the
00:09:56.640 | right, tokens on the left. So we scroll down. We can see different ones. We can keep going.
00:10:03.360 | So here, we have zione. That's like, although my Italian's very bad, that is like the English t-i-o-n.
00:10:16.080 | So t-i-o-n. And we would say stuff like attention, right? Italians have the same,
00:10:25.680 | but they have like attenzione. So that's what we have there. So it's part of a word,
00:10:33.840 | and it's pretty common. And that gets translated into this token here. Now, after that,
00:10:41.600 | our tokenizer moves into vocab.json. And I don't know why it started at the bottom there.
00:10:50.800 | Go to the top. If I clean this up quickly, we can see we have a JSON object. It's like
00:11:01.520 | a dictionary in Python. And we have all of our tokens and the token IDs that they will get
00:11:08.960 | translated into. So if we scroll down here, we should be able to find, was it VA, I think?
00:11:16.000 | Okay, so VA, which is our zione into this token here. And then that eventually gets
00:11:24.640 | converted into this token ID. So that's our full tokenizer process.
00:11:34.400 | Just open that file back up. If we wanted to load that, we would do that like we normally would
00:11:40.880 | with Transformers. So we just write from Transformers, import
00:11:44.960 | Roberta. So we're using a Roberta tokenizer here. So Roberta tokenizer. We can use either
00:11:55.280 | Roberta tokenizer or the fast version. It's up to you. And we just initialize our tokenizer.
00:12:03.440 | Like that. We front pre-trained. And in here, rather than putting a model name from the
00:12:12.000 | HuggingFace website, we would put the path, the local path to our directory, our model directory.
00:12:19.680 | So it's Filiberto for us. And then we can use that to begin encoding text.
00:12:29.360 | So we go, "Ciao, come va," which is like, "Hi, how are you?" If we write that, we can see that we get,
00:12:38.480 | these are the tokens here. I wonder if we did a 10. So I'll do it. I'll try in a minute.
00:12:48.800 | So we have the sort of sequence token here and the sequence token here. So the S and the
00:12:58.400 | D, S like that. So we have those at the sign end of each
00:13:04.400 | sequence. And we can also add padding in there. So padding equals max length. And also max length
00:13:12.960 | needs to have a value as well. So max length, like 12. And then we get these padding tokens,
00:13:19.760 | which are the ones. So that's pretty cool. And I just want to, purely out of curiosity than anything
00:13:25.680 | else. So we have "attenzione." Let's see if we, if that, if we recognize the number there. So no,
00:13:33.280 | we don't. So I suppose this is probably the full word. In fact, it is. So this is the full token
00:13:42.960 | here. If we just do this, maybe we will get, I can't remember what number it was.
00:13:50.800 | The 3, 3, 2, 2. Maybe, maybe that's right. I'm not sure. But anyway, that's, that's how
00:13:57.120 | everything works. So that, that's it for this video. In the next video, we will take a look
00:14:04.720 | how we can use this tokenizer to build out our input pipeline for training our actual
00:14:11.440 | transformer model. So that's everything and I'll see you in the next one.