back to index

Building Transformer Tokenizers (Dhivehi NLP #1)


Chapters

0:0 Intro
1:6 Dhivehi Project
2:28 Hurdles for Low Resource Domains
4:21 Dhivehi Dataset
4:52 Download Dhivehi Corpus
8:25 Tokenizer Components
8:44 Normalizer Component
11:55 Pre-tokenization Component
14:59 Post-tokenization Component
16:26 Decoder Component
17:41 Tokenizer Implementation
21:4 Tokenizer Training
24:22 Post-processing Implementation
27:12 Decoder Implementation
28:7 Saving for Transformers
30:33 Tokenizer Test and Usage
31:36 Download Dhivehi Models
32:21 First Steps

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at how we can design tokenizers or more
00:00:05.760 | specifically a BERT tokenizer for low resource languages. So when I say low
00:00:14.120 | resource language I mean a language where we don't really have that much
00:00:19.160 | data and there is not already a tokenizer or transformer model out there
00:00:26.680 | built for that specific language. Now there are transform models and
00:00:33.540 | tokenizers for a lot or even most languages but there's still a lot of
00:00:42.600 | languages that are well simply less common and get less attention than
00:00:50.600 | something like English or Chinese. So you may find if you're working on these
00:00:56.400 | languages there are very few models out there or maybe no models out there. So
00:01:03.880 | that's what we're going to focus on today. We are going to more specifically
00:01:08.960 | focus on building a tokenizer and in the future we're going to do more of this as
00:01:13.640 | well more models and so on but we're going to focus on building a tokenizer
00:01:18.160 | for the Divey language. Now Divey is the language of the Maldives so the Maldive
00:01:25.440 | islands in the Indian Ocean I'm sure a lot of you probably know them but they're
00:01:30.000 | very beautiful islands incredibly amazing weather and generally I would
00:01:37.360 | not usually think of the Maldives and NLP in the same context but there are a
00:01:44.040 | lot of people that do and in particular there is a guy called Ashraq or Ismail
00:01:49.560 | Ashraq and we've been speaking a lot about NLP and through that I've been
00:01:59.800 | made aware of how difficult it is to actually build models for languages
00:02:04.120 | where there is not much attention out there already. So what we decided to do
00:02:11.400 | is kind of team up and try and put together some transform models for the
00:02:18.680 | Divey language and the first step to that is obviously the tokenizer. You need
00:02:26.040 | tokenizer before you build anything else so there are a few hurdles that we can
00:02:32.200 | overcome when we're building these models so there are already there are
00:02:36.640 | zero well not zero there are some pre-trained models for Divey but they're
00:02:40.960 | not necessarily that useful for what we want to apply them to and you may think
00:02:47.840 | okay there's those multilingual models with hundreds of languages included now
00:02:52.840 | from at least what I've seen all of them miss Divey. Another thing that is
00:02:57.800 | difficult with Divey and other low resource languages is actually finding
00:03:03.160 | text data. Now labeled data is practically impossible for a lot of
00:03:08.440 | these languages but unlabeled unstructured data maybe that's a bit
00:03:13.400 | more reasonable you can use a web scraper and scrape loads of text from whichever
00:03:18.740 | language you are using the websites in that language on the internet and that
00:03:23.320 | is what Ashraq has done to get the data set I'm using today. And as well as that
00:03:30.240 | another really difficult thing is that Divey uses a very unique writing system
00:03:36.840 | that is known as Fana and Fana looks really cool and is definitely very
00:03:45.360 | unique and beautiful but it's not really very well supported even by those
00:03:50.840 | multilingual language models you can see here it looks really cool you're also
00:03:56.560 | going from right to left which is quite interesting everything's also in the
00:04:00.800 | same case there's no uppercase or lowercase which is again quite
00:04:04.480 | interesting I think so there's a lot of unique properties with this language
00:04:10.440 | that we need to deal with and from what I've seen there are no current tokenizers
00:04:16.320 | which actually support these characters so we really do need a tokenizer. First
00:04:22.080 | step is getting data so like I said before Ashraq went and did the hard part
00:04:26.760 | and took or scraped a load of Divey text from the Divey Internet and he managed
00:04:36.800 | to put together 16 or more than 16 million samples of Divey paragraphs and
00:04:43.720 | sentences. Now obviously there are going to be little things in there that we
00:04:48.320 | maybe don't want but for the most part it's actually very good. So step number
00:04:54.040 | one is downloading the Divey corpus from Ashraq and you can see here so it's
00:05:00.720 | hosted over here on Hugging Face so we can use the datasets library to go ahead
00:05:08.880 | and actually get that data which is pretty useful so we just do pip install
00:05:14.040 | if you don't have it already, datasets. Okay now once that's installed we can go
00:05:22.120 | ahead and we just write this so we are going to Ashraq's Divey corpus and we
00:05:29.640 | are taking the train data split from that corpus. Now it's fairly big like I
00:05:38.320 | said it's more than 16 million samples so what you can do if you don't want to
00:05:42.680 | load it all at once you can set streaming equals true and what that will
00:05:48.240 | do is create an iterable object which you can see down here and that means we
00:05:55.120 | can form a for loop and iterate through the data one sample at time and not
00:06:01.480 | download the full thing in one go we just download each sample as we look
00:06:08.600 | through the whole list. So you can see here that's what I've done so the row
00:06:16.800 | here in Divey train so going through the train split we're just printing one row
00:06:22.760 | at a time and again you can see that really cool Divey text in here. Now
00:06:31.160 | there's just three samples here so it's very simple we just have almost like a
00:06:37.040 | dictionary we have text and then we have the Divey text following it so that is it
00:06:45.200 | all I've done here so I'm creating a generator object because later on when
00:06:53.480 | we are loading this data into the tokenizer that we're going to be loading
00:06:56.920 | it in as a generator and that generator will expect us to load the text in
00:07:04.280 | almost like lines so we'll expect to iterate through and just receive the
00:07:09.160 | text it will not expect to receive a dictionary which contains the key text
00:07:15.160 | which actually needs certain text it expects just this here okay nothing else
00:07:23.560 | so I've done is create a generator object here and you don't need numerate
00:07:29.960 | I'm not sure why that's there so remove that and all this does is iterate
00:07:37.920 | through each row and extracts text from each row so it's literally going through
00:07:43.120 | extracting this and nothing else so now when we actually use that that's what we
00:07:52.120 | get so here we're using this Divey text function or generator we are using the
00:07:59.560 | exact same code we did before we're just printing the row you see now we don't
00:08:02.960 | get dictionary we get just plain text which is is what we want now that will
00:08:09.080 | be fed into our tokenizer train from iterator method now we that's kind of
00:08:16.740 | skipping ahead a bit so let's go back and take a look at where this tokenizer
00:08:22.940 | is coming from in the first place okay so we have in our tokenizer we have a
00:08:30.500 | few different sets so let's say we have the tokenizer it's not just a tokenizer
00:08:38.840 | there are different components or sub processes the very first is the
00:08:48.480 | normalization now this is optional you don't need to don't need to do this but
00:08:55.640 | it's useful so given maybe we have something like we have these two words
00:09:05.360 | so this is specifically European languages here but we have C and C now
00:09:14.520 | these are pretty similar and a lot of people may actually use these
00:09:20.120 | interchangeably although they as far as I know are actually different words but
00:09:25.800 | people may use them interchangeably and in if we would expect that then we can
00:09:30.800 | use something called NFKD Unicode normalization to actually take both
00:09:39.200 | these and create the same word which would just be SI and this is particularly
00:09:44.800 | useful if you have weird characters so for example rather than the letter F
00:09:52.160 | like this you see people on social media using like the weird ones not like what
00:09:59.460 | kind of like this it's all these weird characters and stuff but basically what
00:10:03.580 | they mean is an F and because we're people we can understand that's what
00:10:08.440 | they're talking about but a machine learning model is not going to see those
00:10:14.120 | as the same thing because they have different Unicode patterns so what we do
00:10:19.280 | is use NFKD Unicode normalization and we convert them to being the same thing so
00:10:24.320 | that can be quite useful another very useful thing is you can also lowercase
00:10:30.360 | within your normalization step so that's also useful now like I said before with
00:10:37.840 | DeVay you don't have uppercase and lowercase so it's not really necessary for
00:10:42.280 | that but if we do have for example English text in our data at any point
00:10:49.040 | it's probably better we normalize it or lowercase it because that means we will
00:10:55.280 | create less non-DeVay tokens so for example we have the word hello and we
00:11:06.360 | also have the word hello like this now without lowercasing these may become two
00:11:13.840 | separate tokens and take up more space in our tokenizer vocabulary where we
00:11:21.240 | only really want to keep DeVay characters or words so what we can do is
00:11:26.600 | is lowercase that and that reduces the likelihood of getting duplicates
00:11:32.600 | lowcase and uppercase non-DeVay characters in there so that's another
00:11:37.360 | thing we will do so let me remove all of this so we are going to lowercase and
00:11:48.560 | also use NFKD okay so that's the first component the next component after that
00:11:57.480 | is pre-tokenization now pre-tokenization is a tokenization step that happens
00:12:09.160 | before we actually tokenize so what I mean by that so pre-tokenized let's say
00:12:15.760 | we have we have a string hello world is probably easiest now what this is going
00:12:28.120 | to do is just split this into very simple tokens so we might say anything
00:12:36.480 | with a space will get split or anything with space in between it will get split
00:12:42.080 | and maybe punctuation as well will get split so then we would end up with the
00:12:47.200 | token hello list from this we get hello we get world and we also get this
00:12:59.120 | summation mark it's messy but I think that makes sense so that is the
00:13:06.560 | pre-tokenization step and then after that we actually have the model or the
00:13:11.480 | tokenize itself now write model now this is where you have something like a
00:13:19.920 | wordpiece tokenizer which is what we're going to use now wordpiece is the
00:13:23.400 | tokenizer that BERT uses and later on we're going to be using BERT models so
00:13:27.240 | that's why I'm stuck with wordpiece and what this will do is it will take all of
00:13:33.240 | the tokens we've created and it will merge them into the most or the largest
00:13:43.360 | logical components I can think of so in this isn't very good example the hello
00:13:49.120 | world but let's say we have being maybe yep being so being for BERT
00:13:59.520 | specifically this would probably get split into something like B and then a
00:14:04.680 | sub word token which is always prefixed by the double hashtag symbol and that
00:14:15.160 | would be ing so we get two tokens from that BERT or the BERT wordpiece
00:14:22.240 | tokenizer sub word tokenizer so it doesn't split every word like hello
00:14:26.680 | probably just be hello but will split some words like this is being more
00:14:32.420 | relevant if you split them which can be can be useful for example if you had
00:14:35.840 | like it snowed it is snowing and snow or snowboarding you would get snow in all
00:14:42.260 | of those and then you get boarding or board and ing and get ing for snowing
00:14:47.620 | and snow indeed but snowed so that can be quite useful to find patterns in the
00:14:56.580 | words or sub words next step is post processing so post processing and this
00:15:11.740 | is where we would add any special tokens to our text so BERT specifically would
00:15:19.260 | use something like a what's called a classified token followed by hello world
00:15:27.260 | followed by a separated token and you probably have padding tokens and all of
00:15:33.260 | these different things so we're adding in any special tokens there and we're
00:15:38.140 | also going to create different tensors so we will have an input IDs tensor
00:15:47.300 | which will be the ID or integer values that represent each one of these word
00:15:55.420 | tokens or sub word tokens we have token type IDs which is useful when you have
00:16:00.900 | sentence pairs and you will also have a attention mask tensor which tells BERT
00:16:11.500 | or the transform model which tokens to actually pay attention to for
00:16:16.100 | example ignore any padding tokens so I mean they're the main components of a
00:16:21.300 | tokenizer but there is also another which we will I think always almost
00:16:27.140 | always add at least which is a decoder so when your model outputs let's say it
00:16:35.300 | outputs a word prediction so it's predicting you've masked the word you
00:16:40.940 | know let's say here we've covered hello and we said BERT what is that word and
00:16:47.020 | BERT is trying to tell us hello but BERT doesn't know the string hello it just
00:16:52.540 | knows token ID values so it's going to give us like it's going to say yeah this
00:16:58.060 | is my prediction is the number three and obviously we're like okay great I don't
00:17:02.540 | know what number three means so we need a decoder to take that number three and
00:17:08.780 | translate that into something more readable for us which would be hello
00:17:13.380 | okay and that's that's what a decoder is for so obviously we don't need it for
00:17:18.940 | the input into BERT but we do need it we want to understand the output from BERT
00:17:25.060 | so that's our tokenizer components reasonably high level but let's let's go
00:17:33.180 | into the code and we'll have a look at how at least how we've implemented that
00:17:37.100 | for our BERT wordpiece tokenizer so in this we are using the Hugging Face
00:17:45.260 | tokenizers library which is is very good and definitely recommend you do the same
00:17:50.320 | otherwise this will be very difficult now the to install this you would just
00:17:56.780 | pick install tokenizers there's nothing more to it than that and what I've done
00:18:02.540 | here is imported everything that we're going to be using now you can see a few
00:18:06.420 | of the components I mentioned earlier so we have we have decoders the models
00:18:10.860 | which is a tokenizer itself normalizes pre tokenizers and the post processes
00:18:18.580 | as well further on we have these other two classes and we'll go we'll get to
00:18:26.180 | them soon so I'll explain them later now first we want to do is actually we're
00:18:32.980 | not going in 1 2 3 4 like I said before because then we would be initializing
00:18:38.980 | the normalization set first instead what we actually do is we initialize
00:18:45.780 | tokenizer and the main component of that tokenizer is the model so in this case
00:18:51.220 | wordpiece so we initialize that using tokenizer here so you see up here we
00:19:00.700 | have tokenizer this is just a wrapper so that we create a Hugging Face tokenizers
00:19:08.380 | object and in there we need to pass the type tokenizer that I'm using and we are
00:19:13.820 | taking that from models up here and we're using wordpiece now in here we
00:19:19.820 | also have the unknown token so you specify that so that's where your
00:19:24.380 | tokenizer if it sees a word it doesn't know it will replace this okay it will
00:19:30.820 | put new NK rather than just leaving it empty or not understanding this is like
00:19:38.100 | the only thing it can put there instead of raising an error or something so we
00:19:44.620 | put that in there now after we have initialized a tokenizer instance we need
00:19:52.180 | to set the tokenizer attributes which are going to relate to the components
00:19:58.580 | that we listed just a moment ago so here we set the normalizer attribute in our
00:20:04.820 | tokenizer which we just initialize and we're setting that equal to normalizes
00:20:10.660 | sequence so there's a sequence of normalization sets here and we are using
00:20:17.300 | the lowercase first we locate everything and then we are using NFKD
00:20:23.420 | unicode normalization so there is flat so we've set normalization set then we
00:20:29.780 | do the pre tokenization set now pre tokenization is where we're splitting
00:20:36.220 | string into tokens so words or punctuation now the simplest way to do
00:20:44.980 | this is we use this white space tokenizer so all that does is splits on
00:20:51.380 | either white space or on punctuation like commas or exclamation marks or
00:20:56.260 | something like that so that's it we set out pre tokenizer and then we have this
00:21:06.820 | so this was another part that we didn't mention so we imported this trainers so
00:21:14.300 | we imported it here now trainers is basically the method that or function
00:21:25.260 | that we'll be using to train or control the training process of our tokenizer
00:21:32.140 | and all we're doing here is we're using a wordpiece trainer you can have a
00:21:37.540 | wordpiece tokenizer model we are setting the vocabulary size so that's the
00:21:43.060 | maximum number of tokens that our tokenizer will contain and then we also
00:21:51.580 | need to set any special tokens so we already set one earlier but we need to
00:21:55.140 | need to make sure that is included so with special tokens we probably won't
00:22:02.660 | see them or we hopefully won't see them in our input text so or the text that
00:22:10.460 | we'll be training on so what we do is we insert and now beforehand okay so we
00:22:17.100 | have this known token padding token classified token separate token and
00:22:22.180 | masculine token then here we set a min frequency so the minimum number of times
00:22:27.180 | we need to see a token to add it to the vocabulary and then we also earlier on
00:22:32.980 | you saw we had the sub words in our word piece tokenizer here we're just saying
00:22:37.500 | the prefix identifies those so it's just a two symbols there now at this point
00:22:48.220 | we're back to the start we were downloading the debate corpus again we
00:22:52.140 | already covered it so we can skip that we already have the debate corpus and we
00:22:57.780 | are storing it in the generator text and then that's where we go into our
00:23:03.940 | training step so the reason that we put TV or we create this TV text generator
00:23:10.600 | is because we're training from iterator here which expects to receive lines of
00:23:16.820 | text so we have text and then the trainer is equal to the trainer so the trainer
00:23:24.560 | we define just up here so that controls the training process and that's it so
00:23:32.220 | this once you run that that will run through all of the text in your TV text
00:23:40.060 | generator and train to the specifications you set in your trainer
00:23:45.140 | object we can check so the tokenizer you can get the vocab size we said earlier
00:23:51.300 | and you see this is 30,000 so as we as we would expect now before when I showed
00:23:58.740 | you that list there were five sets and we've covered one which is normalization
00:24:04.060 | to pre tokenization and three which would be our actual tokenization model
00:24:14.820 | now after that we we train the tokenizer so we train using those first three sets
00:24:20.900 | or components and then we still need to define that the next step which is the
00:24:25.740 | post processing step now the first thing we do is we get our classifier ID so the
00:24:34.620 | integer value that represents CLS and the integer value that represents SEP
00:24:40.340 | we get both of those and we use this processing or processes template
00:24:46.700 | processing method to create this here which is looks quite messy but if you
00:24:54.020 | take a look at this here so we have the one input at the top and we have the two
00:25:01.220 | inputs at the bottom so when you are feeding just a single sentence into
00:25:07.020 | your BERT model what's going to happen is looking at code we are going to
00:25:11.500 | we are going to use this format so we're going to have our CLS token followed by A which is
00:25:17.220 | sentence one followed by the separated token now back to the image we have on
00:25:27.260 | bottom we have two sentences now if we feed those into BERT they need to be
00:25:32.580 | understood as separate sentences to BERT or separate maybe question and context
00:25:39.980 | for example now in that case back to our code we will use or we would format it as
00:25:46.540 | a CLS token followed by sentence A the first one followed by a separated token
00:25:53.580 | separate the two sentences and finally we would have sentence B after that and
00:26:00.260 | then again we do finish with a separated token again now another thing that's
00:26:05.220 | different here is that we have this colon one for the pair now that's for
00:26:11.020 | the token type IDs array so token type IDs tells BERT where we have sentence A
00:26:16.220 | and sentence B now in the case of a single sentence everything in token type IDs is
00:26:22.460 | equal to zero so it's just all zero zero and zero the alternative where we have
00:26:31.420 | two sentences is that sentence A is going to be zero and everything related to sentence B is
00:26:37.740 | going to be one it's as simple as that and then we need to specify the special
00:26:43.420 | tokens that we're using here so we have CLS and we're specifying it again here
00:26:48.340 | and mapping that to the actual token ID integer so we that's the post-processing
00:26:56.540 | step that's all sorted we don't need to do anything else with that now and we
00:27:03.900 | can move on to so after this our tokenizer will feed the input IDs that it
00:27:10.060 | creates into the model and then we would move on to the decoder step so after
00:27:18.460 | BERT has finished processing whatever it's processing it's going to output you
00:27:22.220 | a number like a word prediction maybe it's going to give you a token ID and we
00:27:27.260 | need to know okay how do we how do we decode that into something we can understand and
00:27:31.420 | maybe it gives us a load of token IDs and what we need to do is say okay we're
00:27:38.060 | using a WordPiece tokenizer and we're going to decode from WordPiece and we
00:27:42.540 | also because we're using WordPiece as this prefix this is already set by
00:27:47.540 | default by the way but I'm just putting in there in case you want to use a different
00:27:50.860 | prefix although I wouldn't recommend it but if you if you have reason to you can
00:27:55.620 | change that so that's our tokenizer and that that's it so we've initialized or
00:28:04.500 | we've created our tokenizer and after that we move on to saving it into a
00:28:12.500 | format that is most useful to us so I think most of us when we are using
00:28:19.820 | tokenizer model we're probably going to use HoneyFace transformers not
00:28:24.020 | HoneyFace tokenizers because they are two different libraries and when we load
00:28:30.340 | tokenizer with HoneyFace transformers we tend to use pre-trained tokenizer or
00:28:34.980 | pre-trained tokenizer fast class now what we can do is save our tokenizer into
00:28:45.700 | the format that is compatible with HoneyFace transformers and compatible
00:28:50.740 | with this class specifically and to do that we first actually initialize the
00:28:56.060 | tokenizer, the HoneyFace tokenizer tokenizer using this class so we are
00:29:02.420 | using pre-trained tokenizer fast and if you've used this before in HoneyFace transformers
00:29:06.900 | we usually write from pre-trained and we load the model from here so like
00:29:11.980 | based on case this time we're not using that not using any methods we're just
00:29:16.260 | initializing the object directly we pass the tokenizer to tokenizer object then we
00:29:23.780 | also have the unknown token padding token and the other special tokens in
00:29:27.420 | there as well okay so that is ready for us to save it's in the correct format now so
00:29:37.060 | what we do is take full tokenizer which we've initialized here and we save it as a
00:29:42.320 | pre-trained model and what I'm doing here is saving it to the BERT based DB
00:29:47.900 | directory and once that's saved we're going to save these three files here and
00:29:56.660 | from now we can actually just load it from pre-trained like we normally would
00:30:01.700 | now just one thing when we load from pre-trained normally we probably are going to be
00:30:07.260 | loading from the HuggingFace models hub in this case we're not loading from a
00:30:13.180 | local directory so in some cases maybe you have this in a different directory
00:30:18.840 | so you write your path here which would go to a different directory and then you
00:30:25.180 | would have your model directory there okay so just be aware of that and then
00:30:31.620 | we can test our tokenizer with some debatex and we see okay cool this is
00:30:38.500 | debatex no idea what any of it means but it looks great and from that we get
00:30:46.660 | three tensors so we have these tensor which is two represents I think this CLS
00:30:55.220 | token and three here would represent our separator token the two special tokens
00:31:00.500 | everything in between is debate or the punctuation like these brackets open
00:31:07.860 | closing brackets comma or something else and then we have token type IDs now we
00:31:13.420 | only have one sentence here so sentence a all of that should be zero and then we
00:31:22.940 | have the attention mass tensor as well in this case we don't have any padding
00:31:27.020 | that I've been padding so the attention master should just be one and that's it
00:31:32.020 | our tokenizer is actually ready now if you do want to go ahead and actually
00:31:38.020 | load this tokenizer directly rather than going through all of this you can just
00:31:43.260 | write James Callum and load it like this because this is the on the hugging face
00:31:50.580 | modelers hub we we have a model and this also includes a BERT model as well so
00:31:56.060 | you have the BERT model which can load and also the tokenizer so that is it for
00:32:02.780 | this guide or walkthrough to building a BERT wordpiece tokenizer for a low
00:32:09.620 | resource or a language which does not have any currently supported tokenizer
00:32:17.860 | so I hope this has been useful as I said the tokenizer is really just a first step
00:32:24.660 | in what we hope will be a great way to support the debate language and
00:32:33.820 | particularly the AI community over there in what they are building and doing by
00:32:40.140 | putting together a few few BERT models that are fine-tuned or built for specific
00:32:45.860 | or different purposes and like this is the very first step in that so I hope
00:32:56.060 | it's all been very useful thank you very much for watching and I will see you in the next one. Bye!