Building Transformer Tokenizers (Dhivehi NLP #1)

Today we're going to have a look at how we can design tokenizers or more specifically a BERT tokenizer for low resource languages. So when I say low resource language I mean a language where we don't really have that much data and there is not already a tokenizer or transformer model out there built for that specific language.

Now there are transform models and tokenizers for a lot or even most languages but there's still a lot of languages that are well simply less common and get less attention than something like English or Chinese. So you may find if you're working on these languages there are very few models out there or maybe no models out there.

So that's what we're going to focus on today. We are going to more specifically focus on building a tokenizer and in the future we're going to do more of this as well more models and so on but we're going to focus on building a tokenizer for the Divey language.

Now Divey is the language of the Maldives so the Maldive islands in the Indian Ocean I'm sure a lot of you probably know them but they're very beautiful islands incredibly amazing weather and generally I would not usually think of the Maldives and NLP in the same context but there are a lot of people that do and in particular there is a guy called Ashraq or Ismail Ashraq and we've been speaking a lot about NLP and through that I've been made aware of how difficult it is to actually build models for languages where there is not much attention out there already.

So what we decided to do is kind of team up and try and put together some transform models for the Divey language and the first step to that is obviously the tokenizer. You need tokenizer before you build anything else so there are a few hurdles that we can overcome when we're building these models so there are already there are zero well not zero there are some pre-trained models for Divey but they're not necessarily that useful for what we want to apply them to and you may think okay there's those multilingual models with hundreds of languages included now from at least what I've seen all of them miss Divey.

Another thing that is difficult with Divey and other low resource languages is actually finding text data. Now labeled data is practically impossible for a lot of these languages but unlabeled unstructured data maybe that's a bit more reasonable you can use a web scraper and scrape loads of text from whichever language you are using the websites in that language on the internet and that is what Ashraq has done to get the data set I'm using today.

And as well as that another really difficult thing is that Divey uses a very unique writing system that is known as Fana and Fana looks really cool and is definitely very unique and beautiful but it's not really very well supported even by those multilingual language models you can see here it looks really cool you're also going from right to left which is quite interesting everything's also in the same case there's no uppercase or lowercase which is again quite interesting I think so there's a lot of unique properties with this language that we need to deal with and from what I've seen there are no current tokenizers which actually support these characters so we really do need a tokenizer.

First step is getting data so like I said before Ashraq went and did the hard part and took or scraped a load of Divey text from the Divey Internet and he managed to put together 16 or more than 16 million samples of Divey paragraphs and sentences. Now obviously there are going to be little things in there that we maybe don't want but for the most part it's actually very good.

So step number one is downloading the Divey corpus from Ashraq and you can see here so it's hosted over here on Hugging Face so we can use the datasets library to go ahead and actually get that data which is pretty useful so we just do pip install if you don't have it already, datasets.

Okay now once that's installed we can go ahead and we just write this so we are going to Ashraq's Divey corpus and we are taking the train data split from that corpus. Now it's fairly big like I said it's more than 16 million samples so what you can do if you don't want to load it all at once you can set streaming equals true and what that will do is create an iterable object which you can see down here and that means we can form a for loop and iterate through the data one sample at time and not download the full thing in one go we just download each sample as we look through the whole list.

So you can see here that's what I've done so the row here in Divey train so going through the train split we're just printing one row at a time and again you can see that really cool Divey text in here. Now there's just three samples here so it's very simple we just have almost like a dictionary we have text and then we have the Divey text following it so that is it all I've done here so I'm creating a generator object because later on when we are loading this data into the tokenizer that we're going to be loading it in as a generator and that generator will expect us to load the text in almost like lines so we'll expect to iterate through and just receive the text it will not expect to receive a dictionary which contains the key text which actually needs certain text it expects just this here okay nothing else so I've done is create a generator object here and you don't need numerate I'm not sure why that's there so remove that and all this does is iterate through each row and extracts text from each row so it's literally going through extracting this and nothing else so now when we actually use that that's what we get so here we're using this Divey text function or generator we are using the exact same code we did before we're just printing the row you see now we don't get dictionary we get just plain text which is is what we want now that will be fed into our tokenizer train from iterator method now we that's kind of skipping ahead a bit so let's go back and take a look at where this tokenizer is coming from in the first place okay so we have in our tokenizer we have a few different sets so let's say we have the tokenizer it's not just a tokenizer there are different components or sub processes the very first is the normalization now this is optional you don't need to don't need to do this but it's useful so given maybe we have something like we have these two words so this is specifically European languages here but we have C and C now these are pretty similar and a lot of people may actually use these interchangeably although they as far as I know are actually different words but people may use them interchangeably and in if we would expect that then we can use something called NFKD Unicode normalization to actually take both these and create the same word which would just be SI and this is particularly useful if you have weird characters so for example rather than the letter F like this you see people on social media using like the weird ones not like what kind of like this it's all these weird characters and stuff but basically what they mean is an F and because we're people we can understand that's what they're talking about but a machine learning model is not going to see those as the same thing because they have different Unicode patterns so what we do is use NFKD Unicode normalization and we convert them to being the same thing so that can be quite useful another very useful thing is you can also lowercase within your normalization step so that's also useful now like I said before with DeVay you don't have uppercase and lowercase so it's not really necessary for that but if we do have for example English text in our data at any point it's probably better we normalize it or lowercase it because that means we will create less non-DeVay tokens so for example we have the word hello and we also have the word hello like this now without lowercasing these may become two separate tokens and take up more space in our tokenizer vocabulary where we only really want to keep DeVay characters or words so what we can do is is lowercase that and that reduces the likelihood of getting duplicates lowcase and uppercase non-DeVay characters in there so that's another thing we will do so let me remove all of this so we are going to lowercase and also use NFKD okay so that's the first component the next component after that is pre-tokenization now pre-tokenization is a tokenization step that happens before we actually tokenize so what I mean by that so pre-tokenized let's say we have we have a string hello world is probably easiest now what this is going to do is just split this into very simple tokens so we might say anything with a space will get split or anything with space in between it will get split and maybe punctuation as well will get split so then we would end up with the token hello list from this we get hello we get world and we also get this summation mark it's messy but I think that makes sense so that is the pre-tokenization step and then after that we actually have the model or the tokenize itself now write model now this is where you have something like a wordpiece tokenizer which is what we're going to use now wordpiece is the tokenizer that BERT uses and later on we're going to be using BERT models so that's why I'm stuck with wordpiece and what this will do is it will take all of the tokens we've created and it will merge them into the most or the largest logical components I can think of so in this isn't very good example the hello world but let's say we have being maybe yep being so being for BERT specifically this would probably get split into something like B and then a sub word token which is always prefixed by the double hashtag symbol and that would be ing so we get two tokens from that BERT or the BERT wordpiece tokenizer sub word tokenizer so it doesn't split every word like hello probably just be hello but will split some words like this is being more relevant if you split them which can be can be useful for example if you had like it snowed it is snowing and snow or snowboarding you would get snow in all of those and then you get boarding or board and ing and get ing for snowing and snow indeed but snowed so that can be quite useful to find patterns in the words or sub words next step is post processing so post processing and this is where we would add any special tokens to our text so BERT specifically would use something like a what's called a classified token followed by hello world followed by a separated token and you probably have padding tokens and all of these different things so we're adding in any special tokens there and we're also going to create different tensors so we will have an input IDs tensor which will be the ID or integer values that represent each one of these word tokens or sub word tokens we have token type IDs which is useful when you have sentence pairs and you will also have a attention mask tensor which tells BERT or the transform model which tokens to actually pay attention to for example ignore any padding tokens so I mean they're the main components of a tokenizer but there is also another which we will I think always almost always add at least which is a decoder so when your model outputs let's say it outputs a word prediction so it's predicting you've masked the word you know let's say here we've covered hello and we said BERT what is that word and BERT is trying to tell us hello but BERT doesn't know the string hello it just knows token ID values so it's going to give us like it's going to say yeah this is my prediction is the number three and obviously we're like okay great I don't know what number three means so we need a decoder to take that number three and translate that into something more readable for us which would be hello okay and that's that's what a decoder is for so obviously we don't need it for the input into BERT but we do need it we want to understand the output from BERT so that's our tokenizer components reasonably high level but let's let's go into the code and we'll have a look at how at least how we've implemented that for our BERT wordpiece tokenizer so in this we are using the Hugging Face tokenizers library which is is very good and definitely recommend you do the same otherwise this will be very difficult now the to install this you would just pick install tokenizers there's nothing more to it than that and what I've done here is imported everything that we're going to be using now you can see a few of the components I mentioned earlier so we have we have decoders the models which is a tokenizer itself normalizes pre tokenizers and the post processes as well further on we have these other two classes and we'll go we'll get to them soon so I'll explain them later now first we want to do is actually we're not going in 1 2 3 4 like I said before because then we would be initializing the normalization set first instead what we actually do is we initialize tokenizer and the main component of that tokenizer is the model so in this case wordpiece so we initialize that using tokenizer here so you see up here we have tokenizer this is just a wrapper so that we create a Hugging Face tokenizers object and in there we need to pass the type tokenizer that I'm using and we are taking that from models up here and we're using wordpiece now in here we also have the unknown token so you specify that so that's where your tokenizer if it sees a word it doesn't know it will replace this okay it will put new NK rather than just leaving it empty or not understanding this is like the only thing it can put there instead of raising an error or something so we put that in there now after we have initialized a tokenizer instance we need to set the tokenizer attributes which are going to relate to the components that we listed just a moment ago so here we set the normalizer attribute in our tokenizer which we just initialize and we're setting that equal to normalizes sequence so there's a sequence of normalization sets here and we are using the lowercase first we locate everything and then we are using NFKD unicode normalization so there is flat so we've set normalization set then we do the pre tokenization set now pre tokenization is where we're splitting string into tokens so words or punctuation now the simplest way to do this is we use this white space tokenizer so all that does is splits on either white space or on punctuation like commas or exclamation marks or something like that so that's it we set out pre tokenizer and then we have this so this was another part that we didn't mention so we imported this trainers so we imported it here now trainers is basically the method that or function that we'll be using to train or control the training process of our tokenizer and all we're doing here is we're using a wordpiece trainer you can have a wordpiece tokenizer model we are setting the vocabulary size so that's the maximum number of tokens that our tokenizer will contain and then we also need to set any special tokens so we already set one earlier but we need to need to make sure that is included so with special tokens we probably won't see them or we hopefully won't see them in our input text so or the text that we'll be training on so what we do is we insert and now beforehand okay so we have this known token padding token classified token separate token and masculine token then here we set a min frequency so the minimum number of times we need to see a token to add it to the vocabulary and then we also earlier on you saw we had the sub words in our word piece tokenizer here we're just saying the prefix identifies those so it's just a two symbols there now at this point we're back to the start we were downloading the debate corpus again we already covered it so we can skip that we already have the debate corpus and we are storing it in the generator text and then that's where we go into our training step so the reason that we put TV or we create this TV text generator is because we're training from iterator here which expects to receive lines of text so we have text and then the trainer is equal to the trainer so the trainer we define just up here so that controls the training process and that's it so this once you run that that will run through all of the text in your TV text generator and train to the specifications you set in your trainer object we can check so the tokenizer you can get the vocab size we said earlier and you see this is 30,000 so as we as we would expect now before when I showed you that list there were five sets and we've covered one which is normalization to pre tokenization and three which would be our actual tokenization model now after that we we train the tokenizer so we train using those first three sets or components and then we still need to define that the next step which is the post processing step now the first thing we do is we get our classifier ID so the integer value that represents CLS and the integer value that represents SEP we get both of those and we use this processing or processes template processing method to create this here which is looks quite messy but if you take a look at this here so we have the one input at the top and we have the two inputs at the bottom so when you are feeding just a single sentence into your BERT model what's going to happen is looking at code we are going to we are going to use this format so we're going to have our CLS token followed by A which is sentence one followed by the separated token now back to the image we have on bottom we have two sentences now if we feed those into BERT they need to be understood as separate sentences to BERT or separate maybe question and context for example now in that case back to our code we will use or we would format it as a CLS token followed by sentence A the first one followed by a separated token separate the two sentences and finally we would have sentence B after that and then again we do finish with a separated token again now another thing that's different here is that we have this colon one for the pair now that's for the token type IDs array so token type IDs tells BERT where we have sentence A and sentence B now in the case of a single sentence everything in token type IDs is equal to zero so it's just all zero zero and zero the alternative where we have two sentences is that sentence A is going to be zero and everything related to sentence B is going to be one it's as simple as that and then we need to specify the special tokens that we're using here so we have CLS and we're specifying it again here and mapping that to the actual token ID integer so we that's the post-processing step that's all sorted we don't need to do anything else with that now and we can move on to so after this our tokenizer will feed the input IDs that it creates into the model and then we would move on to the decoder step so after BERT has finished processing whatever it's processing it's going to output you a number like a word prediction maybe it's going to give you a token ID and we need to know okay how do we how do we decode that into something we can understand and maybe it gives us a load of token IDs and what we need to do is say okay we're using a WordPiece tokenizer and we're going to decode from WordPiece and we also because we're using WordPiece as this prefix this is already set by default by the way but I'm just putting in there in case you want to use a different prefix although I wouldn't recommend it but if you if you have reason to you can change that so that's our tokenizer and that that's it so we've initialized or we've created our tokenizer and after that we move on to saving it into a format that is most useful to us so I think most of us when we are using tokenizer model we're probably going to use HoneyFace transformers not HoneyFace tokenizers because they are two different libraries and when we load tokenizer with HoneyFace transformers we tend to use pre-trained tokenizer or pre-trained tokenizer fast class now what we can do is save our tokenizer into the format that is compatible with HoneyFace transformers and compatible with this class specifically and to do that we first actually initialize the tokenizer, the HoneyFace tokenizer tokenizer using this class so we are using pre-trained tokenizer fast and if you've used this before in HoneyFace transformers we usually write from pre-trained and we load the model from here so like based on case this time we're not using that not using any methods we're just initializing the object directly we pass the tokenizer to tokenizer object then we also have the unknown token padding token and the other special tokens in there as well okay so that is ready for us to save it's in the correct format now so what we do is take full tokenizer which we've initialized here and we save it as a pre-trained model and what I'm doing here is saving it to the BERT based DB directory and once that's saved we're going to save these three files here and from now we can actually just load it from pre-trained like we normally would now just one thing when we load from pre-trained normally we probably are going to be loading from the HuggingFace models hub in this case we're not loading from a local directory so in some cases maybe you have this in a different directory so you write your path here which would go to a different directory and then you would have your model directory there okay so just be aware of that and then we can test our tokenizer with some debatex and we see okay cool this is debatex no idea what any of it means but it looks great and from that we get three tensors so we have these tensor which is two represents I think this CLS token and three here would represent our separator token the two special tokens everything in between is debate or the punctuation like these brackets open closing brackets comma or something else and then we have token type IDs now we only have one sentence here so sentence a all of that should be zero and then we have the attention mass tensor as well in this case we don't have any padding that I've been padding so the attention master should just be one and that's it our tokenizer is actually ready now if you do want to go ahead and actually load this tokenizer directly rather than going through all of this you can just write James Callum and load it like this because this is the on the hugging face modelers hub we we have a model and this also includes a BERT model as well so you have the BERT model which can load and also the tokenizer so that is it for this guide or walkthrough to building a BERT wordpiece tokenizer for a low resource or a language which does not have any currently supported tokenizer so I hope this has been useful as I said the tokenizer is really just a first step in what we hope will be a great way to support the debate language and particularly the AI community over there in what they are building and doing by putting together a few few BERT models that are fine-tuned or built for specific or different purposes and like this is the very first step in that so I hope it's all been very useful thank you very much for watching and I will see you in the next one.

Bye!

Building Transformer Tokenizers (Dhivehi NLP #1)

Chapters

Transcript