back to indexBuilding Transformer Tokenizers (Dhivehi NLP #1)
Chapters
0:0 Intro
1:6 Dhivehi Project
2:28 Hurdles for Low Resource Domains
4:21 Dhivehi Dataset
4:52 Download Dhivehi Corpus
8:25 Tokenizer Components
8:44 Normalizer Component
11:55 Pre-tokenization Component
14:59 Post-tokenization Component
16:26 Decoder Component
17:41 Tokenizer Implementation
21:4 Tokenizer Training
24:22 Post-processing Implementation
27:12 Decoder Implementation
28:7 Saving for Transformers
30:33 Tokenizer Test and Usage
31:36 Download Dhivehi Models
32:21 First Steps
00:00:00.000 |
Today we're going to have a look at how we can design tokenizers or more 00:00:05.760 |
specifically a BERT tokenizer for low resource languages. So when I say low 00:00:14.120 |
resource language I mean a language where we don't really have that much 00:00:19.160 |
data and there is not already a tokenizer or transformer model out there 00:00:26.680 |
built for that specific language. Now there are transform models and 00:00:33.540 |
tokenizers for a lot or even most languages but there's still a lot of 00:00:42.600 |
languages that are well simply less common and get less attention than 00:00:50.600 |
something like English or Chinese. So you may find if you're working on these 00:00:56.400 |
languages there are very few models out there or maybe no models out there. So 00:01:03.880 |
that's what we're going to focus on today. We are going to more specifically 00:01:08.960 |
focus on building a tokenizer and in the future we're going to do more of this as 00:01:13.640 |
well more models and so on but we're going to focus on building a tokenizer 00:01:18.160 |
for the Divey language. Now Divey is the language of the Maldives so the Maldive 00:01:25.440 |
islands in the Indian Ocean I'm sure a lot of you probably know them but they're 00:01:30.000 |
very beautiful islands incredibly amazing weather and generally I would 00:01:37.360 |
not usually think of the Maldives and NLP in the same context but there are a 00:01:44.040 |
lot of people that do and in particular there is a guy called Ashraq or Ismail 00:01:49.560 |
Ashraq and we've been speaking a lot about NLP and through that I've been 00:01:59.800 |
made aware of how difficult it is to actually build models for languages 00:02:04.120 |
where there is not much attention out there already. So what we decided to do 00:02:11.400 |
is kind of team up and try and put together some transform models for the 00:02:18.680 |
Divey language and the first step to that is obviously the tokenizer. You need 00:02:26.040 |
tokenizer before you build anything else so there are a few hurdles that we can 00:02:32.200 |
overcome when we're building these models so there are already there are 00:02:36.640 |
zero well not zero there are some pre-trained models for Divey but they're 00:02:40.960 |
not necessarily that useful for what we want to apply them to and you may think 00:02:47.840 |
okay there's those multilingual models with hundreds of languages included now 00:02:52.840 |
from at least what I've seen all of them miss Divey. Another thing that is 00:02:57.800 |
difficult with Divey and other low resource languages is actually finding 00:03:03.160 |
text data. Now labeled data is practically impossible for a lot of 00:03:08.440 |
these languages but unlabeled unstructured data maybe that's a bit 00:03:13.400 |
more reasonable you can use a web scraper and scrape loads of text from whichever 00:03:18.740 |
language you are using the websites in that language on the internet and that 00:03:23.320 |
is what Ashraq has done to get the data set I'm using today. And as well as that 00:03:30.240 |
another really difficult thing is that Divey uses a very unique writing system 00:03:36.840 |
that is known as Fana and Fana looks really cool and is definitely very 00:03:45.360 |
unique and beautiful but it's not really very well supported even by those 00:03:50.840 |
multilingual language models you can see here it looks really cool you're also 00:03:56.560 |
going from right to left which is quite interesting everything's also in the 00:04:00.800 |
same case there's no uppercase or lowercase which is again quite 00:04:04.480 |
interesting I think so there's a lot of unique properties with this language 00:04:10.440 |
that we need to deal with and from what I've seen there are no current tokenizers 00:04:16.320 |
which actually support these characters so we really do need a tokenizer. First 00:04:22.080 |
step is getting data so like I said before Ashraq went and did the hard part 00:04:26.760 |
and took or scraped a load of Divey text from the Divey Internet and he managed 00:04:36.800 |
to put together 16 or more than 16 million samples of Divey paragraphs and 00:04:43.720 |
sentences. Now obviously there are going to be little things in there that we 00:04:48.320 |
maybe don't want but for the most part it's actually very good. So step number 00:04:54.040 |
one is downloading the Divey corpus from Ashraq and you can see here so it's 00:05:00.720 |
hosted over here on Hugging Face so we can use the datasets library to go ahead 00:05:08.880 |
and actually get that data which is pretty useful so we just do pip install 00:05:14.040 |
if you don't have it already, datasets. Okay now once that's installed we can go 00:05:22.120 |
ahead and we just write this so we are going to Ashraq's Divey corpus and we 00:05:29.640 |
are taking the train data split from that corpus. Now it's fairly big like I 00:05:38.320 |
said it's more than 16 million samples so what you can do if you don't want to 00:05:42.680 |
load it all at once you can set streaming equals true and what that will 00:05:48.240 |
do is create an iterable object which you can see down here and that means we 00:05:55.120 |
can form a for loop and iterate through the data one sample at time and not 00:06:01.480 |
download the full thing in one go we just download each sample as we look 00:06:08.600 |
through the whole list. So you can see here that's what I've done so the row 00:06:16.800 |
here in Divey train so going through the train split we're just printing one row 00:06:22.760 |
at a time and again you can see that really cool Divey text in here. Now 00:06:31.160 |
there's just three samples here so it's very simple we just have almost like a 00:06:37.040 |
dictionary we have text and then we have the Divey text following it so that is it 00:06:45.200 |
all I've done here so I'm creating a generator object because later on when 00:06:53.480 |
we are loading this data into the tokenizer that we're going to be loading 00:06:56.920 |
it in as a generator and that generator will expect us to load the text in 00:07:04.280 |
almost like lines so we'll expect to iterate through and just receive the 00:07:09.160 |
text it will not expect to receive a dictionary which contains the key text 00:07:15.160 |
which actually needs certain text it expects just this here okay nothing else 00:07:23.560 |
so I've done is create a generator object here and you don't need numerate 00:07:29.960 |
I'm not sure why that's there so remove that and all this does is iterate 00:07:37.920 |
through each row and extracts text from each row so it's literally going through 00:07:43.120 |
extracting this and nothing else so now when we actually use that that's what we 00:07:52.120 |
get so here we're using this Divey text function or generator we are using the 00:07:59.560 |
exact same code we did before we're just printing the row you see now we don't 00:08:02.960 |
get dictionary we get just plain text which is is what we want now that will 00:08:09.080 |
be fed into our tokenizer train from iterator method now we that's kind of 00:08:16.740 |
skipping ahead a bit so let's go back and take a look at where this tokenizer 00:08:22.940 |
is coming from in the first place okay so we have in our tokenizer we have a 00:08:30.500 |
few different sets so let's say we have the tokenizer it's not just a tokenizer 00:08:38.840 |
there are different components or sub processes the very first is the 00:08:48.480 |
normalization now this is optional you don't need to don't need to do this but 00:08:55.640 |
it's useful so given maybe we have something like we have these two words 00:09:05.360 |
so this is specifically European languages here but we have C and C now 00:09:14.520 |
these are pretty similar and a lot of people may actually use these 00:09:20.120 |
interchangeably although they as far as I know are actually different words but 00:09:25.800 |
people may use them interchangeably and in if we would expect that then we can 00:09:30.800 |
use something called NFKD Unicode normalization to actually take both 00:09:39.200 |
these and create the same word which would just be SI and this is particularly 00:09:44.800 |
useful if you have weird characters so for example rather than the letter F 00:09:52.160 |
like this you see people on social media using like the weird ones not like what 00:09:59.460 |
kind of like this it's all these weird characters and stuff but basically what 00:10:03.580 |
they mean is an F and because we're people we can understand that's what 00:10:08.440 |
they're talking about but a machine learning model is not going to see those 00:10:14.120 |
as the same thing because they have different Unicode patterns so what we do 00:10:19.280 |
is use NFKD Unicode normalization and we convert them to being the same thing so 00:10:24.320 |
that can be quite useful another very useful thing is you can also lowercase 00:10:30.360 |
within your normalization step so that's also useful now like I said before with 00:10:37.840 |
DeVay you don't have uppercase and lowercase so it's not really necessary for 00:10:42.280 |
that but if we do have for example English text in our data at any point 00:10:49.040 |
it's probably better we normalize it or lowercase it because that means we will 00:10:55.280 |
create less non-DeVay tokens so for example we have the word hello and we 00:11:06.360 |
also have the word hello like this now without lowercasing these may become two 00:11:13.840 |
separate tokens and take up more space in our tokenizer vocabulary where we 00:11:21.240 |
only really want to keep DeVay characters or words so what we can do is 00:11:26.600 |
is lowercase that and that reduces the likelihood of getting duplicates 00:11:32.600 |
lowcase and uppercase non-DeVay characters in there so that's another 00:11:37.360 |
thing we will do so let me remove all of this so we are going to lowercase and 00:11:48.560 |
also use NFKD okay so that's the first component the next component after that 00:11:57.480 |
is pre-tokenization now pre-tokenization is a tokenization step that happens 00:12:09.160 |
before we actually tokenize so what I mean by that so pre-tokenized let's say 00:12:15.760 |
we have we have a string hello world is probably easiest now what this is going 00:12:28.120 |
to do is just split this into very simple tokens so we might say anything 00:12:36.480 |
with a space will get split or anything with space in between it will get split 00:12:42.080 |
and maybe punctuation as well will get split so then we would end up with the 00:12:47.200 |
token hello list from this we get hello we get world and we also get this 00:12:59.120 |
summation mark it's messy but I think that makes sense so that is the 00:13:06.560 |
pre-tokenization step and then after that we actually have the model or the 00:13:11.480 |
tokenize itself now write model now this is where you have something like a 00:13:19.920 |
wordpiece tokenizer which is what we're going to use now wordpiece is the 00:13:23.400 |
tokenizer that BERT uses and later on we're going to be using BERT models so 00:13:27.240 |
that's why I'm stuck with wordpiece and what this will do is it will take all of 00:13:33.240 |
the tokens we've created and it will merge them into the most or the largest 00:13:43.360 |
logical components I can think of so in this isn't very good example the hello 00:13:49.120 |
world but let's say we have being maybe yep being so being for BERT 00:13:59.520 |
specifically this would probably get split into something like B and then a 00:14:04.680 |
sub word token which is always prefixed by the double hashtag symbol and that 00:14:15.160 |
would be ing so we get two tokens from that BERT or the BERT wordpiece 00:14:22.240 |
tokenizer sub word tokenizer so it doesn't split every word like hello 00:14:26.680 |
probably just be hello but will split some words like this is being more 00:14:32.420 |
relevant if you split them which can be can be useful for example if you had 00:14:35.840 |
like it snowed it is snowing and snow or snowboarding you would get snow in all 00:14:42.260 |
of those and then you get boarding or board and ing and get ing for snowing 00:14:47.620 |
and snow indeed but snowed so that can be quite useful to find patterns in the 00:14:56.580 |
words or sub words next step is post processing so post processing and this 00:15:11.740 |
is where we would add any special tokens to our text so BERT specifically would 00:15:19.260 |
use something like a what's called a classified token followed by hello world 00:15:27.260 |
followed by a separated token and you probably have padding tokens and all of 00:15:33.260 |
these different things so we're adding in any special tokens there and we're 00:15:38.140 |
also going to create different tensors so we will have an input IDs tensor 00:15:47.300 |
which will be the ID or integer values that represent each one of these word 00:15:55.420 |
tokens or sub word tokens we have token type IDs which is useful when you have 00:16:00.900 |
sentence pairs and you will also have a attention mask tensor which tells BERT 00:16:11.500 |
or the transform model which tokens to actually pay attention to for 00:16:16.100 |
example ignore any padding tokens so I mean they're the main components of a 00:16:21.300 |
tokenizer but there is also another which we will I think always almost 00:16:27.140 |
always add at least which is a decoder so when your model outputs let's say it 00:16:35.300 |
outputs a word prediction so it's predicting you've masked the word you 00:16:40.940 |
know let's say here we've covered hello and we said BERT what is that word and 00:16:47.020 |
BERT is trying to tell us hello but BERT doesn't know the string hello it just 00:16:52.540 |
knows token ID values so it's going to give us like it's going to say yeah this 00:16:58.060 |
is my prediction is the number three and obviously we're like okay great I don't 00:17:02.540 |
know what number three means so we need a decoder to take that number three and 00:17:08.780 |
translate that into something more readable for us which would be hello 00:17:13.380 |
okay and that's that's what a decoder is for so obviously we don't need it for 00:17:18.940 |
the input into BERT but we do need it we want to understand the output from BERT 00:17:25.060 |
so that's our tokenizer components reasonably high level but let's let's go 00:17:33.180 |
into the code and we'll have a look at how at least how we've implemented that 00:17:37.100 |
for our BERT wordpiece tokenizer so in this we are using the Hugging Face 00:17:45.260 |
tokenizers library which is is very good and definitely recommend you do the same 00:17:50.320 |
otherwise this will be very difficult now the to install this you would just 00:17:56.780 |
pick install tokenizers there's nothing more to it than that and what I've done 00:18:02.540 |
here is imported everything that we're going to be using now you can see a few 00:18:06.420 |
of the components I mentioned earlier so we have we have decoders the models 00:18:10.860 |
which is a tokenizer itself normalizes pre tokenizers and the post processes 00:18:18.580 |
as well further on we have these other two classes and we'll go we'll get to 00:18:26.180 |
them soon so I'll explain them later now first we want to do is actually we're 00:18:32.980 |
not going in 1 2 3 4 like I said before because then we would be initializing 00:18:38.980 |
the normalization set first instead what we actually do is we initialize 00:18:45.780 |
tokenizer and the main component of that tokenizer is the model so in this case 00:18:51.220 |
wordpiece so we initialize that using tokenizer here so you see up here we 00:19:00.700 |
have tokenizer this is just a wrapper so that we create a Hugging Face tokenizers 00:19:08.380 |
object and in there we need to pass the type tokenizer that I'm using and we are 00:19:13.820 |
taking that from models up here and we're using wordpiece now in here we 00:19:19.820 |
also have the unknown token so you specify that so that's where your 00:19:24.380 |
tokenizer if it sees a word it doesn't know it will replace this okay it will 00:19:30.820 |
put new NK rather than just leaving it empty or not understanding this is like 00:19:38.100 |
the only thing it can put there instead of raising an error or something so we 00:19:44.620 |
put that in there now after we have initialized a tokenizer instance we need 00:19:52.180 |
to set the tokenizer attributes which are going to relate to the components 00:19:58.580 |
that we listed just a moment ago so here we set the normalizer attribute in our 00:20:04.820 |
tokenizer which we just initialize and we're setting that equal to normalizes 00:20:10.660 |
sequence so there's a sequence of normalization sets here and we are using 00:20:17.300 |
the lowercase first we locate everything and then we are using NFKD 00:20:23.420 |
unicode normalization so there is flat so we've set normalization set then we 00:20:29.780 |
do the pre tokenization set now pre tokenization is where we're splitting 00:20:36.220 |
string into tokens so words or punctuation now the simplest way to do 00:20:44.980 |
this is we use this white space tokenizer so all that does is splits on 00:20:51.380 |
either white space or on punctuation like commas or exclamation marks or 00:20:56.260 |
something like that so that's it we set out pre tokenizer and then we have this 00:21:06.820 |
so this was another part that we didn't mention so we imported this trainers so 00:21:14.300 |
we imported it here now trainers is basically the method that or function 00:21:25.260 |
that we'll be using to train or control the training process of our tokenizer 00:21:32.140 |
and all we're doing here is we're using a wordpiece trainer you can have a 00:21:37.540 |
wordpiece tokenizer model we are setting the vocabulary size so that's the 00:21:43.060 |
maximum number of tokens that our tokenizer will contain and then we also 00:21:51.580 |
need to set any special tokens so we already set one earlier but we need to 00:21:55.140 |
need to make sure that is included so with special tokens we probably won't 00:22:02.660 |
see them or we hopefully won't see them in our input text so or the text that 00:22:10.460 |
we'll be training on so what we do is we insert and now beforehand okay so we 00:22:17.100 |
have this known token padding token classified token separate token and 00:22:22.180 |
masculine token then here we set a min frequency so the minimum number of times 00:22:27.180 |
we need to see a token to add it to the vocabulary and then we also earlier on 00:22:32.980 |
you saw we had the sub words in our word piece tokenizer here we're just saying 00:22:37.500 |
the prefix identifies those so it's just a two symbols there now at this point 00:22:48.220 |
we're back to the start we were downloading the debate corpus again we 00:22:52.140 |
already covered it so we can skip that we already have the debate corpus and we 00:22:57.780 |
are storing it in the generator text and then that's where we go into our 00:23:03.940 |
training step so the reason that we put TV or we create this TV text generator 00:23:10.600 |
is because we're training from iterator here which expects to receive lines of 00:23:16.820 |
text so we have text and then the trainer is equal to the trainer so the trainer 00:23:24.560 |
we define just up here so that controls the training process and that's it so 00:23:32.220 |
this once you run that that will run through all of the text in your TV text 00:23:40.060 |
generator and train to the specifications you set in your trainer 00:23:45.140 |
object we can check so the tokenizer you can get the vocab size we said earlier 00:23:51.300 |
and you see this is 30,000 so as we as we would expect now before when I showed 00:23:58.740 |
you that list there were five sets and we've covered one which is normalization 00:24:04.060 |
to pre tokenization and three which would be our actual tokenization model 00:24:14.820 |
now after that we we train the tokenizer so we train using those first three sets 00:24:20.900 |
or components and then we still need to define that the next step which is the 00:24:25.740 |
post processing step now the first thing we do is we get our classifier ID so the 00:24:34.620 |
integer value that represents CLS and the integer value that represents SEP 00:24:40.340 |
we get both of those and we use this processing or processes template 00:24:46.700 |
processing method to create this here which is looks quite messy but if you 00:24:54.020 |
take a look at this here so we have the one input at the top and we have the two 00:25:01.220 |
inputs at the bottom so when you are feeding just a single sentence into 00:25:07.020 |
your BERT model what's going to happen is looking at code we are going to 00:25:11.500 |
we are going to use this format so we're going to have our CLS token followed by A which is 00:25:17.220 |
sentence one followed by the separated token now back to the image we have on 00:25:27.260 |
bottom we have two sentences now if we feed those into BERT they need to be 00:25:32.580 |
understood as separate sentences to BERT or separate maybe question and context 00:25:39.980 |
for example now in that case back to our code we will use or we would format it as 00:25:46.540 |
a CLS token followed by sentence A the first one followed by a separated token 00:25:53.580 |
separate the two sentences and finally we would have sentence B after that and 00:26:00.260 |
then again we do finish with a separated token again now another thing that's 00:26:05.220 |
different here is that we have this colon one for the pair now that's for 00:26:11.020 |
the token type IDs array so token type IDs tells BERT where we have sentence A 00:26:16.220 |
and sentence B now in the case of a single sentence everything in token type IDs is 00:26:22.460 |
equal to zero so it's just all zero zero and zero the alternative where we have 00:26:31.420 |
two sentences is that sentence A is going to be zero and everything related to sentence B is 00:26:37.740 |
going to be one it's as simple as that and then we need to specify the special 00:26:43.420 |
tokens that we're using here so we have CLS and we're specifying it again here 00:26:48.340 |
and mapping that to the actual token ID integer so we that's the post-processing 00:26:56.540 |
step that's all sorted we don't need to do anything else with that now and we 00:27:03.900 |
can move on to so after this our tokenizer will feed the input IDs that it 00:27:10.060 |
creates into the model and then we would move on to the decoder step so after 00:27:18.460 |
BERT has finished processing whatever it's processing it's going to output you 00:27:22.220 |
a number like a word prediction maybe it's going to give you a token ID and we 00:27:27.260 |
need to know okay how do we how do we decode that into something we can understand and 00:27:31.420 |
maybe it gives us a load of token IDs and what we need to do is say okay we're 00:27:38.060 |
using a WordPiece tokenizer and we're going to decode from WordPiece and we 00:27:42.540 |
also because we're using WordPiece as this prefix this is already set by 00:27:47.540 |
default by the way but I'm just putting in there in case you want to use a different 00:27:50.860 |
prefix although I wouldn't recommend it but if you if you have reason to you can 00:27:55.620 |
change that so that's our tokenizer and that that's it so we've initialized or 00:28:04.500 |
we've created our tokenizer and after that we move on to saving it into a 00:28:12.500 |
format that is most useful to us so I think most of us when we are using 00:28:19.820 |
tokenizer model we're probably going to use HoneyFace transformers not 00:28:24.020 |
HoneyFace tokenizers because they are two different libraries and when we load 00:28:30.340 |
tokenizer with HoneyFace transformers we tend to use pre-trained tokenizer or 00:28:34.980 |
pre-trained tokenizer fast class now what we can do is save our tokenizer into 00:28:45.700 |
the format that is compatible with HoneyFace transformers and compatible 00:28:50.740 |
with this class specifically and to do that we first actually initialize the 00:28:56.060 |
tokenizer, the HoneyFace tokenizer tokenizer using this class so we are 00:29:02.420 |
using pre-trained tokenizer fast and if you've used this before in HoneyFace transformers 00:29:06.900 |
we usually write from pre-trained and we load the model from here so like 00:29:11.980 |
based on case this time we're not using that not using any methods we're just 00:29:16.260 |
initializing the object directly we pass the tokenizer to tokenizer object then we 00:29:23.780 |
also have the unknown token padding token and the other special tokens in 00:29:27.420 |
there as well okay so that is ready for us to save it's in the correct format now so 00:29:37.060 |
what we do is take full tokenizer which we've initialized here and we save it as a 00:29:42.320 |
pre-trained model and what I'm doing here is saving it to the BERT based DB 00:29:47.900 |
directory and once that's saved we're going to save these three files here and 00:29:56.660 |
from now we can actually just load it from pre-trained like we normally would 00:30:01.700 |
now just one thing when we load from pre-trained normally we probably are going to be 00:30:07.260 |
loading from the HuggingFace models hub in this case we're not loading from a 00:30:13.180 |
local directory so in some cases maybe you have this in a different directory 00:30:18.840 |
so you write your path here which would go to a different directory and then you 00:30:25.180 |
would have your model directory there okay so just be aware of that and then 00:30:31.620 |
we can test our tokenizer with some debatex and we see okay cool this is 00:30:38.500 |
debatex no idea what any of it means but it looks great and from that we get 00:30:46.660 |
three tensors so we have these tensor which is two represents I think this CLS 00:30:55.220 |
token and three here would represent our separator token the two special tokens 00:31:00.500 |
everything in between is debate or the punctuation like these brackets open 00:31:07.860 |
closing brackets comma or something else and then we have token type IDs now we 00:31:13.420 |
only have one sentence here so sentence a all of that should be zero and then we 00:31:22.940 |
have the attention mass tensor as well in this case we don't have any padding 00:31:27.020 |
that I've been padding so the attention master should just be one and that's it 00:31:32.020 |
our tokenizer is actually ready now if you do want to go ahead and actually 00:31:38.020 |
load this tokenizer directly rather than going through all of this you can just 00:31:43.260 |
write James Callum and load it like this because this is the on the hugging face 00:31:50.580 |
modelers hub we we have a model and this also includes a BERT model as well so 00:31:56.060 |
you have the BERT model which can load and also the tokenizer so that is it for 00:32:02.780 |
this guide or walkthrough to building a BERT wordpiece tokenizer for a low 00:32:09.620 |
resource or a language which does not have any currently supported tokenizer 00:32:17.860 |
so I hope this has been useful as I said the tokenizer is really just a first step 00:32:24.660 |
in what we hope will be a great way to support the debate language and 00:32:33.820 |
particularly the AI community over there in what they are building and doing by 00:32:40.140 |
putting together a few few BERT models that are fine-tuned or built for specific 00:32:45.860 |
or different purposes and like this is the very first step in that so I hope 00:32:56.060 |
it's all been very useful thank you very much for watching and I will see you in the next one. Bye!