back to indexHow to Build a Bert WordPiece Tokenizer in Python and HuggingFace
Chapters
0:0 Intro
3:41 WordPiece Tokenizer
5:54 Download Data Sets
7:26 HuggingFace
10:41 Dataset
19:0 Tokenizer
24:43 Tokenizer Walkthrough
26:22 Tokenizer Code
00:00:00.000 |
Hi, welcome to this video. We're going to cover how we can build a tokenizer for BERT from scratch 00:00:06.960 |
So typically when we're using transform models we have three main components so we have the 00:00:19.440 |
tokenizer which obviously what we're going to what we're going to cover here we have the core model 00:00:25.520 |
and we also have a head so the tokenizer is obviously what converts our text into into 00:00:33.360 |
tokens that BERT can read the core model is is BERT itself so BERT has like kind of like a core 00:00:40.240 |
which we build or train through pre-training and then there's also a head which allows us to do 00:00:46.400 |
specialized tasks so we have like a Q&A head or a classification head and so on 00:00:54.400 |
now a lot of the time what we can do is just head over to the Hug & Face website over here and we 00:01:03.440 |
can say okay we have all these tasks over here if I want a model for a question answering I can click 00:01:09.440 |
on here and typically there's usually something we can use but obviously it depends on your use 00:01:15.760 |
case your language and a lot of different things so if we find that there isn't really a model that 00:01:22.240 |
suits what we need or it doesn't perform well on our own data that's where we would start 00:01:27.520 |
considering okay do we need to build our own transformer so in that case at the very least 00:01:34.880 |
we're probably going to need to build or train from scratch a core model and the tokenizer or 00:01:42.480 |
transformer head so we definitely need those two parts and sometimes we'll find that we also need 00:01:50.080 |
tokenizers but not always because you think okay we already have tokenizers for say our task is 00:01:55.840 |
something to do in the English language but the model doesn't perform very well in our specific 00:02:00.960 |
data set it doesn't mean that it doesn't it hasn't been tokenized properly it can still 00:02:05.680 |
tokenize all that text probably pretty well as long as it's like standard English but what we'll 00:02:12.320 |
find is that the model just doesn't quite understand the style of language being used 00:02:17.840 |
so for example if Bert is trained on blogs on the internet it's probably not going to do as well on 00:02:26.560 |
governmental or financial reports so that's the sort of area where you think okay we're probably 00:02:33.200 |
going to need to retrain the core model so it can better understand the style of language 00:02:38.320 |
used in there and then the head is like I said before that's where we are training it specifically 00:02:44.960 |
for a specific use case so for Q&A we'd probably want to train our model on a specific question 00:02:52.320 |
answering data set yeah so that it can it can start answering our questions now in that case 00:03:00.480 |
and it's in English we probably don't need the tokenizer but sometimes we do need a tokenizer 00:03:05.200 |
because maybe your use case is in a less common language and in that case you probably will need 00:03:14.800 |
to build your own tokenizer for Bert and that's really the sort of use case that we would be 00:03:21.040 |
looking at in this video so we'll cover that building a word piece tokenizer which is a 00:03:27.280 |
tokenized user by Bert and we'll also have a look at how we can get or where we can get good 00:03:33.200 |
multilingual data sets from as well so let's move on to what the Bert tokenizer is and what it does 00:03:40.320 |
okay so like I said before the Bert tokenizer it's called a word piece tokenizer so this this 00:03:47.600 |
letter this text up here word piece and it's pretty straightforward what it does is it breaks 00:03:54.880 |
your words into chunks or pieces of words hence word piece so for example the word surf it's just 00:04:02.480 |
probably most likely going to return a single token which would be surf whereas word surfing 00:04:08.160 |
the the ing at the end of surf is a pretty common part of a word in English at least so what we 00:04:14.480 |
would find is this word here would probably get broken out into these two tokens now where we see 00:04:20.320 |
this prefix the double hashtag that's the standard prefix used to indicate that this is a piece of a 00:04:27.200 |
word rather than a word itself and then we see that further down as well so surfboarding gets 00:04:32.480 |
broken into three tokens and then if we for example compare that to snowboarding snowboarding 00:04:38.800 |
surfboarding are obviously kind of similar because they are both boarding sports the difference being 00:04:44.480 |
one is on surf the other one is on snow and before we even feed these tokens into Bert we're making 00:04:52.560 |
that very easy for Bert to identify where the similarities are between those two objects because 00:04:59.280 |
Bert knows that okay one of them is surf one of them is snow but both of them are boarding so this 00:05:05.200 |
is helping Bert from the start which is I think pretty cool now when we're training a tokenizer 00:05:13.680 |
we need a lot of text data so well when I say a lot we let's say two million paragraphs is probably 00:05:23.520 |
a good sign point although ideally you want as much as you can so what we will use for training 00:05:32.560 |
our data is something called the oscar data set or oscar corpus now oscar is just a huge multilingual 00:05:42.640 |
data set that contains and just an insane amount of unstructured text so it's very very good and 00:05:49.440 |
we can access it through hugging face which is super useful so over in in our code first if we 00:05:58.720 |
want to we want to download data sets we need to pip install something called data sets a pip 00:06:05.280 |
install data sets I already have installed so I'm just going to go from data sets 00:06:11.760 |
import load data set okay and then through that we can use the data sets dot list data sets method 00:06:25.920 |
let me sorry let me import data sets as well import data sets 00:06:33.680 |
okay and this will give us a very big list probably a little bit too big 00:06:39.040 |
showing us all the data sets that are currently available in the data sets library which is quite 00:06:45.840 |
a lot I think it's like a thousand just a fair bit over a thousand now okay so we have all of these 00:06:54.400 |
which is a lot how many let me length data sets dot list data sets my internet is very bad um at 00:07:07.280 |
the moment so it takes forever to download anything but that there are data sets and this 00:07:13.840 |
is one way of viewing those data sets but an easier way this is how many we have in all of 00:07:19.200 |
hugging face and there's new ones being added like every day so but an easy way of doing this 00:07:25.120 |
to go to google type in data sets view or hugging face that is its viewer and just click on a 00:07:32.240 |
streamlet hugging face so this is a streamlet app hugging face of bill that allow you to go through 00:07:39.120 |
their data sets so you see over here we go over to data sets and I'm going to type in Oscar because 00:07:46.080 |
that's the one we'll be using Oscar okay I type Oscar and then on the right we should it should 00:07:52.960 |
pop up so within Oscar we have all these different data sets so the first one here is Africans 00:07:58.720 |
the the language and then you have all these other ones down here I'm going to using Italian as my 00:08:06.320 |
example here but Italian has a lot of data so if I click on here it doesn't actually show you 00:08:14.000 |
anything which is a little bit annoying but it's because it's just a huge data set it can't show 00:08:18.400 |
you everything so in fact so that is 101 gigabytes 102 gigabytes of data there so it's a lot but 00:08:27.840 |
that's good for us because we need a lot of data for training so if we want to download that data 00:08:33.200 |
set we need to do this so we write data sets or data sets and it's just a variable name and we 00:08:39.920 |
want to write load data set and then in here we need to write the data set name so it's Oscar 00:08:46.880 |
and then we need to specify which part of the data set it is 00:08:51.200 |
so over here it's a subset it's unshuffled deduplicated it 00:09:07.840 |
so deduplicated it and it's also unshuffled unshuffled deduplicated it so right looks looks 00:09:16.800 |
good and then the other thing that we can do is we can write split and we can specify how much of 00:09:23.120 |
the data we'd like now when you when you use this split it's still going to download the full data 00:09:29.760 |
set to your uh to your machine which is a little bit annoying but this is how it works so I found 00:09:36.480 |
that this isn't particularly useful unless you're just loading it from your machine and you're 00:09:40.640 |
saying okay I only want a certain amount of data what you can do if so this is a 101 gigabytes 00:09:48.480 |
it's a lot if you don't want to download all that you can write streaming equals true and this is 00:09:54.560 |
very useful so what this will do is create an iterator and you can iterate through this object 00:09:59.360 |
and download your your data or samples one at a time now because I already have my 00:10:06.880 |
my data downloaded onto my machine I'm going to use the split method so I am going to take the 00:10:17.200 |
first I'm going to say 500,000 items simply because I mean obviously you want to be using 00:10:27.200 |
more samples than this but I'm just going to use this many because otherwise the loading 00:10:32.880 |
times on all this is is pretty long I don't want to be waiting for too long 00:10:37.680 |
and we also need to specify which data set or subset we're using here so typically we have our 00:10:50.320 |
train validation or tests in our data sets we I think we always have the train set in there 00:10:57.840 |
and then we can have validation test sets as well so we'll load that and then what I'm going to do 00:11:05.280 |
is I'm going to create a new directory where I'm going to store all of these text files so 00:11:13.120 |
when we're training the tokenizer it expects plain text files that where each sample 00:11:19.680 |
separate by a new line so I'm going to go ahead and create that data set for us so I'm going to 00:11:28.240 |
make directory I'm going to call this Oscar and then what I'm going to do is loop through 00:11:35.280 |
our data here and convert them into the file format that we need so first thing I want to do 00:11:43.520 |
is import tqdm auto import tqdm so from and I'm using this so that we have a progress bar so you 00:11:53.440 |
can see where we are in that process because this can take a while so I'm going to create this text 00:12:00.880 |
data list so populate this with all of our text and I'm going to use this file count so that's 00:12:07.760 |
zero so this is just going to loop through and we're going to create all our text files using 00:12:13.840 |
this here so what I want to do is for sample in tqdm tqdm data set yes 00:12:24.400 |
for now I'm just going to pass okay and let's run that and we see that we get 00:12:31.840 |
this this bar this tqdm bar you see we're not even doing anything at the moment and it's already 00:12:39.920 |
taking a long time to to process the data so I'm actually going to let's I'm going to go down to 00:12:46.800 |
50 000 so I'm not waiting too long so let me modify that 50 000 and and that should be a little 00:12:57.280 |
bit quicker okay it's much better now first thing I want to do is we're going to be splitting each 00:13:05.120 |
sample by a newline character so I want to first remove any newline characters that are already 00:13:10.640 |
within each sample otherwise we're going to be splitting our samples like midway through a 00:13:16.560 |
sentence so on sample equals sample and in here so if I can I show you I can show you a sample 00:13:26.400 |
yeah we have id and then we also have the text we want the text obviously so we just wrote text 00:13:32.080 |
and we're going to replace newline characters if there are any hopefully there's not any way 00:13:37.360 |
with space and then what we want to do is just append that to our text data so text data 00:13:44.800 |
append sample and what we want to do so we can put all this in a single file 00:13:53.440 |
but then that leaves us with one single file which is huge so I mean for 50 000 samples it's not 00:13:59.920 |
really a problem but we're not going to typically be using that many samples it's going to be more 00:14:06.080 |
like 5 million 50 million or so on so what I like to do is just split the data into multiple text 00:14:14.160 |
files so what I do is I say if the length of the text data is equal to let's say 5 000 at that point 00:14:25.120 |
I want you to save the text data and then restart again and start populating a new new file so let's 00:14:33.280 |
say with open so we need to open the file we need to save it into this oscar directory that we built 00:14:40.160 |
before so oscar and I'm just going to call it file file file count dot text so you convert this into 00:14:52.960 |
an f string I'm not sure why it's why it's highlighting everything here and we are writing 00:15:02.560 |
that file so with that that's why we want to do fp.write and then we just write our text data 00:15:12.560 |
write our text data but we also so this is a list and what we want to do is we want to join 00:15:19.120 |
every item in that list separated by a new line character so we write this 00:15:25.920 |
and that creates our files now at this point we've created a file and our text data still has 00:15:34.880 |
5 000 items in it and we're going to start looping through and populating it with even more items 00:15:40.480 |
so what we need to do now is re or initialize or empty our text data variable so that's empty again 00:15:47.280 |
it can start counting from zero all the way up to five thousand again okay and so at this point 00:15:52.800 |
we're saving our file so this will be initially file underscore zero dot text but if we loop 00:16:01.360 |
through again and and do it again it's still going to be zero so we need to make sure we are 00:16:05.120 |
increasing that file count so that it's not remaining the same just overwriting the same 00:16:11.120 |
file over and over again okay and what you can also do if you want is you can add another 00:16:20.800 |
so this down here with open you can add that just in case there's any leftover items at the end 00:16:28.960 |
there that haven't been saved into this neat 5000 chunk i'm not going to do that now you can add 00:16:35.680 |
that in if you want to okay so it looks pretty good the only thing that we we do need here is 00:16:40.480 |
actually make sure the encoding is utf-8 otherwise we'll get i think we'll get an error if we if we 00:16:48.400 |
miss that okay so that will or should create all of our data so let's let me open that directory 00:16:57.440 |
here on the left so we have this empty oscar directory i'm going to run this and we should 00:17:06.240 |
see it get populated so it's pretty quick there we go so we're building all these plain text files 00:17:12.880 |
here and if we open that we ignore that and we see that we get all of these so each row here 00:17:23.680 |
is a new sample okay and as you can see it's all all italian so 00:17:30.640 |
that's our data it's ready and what we can do is move on to actually training the the tokenizer so 00:17:40.640 |
the first thing we actually need to do is get a list of all those files and that we can pass 00:17:46.000 |
on to our tokenizer so to do that we'll use the pathlib library so from pathlib import path 00:17:54.160 |
and we just go string x for x in path so our so here we need to specify the directory where our 00:18:10.160 |
files will be found so that is just oscar and at the end here we just add this glob and here we 00:18:17.360 |
don't for in this case we don't need to do this because if we if we just use path here it will 00:18:24.480 |
just select all of the files in that directory and in our case we we can actually do that because 00:18:29.760 |
there's no other files other than the text files but it's good practice to just in case there is 00:18:35.440 |
anything else in there we can use a we can use this function here to say within this directory 00:18:43.200 |
just select all text files okay and then let's have a look so in pass we have 00:18:50.480 |
we should have all of our files and let's see how many of those we have 00:18:59.280 |
okay so we have 10 of those so in total yep 50 000 in in total samples there because you have 00:19:06.480 |
5 000 in each file okay so now let's initialize our plain tokenizer so we want to do from tokenizers 00:19:15.840 |
so if you don't have tokenizers installed super easy all you have to do is do pip install tokenizers 00:19:23.200 |
again this is another hug and face library like transformers or data sets which we used before 00:19:29.360 |
and from transformers we want to import the bert word piece tokenizer which is shown as there 00:19:36.000 |
so we load that and then our tokenizer we initialize it with bert word piece tokenizer 00:19:42.640 |
again and then in here we have a few a few different variables which are useful to to 00:19:49.120 |
understand so first one is clean text so this just removes obvious characters that we don't want 00:19:57.200 |
and converts all white space into spaces so we can say that's true 00:20:03.600 |
we have handle chinese characters now this you can say i'll leave it as false but but what this does 00:20:15.280 |
is if it sees a chinese character in your training data what it's going to do is just add spaces 00:20:21.440 |
around that character which as far as i know allows but at least when we're tokenizing those 00:20:28.240 |
chinese characters it allows them to be better represented i assume but i obviously i don't 00:20:35.760 |
know chinese and i have never trained anything in chinese so i don't know 00:20:42.400 |
but that's what it does strip accents so this is a pretty relevant one for us so this is say if we 00:20:49.440 |
have like an e like this it will convert it into this obviously for romance languages like italian 00:20:58.800 |
we those accents are pretty important so we don't want to we don't want to strip those it's also 00:21:06.160 |
strip not string and then the final one lowercase so this is if we want to if we want to view this 00:21:15.200 |
is equal to this we would set low case equal to true in this case we i you know for me i'm happy 00:21:23.360 |
to have those capital characters as being equal to low case characters that's completely fine 00:21:29.600 |
so that initializes handle chinese sorry handle chinese characters like this so that initializes 00:21:39.680 |
our tokenizer now we we train it so tokenizer dot train in here we need to first pass our 00:21:46.960 |
files so the is it pass that we use up here yeah pass so training it with those we want to set the 00:21:54.960 |
vocab size so this is the number of tokens that we can have within our tokenizer it can be very 00:22:02.400 |
small for us because we we don't have that much data in there i want to set the min frequency 00:22:08.640 |
which initially i thought oh that must mean that you know the minimum number of times a token must 00:22:15.040 |
be found in the data for it to be added to the vocabulary but it's not it's actually the the 00:22:20.720 |
minimum number of times that the it must see two different tokens or characters together 00:22:28.800 |
in order for it to consider these as actually a token by themselves so so merged together 00:22:34.320 |
so typically i think people use two for that which is fine special tokens so these are the 00:22:43.360 |
special tokens i use by bert special underscore tokens and for that we will have padding so the 00:22:51.280 |
padding token the unknown token the classifier token which we put the start of every every 00:22:58.720 |
sequence let me put this on a new line we have the separator token which we put the end of 00:23:07.200 |
of a sequence and then we also have the mask token which is pretty important if we are 00:23:13.200 |
training that core model we also have limit the alphabet so this is the number of different 00:23:20.560 |
characters that we can see within our vocab so limit alphabet so we'll go with 1000 00:23:28.400 |
and word pieces prefix so this is what we saw before in the example where we had the two 00:23:37.120 |
um the two hashes and this like i said it just indicates a piece of a word rather than a full 00:23:45.040 |
word and that should be it actually so i don't think there's really anything else that is 00:23:50.480 |
important for us so we'll train that make sure hopefully it will work again this can take a 00:23:58.080 |
little bit well this will take a little bit of time even without our smaller smaller data set 00:24:05.680 |
so let's see what it's showing us it's not i don't know why i think i need to 00:24:11.600 |
install something here because i just get this blank output i think it's supposed to be a loading 00:24:17.680 |
bar and then what we we do at this point is we probably want to save our new tokenizer so i'm 00:24:26.240 |
going to save it as new tokenizer and i just write tokenizer dot save model and that is going to go 00:24:37.520 |
to new tokenizer directory so new tokenizer okay and that will save this vocab dot text file and 00:24:46.880 |
if we we just have a quick look at what that has inside so come over here so we have new tokenizer 00:24:55.840 |
vocab dot text and then in here we can see all of our all of our tokens okay so the way that this 00:25:04.000 |
the way that this works yeah so you can actually see you know how we use that alphabet the limit 00:25:10.640 |
alphabet we can see that there are 1000 tokens so this stops at 1005 and then go all the way up here 00:25:19.120 |
so it begins at row six so that's the 1000 alphabet characters so single characters 00:25:26.160 |
that are within or allowed within our tokenizer now the oscar data set is just pulled from the 00:25:32.800 |
internet so you do get a lot of random stuff in there so we have well we have a lot of chinese 00:25:39.600 |
characters when we're dealing with italian but if we come down here we start to see some of those 00:25:44.240 |
italian words and these are the tokens so our text our tokenizer is going to read our text 00:25:52.080 |
it's going to split it out into these tokens so like the abb op fin it's going to split out into 00:26:01.040 |
those and then the next step is to convert those into token ids which are represented by the row 00:26:06.640 |
numbers of those tokens so if it's all fin in in the in the text it would replace that with the fin 00:26:14.720 |
token and then it would replace the fin token with this 2201 now let's let's see how it works 00:26:23.760 |
so first thing is well how do we load that tokenizer we we do it as we normally would 00:26:30.720 |
so from transformers import we're using a bet tokenizer so import bet tokenizer 00:26:37.280 |
and we'll say tokenizer equals a bet tokenizer 00:26:44.880 |
from pre-trained and all we do is we point that to where we save it so it's a new tokenizer 00:26:54.560 |
and that should load okay so first i want to say tokenizer and i'm going to tokenize 00:27:01.680 |
ciao come over and this is just hi how are you 00:27:07.280 |
and we see that we get these these tokens here so we have number two here which if you probably 00:27:17.520 |
don't remember in our in our text file at the top we had our special tokens row number two we had the 00:27:23.840 |
cls token so the classified token which we always put at the start of a sequence and at the end we 00:27:30.720 |
also have this three which is the separator token now if we just go and open that vocab 00:27:39.680 |
file that we built so it's a new tokenizer new tokenizer vocab.txt if we read that in 00:27:48.960 |
so let's write vocab fp.read and we want to split by new line characters so split 00:28:02.560 |
like so because every token is is separated by a new line we can see let's have a look at the 00:28:10.400 |
special tokens so we have padding unknown cls at position number two and separate position number 00:28:17.120 |
three so if i were to go number two we'd get cls which aligns to to what we have here so what we 00:28:25.040 |
can do is we could take all of these values and we could use them to identify from this vocab and 00:28:33.600 |
we can do it using tokenizer decode by the way as well but i'm going to do it by indexing in that 00:28:40.240 |
vocab file that we built just to you know show that that's what the vocab file actually is that's 00:28:46.800 |
how it's used so if i write that out so we have take this i want to access input ids 00:29:04.480 |
okay and what i'm going to do is say for i in that list can you print vocab 00:29:14.400 |
vocab i and at the end we'll just add a space okay and then we get this so cls the starting 00:29:26.000 |
classified token ciao exclamation mark call me that question mark separate a token which marks 00:29:32.320 |
the end of our our sentence which i think is is pretty cool now let's let's try that with something 00:29:39.120 |
else so we'll take this again do i want okay so yeah we'll just do this so what i say a lot 00:29:48.320 |
when in italy is ok peter niente and ok peter niente means i understood nothing which is very 00:30:00.320 |
useful so if i print that out we see that we get cls ok peter niente separator now what we're seeing 00:30:11.520 |
here is you know full words we're not seeing any word pieces so if i can find it i think 00:30:22.240 |
this will hopefully return a word piece so response stability like this 00:30:32.320 |
this will hopefully return a few word pieces yes there we go 00:30:39.600 |
okay so we see a respond see billy tap and these are our different word pieces so this gets 00:30:50.240 |
separated into not just a single token but four tokens which is is pretty cool now i think i mean 00:30:58.560 |
that's it for for this video don't think yeah i mean it's nothing nothing else i want to i think 00:31:07.120 |
we need to cover that's pretty much everything we really need to know for building a word piece 00:31:11.680 |
tokenizer for for using with with bert so yeah thank you very much for watching and i will see