How to Build a Bert WordPiece Tokenizer in Python and HuggingFace


0:0 Intro
3:41 WordPiece Tokenizer
5:54 Download Data Sets
7:26 HuggingFace
10:41 Dataset
19:0 Tokenizer
24:43 Tokenizer Walkthrough
26:22 Tokenizer Code

Hi, welcome to this video. We're going to cover how we can build a tokenizer for BERT from scratch
So typically when we're using transform models we have three main components so we have the
tokenizer which obviously what we're going to what we're going to cover here we have the core model
and we also have a head so the tokenizer is obviously what converts our text into into
tokens that BERT can read the core model is is BERT itself so BERT has like kind of like a core
which we build or train through pre-training and then there's also a head which allows us to do
specialized tasks so we have like a Q&A head or a classification head and so on
now a lot of the time what we can do is just head over to the Hug & Face website over here and we
can say okay we have all these tasks over here if I want a model for a question answering I can click
on here and typically there's usually something we can use but obviously it depends on your use
case your language and a lot of different things so if we find that there isn't really a model that
suits what we need or it doesn't perform well on our own data that's where we would start
considering okay do we need to build our own transformer so in that case at the very least
we're probably going to need to build or train from scratch a core model and the tokenizer or
transformer head so we definitely need those two parts and sometimes we'll find that we also need
tokenizers but not always because you think okay we already have tokenizers for say our task is
something to do in the English language but the model doesn't perform very well in our specific
data set it doesn't mean that it doesn't it hasn't been tokenized properly it can still
tokenize all that text probably pretty well as long as it's like standard English but what we'll
find is that the model just doesn't quite understand the style of language being used
so for example if Bert is trained on blogs on the internet it's probably not going to do as well on
governmental or financial reports so that's the sort of area where you think okay we're probably
going to need to retrain the core model so it can better understand the style of language
used in there and then the head is like I said before that's where we are training it specifically
for a specific use case so for Q&A we'd probably want to train our model on a specific question
answering data set yeah so that it can it can start answering our questions now in that case
and it's in English we probably don't need the tokenizer but sometimes we do need a tokenizer
because maybe your use case is in a less common language and in that case you probably will need
to build your own tokenizer for Bert and that's really the sort of use case that we would be
looking at in this video so we'll cover that building a word piece tokenizer which is a
tokenized user by Bert and we'll also have a look at how we can get or where we can get good
multilingual data sets from as well so let's move on to what the Bert tokenizer is and what it does
okay so like I said before the Bert tokenizer it's called a word piece tokenizer so this this
letter this text up here word piece and it's pretty straightforward what it does is it breaks
your words into chunks or pieces of words hence word piece so for example the word surf it's just
probably most likely going to return a single token which would be surf whereas word surfing
the the ing at the end of surf is a pretty common part of a word in English at least so what we
would find is this word here would probably get broken out into these two tokens now where we see
this prefix the double hashtag that's the standard prefix used to indicate that this is a piece of a
word rather than a word itself and then we see that further down as well so surfboarding gets
broken into three tokens and then if we for example compare that to snowboarding snowboarding
surfboarding are obviously kind of similar because they are both boarding sports the difference being
one is on surf the other one is on snow and before we even feed these tokens into Bert we're making
that very easy for Bert to identify where the similarities are between those two objects because
Bert knows that okay one of them is surf one of them is snow but both of them are boarding so this
is helping Bert from the start which is I think pretty cool now when we're training a tokenizer
we need a lot of text data so well when I say a lot we let's say two million paragraphs is probably
a good sign point although ideally you want as much as you can so what we will use for training
our data is something called the oscar data set or oscar corpus now oscar is just a huge multilingual
data set that contains and just an insane amount of unstructured text so it's very very good and
we can access it through hugging face which is super useful so over in in our code first if we
want to we want to download data sets we need to pip install something called data sets a pip
install data sets I already have installed so I'm just going to go from data sets
import load data set okay and then through that we can use the data sets dot list data sets method
let me sorry let me import data sets as well import data sets
okay and this will give us a very big list probably a little bit too big
showing us all the data sets that are currently available in the data sets library which is quite
a lot I think it's like a thousand just a fair bit over a thousand now okay so we have all of these
which is a lot how many let me length data sets dot list data sets my internet is very bad um at
the moment so it takes forever to download anything but that there are data sets and this
is one way of viewing those data sets but an easier way this is how many we have in all of
hugging face and there's new ones being added like every day so but an easy way of doing this
to go to google type in data sets view or hugging face that is its viewer and just click on a
streamlet hugging face so this is a streamlet app hugging face of bill that allow you to go through
their data sets so you see over here we go over to data sets and I'm going to type in Oscar because
that's the one we'll be using Oscar okay I type Oscar and then on the right we should it should
pop up so within Oscar we have all these different data sets so the first one here is Africans
the the language and then you have all these other ones down here I'm going to using Italian as my
example here but Italian has a lot of data so if I click on here it doesn't actually show you
anything which is a little bit annoying but it's because it's just a huge data set it can't show
you everything so in fact so that is 101 gigabytes 102 gigabytes of data there so it's a lot but
that's good for us because we need a lot of data for training so if we want to download that data
set we need to do this so we write data sets or data sets and it's just a variable name and we
want to write load data set and then in here we need to write the data set name so it's Oscar
and then we need to specify which part of the data set it is
so over here it's a subset it's unshuffled deduplicated it
if I can can't select it so never mind
so deduplicated it and it's also unshuffled unshuffled deduplicated it so right looks looks
good and then the other thing that we can do is we can write split and we can specify how much of
the data we'd like now when you when you use this split it's still going to download the full data
set to your uh to your machine which is a little bit annoying but this is how it works so I found
that this isn't particularly useful unless you're just loading it from your machine and you're
saying okay I only want a certain amount of data what you can do if so this is a 101 gigabytes
it's a lot if you don't want to download all that you can write streaming equals true and this is
very useful so what this will do is create an iterator and you can iterate through this object
and download your your data or samples one at a time now because I already have my
my data downloaded onto my machine I'm going to use the split method so I am going to take the
first I'm going to say 500,000 items simply because I mean obviously you want to be using
more samples than this but I'm just going to use this many because otherwise the loading
times on all this is is pretty long I don't want to be waiting for too long
and we also need to specify which data set or subset we're using here so typically we have our
train validation or tests in our data sets we I think we always have the train set in there
and then we can have validation test sets as well so we'll load that and then what I'm going to do
is I'm going to create a new directory where I'm going to store all of these text files so
when we're training the tokenizer it expects plain text files that where each sample
separate by a new line so I'm going to go ahead and create that data set for us so I'm going to
make directory I'm going to call this Oscar and then what I'm going to do is loop through
our data here and convert them into the file format that we need so first thing I want to do
is import tqdm auto import tqdm so from and I'm using this so that we have a progress bar so you
can see where we are in that process because this can take a while so I'm going to create this text
data list so populate this with all of our text and I'm going to use this file count so that's
zero so this is just going to loop through and we're going to create all our text files using
this here so what I want to do is for sample in tqdm tqdm data set yes
for now I'm just going to pass okay and let's run that and we see that we get
this this bar this tqdm bar you see we're not even doing anything at the moment and it's already
taking a long time to to process the data so I'm actually going to let's I'm going to go down to
50 000 so I'm not waiting too long so let me modify that 50 000 and and that should be a little
bit quicker okay it's much better now first thing I want to do is we're going to be splitting each
sample by a newline character so I want to first remove any newline characters that are already
within each sample otherwise we're going to be splitting our samples like midway through a
sentence so on sample equals sample and in here so if I can I show you I can show you a sample
yeah we have id and then we also have the text we want the text obviously so we just wrote text
and we're going to replace newline characters if there are any hopefully there's not any way
with space and then what we want to do is just append that to our text data so text data
append sample and what we want to do so we can put all this in a single file
but then that leaves us with one single file which is huge so I mean for 50 000 samples it's not
really a problem but we're not going to typically be using that many samples it's going to be more
like 5 million 50 million or so on so what I like to do is just split the data into multiple text
files so what I do is I say if the length of the text data is equal to let's say 5 000 at that point
I want you to save the text data and then restart again and start populating a new new file so let's
say with open so we need to open the file we need to save it into this oscar directory that we built
before so oscar and I'm just going to call it file file file count dot text so you convert this into
an f string I'm not sure why it's why it's highlighting everything here and we are
00:15:02.560 | that file so with that that's why we want to do fp.write and then we just write our text data
00:15:12.560 | write our text data but we also so this is a list and what we want to do is we want to join
00:15:19.120 | every item in that list separated by a new line character so we write this
00:15:25.920 | and that creates our files now at this point we've created a file and our text data still has
00:15:34.880 | 5 000 items in it and we're going to start looping through and populating it with even more items
00:15:40.480 | so what we need to do now is re or initialize or empty our text data variable so that's empty again
00:15:47.280 | it can start counting from zero all the way up to five thousand again okay and so at this point
00:15:52.800 | we're saving our file so this will be initially file underscore zero dot text but if we loop
00:16:01.360 | through again and and do it again it's still going to be zero so we need to make sure we are
00:16:05.120 | increasing that file count so that it's not remaining the same just overwriting the same
00:16:11.120 | file over and over again okay and what you can also do if you want is you can add another
00:16:20.800 | so this down here with open you can add that just in case there's any leftover items at the end
00:16:28.960 | there that haven't been saved into this neat 5000 chunk i'm not going to do that now you can add
00:16:35.680 | that in if you want to okay so it looks pretty good the only thing that we we do need here is
00:16:40.480 | actually make sure the encoding is utf-8 otherwise we'll get i think we'll get an error if we if we
00:16:48.400 | miss that okay so that will or should create all of our data so let's let me open that directory
00:16:57.440 | here on the left so we have this empty oscar directory i'm going to run this and we should
00:17:06.240 | see it get populated so it's pretty quick there we go so we're building all these plain text files
00:17:12.880 | here and if we open that we ignore that and we see that we get all of these so each row here
00:17:23.680 | is a new sample okay and as you can see it's all all italian so
00:17:30.640 | that's our data it's ready and what we can do is move on to actually training the the tokenizer so
00:17:40.640 | the first thing we actually need to do is get a list of all those files and that we can pass
00:17:46.000 | on to our tokenizer so to do that we'll use the pathlib library so from pathlib import path
00:17:54.160 | and we just go string x for x in path so our so here we need to specify the directory where our
00:18:10.160 | files will be found so that is just oscar and at the end here we just add this glob and here we
00:18:17.360 | don't for in this case we don't need to do this because if we if we just use path here it will
00:18:24.480 | just select all of the files in that directory and in our case we we can actually do that because
00:18:29.760 | there's no other files other than the text files but it's good practice to just in case there is
00:18:35.440 | anything else in there we can use a we can use this function here to say within this directory
00:18:43.200 | just select all text files okay and then let's have a look so in pass we have
00:18:50.480 | we should have all of our files and let's see how many of those we have
00:18:59.280 | okay so we have 10 of those so in total yep 50 000 in in total samples there because you have
00:19:06.480 | 5 000 in each file okay so now let's initialize our plain tokenizer so we want to do from tokenizers
00:19:15.840 | so if you don't have tokenizers installed super easy all you have to do is do pip install tokenizers
00:19:23.200 | again this is another hug and face library like transformers or data sets which we used before
00:19:29.360 | and from transformers we want to import the bert word piece tokenizer which is shown as there
00:19:36.000 | so we load that and then our tokenizer we initialize it with bert word piece tokenizer
00:19:42.640 | again and then in here we have a few a few different variables which are useful to to
00:19:49.120 | understand so first one is clean text so this just removes obvious characters that we don't want
00:19:57.200 | and converts all white space into spaces so we can say that's true
00:20:03.600 | we have handle chinese characters now this you can say i'll leave it as false but but what this does
00:20:15.280 | is if it sees a chinese character in your training data what it's going to do is just add spaces
00:20:21.440 | around that character which as far as i know allows but at least when we're tokenizing those
00:20:28.240 | chinese characters it allows them to be better represented i assume but i obviously i don't
00:20:35.760 | know chinese and i have never trained anything in chinese so i don't know
00:20:42.400 | but that's what it does strip accents so this is a pretty relevant one for us so this is say if we
00:20:49.440 | have like an e like this it will convert it into this obviously for romance languages like italian
00:20:58.800 | we those accents are pretty important so we don't want to we don't want to strip those it's also
00:21:06.160 | strip not string and then the final one lowercase so this is if we want to if we want to view this
00:21:15.200 | is equal to this we would set low case equal to true in this case we i you know for me i'm happy
00:21:23.360 | to have those capital characters as being equal to low case characters that's completely fine
00:21:29.600 | so that initializes handle chinese sorry handle chinese characters like this so that initializes
00:21:39.680 | our tokenizer now we we train it so tokenizer dot train in here we need to first pass our
00:21:46.960 | files so the is it pass that we use up here yeah pass so training it with those we want to set the
00:21:54.960 | vocab size so this is the number of tokens that we can have within our tokenizer it can be very
00:22:02.400 | small for us because we we don't have that much data in there i want to set the min frequency
00:22:08.640 | which initially i thought oh that must mean that you know the minimum number of times a token must
00:22:15.040 | be found in the data for it to be added to the vocabulary but it's not it's actually the the
00:22:20.720 | minimum number of times that the it must see two different tokens or characters together
00:22:28.800 | in order for it to consider these as actually a token by themselves so so merged together
00:22:34.320 | so typically i think people use two for that which is fine special tokens so these are the
00:22:43.360 | special tokens i use by bert special underscore tokens and for that we will have padding so the
00:22:51.280 | padding token the unknown token the classifier token which we put the start of every every
00:22:58.720 | sequence let me put this on a new line we have the separator token which we put the end of
00:23:07.200 | of a sequence and then we also have the mask token which is pretty important if we are
00:23:13.200 | training that core model we also have limit the alphabet so this is the number of different
00:23:20.560 | characters that we can see within our vocab so limit alphabet so we'll go with 1000
00:23:28.400 | and word pieces prefix so this is what we saw before in the example where we had the two
00:23:37.120 | um the two hashes and this like i said it just indicates a piece of a word rather than a full
00:23:45.040 | word and that should be it actually so i don't think there's really anything else that is
00:23:50.480 | important for us so we'll train that make sure hopefully it will work again this can take a
00:23:58.080 | little bit well this will take a little bit of time even without our smaller smaller data set
00:24:05.680 | so let's see what it's showing us it's not i don't know why i think i need to
00:24:11.600 | install something here because i just get this blank output i think it's supposed to be a loading
00:24:17.680 | bar and then what we we do at this point is we probably want to save our new tokenizer so i'm
00:24:26.240 | going to save it as new tokenizer and i just write tokenizer dot save model and that is going to go
00:24:37.520 | to new tokenizer directory so new tokenizer okay and that will save this vocab dot text file and
00:24:46.880 | if we we just have a quick look at what that has inside so come over here so we have new tokenizer
00:24:55.840 | vocab dot text and then in here we can see all of our all of our tokens okay so the way that this
00:25:04.000 | the way that this works yeah so you can actually see you know how we use that alphabet the limit
00:25:10.640 | alphabet we can see that there are 1000 tokens so this stops at 1005 and then go all the way up here
00:25:19.120 | so it begins at row six so that's the 1000 alphabet characters so single characters
00:25:26.160 | that are within or allowed within our tokenizer now the oscar data set is just pulled from the
00:25:32.800 | internet so you do get a lot of random stuff in there so we have well we have a lot of chinese
00:25:39.600 | characters when we're dealing with italian but if we come down here we start to see some of those
00:25:44.240 | italian words and these are the tokens so our text our tokenizer is going to read our text
00:25:52.080 | it's going to split it out into these tokens so like the abb op fin it's going to split out into
00:26:01.040 | those and then the next step is to convert those into token ids which are represented by the row
00:26:06.640 | numbers of those tokens so if it's all fin in in the in the text it would replace that with the fin
00:26:14.720 | token and then it would replace the fin token with this 2201 now let's let's see how it works
00:26:23.760 | so first thing is well how do we load that tokenizer we we do it as we normally would
00:26:30.720 | so from transformers import we're using a bet tokenizer so import bet tokenizer
00:26:37.280 | and we'll say tokenizer equals a bet tokenizer
00:26:44.880 | from pre-trained and all we do is we point that to where we save it so it's a new tokenizer
00:26:54.560 | and that should load okay so first i want to say tokenizer and i'm going to tokenize
00:27:01.680 | ciao come over and this is just hi how are you
00:27:07.280 | and we see that we get these these tokens here so we have number two here which if you probably
00:27:17.520 | don't remember in our in our text file at the top we had our special tokens row number two we had the
00:27:23.840 | cls token so the classified token which we always put at the start of a sequence and at the end we
00:27:30.720 | also have this three which is the separator token now if we just go and open that vocab
00:27:39.680 | file that we built so it's a new tokenizer new tokenizer vocab.txt if we read that in
00:27:48.960 | so let's write vocab and we want to split by new line characters so split
00:28:02.560 | like so because every token is is separated by a new line we can see let's have a look at the
00:28:10.400 | special tokens so we have padding unknown cls at position number two and separate position number
00:28:17.120 | three so if i were to go number two we'd get cls which aligns to to what we have here so what we
00:28:25.040 | can do is we could take all of these values and we could use them to identify from this vocab and
00:28:33.600 | we can do it using tokenizer decode by the way as well but i'm going to do it by indexing in that
00:28:40.240 | vocab file that we built just to you know show that that's what the vocab file actually is that's
00:28:46.800 | how it's used so if i write that out so we have take this i want to access input ids
00:29:04.480 | okay and what i'm going to do is say for i in that list can you print vocab
00:29:14.400 | vocab i and at the end we'll just add a space okay and then we get this so cls the starting
00:29:26.000 | classified token ciao exclamation mark call me that question mark separate a token which marks
00:29:32.320 | the end of our our sentence which i think is is pretty cool now let's let's try that with something
00:29:39.120 | else so we'll take this again do i want okay so yeah we'll just do this so what i say a lot
00:29:48.320 | when in italy is ok peter niente and ok peter niente means i understood nothing which is very
00:30:00.320 | useful so if i print that out we see that we get cls ok peter niente separator now what we're seeing
00:30:11.520 | here is you know full words we're not seeing any word pieces so if i can find it i think
00:30:22.240 | this will hopefully return a word piece so response stability like this
00:30:29.120 | let me try
00:30:32.320 | this will hopefully return a few word pieces yes there we go
00:30:39.600 | okay so we see a respond see billy tap and these are our different word pieces so this gets
00:30:50.240 | separated into not just a single token but four tokens which is is pretty cool now i think i mean
00:30:58.560 | that's it for for this video don't think yeah i mean it's nothing nothing else i want to i think
00:31:07.120 | we need to cover that's pretty much everything we really need to know for building a word piece
00:31:11.680 | tokenizer for for using with with bert so yeah thank you very much for watching and i will see
00:31:18.400 | see you in the next one. Bye.