back to index

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace


Chapters

0:0 Intro
3:41 WordPiece Tokenizer
5:54 Download Data Sets
7:26 HuggingFace
10:41 Dataset
19:0 Tokenizer
24:43 Tokenizer Walkthrough
26:22 Tokenizer Code

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to this video. We're going to cover how we can build a tokenizer for BERT from scratch
00:00:06.960 | So typically when we're using transform models we have three main components so we have the
00:00:19.440 | tokenizer which obviously what we're going to what we're going to cover here we have the core model
00:00:25.520 | and we also have a head so the tokenizer is obviously what converts our text into into
00:00:33.360 | tokens that BERT can read the core model is is BERT itself so BERT has like kind of like a core
00:00:40.240 | which we build or train through pre-training and then there's also a head which allows us to do
00:00:46.400 | specialized tasks so we have like a Q&A head or a classification head and so on
00:00:54.400 | now a lot of the time what we can do is just head over to the Hug & Face website over here and we
00:01:03.440 | can say okay we have all these tasks over here if I want a model for a question answering I can click
00:01:09.440 | on here and typically there's usually something we can use but obviously it depends on your use
00:01:15.760 | case your language and a lot of different things so if we find that there isn't really a model that
00:01:22.240 | suits what we need or it doesn't perform well on our own data that's where we would start
00:01:27.520 | considering okay do we need to build our own transformer so in that case at the very least
00:01:34.880 | we're probably going to need to build or train from scratch a core model and the tokenizer or
00:01:42.480 | transformer head so we definitely need those two parts and sometimes we'll find that we also need
00:01:50.080 | tokenizers but not always because you think okay we already have tokenizers for say our task is
00:01:55.840 | something to do in the English language but the model doesn't perform very well in our specific
00:02:00.960 | data set it doesn't mean that it doesn't it hasn't been tokenized properly it can still
00:02:05.680 | tokenize all that text probably pretty well as long as it's like standard English but what we'll
00:02:12.320 | find is that the model just doesn't quite understand the style of language being used
00:02:17.840 | so for example if Bert is trained on blogs on the internet it's probably not going to do as well on
00:02:26.560 | governmental or financial reports so that's the sort of area where you think okay we're probably
00:02:33.200 | going to need to retrain the core model so it can better understand the style of language
00:02:38.320 | used in there and then the head is like I said before that's where we are training it specifically
00:02:44.960 | for a specific use case so for Q&A we'd probably want to train our model on a specific question
00:02:52.320 | answering data set yeah so that it can it can start answering our questions now in that case
00:03:00.480 | and it's in English we probably don't need the tokenizer but sometimes we do need a tokenizer
00:03:05.200 | because maybe your use case is in a less common language and in that case you probably will need
00:03:14.800 | to build your own tokenizer for Bert and that's really the sort of use case that we would be
00:03:21.040 | looking at in this video so we'll cover that building a word piece tokenizer which is a
00:03:27.280 | tokenized user by Bert and we'll also have a look at how we can get or where we can get good
00:03:33.200 | multilingual data sets from as well so let's move on to what the Bert tokenizer is and what it does
00:03:40.320 | okay so like I said before the Bert tokenizer it's called a word piece tokenizer so this this
00:03:47.600 | letter this text up here word piece and it's pretty straightforward what it does is it breaks
00:03:54.880 | your words into chunks or pieces of words hence word piece so for example the word surf it's just
00:04:02.480 | probably most likely going to return a single token which would be surf whereas word surfing
00:04:08.160 | the the ing at the end of surf is a pretty common part of a word in English at least so what we
00:04:14.480 | would find is this word here would probably get broken out into these two tokens now where we see
00:04:20.320 | this prefix the double hashtag that's the standard prefix used to indicate that this is a piece of a
00:04:27.200 | word rather than a word itself and then we see that further down as well so surfboarding gets
00:04:32.480 | broken into three tokens and then if we for example compare that to snowboarding snowboarding
00:04:38.800 | surfboarding are obviously kind of similar because they are both boarding sports the difference being
00:04:44.480 | one is on surf the other one is on snow and before we even feed these tokens into Bert we're making
00:04:52.560 | that very easy for Bert to identify where the similarities are between those two objects because
00:04:59.280 | Bert knows that okay one of them is surf one of them is snow but both of them are boarding so this
00:05:05.200 | is helping Bert from the start which is I think pretty cool now when we're training a tokenizer
00:05:13.680 | we need a lot of text data so well when I say a lot we let's say two million paragraphs is probably
00:05:23.520 | a good sign point although ideally you want as much as you can so what we will use for training
00:05:32.560 | our data is something called the oscar data set or oscar corpus now oscar is just a huge multilingual
00:05:42.640 | data set that contains and just an insane amount of unstructured text so it's very very good and
00:05:49.440 | we can access it through hugging face which is super useful so over in in our code first if we
00:05:58.720 | want to we want to download data sets we need to pip install something called data sets a pip
00:06:05.280 | install data sets I already have installed so I'm just going to go from data sets
00:06:11.760 | import load data set okay and then through that we can use the data sets dot list data sets method
00:06:25.920 | let me sorry let me import data sets as well import data sets
00:06:33.680 | okay and this will give us a very big list probably a little bit too big
00:06:39.040 | showing us all the data sets that are currently available in the data sets library which is quite
00:06:45.840 | a lot I think it's like a thousand just a fair bit over a thousand now okay so we have all of these
00:06:54.400 | which is a lot how many let me length data sets dot list data sets my internet is very bad um at
00:07:07.280 | the moment so it takes forever to download anything but that there are data sets and this
00:07:13.840 | is one way of viewing those data sets but an easier way this is how many we have in all of
00:07:19.200 | hugging face and there's new ones being added like every day so but an easy way of doing this
00:07:25.120 | to go to google type in data sets view or hugging face that is its viewer and just click on a
00:07:32.240 | streamlet hugging face so this is a streamlet app hugging face of bill that allow you to go through
00:07:39.120 | their data sets so you see over here we go over to data sets and I'm going to type in Oscar because
00:07:46.080 | that's the one we'll be using Oscar okay I type Oscar and then on the right we should it should
00:07:52.960 | pop up so within Oscar we have all these different data sets so the first one here is Africans
00:07:58.720 | the the language and then you have all these other ones down here I'm going to using Italian as my
00:08:06.320 | example here but Italian has a lot of data so if I click on here it doesn't actually show you
00:08:14.000 | anything which is a little bit annoying but it's because it's just a huge data set it can't show
00:08:18.400 | you everything so in fact so that is 101 gigabytes 102 gigabytes of data there so it's a lot but
00:08:27.840 | that's good for us because we need a lot of data for training so if we want to download that data
00:08:33.200 | set we need to do this so we write data sets or data sets and it's just a variable name and we
00:08:39.920 | want to write load data set and then in here we need to write the data set name so it's Oscar
00:08:46.880 | and then we need to specify which part of the data set it is
00:08:51.200 | so over here it's a subset it's unshuffled deduplicated it
00:08:58.080 | if I can can't select it so never mind
00:09:07.840 | so deduplicated it and it's also unshuffled unshuffled deduplicated it so right looks looks
00:09:16.800 | good and then the other thing that we can do is we can write split and we can specify how much of
00:09:23.120 | the data we'd like now when you when you use this split it's still going to download the full data
00:09:29.760 | set to your uh to your machine which is a little bit annoying but this is how it works so I found
00:09:36.480 | that this isn't particularly useful unless you're just loading it from your machine and you're
00:09:40.640 | saying okay I only want a certain amount of data what you can do if so this is a 101 gigabytes
00:09:48.480 | it's a lot if you don't want to download all that you can write streaming equals true and this is
00:09:54.560 | very useful so what this will do is create an iterator and you can iterate through this object
00:09:59.360 | and download your your data or samples one at a time now because I already have my
00:10:06.880 | my data downloaded onto my machine I'm going to use the split method so I am going to take the
00:10:17.200 | first I'm going to say 500,000 items simply because I mean obviously you want to be using
00:10:27.200 | more samples than this but I'm just going to use this many because otherwise the loading
00:10:32.880 | times on all this is is pretty long I don't want to be waiting for too long
00:10:37.680 | and we also need to specify which data set or subset we're using here so typically we have our
00:10:50.320 | train validation or tests in our data sets we I think we always have the train set in there
00:10:57.840 | and then we can have validation test sets as well so we'll load that and then what I'm going to do
00:11:05.280 | is I'm going to create a new directory where I'm going to store all of these text files so
00:11:13.120 | when we're training the tokenizer it expects plain text files that where each sample
00:11:19.680 | separate by a new line so I'm going to go ahead and create that data set for us so I'm going to
00:11:28.240 | make directory I'm going to call this Oscar and then what I'm going to do is loop through
00:11:35.280 | our data here and convert them into the file format that we need so first thing I want to do
00:11:43.520 | is import tqdm auto import tqdm so from and I'm using this so that we have a progress bar so you
00:11:53.440 | can see where we are in that process because this can take a while so I'm going to create this text
00:12:00.880 | data list so populate this with all of our text and I'm going to use this file count so that's
00:12:07.760 | zero so this is just going to loop through and we're going to create all our text files using
00:12:13.840 | this here so what I want to do is for sample in tqdm tqdm data set yes
00:12:24.400 | for now I'm just going to pass okay and let's run that and we see that we get
00:12:31.840 | this this bar this tqdm bar you see we're not even doing anything at the moment and it's already
00:12:39.920 | taking a long time to to process the data so I'm actually going to let's I'm going to go down to
00:12:46.800 | 50 000 so I'm not waiting too long so let me modify that 50 000 and and that should be a little
00:12:57.280 | bit quicker okay it's much better now first thing I want to do is we're going to be splitting each
00:13:05.120 | sample by a newline character so I want to first remove any newline characters that are already
00:13:10.640 | within each sample otherwise we're going to be splitting our samples like midway through a
00:13:16.560 | sentence so on sample equals sample and in here so if I can I show you I can show you a sample
00:13:26.400 | yeah we have id and then we also have the text we want the text obviously so we just wrote text
00:13:32.080 | and we're going to replace newline characters if there are any hopefully there's not any way
00:13:37.360 | with space and then what we want to do is just append that to our text data so text data
00:13:44.800 | append sample and what we want to do so we can put all this in a single file
00:13:53.440 | but then that leaves us with one single file which is huge so I mean for 50 000 samples it's not
00:13:59.920 | really a problem but we're not going to typically be using that many samples it's going to be more
00:14:06.080 | like 5 million 50 million or so on so what I like to do is just split the data into multiple text
00:14:14.160 | files so what I do is I say if the length of the text data is equal to let's say 5 000 at that point
00:14:25.120 | I want you to save the text data and then restart again and start populating a new new file so let's
00:14:33.280 | say with open so we need to open the file we need to save it into this oscar directory that we built
00:14:40.160 | before so oscar and I'm just going to call it file file file count dot text so you convert this into
00:14:52.960 | an f string I'm not sure why it's why it's highlighting everything here and we are writing
00:15:02.560 | that file so with that that's why we want to do fp.write and then we just write our text data
00:15:12.560 | write our text data but we also so this is a list and what we want to do is we want to join
00:15:19.120 | every item in that list separated by a new line character so we write this
00:15:25.920 | and that creates our files now at this point we've created a file and our text data still has
00:15:34.880 | 5 000 items in it and we're going to start looping through and populating it with even more items
00:15:40.480 | so what we need to do now is re or initialize or empty our text data variable so that's empty again
00:15:47.280 | it can start counting from zero all the way up to five thousand again okay and so at this point
00:15:52.800 | we're saving our file so this will be initially file underscore zero dot text but if we loop
00:16:01.360 | through again and and do it again it's still going to be zero so we need to make sure we are
00:16:05.120 | increasing that file count so that it's not remaining the same just overwriting the same
00:16:11.120 | file over and over again okay and what you can also do if you want is you can add another
00:16:20.800 | so this down here with open you can add that just in case there's any leftover items at the end
00:16:28.960 | there that haven't been saved into this neat 5000 chunk i'm not going to do that now you can add
00:16:35.680 | that in if you want to okay so it looks pretty good the only thing that we we do need here is
00:16:40.480 | actually make sure the encoding is utf-8 otherwise we'll get i think we'll get an error if we if we
00:16:48.400 | miss that okay so that will or should create all of our data so let's let me open that directory
00:16:57.440 | here on the left so we have this empty oscar directory i'm going to run this and we should
00:17:06.240 | see it get populated so it's pretty quick there we go so we're building all these plain text files
00:17:12.880 | here and if we open that we ignore that and we see that we get all of these so each row here
00:17:23.680 | is a new sample okay and as you can see it's all all italian so
00:17:30.640 | that's our data it's ready and what we can do is move on to actually training the the tokenizer so
00:17:40.640 | the first thing we actually need to do is get a list of all those files and that we can pass
00:17:46.000 | on to our tokenizer so to do that we'll use the pathlib library so from pathlib import path
00:17:54.160 | and we just go string x for x in path so our so here we need to specify the directory where our
00:18:10.160 | files will be found so that is just oscar and at the end here we just add this glob and here we
00:18:17.360 | don't for in this case we don't need to do this because if we if we just use path here it will
00:18:24.480 | just select all of the files in that directory and in our case we we can actually do that because
00:18:29.760 | there's no other files other than the text files but it's good practice to just in case there is
00:18:35.440 | anything else in there we can use a we can use this function here to say within this directory
00:18:43.200 | just select all text files okay and then let's have a look so in pass we have
00:18:50.480 | we should have all of our files and let's see how many of those we have
00:18:59.280 | okay so we have 10 of those so in total yep 50 000 in in total samples there because you have
00:19:06.480 | 5 000 in each file okay so now let's initialize our plain tokenizer so we want to do from tokenizers
00:19:15.840 | so if you don't have tokenizers installed super easy all you have to do is do pip install tokenizers
00:19:23.200 | again this is another hug and face library like transformers or data sets which we used before
00:19:29.360 | and from transformers we want to import the bert word piece tokenizer which is shown as there
00:19:36.000 | so we load that and then our tokenizer we initialize it with bert word piece tokenizer
00:19:42.640 | again and then in here we have a few a few different variables which are useful to to
00:19:49.120 | understand so first one is clean text so this just removes obvious characters that we don't want
00:19:57.200 | and converts all white space into spaces so we can say that's true
00:20:03.600 | we have handle chinese characters now this you can say i'll leave it as false but but what this does
00:20:15.280 | is if it sees a chinese character in your training data what it's going to do is just add spaces
00:20:21.440 | around that character which as far as i know allows but at least when we're tokenizing those
00:20:28.240 | chinese characters it allows them to be better represented i assume but i obviously i don't
00:20:35.760 | know chinese and i have never trained anything in chinese so i don't know
00:20:42.400 | but that's what it does strip accents so this is a pretty relevant one for us so this is say if we
00:20:49.440 | have like an e like this it will convert it into this obviously for romance languages like italian
00:20:58.800 | we those accents are pretty important so we don't want to we don't want to strip those it's also
00:21:06.160 | strip not string and then the final one lowercase so this is if we want to if we want to view this
00:21:15.200 | is equal to this we would set low case equal to true in this case we i you know for me i'm happy
00:21:23.360 | to have those capital characters as being equal to low case characters that's completely fine
00:21:29.600 | so that initializes handle chinese sorry handle chinese characters like this so that initializes
00:21:39.680 | our tokenizer now we we train it so tokenizer dot train in here we need to first pass our
00:21:46.960 | files so the is it pass that we use up here yeah pass so training it with those we want to set the
00:21:54.960 | vocab size so this is the number of tokens that we can have within our tokenizer it can be very
00:22:02.400 | small for us because we we don't have that much data in there i want to set the min frequency
00:22:08.640 | which initially i thought oh that must mean that you know the minimum number of times a token must
00:22:15.040 | be found in the data for it to be added to the vocabulary but it's not it's actually the the
00:22:20.720 | minimum number of times that the it must see two different tokens or characters together
00:22:28.800 | in order for it to consider these as actually a token by themselves so so merged together
00:22:34.320 | so typically i think people use two for that which is fine special tokens so these are the
00:22:43.360 | special tokens i use by bert special underscore tokens and for that we will have padding so the
00:22:51.280 | padding token the unknown token the classifier token which we put the start of every every
00:22:58.720 | sequence let me put this on a new line we have the separator token which we put the end of
00:23:07.200 | of a sequence and then we also have the mask token which is pretty important if we are
00:23:13.200 | training that core model we also have limit the alphabet so this is the number of different
00:23:20.560 | characters that we can see within our vocab so limit alphabet so we'll go with 1000
00:23:28.400 | and word pieces prefix so this is what we saw before in the example where we had the two
00:23:37.120 | um the two hashes and this like i said it just indicates a piece of a word rather than a full
00:23:45.040 | word and that should be it actually so i don't think there's really anything else that is
00:23:50.480 | important for us so we'll train that make sure hopefully it will work again this can take a
00:23:58.080 | little bit well this will take a little bit of time even without our smaller smaller data set
00:24:05.680 | so let's see what it's showing us it's not i don't know why i think i need to
00:24:11.600 | install something here because i just get this blank output i think it's supposed to be a loading
00:24:17.680 | bar and then what we we do at this point is we probably want to save our new tokenizer so i'm
00:24:26.240 | going to save it as new tokenizer and i just write tokenizer dot save model and that is going to go
00:24:37.520 | to new tokenizer directory so new tokenizer okay and that will save this vocab dot text file and
00:24:46.880 | if we we just have a quick look at what that has inside so come over here so we have new tokenizer
00:24:55.840 | vocab dot text and then in here we can see all of our all of our tokens okay so the way that this
00:25:04.000 | the way that this works yeah so you can actually see you know how we use that alphabet the limit
00:25:10.640 | alphabet we can see that there are 1000 tokens so this stops at 1005 and then go all the way up here
00:25:19.120 | so it begins at row six so that's the 1000 alphabet characters so single characters
00:25:26.160 | that are within or allowed within our tokenizer now the oscar data set is just pulled from the
00:25:32.800 | internet so you do get a lot of random stuff in there so we have well we have a lot of chinese
00:25:39.600 | characters when we're dealing with italian but if we come down here we start to see some of those
00:25:44.240 | italian words and these are the tokens so our text our tokenizer is going to read our text
00:25:52.080 | it's going to split it out into these tokens so like the abb op fin it's going to split out into
00:26:01.040 | those and then the next step is to convert those into token ids which are represented by the row
00:26:06.640 | numbers of those tokens so if it's all fin in in the in the text it would replace that with the fin
00:26:14.720 | token and then it would replace the fin token with this 2201 now let's let's see how it works
00:26:23.760 | so first thing is well how do we load that tokenizer we we do it as we normally would
00:26:30.720 | so from transformers import we're using a bet tokenizer so import bet tokenizer
00:26:37.280 | and we'll say tokenizer equals a bet tokenizer
00:26:44.880 | from pre-trained and all we do is we point that to where we save it so it's a new tokenizer
00:26:54.560 | and that should load okay so first i want to say tokenizer and i'm going to tokenize
00:27:01.680 | ciao come over and this is just hi how are you
00:27:07.280 | and we see that we get these these tokens here so we have number two here which if you probably
00:27:17.520 | don't remember in our in our text file at the top we had our special tokens row number two we had the
00:27:23.840 | cls token so the classified token which we always put at the start of a sequence and at the end we
00:27:30.720 | also have this three which is the separator token now if we just go and open that vocab
00:27:39.680 | file that we built so it's a new tokenizer new tokenizer vocab.txt if we read that in
00:27:48.960 | so let's write vocab fp.read and we want to split by new line characters so split
00:28:02.560 | like so because every token is is separated by a new line we can see let's have a look at the
00:28:10.400 | special tokens so we have padding unknown cls at position number two and separate position number
00:28:17.120 | three so if i were to go number two we'd get cls which aligns to to what we have here so what we
00:28:25.040 | can do is we could take all of these values and we could use them to identify from this vocab and
00:28:33.600 | we can do it using tokenizer decode by the way as well but i'm going to do it by indexing in that
00:28:40.240 | vocab file that we built just to you know show that that's what the vocab file actually is that's
00:28:46.800 | how it's used so if i write that out so we have take this i want to access input ids
00:29:04.480 | okay and what i'm going to do is say for i in that list can you print vocab
00:29:14.400 | vocab i and at the end we'll just add a space okay and then we get this so cls the starting
00:29:26.000 | classified token ciao exclamation mark call me that question mark separate a token which marks
00:29:32.320 | the end of our our sentence which i think is is pretty cool now let's let's try that with something
00:29:39.120 | else so we'll take this again do i want okay so yeah we'll just do this so what i say a lot
00:29:48.320 | when in italy is ok peter niente and ok peter niente means i understood nothing which is very
00:30:00.320 | useful so if i print that out we see that we get cls ok peter niente separator now what we're seeing
00:30:11.520 | here is you know full words we're not seeing any word pieces so if i can find it i think
00:30:22.240 | this will hopefully return a word piece so response stability like this
00:30:29.120 | let me try
00:30:32.320 | this will hopefully return a few word pieces yes there we go
00:30:39.600 | okay so we see a respond see billy tap and these are our different word pieces so this gets
00:30:50.240 | separated into not just a single token but four tokens which is is pretty cool now i think i mean
00:30:58.560 | that's it for for this video don't think yeah i mean it's nothing nothing else i want to i think
00:31:07.120 | we need to cover that's pretty much everything we really need to know for building a word piece
00:31:11.680 | tokenizer for for using with with bert so yeah thank you very much for watching and i will see
00:31:18.400 | see you in the next one. Bye.