How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Hi, welcome to this video. We're going to cover how we can build a tokenizer for BERT from scratch So typically when we're using transform models we have three main components so we have the tokenizer which obviously what we're going to what we're going to cover here we have the core model and we also have a head so the tokenizer is obviously what converts our text into into tokens that BERT can read the core model is is BERT itself so BERT has like kind of like a core which we build or train through pre-training and then there's also a head which allows us to do specialized tasks so we have like a Q&A head or a classification head and so on now a lot of the time what we can do is just head over to the Hug & Face website over here and we can say okay we have all these tasks over here if I want a model for a question answering I can click on here and typically there's usually something we can use but obviously it depends on your use case your language and a lot of different things so if we find that there isn't really a model that suits what we need or it doesn't perform well on our own data that's where we would start considering okay do we need to build our own transformer so in that case at the very least we're probably going to need to build or train from scratch a core model and the tokenizer or transformer head so we definitely need those two parts and sometimes we'll find that we also need tokenizers but not always because you think okay we already have tokenizers for say our task is something to do in the English language but the model doesn't perform very well in our specific data set it doesn't mean that it doesn't it hasn't been tokenized properly it can still tokenize all that text probably pretty well as long as it's like standard English but what we'll find is that the model just doesn't quite understand the style of language being used so for example if Bert is trained on blogs on the internet it's probably not going to do as well on governmental or financial reports so that's the sort of area where you think okay we're probably going to need to retrain the core model so it can better understand the style of language used in there and then the head is like I said before that's where we are training it specifically for a specific use case so for Q&A we'd probably want to train our model on a specific question answering data set yeah so that it can it can start answering our questions now in that case and it's in English we probably don't need the tokenizer but sometimes we do need a tokenizer because maybe your use case is in a less common language and in that case you probably will need to build your own tokenizer for Bert and that's really the sort of use case that we would be looking at in this video so we'll cover that building a word piece tokenizer which is a tokenized user by Bert and we'll also have a look at how we can get or where we can get good multilingual data sets from as well so let's move on to what the Bert tokenizer is and what it does okay so like I said before the Bert tokenizer it's called a word piece tokenizer so this this letter this text up here word piece and it's pretty straightforward what it does is it breaks your words into chunks or pieces of words hence word piece so for example the word surf it's just probably most likely going to return a single token which would be surf whereas word surfing the the ing at the end of surf is a pretty common part of a word in English at least so what we would find is this word here would probably get broken out into these two tokens now where we see this prefix the double hashtag that's the standard prefix used to indicate that this is a piece of a word rather than a word itself and then we see that further down as well so surfboarding gets broken into three tokens and then if we for example compare that to snowboarding snowboarding surfboarding are obviously kind of similar because they are both boarding sports the difference being one is on surf the other one is on snow and before we even feed these tokens into Bert we're making that very easy for Bert to identify where the similarities are between those two objects because Bert knows that okay one of them is surf one of them is snow but both of them are boarding so this is helping Bert from the start which is I think pretty cool now when we're training a tokenizer we need a lot of text data so well when I say a lot we let's say two million paragraphs is probably a good sign point although ideally you want as much as you can so what we will use for training our data is something called the oscar data set or oscar corpus now oscar is just a huge multilingual data set that contains and just an insane amount of unstructured text so it's very very good and we can access it through hugging face which is super useful so over in in our code first if we want to we want to download data sets we need to pip install something called data sets a pip install data sets I already have installed so I'm just going to go from data sets import load data set okay and then through that we can use the data sets dot list data sets method let me sorry let me import data sets as well import data sets okay and this will give us a very big list probably a little bit too big showing us all the data sets that are currently available in the data sets library which is quite a lot I think it's like a thousand just a fair bit over a thousand now okay so we have all of these which is a lot how many let me length data sets dot list data sets my internet is very bad um at the moment so it takes forever to download anything but that there are data sets and this is one way of viewing those data sets but an easier way this is how many we have in all of hugging face and there's new ones being added like every day so but an easy way of doing this to go to google type in data sets view or hugging face that is its viewer and just click on a streamlet hugging face so this is a streamlet app hugging face of bill that allow you to go through their data sets so you see over here we go over to data sets and I'm going to type in Oscar because that's the one we'll be using Oscar okay I type Oscar and then on the right we should it should pop up so within Oscar we have all these different data sets so the first one here is Africans the the language and then you have all these other ones down here I'm going to using Italian as my example here but Italian has a lot of data so if I click on here it doesn't actually show you anything which is a little bit annoying but it's because it's just a huge data set it can't show you everything so in fact so that is 101 gigabytes 102 gigabytes of data there so it's a lot but that's good for us because we need a lot of data for training so if we want to download that data set we need to do this so we write data sets or data sets and it's just a variable name and we want to write load data set and then in here we need to write the data set name so it's Oscar and then we need to specify which part of the data set it is so over here it's a subset it's unshuffled deduplicated it if I can can't select it so never mind so deduplicated it and it's also unshuffled unshuffled deduplicated it so right looks looks good and then the other thing that we can do is we can write split and we can specify how much of the data we'd like now when you when you use this split it's still going to download the full data set to your uh to your machine which is a little bit annoying but this is how it works so I found that this isn't particularly useful unless you're just loading it from your machine and you're saying okay I only want a certain amount of data what you can do if so this is a 101 gigabytes it's a lot if you don't want to download all that you can write streaming equals true and this is very useful so what this will do is create an iterator and you can iterate through this object and download your your data or samples one at a time now because I already have my my data downloaded onto my machine I'm going to use the split method so I am going to take the first I'm going to say 500,000 items simply because I mean obviously you want to be using more samples than this but I'm just going to use this many because otherwise the loading times on all this is is pretty long I don't want to be waiting for too long and we also need to specify which data set or subset we're using here so typically we have our train validation or tests in our data sets we I think we always have the train set in there and then we can have validation test sets as well so we'll load that and then what I'm going to do is I'm going to create a new directory where I'm going to store all of these text files so when we're training the tokenizer it expects plain text files that where each sample separate by a new line so I'm going to go ahead and create that data set for us so I'm going to make directory I'm going to call this Oscar and then what I'm going to do is loop through our data here and convert them into the file format that we need so first thing I want to do is import tqdm auto import tqdm so from and I'm using this so that we have a progress bar so you can see where we are in that process because this can take a while so I'm going to create this text data list so populate this with all of our text and I'm going to use this file count so that's zero so this is just going to loop through and we're going to create all our text files using this here so what I want to do is for sample in tqdm tqdm data set yes for now I'm just going to pass okay and let's run that and we see that we get this this bar this tqdm bar you see we're not even doing anything at the moment and it's already taking a long time to to process the data so I'm actually going to let's I'm going to go down to 50 000 so I'm not waiting too long so let me modify that 50 000 and and that should be a little bit quicker okay it's much better now first thing I want to do is we're going to be splitting each sample by a newline character so I want to first remove any newline characters that are already within each sample otherwise we're going to be splitting our samples like midway through a sentence so on sample equals sample and in here so if I can I show you I can show you a sample yeah we have id and then we also have the text we want the text obviously so we just wrote text and we're going to replace newline characters if there are any hopefully there's not any way with space and then what we want to do is just append that to our text data so text data append sample and what we want to do so we can put all this in a single file but then that leaves us with one single file which is huge so I mean for 50 000 samples it's not really a problem but we're not going to typically be using that many samples it's going to be more like 5 million 50 million or so on so what I like to do is just split the data into multiple text files so what I do is I say if the length of the text data is equal to let's say 5 000 at that point I want you to save the text data and then restart again and start populating a new new file so let's say with open so we need to open the file we need to save it into this oscar directory that we built before so oscar and I'm just going to call it file file file count dot text so you convert this into an f string I'm not sure why it's why it's highlighting everything here and we are writing that file so with that that's why we want to do fp.write and then we just write our text data write our text data but we also so this is a list and what we want to do is we want to join every item in that list separated by a new line character so we write this and that creates our files now at this point we've created a file and our text data still has 5 000 items in it and we're going to start looping through and populating it with even more items so what we need to do now is re or initialize or empty our text data variable so that's empty again it can start counting from zero all the way up to five thousand again okay and so at this point we're saving our file so this will be initially file underscore zero dot text but if we loop through again and and do it again it's still going to be zero so we need to make sure we are increasing that file count so that it's not remaining the same just overwriting the same file over and over again okay and what you can also do if you want is you can add another so this down here with open you can add that just in case there's any leftover items at the end there that haven't been saved into this neat 5000 chunk i'm not going to do that now you can add that in if you want to okay so it looks pretty good the only thing that we we do need here is actually make sure the encoding is utf-8 otherwise we'll get i think we'll get an error if we if we miss that okay so that will or should create all of our data so let's let me open that directory here on the left so we have this empty oscar directory i'm going to run this and we should see it get populated so it's pretty quick there we go so we're building all these plain text files here and if we open that we ignore that and we see that we get all of these so each row here is a new sample okay and as you can see it's all all italian so that's our data it's ready and what we can do is move on to actually training the the tokenizer so the first thing we actually need to do is get a list of all those files and that we can pass on to our tokenizer so to do that we'll use the pathlib library so from pathlib import path and we just go string x for x in path so our so here we need to specify the directory where our files will be found so that is just oscar and at the end here we just add this glob and here we don't for in this case we don't need to do this because if we if we just use path here it will just select all of the files in that directory and in our case we we can actually do that because there's no other files other than the text files but it's good practice to just in case there is anything else in there we can use a we can use this function here to say within this directory just select all text files okay and then let's have a look so in pass we have we should have all of our files and let's see how many of those we have okay so we have 10 of those so in total yep 50 000 in in total samples there because you have 5 000 in each file okay so now let's initialize our plain tokenizer so we want to do from tokenizers so if you don't have tokenizers installed super easy all you have to do is do pip install tokenizers again this is another hug and face library like transformers or data sets which we used before and from transformers we want to import the bert word piece tokenizer which is shown as there so we load that and then our tokenizer we initialize it with bert word piece tokenizer again and then in here we have a few a few different variables which are useful to to understand so first one is clean text so this just removes obvious characters that we don't want and converts all white space into spaces so we can say that's true we have handle chinese characters now this you can say i'll leave it as false but but what this does is if it sees a chinese character in your training data what it's going to do is just add spaces around that character which as far as i know allows but at least when we're tokenizing those chinese characters it allows them to be better represented i assume but i obviously i don't know chinese and i have never trained anything in chinese so i don't know but that's what it does strip accents so this is a pretty relevant one for us so this is say if we have like an e like this it will convert it into this obviously for romance languages like italian we those accents are pretty important so we don't want to we don't want to strip those it's also strip not string and then the final one lowercase so this is if we want to if we want to view this is equal to this we would set low case equal to true in this case we i you know for me i'm happy to have those capital characters as being equal to low case characters that's completely fine so that initializes handle chinese sorry handle chinese characters like this so that initializes our tokenizer now we we train it so tokenizer dot train in here we need to first pass our files so the is it pass that we use up here yeah pass so training it with those we want to set the vocab size so this is the number of tokens that we can have within our tokenizer it can be very small for us because we we don't have that much data in there i want to set the min frequency which initially i thought oh that must mean that you know the minimum number of times a token must be found in the data for it to be added to the vocabulary but it's not it's actually the the minimum number of times that the it must see two different tokens or characters together in order for it to consider these as actually a token by themselves so so merged together so typically i think people use two for that which is fine special tokens so these are the special tokens i use by bert special underscore tokens and for that we will have padding so the padding token the unknown token the classifier token which we put the start of every every sequence let me put this on a new line we have the separator token which we put the end of of a sequence and then we also have the mask token which is pretty important if we are training that core model we also have limit the alphabet so this is the number of different characters that we can see within our vocab so limit alphabet so we'll go with 1000 and word pieces prefix so this is what we saw before in the example where we had the two um the two hashes and this like i said it just indicates a piece of a word rather than a full word and that should be it actually so i don't think there's really anything else that is important for us so we'll train that make sure hopefully it will work again this can take a little bit well this will take a little bit of time even without our smaller smaller data set so let's see what it's showing us it's not i don't know why i think i need to install something here because i just get this blank output i think it's supposed to be a loading bar and then what we we do at this point is we probably want to save our new tokenizer so i'm going to save it as new tokenizer and i just write tokenizer dot save model and that is going to go to new tokenizer directory so new tokenizer okay and that will save this vocab dot text file and if we we just have a quick look at what that has inside so come over here so we have new tokenizer vocab dot text and then in here we can see all of our all of our tokens okay so the way that this the way that this works yeah so you can actually see you know how we use that alphabet the limit alphabet we can see that there are 1000 tokens so this stops at 1005 and then go all the way up here so it begins at row six so that's the 1000 alphabet characters so single characters that are within or allowed within our tokenizer now the oscar data set is just pulled from the internet so you do get a lot of random stuff in there so we have well we have a lot of chinese characters when we're dealing with italian but if we come down here we start to see some of those italian words and these are the tokens so our text our tokenizer is going to read our text it's going to split it out into these tokens so like the abb op fin it's going to split out into those and then the next step is to convert those into token ids which are represented by the row numbers of those tokens so if it's all fin in in the in the text it would replace that with the fin token and then it would replace the fin token with this 2201 now let's let's see how it works so first thing is well how do we load that tokenizer we we do it as we normally would so from transformers import we're using a bet tokenizer so import bet tokenizer and we'll say tokenizer equals a bet tokenizer from pre-trained and all we do is we point that to where we save it so it's a new tokenizer and that should load okay so first i want to say tokenizer and i'm going to tokenize ciao come over and this is just hi how are you and we see that we get these these tokens here so we have number two here which if you probably don't remember in our in our text file at the top we had our special tokens row number two we had the cls token so the classified token which we always put at the start of a sequence and at the end we also have this three which is the separator token now if we just go and open that vocab file that we built so it's a new tokenizer new tokenizer vocab.txt if we read that in so let's write vocab fp.read and we want to split by new line characters so split like so because every token is is separated by a new line we can see let's have a look at the special tokens so we have padding unknown cls at position number two and separate position number three so if i were to go number two we'd get cls which aligns to to what we have here so what we can do is we could take all of these values and we could use them to identify from this vocab and we can do it using tokenizer decode by the way as well but i'm going to do it by indexing in that vocab file that we built just to you know show that that's what the vocab file actually is that's how it's used so if i write that out so we have take this i want to access input ids okay and what i'm going to do is say for i in that list can you print vocab vocab i and at the end we'll just add a space okay and then we get this so cls the starting classified token ciao exclamation mark call me that question mark separate a token which marks the end of our our sentence which i think is is pretty cool now let's let's try that with something else so we'll take this again do i want okay so yeah we'll just do this so what i say a lot when in italy is ok peter niente and ok peter niente means i understood nothing which is very useful so if i print that out we see that we get cls ok peter niente separator now what we're seeing here is you know full words we're not seeing any word pieces so if i can find it i think this will hopefully return a word piece so response stability like this let me try this will hopefully return a few word pieces yes there we go okay so we see a respond see billy tap and these are our different word pieces so this gets separated into not just a single token but four tokens which is is pretty cool now i think i mean that's it for for this video don't think yeah i mean it's nothing nothing else i want to i think we need to cover that's pretty much everything we really need to know for building a word piece tokenizer for for using with with bert so yeah thank you very much for watching and i will see see you in the next one.

Bye.

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Chapters

Transcript