Back to Index

How-to Use HuggingFace's Datasets - Transformers From Scratch #1


Chapters

0:0 Intro
1:28 Getting Data
7:25 Training Data

Transcript

Hi, welcome to this video. This is the first video in what will be kind of like a mini-series on how we can train a transformer from scratch. So a lot of you have been asking for this in one way or another, so we're just going to run through kind of everything I think you need to know to actually just build your own transformer.

So there are a few different parts to this and that's why I'm doing it in a series because otherwise it would be a super long video. So we're just going to break it down into a few different parts. So the first video is going to be on getting our data and that's what you're watching now.

So we're going to learn how to use the Hugging Face datasets library, which I think is very good actually. So we'll take a look at that. In the next video, we'll have a look at actually training the tokenizer with that data. And then in these bits here, so these three parts, they might just be one video.

I'm not sure I'm going to see how it flows. So there will probably be maybe one or two videos. Let's see. So we'll move on to getting our data over in Python. Now when it comes to getting data to train a transformer model, we are pretty much spoiled for choice.

All we really need is unstructured text data, which of course there's a lot of that on the internet. Now there's one dataset in particular that I've noticed that is very good called Oscar and that is a multilingual dataset. It has like hundreds of languages, I think, like really loads.

So we're going to go ahead and use that. Now to download the dataset or the Oscar dataset, we will be using a transformers library called datasets. So if you do need to, you can pip install that. So you just do pip install datasets. And once we have that, we just want to import datasets.

Now, what we can do if we want to just view all of the datasets that are available to us within our notebook here is we can write datasets.list datasets. And let's just print out how many we have, because there's quite a few. So at the moment, there's 965 of those.

So I'm not going to print all of them out, but let's just have a look at a few of those so we can see what we actually have here. So here's just five. So it's alphabetically sorted. So we have five datasets all beginning with A. Now, one of these is called Oscar.

So we can write Oscar pin or ds. We see true, but we can't really get that much information from within Python. So what we can do is head on over to this website here, which is the Hugging Face datasets viewer, which is really cool. So what we do is we go over here on the left, and we can search for a dataset based on tags.

Or what we'll be using is the actual dataset name. So we can just go through and see all these datasets. There's loads of them. Now, if I scroll down quite a lot, I think you can also type in at the top, you will find, or I should find Oscar here.

Now we search for Oscar, and then we also get this subset here. This is another important thing. So Oscar has all these different subset of languages within its dataset, and these are all of those. Now, if you want to know what each of those are, because we just get a letter here, which is not really that useful, we can go over here.

So this is oscarcorpus.com. And I believe we click here. Okay. So we scroll down to here, and we have a big list of all of the languages here. So we have the language, and then we have the AF here, which is the first one. So the first one we know is Africans.

Now, if we scroll down, I'm going to go, I only know English, but my girlfriend is Italian and she's going to come along and hopefully tell us that the model works at some point. So we're going to be using Italian because that's literally the only choice I have other than English, and that's kind of boring.

So we're going to go with this one. So we need to search for the one with IT at the end. We come here, and we can just type it in, I think, or maybe we can't. There we go. So we click that, and we are not actually allowed to view it because it's too big.

I didn't realize that. Let's go with, I think, Latin you can view. Yeah. So obviously, this is in Italian, this is Latin, but you can see here we have the ID, and we have text, and this is the data set that we're going to be using. So let's go back over to, let's copy this, and we'll go back over to our code.

Now, what we do is we're going to be loading that data set or downloading data set into this data set variable. We want to write data sets, load data sets, and here we want to write Oscar, which is the data set name, and then we also want to include the subset.

So our subset is, hmm, not pasting. That's fine. It's unshuffled, deduplicated, IT. So here we go. So I already have it downloaded, I think, so that's not going to download it again, but you should get a loading bar if this is your first time downloading this data set, and that might take a little bit of time.

Okay. So that has loaded for me now, so I can do this, and we can see that we have this data set dictionary object, and inside there we have this one item. So sometimes you have training data and testing data. We just have training data here. So we have train, and then inside that we have our data.

So we have this many samples, which is 28.5 million samples. I'm not going to use all of them because it will just take a very long time, and we have the two features that we saw before. So we have the ID, and then we also have the text, which is what we care about.

Now, if we wanted to just have a look at one of those, we could write train zero like this, and we see our data. Okay. Now, that's good, but when we're training our tokenizer, we're going to want to read these in from file rather than keeping them in memory.

So what we're going to do is, first I'm going to import tqdm because this can take a little bit of time. So I want to have a loading bar so that we can actually see what is happening, and from there we want to initialize the list, which is going to contain our text data.

And what I'm going to do is, so I'm going to loop through all of this data, format it in a way that we can then save it to file that we would expect for the tokenizer. So it essentially needs every sample to be on a new line, and I'm just going to take, I think, 10,000 of those samples, put them into a file, and then save it and move on to the next file.

So this is what the file count is for. I'm just going to write something like, I don't know, Italian data zero, Italian data one, Italian data two, and so on. So we're going to loop through all of our samples, so for sample in, and here I'm going to wrap it in tqdm so that we can see the loading bar or the progress bar.

And here we're just going to go data set train, so that will go through all of our samples. Now we're going to be splitting each sample with a new line character, which also means we need to remove any other new line characters from our data, otherwise we're going to be splitting each sample into multiple samples, which we don't really want.

So we write sample equals sample, and in here, remember, we have ID and text here, so we want to access the text specifically, and we're going to replace the new line characters in there with just spaces, I think, yeah. Then what we're going to do is text data append sample, so that is going to add one sample to our text data list up here.

Now what we want to do is say if the length of that text data list hits 10k, at that point, we want to save it to file, so I'm going to write with open, and I'm just going to call it Italian text, or just IT, and I'll put in the file count, so file count dot txt.

We're going to be writing that, and then we just write fp dot write. We're using new line characters here, so we're just going to join everything within our text data list like that, and we also want to just here include the encoding, so utf-8. Now once we've written that data, we don't want to keep all of the current data within the text data variable or text data list, because we have 10,000, we want to sort of reinitialize that list so that we start again, and then we print the next, or we save the next 10,000 after that.

So we want to write text data equals, and it's going to equal an empty list, and then obviously we just keep saving it with the current file count, we're just going to keep overwriting ourselves, so we need to add one to the file count. Now that will save most of our data, but on the final batch, so we have, what do we have?

So we have this many samples, and if we take that, we are left with 2,082 samples at the end there, so that means they will not save because it will not reach the 10k on that final loop. So at the very end here, all we're going to do is, I think we just copy this, so we will copy that, and yeah, that should be fine, so that will save that final 2k that we have there.

So I'm going to run that, I think it can take a while, well let me see, I think it does take a while, so what's that, 20, that's kind of weird, yeah it's going to take maybe 30 minutes, so that's fine. So after we have done that, we move on to what I'm going to cover in the next video, which is actually training our tokenizer, so I'll see you there.