back to indexHow-to Use HuggingFace's Datasets - Transformers From Scratch #1
Chapters
0:0 Intro
1:28 Getting Data
7:25 Training Data
00:00:00.000 |
Hi, welcome to this video. This is the first video in what will be 00:00:04.800 |
kind of like a mini-series on how we can train a transformer from scratch. 00:00:10.080 |
So a lot of you have been asking for this in one way or another, so we're just going to run through 00:00:17.760 |
kind of everything I think you need to know to actually just build your own transformer. 00:00:25.040 |
So there are a few different parts to this and that's why I'm doing it in a series because 00:00:31.040 |
otherwise it would be a super long video. So we're just going to break it down into 00:00:39.440 |
a few different parts. So the first video is going to be on getting our data and that's what 00:00:45.520 |
you're watching now. So we're going to learn how to use the Hugging Face datasets library, 00:00:52.560 |
which I think is very good actually. So we'll take a look at that. In the next video, 00:00:59.440 |
we'll have a look at actually training the tokenizer with that data. And then in these 00:01:06.400 |
bits here, so these three parts, they might just be one video. I'm not sure I'm going to see how it 00:01:15.440 |
flows. So there will probably be maybe one or two videos. Let's see. So we'll move on to 00:01:25.440 |
getting our data over in Python. Now when it comes to getting data to train a transformer model, 00:01:32.160 |
we are pretty much spoiled for choice. All we really need is unstructured text data, 00:01:38.880 |
which of course there's a lot of that on the internet. Now there's one dataset in particular 00:01:45.680 |
that I've noticed that is very good called Oscar and that is a multilingual dataset. 00:01:52.400 |
It has like hundreds of languages, I think, like really loads. So we're going to go ahead and use 00:02:00.320 |
that. Now to download the dataset or the Oscar dataset, we will be using a transformers library 00:02:11.280 |
called datasets. So if you do need to, you can pip install that. So you just do pip install 00:02:16.000 |
datasets. And once we have that, we just want to import datasets. Now, 00:02:25.520 |
what we can do if we want to just view all of the datasets that are available to us 00:02:29.680 |
within our notebook here is we can write datasets.list datasets. 00:02:38.480 |
And let's just print out how many we have, because there's quite a few. So at the moment, 00:02:47.440 |
there's 965 of those. So I'm not going to print all of them out, but let's just have a look at 00:02:52.480 |
a few of those so we can see what we actually have here. So here's just five. So it's 00:02:58.320 |
alphabetically sorted. So we have five datasets all beginning with A. Now, one of these is called 00:03:08.640 |
Oscar. So we can write Oscar pin or ds. We see true, but we can't really get that much information 00:03:19.040 |
from within Python. So what we can do is head on over 00:03:24.160 |
to this website here, which is the Hugging Face datasets viewer, which is really cool. 00:03:34.320 |
So what we do is we go over here on the left, and we can search for a dataset based on tags. 00:03:44.960 |
Or what we'll be using is the actual dataset name. So we can just 00:03:48.960 |
go through and see all these datasets. There's loads of them. Now, if I scroll down quite a lot, 00:03:55.520 |
I think you can also type in at the top, you will find, or I should find Oscar here. 00:04:09.600 |
Now we search for Oscar, and then we also get this subset here. This is another important thing. So 00:04:15.040 |
Oscar has all these different subset of languages within its dataset, 00:04:21.040 |
and these are all of those. Now, if you want to know what each of those are, 00:04:31.040 |
because we just get a letter here, which is not really that useful, 00:04:36.640 |
we can go over here. So this is oscarcorpus.com. 00:04:41.280 |
And I believe we click here. Okay. So we scroll down to here, and we have a big list of 00:04:52.960 |
all of the languages here. So we have the language, and then we have the AF here, 00:04:59.040 |
which is the first one. So the first one we know is Africans. Now, if we scroll down, I'm going to 00:05:07.680 |
go, I only know English, but my girlfriend is Italian and she's going to come along and hopefully 00:05:16.560 |
tell us that the model works at some point. So we're going to be using Italian because that's 00:05:23.760 |
literally the only choice I have other than English, and that's kind of boring. So we're 00:05:27.680 |
going to go with this one. So we need to search for the one with IT at the end. We come here, 00:05:36.960 |
and we can just type it in, I think, or maybe we can't. There we go. So we click that, and 00:05:48.400 |
we are not actually allowed to view it because it's too big. I didn't realize that. 00:05:58.320 |
Yeah. So obviously, this is in Italian, this is Latin, but you can see here we have the ID, 00:06:07.760 |
and we have text, and this is the data set that we're going to be using. So 00:06:11.200 |
let's go back over to, let's copy this, and we'll go back over to our code. 00:06:20.880 |
Now, what we do is we're going to be loading that data set or downloading 00:06:26.320 |
data set into this data set variable. We want to write data sets, 00:06:29.600 |
load data sets, and here we want to write Oscar, which is the data set name, 00:06:39.280 |
and then we also want to include the subset. So our subset is, 00:06:43.120 |
hmm, not pasting. That's fine. It's unshuffled, deduplicated, IT. 00:06:59.680 |
So here we go. So I already have it downloaded, I think, so that's not going to download it again, 00:07:11.040 |
but you should get a loading bar if this is your first time downloading this data set, 00:07:14.240 |
and that might take a little bit of time. Okay. So that has loaded for me now, so I can 00:07:21.680 |
do this, and we can see that we have this data set dictionary object, and inside there we have this 00:07:29.360 |
one item. So sometimes you have training data and testing data. We just have training data here. 00:07:35.840 |
So we have train, and then inside that we have our data. So we have this many samples, which is 00:07:44.240 |
28.5 million samples. I'm not going to use all of them because it will just take a very long time, 00:07:51.760 |
and we have the two features that we saw before. So we have the ID, 00:07:58.800 |
and then we also have the text, which is what we care about. Now, if we wanted to 00:08:05.200 |
just have a look at one of those, we could write train zero like this, 00:08:19.760 |
that's good, but when we're training our tokenizer, we're going to want to read these 00:08:29.680 |
in from file rather than keeping them in memory. So what we're going to do is, 00:08:38.080 |
first I'm going to import tqdm because this can take a little bit of time. So I want to 00:08:47.920 |
have a loading bar so that we can actually see what is happening, 00:08:57.520 |
and from there we want to initialize the list, which is going to contain our text data. 00:09:07.520 |
And what I'm going to do is, so I'm going to loop through all of this data, format it in a way that 00:09:16.640 |
we can then save it to file that we would expect for the tokenizer. So it essentially needs every 00:09:23.360 |
sample to be on a new line, and I'm just going to take, I think, 10,000 of those samples, 00:09:31.600 |
put them into a file, and then save it and move on to the next file. So this is what the 00:09:36.000 |
file count is for. I'm just going to write something like, I don't know, Italian data zero, 00:09:41.440 |
Italian data one, Italian data two, and so on. So we're going to loop through all of our samples, 00:09:48.960 |
so for sample in, and here I'm going to wrap it in tqdm so that we can see the 00:09:54.160 |
loading bar or the progress bar. And here we're just going to go data set train, 00:10:02.400 |
so that will go through all of our samples. Now we're going to be splitting each sample with a 00:10:09.440 |
new line character, which also means we need to remove any other new line characters from our 00:10:14.400 |
data, otherwise we're going to be splitting each sample into multiple samples, which we 00:10:20.400 |
don't really want. So we write sample equals sample, and in here, remember, we have ID 00:10:29.680 |
and text here, so we want to access the text specifically, and we're going to replace the 00:10:39.040 |
new line characters in there with just spaces, I think, yeah. Then what we're going to do is 00:10:50.320 |
text data append sample, so that is going to add one sample to our text data list up here. 00:10:58.000 |
Now what we want to do is say if the length of that text data list 00:11:06.320 |
hits 10k, at that point, we want to save it to file, so I'm going to write with open, 00:11:20.240 |
and I'm just going to call it Italian text, or just IT, and I'll put in the file count, 00:11:33.040 |
so file count dot txt. We're going to be writing that, and then we just write fp dot write. 00:11:45.760 |
We're using new line characters here, so we're just going to join everything within 00:11:50.560 |
our text data list like that, and we also want to just here include the encoding, so utf-8. 00:12:02.720 |
Now once we've written that data, we don't want to keep all of the current data within 00:12:09.920 |
the text data variable or text data list, because we have 10,000, we want to sort of 00:12:16.000 |
reinitialize that list so that we start again, and then we print the next, or we save the next 00:12:20.560 |
10,000 after that. So we want to write text data equals, and it's going to equal an empty list, 00:12:30.560 |
and then obviously we just keep saving it with the current file count, we're just going to keep 00:12:35.200 |
overwriting ourselves, so we need to add one to the file count. Now that will save most of our data, 00:12:43.440 |
but on the final batch, so we have, what do we have? 00:12:52.320 |
So we have this many samples, and if we take that, we are left with 2,082 samples at the end 00:13:10.080 |
there, so that means they will not save because it will not reach the 10k on that final loop. 00:13:16.960 |
So at the very end here, all we're going to do is, I think we just copy this, 00:13:28.320 |
yeah, that should be fine, so that will save that final 2k that we have there. 00:13:37.280 |
So I'm going to run that, I think it can take a while, well let me see, I think it does take a 00:13:45.200 |
while, so what's that, 20, that's kind of weird, yeah it's going to take maybe 30 minutes, so 00:14:02.240 |
that's fine. So after we have done that, we move on to what I'm going to cover in the next video, 00:14:13.360 |
which is actually training our tokenizer, so I'll see you there.