How-to Use HuggingFace's Datasets - Transformers From Scratch #1

00:00:00.000 | Hi, welcome to this video. This is the first video in what will be

00:00:04.800 | kind of like a mini-series on how we can train a transformer from scratch.

00:00:10.080 | So a lot of you have been asking for this in one way or another, so we're just going to run through

00:00:17.760 | kind of everything I think you need to know to actually just build your own transformer.

00:00:25.040 | So there are a few different parts to this and that's why I'm doing it in a series because

00:00:31.040 | otherwise it would be a super long video. So we're just going to break it down into

00:00:39.440 | a few different parts. So the first video is going to be on getting our data and that's what

00:00:45.520 | you're watching now. So we're going to learn how to use the Hugging Face datasets library,

00:00:52.560 | which I think is very good actually. So we'll take a look at that. In the next video,

00:00:59.440 | we'll have a look at actually training the tokenizer with that data. And then in these

00:01:06.400 | bits here, so these three parts, they might just be one video. I'm not sure I'm going to see how it

00:01:15.440 | flows. So there will probably be maybe one or two videos. Let's see. So we'll move on to

00:01:25.440 | getting our data over in Python. Now when it comes to getting data to train a transformer model,

00:01:32.160 | we are pretty much spoiled for choice. All we really need is unstructured text data,

00:01:38.880 | which of course there's a lot of that on the internet. Now there's one dataset in particular

00:01:45.680 | that I've noticed that is very good called Oscar and that is a multilingual dataset.

00:01:52.400 | It has like hundreds of languages, I think, like really loads. So we're going to go ahead and use

00:02:00.320 | that. Now to download the dataset or the Oscar dataset, we will be using a transformers library

00:02:11.280 | called datasets. So if you do need to, you can pip install that. So you just do pip install

00:02:16.000 | datasets. And once we have that, we just want to import datasets. Now,

00:02:25.520 | what we can do if we want to just view all of the datasets that are available to us

00:02:29.680 | within our notebook here is we can write datasets.list datasets.

00:02:38.480 | And let's just print out how many we have, because there's quite a few. So at the moment,

00:02:47.440 | there's 965 of those. So I'm not going to print all of them out, but let's just have a look at

00:02:52.480 | a few of those so we can see what we actually have here. So here's just five. So it's

00:02:58.320 | alphabetically sorted. So we have five datasets all beginning with A. Now, one of these is called

00:03:08.640 | Oscar. So we can write Oscar pin or ds. We see true, but we can't really get that much information

00:03:19.040 | from within Python. So what we can do is head on over

00:03:24.160 | to this website here, which is the Hugging Face datasets viewer, which is really cool.

00:03:34.320 | So what we do is we go over here on the left, and we can search for a dataset based on tags.

00:03:44.960 | Or what we'll be using is the actual dataset name. So we can just

00:03:48.960 | go through and see all these datasets. There's loads of them. Now, if I scroll down quite a lot,

00:03:55.520 | I think you can also type in at the top, you will find, or I should find Oscar here.

00:04:09.600 | Now we search for Oscar, and then we also get this subset here. This is another important thing. So

00:04:15.040 | Oscar has all these different subset of languages within its dataset,

00:04:21.040 | and these are all of those. Now, if you want to know what each of those are,

00:04:31.040 | because we just get a letter here, which is not really that useful,

00:04:36.640 | we can go over here. So this is oscarcorpus.com.

00:04:41.280 | And I believe we click here. Okay. So we scroll down to here, and we have a big list of

00:04:52.960 | all of the languages here. So we have the language, and then we have the AF here,

00:04:59.040 | which is the first one. So the first one we know is Africans. Now, if we scroll down, I'm going to

00:05:07.680 | go, I only know English, but my girlfriend is Italian and she's going to come along and hopefully

00:05:16.560 | tell us that the model works at some point. So we're going to be using Italian because that's

00:05:23.760 | literally the only choice I have other than English, and that's kind of boring. So we're

00:05:27.680 | going to go with this one. So we need to search for the one with IT at the end. We come here,

00:05:36.960 | and we can just type it in, I think, or maybe we can't. There we go. So we click that, and

00:05:48.400 | we are not actually allowed to view it because it's too big. I didn't realize that.

00:05:56.080 | Let's go with, I think, Latin you can view.

00:05:58.320 | Yeah. So obviously, this is in Italian, this is Latin, but you can see here we have the ID,

00:06:07.760 | and we have text, and this is the data set that we're going to be using. So

00:06:11.200 | let's go back over to, let's copy this, and we'll go back over to our code.

00:06:20.880 | Now, what we do is we're going to be loading that data set or downloading

00:06:26.320 | data set into this data set variable. We want to write data sets,

00:06:29.600 | load data sets, and here we want to write Oscar, which is the data set name,

00:06:39.280 | and then we also want to include the subset. So our subset is,

00:06:43.120 | hmm, not pasting. That's fine. It's unshuffled, deduplicated, IT.

00:06:59.680 | So here we go. So I already have it downloaded, I think, so that's not going to download it again,

00:07:11.040 | but you should get a loading bar if this is your first time downloading this data set,

00:07:14.240 | and that might take a little bit of time. Okay. So that has loaded for me now, so I can

00:07:21.680 | do this, and we can see that we have this data set dictionary object, and inside there we have this

00:07:29.360 | one item. So sometimes you have training data and testing data. We just have training data here.

00:07:35.840 | So we have train, and then inside that we have our data. So we have this many samples, which is

00:07:44.240 | 28.5 million samples. I'm not going to use all of them because it will just take a very long time,

00:07:51.760 | and we have the two features that we saw before. So we have the ID,

00:07:58.800 | and then we also have the text, which is what we care about. Now, if we wanted to

00:08:05.200 | just have a look at one of those, we could write train zero like this,

00:08:12.560 | and we see our data. Okay. Now,

00:08:19.760 | that's good, but when we're training our tokenizer, we're going to want to read these

00:08:29.680 | in from file rather than keeping them in memory. So what we're going to do is,

00:08:38.080 | first I'm going to import tqdm because this can take a little bit of time. So I want to

00:08:47.920 | have a loading bar so that we can actually see what is happening,

00:08:57.520 | and from there we want to initialize the list, which is going to contain our text data.

00:09:07.520 | And what I'm going to do is, so I'm going to loop through all of this data, format it in a way that

00:09:16.640 | we can then save it to file that we would expect for the tokenizer. So it essentially needs every

00:09:23.360 | sample to be on a new line, and I'm just going to take, I think, 10,000 of those samples,

00:09:31.600 | put them into a file, and then save it and move on to the next file. So this is what the

00:09:36.000 | file count is for. I'm just going to write something like, I don't know, Italian data zero,

00:09:41.440 | Italian data one, Italian data two, and so on. So we're going to loop through all of our samples,

00:09:48.960 | so for sample in, and here I'm going to wrap it in tqdm so that we can see the

00:09:54.160 | loading bar or the progress bar. And here we're just going to go data set train,

00:10:02.400 | so that will go through all of our samples. Now we're going to be splitting each sample with a

00:10:09.440 | new line character, which also means we need to remove any other new line characters from our

00:10:14.400 | data, otherwise we're going to be splitting each sample into multiple samples, which we

00:10:20.400 | don't really want. So we write sample equals sample, and in here, remember, we have ID

00:10:29.680 | and text here, so we want to access the text specifically, and we're going to replace the

00:10:39.040 | new line characters in there with just spaces, I think, yeah. Then what we're going to do is

00:10:50.320 | text data append sample, so that is going to add one sample to our text data list up here.

00:10:58.000 | Now what we want to do is say if the length of that text data list

00:11:06.320 | hits 10k, at that point, we want to save it to file, so I'm going to write with open,

00:11:20.240 | and I'm just going to call it Italian text, or just IT, and I'll put in the file count,

00:11:33.040 | so file count dot txt. We're going to be writing that, and then we just write fp dot write.

00:11:45.760 | We're using new line characters here, so we're just going to join everything within

00:11:50.560 | our text data list like that, and we also want to just here include the encoding, so utf-8.

00:12:02.720 | Now once we've written that data, we don't want to keep all of the current data within

00:12:09.920 | the text data variable or text data list, because we have 10,000, we want to sort of

00:12:16.000 | reinitialize that list so that we start again, and then we print the next, or we save the next

00:12:20.560 | 10,000 after that. So we want to write text data equals, and it's going to equal an empty list,

00:12:30.560 | and then obviously we just keep saving it with the current file count, we're just going to keep

00:12:35.200 | overwriting ourselves, so we need to add one to the file count. Now that will save most of our data,

00:12:43.440 | but on the final batch, so we have, what do we have?

00:12:52.320 | So we have this many samples, and if we take that, we are left with 2,082 samples at the end

00:13:10.080 | there, so that means they will not save because it will not reach the 10k on that final loop.

00:13:16.960 | So at the very end here, all we're going to do is, I think we just copy this,

00:13:22.560 | so we will copy that, and

00:13:28.320 | yeah, that should be fine, so that will save that final 2k that we have there.

00:13:37.280 | So I'm going to run that, I think it can take a while, well let me see, I think it does take a

00:13:45.200 | while, so what's that, 20, that's kind of weird, yeah it's going to take maybe 30 minutes, so

00:14:02.240 | that's fine. So after we have done that, we move on to what I'm going to cover in the next video,

00:14:13.360 | which is actually training our tokenizer, so I'll see you there.

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Chapters