back to index

How-to Use HuggingFace's Datasets - Transformers From Scratch #1


Chapters

0:0 Intro
1:28 Getting Data
7:25 Training Data

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to this video. This is the first video in what will be
00:00:04.800 | kind of like a mini-series on how we can train a transformer from scratch.
00:00:10.080 | So a lot of you have been asking for this in one way or another, so we're just going to run through
00:00:17.760 | kind of everything I think you need to know to actually just build your own transformer.
00:00:25.040 | So there are a few different parts to this and that's why I'm doing it in a series because
00:00:31.040 | otherwise it would be a super long video. So we're just going to break it down into
00:00:39.440 | a few different parts. So the first video is going to be on getting our data and that's what
00:00:45.520 | you're watching now. So we're going to learn how to use the Hugging Face datasets library,
00:00:52.560 | which I think is very good actually. So we'll take a look at that. In the next video,
00:00:59.440 | we'll have a look at actually training the tokenizer with that data. And then in these
00:01:06.400 | bits here, so these three parts, they might just be one video. I'm not sure I'm going to see how it
00:01:15.440 | flows. So there will probably be maybe one or two videos. Let's see. So we'll move on to
00:01:25.440 | getting our data over in Python. Now when it comes to getting data to train a transformer model,
00:01:32.160 | we are pretty much spoiled for choice. All we really need is unstructured text data,
00:01:38.880 | which of course there's a lot of that on the internet. Now there's one dataset in particular
00:01:45.680 | that I've noticed that is very good called Oscar and that is a multilingual dataset.
00:01:52.400 | It has like hundreds of languages, I think, like really loads. So we're going to go ahead and use
00:02:00.320 | that. Now to download the dataset or the Oscar dataset, we will be using a transformers library
00:02:11.280 | called datasets. So if you do need to, you can pip install that. So you just do pip install
00:02:16.000 | datasets. And once we have that, we just want to import datasets. Now,
00:02:25.520 | what we can do if we want to just view all of the datasets that are available to us
00:02:29.680 | within our notebook here is we can write datasets.list datasets.
00:02:38.480 | And let's just print out how many we have, because there's quite a few. So at the moment,
00:02:47.440 | there's 965 of those. So I'm not going to print all of them out, but let's just have a look at
00:02:52.480 | a few of those so we can see what we actually have here. So here's just five. So it's
00:02:58.320 | alphabetically sorted. So we have five datasets all beginning with A. Now, one of these is called
00:03:08.640 | Oscar. So we can write Oscar pin or ds. We see true, but we can't really get that much information
00:03:19.040 | from within Python. So what we can do is head on over
00:03:24.160 | to this website here, which is the Hugging Face datasets viewer, which is really cool.
00:03:34.320 | So what we do is we go over here on the left, and we can search for a dataset based on tags.
00:03:44.960 | Or what we'll be using is the actual dataset name. So we can just
00:03:48.960 | go through and see all these datasets. There's loads of them. Now, if I scroll down quite a lot,
00:03:55.520 | I think you can also type in at the top, you will find, or I should find Oscar here.
00:04:09.600 | Now we search for Oscar, and then we also get this subset here. This is another important thing. So
00:04:15.040 | Oscar has all these different subset of languages within its dataset,
00:04:21.040 | and these are all of those. Now, if you want to know what each of those are,
00:04:31.040 | because we just get a letter here, which is not really that useful,
00:04:36.640 | we can go over here. So this is oscarcorpus.com.
00:04:41.280 | And I believe we click here. Okay. So we scroll down to here, and we have a big list of
00:04:52.960 | all of the languages here. So we have the language, and then we have the AF here,
00:04:59.040 | which is the first one. So the first one we know is Africans. Now, if we scroll down, I'm going to
00:05:07.680 | go, I only know English, but my girlfriend is Italian and she's going to come along and hopefully
00:05:16.560 | tell us that the model works at some point. So we're going to be using Italian because that's
00:05:23.760 | literally the only choice I have other than English, and that's kind of boring. So we're
00:05:27.680 | going to go with this one. So we need to search for the one with IT at the end. We come here,
00:05:36.960 | and we can just type it in, I think, or maybe we can't. There we go. So we click that, and
00:05:48.400 | we are not actually allowed to view it because it's too big. I didn't realize that.
00:05:56.080 | Let's go with, I think, Latin you can view.
00:05:58.320 | Yeah. So obviously, this is in Italian, this is Latin, but you can see here we have the ID,
00:06:07.760 | and we have text, and this is the data set that we're going to be using. So
00:06:11.200 | let's go back over to, let's copy this, and we'll go back over to our code.
00:06:20.880 | Now, what we do is we're going to be loading that data set or downloading
00:06:26.320 | data set into this data set variable. We want to write data sets,
00:06:29.600 | load data sets, and here we want to write Oscar, which is the data set name,
00:06:39.280 | and then we also want to include the subset. So our subset is,
00:06:43.120 | hmm, not pasting. That's fine. It's unshuffled, deduplicated, IT.
00:06:59.680 | So here we go. So I already have it downloaded, I think, so that's not going to download it again,
00:07:11.040 | but you should get a loading bar if this is your first time downloading this data set,
00:07:14.240 | and that might take a little bit of time. Okay. So that has loaded for me now, so I can
00:07:21.680 | do this, and we can see that we have this data set dictionary object, and inside there we have this
00:07:29.360 | one item. So sometimes you have training data and testing data. We just have training data here.
00:07:35.840 | So we have train, and then inside that we have our data. So we have this many samples, which is
00:07:44.240 | 28.5 million samples. I'm not going to use all of them because it will just take a very long time,
00:07:51.760 | and we have the two features that we saw before. So we have the ID,
00:07:58.800 | and then we also have the text, which is what we care about. Now, if we wanted to
00:08:05.200 | just have a look at one of those, we could write train zero like this,
00:08:12.560 | and we see our data. Okay. Now,
00:08:19.760 | that's good, but when we're training our tokenizer, we're going to want to read these
00:08:29.680 | in from file rather than keeping them in memory. So what we're going to do is,
00:08:38.080 | first I'm going to import tqdm because this can take a little bit of time. So I want to
00:08:47.920 | have a loading bar so that we can actually see what is happening,
00:08:57.520 | and from there we want to initialize the list, which is going to contain our text data.
00:09:07.520 | And what I'm going to do is, so I'm going to loop through all of this data, format it in a way that
00:09:16.640 | we can then save it to file that we would expect for the tokenizer. So it essentially needs every
00:09:23.360 | sample to be on a new line, and I'm just going to take, I think, 10,000 of those samples,
00:09:31.600 | put them into a file, and then save it and move on to the next file. So this is what the
00:09:36.000 | file count is for. I'm just going to write something like, I don't know, Italian data zero,
00:09:41.440 | Italian data one, Italian data two, and so on. So we're going to loop through all of our samples,
00:09:48.960 | so for sample in, and here I'm going to wrap it in tqdm so that we can see the
00:09:54.160 | loading bar or the progress bar. And here we're just going to go data set train,
00:10:02.400 | so that will go through all of our samples. Now we're going to be splitting each sample with a
00:10:09.440 | new line character, which also means we need to remove any other new line characters from our
00:10:14.400 | data, otherwise we're going to be splitting each sample into multiple samples, which we
00:10:20.400 | don't really want. So we write sample equals sample, and in here, remember, we have ID
00:10:29.680 | and text here, so we want to access the text specifically, and we're going to replace the
00:10:39.040 | new line characters in there with just spaces, I think, yeah. Then what we're going to do is
00:10:50.320 | text data append sample, so that is going to add one sample to our text data list up here.
00:10:58.000 | Now what we want to do is say if the length of that text data list
00:11:06.320 | hits 10k, at that point, we want to save it to file, so I'm going to write with open,
00:11:20.240 | and I'm just going to call it Italian text, or just IT, and I'll put in the file count,
00:11:33.040 | so file count dot txt. We're going to be writing that, and then we just write fp dot write.
00:11:45.760 | We're using new line characters here, so we're just going to join everything within
00:11:50.560 | our text data list like that, and we also want to just here include the encoding, so utf-8.
00:12:02.720 | Now once we've written that data, we don't want to keep all of the current data within
00:12:09.920 | the text data variable or text data list, because we have 10,000, we want to sort of
00:12:16.000 | reinitialize that list so that we start again, and then we print the next, or we save the next
00:12:20.560 | 10,000 after that. So we want to write text data equals, and it's going to equal an empty list,
00:12:30.560 | and then obviously we just keep saving it with the current file count, we're just going to keep
00:12:35.200 | overwriting ourselves, so we need to add one to the file count. Now that will save most of our data,
00:12:43.440 | but on the final batch, so we have, what do we have?
00:12:52.320 | So we have this many samples, and if we take that, we are left with 2,082 samples at the end
00:13:10.080 | there, so that means they will not save because it will not reach the 10k on that final loop.
00:13:16.960 | So at the very end here, all we're going to do is, I think we just copy this,
00:13:22.560 | so we will copy that, and
00:13:28.320 | yeah, that should be fine, so that will save that final 2k that we have there.
00:13:37.280 | So I'm going to run that, I think it can take a while, well let me see, I think it does take a
00:13:45.200 | while, so what's that, 20, that's kind of weird, yeah it's going to take maybe 30 minutes, so
00:14:02.240 | that's fine. So after we have done that, we move on to what I'm going to cover in the next video,
00:14:13.360 | which is actually training our tokenizer, so I'll see you there.