back to indexBuild NLP Pipelines with HuggingFace Datasets
Chapters
0:0 Intro
0:28 Importing Datasets
4:13 Loading Datasets
6:5 Selecting Datasets
7:15 Writing Datasets
8:42 Dataset Features
9:25 Dataset Example
11:14 Modifying Dataset Features
16:49 Troubleshooting
23:9 Batching
23:44 Tokenization
29:49 Filtering
00:00:00.000 |
Welcome to this video. We're going to have a look at hugging faces datasets library. We're going to have a look at 00:00:06.980 |
some of what I think are the most useful datasets and 00:00:10.980 |
We're going to look at how we can use the library to build 00:00:15.260 |
What I think are very good pipelines or data input pipelines for NLP. So let's get started 00:00:33.520 |
Datasets, so we'll go pip install data sets and and that will install the library for you 00:00:39.560 |
after this we'll want to go ahead and import data sets and 00:00:44.640 |
Then we can start having a look at which data sets are available to us now 00:00:52.140 |
There's two ways that you can have a look at all of the data sets the first one 00:00:58.860 |
using the face data sets viewer which you can find on Google just type in data sets viewer and 00:01:07.300 |
App which allows you to go through and have a look at the different data sets 00:01:12.640 |
Now I'm not going to I've already spoken about that a lot before and it's super easy to use 00:01:20.540 |
We're just going to have a look at how we can have view everything in Python, which is the second option 00:01:26.060 |
So first we can we can do this so we just list all of our data sets 00:01:38.980 |
From this we will just get I think it's something like 00:01:42.920 |
1,400 data sets now, so it's quite a lot so if we go Len 00:02:01.420 |
1.4,000 days which is obviously a lot and some of these are massive as well 00:02:08.900 |
For example if we were to look at the Oscar data set 00:02:32.580 |
So these are just data set names. Okay, and we have 00:02:43.980 |
Right, I imagine it's probably Portuguese and then we we have all these other ones as well the but these are just these are 00:02:54.100 |
Uploaded Oscar data says this is the the actual Oscar data set that's been sell by hugging face and it's huge 00:03:01.020 |
It contains I think hundred and more than hundred and sixty 00:03:04.340 |
Languages and some of them for example English also English is one of the biggest ones 00:03:15.060 |
There's a lot of in their data in there, but that's just unstructured 00:03:19.700 |
Texts why I want to have a look at is the squad data sets 00:03:25.720 |
so we're gonna be using we're just going to use the original squad in our 00:03:35.340 |
You can see that we have a few different ones here. So Italian Spanish 00:03:40.700 |
Korean you have Thai Thai QA squad here and then also French as well at the bottom. So 00:03:49.740 |
Now obviously you kind of need to know what's up dates that you're looking for. I know I'm looking for a squad data set 00:03:55.500 |
So I've gone look squad. There are other ones as well. Actually if I if I change this to lower 00:04:04.180 |
Okay, so we have like this one here and this one this one doesn't seem to work 00:04:12.420 |
Now to load one those data sets. Obviously, we're gonna be using squad we write 00:04:27.320 |
dates and names a squad now, there's two ways to 00:04:33.260 |
Two ways to download your day. So if we if we do this, this is the default method 00:04:39.420 |
We are going to download and cache the whole data set in memory wish it for the squad is fine 00:04:44.940 |
I think squad it's it's not a huge data set. So it's not really a problem 00:04:48.540 |
But when you think okay, we wanted the English Oscar data set 00:04:52.980 |
That's massive. That's 1.2 terabytes. So in those cases, you probably don't want to download it all onto your 00:05:05.340 |
so what you can do instead is you set streaming equal to true and 00:05:10.180 |
When streaming is equal to true you do need to make some 00:05:14.300 |
Changes to your code which will I'll show you and there are also some things 00:05:20.540 |
particularly filtering which we will cover later on which we can't do with streaming but 00:05:26.500 |
We will just go ahead and for now, we're going to use streaming will switch over to to not streaming later on 00:05:43.880 |
specific record within that data set it is only going to 00:05:48.220 |
Download or sword that single record or multiple records in our memory at once 00:05:55.700 |
So we're not downloading the data set and we're just processing it as we get which is I think very useful 00:06:07.800 |
Actual subsets within our data if we want to select a specific subset. All we have to do is rewrite 00:06:15.140 |
Data sets again. So let me actually copy this 00:06:19.340 |
So we copy that and if we just want a subset we write split and 00:06:36.900 |
Call execute that so I'm not I'm not going to store that in our data set variable here because I don't want to use 00:06:47.880 |
Iterable data set object. So we're just pulling in this single part of it or single subset 00:06:54.940 |
and we can also view so here we can see we have train and validation if you want to 00:07:01.260 |
See it in a more clear way. You can you can use dictionary 00:07:10.660 |
You can use dictionary syntax for most of this so we have train and validation 00:07:18.860 |
So the moment we we have our data set. We don't really know anything about it 00:07:23.640 |
So we have this train subset and let's say I want to you know understand what is in there 00:07:29.420 |
So what I can do to start is I write a data set train 00:07:34.060 |
And I can write for example the data set size. So how big is it? 00:07:46.780 |
Don't know what I was doing there. Let me see that we get so it's like 00:07:57.540 |
Reasonably big but it's not not anything huge and nothing crazy 00:08:00.220 |
We can also so we we have that we can also get if I copy this 00:08:13.980 |
Let me see what the data is so squad I didn't even mention it already but 00:08:25.980 |
Squad is the Sanford question answering data set to use it for generally for training Q&A models or testing Q&A models 00:08:35.980 |
You can pause and read that if you if you want to 00:08:42.300 |
Then another thing that is pretty important is what are the features that we have inside here now 00:08:49.100 |
We can we can also just print out one of the on the samples 00:08:54.020 |
But it's useful to useful to know I think and this also gives you data types, which is it's kind of useful 00:08:58.660 |
So we have ID title context question and answers 00:09:07.700 |
Answers is actually so we in within answers. We have say is a sequence here. You can we can view it as a dictionary 00:09:15.180 |
But we have a text attribute and also an answer attribute so that's pretty useful to know I think 00:09:27.820 |
One of our samples, so yeah, we have all the features here 00:09:33.140 |
But let's say we just want to see what it actually looks like we can write data set and we go train and 00:09:40.420 |
When we have trimmings streaming set defaults, we can write this but because we have streaming sets true 00:09:48.060 |
We can't do this. So instead what we have to do is 00:09:52.300 |
We we actually just iterate through the data set. So we just for sample in data set 00:10:00.300 |
And we just want to print a single sample and then I don't want to print any more 00:10:07.900 |
So I'm gonna I'm gonna write break after that. So we just print one of those samples 00:10:12.300 |
And then we see okay. We have the ID we have title so 00:10:18.340 |
Each of these samples is being pulled from a different Wikipedia page in this case 00:10:23.820 |
The title is a titled page. So this one is from the University of Notre Dame 00:10:30.300 |
We have answers so that further down. We're going to ask a question and these this answers here 00:10:36.660 |
So we have the text which is the text answer and then we have the position 00:10:41.780 |
So the character position where the answer starts within the context, which is what you can see here 00:10:50.220 |
here, which we're asking and then the the model the Q&A models going to 00:11:01.940 |
So we're not going to be training model in this video or anything like that 00:11:06.140 |
We're just experimenting with the datasets library. We don't need to worry so much about that 00:11:13.660 |
So the first thing I want to do is have a look at how we can modify some of the features in our data 00:11:25.220 |
training a model one of the first things we would do is we take our answer start and the text and 00:11:32.260 |
We would use that to get the answer and position as well 00:11:37.180 |
So let's go ahead and do that. So I first I want to just have a look 00:11:47.540 |
Train, I'm just going to print out a few of the answer features. So we have sample 00:11:53.760 |
Answer or answers, sorry, and I just want to print that 00:12:04.820 |
I want to enumerate this so I can count how many times we're going through it 00:12:12.220 |
Viewing the data so we can actually see what we have in there 00:12:23.380 |
Just break just stop stop printing answers for us. So and then we have a few these so we have text 00:12:30.420 |
We have and assault we want to add and to end and the way that we do that. It is pretty straightforward 00:12:35.660 |
We just need to take the answer start and we add the length of our text to that to get the answer end 00:12:41.940 |
Nothing, nothing complicated there. So what we're going to do here is modify the 00:12:51.300 |
the best way or I think the least the most common way of 00:12:55.220 |
Modifying features or adding new features as well is to use the map method. So we go 00:13:06.420 |
Output a new data set. So we write data set train 00:13:29.460 |
so in here, we're building a lambda function and 00:13:33.860 |
What we need to do so this is one of the things that changes depending on whether you're using streaming 00:13:39.780 |
Or not. So with streaming equals true in here. We need to specify 00:13:56.000 |
So when streaming is false, we will just write answers 00:14:03.580 |
The modification to that feature. So in this case, we are taking the current answers, so it would be 00:14:15.460 |
We would be merging that with a new dictionary item which is going to be 00:14:39.700 |
Here what we have to do is we go X answers. So this is a little bit messy. No 00:14:44.740 |
It's just how it is. So we're within answers and we want to take the 00:15:13.900 |
Okay, so all we're doing there is we're taking and start and we're adding answer 00:15:20.380 |
Text or the length of answer text to that to get our answer end now 00:15:25.540 |
This is all we would have to write if we were using streaming equals false, but we're not 00:15:33.220 |
Streaming equals true. We need to add every other feature in there as well. I'm not sure why it is 00:15:42.180 |
But it is so we need to just add those in as well 00:15:46.900 |
So all they are is a direct mapping from the old version to the new data set 00:15:55.780 |
We just need to add ID once about that to ID and do that for the other features as well 00:16:07.620 |
We have answer already done of course question which is going to be X question 00:16:37.380 |
Okay, and with that we should be ready to go so let's let's map that and 00:16:49.420 |
What we'll find is when we're using streaming keywords equals true 00:16:58.220 |
Or the transformation that we just built is lazily loaded 00:17:01.780 |
So we haven't actually just done anything that all we've said is we've passed this instruction to transform 00:17:07.700 |
The data set in this way, but it hasn't actually transformed anything yet 00:17:13.020 |
It only performs this transformation when we call the data set 00:17:21.660 |
This would call the data set and it would force the code to run this 00:17:33.020 |
And you see we actually do get an error here. And why is that? So let me come down 00:17:49.580 |
the length of answers what's wrong with that? Ah 00:17:58.940 |
These items here within the list. So we actually need to we actually need to access 00:18:09.540 |
When we first execute this code nothing happened and it only actually came across the error 00:18:16.420 |
when we called a data set because that's when this transformation is actually performed and 00:18:21.900 |
Now what we have to do is because we've already added this instruction to our data set 00:18:27.940 |
Transformation or building process we actually need to reinitialize our data set. So we will come back up here 00:18:36.620 |
So, where are you yes a date not that one this one so we need to load that again to 00:18:46.800 |
reinitialize the all of the instructions that we've added in there and 00:18:50.120 |
Then we can go ahead rerun this and now it should work. Hopefully I see 00:18:57.200 |
There we go. So now if we have a look at this and this is something I probably should have done, but I 00:19:03.000 |
completely forgot to so I should have added this as maybe a list rather than just the 00:19:08.920 |
Number, but it's fine because you know, if you come across and you need to do this you may want to add that in 00:19:16.320 |
But we're not doing anything of them playing around with with the data sets 00:19:21.180 |
Library, so it's not not really problem, but you can see that we have added answers and into there now 00:19:38.160 |
we'll notice that we do still have all of our data set so if I 00:19:43.640 |
Go here, I don't really need to remove that's fine. I'll just break straight away. That's fine 00:19:58.640 |
We see that we still have the ID we have the text we have the context we have everything in there now 00:20:27.000 |
Do we have yet? That's fine, but then if I run this 00:20:31.140 |
So before this had the all day all the features 00:20:34.520 |
But now we only have the the single feature that we specified in this formula so the answers 00:20:40.200 |
So that's why you need to when shuffle is set to true. That's why you need to 00:20:45.920 |
Add every single feature in there. Otherwise, it's just going to remove them when you perform the map operation, but that's only the case 00:20:53.600 |
When shuffle is actually set to true up shuffle. Why am I saying sure for streaming is set to true 00:21:10.800 |
Because we're going to need to reload our data set now anyway, because we just removed all the features from it 00:21:24.040 |
What I'm going to do now is just set streaming into defaults and I'm gonna read I'm going to run this same code where we 00:21:31.080 |
still don't have our IDs or anything like that in there and 00:21:35.440 |
We'll see what happens as well. We'll also notice we'll get a loading bar here and 00:21:39.580 |
It's going to take a little bit of time to process this. Although actually with this it's probably gonna be super fast. So 00:21:48.400 |
But it will you see? Okay, it's taking a little bit of time. So now it's going through a whole date set 00:21:54.040 |
We haven't we haven't called a date set, but we have used this map function when streaming is set to faults 00:22:02.440 |
The date set isn't lazily loaded. And so the operation the map operation is performed as soon as you call it 00:22:09.140 |
so it's a slightly different behavior and the other behavior which is different is the fact that 00:22:15.080 |
We've only needed to specify the answers feature here 00:22:18.840 |
So we only when we have streaming set defaults, we don't need to include every feature within the map operation 00:22:25.880 |
We only need to include the feature that we are modifying or creating 00:22:32.600 |
You know, it's weird. I don't know why there's a behavior difference when streaming is true or false 00:22:45.760 |
Run that we see now that we have all of our features again 00:22:54.840 |
If I run this code, it would have only included our answers the ID title context question 00:23:11.880 |
So it's a weird feature or a weird behavior, but it's 00:23:16.960 |
How it is and we obviously just need to deal with it 00:23:20.980 |
Now the next thing I want to show you is how we can 00:23:32.720 |
Well pretty much every or any as far as I can think of any NLP tasks. We're going to want to 00:23:43.680 |
So we're gonna go ahead and do that for the Q&A 00:23:46.640 |
So we would import transformers or from transformers import a bird tokenizer. Let's say 00:23:56.040 |
Would initialize that so this is you know what we typically do tokenizer equals bird tokenizer 00:24:21.840 |
And then what I want to do is I'm going to tokenize my 00:24:25.880 |
context or question and context in the format that squad would usually expect when you're doing Q&A or 00:24:36.000 |
I want to do that using the map function so you can do this in both streaming and 00:24:48.960 |
Was train so same be same as before data set it was train or data set train 00:25:10.400 |
Usually when you write this you would include a dictionary here 00:25:16.400 |
The tokenizer the output from the tokenizer is already in dictionary format 00:25:21.100 |
So we don't need to I don't need to do it in this case 00:25:23.920 |
but basically what we have here is it's still a dictionary and 00:25:31.360 |
Q&A in your tokenize that you pass to text input you pass your question and 00:25:53.320 |
would set padding equal to the max length and 00:26:01.360 |
Okay, so very typical tokenization process nothing. There's nothing different going on here 00:26:07.920 |
this is what we normally do when we tokenize our 00:26:14.080 |
Transform model and then we want to say okay batched equals true 00:26:19.000 |
So this allows us to do everything or perform this operation in batches 00:26:23.000 |
And then we can also specify our batch size. So batch size equals 00:26:36.520 |
The map function here is going to tokenize our question and context in batches of 32 00:26:45.680 |
Okay, and then you can you can see that processing there so I mean that's that's all we really need to 00:26:53.400 |
Do with that. So I think that's probably it for 00:27:02.480 |
We'll continue with I think a few of the methods I think quite useful as well 00:27:13.440 |
so we can go ahead and have a look at what we've actually produced so 00:27:26.440 |
Now we have we have answers like we did before but now we also have attention mask 00:27:31.360 |
We have input IDs and we also have token type IDs 00:27:35.760 |
We should it the three tensors that we usually output from from the tokenizer when we do that 00:27:42.240 |
So we now have those in there as well. We can also have a look 00:27:45.120 |
Another thing as well. We can we can now rather than looping through our data set because we're not using a we're not using streaming 00:27:53.180 |
It's true. We're using streaming equals false. We can now 00:27:59.400 |
And we can see okay, we have a tangent mask and it's not going to show me everything because it's quite large 00:28:04.960 |
So I'm just delete that but you can see that we have detention mask in there 00:28:13.440 |
Say I want to be quite pedantic and I don't like the fact that there is the 00:28:28.840 |
It should be topic because it's the topic of the the context and the question 00:28:33.360 |
If I want to be really pedantic and modify that I could say data set train 00:28:42.140 |
To be honest you you can use it for this, of course 00:28:45.360 |
but you're probably not going to you're probably going to use it more for when you need to rename a column to make sure it 00:28:54.580 |
Expected inputs are for a transformer model. For example, so 00:28:58.960 |
That that's where you would use it, but I'm just using this example. So I'm going to rename the column title 00:29:13.900 |
So down here we have title in a moment. We're going to have topic 00:29:22.700 |
So just rename column. Like I said come useful not in this case, but generally this is 00:29:33.700 |
What I may want to do as well is remove certain 00:29:38.220 |
Records from this data set. So so far we've been 00:29:41.740 |
Printing out the here we have this which is now topic. We have University of Notre Dame 00:29:49.640 |
Maybe for whatever reason we don't want to include those 00:29:56.460 |
Very similar to before we write dates that train 00:30:05.860 |
This I'm going to filter so we're going to filter out records 00:30:09.900 |
I don't want and again, it's very similar to the syntax you use for the map function, which is the lambda and 00:30:17.980 |
in here, we just need to specify the condition for the samples that we do want to include or we do want to keep and 00:30:25.820 |
In this case, we want to say okay, wherever the topic is 00:30:36.780 |
Okay, so we'll run this and we'll have a look at what what we produce so they set to train 00:30:48.120 |
So somehow like we have number of rows here, which is just over most 00:30:55.860 |
And we should get a lower number now now this will also go through so this 00:31:01.220 |
Remember we have shuffle set to shuffle. Why I keep calling it shuffle we have 00:31:11.820 |
So it's going to run through the whole data set and then perform this filtering operation 00:31:19.660 |
Now I'll just fast forward again to to where this finishes in a moment 00:31:26.260 |
Okay. So now we have let's finish and we have before we had 00:31:43.900 |
Topic and I want to see let's say the first five of those 00:31:47.940 |
Okay, now they're all beyond say rather than before where it was the University of Notre Dame 00:32:02.520 |
Say for example, we're performing inference with Q&A with a transform model 00:32:10.380 |
We don't really need all of the features that we have here. So 00:32:15.840 |
We would only need the attention mass the input IDs and also the token type IDs 00:32:23.940 |
So what we can do now is we can remove some of those columns. So 00:32:36.080 |
And we want to remove those columns so remove columns 00:32:47.880 |
We'll just remove so what all of them other than the ones that we want so 00:33:03.800 |
Okay, and then let's have a look at what we what we have left 00:33:11.400 |
Okay, and then that's it so we we have those final features and these are the ones that we would input into a 00:33:20.920 |
Transform model for training now. I mean, there's nothing else. I rarely want to cover 00:33:26.080 |
I think that is pretty much all you need to know on 00:33:29.160 |
I can face the data sets to get started and start building the pretty I think good 00:33:39.000 |
The data sets are available. So we'll leave it there 00:33:43.260 |
Thank you very much for watching and I will see you again in the next one. Bye