Back to Index

Build NLP Pipelines with HuggingFace Datasets


Chapters

0:0 Intro
0:28 Importing Datasets
4:13 Loading Datasets
6:5 Selecting Datasets
7:15 Writing Datasets
8:42 Dataset Features
9:25 Dataset Example
11:14 Modifying Dataset Features
16:49 Troubleshooting
23:9 Batching
23:44 Tokenization
29:49 Filtering

Transcript

Welcome to this video. We're going to have a look at hugging faces datasets library. We're going to have a look at some of what I think are the most useful datasets and We're going to look at how we can use the library to build What I think are very good pipelines or data input pipelines for NLP.

So let's get started So the first thing we want to do is actually well install Datasets, so we'll go pip install data sets and and that will install the library for you after this we'll want to go ahead and import data sets and Then we can start having a look at which data sets are available to us now There's two ways that you can have a look at all of the data sets the first one is using the face data sets viewer which you can find on Google just type in data sets viewer and It's just an interactive App which allows you to go through and have a look at the different data sets Now I'm not going to I've already spoken about that a lot before and it's super easy to use So we're not going to go through it instead We're just going to have a look at how we can have view everything in Python, which is the second option So first we can we can do this so we just list all of our data sets Now I'm going to just write DS list here And From this we will just get I think it's something like 1,400 data sets now, so it's quite a lot so if we go Len Yes, it's all DS or DS list So Yes, it's 1.4,000 days which is obviously a lot and some of these are massive as well so if we For example if we were to look at the Oscar data set so in DS list we could go No data set for data set in DS list If Oscar Is in the data set So these are just data set names.

Okay, and we have So we have Oscar. I think PT is What is PT? Right, I imagine it's probably Portuguese and then we we have all these other ones as well the but these are just these are users Uploaded Oscar data says this is the the actual Oscar data set that's been sell by hugging face and it's huge It contains I think hundred and more than hundred and sixty Languages and some of them for example English also English is one of the biggest ones that contains 1.2 terabytes of data, so There's a lot of in their data in there, but that's just unstructured Texts why I want to have a look at is the squad data sets so we're gonna be using we're just going to use the original squad in our in this video, but You can see that we have a few different ones here.

So Italian Spanish Korean you have Thai Thai QA squad here and then also French as well at the bottom. So You have plenty of choice Now obviously you kind of need to know what's up dates that you're looking for. I know I'm looking for a squad data set So I've gone look squad.

There are other ones as well. Actually if I if I change this to lower We'll see those also pop up Okay, so we have like this one here and this one this one doesn't seem to work It's fine Now to load one those data sets. Obviously, we're gonna be using squad we write datasets equals data sets dot load data set and Then in here, we just write dates and names a squad now, there's two ways to Two ways to download your day.

So if we if we do this, this is the default method We are going to download and cache the whole data set in memory wish it for the squad is fine I think squad it's it's not a huge data set. So it's not really a problem But when you think okay, we wanted the English Oscar data set That's massive.

That's 1.2 terabytes. So in those cases, you probably don't want to download it all onto your onto your machine so what you can do instead is you set streaming equal to true and When streaming is equal to true you do need to make some Changes to your code which will I'll show you and there are also some things particularly filtering which we will cover later on which we can't do with streaming but We will just go ahead and for now, we're going to use streaming will switch over to to not streaming later on and this creates like a iteratable data set object and it means that whenever we are calling a specific record within that data set it is only going to Download or sword that single record or multiple records in our memory at once So we're not downloading the data set and we're just processing it as we get which is I think very useful Now You can see here.

We have we have two Actual subsets within our data if we want to select a specific subset. All we have to do is rewrite Data sets again. So let me actually copy this So we copy that and if we just want a subset we write split and in this case, it would be train or validation and if I just Call execute that so I'm not I'm not going to store that in our data set variable here because I don't want to use just train We have this single Iterable data set object.

So we're just pulling in this single part of it or single subset and we can also view so here we can see we have train and validation if you want to See it in a more clear way. You can you can use dictionary Syntax, so sorry data set keys You can use dictionary syntax for most of this so we have train and validation now there's also So the moment we we have our data set.

We don't really know anything about it So we have this train subset and let's say I want to you know understand what is in there So what I can do to start is I write a data set train And I can write for example the data set size. So how big is it?

Right, it's at size Data set not data size size Don't know what I was doing there. Let me see that we get so it's like so 80 about 90 90 megabytes there so Reasonably big but it's not not anything huge and nothing crazy We can also so we we have that we can also get if I copy this You can also get a description Let me see what the data is so squad I didn't even mention it already but Squad is the Sanford question answering data set to use it for generally for training Q&A models or testing Q&A models and You can pause and read that if you if you want to And Then another thing that is pretty important is what are the features that we have inside here now We can we can also just print out one of the on the samples But it's useful to useful to know I think and this also gives you data types, which is it's kind of useful So we have ID title context question and answers all of them are strings Answers is actually so we in within answers.

We have say is a sequence here. You can we can view it as a dictionary But we have a text attribute and also an answer attribute so that's pretty useful to know I think and to view one of our One of our samples, so yeah, we have all the features here But let's say we just want to see what it actually looks like we can write data set and we go train and When we have trimmings streaming set defaults, we can write this but because we have streaming sets true We can't do this.

So instead what we have to do is We we actually just iterate through the data set. So we just for sample in data set And we just want to print a single sample and then I don't want to print any more So I'm gonna I'm gonna write break after that.

So we just print one of those samples And then we see okay. We have the ID we have title so Each of these samples is being pulled from a different Wikipedia page in this case The title is a titled page. So this one is from the University of Notre Dame Wikipedia page We have answers so that further down.

We're going to ask a question and these this answers here So we have the text which is the text answer and then we have the position So the character position where the answer starts within the context, which is what you can see here We have a question here, which we're asking and then the the model the Q&A models going to extract the answer from from our context there Okay So we're not going to be training model in this video or anything like that We're just experimenting with the datasets library.

We don't need to worry so much about that So the first thing I want to do is have a look at how we can modify some of the features in our data so with squad when we are training a model one of the first things we would do is we take our answer start and the text and We would use that to get the answer and position as well So let's go ahead and do that.

So I first I want to just have a look Okay for sample in the data set Train, I'm just going to print out a few of the answer features. So we have sample Answer or answers, sorry, and I just want to print that So print it and I want to say, okay I want to enumerate this so I can count how many times we're going through it so here I'm just Viewing the data so we can actually see what we have in there So I want to say if I is broken for Just break just stop stop printing answers for us.

So and then we have a few these so we have text We have and assault we want to add and to end and the way that we do that. It is pretty straightforward We just need to take the answer start and we add the length of our text to that to get the answer end Nothing, nothing complicated there.

So what we're going to do here is modify the answers feature and the best way or I think the least the most common way of Modifying features or adding new features as well is to use the map method. So we go date set so it's going to Output a new data set.

So we write data set train Equals date set train and we're going to use the map method and With map we use lambda so we write Lambda X so in here, we're building a lambda function and What we need to do so this is one of the things that changes depending on whether you're using streaming Or not.

So with streaming equals true in here. We need to specify every single feature so what I mean by that is Let me do it for stream faults initially So when streaming is false, we will just write answers And we would write The modification to that feature. So in this case, we are taking the current answers, so it would be X answers and We would be merging that with a new dictionary item which is going to be answers end so answer or end answer Oh answer start so answer n is equal to And Here what we have to do is we go X answers.

So this is a little bit messy. No It's just how it is. So we're within answers and we want to take the answer start position so answer start And We want to add Let me start a new line here And we want to add the length of Answers Text Okay, so all we're doing there is we're taking and start and we're adding answer Text or the length of answer text to that to get our answer end now This is all we would have to write if we were using streaming equals false, but we're not with Streaming equals true.

We need to add every other feature in there as well. I'm not sure why it is Why why this is the case? But it is so we need to just add those in as well So all they are is a direct mapping from the old version to the new data set So we don't need to really do anything there We just need to add ID once about that to ID and do that for the other features as well So we have also have context Which is X context We have answer already done of course question which is going to be X question So ID context question answers Is there anything else I'm missing?

ID Oh title, of course title Just title Yeah, so also add title in there as well Okay, and with that we should be ready to go so let's let's map that and What we'll find is when we're using streaming keywords equals true our the actual process is Or the transformation that we just built is lazily loaded So we haven't actually just done anything that all we've said is we've passed this instruction to transform The data set in this way, but it hasn't actually transformed anything yet It only performs this transformation when we call the data set so if we this again This would call the data set and it would force the code to run this instruction or this transformation So, let's run that And you see we actually do get an error here.

And why is that? So let me come down We have So what am I doing? I'm And start plus the length of answers what's wrong with that? Ah Okay, so if we look up here we have These items here within the list. So we actually need to we actually need to access that first item But that's good because we saw that When we first execute this code nothing happened and it only actually came across the error when we called a data set because that's when this transformation is actually performed and Now what we have to do is because we've already added this instruction to our data set Transformation or building process we actually need to reinitialize our data set.

So we will come back up here So, where are you yes a date not that one this one so we need to load that again to reinitialize the all of the instructions that we've added in there and Then we can go ahead rerun this and now it should work.

Hopefully I see There we go. So now if we have a look at this and this is something I probably should have done, but I completely forgot to so I should have added this as maybe a list rather than just the Number, but it's fine because you know, if you come across and you need to do this you may want to add that in But we're not doing anything of them playing around with with the data sets Library, so it's not not really problem, but you can see that we have added answers and into there now which is is what we wanted to do and Also importantly is if I let me copy this Bring down here we'll notice that we do still have all of our data set so if I Go here, I don't really need to remove that's fine.

I'll just break straight away. That's fine So sample sorry, yeah so you see the whole thing and We see that we still have the ID we have the text we have the context we have everything in there now I'm just going to show you you know Why this breaks?

Why this breaks or why what happens if I? remove these Okay, so let me rerun that and this as well, so Yeah, so this should look the same Do we have yet? That's fine, but then if I run this So before this had the all day all the features But now we only have the the single feature that we specified in this formula so the answers So that's why you need to when shuffle is set to true.

That's why you need to Add every single feature in there. Otherwise, it's just going to remove them when you perform the map operation, but that's only the case When shuffle is actually set to true up shuffle. Why am I saying sure for streaming is set to true so let me bring this down here and Let me also copy our Initial loading code.

So yeah Because we're going to need to reload our data set now anyway, because we just removed all the features from it Okay, and What I'm going to do now is just set streaming into defaults and I'm gonna read I'm going to run this same code where we still don't have our IDs or anything like that in there and We'll see what happens as well.

We'll also notice we'll get a loading bar here and It's going to take a little bit of time to process this. Although actually with this it's probably gonna be super fast. So Probably ignore that But it will you see? Okay, it's taking a little bit of time. So now it's going through a whole date set We haven't we haven't called a date set, but we have used this map function when streaming is set to faults The date set isn't lazily loaded.

And so the operation the map operation is performed as soon as you call it so it's a slightly different behavior and the other behavior which is different is the fact that We've only needed to specify the answers feature here So we only when we have streaming set defaults, we don't need to include every feature within the map operation We only need to include the feature that we are modifying or creating which You know, it's weird.

I don't know why there's a behavior difference when streaming is true or false But it is there. So if I now take this again come down here and Run that we see now that we have all of our features again Right. So before when streaming was true If I run this code, it would have only included our answers the ID title context question They all would have been removed but now we're streaming equal to Faults that they're still there so weird a weird So it's a weird feature or a weird behavior, but it's How it is and we obviously just need to deal with it Now the next thing I want to show you is how we can also add batching to our mapping process, so typically with Well pretty much every or any as far as I can think of any NLP tasks.

We're going to want to tokenize our tips So we're gonna go ahead and do that for the Q&A So we would import transformers or from transformers import a bird tokenizer. Let's say And I Would initialize that so this is you know what we typically do tokenizer equals bird tokenizer From be trained and Let's say that base on case Okay, I'll initialize that And then what I want to do is I'm going to tokenize my context or question and context in the format that squad would usually expect when you're doing Q&A or making a model and I want to do that using the map function so you can do this in both streaming and non streaming by the way so We just write date set Was train so same be same as before data set it was train or data set train Dot map we are using a lambda function X and In here, we just want to say tokenizer so I'm not doing the Usually when you write this you would include a dictionary here but The tokenizer the output from the tokenizer is already in dictionary format So we don't need to I don't need to do it in this case but basically what we have here is it's still a dictionary and What I want to do is so with Q&A in your tokenize that you pass to text input you pass your question and You'd also then pass your context And As usual we would we sell max length so for usually 512 I would set padding equal to the max length and Also do truncation as well Okay, so very typical tokenization process nothing.

There's nothing different going on here this is what we normally do when we tokenize our text going into a Transform model and then we want to say okay batched equals true So this allows us to do everything or perform this operation in batches And then we can also specify our batch size.

So batch size equals Let's say 32. So now when we run this Where is it gone? You see it? now when we run this The map function here is going to tokenize our question and context in batches of 32 So let's go ahead and do that Okay, and then you can you can see that processing there so I mean that's that's all we really need to Do with that.

So I think that's probably it for the map method and we'll well, I'll fast forward and We'll continue with I think a few of the methods I think quite useful as well Okay, so that's just finishing up now so we can go ahead and have a look at what we've actually produced so Come to here and say Dataset train.

So what do we have? Now we have we have answers like we did before but now we also have attention mask We have input IDs and we also have token type IDs We should it the three tensors that we usually output from from the tokenizer when we do that So we now have those in there as well.

We can also have a look Another thing as well. We can we can now rather than looping through our data set because we're not using a we're not using streaming It's true. We're using streaming equals false. We can now Do this? And we can see okay, we have a tangent mask and it's not going to show me everything because it's quite large So I'm just delete that but you can see that we have detention mask in there So one I want to do is Say I want to be quite pedantic and I don't like the fact that there is the Remove that That we have one feature called title Maybe I want to say okay It should be topic because it's the topic of the the context and the question If I want to be really pedantic and modify that I could say data set train rename column and To be honest you you can use it for this, of course but you're probably not going to you're probably going to use it more for when you need to rename a column to make sure it aligns to whatever the Expected inputs are for a transformer model.

For example, so That that's where you would use it, but I'm just using this example. So I'm going to rename the column title to topic And Let's print out and data set train again So down here we have title in a moment. We're going to have topic Okay, so now we have topic So just rename column.

Like I said come useful not in this case, but generally this is usually useful now What I may want to do as well is remove certain Records from this data set. So so far we've been Printing out the here we have this which is now topic. We have University of Notre Dame Maybe for whatever reason we don't want to include those those topics so we can say Very similar to before we write dates that train equals dataset train again This I'm going to filter so we're going to filter out records I don't want and again, it's very similar to the syntax you use for the map function, which is the lambda and in here, we just need to specify the condition for the samples that we do want to include or we do want to keep and In this case, we want to say okay, wherever the topic is not equal to University of Notre Dame Okay, so we'll run this and we'll have a look at what what we produce so they set to train So somehow like we have number of rows here, which is just over most 88,000 And we should get a lower number now now this will also go through so this Remember we have shuffle set to shuffle.

Why I keep calling it shuffle we have streaming set to false this time So it's going to run through the whole data set and then perform this filtering operation Now whilst I'm waiting for that Now I'll just fast forward again to to where this finishes in a moment Okay.

So now we have let's finish and we have before we had 88,000 rows now we have 87.3 and We should see so let me take the data set train Topic and I want to see let's say the first five of those Okay, now they're all beyond say rather than before where it was the University of Notre Dame so we have those and What we may want to do now is Say for example, we're performing inference with Q&A with a transform model We don't really need all of the features that we have here.

So We would only need the attention mass the input IDs and also the token type IDs So what we can do now is we can remove some of those columns. So We'll do a data set train as always There's a train again And we want to remove those columns so remove columns and We'll just remove so what all of them other than the ones that we want so Do answers Context ID question and topic Okay, and then let's have a look at what we what we have left Okay, and then that's it so we we have those final features and these are the ones that we would input into a Transform model for training now.

I mean, there's nothing else. I rarely want to cover I think that is pretty much all you need to know on I can face the data sets to get started and start building the pretty I think good input pipelines and using some of the The data sets are available.

So we'll leave it there Thank you very much for watching and I will see you again in the next one. Bye