Build NLP Pipelines with HuggingFace Datasets

00:00:00.000 | Welcome to this video. We're going to have a look at hugging faces datasets library. We're going to have a look at

00:00:06.980 | some of what I think are the most useful datasets and

00:00:10.980 | We're going to look at how we can use the library to build

00:00:15.260 | What I think are very good pipelines or data input pipelines for NLP. So let's get started

00:00:28.180 | So the first thing we want to do is actually

00:00:30.700 | well install

00:00:33.520 | Datasets, so we'll go pip install data sets and and that will install the library for you

00:00:39.560 | after this we'll want to go ahead and import data sets and

00:00:44.640 | Then we can start having a look at which data sets are available to us now

00:00:52.140 | There's two ways that you can have a look at all of the data sets the first one

00:00:57.500 | is

00:00:58.860 | using the face data sets viewer which you can find on Google just type in data sets viewer and

00:01:04.940 | It's just an interactive

00:01:07.300 | App which allows you to go through and have a look at the different data sets

00:01:12.640 | Now I'm not going to I've already spoken about that a lot before and it's super easy to use

00:01:18.460 | So we're not going to go through it instead

00:01:20.540 | We're just going to have a look at how we can have view everything in Python, which is the second option

00:01:26.060 | So first we can we can do this so we just list all of our data sets

00:01:31.180 | Now I'm going to just write

00:01:33.620 | DS list here

00:01:36.180 | And

00:01:38.980 | From this we will just get I think it's something like

00:01:42.920 | 1,400 data sets now, so it's quite a lot so if we go Len

00:01:49.320 | Yes, it's all DS or DS list

00:01:54.820 | So

00:01:56.820 | Yes, it's

00:02:01.420 | 1.4,000 days which is obviously a lot and some of these are massive as well

00:02:06.340 | so if we

00:02:08.900 | For example if we were to look at the Oscar data set

00:02:12.780 | so in

00:02:15.500 | DS list we could go

00:02:19.300 | No data set for data set in DS list

00:02:24.980 | If Oscar

00:02:29.260 | Is in the data set

00:02:32.580 | So these are just data set names. Okay, and we have

00:02:36.140 | So we have Oscar. I think PT is

00:02:39.540 | What is PT?

00:02:43.980 | Right, I imagine it's probably Portuguese and then we we have all these other ones as well the but these are just these are

00:02:51.580 | users

00:02:54.100 | Uploaded Oscar data says this is the the actual Oscar data set that's been sell by hugging face and it's huge

00:03:01.020 | It contains I think hundred and more than hundred and sixty

00:03:04.340 | Languages and some of them for example English also English is one of the biggest ones

00:03:11.540 | that contains 1.2 terabytes of data, so

00:03:15.060 | There's a lot of in their data in there, but that's just unstructured

00:03:19.700 | Texts why I want to have a look at is the squad data sets

00:03:25.720 | so we're gonna be using we're just going to use the original squad in our

00:03:32.220 | in this video, but

00:03:35.340 | You can see that we have a few different ones here. So Italian Spanish

00:03:40.700 | Korean you have Thai Thai QA squad here and then also French as well at the bottom. So

00:03:46.900 | You have plenty of choice

00:03:49.740 | Now obviously you kind of need to know what's up dates that you're looking for. I know I'm looking for a squad data set

00:03:55.500 | So I've gone look squad. There are other ones as well. Actually if I if I change this to lower

00:04:00.880 | We'll see those also pop up

00:04:04.180 | Okay, so we have like this one here and this one this one doesn't seem to work

00:04:11.020 | It's fine

00:04:12.420 | Now to load one those data sets. Obviously, we're gonna be using squad we write

00:04:18.220 | datasets

00:04:20.780 | equals data sets dot load data set and

00:04:24.180 | Then in here, we just write

00:04:27.320 | dates and names a squad now, there's two ways to

00:04:33.260 | Two ways to download your day. So if we if we do this, this is the default method

00:04:39.420 | We are going to download and cache the whole data set in memory wish it for the squad is fine

00:04:44.940 | I think squad it's it's not a huge data set. So it's not really a problem

00:04:48.540 | But when you think okay, we wanted the English Oscar data set

00:04:52.980 | That's massive. That's 1.2 terabytes. So in those cases, you probably don't want to download it all onto your

00:05:01.640 | onto your machine

00:05:05.340 | so what you can do instead is you set streaming equal to true and

00:05:10.180 | When streaming is equal to true you do need to make some

00:05:14.300 | Changes to your code which will I'll show you and there are also some things

00:05:20.540 | particularly filtering which we will cover later on which we can't do with streaming but

00:05:26.500 | We will just go ahead and for now, we're going to use streaming will switch over to to not streaming later on

00:05:34.060 | and this creates

00:05:36.060 | like a iteratable data set object and

00:05:39.700 | it means that whenever we are calling a

00:05:43.880 | specific record within that data set it is only going to

00:05:48.220 | Download or sword that single record or multiple records in our memory at once

00:05:55.700 | So we're not downloading the data set and we're just processing it as we get which is I think very useful

00:06:03.580 | Now

00:06:05.420 | You can see here. We have we have two

00:06:07.800 | Actual subsets within our data if we want to select a specific subset. All we have to do is rewrite

00:06:15.140 | Data sets again. So let me actually copy this

00:06:19.340 | So we copy that and if we just want a subset we write split and

00:06:31.940 | in this case, it would be train or

00:06:33.940 | validation and if I just

00:06:36.900 | Call execute that so I'm not I'm not going to store that in our data set variable here because I don't want to use

00:06:43.420 | just train

00:06:45.880 | We have this single

00:06:47.880 | Iterable data set object. So we're just pulling in this single part of it or single subset

00:06:54.940 | and we can also view so here we can see we have train and validation if you want to

00:07:01.260 | See it in a more clear way. You can you can use dictionary

00:07:07.140 | Syntax, so sorry data set keys

00:07:10.660 | You can use dictionary syntax for most of this so we have train and validation

00:07:16.180 | now there's also

00:07:18.860 | So the moment we we have our data set. We don't really know anything about it

00:07:23.640 | So we have this train subset and let's say I want to you know understand what is in there

00:07:29.420 | So what I can do to start is I write a data set train

00:07:34.060 | And I can write for example the data set size. So how big is it?

00:07:38.960 | Right, it's at size

00:07:42.660 | Data set not data size size

00:07:46.780 | Don't know what I was doing there. Let me see that we get so it's like

00:07:51.580 | so 80 about 90 90 megabytes there so

00:07:57.540 | Reasonably big but it's not not anything huge and nothing crazy

00:08:00.220 | We can also so we we have that we can also get if I copy this

00:08:09.200 | You can also get a description

00:08:13.980 | Let me see what the data is so squad I didn't even mention it already but

00:08:25.980 | Squad is the Sanford question answering data set to use it for generally for training Q&A models or testing Q&A models

00:08:33.240 | and

00:08:35.980 | You can pause and read that if you if you want to

00:08:38.580 | And

00:08:42.300 | Then another thing that is pretty important is what are the features that we have inside here now

00:08:49.100 | We can we can also just print out one of the on the samples

00:08:54.020 | But it's useful to useful to know I think and this also gives you data types, which is it's kind of useful

00:08:58.660 | So we have ID title context question and answers

00:09:02.560 | all of them are

00:09:05.500 | strings

00:09:07.700 | Answers is actually so we in within answers. We have say is a sequence here. You can we can view it as a dictionary

00:09:15.180 | But we have a text attribute and also an answer attribute so that's pretty useful to know I think

00:09:24.260 | and

00:09:25.580 | to view one of our

00:09:27.820 | One of our samples, so yeah, we have all the features here

00:09:33.140 | But let's say we just want to see what it actually looks like we can write data set and we go train and

00:09:40.420 | When we have trimmings streaming set defaults, we can write this but because we have streaming sets true

00:09:48.060 | We can't do this. So instead what we have to do is

00:09:52.300 | We we actually just iterate through the data set. So we just for sample in data set

00:10:00.300 | And we just want to print a single sample and then I don't want to print any more

00:10:07.900 | So I'm gonna I'm gonna write break after that. So we just print one of those samples

00:10:12.300 | And then we see okay. We have the ID we have title so

00:10:18.340 | Each of these samples is being pulled from a different Wikipedia page in this case

00:10:23.820 | The title is a titled page. So this one is from the University of Notre Dame

00:10:27.780 | Wikipedia page

00:10:30.300 | We have answers so that further down. We're going to ask a question and these this answers here

00:10:36.660 | So we have the text which is the text answer and then we have the position

00:10:41.780 | So the character position where the answer starts within the context, which is what you can see here

00:10:48.500 | We have a question

00:10:50.220 | here, which we're asking and then the the model the Q&A models going to

00:10:54.520 | extract the answer from

00:10:57.180 | from our context there

00:10:59.260 | Okay

00:11:01.940 | So we're not going to be training model in this video or anything like that

00:11:06.140 | We're just experimenting with the datasets library. We don't need to worry so much about that

00:11:13.660 | So the first thing I want to do is have a look at how we can modify some of the features in our data

00:11:20.340 | so with

00:11:22.260 | squad

00:11:23.340 | when we are

00:11:25.220 | training a model one of the first things we would do is we take our answer start and the text and

00:11:32.260 | We would use that to get the answer and position as well

00:11:37.180 | So let's go ahead and do that. So I first I want to just have a look

00:11:42.620 | Okay for sample in

00:11:44.780 | the data set

00:11:47.540 | Train, I'm just going to print out a few of the answer features. So we have sample

00:11:53.760 | Answer or answers, sorry, and I just want to print that

00:12:00.660 | So print it and I want to say, okay

00:12:04.820 | I want to enumerate this so I can count how many times we're going through it

00:12:09.780 | so here I'm just

00:12:12.220 | Viewing the data so we can actually see what we have in there

00:12:15.540 | So I want to say

00:12:19.180 | if I

00:12:21.380 | is broken for

00:12:23.380 | Just break just stop stop printing answers for us. So and then we have a few these so we have text

00:12:30.420 | We have and assault we want to add and to end and the way that we do that. It is pretty straightforward

00:12:35.660 | We just need to take the answer start and we add the length of our text to that to get the answer end

00:12:41.940 | Nothing, nothing complicated there. So what we're going to do here is modify the

00:12:48.740 | answers feature and

00:12:51.300 | the best way or I think the least the most common way of

00:12:55.220 | Modifying features or adding new features as well is to use the map method. So we go

00:13:04.420 | date set so it's going to

00:13:06.420 | Output a new data set. So we write data set train

00:13:11.720 | Equals date set train

00:13:15.420 | and

00:13:17.860 | we're going to use the map method and

00:13:19.940 | With map we use lambda so we write

00:13:25.980 | Lambda X

00:13:29.460 | so in here, we're building a lambda function and

00:13:33.860 | What we need to do so this is one of the things that changes depending on whether you're using streaming

00:13:39.780 | Or not. So with streaming equals true in here. We need to specify

00:13:44.740 | every single feature so

00:13:48.260 | what I mean by that is

00:13:51.220 | Let me do it for stream faults initially

00:13:56.000 | So when streaming is false, we will just write answers

00:13:59.820 | And we would write

00:14:03.580 | The modification to that feature. So in this case, we are taking the current answers, so it would be

00:14:11.700 | X answers and

00:14:15.460 | We would be merging that with a new dictionary item which is going to be

00:14:23.420 | answers end so

00:14:26.820 | answer or end answer Oh answer start so

00:14:32.340 | answer n is

00:14:34.340 | equal to

00:14:36.580 | And

00:14:39.700 | Here what we have to do is we go X answers. So this is a little bit messy. No

00:14:44.740 | It's just how it is. So we're within answers and we want to take the

00:14:50.420 | answer start position

00:14:53.020 | so answer

00:14:54.860 | start

00:14:56.860 | And

00:14:58.980 | We want to add

00:15:00.980 | Let me start a new line here

00:15:02.980 | And we want to add

00:15:05.940 | the length of

00:15:08.380 | Answers

00:15:11.860 | Text

00:15:13.900 | Okay, so all we're doing there is we're taking and start and we're adding answer

00:15:20.380 | Text or the length of answer text to that to get our answer end now

00:15:25.540 | This is all we would have to write if we were using streaming equals false, but we're not

00:15:31.460 | with

00:15:33.220 | Streaming equals true. We need to add every other feature in there as well. I'm not sure why it is

00:15:39.500 | Why why this is the case?

00:15:42.180 | But it is so we need to just add those in as well

00:15:46.900 | So all they are is a direct mapping from the old version to the new data set

00:15:53.860 | So we don't need to really do anything there

00:15:55.780 | We just need to add ID once about that to ID and do that for the other features as well

00:16:01.340 | So we have also have context

00:16:04.300 | Which is X context

00:16:07.620 | We have answer already done of course question which is going to be X question

00:16:17.460 | So ID context question answers

00:16:23.420 | Is there anything else I'm missing?

00:16:25.420 | ID Oh title, of course title

00:16:30.540 | Just title

00:16:33.940 | Yeah, so also add title in there as well

00:16:37.380 | Okay, and with that we should be ready to go so let's let's map that and

00:16:49.420 | What we'll find is when we're using streaming keywords equals true

00:16:53.700 | our

00:16:56.220 | the actual process is

00:16:58.220 | Or the transformation that we just built is lazily loaded

00:17:01.780 | So we haven't actually just done anything that all we've said is we've passed this instruction to transform

00:17:07.700 | The data set in this way, but it hasn't actually transformed anything yet

00:17:13.020 | It only performs this transformation when we call the data set

00:17:17.300 | so if we

00:17:19.300 | this again

00:17:21.660 | This would call the data set and it would force the code to run this

00:17:26.260 | instruction or this transformation

00:17:29.100 | So, let's run that

00:17:33.020 | And you see we actually do get an error here. And why is that? So let me come down

00:17:39.740 | We have

00:17:44.500 | So what am I doing? I'm

00:17:47.580 | And start plus

00:17:49.580 | the length of answers what's wrong with that? Ah

00:17:52.700 | Okay, so if we look up here

00:17:55.980 | we have

00:17:58.940 | These items here within the list. So we actually need to we actually need to access

00:18:04.340 | that first item

00:18:07.260 | But that's good because we saw that

00:18:09.540 | When we first execute this code nothing happened and it only actually came across the error

00:18:16.420 | when we called a data set because that's when this transformation is actually performed and

00:18:21.900 | Now what we have to do is because we've already added this instruction to our data set

00:18:27.940 | Transformation or building process we actually need to reinitialize our data set. So we will come back up here

00:18:36.620 | So, where are you yes a date not that one this one so we need to load that again to

00:18:46.800 | reinitialize the all of the instructions that we've added in there and

00:18:50.120 | Then we can go ahead rerun this and now it should work. Hopefully I see

00:18:57.200 | There we go. So now if we have a look at this and this is something I probably should have done, but I

00:19:03.000 | completely forgot to so I should have added this as maybe a list rather than just the

00:19:08.920 | Number, but it's fine because you know, if you come across and you need to do this you may want to add that in

00:19:16.320 | But we're not doing anything of them playing around with with the data sets

00:19:21.180 | Library, so it's not not really problem, but you can see that we have added answers and into there now

00:19:27.480 | which is is what we wanted to do and

00:19:29.480 | Also importantly is if I let me copy this

00:19:35.040 | Bring down here

00:19:38.160 | we'll notice that we do still have all of our data set so if I

00:19:43.640 | Go here, I don't really need to remove that's fine. I'll just break straight away. That's fine

00:19:49.120 | So sample sorry, yeah

00:19:56.080 | so you see the whole thing and

00:19:58.640 | We see that we still have the ID we have the text we have the context we have everything in there now

00:20:05.800 | I'm just going to show you you know

00:20:08.160 | Why this breaks?

00:20:11.000 | Why this breaks or why what happens if I?

00:20:15.080 | remove these

00:20:17.600 | Okay, so let me rerun that

00:20:20.080 | and

00:20:22.040 | this as well, so

00:20:24.040 | Yeah, so this should look the same

00:20:27.000 | Do we have yet? That's fine, but then if I run this

00:20:31.140 | So before this had the all day all the features

00:20:34.520 | But now we only have the the single feature that we specified in this formula so the answers

00:20:40.200 | So that's why you need to when shuffle is set to true. That's why you need to

00:20:45.920 | Add every single feature in there. Otherwise, it's just going to remove them when you perform the map operation, but that's only the case

00:20:53.600 | When shuffle is actually set to true up shuffle. Why am I saying sure for streaming is set to true

00:21:01.240 | so let me bring this down here and

00:21:04.080 | Let me also copy our

00:21:07.720 | Initial loading code. So yeah

00:21:10.800 | Because we're going to need to reload our data set now anyway, because we just removed all the features from it

00:21:18.000 | Okay, and

00:21:24.040 | What I'm going to do now is just set streaming into defaults and I'm gonna read I'm going to run this same code where we

00:21:31.080 | still don't have our IDs or anything like that in there and

00:21:35.440 | We'll see what happens as well. We'll also notice we'll get a loading bar here and

00:21:39.580 | It's going to take a little bit of time to process this. Although actually with this it's probably gonna be super fast. So

00:21:46.080 | Probably ignore that

00:21:48.400 | But it will you see? Okay, it's taking a little bit of time. So now it's going through a whole date set

00:21:54.040 | We haven't we haven't called a date set, but we have used this map function when streaming is set to faults

00:22:02.440 | The date set isn't lazily loaded. And so the operation the map operation is performed as soon as you call it

00:22:09.140 | so it's a slightly different behavior and the other behavior which is different is the fact that

00:22:15.080 | We've only needed to specify the answers feature here

00:22:18.840 | So we only when we have streaming set defaults, we don't need to include every feature within the map operation

00:22:25.880 | We only need to include the feature that we are modifying or creating

00:22:31.360 | which

00:22:32.600 | You know, it's weird. I don't know why there's a behavior difference when streaming is true or false

00:22:37.700 | But it is there. So if I now take this again

00:22:42.800 | come down here and

00:22:45.760 | Run that we see now that we have all of our features again

00:22:50.440 | Right. So before when streaming was true

00:22:54.840 | If I run this code, it would have only included our answers the ID title context question

00:23:01.680 | They all would have been removed

00:23:03.560 | but now we're streaming equal to

00:23:05.560 | Faults that they're still there

00:23:08.700 | so weird a weird

00:23:11.880 | So it's a weird feature or a weird behavior, but it's

00:23:16.960 | How it is and we obviously just need to deal with it

00:23:20.980 | Now the next thing I want to show you is how we can

00:23:24.800 | also add batching to our mapping process, so

00:23:29.740 | typically with

00:23:32.720 | Well pretty much every or any as far as I can think of any NLP tasks. We're going to want to

00:23:40.200 | tokenize our

00:23:42.520 | tips

00:23:43.680 | So we're gonna go ahead and do that for the Q&A

00:23:46.640 | So we would import transformers or from transformers import a bird tokenizer. Let's say

00:23:54.040 | And I

00:23:56.040 | Would initialize that so this is you know what we typically do tokenizer equals bird tokenizer

00:24:02.880 | From be trained and

00:24:06.480 | Let's say that base on case

00:24:11.240 | Okay, I'll initialize that

00:24:21.840 | And then what I want to do is I'm going to tokenize my

00:24:25.880 | context or question and context in the format that squad would usually expect when you're doing Q&A or

00:24:33.640 | making a model and

00:24:36.000 | I want to do that using the map function so you can do this in both streaming and

00:24:41.800 | non streaming by the way

00:24:44.840 | so

00:24:46.600 | We just write date set

00:24:48.960 | Was train so same be same as before data set it was train or data set train

00:24:55.360 | Dot map we are using a lambda function

00:24:59.040 | X and

00:25:04.600 | In here, we just want to say tokenizer

00:25:07.320 | so I'm not doing the

00:25:10.400 | Usually when you write this you would include a dictionary here

00:25:14.520 | but

00:25:16.400 | The tokenizer the output from the tokenizer is already in dictionary format

00:25:21.100 | So we don't need to I don't need to do it in this case

00:25:23.920 | but basically what we have here is it's still a dictionary and

00:25:28.360 | What I want to do is so with

00:25:31.360 | Q&A in your tokenize that you pass to text input you pass your question and

00:25:38.240 | You'd also then pass your context

00:25:41.720 | And

00:25:45.040 | As usual we would we sell max length

00:25:48.680 | so for usually

00:25:51.680 | 512 I

00:25:53.320 | would set padding equal to the max length and

00:25:57.200 | Also do truncation as well

00:26:01.360 | Okay, so very typical tokenization process nothing. There's nothing different going on here

00:26:07.920 | this is what we normally do when we tokenize our

00:26:10.800 | text going into a

00:26:14.080 | Transform model and then we want to say okay batched equals true

00:26:19.000 | So this allows us to do everything or perform this operation in batches

00:26:23.000 | And then we can also specify our batch size. So batch size equals

00:26:27.880 | Let's say 32. So now when we run this

00:26:31.560 | Where is it gone? You see it?

00:26:34.360 | now when we run this

00:26:36.520 | The map function here is going to tokenize our question and context in batches of 32

00:26:43.680 | So let's go ahead and do that

00:26:45.680 | Okay, and then you can you can see that processing there so I mean that's that's all we really need to

00:26:53.400 | Do with that. So I think that's probably it for

00:26:56.600 | the map method and we'll

00:26:59.800 | well, I'll fast forward and

00:27:02.480 | We'll continue with I think a few of the methods I think quite useful as well

00:27:09.200 | Okay, so that's just finishing up now

00:27:13.440 | so we can go ahead and have a look at what we've actually produced so

00:27:18.740 | Come to here and say

00:27:23.840 | Dataset train. So what do we have?

00:27:26.440 | Now we have we have answers like we did before but now we also have attention mask

00:27:31.360 | We have input IDs and we also have token type IDs

00:27:35.760 | We should it the three tensors that we usually output from from the tokenizer when we do that

00:27:42.240 | So we now have those in there as well. We can also have a look

00:27:45.120 | Another thing as well. We can we can now rather than looping through our data set because we're not using a we're not using streaming

00:27:53.180 | It's true. We're using streaming equals false. We can now

00:27:56.420 | Do this?

00:27:59.400 | And we can see okay, we have a tangent mask and it's not going to show me everything because it's quite large

00:28:04.960 | So I'm just delete that but you can see that we have detention mask in there

00:28:11.160 | So one I want to do is

00:28:13.440 | Say I want to be quite pedantic and I don't like the fact that there is the

00:28:21.600 | Remove that

00:28:24.280 | That we have one feature called title

00:28:26.780 | Maybe I want to say okay

00:28:28.840 | It should be topic because it's the topic of the the context and the question

00:28:33.360 | If I want to be really pedantic and modify that I could say data set train

00:28:40.140 | rename column and

00:28:42.140 | To be honest you you can use it for this, of course

00:28:45.360 | but you're probably not going to you're probably going to use it more for when you need to rename a column to make sure it

00:28:51.300 | aligns to

00:28:52.580 | whatever the

00:28:54.580 | Expected inputs are for a transformer model. For example, so

00:28:58.960 | That that's where you would use it, but I'm just using this example. So I'm going to rename the column title

00:29:05.380 | to topic

00:29:09.340 | And

00:29:10.740 | Let's print out and data set train again

00:29:13.900 | So down here we have title in a moment. We're going to have topic

00:29:19.040 | Okay, so now we have topic

00:29:22.700 | So just rename column. Like I said come useful not in this case, but generally this is

00:29:29.540 | usually useful

00:29:31.860 | now

00:29:33.700 | What I may want to do as well is remove certain

00:29:38.220 | Records from this data set. So so far we've been

00:29:41.740 | Printing out the here we have this which is now topic. We have University of Notre Dame

00:29:49.640 | Maybe for whatever reason we don't want to include those

00:29:53.060 | those topics so we can say

00:29:56.460 | Very similar to before we write dates that train

00:30:01.100 | equals

00:30:03.860 | dataset train again

00:30:05.860 | This I'm going to filter so we're going to filter out records

00:30:09.900 | I don't want and again, it's very similar to the syntax you use for the map function, which is the lambda and

00:30:17.980 | in here, we just need to specify the condition for the samples that we do want to include or we do want to keep and

00:30:25.820 | In this case, we want to say okay, wherever the topic is

00:30:29.420 | not equal to

00:30:33.180 | University of Notre Dame

00:30:36.780 | Okay, so we'll run this and we'll have a look at what what we produce so they set to train

00:30:48.120 | So somehow like we have number of rows here, which is just over most

00:30:53.100 | 88,000

00:30:55.860 | And we should get a lower number now now this will also go through so this

00:31:01.220 | Remember we have shuffle set to shuffle. Why I keep calling it shuffle we have

00:31:06.740 | streaming set to

00:31:09.800 | false this time

00:31:11.820 | So it's going to run through the whole data set and then perform this filtering operation

00:31:16.320 | Now whilst I'm waiting for that

00:31:19.660 | Now I'll just fast forward again to to where this finishes in a moment

00:31:26.260 | Okay. So now we have let's finish and we have before we had

00:31:31.420 | 88,000 rows now we have

00:31:34.780 | 87.3 and

00:31:36.620 | We should see so let me take the data set

00:31:40.560 | train

00:31:43.900 | Topic and I want to see let's say the first five of those

00:31:47.940 | Okay, now they're all beyond say rather than before where it was the University of Notre Dame

00:31:56.720 | so we have those and

00:31:58.720 | What we may want to do now is

00:32:02.520 | Say for example, we're performing inference with Q&A with a transform model

00:32:10.380 | We don't really need all of the features that we have here. So

00:32:15.840 | We would only need the attention mass the input IDs and also the token type IDs

00:32:23.940 | So what we can do now is we can remove some of those columns. So

00:32:29.120 | We'll do a data set train as always

00:32:32.960 | There's a train again

00:32:36.080 | And we want to remove those columns so remove columns

00:32:44.760 | and

00:32:47.880 | We'll just remove so what all of them other than the ones that we want so

00:32:53.740 | Do answers

00:32:55.740 | Context

00:32:59.480 | ID question and topic

00:33:03.800 | Okay, and then let's have a look at what we what we have left

00:33:11.400 | Okay, and then that's it so we we have those final features and these are the ones that we would input into a

00:33:20.920 | Transform model for training now. I mean, there's nothing else. I rarely want to cover

00:33:26.080 | I think that is pretty much all you need to know on

00:33:29.160 | I can face the data sets to get started and start building the pretty I think good

00:33:35.360 | input pipelines and using some of the

00:33:39.000 | The data sets are available. So we'll leave it there

00:33:43.260 | Thank you very much for watching and I will see you again in the next one. Bye

Build NLP Pipelines with HuggingFace Datasets

Chapters