back to index

Build NLP Pipelines with HuggingFace Datasets


Chapters

0:0 Intro
0:28 Importing Datasets
4:13 Loading Datasets
6:5 Selecting Datasets
7:15 Writing Datasets
8:42 Dataset Features
9:25 Dataset Example
11:14 Modifying Dataset Features
16:49 Troubleshooting
23:9 Batching
23:44 Tokenization
29:49 Filtering

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome to this video. We're going to have a look at hugging faces datasets library. We're going to have a look at
00:00:06.980 | some of what I think are the most useful datasets and
00:00:10.980 | We're going to look at how we can use the library to build
00:00:15.260 | What I think are very good pipelines or data input pipelines for NLP. So let's get started
00:00:28.180 | So the first thing we want to do is actually
00:00:30.700 | well install
00:00:33.520 | Datasets, so we'll go pip install data sets and and that will install the library for you
00:00:39.560 | after this we'll want to go ahead and import data sets and
00:00:44.640 | Then we can start having a look at which data sets are available to us now
00:00:52.140 | There's two ways that you can have a look at all of the data sets the first one
00:00:58.860 | using the face data sets viewer which you can find on Google just type in data sets viewer and
00:01:04.940 | It's just an interactive
00:01:07.300 | App which allows you to go through and have a look at the different data sets
00:01:12.640 | Now I'm not going to I've already spoken about that a lot before and it's super easy to use
00:01:18.460 | So we're not going to go through it instead
00:01:20.540 | We're just going to have a look at how we can have view everything in Python, which is the second option
00:01:26.060 | So first we can we can do this so we just list all of our data sets
00:01:31.180 | Now I'm going to just write
00:01:33.620 | DS list here
00:01:38.980 | From this we will just get I think it's something like
00:01:42.920 | 1,400 data sets now, so it's quite a lot so if we go Len
00:01:49.320 | Yes, it's all DS or DS list
00:01:56.820 | Yes, it's
00:02:01.420 | 1.4,000 days which is obviously a lot and some of these are massive as well
00:02:06.340 | so if we
00:02:08.900 | For example if we were to look at the Oscar data set
00:02:12.780 | so in
00:02:15.500 | DS list we could go
00:02:19.300 | No data set for data set in DS list
00:02:24.980 | If Oscar
00:02:29.260 | Is in the data set
00:02:32.580 | So these are just data set names. Okay, and we have
00:02:36.140 | So we have Oscar. I think PT is
00:02:39.540 | What is PT?
00:02:43.980 | Right, I imagine it's probably Portuguese and then we we have all these other ones as well the but these are just these are
00:02:51.580 | users
00:02:54.100 | Uploaded Oscar data says this is the the actual Oscar data set that's been sell by hugging face and it's huge
00:03:01.020 | It contains I think hundred and more than hundred and sixty
00:03:04.340 | Languages and some of them for example English also English is one of the biggest ones
00:03:11.540 | that contains 1.2 terabytes of data, so
00:03:15.060 | There's a lot of in their data in there, but that's just unstructured
00:03:19.700 | Texts why I want to have a look at is the squad data sets
00:03:25.720 | so we're gonna be using we're just going to use the original squad in our
00:03:32.220 | in this video, but
00:03:35.340 | You can see that we have a few different ones here. So Italian Spanish
00:03:40.700 | Korean you have Thai Thai QA squad here and then also French as well at the bottom. So
00:03:46.900 | You have plenty of choice
00:03:49.740 | Now obviously you kind of need to know what's up dates that you're looking for. I know I'm looking for a squad data set
00:03:55.500 | So I've gone look squad. There are other ones as well. Actually if I if I change this to lower
00:04:00.880 | We'll see those also pop up
00:04:04.180 | Okay, so we have like this one here and this one this one doesn't seem to work
00:04:11.020 | It's fine
00:04:12.420 | Now to load one those data sets. Obviously, we're gonna be using squad we write
00:04:18.220 | datasets
00:04:20.780 | equals data sets dot load data set and
00:04:24.180 | Then in here, we just write
00:04:27.320 | dates and names a squad now, there's two ways to
00:04:33.260 | Two ways to download your day. So if we if we do this, this is the default method
00:04:39.420 | We are going to download and cache the whole data set in memory wish it for the squad is fine
00:04:44.940 | I think squad it's it's not a huge data set. So it's not really a problem
00:04:48.540 | But when you think okay, we wanted the English Oscar data set
00:04:52.980 | That's massive. That's 1.2 terabytes. So in those cases, you probably don't want to download it all onto your
00:05:01.640 | onto your machine
00:05:05.340 | so what you can do instead is you set streaming equal to true and
00:05:10.180 | When streaming is equal to true you do need to make some
00:05:14.300 | Changes to your code which will I'll show you and there are also some things
00:05:20.540 | particularly filtering which we will cover later on which we can't do with streaming but
00:05:26.500 | We will just go ahead and for now, we're going to use streaming will switch over to to not streaming later on
00:05:34.060 | and this creates
00:05:36.060 | like a iteratable data set object and
00:05:39.700 | it means that whenever we are calling a
00:05:43.880 | specific record within that data set it is only going to
00:05:48.220 | Download or sword that single record or multiple records in our memory at once
00:05:55.700 | So we're not downloading the data set and we're just processing it as we get which is I think very useful
00:06:05.420 | You can see here. We have we have two
00:06:07.800 | Actual subsets within our data if we want to select a specific subset. All we have to do is rewrite
00:06:15.140 | Data sets again. So let me actually copy this
00:06:19.340 | So we copy that and if we just want a subset we write split and
00:06:31.940 | in this case, it would be train or
00:06:33.940 | validation and if I just
00:06:36.900 | Call execute that so I'm not I'm not going to store that in our data set variable here because I don't want to use
00:06:43.420 | just train
00:06:45.880 | We have this single
00:06:47.880 | Iterable data set object. So we're just pulling in this single part of it or single subset
00:06:54.940 | and we can also view so here we can see we have train and validation if you want to
00:07:01.260 | See it in a more clear way. You can you can use dictionary
00:07:07.140 | Syntax, so sorry data set keys
00:07:10.660 | You can use dictionary syntax for most of this so we have train and validation
00:07:16.180 | now there's also
00:07:18.860 | So the moment we we have our data set. We don't really know anything about it
00:07:23.640 | So we have this train subset and let's say I want to you know understand what is in there
00:07:29.420 | So what I can do to start is I write a data set train
00:07:34.060 | And I can write for example the data set size. So how big is it?
00:07:38.960 | Right, it's at size
00:07:42.660 | Data set not data size size
00:07:46.780 | Don't know what I was doing there. Let me see that we get so it's like
00:07:51.580 | so 80 about 90 90 megabytes there so
00:07:57.540 | Reasonably big but it's not not anything huge and nothing crazy
00:08:00.220 | We can also so we we have that we can also get if I copy this
00:08:09.200 | You can also get a description
00:08:13.980 | Let me see what the data is so squad I didn't even mention it already but
00:08:25.980 | Squad is the Sanford question answering data set to use it for generally for training Q&A models or testing Q&A models
00:08:35.980 | You can pause and read that if you if you want to
00:08:42.300 | Then another thing that is pretty important is what are the features that we have inside here now
00:08:49.100 | We can we can also just print out one of the on the samples
00:08:54.020 | But it's useful to useful to know I think and this also gives you data types, which is it's kind of useful
00:08:58.660 | So we have ID title context question and answers
00:09:02.560 | all of them are
00:09:05.500 | strings
00:09:07.700 | Answers is actually so we in within answers. We have say is a sequence here. You can we can view it as a dictionary
00:09:15.180 | But we have a text attribute and also an answer attribute so that's pretty useful to know I think
00:09:25.580 | to view one of our
00:09:27.820 | One of our samples, so yeah, we have all the features here
00:09:33.140 | But let's say we just want to see what it actually looks like we can write data set and we go train and
00:09:40.420 | When we have trimmings streaming set defaults, we can write this but because we have streaming sets true
00:09:48.060 | We can't do this. So instead what we have to do is
00:09:52.300 | We we actually just iterate through the data set. So we just for sample in data set
00:10:00.300 | And we just want to print a single sample and then I don't want to print any more
00:10:07.900 | So I'm gonna I'm gonna write break after that. So we just print one of those samples
00:10:12.300 | And then we see okay. We have the ID we have title so
00:10:18.340 | Each of these samples is being pulled from a different Wikipedia page in this case
00:10:23.820 | The title is a titled page. So this one is from the University of Notre Dame
00:10:27.780 | Wikipedia page
00:10:30.300 | We have answers so that further down. We're going to ask a question and these this answers here
00:10:36.660 | So we have the text which is the text answer and then we have the position
00:10:41.780 | So the character position where the answer starts within the context, which is what you can see here
00:10:48.500 | We have a question
00:10:50.220 | here, which we're asking and then the the model the Q&A models going to
00:10:54.520 | extract the answer from
00:10:57.180 | from our context there
00:11:01.940 | So we're not going to be training model in this video or anything like that
00:11:06.140 | We're just experimenting with the datasets library. We don't need to worry so much about that
00:11:13.660 | So the first thing I want to do is have a look at how we can modify some of the features in our data
00:11:20.340 | so with
00:11:22.260 | squad
00:11:23.340 | when we are
00:11:25.220 | training a model one of the first things we would do is we take our answer start and the text and
00:11:32.260 | We would use that to get the answer and position as well
00:11:37.180 | So let's go ahead and do that. So I first I want to just have a look
00:11:42.620 | Okay for sample in
00:11:44.780 | the data set
00:11:47.540 | Train, I'm just going to print out a few of the answer features. So we have sample
00:11:53.760 | Answer or answers, sorry, and I just want to print that
00:12:00.660 | So print it and I want to say, okay
00:12:04.820 | I want to enumerate this so I can count how many times we're going through it
00:12:09.780 | so here I'm just
00:12:12.220 | Viewing the data so we can actually see what we have in there
00:12:15.540 | So I want to say
00:12:21.380 | is broken for
00:12:23.380 | Just break just stop stop printing answers for us. So and then we have a few these so we have text
00:12:30.420 | We have and assault we want to add and to end and the way that we do that. It is pretty straightforward
00:12:35.660 | We just need to take the answer start and we add the length of our text to that to get the answer end
00:12:41.940 | Nothing, nothing complicated there. So what we're going to do here is modify the
00:12:48.740 | answers feature and
00:12:51.300 | the best way or I think the least the most common way of
00:12:55.220 | Modifying features or adding new features as well is to use the map method. So we go
00:13:04.420 | date set so it's going to
00:13:06.420 | Output a new data set. So we write data set train
00:13:11.720 | Equals date set train
00:13:17.860 | we're going to use the map method and
00:13:19.940 | With map we use lambda so we write
00:13:25.980 | Lambda X
00:13:29.460 | so in here, we're building a lambda function and
00:13:33.860 | What we need to do so this is one of the things that changes depending on whether you're using streaming
00:13:39.780 | Or not. So with streaming equals true in here. We need to specify
00:13:44.740 | every single feature so
00:13:48.260 | what I mean by that is
00:13:51.220 | Let me do it for stream faults initially
00:13:56.000 | So when streaming is false, we will just write answers
00:13:59.820 | And we would write
00:14:03.580 | The modification to that feature. So in this case, we are taking the current answers, so it would be
00:14:11.700 | X answers and
00:14:15.460 | We would be merging that with a new dictionary item which is going to be
00:14:23.420 | answers end so
00:14:26.820 | answer or end answer Oh answer start so
00:14:32.340 | answer n is
00:14:34.340 | equal to
00:14:39.700 | Here what we have to do is we go X answers. So this is a little bit messy. No
00:14:44.740 | It's just how it is. So we're within answers and we want to take the
00:14:50.420 | answer start position
00:14:53.020 | so answer
00:14:54.860 | start
00:14:58.980 | We want to add
00:15:00.980 | Let me start a new line here
00:15:02.980 | And we want to add
00:15:05.940 | the length of
00:15:08.380 | Answers
00:15:13.900 | Okay, so all we're doing there is we're taking and start and we're adding answer
00:15:20.380 | Text or the length of answer text to that to get our answer end now
00:15:25.540 | This is all we would have to write if we were using streaming equals false, but we're not
00:15:33.220 | Streaming equals true. We need to add every other feature in there as well. I'm not sure why it is
00:15:39.500 | Why why this is the case?
00:15:42.180 | But it is so we need to just add those in as well
00:15:46.900 | So all they are is a direct mapping from the old version to the new data set
00:15:53.860 | So we don't need to really do anything there
00:15:55.780 | We just need to add ID once about that to ID and do that for the other features as well
00:16:01.340 | So we have also have context
00:16:04.300 | Which is X context
00:16:07.620 | We have answer already done of course question which is going to be X question
00:16:17.460 | So ID context question answers
00:16:23.420 | Is there anything else I'm missing?
00:16:25.420 | ID Oh title, of course title
00:16:30.540 | Just title
00:16:33.940 | Yeah, so also add title in there as well
00:16:37.380 | Okay, and with that we should be ready to go so let's let's map that and
00:16:49.420 | What we'll find is when we're using streaming keywords equals true
00:16:56.220 | the actual process is
00:16:58.220 | Or the transformation that we just built is lazily loaded
00:17:01.780 | So we haven't actually just done anything that all we've said is we've passed this instruction to transform
00:17:07.700 | The data set in this way, but it hasn't actually transformed anything yet
00:17:13.020 | It only performs this transformation when we call the data set
00:17:17.300 | so if we
00:17:19.300 | this again
00:17:21.660 | This would call the data set and it would force the code to run this
00:17:26.260 | instruction or this transformation
00:17:29.100 | So, let's run that
00:17:33.020 | And you see we actually do get an error here. And why is that? So let me come down
00:17:39.740 | We have
00:17:44.500 | So what am I doing? I'm
00:17:47.580 | And start plus
00:17:49.580 | the length of answers what's wrong with that? Ah
00:17:52.700 | Okay, so if we look up here
00:17:55.980 | we have
00:17:58.940 | These items here within the list. So we actually need to we actually need to access
00:18:04.340 | that first item
00:18:07.260 | But that's good because we saw that
00:18:09.540 | When we first execute this code nothing happened and it only actually came across the error
00:18:16.420 | when we called a data set because that's when this transformation is actually performed and
00:18:21.900 | Now what we have to do is because we've already added this instruction to our data set
00:18:27.940 | Transformation or building process we actually need to reinitialize our data set. So we will come back up here
00:18:36.620 | So, where are you yes a date not that one this one so we need to load that again to
00:18:46.800 | reinitialize the all of the instructions that we've added in there and
00:18:50.120 | Then we can go ahead rerun this and now it should work. Hopefully I see
00:18:57.200 | There we go. So now if we have a look at this and this is something I probably should have done, but I
00:19:03.000 | completely forgot to so I should have added this as maybe a list rather than just the
00:19:08.920 | Number, but it's fine because you know, if you come across and you need to do this you may want to add that in
00:19:16.320 | But we're not doing anything of them playing around with with the data sets
00:19:21.180 | Library, so it's not not really problem, but you can see that we have added answers and into there now
00:19:27.480 | which is is what we wanted to do and
00:19:29.480 | Also importantly is if I let me copy this
00:19:35.040 | Bring down here
00:19:38.160 | we'll notice that we do still have all of our data set so if I
00:19:43.640 | Go here, I don't really need to remove that's fine. I'll just break straight away. That's fine
00:19:49.120 | So sample sorry, yeah
00:19:56.080 | so you see the whole thing and
00:19:58.640 | We see that we still have the ID we have the text we have the context we have everything in there now
00:20:05.800 | I'm just going to show you you know
00:20:08.160 | Why this breaks?
00:20:11.000 | Why this breaks or why what happens if I?
00:20:15.080 | remove these
00:20:17.600 | Okay, so let me rerun that
00:20:22.040 | this as well, so
00:20:24.040 | Yeah, so this should look the same
00:20:27.000 | Do we have yet? That's fine, but then if I run this
00:20:31.140 | So before this had the all day all the features
00:20:34.520 | But now we only have the the single feature that we specified in this formula so the answers
00:20:40.200 | So that's why you need to when shuffle is set to true. That's why you need to
00:20:45.920 | Add every single feature in there. Otherwise, it's just going to remove them when you perform the map operation, but that's only the case
00:20:53.600 | When shuffle is actually set to true up shuffle. Why am I saying sure for streaming is set to true
00:21:01.240 | so let me bring this down here and
00:21:04.080 | Let me also copy our
00:21:07.720 | Initial loading code. So yeah
00:21:10.800 | Because we're going to need to reload our data set now anyway, because we just removed all the features from it
00:21:18.000 | Okay, and
00:21:24.040 | What I'm going to do now is just set streaming into defaults and I'm gonna read I'm going to run this same code where we
00:21:31.080 | still don't have our IDs or anything like that in there and
00:21:35.440 | We'll see what happens as well. We'll also notice we'll get a loading bar here and
00:21:39.580 | It's going to take a little bit of time to process this. Although actually with this it's probably gonna be super fast. So
00:21:46.080 | Probably ignore that
00:21:48.400 | But it will you see? Okay, it's taking a little bit of time. So now it's going through a whole date set
00:21:54.040 | We haven't we haven't called a date set, but we have used this map function when streaming is set to faults
00:22:02.440 | The date set isn't lazily loaded. And so the operation the map operation is performed as soon as you call it
00:22:09.140 | so it's a slightly different behavior and the other behavior which is different is the fact that
00:22:15.080 | We've only needed to specify the answers feature here
00:22:18.840 | So we only when we have streaming set defaults, we don't need to include every feature within the map operation
00:22:25.880 | We only need to include the feature that we are modifying or creating
00:22:31.360 | which
00:22:32.600 | You know, it's weird. I don't know why there's a behavior difference when streaming is true or false
00:22:37.700 | But it is there. So if I now take this again
00:22:42.800 | come down here and
00:22:45.760 | Run that we see now that we have all of our features again
00:22:50.440 | Right. So before when streaming was true
00:22:54.840 | If I run this code, it would have only included our answers the ID title context question
00:23:01.680 | They all would have been removed
00:23:03.560 | but now we're streaming equal to
00:23:05.560 | Faults that they're still there
00:23:08.700 | so weird a weird
00:23:11.880 | So it's a weird feature or a weird behavior, but it's
00:23:16.960 | How it is and we obviously just need to deal with it
00:23:20.980 | Now the next thing I want to show you is how we can
00:23:24.800 | also add batching to our mapping process, so
00:23:29.740 | typically with
00:23:32.720 | Well pretty much every or any as far as I can think of any NLP tasks. We're going to want to
00:23:40.200 | tokenize our
00:23:43.680 | So we're gonna go ahead and do that for the Q&A
00:23:46.640 | So we would import transformers or from transformers import a bird tokenizer. Let's say
00:23:54.040 | And I
00:23:56.040 | Would initialize that so this is you know what we typically do tokenizer equals bird tokenizer
00:24:02.880 | From be trained and
00:24:06.480 | Let's say that base on case
00:24:11.240 | Okay, I'll initialize that
00:24:21.840 | And then what I want to do is I'm going to tokenize my
00:24:25.880 | context or question and context in the format that squad would usually expect when you're doing Q&A or
00:24:33.640 | making a model and
00:24:36.000 | I want to do that using the map function so you can do this in both streaming and
00:24:41.800 | non streaming by the way
00:24:46.600 | We just write date set
00:24:48.960 | Was train so same be same as before data set it was train or data set train
00:24:55.360 | Dot map we are using a lambda function
00:24:59.040 | X and
00:25:04.600 | In here, we just want to say tokenizer
00:25:07.320 | so I'm not doing the
00:25:10.400 | Usually when you write this you would include a dictionary here
00:25:16.400 | The tokenizer the output from the tokenizer is already in dictionary format
00:25:21.100 | So we don't need to I don't need to do it in this case
00:25:23.920 | but basically what we have here is it's still a dictionary and
00:25:28.360 | What I want to do is so with
00:25:31.360 | Q&A in your tokenize that you pass to text input you pass your question and
00:25:38.240 | You'd also then pass your context
00:25:45.040 | As usual we would we sell max length
00:25:48.680 | so for usually
00:25:51.680 | 512 I
00:25:53.320 | would set padding equal to the max length and
00:25:57.200 | Also do truncation as well
00:26:01.360 | Okay, so very typical tokenization process nothing. There's nothing different going on here
00:26:07.920 | this is what we normally do when we tokenize our
00:26:10.800 | text going into a
00:26:14.080 | Transform model and then we want to say okay batched equals true
00:26:19.000 | So this allows us to do everything or perform this operation in batches
00:26:23.000 | And then we can also specify our batch size. So batch size equals
00:26:27.880 | Let's say 32. So now when we run this
00:26:31.560 | Where is it gone? You see it?
00:26:34.360 | now when we run this
00:26:36.520 | The map function here is going to tokenize our question and context in batches of 32
00:26:43.680 | So let's go ahead and do that
00:26:45.680 | Okay, and then you can you can see that processing there so I mean that's that's all we really need to
00:26:53.400 | Do with that. So I think that's probably it for
00:26:56.600 | the map method and we'll
00:26:59.800 | well, I'll fast forward and
00:27:02.480 | We'll continue with I think a few of the methods I think quite useful as well
00:27:09.200 | Okay, so that's just finishing up now
00:27:13.440 | so we can go ahead and have a look at what we've actually produced so
00:27:18.740 | Come to here and say
00:27:23.840 | Dataset train. So what do we have?
00:27:26.440 | Now we have we have answers like we did before but now we also have attention mask
00:27:31.360 | We have input IDs and we also have token type IDs
00:27:35.760 | We should it the three tensors that we usually output from from the tokenizer when we do that
00:27:42.240 | So we now have those in there as well. We can also have a look
00:27:45.120 | Another thing as well. We can we can now rather than looping through our data set because we're not using a we're not using streaming
00:27:53.180 | It's true. We're using streaming equals false. We can now
00:27:56.420 | Do this?
00:27:59.400 | And we can see okay, we have a tangent mask and it's not going to show me everything because it's quite large
00:28:04.960 | So I'm just delete that but you can see that we have detention mask in there
00:28:11.160 | So one I want to do is
00:28:13.440 | Say I want to be quite pedantic and I don't like the fact that there is the
00:28:21.600 | Remove that
00:28:24.280 | That we have one feature called title
00:28:26.780 | Maybe I want to say okay
00:28:28.840 | It should be topic because it's the topic of the the context and the question
00:28:33.360 | If I want to be really pedantic and modify that I could say data set train
00:28:40.140 | rename column and
00:28:42.140 | To be honest you you can use it for this, of course
00:28:45.360 | but you're probably not going to you're probably going to use it more for when you need to rename a column to make sure it
00:28:51.300 | aligns to
00:28:52.580 | whatever the
00:28:54.580 | Expected inputs are for a transformer model. For example, so
00:28:58.960 | That that's where you would use it, but I'm just using this example. So I'm going to rename the column title
00:29:05.380 | to topic
00:29:10.740 | Let's print out and data set train again
00:29:13.900 | So down here we have title in a moment. We're going to have topic
00:29:19.040 | Okay, so now we have topic
00:29:22.700 | So just rename column. Like I said come useful not in this case, but generally this is
00:29:29.540 | usually useful
00:29:33.700 | What I may want to do as well is remove certain
00:29:38.220 | Records from this data set. So so far we've been
00:29:41.740 | Printing out the here we have this which is now topic. We have University of Notre Dame
00:29:49.640 | Maybe for whatever reason we don't want to include those
00:29:53.060 | those topics so we can say
00:29:56.460 | Very similar to before we write dates that train
00:30:01.100 | equals
00:30:03.860 | dataset train again
00:30:05.860 | This I'm going to filter so we're going to filter out records
00:30:09.900 | I don't want and again, it's very similar to the syntax you use for the map function, which is the lambda and
00:30:17.980 | in here, we just need to specify the condition for the samples that we do want to include or we do want to keep and
00:30:25.820 | In this case, we want to say okay, wherever the topic is
00:30:29.420 | not equal to
00:30:33.180 | University of Notre Dame
00:30:36.780 | Okay, so we'll run this and we'll have a look at what what we produce so they set to train
00:30:48.120 | So somehow like we have number of rows here, which is just over most
00:30:53.100 | 88,000
00:30:55.860 | And we should get a lower number now now this will also go through so this
00:31:01.220 | Remember we have shuffle set to shuffle. Why I keep calling it shuffle we have
00:31:06.740 | streaming set to
00:31:09.800 | false this time
00:31:11.820 | So it's going to run through the whole data set and then perform this filtering operation
00:31:16.320 | Now whilst I'm waiting for that
00:31:19.660 | Now I'll just fast forward again to to where this finishes in a moment
00:31:26.260 | Okay. So now we have let's finish and we have before we had
00:31:31.420 | 88,000 rows now we have
00:31:34.780 | 87.3 and
00:31:36.620 | We should see so let me take the data set
00:31:40.560 | train
00:31:43.900 | Topic and I want to see let's say the first five of those
00:31:47.940 | Okay, now they're all beyond say rather than before where it was the University of Notre Dame
00:31:56.720 | so we have those and
00:31:58.720 | What we may want to do now is
00:32:02.520 | Say for example, we're performing inference with Q&A with a transform model
00:32:10.380 | We don't really need all of the features that we have here. So
00:32:15.840 | We would only need the attention mass the input IDs and also the token type IDs
00:32:23.940 | So what we can do now is we can remove some of those columns. So
00:32:29.120 | We'll do a data set train as always
00:32:32.960 | There's a train again
00:32:36.080 | And we want to remove those columns so remove columns
00:32:47.880 | We'll just remove so what all of them other than the ones that we want so
00:32:53.740 | Do answers
00:32:55.740 | Context
00:32:59.480 | ID question and topic
00:33:03.800 | Okay, and then let's have a look at what we what we have left
00:33:11.400 | Okay, and then that's it so we we have those final features and these are the ones that we would input into a
00:33:20.920 | Transform model for training now. I mean, there's nothing else. I rarely want to cover
00:33:26.080 | I think that is pretty much all you need to know on
00:33:29.160 | I can face the data sets to get started and start building the pretty I think good
00:33:35.360 | input pipelines and using some of the
00:33:39.000 | The data sets are available. So we'll leave it there
00:33:43.260 | Thank you very much for watching and I will see you again in the next one. Bye