back to index

Hugging Face Datasets #2 | Dataset Builder Scripts (for Beginners)


Chapters

0:0 Intro
0:49 Creating Compressed Files
2:41 Creating Dataset Build Script
4:49 Download Manager
8:59 Finishing Split Generator
10:13 Generate Examples Method
14:47 Add Dataset to Hugging Face
17:49 Apache Arrow Features
22:52 What's Next?

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to continue with the huggerface datasets series and we're going to have a look at
00:00:07.200 | how to use the builder scripts so with the builder scripts we can either we can do a few things so we
00:00:12.800 | can include data pre-processing within the data loading pipeline, we can stream from a another
00:00:21.840 | sort of remote data source which is pretty useful if you are using a data set where the owners of
00:00:29.520 | that data set want the data to be streamed from their server which happens quite a lot or if you
00:00:36.160 | know maybe you have your data set split into multiple files or you have images in your data
00:00:41.840 | set or something along those lines in those cases you always need to use one of these dataset
00:00:47.680 | building scripts. So what I'm first going to do very quickly is show you how I created a compressed
00:00:55.040 | file for this demo so we're going to create let me show you so we're going to go over here so into
00:01:02.320 | this James Callum HF datasets repo on github 01 builder script we're going to here and you will
00:01:09.920 | see this file here so this dataset tar.gz file so this is a zipped or compressed file and we're
00:01:18.000 | actually going to stream our data from this exact location so if you if we go on here you see we
00:01:26.080 | have this download link the download button we're just going to copy that link address and we're
00:01:30.960 | going to use that to stream our data into the data dataset building script. So very quickly
00:01:39.760 | how did I build that you can actually have a look at this file here so all I'm doing is I'm taking
00:01:47.360 | this the reddit topics dataset that I've built already very similar to the dataset we used in
00:01:52.640 | the last video it's just a little bit bigger so it's not massive it's 3.7, 3.8 thousand rows
00:02:01.360 | all it converts pandas convert to a dictionary or using the the records orientation there
00:02:09.440 | and then save that as a jsonl or jsonlines file then we compress it using this so you'll probably
00:02:18.080 | if you have your own dataset and you want to compress it and kind of follow the same steps
00:02:21.760 | we're doing here this is what you will need so you need to add your dataset file to the
00:02:29.520 | zipped compressed file here and you just use tar file that this I believe is actually installed
00:02:37.600 | by default with python so you won't have to pivot install that so with all of that we can go ahead
00:02:43.200 | and actually have a look at how we build our dataset building script so we start with the
00:02:49.360 | template first so come over to hugging face here and go to datasets and let's go for squad okay so
00:03:00.640 | go here squad is just a very popular dataset and I think I'm on the tutorials they use it as a
00:03:09.280 | as like a template for building your own scripts and that's probably where I got this from but I
00:03:18.320 | just by default I go to this dataset and use this as my template if I'm building a new
00:03:23.760 | dataset loading script so come over here so within the build script I have a few things here
00:03:30.560 | jsonlines just you can see what was in there I can actually delete that I don't need that anymore
00:03:36.400 | so let's remove that what I want to do is create a python file and I'm going to name it
00:03:47.600 | the same as my dataset so I'm going to call this the reddit topics tar gz
00:03:56.800 | that's what I'm going to call this this dataset okay
00:04:04.560 | and this all okay we're going to modify a lot of this but for now I'm not going to
00:04:14.320 | touch too much we just want to um let's focus on the essential things that we need here so
00:04:19.920 | first thing we don't need this it's like added complexity that isn't necessary
00:04:24.240 | class um we'll call it sort of reddit tar gz I suppose it's fine or reddit topics tar gz
00:04:37.360 | builder configs this doesn't matter this does matter but we will mess around with that later
00:04:46.400 | not now so what does let's focus on what actually matters right now so the here we have this
00:04:53.520 | download manager and we're gonna we're gonna look at download manager a bit more in the next video
00:04:58.160 | but um for now download manager is essentially a hugging face dataset utility that allows us to
00:05:06.480 | given a particular file either local or on the internet we can download it and extract the
00:05:14.320 | contents of it so this is why I formatted the datasets file as a tar gz file because I want
00:05:25.200 | to use this download and extract function or method so what we need to do is in your so I'm
00:05:32.000 | going to change this to url I'm going to come up here where we define url um actually remove that
00:05:39.520 | and here I'm going to replace that with the location that I copied earlier so
00:05:45.840 | let's see if that actually that won't work so I need to copy again so if I go to here um
00:05:53.280 | the repo again go to zero and build a script the location of the compressed file go there
00:06:01.760 | and then where it says download I'm going to copy that link and I'm going to put in here okay so
00:06:08.480 | with that um description we can you know call it demo um we'll change the other things later
00:06:18.000 | with that we or this here will kind of almost work so there's just one thing so
00:06:27.680 | we're downloading this url but um with squad there were two urls okay so if I actually go
00:06:35.680 | back a little bit you see that there is these two urls one for the training set one for the
00:06:40.000 | development set um we only have one so we actually need to modify this a little bit to deal with just
00:06:48.080 | one dataset not two so here we need to return split generator we actually just remove this one
00:06:55.520 | the validation split because we just have a training split and the download files is actually
00:07:02.560 | not going to be this it's going to be so this is basically going to show us a path to a particular
00:07:08.960 | location let me show you exactly what it's doing okay so we're going to do from uh transfer no
00:07:15.600 | dataset sorry import from datasets utils import download manager it might not be there maybe it's
00:07:30.880 | here let's see okay it was there so dl manager and I'm just going to initialize it this is kind of
00:07:41.600 | happening in the background of our builder script so we don't actually do this in the builder script
00:07:46.800 | it just kind of happens so we do that and then let's just copy what we have elsewhere so we have
00:07:55.840 | the url um and it is this okay that is the euro and let's just see what this outputs so download
00:08:04.000 | manager uh download and extract url okay let's see what we get um we'll call this out
00:08:14.080 | okay so we see we get a file path from that now okay interesting so what let's have a look at
00:08:26.960 | what is in that file path so os lister out okay so now we can see we actually have that json lines
00:08:36.720 | file that we that we put inside our compressed tar file so okay what what does that mean for us
00:08:46.560 | it means we can actually just load that from here based on what the download manager is giving us
00:08:52.640 | okay so this is like a cached location for our particular data set
00:08:56.880 | so return to the builder script we have download files I don't really like the name so I'm just
00:09:06.880 | going to call it path I'm going to say path here as well remove train and if we just have a look
00:09:16.320 | at the path um it is it's just the directory that contains our data set dot json okay or json lines
00:09:26.880 | file so actually what we need to do is we need to do like out plus data set dot json out okay
00:09:37.360 | this this will give us a full path to our file so that is what we're going to do here
00:09:46.000 | so where are we um so path let's I mean it's a bit easy to read okay come here zoom out a little bit
00:09:54.800 | okay so it will be path
00:10:01.200 | and then here we have data set dot json out okay and yeah that's our that's a split generators
00:10:14.640 | function here and what that will do is you see that we have this file path here that is going
00:10:21.200 | to get passed along to this generate examples um function or method and it's this method that is
00:10:29.200 | going to kind of output the rows of the data set to us so what we need to do is actually um just
00:10:39.200 | use this to kind of read our data set now it you know just kind of doing that from from scratch
00:10:44.320 | without seeing what's happening it's kind of hard so let's let's return to the notebook file and see
00:10:49.440 | how we can do that so we're here we have our let's call this file path now because this is
00:10:57.440 | what we've created in the other file it's a file path
00:11:00.080 | and what I'm going to do file path is well first we need to import json because it's a json lines
00:11:08.880 | file so we're going to have to read that and exactly we actually want to do this so as fp
00:11:16.400 | and I don't think we need to put encoding there but we'll put it to be safe and what we're going
00:11:24.720 | to do is we're going to go through that so for line in fp because it's a json lines file so
00:11:30.960 | there's just lines of data each one those lines represents a json object so we are going to
00:11:39.360 | oh we can we can just print it for now in in here so let's put count on this so we'll print out
00:11:48.080 | a few items but not too many so if count is five break okay let's see what we get okay cool so
00:12:00.000 | we can see that we we get a few items here right so we're just kind of going through there's a
00:12:07.120 | red file and we're just looping through and and printing them so we can do the same over in our
00:12:14.800 | other file in the builder script so let's come to here copy this in now all of this we can see here
00:12:26.960 | okay some of this we will need not all of it so let's go ahead and just remove what we don't need
00:12:34.080 | okay this is all we need this yield so because this generator is creating a generator function
00:12:43.120 | right so let's come here remove remove parts of this so the line or the should call it a record
00:12:55.040 | is equal to json.loads line maybe we call it object okay and within that object we have a few
00:13:07.840 | a few different key value pairs right so what are those we can have a look at the make tar file
00:13:15.840 | file and we have we have all these here so we have sub title self text up vote ratio id and
00:13:24.640 | create utc now we can actually just pass all of these directly onto the next so we can yield all
00:13:31.520 | of these so let me show you what i mean by that so we come here and you see that we're just yielding
00:13:37.680 | and then we're yielding this dictionary type structure for squad right for us we already have
00:13:42.560 | that dictionary type structure because we use the json lines file this is one of the reasons i like
00:13:46.720 | using them so we can actually just do yield key object like that now okay what is um what is key
00:13:54.720 | key is actually the index value or id value if you want but it's an index value so i'm going to
00:14:04.720 | rename it index because that makes more sense to me than key and yeah here we go so we have
00:14:14.240 | set everything up here we're going to open the file located or let's let's do that and read the
00:14:22.160 | lines um load file object or json object and yeah we just yield them so what would that do when we
00:14:35.520 | are loading the function or when we are loading the data set over in hugging face data sets
00:14:42.960 | this is going to be the thing that generates all of those all those items so what we should do now
00:14:50.000 | is maybe we can maybe we can test it and see what happens and it won't work straight away we'll see
00:14:57.280 | but let's try so what i'm going to do is i'm just going to copy all of this then i'm going to come
00:15:03.360 | over to hugging face i'm going to click on my little icon right over here click new data set
00:15:11.520 | i'm going to call it reddit topics tar gc create that and i'm going to come to files i'm going to
00:15:20.720 | add a file create new file and this is just going to be reddit topics tar gc so the exact same file
00:15:28.880 | we create before and i'm just going to paste all that code in there okay so you see we have all
00:15:34.560 | this code uh let's let's just remove this is that important uh it's not squad anymore so let's just
00:15:42.160 | call it reddit topics tar gz demo data set one thing we do need is we need to import json so
00:15:52.000 | it's good that's already there um we don't need this anymore but let's keep it in there for now
00:15:57.840 | before we start removing everything creating more errors so let's commit that and then let's just try
00:16:05.920 | and see what happens okay so i'm going to create a new file to test it so i'm saying test test data
00:16:13.680 | and what we're going to do is from data sets import load data set
00:16:25.200 | and the data set we'll just call it data load data set
00:16:28.480 | and we can find the data set name over here so i'm just going to copy click here copy that and
00:16:37.120 | there is just one split in this data set so split equals train okay let's see what happens
00:16:42.880 | okay we download the build script so far so good download the data and then we get this okay what
00:16:51.440 | is this os error come down here cannot find data file okay so if we have a look at this so without
00:16:59.200 | this dot here uh we can see that data file is there so we have our first error um which was
00:17:06.080 | not on purpose but that's fine so the reason we have that is because here i we put a dot i'm not
00:17:13.760 | sure why i did that so let's save that and actually let's just edit it in the in the web editor here
00:17:19.680 | as well so let's remove that commit changes and then let's try again okay let's come up here let's
00:17:30.240 | clear everything restart and let's go again okay so now we get this key error so what does that
00:17:36.640 | mean key error context okay i don't remember putting context anywhere so let's have a look at the
00:17:43.520 | builder script and if we okay let's have a look okay here we have this so we haven't modified
00:17:49.360 | this yet now what is this telling us um it's basically telling the data set builder which
00:17:57.520 | features to expect in the data set so basically down here we're kind of feeding in these these
00:18:04.080 | different features we're feeding these these records each record is a key value pair so the
00:18:08.720 | keys are the feature names and the values are obviously values which have a particular data type
00:18:13.840 | now here we have the the feature names so the keys um but they are not aligned to our actual
00:18:23.200 | data set these are using the squad data set key value pairs so we need to come over to this file
00:18:29.920 | and we can get those features specific to our data set from there so let's take these i'm going to
00:18:36.320 | copy them across here and all i'm going to do is actually just write those here okay so we have
00:18:46.080 | subtitle self-text
00:18:48.880 | upvote ratio
00:18:52.000 | uh we have oh we have id and we also have another one so let's create another
00:19:00.000 | well actually let's make this one more normal first so id
00:19:03.760 | is this and then we have one more which is created utc
00:19:11.840 | okay now we can try this it's not going to work again but let's try
00:19:18.160 | okay let's rerun this see what happens okay so actually it does work but it's not working
00:19:32.320 | in the way that we might expect so if we have a look at data and zero okay we have
00:19:41.440 | subtitle self-text and then we come down here there's a lot in this self-text um but so just
00:19:48.960 | look at this so the upvote ratio which is a floating point number is now a string
00:19:56.320 | the id that's fine we should expect that and the credit utc which is also a floating point number
00:20:02.720 | is now a string as well so there's a bit of an issue here basically if we if we go back to our
00:20:09.440 | script when we are feeding the features through this sort of feature specification um it's seeing
00:20:20.000 | that we're saying everything should be a string and it's converting everything into a string
00:20:23.840 | we don't actually want everything to be a string so what we need to do here is use a specific
00:20:30.240 | apache arrow data type identifiers for different things so for example float that we have here
00:20:36.880 | okay so let's go ahead and have a look at how or what that might be so to find that i'm just
00:20:42.720 | going to type like apache arrow data types here so apache arrow data types and schemas
00:20:51.360 | schemas maybe uh we come here and we can see we can see a load these so we have we have integer
00:20:58.480 | values unsigned integers and then we have floats so i'm going to say okay single precision floating
00:21:05.120 | point type is perfect okay so i'm just going to copy that float 32 i'm going to put that for
00:21:11.440 | create utc and also the upvote ratio okay i'm going to save that i'm going to change a few
00:21:18.160 | things that we don't we don't actually need so i'm going to remove this task template because
00:21:21.680 | we can't do question answering with this data set um no we can't at least not extractive question
00:21:27.920 | answering or can't train with that uh for the home page let's let's put this i suppose
00:21:38.240 | okay supervisor keys is none and what else do we have here so description
00:21:46.000 | uh so it's a demo we know that okay okay let's save this and and try again okay so i'm going
00:21:54.880 | to copy this over into home face come here uh not here here edit and come here select all paste
00:22:04.880 | and i am going to commit those changes now let's have a look at what happens if we load the data
00:22:12.800 | set so come back over here test data set uh let's run this let's see what happens okay it loaded
00:22:21.360 | well it loaded correctly that's a good sign come down here and now we can see that these are no
00:22:27.200 | longer strings but they're actually floating point numbers okay so that is that's everything
00:22:35.600 | there are maybe a few aesthetic things to change here so the like the citation
00:22:41.680 | we'll change that up here i can change this as well but we're not going to go through that in
00:22:46.960 | this uh in this video i don't think you want to watch me change citations so yeah that's everything
00:22:54.640 | for this video in the next video what we're going to do is take a look at taking this a a little bit
00:23:00.080 | further and adding more advanced data types like images into our data sets so until then i hope
00:23:08.320 | this has been useful thank you very much for watching and i will see you again in the next one