back to indexHugging Face Datasets #2 | Dataset Builder Scripts (for Beginners)
Chapters
0:0 Intro
0:49 Creating Compressed Files
2:41 Creating Dataset Build Script
4:49 Download Manager
8:59 Finishing Split Generator
10:13 Generate Examples Method
14:47 Add Dataset to Hugging Face
17:49 Apache Arrow Features
22:52 What's Next?
00:00:00.000 |
Today we're going to continue with the huggerface datasets series and we're going to have a look at 00:00:07.200 |
how to use the builder scripts so with the builder scripts we can either we can do a few things so we 00:00:12.800 |
can include data pre-processing within the data loading pipeline, we can stream from a another 00:00:21.840 |
sort of remote data source which is pretty useful if you are using a data set where the owners of 00:00:29.520 |
that data set want the data to be streamed from their server which happens quite a lot or if you 00:00:36.160 |
know maybe you have your data set split into multiple files or you have images in your data 00:00:41.840 |
set or something along those lines in those cases you always need to use one of these dataset 00:00:47.680 |
building scripts. So what I'm first going to do very quickly is show you how I created a compressed 00:00:55.040 |
file for this demo so we're going to create let me show you so we're going to go over here so into 00:01:02.320 |
this James Callum HF datasets repo on github 01 builder script we're going to here and you will 00:01:09.920 |
see this file here so this dataset tar.gz file so this is a zipped or compressed file and we're 00:01:18.000 |
actually going to stream our data from this exact location so if you if we go on here you see we 00:01:26.080 |
have this download link the download button we're just going to copy that link address and we're 00:01:30.960 |
going to use that to stream our data into the data dataset building script. So very quickly 00:01:39.760 |
how did I build that you can actually have a look at this file here so all I'm doing is I'm taking 00:01:47.360 |
this the reddit topics dataset that I've built already very similar to the dataset we used in 00:01:52.640 |
the last video it's just a little bit bigger so it's not massive it's 3.7, 3.8 thousand rows 00:02:01.360 |
all it converts pandas convert to a dictionary or using the the records orientation there 00:02:09.440 |
and then save that as a jsonl or jsonlines file then we compress it using this so you'll probably 00:02:18.080 |
if you have your own dataset and you want to compress it and kind of follow the same steps 00:02:21.760 |
we're doing here this is what you will need so you need to add your dataset file to the 00:02:29.520 |
zipped compressed file here and you just use tar file that this I believe is actually installed 00:02:37.600 |
by default with python so you won't have to pivot install that so with all of that we can go ahead 00:02:43.200 |
and actually have a look at how we build our dataset building script so we start with the 00:02:49.360 |
template first so come over to hugging face here and go to datasets and let's go for squad okay so 00:03:00.640 |
go here squad is just a very popular dataset and I think I'm on the tutorials they use it as a 00:03:09.280 |
as like a template for building your own scripts and that's probably where I got this from but I 00:03:18.320 |
just by default I go to this dataset and use this as my template if I'm building a new 00:03:23.760 |
dataset loading script so come over here so within the build script I have a few things here 00:03:30.560 |
jsonlines just you can see what was in there I can actually delete that I don't need that anymore 00:03:36.400 |
so let's remove that what I want to do is create a python file and I'm going to name it 00:03:47.600 |
the same as my dataset so I'm going to call this the reddit topics tar gz 00:03:56.800 |
that's what I'm going to call this this dataset okay 00:04:04.560 |
and this all okay we're going to modify a lot of this but for now I'm not going to 00:04:14.320 |
touch too much we just want to um let's focus on the essential things that we need here so 00:04:19.920 |
first thing we don't need this it's like added complexity that isn't necessary 00:04:24.240 |
class um we'll call it sort of reddit tar gz I suppose it's fine or reddit topics tar gz 00:04:37.360 |
builder configs this doesn't matter this does matter but we will mess around with that later 00:04:46.400 |
not now so what does let's focus on what actually matters right now so the here we have this 00:04:53.520 |
download manager and we're gonna we're gonna look at download manager a bit more in the next video 00:04:58.160 |
but um for now download manager is essentially a hugging face dataset utility that allows us to 00:05:06.480 |
given a particular file either local or on the internet we can download it and extract the 00:05:14.320 |
contents of it so this is why I formatted the datasets file as a tar gz file because I want 00:05:25.200 |
to use this download and extract function or method so what we need to do is in your so I'm 00:05:32.000 |
going to change this to url I'm going to come up here where we define url um actually remove that 00:05:39.520 |
and here I'm going to replace that with the location that I copied earlier so 00:05:45.840 |
let's see if that actually that won't work so I need to copy again so if I go to here um 00:05:53.280 |
the repo again go to zero and build a script the location of the compressed file go there 00:06:01.760 |
and then where it says download I'm going to copy that link and I'm going to put in here okay so 00:06:08.480 |
with that um description we can you know call it demo um we'll change the other things later 00:06:18.000 |
with that we or this here will kind of almost work so there's just one thing so 00:06:27.680 |
we're downloading this url but um with squad there were two urls okay so if I actually go 00:06:35.680 |
back a little bit you see that there is these two urls one for the training set one for the 00:06:40.000 |
development set um we only have one so we actually need to modify this a little bit to deal with just 00:06:48.080 |
one dataset not two so here we need to return split generator we actually just remove this one 00:06:55.520 |
the validation split because we just have a training split and the download files is actually 00:07:02.560 |
not going to be this it's going to be so this is basically going to show us a path to a particular 00:07:08.960 |
location let me show you exactly what it's doing okay so we're going to do from uh transfer no 00:07:15.600 |
dataset sorry import from datasets utils import download manager it might not be there maybe it's 00:07:30.880 |
here let's see okay it was there so dl manager and I'm just going to initialize it this is kind of 00:07:41.600 |
happening in the background of our builder script so we don't actually do this in the builder script 00:07:46.800 |
it just kind of happens so we do that and then let's just copy what we have elsewhere so we have 00:07:55.840 |
the url um and it is this okay that is the euro and let's just see what this outputs so download 00:08:04.000 |
manager uh download and extract url okay let's see what we get um we'll call this out 00:08:14.080 |
okay so we see we get a file path from that now okay interesting so what let's have a look at 00:08:26.960 |
what is in that file path so os lister out okay so now we can see we actually have that json lines 00:08:36.720 |
file that we that we put inside our compressed tar file so okay what what does that mean for us 00:08:46.560 |
it means we can actually just load that from here based on what the download manager is giving us 00:08:52.640 |
okay so this is like a cached location for our particular data set 00:08:56.880 |
so return to the builder script we have download files I don't really like the name so I'm just 00:09:06.880 |
going to call it path I'm going to say path here as well remove train and if we just have a look 00:09:16.320 |
at the path um it is it's just the directory that contains our data set dot json okay or json lines 00:09:26.880 |
file so actually what we need to do is we need to do like out plus data set dot json out okay 00:09:37.360 |
this this will give us a full path to our file so that is what we're going to do here 00:09:46.000 |
so where are we um so path let's I mean it's a bit easy to read okay come here zoom out a little bit 00:10:01.200 |
and then here we have data set dot json out okay and yeah that's our that's a split generators 00:10:14.640 |
function here and what that will do is you see that we have this file path here that is going 00:10:21.200 |
to get passed along to this generate examples um function or method and it's this method that is 00:10:29.200 |
going to kind of output the rows of the data set to us so what we need to do is actually um just 00:10:39.200 |
use this to kind of read our data set now it you know just kind of doing that from from scratch 00:10:44.320 |
without seeing what's happening it's kind of hard so let's let's return to the notebook file and see 00:10:49.440 |
how we can do that so we're here we have our let's call this file path now because this is 00:10:57.440 |
what we've created in the other file it's a file path 00:11:00.080 |
and what I'm going to do file path is well first we need to import json because it's a json lines 00:11:08.880 |
file so we're going to have to read that and exactly we actually want to do this so as fp 00:11:16.400 |
and I don't think we need to put encoding there but we'll put it to be safe and what we're going 00:11:24.720 |
to do is we're going to go through that so for line in fp because it's a json lines file so 00:11:30.960 |
there's just lines of data each one those lines represents a json object so we are going to 00:11:39.360 |
oh we can we can just print it for now in in here so let's put count on this so we'll print out 00:11:48.080 |
a few items but not too many so if count is five break okay let's see what we get okay cool so 00:12:00.000 |
we can see that we we get a few items here right so we're just kind of going through there's a 00:12:07.120 |
red file and we're just looping through and and printing them so we can do the same over in our 00:12:14.800 |
other file in the builder script so let's come to here copy this in now all of this we can see here 00:12:26.960 |
okay some of this we will need not all of it so let's go ahead and just remove what we don't need 00:12:34.080 |
okay this is all we need this yield so because this generator is creating a generator function 00:12:43.120 |
right so let's come here remove remove parts of this so the line or the should call it a record 00:12:55.040 |
is equal to json.loads line maybe we call it object okay and within that object we have a few 00:13:07.840 |
a few different key value pairs right so what are those we can have a look at the make tar file 00:13:15.840 |
file and we have we have all these here so we have sub title self text up vote ratio id and 00:13:24.640 |
create utc now we can actually just pass all of these directly onto the next so we can yield all 00:13:31.520 |
of these so let me show you what i mean by that so we come here and you see that we're just yielding 00:13:37.680 |
and then we're yielding this dictionary type structure for squad right for us we already have 00:13:42.560 |
that dictionary type structure because we use the json lines file this is one of the reasons i like 00:13:46.720 |
using them so we can actually just do yield key object like that now okay what is um what is key 00:13:54.720 |
key is actually the index value or id value if you want but it's an index value so i'm going to 00:14:04.720 |
rename it index because that makes more sense to me than key and yeah here we go so we have 00:14:14.240 |
set everything up here we're going to open the file located or let's let's do that and read the 00:14:22.160 |
lines um load file object or json object and yeah we just yield them so what would that do when we 00:14:35.520 |
are loading the function or when we are loading the data set over in hugging face data sets 00:14:42.960 |
this is going to be the thing that generates all of those all those items so what we should do now 00:14:50.000 |
is maybe we can maybe we can test it and see what happens and it won't work straight away we'll see 00:14:57.280 |
but let's try so what i'm going to do is i'm just going to copy all of this then i'm going to come 00:15:03.360 |
over to hugging face i'm going to click on my little icon right over here click new data set 00:15:11.520 |
i'm going to call it reddit topics tar gc create that and i'm going to come to files i'm going to 00:15:20.720 |
add a file create new file and this is just going to be reddit topics tar gc so the exact same file 00:15:28.880 |
we create before and i'm just going to paste all that code in there okay so you see we have all 00:15:34.560 |
this code uh let's let's just remove this is that important uh it's not squad anymore so let's just 00:15:42.160 |
call it reddit topics tar gz demo data set one thing we do need is we need to import json so 00:15:52.000 |
it's good that's already there um we don't need this anymore but let's keep it in there for now 00:15:57.840 |
before we start removing everything creating more errors so let's commit that and then let's just try 00:16:05.920 |
and see what happens okay so i'm going to create a new file to test it so i'm saying test test data 00:16:13.680 |
and what we're going to do is from data sets import load data set 00:16:25.200 |
and the data set we'll just call it data load data set 00:16:28.480 |
and we can find the data set name over here so i'm just going to copy click here copy that and 00:16:37.120 |
there is just one split in this data set so split equals train okay let's see what happens 00:16:42.880 |
okay we download the build script so far so good download the data and then we get this okay what 00:16:51.440 |
is this os error come down here cannot find data file okay so if we have a look at this so without 00:16:59.200 |
this dot here uh we can see that data file is there so we have our first error um which was 00:17:06.080 |
not on purpose but that's fine so the reason we have that is because here i we put a dot i'm not 00:17:13.760 |
sure why i did that so let's save that and actually let's just edit it in the in the web editor here 00:17:19.680 |
as well so let's remove that commit changes and then let's try again okay let's come up here let's 00:17:30.240 |
clear everything restart and let's go again okay so now we get this key error so what does that 00:17:36.640 |
mean key error context okay i don't remember putting context anywhere so let's have a look at the 00:17:43.520 |
builder script and if we okay let's have a look okay here we have this so we haven't modified 00:17:49.360 |
this yet now what is this telling us um it's basically telling the data set builder which 00:17:57.520 |
features to expect in the data set so basically down here we're kind of feeding in these these 00:18:04.080 |
different features we're feeding these these records each record is a key value pair so the 00:18:08.720 |
keys are the feature names and the values are obviously values which have a particular data type 00:18:13.840 |
now here we have the the feature names so the keys um but they are not aligned to our actual 00:18:23.200 |
data set these are using the squad data set key value pairs so we need to come over to this file 00:18:29.920 |
and we can get those features specific to our data set from there so let's take these i'm going to 00:18:36.320 |
copy them across here and all i'm going to do is actually just write those here okay so we have 00:18:52.000 |
uh we have oh we have id and we also have another one so let's create another 00:19:00.000 |
well actually let's make this one more normal first so id 00:19:03.760 |
is this and then we have one more which is created utc 00:19:11.840 |
okay now we can try this it's not going to work again but let's try 00:19:18.160 |
okay let's rerun this see what happens okay so actually it does work but it's not working 00:19:32.320 |
in the way that we might expect so if we have a look at data and zero okay we have 00:19:41.440 |
subtitle self-text and then we come down here there's a lot in this self-text um but so just 00:19:48.960 |
look at this so the upvote ratio which is a floating point number is now a string 00:19:56.320 |
the id that's fine we should expect that and the credit utc which is also a floating point number 00:20:02.720 |
is now a string as well so there's a bit of an issue here basically if we if we go back to our 00:20:09.440 |
script when we are feeding the features through this sort of feature specification um it's seeing 00:20:20.000 |
that we're saying everything should be a string and it's converting everything into a string 00:20:23.840 |
we don't actually want everything to be a string so what we need to do here is use a specific 00:20:30.240 |
apache arrow data type identifiers for different things so for example float that we have here 00:20:36.880 |
okay so let's go ahead and have a look at how or what that might be so to find that i'm just 00:20:42.720 |
going to type like apache arrow data types here so apache arrow data types and schemas 00:20:51.360 |
schemas maybe uh we come here and we can see we can see a load these so we have we have integer 00:20:58.480 |
values unsigned integers and then we have floats so i'm going to say okay single precision floating 00:21:05.120 |
point type is perfect okay so i'm just going to copy that float 32 i'm going to put that for 00:21:11.440 |
create utc and also the upvote ratio okay i'm going to save that i'm going to change a few 00:21:18.160 |
things that we don't we don't actually need so i'm going to remove this task template because 00:21:21.680 |
we can't do question answering with this data set um no we can't at least not extractive question 00:21:27.920 |
answering or can't train with that uh for the home page let's let's put this i suppose 00:21:38.240 |
okay supervisor keys is none and what else do we have here so description 00:21:46.000 |
uh so it's a demo we know that okay okay let's save this and and try again okay so i'm going 00:21:54.880 |
to copy this over into home face come here uh not here here edit and come here select all paste 00:22:04.880 |
and i am going to commit those changes now let's have a look at what happens if we load the data 00:22:12.800 |
set so come back over here test data set uh let's run this let's see what happens okay it loaded 00:22:21.360 |
well it loaded correctly that's a good sign come down here and now we can see that these are no 00:22:27.200 |
longer strings but they're actually floating point numbers okay so that is that's everything 00:22:35.600 |
there are maybe a few aesthetic things to change here so the like the citation 00:22:41.680 |
we'll change that up here i can change this as well but we're not going to go through that in 00:22:46.960 |
this uh in this video i don't think you want to watch me change citations so yeah that's everything 00:22:54.640 |
for this video in the next video what we're going to do is take a look at taking this a a little bit 00:23:00.080 |
further and adding more advanced data types like images into our data sets so until then i hope 00:23:08.320 |
this has been useful thank you very much for watching and i will see you again in the next one