Back to Index

Hugging Face Datasets #2 | Dataset Builder Scripts (for Beginners)


Chapters

0:0 Intro
0:49 Creating Compressed Files
2:41 Creating Dataset Build Script
4:49 Download Manager
8:59 Finishing Split Generator
10:13 Generate Examples Method
14:47 Add Dataset to Hugging Face
17:49 Apache Arrow Features
22:52 What's Next?

Transcript

Today we're going to continue with the huggerface datasets series and we're going to have a look at how to use the builder scripts so with the builder scripts we can either we can do a few things so we can include data pre-processing within the data loading pipeline, we can stream from a another sort of remote data source which is pretty useful if you are using a data set where the owners of that data set want the data to be streamed from their server which happens quite a lot or if you know maybe you have your data set split into multiple files or you have images in your data set or something along those lines in those cases you always need to use one of these dataset building scripts.

So what I'm first going to do very quickly is show you how I created a compressed file for this demo so we're going to create let me show you so we're going to go over here so into this James Callum HF datasets repo on github 01 builder script we're going to here and you will see this file here so this dataset tar.gz file so this is a zipped or compressed file and we're actually going to stream our data from this exact location so if you if we go on here you see we have this download link the download button we're just going to copy that link address and we're going to use that to stream our data into the data dataset building script.

So very quickly how did I build that you can actually have a look at this file here so all I'm doing is I'm taking this the reddit topics dataset that I've built already very similar to the dataset we used in the last video it's just a little bit bigger so it's not massive it's 3.7, 3.8 thousand rows all it converts pandas convert to a dictionary or using the the records orientation there and then save that as a jsonl or jsonlines file then we compress it using this so you'll probably if you have your own dataset and you want to compress it and kind of follow the same steps we're doing here this is what you will need so you need to add your dataset file to the zipped compressed file here and you just use tar file that this I believe is actually installed by default with python so you won't have to pivot install that so with all of that we can go ahead and actually have a look at how we build our dataset building script so we start with the template first so come over to hugging face here and go to datasets and let's go for squad okay so go here squad is just a very popular dataset and I think I'm on the tutorials they use it as a as like a template for building your own scripts and that's probably where I got this from but I just by default I go to this dataset and use this as my template if I'm building a new dataset loading script so come over here so within the build script I have a few things here jsonlines just you can see what was in there I can actually delete that I don't need that anymore so let's remove that what I want to do is create a python file and I'm going to name it the same as my dataset so I'm going to call this the reddit topics tar gz that's what I'm going to call this this dataset okay and this all okay we're going to modify a lot of this but for now I'm not going to touch too much we just want to um let's focus on the essential things that we need here so first thing we don't need this it's like added complexity that isn't necessary class um we'll call it sort of reddit tar gz I suppose it's fine or reddit topics tar gz builder configs this doesn't matter this does matter but we will mess around with that later not now so what does let's focus on what actually matters right now so the here we have this download manager and we're gonna we're gonna look at download manager a bit more in the next video but um for now download manager is essentially a hugging face dataset utility that allows us to given a particular file either local or on the internet we can download it and extract the contents of it so this is why I formatted the datasets file as a tar gz file because I want to use this download and extract function or method so what we need to do is in your so I'm going to change this to url I'm going to come up here where we define url um actually remove that and here I'm going to replace that with the location that I copied earlier so let's see if that actually that won't work so I need to copy again so if I go to here um the repo again go to zero and build a script the location of the compressed file go there and then where it says download I'm going to copy that link and I'm going to put in here okay so with that um description we can you know call it demo um we'll change the other things later with that we or this here will kind of almost work so there's just one thing so we're downloading this url but um with squad there were two urls okay so if I actually go back a little bit you see that there is these two urls one for the training set one for the development set um we only have one so we actually need to modify this a little bit to deal with just one dataset not two so here we need to return split generator we actually just remove this one the validation split because we just have a training split and the download files is actually not going to be this it's going to be so this is basically going to show us a path to a particular location let me show you exactly what it's doing okay so we're going to do from uh transfer no dataset sorry import from datasets utils import download manager it might not be there maybe it's here let's see okay it was there so dl manager and I'm just going to initialize it this is kind of happening in the background of our builder script so we don't actually do this in the builder script it just kind of happens so we do that and then let's just copy what we have elsewhere so we have the url um and it is this okay that is the euro and let's just see what this outputs so download manager uh download and extract url okay let's see what we get um we'll call this out okay so we see we get a file path from that now okay interesting so what let's have a look at what is in that file path so os lister out okay so now we can see we actually have that json lines file that we that we put inside our compressed tar file so okay what what does that mean for us it means we can actually just load that from here based on what the download manager is giving us okay so this is like a cached location for our particular data set so return to the builder script we have download files I don't really like the name so I'm just going to call it path I'm going to say path here as well remove train and if we just have a look at the path um it is it's just the directory that contains our data set dot json okay or json lines file so actually what we need to do is we need to do like out plus data set dot json out okay this this will give us a full path to our file so that is what we're going to do here so where are we um so path let's I mean it's a bit easy to read okay come here zoom out a little bit okay so it will be path and then here we have data set dot json out okay and yeah that's our that's a split generators function here and what that will do is you see that we have this file path here that is going to get passed along to this generate examples um function or method and it's this method that is going to kind of output the rows of the data set to us so what we need to do is actually um just use this to kind of read our data set now it you know just kind of doing that from from scratch without seeing what's happening it's kind of hard so let's let's return to the notebook file and see how we can do that so we're here we have our let's call this file path now because this is what we've created in the other file it's a file path and what I'm going to do file path is well first we need to import json because it's a json lines file so we're going to have to read that and exactly we actually want to do this so as fp and I don't think we need to put encoding there but we'll put it to be safe and what we're going to do is we're going to go through that so for line in fp because it's a json lines file so there's just lines of data each one those lines represents a json object so we are going to oh we can we can just print it for now in in here so let's put count on this so we'll print out a few items but not too many so if count is five break okay let's see what we get okay cool so we can see that we we get a few items here right so we're just kind of going through there's a red file and we're just looping through and and printing them so we can do the same over in our other file in the builder script so let's come to here copy this in now all of this we can see here okay some of this we will need not all of it so let's go ahead and just remove what we don't need okay this is all we need this yield so because this generator is creating a generator function right so let's come here remove remove parts of this so the line or the should call it a record is equal to json.loads line maybe we call it object okay and within that object we have a few a few different key value pairs right so what are those we can have a look at the make tar file file and we have we have all these here so we have sub title self text up vote ratio id and create utc now we can actually just pass all of these directly onto the next so we can yield all of these so let me show you what i mean by that so we come here and you see that we're just yielding and then we're yielding this dictionary type structure for squad right for us we already have that dictionary type structure because we use the json lines file this is one of the reasons i like using them so we can actually just do yield key object like that now okay what is um what is key key is actually the index value or id value if you want but it's an index value so i'm going to rename it index because that makes more sense to me than key and yeah here we go so we have set everything up here we're going to open the file located or let's let's do that and read the lines um load file object or json object and yeah we just yield them so what would that do when we are loading the function or when we are loading the data set over in hugging face data sets this is going to be the thing that generates all of those all those items so what we should do now is maybe we can maybe we can test it and see what happens and it won't work straight away we'll see but let's try so what i'm going to do is i'm just going to copy all of this then i'm going to come over to hugging face i'm going to click on my little icon right over here click new data set i'm going to call it reddit topics tar gc create that and i'm going to come to files i'm going to add a file create new file and this is just going to be reddit topics tar gc so the exact same file we create before and i'm just going to paste all that code in there okay so you see we have all this code uh let's let's just remove this is that important uh it's not squad anymore so let's just call it reddit topics tar gz demo data set one thing we do need is we need to import json so it's good that's already there um we don't need this anymore but let's keep it in there for now before we start removing everything creating more errors so let's commit that and then let's just try and see what happens okay so i'm going to create a new file to test it so i'm saying test test data and what we're going to do is from data sets import load data set and the data set we'll just call it data load data set and we can find the data set name over here so i'm just going to copy click here copy that and there is just one split in this data set so split equals train okay let's see what happens okay we download the build script so far so good download the data and then we get this okay what is this os error come down here cannot find data file okay so if we have a look at this so without this dot here uh we can see that data file is there so we have our first error um which was not on purpose but that's fine so the reason we have that is because here i we put a dot i'm not sure why i did that so let's save that and actually let's just edit it in the in the web editor here as well so let's remove that commit changes and then let's try again okay let's come up here let's clear everything restart and let's go again okay so now we get this key error so what does that mean key error context okay i don't remember putting context anywhere so let's have a look at the builder script and if we okay let's have a look okay here we have this so we haven't modified this yet now what is this telling us um it's basically telling the data set builder which features to expect in the data set so basically down here we're kind of feeding in these these different features we're feeding these these records each record is a key value pair so the keys are the feature names and the values are obviously values which have a particular data type now here we have the the feature names so the keys um but they are not aligned to our actual data set these are using the squad data set key value pairs so we need to come over to this file and we can get those features specific to our data set from there so let's take these i'm going to copy them across here and all i'm going to do is actually just write those here okay so we have subtitle self-text upvote ratio uh we have oh we have id and we also have another one so let's create another well actually let's make this one more normal first so id is this and then we have one more which is created utc okay now we can try this it's not going to work again but let's try okay let's rerun this see what happens okay so actually it does work but it's not working in the way that we might expect so if we have a look at data and zero okay we have subtitle self-text and then we come down here there's a lot in this self-text um but so just look at this so the upvote ratio which is a floating point number is now a string the id that's fine we should expect that and the credit utc which is also a floating point number is now a string as well so there's a bit of an issue here basically if we if we go back to our script when we are feeding the features through this sort of feature specification um it's seeing that we're saying everything should be a string and it's converting everything into a string we don't actually want everything to be a string so what we need to do here is use a specific apache arrow data type identifiers for different things so for example float that we have here okay so let's go ahead and have a look at how or what that might be so to find that i'm just going to type like apache arrow data types here so apache arrow data types and schemas schemas maybe uh we come here and we can see we can see a load these so we have we have integer values unsigned integers and then we have floats so i'm going to say okay single precision floating point type is perfect okay so i'm just going to copy that float 32 i'm going to put that for create utc and also the upvote ratio okay i'm going to save that i'm going to change a few things that we don't we don't actually need so i'm going to remove this task template because we can't do question answering with this data set um no we can't at least not extractive question answering or can't train with that uh for the home page let's let's put this i suppose okay supervisor keys is none and what else do we have here so description uh so it's a demo we know that okay okay let's save this and and try again okay so i'm going to copy this over into home face come here uh not here here edit and come here select all paste and i am going to commit those changes now let's have a look at what happens if we load the data set so come back over here test data set uh let's run this let's see what happens okay it loaded well it loaded correctly that's a good sign come down here and now we can see that these are no longer strings but they're actually floating point numbers okay so that is that's everything there are maybe a few aesthetic things to change here so the like the citation we'll change that up here i can change this as well but we're not going to go through that in this uh in this video i don't think you want to watch me change citations so yeah that's everything for this video in the next video what we're going to do is take a look at taking this a a little bit further and adding more advanced data types like images into our data sets so until then i hope this has been useful thank you very much for watching and i will see you again in the next one bye