Back to Index

Hugging Face Datasets #3 | Adding Images


Chapters

0:0 Intro
2:5 Creating Tar Files for Images
5:11 Compressing Images in Tar Files
6:26 Adding Dataset Builder Script
9:7 Iterable Download Manager with iter_archive
9:56 _generate_examples Function Definition
12:52 Adding to Hugging Face Datasets Hub
13:34 Fixing Errors
14:23 Using Your New Dataset
14:53 Dealing with Larger Image Datasets

Transcript

Okay so today we're going to take a look at the third video in this series on using Hogan Face datasets. Today we're going to have a look at how we can include images in our datasets. So it will look kind of like this dataset here. So this is James Callum image text demo.

This is what we're going to recreate. So we're going to have the images here and you see that you're within like the dataset preview you get this nice little widget where you actually see the image and then we're also gonna have some text here that's not so not so important but you can scroll through and there's all these different images that are loaded and we're going to learn how to do the same thing.

So we'll come over to a notebook and first thing I'm going to do is actually get the images from that same dataset. Okay so from datasets import load dataset and we're just gonna load that dataset. So data equals the dataset and that will be the same as what you have up here so I can copy this and it will be the training split.

Okay it might take a little while to download if you haven't downloaded it before. So you have text and the image so from here what we can do is actually go into the first item so we go row zero and look at that image. Okay and then we actually get the image from that dataset.

Now to do that we have to do something slightly different just including the data within a JSON lines file because obviously an image you can't include in JSON lines file unless you use like the image bytes which will not load in this way you would have to do some extra processing steps in order to actually view an image if you did that.

So how do we do it in this way where we actually just get this this nice image. Well the first thing we're going to do is actually create a tar file where we will take all of our images and put them into this compressed file and that file will be hosted you can host it in different places but we're going to host it on Hugging Face in this example.

So let's get started by creating or that we'll download each one of these files and we're going to create a tar file from them. So first we're going to do is import OS. So right now I'm just preparing the data to actually create this dataset for. So I'm going to do OS make directory I'm just going to create a new directory called images.

If it doesn't already exist so we'll say if OS path exists and we want dot images in there then create it but yeah I also want to make this if not. Okay and then after that what I want to do is iterate through each one of these images and just save them to file.

So let's create that first and let's see how we can do that. So we will go so zero image so data zero image and let's have a look at what that image is. So it's going to be a pale image object and what I want to do is actually save that.

So we just do if I remember correctly image dot save and just show me and yeah I'm going to do this. Okay now if I have a look in my file explorer over here we see that we have images and then here we have zero dot JPEG open this and we have that image.

So what we're going to do now is just repeat that logic for every item in that data set. So for I in range len of that data set we are going to do exactly what it says here. So yeah we can add it might take a little bit of time because it needs to load every image and each image is pretty big so I'm going to do from TQDM dot auto import TQDM.

So I'm not just blindly waiting. TQDM this is just a progress bar so we can see what is actually happening. Oh and we need to do from not import. Okay it doesn't take too long. Great so now within that directory we should see a ton of images. So let's have a look we'll just do OS list there.

Okay and we can see that we have all of these I think it's 21 in total it goes up to 20 starts at zero so we have 21. So that is all of our images that downloaded and now what I want to do is go ahead and compress them all into a tar file.

So how do we do that? Well I think we had a look at this already in the previous video. So come here we can kind of see this. So we come down let's have a look. Pretty much this right here. Now I'm not sure if that will work for a directory but let's let's try.

So with tar file open and we want images. Add images. Let's see. Okay so I think that has worked okay. Let's try and open this. So double click and okay let's see what we have in there. Yeah we have everything we need. Great so that has compressed correctly. So that's all we need to actually build our data set.

So now I want to do is add this to what will be our Hugging Face datasets directory and we'll also add a dataset loading script. So dataset loading script. Let's go ahead and copy that from the previous video and then we'll just modify it from there. So we have this.

I'm gonna copy and I'm gonna paste it in here. Okay and come down to here. Cool so we have all these features here which are the expected features within our dataset. Now this is going to consist of two items. We have the text which is a kind of like text description of the image and then we also have the image itself.

So let's modify this a little bit. We're gonna have text and then we're gonna have image. Okay we'll delete the rest because we don't need those and then for the value of this feature here we're not going to use a string obviously. We're going to use a special one which is just called image.

So we use that and then what we can do is we can modify this and say the home page is it's not this. So if I just use a previous location basically HuggingFace.co datasets James Callum this would be image text demo. Now let's come down here and we will need to modify this as well.

So we will need to download and extract that tar file. So to get started let's just go ahead and actually upload that to HuggingFace so we can see the actual URL for that file. So we'll come over here I'm going to create a new dataset. I'm just gonna call it image demo for now.

I'll keep it public briefly and before I remove it. What I'm gonna do is go to files. I'm going to add file and we are going to use the tar file that we just created. So images.tar. So drag this in here. So images.tar.gz. Add that and I'm gonna commit those changes.

Okay so in here we now have this file. Let's click on here and what I want to do is this download button here just right click and we're gonna copy link address. That's gonna be our URL. So do we have okay URL here. I'm gonna go ahead and paste it into this.

Okay so HuggingFace.co image demo resolve and then we have images.tar.gz. Okay so we have an iterable object that will go through the compressed file and iteratively extract items from it. So what we will do is say let's call it image iters and this will be deal manage iter archive path.

Like that. Okay that's perfect and then in here we're going to be returning. We're gonna be having the list. We still have the split generator. We still have the split train. The only thing that will change is this. So instead of file path here we're gonna we're gonna call it images which just means here we're going to change this to images and that will be equal to the image iters item.

Okay or iterable object. Okay so those are our images and then last thing to do is actually rewrite this object. Generate examples. Now here what I want to do is going to open the file. So with open this is going to be different. So let's change it. So we're going to iterate through the images.

Iterate through images. So for image and images and what this is going to do is actually include both the file path and the image itself. Okay from this iterable object. So file path, image and images. What we're going to do is extract the text from each item. So the text maybe like one way of doing this is saving the text within the file name or another way is just storing another like mapping file which will map from each row to a particular description.

So what we can do for this is actually go back into our into here. So into this Jupyter notebook and what we can do is we have all the descriptions already. So I'm just going to grab them. So it's going to be data, text and this is just a list of all the descriptions.

I'm just going to use this. Okay so it's probably obviously you're not going to do this for a big data set but this is okay for this example. I think it will let me scroll down all the way. Okay so I'm just going to copy this. I'm going to copy it straight into the code.

Okay and I'll put it up here. So I'm going to call them like descriptions. Okay and we have all those and so down here what we will have is that we will need to yield. So we have generator objects here so we're using yield and we need to yield the index value.

So IDX and then we also want to yield those the items in there. So the image object needs to include the image. Well I'll explain that in a minute. So it's going to be a dictionary where you have the the file path which goes to the file path that we just extracted and then also the image.

So the image is going to be what we had there. So image then we need to read it like this and then the text is going to be descriptions followed by the index value. So then after that we just want to do IDX plus equals one and that will just iterate through the whole thing.

So that I believe should be pretty much everything. I'm going to rename this to what do we have like images demo I think by I'm going to rename the file as well. So let me open that there and we'll call this images demo as well and now we're going to do head on over to here go to images demo we're going to add file upload files and I'm just going to drag that images demo Python file into there and commit those changes and we'll just test it see if we if we've covered everything there.

It's probably going to be something missing. So let's go back to our our notebook images dataset and let's try let's just try and see if that works. So just copy this again come to here and this one is called I think images demo let's try that. Okay what's this?

Okay so I think I've entered the wrong dataset name. Image demo without ES. Okay it's working so far. Okay one there's a problem somewhere. Okay so here there's an error that should read path not file path so let me modify that quickly and in fact we can actually do in here so image demo we're going to edit let's come down here so this should be path and this here should be image it should be bytes.

Okay let's commit those changes try again okay come up to here let's try again let's go. Okay it looks that looks pretty good we have the dataset description here let's try data and zero see what we have we have the text and then we have the image object and let's go again image okay there we go.

So we've built our image enabled hungarface dataset. It's I think relatively straightforward obviously when you have a lot of image files you're going to need to find somewhere to store them so what you will want to do rather than creating a single tar file you will need to create multiple tar files and store your images across those but other than that the logic is pretty much the same as what you've seen here.

So I hope this has been interesting and useful thank you very much for watching and I will see you again in the next one, bye. you you you (music fades)