back to indexHugging Face Datasets #3 | Adding Images
Chapters
0:0 Intro
2:5 Creating Tar Files for Images
5:11 Compressing Images in Tar Files
6:26 Adding Dataset Builder Script
9:7 Iterable Download Manager with iter_archive
9:56 _generate_examples Function Definition
12:52 Adding to Hugging Face Datasets Hub
13:34 Fixing Errors
14:23 Using Your New Dataset
14:53 Dealing with Larger Image Datasets
00:00:00.000 |
Okay so today we're going to take a look at the third video in this series on 00:00:05.520 |
using Hogan Face datasets. Today we're going to have a look at how we can 00:00:10.800 |
include images in our datasets. So it will look kind of like this dataset here. 00:00:16.560 |
So this is James Callum image text demo. This is what we're going to recreate. So 00:00:22.400 |
we're going to have the images here and you see that you're within like the 00:00:26.200 |
dataset preview you get this nice little widget where you actually see the image 00:00:30.480 |
and then we're also gonna have some text here that's not so not so important but 00:00:35.000 |
you can scroll through and there's all these different images that are loaded 00:00:38.160 |
and we're going to learn how to do the same thing. So we'll come over to a 00:00:42.360 |
notebook and first thing I'm going to do is actually get the images from that 00:00:47.520 |
same dataset. Okay so from datasets import load dataset and we're just gonna 00:00:57.760 |
load that dataset. So data equals the dataset and that will be the same as what 00:01:05.680 |
you have up here so I can copy this and it will be the training split. Okay it 00:01:16.040 |
might take a little while to download if you haven't downloaded it before. So you 00:01:20.720 |
have text and the image so from here what we can do is actually go into the 00:01:26.440 |
first item so we go row zero and look at that image. Okay and then we actually get 00:01:34.200 |
the image from that dataset. Now to do that we have to do something slightly 00:01:40.040 |
different just including the data within a JSON lines file because obviously an image 00:01:46.360 |
you can't include in JSON lines file unless you use like the image bytes 00:01:50.200 |
which will not load in this way you would have to do some extra processing 00:01:56.280 |
steps in order to actually view an image if you did that. So how do we do it in 00:02:01.680 |
this way where we actually just get this this nice image. Well the first thing 00:02:06.320 |
we're going to do is actually create a tar file where we will take all of our 00:02:10.620 |
images and put them into this compressed file and that file will be hosted you 00:02:15.760 |
can host it in different places but we're going to host it on Hugging Face 00:02:20.600 |
in this example. So let's get started by creating or that we'll download each 00:02:25.920 |
one of these files and we're going to create a tar file from them. So first 00:02:31.040 |
we're going to do is import OS. So right now I'm just preparing the data to 00:02:36.380 |
actually create this dataset for. So I'm going to do OS make directory I'm just 00:02:42.480 |
going to create a new directory called images. If it doesn't already exist so 00:02:48.280 |
we'll say if OS path exists and we want dot images in there then create it but 00:03:02.100 |
yeah I also want to make this if not. Okay and then after that what I want to do is 00:03:09.800 |
iterate through each one of these images and just save them to file. So let's 00:03:15.000 |
create that first and let's see how we can do that. So we will go so zero image 00:03:21.000 |
so data zero image and let's have a look at what that image is. So it's going to 00:03:28.860 |
be a pale image object and what I want to do is actually save that. So we 00:03:35.920 |
just do if I remember correctly image dot save and just show me and yeah I'm 00:03:44.400 |
going to do this. Okay now if I have a look in my file explorer over here we 00:03:53.920 |
see that we have images and then here we have zero dot JPEG open this and we have 00:03:59.960 |
that image. So what we're going to do now is just repeat that logic for every 00:04:05.080 |
item in that data set. So for I in range len of that data set we are going to do 00:04:14.720 |
exactly what it says here. So yeah we can add it might take a little bit of time 00:04:20.880 |
because it needs to load every image and each image is pretty big so I'm going to 00:04:25.240 |
do from TQDM dot auto import TQDM. So I'm not just blindly waiting. TQDM this is 00:04:32.760 |
just a progress bar so we can see what is actually happening. Oh and we need to 00:04:38.280 |
do from not import. Okay it doesn't take too long. Great so now within that 00:04:45.560 |
directory we should see a ton of images. So let's have a look we'll just do OS 00:04:51.760 |
list there. Okay and we can see that we have all of these I think it's 21 in 00:05:00.560 |
total it goes up to 20 starts at zero so we have 21. So that is all of our images 00:05:09.160 |
that downloaded and now what I want to do is go ahead and compress them all 00:05:14.280 |
into a tar file. So how do we do that? Well I think we had a look at this 00:05:19.600 |
already in the previous video. So come here we can kind of see this. So we come 00:05:29.000 |
down let's have a look. Pretty much this right here. Now I'm not sure if that will 00:05:37.040 |
work for a directory but let's let's try. So with tar file open and we want 00:05:44.880 |
images. Add images. Let's see. Okay so I think that has worked okay. Let's try and 00:05:57.680 |
open this. So double click and okay let's see what we have in there. Yeah we have 00:06:05.160 |
everything we need. Great so that has compressed correctly. So that's all we 00:06:10.600 |
need to actually build our data set. So now I want to do is add this to what 00:06:18.900 |
will be our Hugging Face datasets directory and we'll also add a dataset 00:06:25.440 |
loading script. So dataset loading script. Let's go ahead and copy that from the 00:06:31.000 |
previous video and then we'll just modify it from there. So we have this. 00:06:39.520 |
I'm gonna copy and I'm gonna paste it in here. Okay and come down to here. Cool so 00:06:53.000 |
we have all these features here which are the expected features within our 00:06:59.140 |
dataset. Now this is going to consist of two items. We have the text which is a 00:07:04.840 |
kind of like text description of the image and then we also have the image 00:07:08.780 |
itself. So let's modify this a little bit. We're gonna have text and then we're 00:07:12.760 |
gonna have image. Okay we'll delete the rest because we don't need those and 00:07:16.760 |
then for the value of this feature here we're not going to use a string 00:07:22.360 |
obviously. We're going to use a special one which is just called image. So we 00:07:27.080 |
use that and then what we can do is we can modify this and say the home page is 00:07:33.920 |
it's not this. So if I just use a previous location basically HuggingFace.co 00:07:41.320 |
datasets James Callum this would be image text demo. Now let's come down here 00:07:50.800 |
and we will need to modify this as well. So we will need to download and extract 00:07:57.240 |
that tar file. So to get started let's just go ahead and actually upload that 00:08:02.800 |
to HuggingFace so we can see the actual URL for that file. So we'll come over 00:08:07.520 |
here I'm going to create a new dataset. I'm just gonna call it image demo for 00:08:13.040 |
now. I'll keep it public briefly and before I remove it. What I'm gonna do is 00:08:19.960 |
go to files. I'm going to add file and we are going to use the tar file that we 00:08:28.060 |
just created. So images.tar. So drag this in here. So images.tar.gz. Add that 00:08:34.920 |
and I'm gonna commit those changes. Okay so in here we now have this file. Let's 00:08:40.420 |
click on here and what I want to do is this download button here just right 00:08:45.880 |
click and we're gonna copy link address. That's gonna be our URL. So do we have 00:08:52.560 |
okay URL here. I'm gonna go ahead and paste it into this. Okay so HuggingFace.co 00:09:01.240 |
image demo resolve and then we have images.tar.gz. Okay so we have an 00:09:08.160 |
iterable object that will go through the compressed file and iteratively extract 00:09:13.760 |
items from it. So what we will do is say let's call it image iters and this will 00:09:21.840 |
be deal manage iter archive path. Like that. Okay that's perfect and then in 00:09:29.360 |
here we're going to be returning. We're gonna be having the list. We still have the 00:09:33.400 |
split generator. We still have the split train. The only thing that will change is 00:09:38.340 |
this. So instead of file path here we're gonna we're gonna call it images which 00:09:42.680 |
just means here we're going to change this to images and that will be equal to 00:09:48.080 |
the image iters item. Okay or iterable object. Okay so those are our images and 00:09:58.280 |
then last thing to do is actually rewrite this object. Generate examples. 00:10:03.860 |
Now here what I want to do is going to open the file. So with open this is going 00:10:10.480 |
to be different. So let's change it. So we're going to iterate through the 00:10:15.000 |
images. Iterate through images. So for image and images and what this is going 00:10:23.400 |
to do is actually include both the file path and the image itself. Okay from this 00:10:29.200 |
iterable object. So file path, image and images. What we're going to do is extract the 00:10:35.240 |
text from each item. So the text maybe like one way of doing this is saving the 00:10:42.120 |
text within the file name or another way is just storing another like mapping 00:10:46.760 |
file which will map from each row to a particular description. So what we can do 00:10:54.840 |
for this is actually go back into our into here. So into this Jupyter 00:11:00.920 |
notebook and what we can do is we have all the descriptions already. So I'm just 00:11:06.040 |
going to grab them. So it's going to be data, text and this is just a list of 00:11:12.920 |
all the descriptions. I'm just going to use this. Okay so it's probably 00:11:16.280 |
obviously you're not going to do this for a big data set but this is okay for 00:11:20.720 |
this example. I think it will let me scroll down all the way. Okay so I'm 00:11:26.600 |
just going to copy this. I'm going to copy it straight into the code. Okay and I'll 00:11:31.440 |
put it up here. So I'm going to call them like descriptions. Okay and we have all 00:11:38.000 |
those and so down here what we will have is that we will need to yield. So we have 00:11:45.120 |
generator objects here so we're using yield and we need to yield the index 00:11:50.960 |
value. So IDX and then we also want to yield those the items in there. So the 00:12:00.960 |
Well I'll explain that in a minute. So it's going to be a dictionary where you 00:12:05.520 |
have the the file path which goes to the file path that we just extracted and 00:12:11.040 |
then also the image. So the image is going to be what we had there. So image 00:12:16.600 |
then we need to read it like this and then the text is going to be descriptions 00:12:21.560 |
followed by the index value. So then after that we just want to do IDX plus equals 00:12:27.160 |
one and that will just iterate through the whole thing. So that I believe should 00:12:34.320 |
be pretty much everything. I'm going to rename this to what do we have like 00:12:39.560 |
images demo I think by I'm going to rename the file as well. So let me open 00:12:46.440 |
that there and we'll call this images demo as well and now we're going to do 00:12:54.120 |
head on over to here go to images demo we're going to add file upload files and 00:13:00.600 |
I'm just going to drag that images demo Python file into there and commit those 00:13:07.080 |
changes and we'll just test it see if we if we've covered everything there. It's 00:13:11.880 |
probably going to be something missing. So let's go back to our our notebook 00:13:16.280 |
images dataset and let's try let's just try and see if that works. So just copy 00:13:23.680 |
this again come to here and this one is called I think images demo let's try 00:13:33.560 |
that. Okay what's this? Okay so I think I've entered the wrong dataset name. 00:13:43.400 |
Image demo without ES. Okay it's working so far. Okay one there's a problem 00:13:52.800 |
somewhere. Okay so here there's an error that should read path not file path so 00:13:58.620 |
let me modify that quickly and in fact we can actually do in here so image 00:14:03.320 |
demo we're going to edit let's come down here so this should be path and this 00:14:12.400 |
here should be image it should be bytes. Okay let's commit those changes try 00:14:19.000 |
again okay come up to here let's try again let's go. Okay it looks that looks 00:14:25.560 |
pretty good we have the dataset description here let's try data and zero 00:14:32.440 |
see what we have we have the text and then we have the image object and let's 00:14:39.960 |
go again image okay there we go. So we've built our image enabled 00:14:49.000 |
hungarface dataset. It's I think relatively straightforward obviously 00:14:53.720 |
when you have a lot of image files you're going to need to find somewhere 00:14:58.240 |
to store them so what you will want to do rather than creating a single tar 00:15:03.040 |
file you will need to create multiple tar files and store your images across 00:15:07.240 |
those but other than that the logic is pretty much the same as what you've seen 00:15:11.240 |
here. So I hope this has been interesting and useful thank you very much for 00:15:17.320 |
watching and I will see you again in the next one, bye.