back to index

Hugging Face Datasets #3 | Adding Images


Chapters

0:0 Intro
2:5 Creating Tar Files for Images
5:11 Compressing Images in Tar Files
6:26 Adding Dataset Builder Script
9:7 Iterable Download Manager with iter_archive
9:56 _generate_examples Function Definition
12:52 Adding to Hugging Face Datasets Hub
13:34 Fixing Errors
14:23 Using Your New Dataset
14:53 Dealing with Larger Image Datasets

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay so today we're going to take a look at the third video in this series on
00:00:05.520 | using Hogan Face datasets. Today we're going to have a look at how we can
00:00:10.800 | include images in our datasets. So it will look kind of like this dataset here.
00:00:16.560 | So this is James Callum image text demo. This is what we're going to recreate. So
00:00:22.400 | we're going to have the images here and you see that you're within like the
00:00:26.200 | dataset preview you get this nice little widget where you actually see the image
00:00:30.480 | and then we're also gonna have some text here that's not so not so important but
00:00:35.000 | you can scroll through and there's all these different images that are loaded
00:00:38.160 | and we're going to learn how to do the same thing. So we'll come over to a
00:00:42.360 | notebook and first thing I'm going to do is actually get the images from that
00:00:47.520 | same dataset. Okay so from datasets import load dataset and we're just gonna
00:00:57.760 | load that dataset. So data equals the dataset and that will be the same as what
00:01:05.680 | you have up here so I can copy this and it will be the training split. Okay it
00:01:16.040 | might take a little while to download if you haven't downloaded it before. So you
00:01:20.720 | have text and the image so from here what we can do is actually go into the
00:01:26.440 | first item so we go row zero and look at that image. Okay and then we actually get
00:01:34.200 | the image from that dataset. Now to do that we have to do something slightly
00:01:40.040 | different just including the data within a JSON lines file because obviously an image
00:01:46.360 | you can't include in JSON lines file unless you use like the image bytes
00:01:50.200 | which will not load in this way you would have to do some extra processing
00:01:56.280 | steps in order to actually view an image if you did that. So how do we do it in
00:02:01.680 | this way where we actually just get this this nice image. Well the first thing
00:02:06.320 | we're going to do is actually create a tar file where we will take all of our
00:02:10.620 | images and put them into this compressed file and that file will be hosted you
00:02:15.760 | can host it in different places but we're going to host it on Hugging Face
00:02:20.600 | in this example. So let's get started by creating or that we'll download each
00:02:25.920 | one of these files and we're going to create a tar file from them. So first
00:02:31.040 | we're going to do is import OS. So right now I'm just preparing the data to
00:02:36.380 | actually create this dataset for. So I'm going to do OS make directory I'm just
00:02:42.480 | going to create a new directory called images. If it doesn't already exist so
00:02:48.280 | we'll say if OS path exists and we want dot images in there then create it but
00:03:02.100 | yeah I also want to make this if not. Okay and then after that what I want to do is
00:03:09.800 | iterate through each one of these images and just save them to file. So let's
00:03:15.000 | create that first and let's see how we can do that. So we will go so zero image
00:03:21.000 | so data zero image and let's have a look at what that image is. So it's going to
00:03:28.860 | be a pale image object and what I want to do is actually save that. So we
00:03:35.920 | just do if I remember correctly image dot save and just show me and yeah I'm
00:03:44.400 | going to do this. Okay now if I have a look in my file explorer over here we
00:03:53.920 | see that we have images and then here we have zero dot JPEG open this and we have
00:03:59.960 | that image. So what we're going to do now is just repeat that logic for every
00:04:05.080 | item in that data set. So for I in range len of that data set we are going to do
00:04:14.720 | exactly what it says here. So yeah we can add it might take a little bit of time
00:04:20.880 | because it needs to load every image and each image is pretty big so I'm going to
00:04:25.240 | do from TQDM dot auto import TQDM. So I'm not just blindly waiting. TQDM this is
00:04:32.760 | just a progress bar so we can see what is actually happening. Oh and we need to
00:04:38.280 | do from not import. Okay it doesn't take too long. Great so now within that
00:04:45.560 | directory we should see a ton of images. So let's have a look we'll just do OS
00:04:51.760 | list there. Okay and we can see that we have all of these I think it's 21 in
00:05:00.560 | total it goes up to 20 starts at zero so we have 21. So that is all of our images
00:05:09.160 | that downloaded and now what I want to do is go ahead and compress them all
00:05:14.280 | into a tar file. So how do we do that? Well I think we had a look at this
00:05:19.600 | already in the previous video. So come here we can kind of see this. So we come
00:05:29.000 | down let's have a look. Pretty much this right here. Now I'm not sure if that will
00:05:37.040 | work for a directory but let's let's try. So with tar file open and we want
00:05:44.880 | images. Add images. Let's see. Okay so I think that has worked okay. Let's try and
00:05:57.680 | open this. So double click and okay let's see what we have in there. Yeah we have
00:06:05.160 | everything we need. Great so that has compressed correctly. So that's all we
00:06:10.600 | need to actually build our data set. So now I want to do is add this to what
00:06:18.900 | will be our Hugging Face datasets directory and we'll also add a dataset
00:06:25.440 | loading script. So dataset loading script. Let's go ahead and copy that from the
00:06:31.000 | previous video and then we'll just modify it from there. So we have this.
00:06:39.520 | I'm gonna copy and I'm gonna paste it in here. Okay and come down to here. Cool so
00:06:53.000 | we have all these features here which are the expected features within our
00:06:59.140 | dataset. Now this is going to consist of two items. We have the text which is a
00:07:04.840 | kind of like text description of the image and then we also have the image
00:07:08.780 | itself. So let's modify this a little bit. We're gonna have text and then we're
00:07:12.760 | gonna have image. Okay we'll delete the rest because we don't need those and
00:07:16.760 | then for the value of this feature here we're not going to use a string
00:07:22.360 | obviously. We're going to use a special one which is just called image. So we
00:07:27.080 | use that and then what we can do is we can modify this and say the home page is
00:07:33.920 | it's not this. So if I just use a previous location basically HuggingFace.co
00:07:41.320 | datasets James Callum this would be image text demo. Now let's come down here
00:07:50.800 | and we will need to modify this as well. So we will need to download and extract
00:07:57.240 | that tar file. So to get started let's just go ahead and actually upload that
00:08:02.800 | to HuggingFace so we can see the actual URL for that file. So we'll come over
00:08:07.520 | here I'm going to create a new dataset. I'm just gonna call it image demo for
00:08:13.040 | now. I'll keep it public briefly and before I remove it. What I'm gonna do is
00:08:19.960 | go to files. I'm going to add file and we are going to use the tar file that we
00:08:28.060 | just created. So images.tar. So drag this in here. So images.tar.gz. Add that
00:08:34.920 | and I'm gonna commit those changes. Okay so in here we now have this file. Let's
00:08:40.420 | click on here and what I want to do is this download button here just right
00:08:45.880 | click and we're gonna copy link address. That's gonna be our URL. So do we have
00:08:52.560 | okay URL here. I'm gonna go ahead and paste it into this. Okay so HuggingFace.co
00:09:01.240 | image demo resolve and then we have images.tar.gz. Okay so we have an
00:09:08.160 | iterable object that will go through the compressed file and iteratively extract
00:09:13.760 | items from it. So what we will do is say let's call it image iters and this will
00:09:21.840 | be deal manage iter archive path. Like that. Okay that's perfect and then in
00:09:29.360 | here we're going to be returning. We're gonna be having the list. We still have the
00:09:33.400 | split generator. We still have the split train. The only thing that will change is
00:09:38.340 | this. So instead of file path here we're gonna we're gonna call it images which
00:09:42.680 | just means here we're going to change this to images and that will be equal to
00:09:48.080 | the image iters item. Okay or iterable object. Okay so those are our images and
00:09:58.280 | then last thing to do is actually rewrite this object. Generate examples.
00:10:03.860 | Now here what I want to do is going to open the file. So with open this is going
00:10:10.480 | to be different. So let's change it. So we're going to iterate through the
00:10:15.000 | images. Iterate through images. So for image and images and what this is going
00:10:23.400 | to do is actually include both the file path and the image itself. Okay from this
00:10:29.200 | iterable object. So file path, image and images. What we're going to do is extract the
00:10:35.240 | text from each item. So the text maybe like one way of doing this is saving the
00:10:42.120 | text within the file name or another way is just storing another like mapping
00:10:46.760 | file which will map from each row to a particular description. So what we can do
00:10:54.840 | for this is actually go back into our into here. So into this Jupyter
00:11:00.920 | notebook and what we can do is we have all the descriptions already. So I'm just
00:11:06.040 | going to grab them. So it's going to be data, text and this is just a list of
00:11:12.920 | all the descriptions. I'm just going to use this. Okay so it's probably
00:11:16.280 | obviously you're not going to do this for a big data set but this is okay for
00:11:20.720 | this example. I think it will let me scroll down all the way. Okay so I'm
00:11:26.600 | just going to copy this. I'm going to copy it straight into the code. Okay and I'll
00:11:31.440 | put it up here. So I'm going to call them like descriptions. Okay and we have all
00:11:38.000 | those and so down here what we will have is that we will need to yield. So we have
00:11:45.120 | generator objects here so we're using yield and we need to yield the index
00:11:50.960 | value. So IDX and then we also want to yield those the items in there. So the
00:11:56.880 | image object needs to include the image.
00:12:00.960 | Well I'll explain that in a minute. So it's going to be a dictionary where you
00:12:05.520 | have the the file path which goes to the file path that we just extracted and
00:12:11.040 | then also the image. So the image is going to be what we had there. So image
00:12:16.600 | then we need to read it like this and then the text is going to be descriptions
00:12:21.560 | followed by the index value. So then after that we just want to do IDX plus equals
00:12:27.160 | one and that will just iterate through the whole thing. So that I believe should
00:12:34.320 | be pretty much everything. I'm going to rename this to what do we have like
00:12:39.560 | images demo I think by I'm going to rename the file as well. So let me open
00:12:46.440 | that there and we'll call this images demo as well and now we're going to do
00:12:54.120 | head on over to here go to images demo we're going to add file upload files and
00:13:00.600 | I'm just going to drag that images demo Python file into there and commit those
00:13:07.080 | changes and we'll just test it see if we if we've covered everything there. It's
00:13:11.880 | probably going to be something missing. So let's go back to our our notebook
00:13:16.280 | images dataset and let's try let's just try and see if that works. So just copy
00:13:23.680 | this again come to here and this one is called I think images demo let's try
00:13:33.560 | that. Okay what's this? Okay so I think I've entered the wrong dataset name.
00:13:43.400 | Image demo without ES. Okay it's working so far. Okay one there's a problem
00:13:52.800 | somewhere. Okay so here there's an error that should read path not file path so
00:13:58.620 | let me modify that quickly and in fact we can actually do in here so image
00:14:03.320 | demo we're going to edit let's come down here so this should be path and this
00:14:12.400 | here should be image it should be bytes. Okay let's commit those changes try
00:14:19.000 | again okay come up to here let's try again let's go. Okay it looks that looks
00:14:25.560 | pretty good we have the dataset description here let's try data and zero
00:14:32.440 | see what we have we have the text and then we have the image object and let's
00:14:39.960 | go again image okay there we go. So we've built our image enabled
00:14:49.000 | hungarface dataset. It's I think relatively straightforward obviously
00:14:53.720 | when you have a lot of image files you're going to need to find somewhere
00:14:58.240 | to store them so what you will want to do rather than creating a single tar
00:15:03.040 | file you will need to create multiple tar files and store your images across
00:15:07.240 | those but other than that the logic is pretty much the same as what you've seen
00:15:11.240 | here. So I hope this has been interesting and useful thank you very much for
00:15:17.320 | watching and I will see you again in the next one, bye.
00:15:28.180 | (music fades)