back to indexHow to Build TensorFlow Pipelines with tf.data.Dataset
Chapters
0:0
0:27 Use the Pipeline Dataset Object
4:40 Reading It Directly from File
7:53 Field Delimiter
8:46 Select Columns
11:41 Shuffle and Batch Methods
16:5 Map Method
22:19 Define a Model
00:00:00.520 |
All right, we're going to go through the TensorFlow data sets. 00:00:20.600 |
But we are just going to go and dive right into it. 00:00:31.760 |
we need to actually import TensorFlow, of course. 00:00:38.640 |
And we're also going to be using Pandas and NumPy 00:00:42.400 |
for a few examples here, so we'll go ahead and import those as well. 00:00:49.180 |
Okay, so there's a few different ways that we can read data 00:01:12.300 |
if you have seen this before, will have seen it. 00:01:15.040 |
We'll go ahead and put that together quickly. 00:01:20.880 |
So we can take a couple of Python lists, put them together, 00:01:28.120 |
So we'll just put some together really quickly here. 00:01:32.160 |
Okay, so we just have input and output, they're both lists. 00:01:38.560 |
And then to create our data set object from these two, 00:02:10.300 |
because this only accepts one input parameter. 00:02:16.040 |
And that input parameter is basically all of the data 00:02:18.540 |
that we are going to be feeding into our model later on. 00:02:28.080 |
when it's feeding into a model is simply one input tensor 00:02:33.820 |
and one output tensor or target label tensor, 00:02:41.420 |
Here, once it loads, we will have built our first data set. 00:02:59.200 |
So we're just going to see what it looks like. 00:03:08.840 |
This is the tuple format that we created here. 00:03:18.380 |
or NumPy integer, which is zero, which matches up to this. 00:03:23.820 |
And then next to that, we have the output value, 00:03:27.720 |
And then it's the same for the following three rows in there. 00:03:32.980 |
So we can also do the same with NumPy arrays. 00:03:38.560 |
Which is literally pretty much exactly the same, 00:03:57.020 |
This time, before we were passing inputs and outputs, 00:04:09.620 |
Okay, so we'll create a slightly different format here. 00:04:14.460 |
And with this, we would reformat or restructure the dataset here 00:04:21.100 |
before feeding into our model for it to read everything correctly. 00:04:26.440 |
But for now, we're just going to leave it like that. 00:04:29.000 |
And then we'll go over the mapping and everything pretty soon. 00:04:34.540 |
So the other option we have for reading the data 00:04:38.540 |
into our dataset is actually reading it directly from file. 00:04:42.440 |
So from file, the benefits we get from doing that 00:04:47.120 |
is that we are reading in data from an out-of-memory source. 00:04:51.660 |
And because we're reading from an out-of-memory source, 00:05:01.440 |
rather than pulling in the entire data source 00:05:07.340 |
So if we have a big dataset, then this is pretty useful 00:05:13.180 |
because in a lot of cases, we're working with a big dataset 00:05:16.740 |
and we can't actually bring everything into our memory all at once. 00:05:36.400 |
And this is my attempt at demonstrating the difference 00:05:42.240 |
or demonstrating what the from file version of this does. 00:05:55.880 |
And at any one time, we feed in a single batch, 00:06:03.920 |
And then once we're done with that, we go on to the next batch. 00:06:07.420 |
And then we feed it into dataset transformations 00:06:22.960 |
And that file is actually from the sentiment analysis 00:06:31.820 |
You can see the link, I'll put it in the description. 00:06:35.300 |
And we're going to read from this, read from it directly. 00:06:59.060 |
So I know we are actually using tab separated values here 00:07:06.560 |
So all we're doing here is we're going to change the field 00:07:21.340 |
And then also in here, we actually define our batch size. 00:07:26.720 |
So we're just going to do something really small for now. 00:07:29.420 |
But obviously, when you are using this for your actual models, 00:07:35.620 |
you would probably be doing something like a batch size of 64 00:07:44.040 |
so we can really easily visualize everything. 00:07:48.440 |
And then next, I don't know why it's doing that. 00:07:56.080 |
So this is where we tell it it's actually a tab delimited file 00:08:41.200 |
And then actually another really useful argument here 00:08:49.240 |
So with this, we just pass a list of the columns that we want to keep 00:09:01.480 |
it depends on what you're doing, obviously, here. 00:09:07.760 |
And then we're also going to keep the input and target data. 00:09:33.660 |
And then let's just have a quick look at what we have. 00:09:41.460 |
So we use this take to just take the first batch 00:09:49.700 |
If we say take 20, then it will take the first 20 batches 00:09:59.580 |
so we can see the actual format within the dataset. 00:10:09.380 |
And this is why I wanted to keep the phrase ID in. 00:10:12.420 |
I mean, we don't actually need it for training model. 00:10:16.780 |
But I wanted to show you that it actually shuffles the data. 00:10:32.540 |
it actually automatically shuffles everything, 00:10:52.800 |
And so that's everything for the reading into our dataset. 00:11:04.000 |
And we'll move on to performing a few operations on it. 00:11:21.960 |
but we're going to load it into our memory first. 00:11:34.760 |
So I'm going to do this because I want to show you 00:11:43.860 |
And obviously, if it's already shuffled and batched, 00:11:49.580 |
But this is useful to know if you're reading things from in-memory. 00:11:53.840 |
Obviously, if you are reading things from your disk, 00:12:01.480 |
And then we're in TSV, so we need to keep the separator as a tab. 00:12:11.120 |
Okay, and let's just make sure we've read it in correctly. 00:12:24.680 |
we actually need to, sorry, we need to read it into our dataset. 00:12:53.700 |
Okay, so this is because we have these phrases in here. 00:13:20.920 |
obviously, for machine learning, you're going to tokenize it. 00:13:28.780 |
But we're not going to go all the way through to actually training the model. 00:13:34.860 |
We're just going to have a look at the pipelining. 00:14:12.640 |
So we shuffled dataset and we just add in a large number here 00:14:17.640 |
to make sure it shuffles everything like as far away 00:14:27.240 |
And the sort of standard number here is actually 10,000. 00:14:38.660 |
I've seen people use 10K, so I'm going to stick with it. 00:15:09.680 |
So now we should see quite a few and we can see, okay, cool. 00:15:17.860 |
because these were one, two, three, four before. 00:15:23.500 |
So that's cool, we shuffled and batched it like incredibly easily. 00:15:27.560 |
So, you know, that's one of the benefits of doing it. 00:15:35.840 |
one, it's incredibly easy and simple to remember. 00:15:43.180 |
like what is happening in dataset shuffle batch into eight. 00:15:48.280 |
I mean, maybe some people might get a bit confused 00:15:50.460 |
by this number here, but otherwise super easy to read. 00:16:03.340 |
Next thing I would show you is the map method. 00:16:08.140 |
So for any more complex data transformations, 00:16:23.160 |
We can maybe add or multiply everything in the labels by two. 00:16:28.820 |
I mean, obviously we wouldn't do this in reality, 00:16:38.180 |
So we're going to build it as if we have two input fields. 00:16:45.240 |
So for example, when you're working with transformers 00:16:49.580 |
or a lot of the time you have an input ID field or layer, 00:16:54.800 |
and you also have an attention mass field or layer. 00:16:57.920 |
And don't worry if that doesn't really make sense. 00:17:01.420 |
But essentially we just have two input layers and fields. 00:17:04.860 |
So we'll format this to be formatted in the correct way 00:17:11.960 |
And we'll also change the number of the output 00:17:18.900 |
So generally the best way of doing this is to create a function. 00:17:23.940 |
So I'm just going to call it map func and pass X. 00:17:28.080 |
So this is just going to pass every single record 00:17:41.780 |
because otherwise we have to consider the batching. 00:17:45.460 |
So let's move these after and let's write this. 00:17:53.900 |
So when we are working with multiple inputs or outputs, 00:18:10.520 |
is to give the input layers or output layers a name 00:18:15.280 |
when you're defining the model and building it 00:18:19.520 |
and match that to the names that we give to this dictionary here. 00:18:36.040 |
So our input ID for this is going to be this value here. 00:18:56.600 |
And we're just going to take the first value of X. 00:19:04.800 |
And then the mask, we're going to write it like this. 00:19:18.820 |
we only have one label, one output, one target, 00:19:26.820 |
And we're just going to perform like a really basic operation on it 00:19:45.340 |
all we need to do, again, like it's incredibly easy, 00:20:11.540 |
Okay, so let's have a look at what we have now. 00:20:23.340 |
Okay, it's kind of hard to read, but inside a tuple. 00:20:48.340 |
which has input IDs and then also has the mask. 00:21:18.240 |
it will read the key values of the dictionary 00:21:31.740 |
So essentially, you would have to have a layer 00:21:35.740 |
called input IDs and it would pass this to that layer. 00:21:40.900 |
And then it would also pass this to the layer of mask. 00:21:52.820 |
We wouldn't necessarily need to mark this one out though. 00:22:00.960 |
Like we did before, and then we can just view what we have here. 00:22:09.100 |
Okay, so that's pretty good, that's what we want. 00:22:31.860 |
Okay, and so I'm not actually going to define all of this. 00:22:46.500 |
I'm just going to show you how it would work. 00:22:48.040 |
So you define your shape, you define it here, 00:22:55.740 |
And then it's this name that would have to match up 00:22:57.920 |
with the dictionary that we have fed in previously. 00:23:02.320 |
So we go like input's ID, or was it input's ID? 00:23:07.500 |
Or, okay, so input's IDs, right, they would have to match. 00:23:26.960 |
And obviously you'd call this mask or anything else you want. 00:23:31.860 |
We call it like input one, input two, it doesn't matter. 00:23:37.100 |
Okay, but later on, when we actually fit the model, 00:23:43.740 |
So obviously we would have an output here as well, 00:23:52.480 |
But we'd have two inputs, and then the output. 00:23:57.180 |
We'd have the two inputs, something in the middle, 00:24:03.040 |
with that model architecture is it's fit in dataset like this. 00:24:07.680 |
And then you'd have however many epochs you're training for. 00:24:18.040 |
So actually, there's one other thing that I wanted to mention as well. 00:24:27.780 |
you're going to want to split your data for training 00:25:07.540 |
it will take the first 10 batches of the dataset and nothing else. 00:25:15.120 |
it will skip the first 10 batches of the dataset and nothing else. 00:25:19.620 |
So this is not the most efficient way of doing it, 00:25:38.580 |
so I say this isn't efficient because it's to take a list of the dataset 00:25:44.680 |
we're loading everything in or we're generating, 00:25:49.820 |
and we're putting everything into memory as a list 00:25:56.700 |
So it's bad just to know how many batches you're building from the start, 00:26:04.400 |
I mean, for this, it's fine because we don't have a lot of data, 00:26:08.240 |
but normally it would be better not to do that. 00:26:12.100 |
So let's see, I mean, you can see even with this dataset was pretty small, 00:26:26.820 |
30% for the validation and probably test data as well, 00:27:04.880 |
Okay, and then we're going to have to round this to the nearest batch 00:27:11.660 |
because we can't take 10.2 or something like that. 00:27:17.200 |
And then let's just see what we have for training size. 00:27:23.600 |
Okay, so we have 1,707 batches for the training data. 00:27:33.120 |
Okay, so we want to take that number of batches. 00:27:36.920 |
So train size, and we create another dataset, 00:28:07.300 |
So let's take the length of those so we can see again. 00:28:18.980 |
Okay, so yeah, we get 1,707 and then our 30% value 00:28:39.000 |
Okay, so we already have the batch size here. 00:28:49.680 |
So we've already considered the batches in here. 00:28:52.940 |
So we didn't need to consider it here as well. 00:28:58.520 |
Okay, so it would actually be, it should actually be that way. 00:29:11.500 |
and then the remaining batches will feed into our validation data, 00:29:23.180 |
Yeah, just under, sorry, just under 6,000 values. 00:29:27.360 |
But that's everything I think that I wanted to go through. 00:29:41.540 |
how we can load in-memory data and/or read into data sets from file, 00:29:48.680 |
how to batch and shuffle the ones that we read from in-memory sources, 00:30:01.460 |
One thing to note that if we just have an input and output, 00:30:18.140 |
And then you would just do model.fit data set like that. 00:30:24.080 |
You don't have to name the layers or anything. 00:30:29.960 |
And then after that, we went through the splits,