Back to Index

How to Build TensorFlow Pipelines with tf.data.Dataset


Chapters

0:0
0:27 Use the Pipeline Dataset Object
4:40 Reading It Directly from File
7:53 Field Delimiter
8:46 Select Columns
11:41 Shuffle and Batch Methods
16:5 Map Method
22:19 Define a Model

Transcript

All right, we're going to go through the TensorFlow data sets. Essentially, these are a more efficient, built-in way to build our input pipelines. So we can see the documentation here. If you'd like to go through it, you can do. I'll leave a link in the description. But we are just going to go and dive right into it.

So to use the pipeline data set object, we need to actually import TensorFlow, of course. TF. And we're also going to be using Pandas and NumPy for a few examples here, so we'll go ahead and import those as well. (TYPING) Okay, so there's a few different ways that we can read data into our data pipelines, data sets.

So the first of those is from in-memory, which is probably the way that most of you, if you have seen this before, will have seen it. We'll go ahead and put that together quickly. So the first one is a list. So we can take a couple of Python lists, put them together, and build our data set using that.

So we'll just put some together really quickly here. Okay, so we just have input and output, they're both lists. And then to create our data set object from these two, we just type tf.data.DataSet with a capital D. And we are taking these as a tensor slices. Like this. And one thing that we're doing here, we're putting both of these into a tuple because this only accepts one input parameter.

And that input parameter is basically all of the data that we are going to be feeding into our model later on. The default format for the data set object when it's feeding into a model is simply one input tensor and one output tensor or target label tensor, whatever you'd like to call it.

So we're just going to put that in here. Here, once it loads, we will have built our first data set. Took a lot longer than it should have done. And for item and data set. So we're just going to see what it looks like. And we'll see it's like, what's that?

It's a list of tensor arrays. This is the tuple format that we created here. So this is the first item we have. And the first tensor object is a NumPy array or NumPy integer, which is zero, which matches up to this. And then next to that, we have the output value, which is one here.

And then it's the same for the following three rows in there. So we can also do the same with NumPy arrays. Which is literally pretty much exactly the same, exactly the same format, like this. And this will produce the exact same thing. And then if we want to use a data frame, which I assume a lot of you will do.

This time, before we were passing inputs and outputs, this time we just do the data frame. And we will see. Okay, so we'll create a slightly different format here. And with this, we would reformat or restructure the dataset here before feeding into our model for it to read everything correctly.

But for now, we're just going to leave it like that. And then we'll go over the mapping and everything pretty soon. So the other option we have for reading the data into our dataset is actually reading it directly from file. So from file, the benefits we get from doing that is that we are reading in data from an out-of-memory source.

And because we're reading from an out-of-memory source, TensorFlow will read data batch by batch rather than pulling in the entire data source or the entire dataset all at once. So if we have a big dataset, then this is pretty useful because in a lot of cases, we're working with a big dataset and we can't actually bring everything into our memory all at once.

So this allows us to get around that. And it does it in an efficient way. And it's just super easy as well. So I have also put this together. Zoom out a little bit. And this is my attempt at demonstrating the difference or demonstrating what the from file version of this does.

So this is our full dataset here. And we've batched into three batches here. Obviously, you'd have way more than this. And at any one time, we feed in a single batch, apply our dataset transformations, feed it into the model for training. And then once we're done with that, we go on to the next batch.

And then we feed it into dataset transformations and go and train the model with it. So let's do that quickly. So I've got this train.tsv here. And that file is actually from the sentiment analysis on the movie reviews from Kaggle. So you can download it here. You can see the link, I'll put it in the description.

And we're going to read from this, read from it directly. So it's slightly different. We do tf.data.experimental. And then we make CSV dataset. So I know we are actually using tab separated values here rather than comma separated values. So all we're doing here is we're going to change the field delimiter to a tab character instead.

So it's train.tsv. And then also in here, we actually define our batch size. So we're just going to do something really small for now. But obviously, when you are using this for your actual models, you would probably be doing something like a batch size of 64 or 128 or whatever it is you're using.

But we're just going to go for eight now so we can really easily visualize everything. And then next, I don't know why it's doing that. It's fine. Next, we do field delimiter. So this is where we tell it it's actually a tab delimited file rather than comma. Try and sort that out.

It's really annoying. And then we also need to... We need to set the label for our dataset. Which if we look here, our label is this sentiment field. So your sentiment. And then actually another really useful argument here is the select columns. So with this, we just pass a list of the columns that we want to keep and then it will drop all the other columns.

So for now, I mean, it depends on what you're doing, obviously, here. But we're just going to keep the ID. And then we're also going to keep the input and target data. Which so the ID is phrase ID. And then the input data is phrase. Yep. And then sentiment for the label.

Let's execute that. And then let's just have a quick look at what we have. So we use this take to just take the first batch within our dataset. If we say take 20, then it will take the first 20 batches and nothing else. But we just want to see the first one so we can see the actual format within the dataset.

So you can see here we have the phrase ID. And this is why I wanted to keep the phrase ID in. I mean, we don't actually need it for training model. But I wanted to show you that it actually shuffles the data. So you can see in here, phrase ID is 1, 2, 3, 4, 5.

It's in order. But then when we read data in with this, it actually automatically shuffles everything, which is a pretty cool feature. So yeah, it's pretty useful. And then here we have phrases, which would be our input data. And then here we have the sentiment ratings, which would be our target data.

And so that's everything for the reading into our dataset. And we'll move on to performing a few operations on it. So I'm going to go back and just assume that we're not reading it from here. Actually, let's use this, but we're going to load it into our memory first.

Like this. Let's do this. So I'm going to do this because I want to show you the shuffle and batch methods. And obviously, if it's already shuffled and batched, there's no point in showing you. But this is useful to know if you're reading things from in-memory. Obviously, if you are reading things from your disk, then there's no pointing in doing this part.

And then we're in TSV, so we need to keep the separator as a tab. Okay, and let's just make sure we've read it in correctly. Okay, cool. So what I want to show you here, we actually need to, sorry, we need to read it into our dataset. So we're going to use the same here.

And then let's do the for item in dataset.take1. Print item. Print item. Okay, so this is because we have these phrases in here. So, I mean, we don't really need them. So let's just go sentence ID to make things a bit easier for now. So, I mean, if you were using strings, obviously, for machine learning, you're going to tokenize it.

So, I mean, you would do that first. But we're not going to go all the way through to actually training the model. We're just going to have a look at the pipelining. Okay, cool. So our first row is 1, 1, 1, 1, 1, 1. That's what we expect. First thing is to actually do the shuffling and the batch, like I said before.

So we're going to do a more, in fact, no, we stick with a batch of eight just to make things a bit more readable. So what we do is, it's like super easy. So we shuffled dataset and we just add in a large number here to make sure it shuffles everything like as far away from its neighboring samples as possible.

And the sort of standard number here is actually 10,000. I don't know exactly why, but almost every time I've seen shuffling, I've seen people use 10K, so I'm going to stick with it. And then I'm so used to putting like 6,428. We'll just put a batch of eight here.

So if we take, so now we've batched it, so we should actually see more than one because it's taking the first, like one of the highest level record or batch within the dataset. So now we should see quite a few and we can see, okay, cool. It's definitely mixed up the phrase IDs because these were one, two, three, four before.

And now they're all mixed up. So that's cool, we shuffled and batched it like incredibly easily. So, you know, that's one of the benefits of doing it. And as well, I mean, writing this code is, one, it's incredibly easy and simple to remember. Like it's not hard to remember that.

It's very obvious when you're reading it, like what is happening in dataset shuffle batch into eight. That's super easy. I mean, maybe some people might get a bit confused by this number here, but otherwise super easy to read. And it's really quick and efficient. So it's pretty good. Next thing I would show you is the map method.

So for any more complex data transformations, this is probably what you'd use. I mean, it's really, really useful. So what can we do? We can maybe add or multiply everything in the labels by two. I mean, obviously we wouldn't do this in reality, but it's just an example. And we'll also reformat the data.

So we're going to build it as if we have two input fields. So for example, when you're working with transformers or a lot of the time you have an input ID field or layer, and you also have an attention mass field or layer. And don't worry if that doesn't really make sense.

But essentially we just have two input layers and fields. So we'll format this to be formatted in the correct way to have two inputs and then one output. And we'll also change the number of the output just so we can see how it works. So generally the best way of doing this is to create a function.

So I'm just going to call it map func and pass X. So this is just going to pass every single record within our dataset. So one thing I actually just realized is that we should batch this afterwards because otherwise we have to consider the batching. So let's move these after and let's write this.

So we're going to return. So when we are working with multiple inputs or outputs, even the best way to let TensorFlow know where each input is supposed to be going is to give the input layers or output layers a name when you're defining the model and building it and match that to the names that we give to this dictionary here.

So with the transformer example, I think most people just do input IDs. And then I'm just going to make this up. So our input ID for this is going to be this value here. And we're going to put it into a list because typically you'd have like an array or a list of numbers coming in here.

And we're just going to take the first value of X. And then the mask, we're going to write it like this. And put one. And then on the outside of this dictionary, we only have one label, one output, one target, however you want to call it. And we're just going to perform like a really basic operation on it just to show that we can.

So multiply by two and nothing special. I'm just putting a list. So to apply this mapping function, all we need to do, again, like it's incredibly easy, is dataset.map. And then we map the map func like this. And then we did batch it before. So let's just rerun this bit of the code so that we have it all unbatched again.

Run this and this. Okay, so let's have a look at what we have now. So you can see here we have this format. Let's do that again. Let's see what we have. Okay, it's kind of hard to read, but inside a tuple. Right, so this is the index one of the tuple and this is index zero.

All right, so inside index zero, we have this dictionary that we defined, which has input IDs and then also has the mask. And that is what this tuple format here with the input and the output here is what TensorFlow will be expecting when we fit to the model. And then if it sees a dictionary in either the input or the output, it will read the key values of the dictionary and the values that match to those keys will be provided to a corresponding key.

So essentially, you would have to have a layer called input IDs and it would pass this to that layer. And then it would also pass this to the layer of mask. And then we would also have the outputs being passed to our output layer. We wouldn't necessarily need to mark this one out though.

Okay, now we want to batch it. Like we did before, and then we can just view what we have here. Okay, so we have the dictionary and then we have everything else as well. Okay, so that's pretty good, that's what we want. Okay, so we need to define a, just like really quickly define a model.

So it's going to have a input IDs, and then it's going to have this layer. Okay, and so I'm not actually going to define all of this. I'm just going to show you how it would work. So you define your shape, you define it here, and then you would also have your name.

And then it's this name that would have to match up with the dictionary that we have fed in previously. So we go like input's ID, or was it input's ID? Or, okay, so input's IDs, right, they would have to match. And then we would have a mask as well, because we have two inputs, remember?

And then this one would be called mask. And obviously you'd call this mask or anything else you want. We call it like input one, input two, it doesn't matter. Okay, but later on, when we actually fit the model, we do it like this. So obviously we would have an output here as well, and I'm not defining the rest of it.

But we'd have two inputs, and then the output. We'd have the two inputs, something in the middle, the output, and then all we'd have to do with that model architecture is it's fit in dataset like this. And then you'd have however many epochs you're training for. And then that's everything for that.

So actually, there's one other thing that I wanted to mention as well. So obviously, a lot of the time, you're going to want to split your data for training into a training and a validation set. So to do that, it's actually super easy. So before we mentioned the take method, so we use dataset.take, and we can also use dataset.skip.

And these are like equal and opposite. So if we do dataset.take 10, it will take the first 10 batches of the dataset and nothing else. Then if we do dataset.skip 10, it will skip the first 10 batches of the dataset and nothing else. So this is not the most efficient way of doing it, but let's just do it like this for now.

So if we just get the length of the dataset, so I say this isn't efficient because it's to take a list of the dataset we're loading everything in or we're generating, because the dataset is a generator and we're putting everything into memory as a list and then taking the length of it.

So it's bad just to know how many batches you're building from the start, if it's a big dataset. I mean, for this, it's fine because we don't have a lot of data, but normally it would be better not to do that. So let's see, I mean, you can see even with this dataset was pretty small, it's still taking quite a long time.

And then say if we want to take a 70% split, so 70/30, 70% for the training data, 30% for the validation and probably test data as well, but you just split that after. We would take, so training size would be 0.7. And remember, this is taking the batches. So it would actually be the length divided by the batch size, which is eight.

Okay, and then we're going to have to round this to the nearest batch or the nearest, sorry, the nearest integer because we can't take 10.2 or something like that. So we just round it here. And then let's just see what we have for training size. Okay, so we have 1,707 batches for the training data.

Okay, so we want to take that number of batches. So train size, and we create another dataset, which is the train dataset. And then for the validation dataset, we just skip those 1,707 batches. Like that. Super simple. So let's take the length of those so we can see again.

I know it's not efficient, but it's quick. It's the easiest way to do it quickly. Okay, so yeah, we get 1,707 and then our 30% value is when it finally loads. When it finally loads. Okay, so we already have the batch size here. I don't know why. That was really stupid.

So we've already considered the batches in here. So we didn't need to consider it here as well. That's why it was taking so long. Okay, so it would actually be, it should actually be that way. So let's do that. So yeah, we have a training size of 13.6K and then the remaining batches will feed into our validation data, which will be around 600 values.

Yeah, just under, sorry, just under 6,000 values. But that's everything I think that I wanted to go through. So we've covered all of the essentials of the TensorFlow data set object, how we can load in-memory data and/or read into data sets from file, how to batch and shuffle the ones that we read from in-memory sources, how to transform the data sets with map, how we can feed them into models.

One thing to note that if we just have an input and output, like you probably will for most models, you don't need to do anything. You just have it in the tuple format and you have inputs and outputs. And then you would just do model.fit data set like that.

You don't have to name the layers or anything. You just feed it straight in. And then after that, we went through the splits, which we've just done here. So yeah, that's everything. I hope it's been a useful video. And I hope you enjoyed the video. Thanks for watching.