Input Data Pipelines - TensorFlow Essentials #4

Hi and welcome to this video on TensorFlow Datasets. So, TensorFlow Dataset is an object type that TensorFlow supplies that essentially makes our job a lot easier when building an input data pipeline. First, when we are actually putting our data into our model, a Dataset object makes it much, much easier and we literally just feed in Dataset to our model training and everything, as long as it's set up correctly, we run very smoothly.

Second, we can batch and shuffle our Dataset with a single line of code, which is honestly incredible. And finally, we can also adjust our Dataset very easily without taking up too much disk space. And alongside all of those, the Dataset object is incredibly well optimized. So, it's definitely well worth learning how to use this.

So, we'll just open up a new Jupyter Notebook here. And we want to import TensorFlow, Pandas, and NumPy. So, there are a few different ways of reading in data into our Dataset. The first of those is from in-memory. So, if we, for example, have our inputs and labels, which are our outputs, we can very quickly create a Dataset with this, like so.

So, we just pass a tuple containing the inputs and labels. It's incredibly easy. So, now we can loop through the Dataset to see what we have inside. Okay, so we can see that we have these Tensor objects, which are the data types or the array types that TensorFlow will read.

And we see, okay, it's a NumPy integer, integer 32 here, and the value is 0. And we have 1, 2, 3. Okay, so these are obviously our inputs. Now, on the other side, we have our labels. We have NumPy 1, 0, 1, and 0. So, incredibly easy to get that set up.

But, obviously, we're not always going to read in as a list. We can also read in as a NumPy array as well. Which, again, is incredibly easy. In fact, it's essentially the exact same piece of code. So, we're just going to convert our initial list into an array. And also the labels as well.

Let's see what we have here. Okay, so now we have an array rather than a list. And all we do is the exact same thing again. There we go. We have the exact same output again. So, lists and arrays, we can deal with them in the exact same way.

There's no need to do anything differently for them both. Now, in the case of us using a Pandas DataFrame, we may want to do something a little bit different. So, we have our inputs. And our labels. Okay, so now we have a Pandas DataFrame rather than an array or a list.

The best way to deal with this and the way that we will be dealing with this throughout this course is to simply convert these into a NumPy array. Which we can do really easily like this. And just add values onto the end. And that creates our array. So, if we wanted to, again, create a dataset from that, we just take this.

Split each into a column. Take the values. And we get our dataset. Incredibly easy. So, the only difference here is that we are creating a 64-bit integer rather than a 32-bit integer. And this is because this is the preferred data type for Pandas. But for us, I mean, that doesn't really make much difference.

Now, we can also read data from file. And to do that, we again, use TensorFlow data. But we use something slightly different here. So, this is a newer feature. So, it's not within the dataset attribute. It's within experimental. We make a CSV dataset. So, what we would do here is pass a CSV dataset into here.

So, if we quickly get some. Okay. So, we've got this train.csv here. And this is just a extract from the Kaggle Titanic dataset. So, you can find that by searching Titanic Kaggle. And it will come up. And now we have all of these rows which each list a passenger on the Titanic.

So, all we do is train.csv. And then here, we can actually specify our batch sizes, what columns we want, the field delimiter, and all of these different things, which is incredibly useful. So, first, we'll do a batch size. And let's go with a batch size of 16. So, this simply means that we will read our data into our model 16 rows at a time.

Our field delimiter. Because we're using a comma separated file, we don't really need to specify this. But it's there just in case you are using something else, like a pipe delimited file. Or a tab delimited file. And then we can also specify specific columns as well. So, let's select a few.

So, let's go with passenger ID, survived, and P class. And then we also add in our label name, which is the target or output column. Now, one thing to note here is that our label name must also be within our selected column list. There we go. So, now we can view what we have.

So, we can use this .take method. Just take a certain number of columns, or batches, sorry. Let's print item. Okay, so here we can see we have passenger ID, and we have a few values there. P class, a few values there. And then at the end we have our labels, which are not assigned a name, but they are here.

Now, if we take a look at the passenger ID values here, we have 889, 826, and 273. If we look at the top here, these are obviously not the same. So, the reason that is, is because when we use this experimental make CSV dataset method, we actually automatically shuffle data, which is a good thing.

Realistically, we always want to try and shuffle data before feeding it into our model. We want every batch that we have to essentially be a representative sample of the full dataset. However, we're not always necessarily going to be reading from file. So, if we're not doing that, we do something else.

So, let's read in our train.csv. We're going to read it in with pandas. And then we want to create our dataset using the original method that we learned, which is from tensorsizers. So, this time, let's take a few different inputs and create an array here. So, we might want pclass, age, and parch.

So, this will create an array with three columns and the number of rows equal to the number of samples that we have within our dataset. And then for the output label, which is always the second part of the tuple, we want to add survived. So, here we just need to change df to data to match up to our data frame up here.

And now we have our dataset. So, the only issue is now, none of this is shuffled and none of it is batched. So, if we do for item in dataset, take one, we see that we just have one sample. We want everything to be shuffled and batched. And to do that, all we need to do is write a single line of code with dataset.shuffle.

And then here we just add a large number. The larger the dataset and the less shuffled it is, the larger we need to add in this number. So, if we have a very unrepresentative sample or batch, we need to increase this number. And then we set our batch. And let's go with 16 again.

And there is also another argument that we can add here. So, obviously 16 probably won't fit perfectly into the number of samples that we have. So, say if we had batches of two and we had a total dataset size of nine, four of those twos would fit perfectly into eight, but then we'd have one sample left over at the end.

So, what we can do if that is the case, is we can just drop the remainder to avoid any shape problems coming up later on. So, just drop remainder equals true. And there we have it. We have shuffled and batched our dataset. So, that is everything on the TensorFlow datasets object.

This is incredibly useful. So, definitely get familiar with it and we will also be using it a lot in this course as well. So, I hope that has been useful. As always, thank you for watching and I will see you in

Input Data Pipelines - TensorFlow Essentials #4

Chapters

Transcript