Input Data Pipelines - TensorFlow Essentials #4

00:00:00.000 | Hi and welcome to this video on TensorFlow Datasets.

00:00:04.000 | So, TensorFlow Dataset is an object type that TensorFlow supplies

00:00:09.000 | that essentially makes our job a lot easier when building an input data pipeline.

00:00:15.000 | First, when we are actually putting our data into our model,

00:00:19.000 | a Dataset object makes it much, much easier and we literally just feed in

00:00:23.000 | Dataset to our model training and everything, as long as it's set up correctly,

00:00:27.000 | we run very smoothly. Second, we can batch and shuffle our Dataset

00:00:32.000 | with a single line of code, which is honestly incredible.

00:00:36.000 | And finally, we can also adjust our Dataset very easily

00:00:40.000 | without taking up too much disk space. And alongside all of those,

00:00:44.000 | the Dataset object is incredibly well optimized.

00:00:49.000 | So, it's definitely well worth learning how to use this.

00:00:54.000 | So, we'll just open up a new Jupyter Notebook here.

00:00:58.000 | And we want to import TensorFlow,

00:01:02.000 | Pandas,

00:01:07.000 | and NumPy.

00:01:11.000 | So, there are a few different ways of reading in data into our Dataset.

00:01:16.000 | The first of those is from in-memory.

00:01:20.000 | So, if we, for example, have our inputs

00:01:24.000 | and labels,

00:01:28.000 | which are our outputs,

00:01:32.000 | we can

00:01:36.000 | very quickly create a Dataset with this,

00:01:40.000 | like so.

00:01:44.000 | [typing]

00:01:48.000 | So, we just pass a tuple

00:01:52.000 | containing the inputs and labels.

00:01:56.000 | It's incredibly easy. So, now we can loop through

00:02:00.000 | the Dataset to see what we have

00:02:04.000 | inside.

00:02:08.000 | Okay, so we can see that we have these Tensor objects,

00:02:12.000 | which are the data types or the array types that TensorFlow

00:02:16.000 | will read. And we see, okay, it's a NumPy integer,

00:02:20.000 | integer 32 here, and the value is

00:02:24.000 | 0. And we have 1, 2, 3. Okay, so these are obviously

00:02:28.000 | our inputs. Now, on the other side, we have our labels.

00:02:32.000 | We have NumPy 1, 0, 1, and 0.

00:02:36.000 | So, incredibly easy to get that set up.

00:02:40.000 | But, obviously, we're not always going to read in as a list.

00:02:44.000 | We can also read in as a NumPy array as well.

00:02:48.000 | Which, again, is incredibly easy. In fact, it's essentially

00:02:52.000 | the exact same piece of code.

00:02:56.000 | So, we're just going to convert our initial list into an array.

00:03:00.000 | [typing]

00:03:04.000 | And also the labels as well.

00:03:08.000 | [typing]

00:03:12.000 | Let's see what we have here. Okay, so now we have an array rather than a list.

00:03:16.000 | And all we do is

00:03:20.000 | the exact same thing again.

00:03:24.000 | [typing]

00:03:28.000 | There we go. We have the exact same output again.

00:03:32.000 | So, lists and arrays, we can deal with them in the exact same way. There's no

00:03:36.000 | need to do anything differently for them both.

00:03:40.000 | Now, in the case of us using a Pandas DataFrame,

00:03:44.000 | we may want to do something a little bit different.

00:03:48.000 | [typing]

00:03:52.000 | So, we have our inputs.

00:03:56.000 | [typing]

00:04:00.000 | And our labels.

00:04:04.000 | [typing]

00:04:08.000 | Okay, so now we have a Pandas DataFrame

00:04:12.000 | rather than an array or a list. The best way to deal with this

00:04:16.000 | and the way that we will be dealing with this throughout this course is to simply

00:04:20.000 | convert these into a NumPy array. Which we can do

00:04:24.000 | really easily like this.

00:04:28.000 | [typing]

00:04:32.000 | And just add values onto the end. And that creates our array.

00:04:36.000 | So, if we wanted to, again,

00:04:40.000 | create a dataset from that, we just take this.

00:04:44.000 | [typing]

00:04:48.000 | Split each into a column. Take the values.

00:04:52.000 | [typing]

00:04:56.000 | [typing]

00:05:00.000 | And we get our dataset.

00:05:04.000 | [typing]

00:05:08.000 | Incredibly easy. So, the only difference here is that we are

00:05:12.000 | creating a 64-bit integer rather than a 32-bit integer.

00:05:16.000 | And this is because this is the preferred data type for Pandas.

00:05:20.000 | But for us, I mean, that doesn't really make much difference.

00:05:24.000 | Now, we can also

00:05:28.000 | read data from file. And to do that, we

00:05:32.000 | again, use TensorFlow

00:05:36.000 | data. But we use something slightly different here.

00:05:40.000 | So, this is a newer feature. So, it's not within the dataset

00:05:44.000 | attribute. It's within experimental.

00:05:48.000 | We make a CSV dataset.

00:05:52.000 | [typing]

00:05:56.000 | So, what we would do here is pass a CSV dataset

00:06:00.000 | into here. So, if we quickly get

00:06:04.000 | some.

00:06:08.000 | Okay. So, we've got this train.csv here. And this is just a

00:06:12.000 | extract from the Kaggle Titanic

00:06:16.000 | dataset. So, you can find that by searching Titanic

00:06:20.000 | Kaggle. And it will come up. And now we have all of these rows

00:06:24.000 | which each list a passenger on the Titanic.

00:06:28.000 | So, all we do is train.csv.

00:06:32.000 | And then here, we can actually specify our batch

00:06:36.000 | sizes, what columns we want, the field delimiter, and

00:06:40.000 | all of these different things, which is incredibly useful. So,

00:06:44.000 | first, we'll do a batch size. And let's go with a batch size of

00:06:48.000 | 16. So, this simply means that we

00:06:52.000 | will read our data into our model 16 rows at a time.

00:06:56.000 | Our field delimiter.

00:07:00.000 | Because we're using a comma separated file, we don't

00:07:04.000 | really need to specify this. But it's there just in case you are using

00:07:08.000 | something else, like a pipe delimited file.

00:07:12.000 | Or a tab delimited file.

00:07:16.000 | And then we can also

00:07:20.000 | specify specific columns as well. So, let's

00:07:24.000 | select a few.

00:07:28.000 | So, let's go with passenger

00:07:32.000 | ID, survived, and P class.

00:07:36.000 | And then we also add in our label name, which is

00:07:48.000 | the target or output column.

00:07:52.000 | Now, one thing to note here is that our label name must also be

00:07:56.000 | within our selected column list.

00:08:00.000 | There we go. So, now we

00:08:04.000 | can view what we have.

00:08:08.000 | So, we can use this .take method.

00:08:12.000 | Just take a certain number of columns, or batches, sorry.

00:08:16.000 | Let's print item.

00:08:20.000 | Okay, so here we can see we have

00:08:24.000 | passenger ID, and we have a few values there.

00:08:28.000 | P class, a few values there. And then at the

00:08:32.000 | end we have our labels, which are not assigned a

00:08:36.000 | name, but they are here.

00:08:40.000 | Now, if we take a look at the passenger ID values here, we have

00:08:44.000 | 889, 826, and 273.

00:08:48.000 | If we look at the top here, these are obviously not the same.

00:08:52.000 | So, the reason that is, is because when we use this

00:08:56.000 | experimental make CSV dataset method,

00:09:00.000 | we actually automatically shuffle data, which is a good

00:09:04.000 | thing. Realistically, we always want to try and shuffle data before

00:09:08.000 | feeding it into our model. We want every batch that we have to essentially be

00:09:12.000 | a representative sample of the full dataset. However,

00:09:16.000 | we're not always necessarily going to be reading from file.

00:09:20.000 | So, if we're not doing that, we do something else.

00:09:24.000 | So, let's read in our train.csv.

00:09:28.000 | We're going to read it in with pandas.

00:09:36.000 | And then we want to create our dataset using the

00:09:46.000 | original method that we learned, which is

00:09:50.000 | from tensorsizers.

00:09:54.000 | So, this time, let's take

00:09:58.000 | a few different inputs and create an array here.

00:10:02.000 | So, we might want

00:10:06.000 | pclass, age,

00:10:10.000 | and parch.

00:10:14.000 | So, this will create an array with

00:10:18.000 | three columns and the number of rows equal to the number of samples

00:10:22.000 | that we have within our dataset. And then for the

00:10:26.000 | output label, which is always the second part of the tuple, we want

00:10:30.000 | to add survived.

00:10:34.000 | So, here we just need to change df to

00:10:38.000 | data to match

00:10:42.000 | up to our data frame up here. And now we have our

00:10:46.000 | dataset. So, the only issue is now, none of this is shuffled

00:10:50.000 | and none of it is batched. So, if we do for

00:10:54.000 | item in dataset, take

00:10:58.000 | one,

00:11:02.000 | we see that we just have one sample. We want everything to be shuffled

00:11:06.000 | and batched. And to do that, all we

00:11:10.000 | need to do is write a single line of code with

00:11:14.000 | dataset.shuffle. And then here we just add a large

00:11:18.000 | number. The larger the dataset and the less shuffled it is, the

00:11:22.000 | larger we need to add in this number. So, if

00:11:26.000 | we have a very unrepresentative sample

00:11:30.000 | or batch, we need to increase this number.

00:11:34.000 | And then we set our batch. And let's go with

00:11:38.000 | 16 again. And there is also another argument that we can add here. So, obviously

00:11:42.000 | 16 probably won't fit perfectly into the number of

00:11:46.000 | samples that we have. So, say if we had batches of two

00:11:50.000 | and we had a total dataset size of nine,

00:11:54.000 | four of those twos would fit perfectly into eight, but then we'd have one

00:11:58.000 | sample left over at the end. So, what we can do

00:12:02.000 | if that is the case, is we can just drop the remainder to avoid any

00:12:06.000 | shape problems coming up later on. So, just drop remainder

00:12:10.000 | equals true. And there we have it. We have shuffled and batched our dataset.

00:12:14.000 | So, that is everything on the TensorFlow datasets

00:12:18.000 | object. This is incredibly useful. So, definitely get familiar with it

00:12:22.000 | and we will also be using it a lot in this course as well.

00:12:26.000 | So, I hope that has been useful. As always, thank you for watching and I will see you in

Input Data Pipelines - TensorFlow Essentials #4

Chapters