back to index

Input Data Pipelines - TensorFlow Essentials #4


Chapters

0:0
1:50 pass a tuple containing the inputs and labels
2:55 convert our initial list into a array
3:41 using a pandas data frame
4:19 convert these into a numpy array
5:57 pass a csv data set
6:52 read our data into our model 16 rows at a time
8:11 take a certain number of columns or batches
9:43 create our data set using the original method
10:15 create a array with three columns

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to this video on TensorFlow Datasets.
00:00:04.000 | So, TensorFlow Dataset is an object type that TensorFlow supplies
00:00:09.000 | that essentially makes our job a lot easier when building an input data pipeline.
00:00:15.000 | First, when we are actually putting our data into our model,
00:00:19.000 | a Dataset object makes it much, much easier and we literally just feed in
00:00:23.000 | Dataset to our model training and everything, as long as it's set up correctly,
00:00:27.000 | we run very smoothly. Second, we can batch and shuffle our Dataset
00:00:32.000 | with a single line of code, which is honestly incredible.
00:00:36.000 | And finally, we can also adjust our Dataset very easily
00:00:40.000 | without taking up too much disk space. And alongside all of those,
00:00:44.000 | the Dataset object is incredibly well optimized.
00:00:49.000 | So, it's definitely well worth learning how to use this.
00:00:54.000 | So, we'll just open up a new Jupyter Notebook here.
00:00:58.000 | And we want to import TensorFlow,
00:01:02.000 | Pandas,
00:01:07.000 | and NumPy.
00:01:11.000 | So, there are a few different ways of reading in data into our Dataset.
00:01:16.000 | The first of those is from in-memory.
00:01:20.000 | So, if we, for example, have our inputs
00:01:24.000 | and labels,
00:01:28.000 | which are our outputs,
00:01:32.000 | we can
00:01:36.000 | very quickly create a Dataset with this,
00:01:40.000 | like so.
00:01:44.000 | [typing]
00:01:48.000 | So, we just pass a tuple
00:01:52.000 | containing the inputs and labels.
00:01:56.000 | It's incredibly easy. So, now we can loop through
00:02:00.000 | the Dataset to see what we have
00:02:04.000 | inside.
00:02:08.000 | Okay, so we can see that we have these Tensor objects,
00:02:12.000 | which are the data types or the array types that TensorFlow
00:02:16.000 | will read. And we see, okay, it's a NumPy integer,
00:02:20.000 | integer 32 here, and the value is
00:02:24.000 | 0. And we have 1, 2, 3. Okay, so these are obviously
00:02:28.000 | our inputs. Now, on the other side, we have our labels.
00:02:32.000 | We have NumPy 1, 0, 1, and 0.
00:02:36.000 | So, incredibly easy to get that set up.
00:02:40.000 | But, obviously, we're not always going to read in as a list.
00:02:44.000 | We can also read in as a NumPy array as well.
00:02:48.000 | Which, again, is incredibly easy. In fact, it's essentially
00:02:52.000 | the exact same piece of code.
00:02:56.000 | So, we're just going to convert our initial list into an array.
00:03:00.000 | [typing]
00:03:04.000 | And also the labels as well.
00:03:08.000 | [typing]
00:03:12.000 | Let's see what we have here. Okay, so now we have an array rather than a list.
00:03:16.000 | And all we do is
00:03:20.000 | the exact same thing again.
00:03:24.000 | [typing]
00:03:28.000 | There we go. We have the exact same output again.
00:03:32.000 | So, lists and arrays, we can deal with them in the exact same way. There's no
00:03:36.000 | need to do anything differently for them both.
00:03:40.000 | Now, in the case of us using a Pandas DataFrame,
00:03:44.000 | we may want to do something a little bit different.
00:03:48.000 | [typing]
00:03:52.000 | So, we have our inputs.
00:03:56.000 | [typing]
00:04:00.000 | And our labels.
00:04:04.000 | [typing]
00:04:08.000 | Okay, so now we have a Pandas DataFrame
00:04:12.000 | rather than an array or a list. The best way to deal with this
00:04:16.000 | and the way that we will be dealing with this throughout this course is to simply
00:04:20.000 | convert these into a NumPy array. Which we can do
00:04:24.000 | really easily like this.
00:04:28.000 | [typing]
00:04:32.000 | And just add values onto the end. And that creates our array.
00:04:36.000 | So, if we wanted to, again,
00:04:40.000 | create a dataset from that, we just take this.
00:04:44.000 | [typing]
00:04:48.000 | Split each into a column. Take the values.
00:04:52.000 | [typing]
00:04:56.000 | [typing]
00:05:00.000 | And we get our dataset.
00:05:04.000 | [typing]
00:05:08.000 | Incredibly easy. So, the only difference here is that we are
00:05:12.000 | creating a 64-bit integer rather than a 32-bit integer.
00:05:16.000 | And this is because this is the preferred data type for Pandas.
00:05:20.000 | But for us, I mean, that doesn't really make much difference.
00:05:24.000 | Now, we can also
00:05:28.000 | read data from file. And to do that, we
00:05:32.000 | again, use TensorFlow
00:05:36.000 | data. But we use something slightly different here.
00:05:40.000 | So, this is a newer feature. So, it's not within the dataset
00:05:44.000 | attribute. It's within experimental.
00:05:48.000 | We make a CSV dataset.
00:05:52.000 | [typing]
00:05:56.000 | So, what we would do here is pass a CSV dataset
00:06:00.000 | into here. So, if we quickly get
00:06:04.000 | some.
00:06:08.000 | Okay. So, we've got this train.csv here. And this is just a
00:06:12.000 | extract from the Kaggle Titanic
00:06:16.000 | dataset. So, you can find that by searching Titanic
00:06:20.000 | Kaggle. And it will come up. And now we have all of these rows
00:06:24.000 | which each list a passenger on the Titanic.
00:06:28.000 | So, all we do is train.csv.
00:06:32.000 | And then here, we can actually specify our batch
00:06:36.000 | sizes, what columns we want, the field delimiter, and
00:06:40.000 | all of these different things, which is incredibly useful. So,
00:06:44.000 | first, we'll do a batch size. And let's go with a batch size of
00:06:48.000 | 16. So, this simply means that we
00:06:52.000 | will read our data into our model 16 rows at a time.
00:06:56.000 | Our field delimiter.
00:07:00.000 | Because we're using a comma separated file, we don't
00:07:04.000 | really need to specify this. But it's there just in case you are using
00:07:08.000 | something else, like a pipe delimited file.
00:07:12.000 | Or a tab delimited file.
00:07:16.000 | And then we can also
00:07:20.000 | specify specific columns as well. So, let's
00:07:24.000 | select a few.
00:07:28.000 | So, let's go with passenger
00:07:32.000 | ID, survived, and P class.
00:07:36.000 | And then we also add in our label name, which is
00:07:48.000 | the target or output column.
00:07:52.000 | Now, one thing to note here is that our label name must also be
00:07:56.000 | within our selected column list.
00:08:00.000 | There we go. So, now we
00:08:04.000 | can view what we have.
00:08:08.000 | So, we can use this .take method.
00:08:12.000 | Just take a certain number of columns, or batches, sorry.
00:08:16.000 | Let's print item.
00:08:20.000 | Okay, so here we can see we have
00:08:24.000 | passenger ID, and we have a few values there.
00:08:28.000 | P class, a few values there. And then at the
00:08:32.000 | end we have our labels, which are not assigned a
00:08:36.000 | name, but they are here.
00:08:40.000 | Now, if we take a look at the passenger ID values here, we have
00:08:44.000 | 889, 826, and 273.
00:08:48.000 | If we look at the top here, these are obviously not the same.
00:08:52.000 | So, the reason that is, is because when we use this
00:08:56.000 | experimental make CSV dataset method,
00:09:00.000 | we actually automatically shuffle data, which is a good
00:09:04.000 | thing. Realistically, we always want to try and shuffle data before
00:09:08.000 | feeding it into our model. We want every batch that we have to essentially be
00:09:12.000 | a representative sample of the full dataset. However,
00:09:16.000 | we're not always necessarily going to be reading from file.
00:09:20.000 | So, if we're not doing that, we do something else.
00:09:24.000 | So, let's read in our train.csv.
00:09:28.000 | We're going to read it in with pandas.
00:09:36.000 | And then we want to create our dataset using the
00:09:46.000 | original method that we learned, which is
00:09:50.000 | from tensorsizers.
00:09:54.000 | So, this time, let's take
00:09:58.000 | a few different inputs and create an array here.
00:10:02.000 | So, we might want
00:10:06.000 | pclass, age,
00:10:10.000 | and parch.
00:10:14.000 | So, this will create an array with
00:10:18.000 | three columns and the number of rows equal to the number of samples
00:10:22.000 | that we have within our dataset. And then for the
00:10:26.000 | output label, which is always the second part of the tuple, we want
00:10:30.000 | to add survived.
00:10:34.000 | So, here we just need to change df to
00:10:38.000 | data to match
00:10:42.000 | up to our data frame up here. And now we have our
00:10:46.000 | dataset. So, the only issue is now, none of this is shuffled
00:10:50.000 | and none of it is batched. So, if we do for
00:10:54.000 | item in dataset, take
00:11:02.000 | we see that we just have one sample. We want everything to be shuffled
00:11:06.000 | and batched. And to do that, all we
00:11:10.000 | need to do is write a single line of code with
00:11:14.000 | dataset.shuffle. And then here we just add a large
00:11:18.000 | number. The larger the dataset and the less shuffled it is, the
00:11:22.000 | larger we need to add in this number. So, if
00:11:26.000 | we have a very unrepresentative sample
00:11:30.000 | or batch, we need to increase this number.
00:11:34.000 | And then we set our batch. And let's go with
00:11:38.000 | 16 again. And there is also another argument that we can add here. So, obviously
00:11:42.000 | 16 probably won't fit perfectly into the number of
00:11:46.000 | samples that we have. So, say if we had batches of two
00:11:50.000 | and we had a total dataset size of nine,
00:11:54.000 | four of those twos would fit perfectly into eight, but then we'd have one
00:11:58.000 | sample left over at the end. So, what we can do
00:12:02.000 | if that is the case, is we can just drop the remainder to avoid any
00:12:06.000 | shape problems coming up later on. So, just drop remainder
00:12:10.000 | equals true. And there we have it. We have shuffled and batched our dataset.
00:12:14.000 | So, that is everything on the TensorFlow datasets
00:12:18.000 | object. This is incredibly useful. So, definitely get familiar with it
00:12:22.000 | and we will also be using it a lot in this course as well.
00:12:26.000 | So, I hope that has been useful. As always, thank you for watching and I will see you in