back to indexInput Data Pipelines - TensorFlow Essentials #4
Chapters
0:0
1:50 pass a tuple containing the inputs and labels
2:55 convert our initial list into a array
3:41 using a pandas data frame
4:19 convert these into a numpy array
5:57 pass a csv data set
6:52 read our data into our model 16 rows at a time
8:11 take a certain number of columns or batches
9:43 create our data set using the original method
10:15 create a array with three columns
00:00:00.000 |
Hi and welcome to this video on TensorFlow Datasets. 00:00:04.000 |
So, TensorFlow Dataset is an object type that TensorFlow supplies 00:00:09.000 |
that essentially makes our job a lot easier when building an input data pipeline. 00:00:15.000 |
First, when we are actually putting our data into our model, 00:00:19.000 |
a Dataset object makes it much, much easier and we literally just feed in 00:00:23.000 |
Dataset to our model training and everything, as long as it's set up correctly, 00:00:27.000 |
we run very smoothly. Second, we can batch and shuffle our Dataset 00:00:32.000 |
with a single line of code, which is honestly incredible. 00:00:36.000 |
And finally, we can also adjust our Dataset very easily 00:00:40.000 |
without taking up too much disk space. And alongside all of those, 00:00:44.000 |
the Dataset object is incredibly well optimized. 00:00:49.000 |
So, it's definitely well worth learning how to use this. 00:00:54.000 |
So, we'll just open up a new Jupyter Notebook here. 00:01:11.000 |
So, there are a few different ways of reading in data into our Dataset. 00:01:56.000 |
It's incredibly easy. So, now we can loop through 00:02:08.000 |
Okay, so we can see that we have these Tensor objects, 00:02:12.000 |
which are the data types or the array types that TensorFlow 00:02:16.000 |
will read. And we see, okay, it's a NumPy integer, 00:02:24.000 |
0. And we have 1, 2, 3. Okay, so these are obviously 00:02:28.000 |
our inputs. Now, on the other side, we have our labels. 00:02:40.000 |
But, obviously, we're not always going to read in as a list. 00:02:44.000 |
We can also read in as a NumPy array as well. 00:02:48.000 |
Which, again, is incredibly easy. In fact, it's essentially 00:02:56.000 |
So, we're just going to convert our initial list into an array. 00:03:12.000 |
Let's see what we have here. Okay, so now we have an array rather than a list. 00:03:28.000 |
There we go. We have the exact same output again. 00:03:32.000 |
So, lists and arrays, we can deal with them in the exact same way. There's no 00:03:36.000 |
need to do anything differently for them both. 00:03:40.000 |
Now, in the case of us using a Pandas DataFrame, 00:03:44.000 |
we may want to do something a little bit different. 00:04:12.000 |
rather than an array or a list. The best way to deal with this 00:04:16.000 |
and the way that we will be dealing with this throughout this course is to simply 00:04:20.000 |
convert these into a NumPy array. Which we can do 00:04:32.000 |
And just add values onto the end. And that creates our array. 00:04:40.000 |
create a dataset from that, we just take this. 00:05:08.000 |
Incredibly easy. So, the only difference here is that we are 00:05:12.000 |
creating a 64-bit integer rather than a 32-bit integer. 00:05:16.000 |
And this is because this is the preferred data type for Pandas. 00:05:20.000 |
But for us, I mean, that doesn't really make much difference. 00:05:36.000 |
data. But we use something slightly different here. 00:05:40.000 |
So, this is a newer feature. So, it's not within the dataset 00:05:56.000 |
So, what we would do here is pass a CSV dataset 00:06:08.000 |
Okay. So, we've got this train.csv here. And this is just a 00:06:16.000 |
dataset. So, you can find that by searching Titanic 00:06:20.000 |
Kaggle. And it will come up. And now we have all of these rows 00:06:32.000 |
And then here, we can actually specify our batch 00:06:36.000 |
sizes, what columns we want, the field delimiter, and 00:06:40.000 |
all of these different things, which is incredibly useful. So, 00:06:44.000 |
first, we'll do a batch size. And let's go with a batch size of 00:06:52.000 |
will read our data into our model 16 rows at a time. 00:07:00.000 |
Because we're using a comma separated file, we don't 00:07:04.000 |
really need to specify this. But it's there just in case you are using 00:07:36.000 |
And then we also add in our label name, which is 00:07:52.000 |
Now, one thing to note here is that our label name must also be 00:08:12.000 |
Just take a certain number of columns, or batches, sorry. 00:08:24.000 |
passenger ID, and we have a few values there. 00:08:32.000 |
end we have our labels, which are not assigned a 00:08:40.000 |
Now, if we take a look at the passenger ID values here, we have 00:08:48.000 |
If we look at the top here, these are obviously not the same. 00:08:52.000 |
So, the reason that is, is because when we use this 00:09:00.000 |
we actually automatically shuffle data, which is a good 00:09:04.000 |
thing. Realistically, we always want to try and shuffle data before 00:09:08.000 |
feeding it into our model. We want every batch that we have to essentially be 00:09:12.000 |
a representative sample of the full dataset. However, 00:09:16.000 |
we're not always necessarily going to be reading from file. 00:09:20.000 |
So, if we're not doing that, we do something else. 00:09:36.000 |
And then we want to create our dataset using the 00:09:58.000 |
a few different inputs and create an array here. 00:10:18.000 |
three columns and the number of rows equal to the number of samples 00:10:22.000 |
that we have within our dataset. And then for the 00:10:26.000 |
output label, which is always the second part of the tuple, we want 00:10:42.000 |
up to our data frame up here. And now we have our 00:10:46.000 |
dataset. So, the only issue is now, none of this is shuffled 00:11:02.000 |
we see that we just have one sample. We want everything to be shuffled 00:11:10.000 |
need to do is write a single line of code with 00:11:14.000 |
dataset.shuffle. And then here we just add a large 00:11:18.000 |
number. The larger the dataset and the less shuffled it is, the 00:11:38.000 |
16 again. And there is also another argument that we can add here. So, obviously 00:11:42.000 |
16 probably won't fit perfectly into the number of 00:11:46.000 |
samples that we have. So, say if we had batches of two 00:11:54.000 |
four of those twos would fit perfectly into eight, but then we'd have one 00:11:58.000 |
sample left over at the end. So, what we can do 00:12:02.000 |
if that is the case, is we can just drop the remainder to avoid any 00:12:06.000 |
shape problems coming up later on. So, just drop remainder 00:12:10.000 |
equals true. And there we have it. We have shuffled and batched our dataset. 00:12:14.000 |
So, that is everything on the TensorFlow datasets 00:12:18.000 |
object. This is incredibly useful. So, definitely get familiar with it 00:12:22.000 |
and we will also be using it a lot in this course as well. 00:12:26.000 |
So, I hope that has been useful. As always, thank you for watching and I will see you in