back to indexLesson 12: Cutting Edge Deep Learning for Coders
Chapters
0:0
6:51 find initial centroids
12:48 use standard python loops
31:2 partition the data into some number of partitions
35:10 using the five dimensional clustering
48:7 look at a couple of examples of list comprehensions
53:40 start out by creating two arrays of zeros
60:38 combine the matrices
00:00:00.000 |
I'm just going to start by going back to clustering. 00:00:07.740 |
We're going to talk about clustering again in the next lesson or two in terms of an application 00:00:15.800 |
But specifically what I wanted to do was show you k-means clustering in TensorFlow. 00:00:29.720 |
There are some things which are easier to do in TensorFlow than PyTorch, mainly because 00:00:40.040 |
So there are some things that are just like, oh, there's already a method that does that, 00:00:47.360 |
And some things are just a bit neater in TensorFlow than in PyTorch. 00:00:53.020 |
And I actually found k-means quite easy to do. 00:00:55.720 |
But what I want to do is I'm going to try and show you a way to write custom TensorFlow 00:01:00.560 |
code in a kind of a really PyTorch-y way, in a kind of an interactive-y way, and we're 00:01:06.840 |
going to try and avoid all of the fancy, session-y, graph-y, scope-y business as much as possible. 00:01:18.040 |
So to remind you, the way we initially came at clustering was to say, Hey, what if we 00:01:24.760 |
were doing lung cancer detection in CT scans, and these were like 512x512x200 volumetric 00:01:35.760 |
things which is too big to really run a whole CNN over conveniently. 00:01:42.560 |
So one of the thoughts to fix that was to run some kind of heuristic that found all 00:01:48.120 |
of the things that looked like they could vaguely be nodules, and then create a new 00:01:53.280 |
data set where you basically zoomed into each of those maybe nodules and created a small 00:01:57.720 |
little 20x20x20 cube or something, and you could then use a 3D CNN on that, or try Planar 00:02:07.520 |
And this general concept I wanted to remind you about because I feel like it's something 00:02:17.360 |
I've kind of kept on showing you ways of doing this. 00:02:19.360 |
Going back to lesson 7 with the fish, I showed you the bounding boxes, and I showed you the 00:02:24.760 |
The reason for all of that was basically to show you how to zoom into things and then 00:02:28.640 |
create new models based on those zoomed in things. 00:02:33.000 |
So in the fisheries case, we could really just use a lower-res CNN to find the fish 00:02:44.600 |
In the CT scan case, maybe we can't even do that, so maybe we need to use this kind of 00:02:53.480 |
It would be interesting to see what the winners use, but certainly particularly if you don't 00:02:57.880 |
have lots of time or you have a lot of data, heuristics become more and more interesting. 00:03:04.520 |
The reason a heuristic is interesting is you can do something quickly, an approximate that 00:03:11.200 |
could have lots and lots and lots of false positives, and that doesn't really matter. 00:03:16.800 |
Those false positives means just extra data that you're feeding to your real model. 00:03:24.200 |
So you can always tune it as to how much time have I got to train my real model, and then 00:03:29.280 |
I can decide how many false positives I can handle. 00:03:33.040 |
So as long as your preprocessing model is better than nothing, you can use it to get 00:03:39.900 |
rid of some of the stuff that is clearly not a nodule, for example. 00:03:44.900 |
For example, anything that is in the middle of the lung wall is not a nodule, anything 00:03:50.120 |
that is all white space is not a nodule, and so forth. 00:03:57.600 |
So we talked about mean shift clustering and how the big benefit of it is that it allows 00:04:06.360 |
us to build clusters without knowing how many clusters there are at a time. 00:04:13.680 |
Also without any special extra work, it allows us to find clusters which aren't kind of Gaussian 00:04:24.560 |
That's really important for something like a CT scan, where a cluster will often be like 00:04:27.600 |
a vessel, which is this really skinny long thing. 00:04:33.240 |
So k-means on the other hand is faster, I think it's n^2 rather than n^3 time. 00:04:42.340 |
We have talked particularly on the forum about dramatically speeding up mean shift clustering 00:04:46.840 |
using approximate nearest neighbors, which is something which we started making some 00:04:50.600 |
progress on today, so hopefully we'll have results from that maybe by next week. 00:04:55.800 |
But the basic naive algorithm certainly should be a lot faster for k-means, so there's one 00:05:03.720 |
So as per usual, we can start with some data, and we're going to try and figure out where 00:05:15.680 |
So one quick way to avoid hassles in TensorFlow is to create an interactive session. 00:05:24.280 |
So an interactive session basically means that you can call .run() on a computation graph 00:05:32.340 |
which doesn't return something, or .eval() on a computation graph that does return something, 00:05:39.000 |
and you don't have to worry about creating a graph or a session or having a session with 00:05:49.140 |
So that's basically what happens when you call tf.interactive_session. 00:05:56.220 |
So by creating an interactive session, we can then do things one step at a time. 00:06:08.200 |
So in this case, the first step in k-means is to pick some initial centroids. 00:06:15.780 |
So you basically start out and say, okay, if we're going to create however many clusters 00:06:19.960 |
-- in this case, n clusters is 6 -- then start out by saying, okay, where might those 6 clusters 00:06:30.440 |
And for a long time with k-means, people picked them randomly. 00:06:35.960 |
But most practitioners realized that was a dumb idea soon enough, and a lot of people 00:06:42.560 |
In 2007, finally, a paper was published actually suggesting a heuristic. 00:06:46.480 |
I tend to use a very simple heuristic, which is what I use here in find_initial_centroids. 00:06:53.720 |
So to describe this heuristic, I will show you the code. 00:07:04.960 |
Basically -- and I'm going to run through it quickly, and then I'll run through it slowly. 00:07:08.560 |
Basically the idea is we first of all pick a single data point index, and then we select 00:07:19.280 |
So we have one randomly selected data point, and then we find what is the distance from 00:07:25.560 |
that randomly selected data point to every other data point. 00:07:30.720 |
And then we say, okay, what is the data point that is the furthest away from that randomly 00:07:38.520 |
selected data point, the index of it and the point itself. 00:07:43.320 |
And then we say, okay, we're going to append that to the initial centroids. 00:07:47.960 |
So say I picked at random this point as my random initial point, the furthest point away 00:07:58.080 |
So that would be the first centroid we picked. 00:08:03.000 |
Okay we're now inside a loop, and we now go back and we repeat the process. 00:08:07.600 |
So we now replace our random point with the actual first centroid, and we go through the 00:08:16.800 |
So if we had our first centroid here, our second one now might be somewhere over here. 00:08:27.200 |
The next time through the loop, therefore, this is slightly more interesting. 00:08:30.480 |
All distances, we're now going to have the distance between every one of our initial centroids 00:08:41.160 |
So we've got a matrix, in this case it's going to be 2 by the number of data points. 00:08:47.120 |
So then we say, okay, for every data point, find the closest cluster. 00:08:55.120 |
So what's its distance to the closest initial centroid? 00:09:00.360 |
And then tell me which data point is the furthest away from its closest initial centroid. 00:09:07.240 |
So in other words, which data point is the furthest away from any centroid? 00:09:18.560 |
So let's look and see how we actually do that in TensorFlow. 00:09:23.200 |
So it looks a lot like NumPy, except in places you would expect to see np, we see tf, and 00:09:31.200 |
then we see the API is a little different, but not too different. 00:09:37.160 |
So to get a random number, we can just use random uniform. 00:09:43.680 |
We can tell it what type of random number we want, so we want a random int because we're 00:09:46.960 |
trying to get a random index, which is a random data point. 00:09:51.240 |
That's going to be between 0 and the amount of data points we have. 00:10:06.200 |
Now you'll notice I've created something called vdata. 00:10:10.960 |
When we set up this k-means in the first place, the data was sent in as a NumPy array, and 00:10:19.840 |
Now this is the critical thing that lets us make TensorFlow feel more like PyTorch. 00:10:26.920 |
Once I do this, data is now basically copied to the GPU, and so when I'm calling something 00:10:41.280 |
Now there's one thing problematic to be aware of, which is that the copying does not actually 00:10:56.440 |
So any time you call tf.variable, if you then try to run something using that variable, 00:11:04.800 |
you'll get back an uninitialized variable error unless you call this in the meantime. 00:11:09.800 |
This is kind of like performance stuff in TensorFlow where they try to say you can set up lots 00:11:17.840 |
of variables at once and then call this initializer and we'll do it all at once for you. 00:11:25.440 |
So earlier on we created this k-means object. 00:11:33.900 |
We know that in Python when you create an object, it calls _init__ (that's just how 00:11:39.200 |
Python works), inside that we copied the data to the GPU by using tf.variable, and then 00:11:45.980 |
inside find_initial_centroids, we can now access that in order to basically do calculations 00:11:58.200 |
In TensorFlow, pretty much everything takes and returns a tensor. 00:12:02.480 |
So when you call random_uniform, it's giving us a tensor, an array of random numbers. 00:12:08.160 |
In this case, we just wanted one of them, so we have to use tf.squeeze to take that 00:12:13.200 |
tensor and turn it into a scalar, because then we're just indexing it into here to get 00:12:22.080 |
So now that we've got that single item back, we then expand it back again into a tensor 00:12:29.160 |
because inside our loop, remember, this is going to be a list of initial centroids. 00:12:34.360 |
It's just that this list happens to be of length 1 at the moment. 00:12:40.980 |
So one of these tricks in making TensorFlow feel more like PyTorch is to use standard 00:12:51.920 |
So in a lot of TensorFlow code where it's more serious performance intensive stuff, 00:12:57.440 |
you'll see people use TensorFlow-specific loops like tf.while or tf.scan or map or so forth. 00:13:05.200 |
The challenge with using those kind of loops is it's basically creating a computation graph 00:13:10.480 |
You can't step through it, you can't use it in the normal Pythonic kind of ways. 00:13:16.760 |
So we can just use normal Python loops if we're careful about how we do it. 00:13:21.960 |
So inside our normal Python loop, we can use normal Python functions. 00:13:27.720 |
So here's a function I created which calculates the distance between everything in this tensor 00:13:35.760 |
So all distances, it looks very familiar because it looks a lot like the PyTorch code we had. 00:13:45.160 |
So for the first array, for the first tensor, we add an additional axis to axis 0, and for 00:13:54.560 |
the second we add an additional axis to axis 1. 00:13:58.720 |
So the reason this works is because of broadcasting. 00:14:03.320 |
So A, when it starts out, is a vector, and B is a vector. 00:14:21.800 |
Well the answer is it's both and it's neither. 00:14:24.360 |
It's one-dimensional, so it has no concept of what direction it's looking. 00:14:30.040 |
So then what we do is we set expandims on axis 0, so that's rows. 00:14:37.400 |
So that basically says to A, you are now definitely a row vector. 00:14:42.960 |
You now have one row and however many columns, same as before. 00:14:48.960 |
And then where else with B, we add an axis at the end. 00:14:54.360 |
So B is now definitely a column vector, it now has one column and however many rows we 00:15:03.960 |
So with broadcasting, what happens is that this one gets broadcast to this length, and 00:15:17.520 |
So we end up with a matrix containing the difference between every one of these items 00:15:27.320 |
So that's like this simple but powerful concept of how we can do very fast GPU-accelerated 00:15:36.040 |
loops and less code than it would have taken to actually write the loop. 00:15:40.800 |
And we don't have to worry about out-of-bounds conditions or anything like that, it's all 00:15:48.760 |
And once we've got that matrix, because in TensorFlow everything is a tensor, we can 00:15:54.440 |
call square difference rather than just regular difference, and it gives us the squares of 00:15:59.920 |
those differences, and then we can sum over the last axis. 00:16:04.140 |
So the last axis is the dimensions, so we're just creating a Euclidean distance here. 00:16:15.680 |
So this gives us every distance between every element of A and every element of B. 00:16:28.420 |
So then let's say we've gone through a couple of loops. 00:16:32.760 |
So after a couple of loops, R is going to contain a few initial centroids. 00:16:38.680 |
So we now want to basically find out for every point how far away is it from its nearest initial 00:16:50.360 |
So when we go reduce min with axis=0, then we know that that's going over the axis here 00:17:00.680 |
because that's what we put into our all-distances function. 00:17:03.960 |
So it's going to go through, well actually it's reducing across into that axis, so it's 00:17:13.160 |
So at the end of this, it says, alright, this is for every piece of our data how far it 00:17:26.200 |
And that returns the actual distance, because we said do the actual min. 00:17:31.440 |
So there's a difference between min and arg, the arg version. 00:17:35.880 |
So argmax then says, okay, now go through all of the points. 00:17:41.680 |
We now know how far away they are from their closest centroid, and tell me the index of 00:17:53.700 |
We used it quite a bit in part 1 of the course, but it's well worth making sure we understand 00:18:01.040 |
I think IntensorFlow, I think they're getting rid of these reduce_ prefixes. 00:18:10.700 |
So in some version you may find this is called min rather than reduce_min. 00:18:18.520 |
For those of you who don't have such a computer science background, a reduction basically 00:18:22.880 |
means taking something in a higher dimension and squishing it down into something that's 00:18:26.400 |
a lower dimension, for example, summing a vector and turning it into a scalar is called 00:18:33.000 |
So this is a very TensorFlow API assuming that everybody is a computer scientist and 00:18:37.640 |
that you wouldn't look for min, you would look for reduce_min. 00:18:46.880 |
So generally speaking, you have to be a bit careful of data types. 00:18:54.600 |
I generally don't really notice about data type problems until I get the error, but if 00:18:58.240 |
you get an error that says you passed an int64 into something that expected an int32, you 00:19:07.000 |
So we need to index something with an int32, so we just have to cast it. 00:19:11.120 |
And so this returns the actual point, append, and then this is very similar to NumPy stacking 00:19:18.460 |
together the initial centroids to create a tensor of them. 00:19:25.680 |
So the code doesn't look at all weird or different, but it's important to remember that when we 00:19:34.000 |
run this code, nothing happens, other than that it creates a computation graph. 00:19:40.580 |
So when we call k.find_initial_centroids, nothing happens. 00:19:49.120 |
But because we're in an interactive session, we can now call .eval, and that actually runs 00:19:56.360 |
And it runs it, and it actually takes the data that's returned from that and copies 00:20:01.520 |
it off the GPU and puts it back in the CPU as a NumPy array. 00:20:08.120 |
So it's important to remember that if you call eval, we now have an actual genuine regular 00:20:16.180 |
And this is the thing that makes us be able to write code that looks a lot like PyTorch 00:20:23.200 |
code, because we now know that we can take something that's a NumPy array and turn it 00:20:26.960 |
into a GPU tensor like that, and we can take something that's a GPU tensor and turn it 00:20:39.000 |
I suspect this might make TensorFlow developers shake at how horrible this is. 00:20:50.120 |
It's not really quite the way you're meant to do things I think, but it's super easy 00:21:00.080 |
This approach where we're calling .eval, you do need to be a bit careful. 00:21:06.000 |
If this was inside a loop that we were calling eval and we were copying back a really big 00:21:11.240 |
chunk of data to the GPU and the CPU again and again and again, that would be a performance 00:21:17.560 |
So you do need to think about what's going on as you do it. 00:21:22.040 |
So we'll look inside the inner loop in a moment and just check. 00:21:29.720 |
As you can see, this little hackyheuristic does a great job. 00:21:37.000 |
It's a hackyheuristic I've been using for decades now, and it's a kind of thing which 00:21:44.480 |
In this case, a similar hackyheuristic did actually appear in a paper in 2007 and an 00:21:52.760 |
But it's always worth thinking about how can I pre-process my data to get it close to where 00:22:00.920 |
And often these kind of approaches are useful. 00:22:05.740 |
There's actually -- I don't know if we'll have time to maybe talk about it someday -- but 00:22:08.440 |
there's an approach to doing PCA, Principle Components Analysis, which has a similar flavor, 00:22:13.840 |
basically finding random numbers and finding the furthest numbers away from them. 00:22:26.560 |
Well, what we do next is we're going to be doing more computation in TensorFlow with 00:22:29.760 |
them, so we want to copy them back to the GPU. 00:22:34.800 |
And so because we copied them to the GPU, before we do an eval or anything later on, 00:22:41.240 |
we're going to have to make sure we go global variable initializers dot run. 00:22:46.080 |
Can you explain what happens if you don't create interactive session? 00:22:52.760 |
So what the TensorFlow authors decided to do in their wisdom was to generate their own 00:22:58.560 |
whole concept of namespaces and variables and whatever else. 00:23:04.200 |
So rather than using Python's, there's TensorFlow. 00:23:10.040 |
And so a session is basically kind of like a namespace that holds the computation graphs 00:23:21.360 |
You can, and then there's this concept of a context manager, which is basically where 00:23:26.680 |
you have a with clause in Python and you say with this session, now you're going to do 00:23:36.120 |
You can have multiple computation graphs, so you can say with this graph, create these 00:23:42.920 |
Where it comes in very handy is if you want to say run this graph on this GPU, or stick 00:23:56.160 |
So without an interactive session, you basically have to create that session, you have to say 00:24:01.640 |
which session to use using a with clause, and then there's many layers of that. 00:24:06.480 |
So within that you can then create namescopes and variablescopes and blah blah blah. 00:24:11.200 |
So the annoying thing is the vast majority of tutorial code out there uses all of these 00:24:18.360 |
It's as if all of Python's OO and variables and modules doesn't exist, and you use TensorFlow 00:24:27.040 |
So I wanted to show you that you don't have to use any of these concepts, pretty much. 00:24:34.520 |
I haven't quite finished thinking through this, but have you tried? 00:24:51.200 |
If you had initially said I have seven clusters, or eight clusters, what you would find after 00:24:56.600 |
you hit your six is you'd all of a sudden start getting centroids that were very close 00:25:04.000 |
So it seems like you could somehow intelligently define a width of a cluster, or kind of look 00:25:12.960 |
for a jump in things dropping down and how far apart they are from some other cluster, 00:25:17.940 |
and programmatically come up with a way to decide the number of clusters. 00:25:26.680 |
Maybe then you're using TK means, I don't know. 00:25:36.760 |
There are certainly papers about figuring out the number of clusters in K-means. 00:25:44.720 |
So maybe during the week you'd check one out, put it to TensorFlow, that would be really 00:25:51.520 |
And I just wanted to follow up what you said about sessions to emphasize that with a lot 00:25:57.400 |
of tutorials, you could make the code simpler by using an interactive session in a Jupyter 00:26:04.200 |
I remember when Rachel was going through a TensorFlow course a while ago and she kept 00:26:08.440 |
on banging her head against a desk with sessions and variable scopes and whatever else. 00:26:14.400 |
That was part of what led us to think, "Okay, let's simplify all that." 00:26:19.680 |
So step one was to take our initial centroids and copy them onto the GPU. 00:26:27.960 |
So the next step in the K-means algorithm is to take every point and assign them to 00:26:34.280 |
a cluster, which is basically to say for every point which of the centroids is the closest. 00:26:47.840 |
We'll get to that in a moment, but let's pretend we've done it. 00:26:50.460 |
This will now be a list of which centroid is the closest for every data point. 00:26:58.620 |
So then we need one more piece of TensorFlow concepts, which is we want to update an existing 00:27:15.760 |
And so we can actually call updateCentroids to basically do that updating, and I'll show 00:27:24.760 |
So basically the idea is we're going to loop through doing this again and again and again. 00:27:33.480 |
But when we just do it once, you can actually see it's nearly perfect already. 00:27:39.520 |
So it's a pretty powerful idea as long as your initial cluster centers are good. 00:27:47.120 |
So let's see how this works, assigned to nearest. 00:27:59.960 |
The reason there's a single line of code is we already have the code to find the distance 00:28:04.740 |
between every piece of data and its centroid. 00:28:10.880 |
Now rather than calling tf.reduceMin, which returned the distance to its nearest centroid, 00:28:16.880 |
we call tf.argmin to get the index of its nearest centroid. 00:28:23.280 |
So generally speaking, the hard bit of doing this kind of highly vectorized code is figuring 00:28:28.840 |
out this number, which is what axis we're working with. 00:28:33.600 |
And so it's a good idea to actually write down on a piece of paper for each of your 00:28:40.220 |
It's like it's time by batch, by row, by column, or whatever. 00:28:43.800 |
Make sure you know what every axis represents. 00:28:48.880 |
When I'm creating these algorithms, I'm constantly printing out the shape of things. 00:28:57.480 |
Another really simple trick, but a lot of people don't do this, is make sure that your 00:29:01.720 |
different dimensions actually have different sizes. 00:29:05.540 |
So when you're playing around with testing things, don't have a batch size of 10 and 00:29:13.320 |
I find it much easier to think of real numbers, so have a batch size of 8 and an n of 10 and 00:29:19.560 |
Because every time you print out the shape, you'll find out exactly what everything is. 00:29:24.960 |
So this is going to return the nearest indices. 00:29:32.040 |
So then we can go ahead and update the centroids. 00:29:48.320 |
And if you know the computer science term for the thing you're trying to do, it's generally 00:30:00.640 |
The only other way to find it is just to do lots and lots of searching through the documentation. 00:30:04.840 |
So in general, taking a set of data and sticking it into multiple chunks of data according 00:30:15.780 |
to some kind of criteria is called partitioning in computer science. 00:30:20.920 |
So I got a bit lucky when I first looked for this. 00:30:22.800 |
I googled for TensorFlow partition and bang, this thing popped up. 00:30:35.100 |
And this is where reading about GPU programming in general is very helpful. 00:30:41.520 |
Because in GPU programming there's this kind of smallish subset of things which everything 00:30:46.760 |
else is built on, and one of them is partitioning. 00:31:04.800 |
Partition is the data into some number of partitions using some indices. 00:31:10.600 |
And generally speaking, it's easiest to just look at some code. 00:31:16.480 |
We're going to create two partitions, we're calling them clusters, and it's going to go 00:31:28.000 |
So 10 will go to the 0th partition, 20 will go to the 0th partition, 30 will go to the 00:31:38.280 |
So this is the nice thing is that there's a lot of these, you can see all this stuff. 00:31:42.720 |
There's so many functions available, often there's the exact function you need. 00:31:48.980 |
So we just take our list of indices, convert it to a list of int32s, pass it out data, 00:31:55.360 |
the indices and the number of clusters, and we're done. 00:32:00.040 |
This is now a separate array, basically a separate tensor for each of our clusters. 00:32:10.600 |
So now that we've done that, we can then figure out what is the mean of each of those clusters. 00:32:19.440 |
So the mean of each of those clusters is our new centroid. 00:32:22.760 |
So what we're doing is we're saying which points are the closest to this one, and these 00:32:31.200 |
points are the closest to this one, okay, what's the average of those points? 00:32:44.280 |
And then we can basically say, okay, that's our new clusters. 00:32:52.000 |
So then just join them all together, concatenate them together. 00:33:04.380 |
Except for that dynamic partitions, well in fact including that dynamic partitions, that 00:33:08.960 |
was incredibly simple, but it was incredibly simple because we had a function that did 00:33:13.720 |
So because we assigned a variable up here, we have to call initializer.run. 00:33:20.120 |
And then of course before we can do anything with this tensor, we have to call .eval to 00:33:26.520 |
actually call the computation graph and copy it back to our CPU. 00:33:38.000 |
So then we want to replace the contents of current centroids with the contents of updated 00:33:47.680 |
And so to do that, we can't just say equals, everything is different in TensorFlow, you 00:33:57.080 |
So this is the same as basically saying current centroids equals updated centroids, but it's 00:34:01.840 |
creating a computation graph that basically does that assignment on the GPU. 00:34:07.880 |
How can we extrapolate this to other non-numeric data types such as words, images? 00:34:17.800 |
Well they're all numeric data types really, so an image is absolutely a numeric data type. 00:34:26.600 |
So it's just a bunch of pixels, you just have to decide what distance measure you want, 00:34:34.240 |
which generally just means deciding you're probably using Euclidean distance, but are 00:34:37.680 |
you doing it in pixel space, or are you picking one of the activation layers in a neural net? 00:34:44.000 |
For words, you would create a word vector for your words. 00:34:52.800 |
There's nothing specifically two-dimensional about this. 00:35:00.240 |
That's really the whole point, and I'm hoping that maybe during the week, some people will 00:35:03.360 |
start to play around with some higher dimensional data sets to get a feel for how this works. 00:35:08.560 |
Particularly if you can get it working on CT scans, that would be fascinating using 00:35:12.280 |
the five-dimensional clustering we talked about. 00:35:17.600 |
So here's what it looks like in total if we weren't using an interactive session. 00:35:23.960 |
You basically say with tf.session, that creates a session, but as default, that sets it to 00:35:30.960 |
the current session, and then within the with block, we can now run things. 00:35:36.520 |
And then k.run does all the stuff we just saw, so if we go to k.run, here it is. 00:35:48.600 |
So this is how you can create a complete computation graph in TensorFlow using a notebook. 00:35:56.800 |
Once you've got it working, you put it all together. 00:35:59.280 |
So you can see find_initial_centroids.eval, put it back into a variable again, assigned 00:36:13.140 |
Because we created a variable in the process there, we then have to rerun global_variable_initializer. 00:36:21.120 |
We could have avoided this I guess by not calling eval and just treating this as a variable 00:36:28.560 |
the whole time, but it doesn't matter, this works fine. 00:36:34.420 |
And then we just loop through a bunch of times, calling centroids.assign_updated_centroids. 00:36:50.440 |
What we should be doing after that is then calling update_centroids each time. 00:37:01.360 |
Then the nice thing is, because I've used a normal Python for loop here and I'm calling 00:37:08.520 |
dot eval each time, it means I can check, "Oh, have any of the cluster centroids moved?" 00:37:20.080 |
And if they haven't, then I will stop working. 00:37:24.040 |
So it makes it very easy to create dynamic for loops, which could be quite tricky sometimes 00:37:35.960 |
So that is the TensorFlow algorithm from end to end. 00:37:48.000 |
Rachel, do you want to pick out an AMA question? 00:37:53.820 |
So actually I kind of am helping start a company, I don't know if you've seen my talk on ted.com, 00:38:04.280 |
but I kind of show this demo of this interactive labelling tool, a friend of mine said that 00:38:12.360 |
he wanted to start a company to actually make that and commercialize it. 00:38:15.240 |
So I guess my short answer is I'm helping somebody do that because I think that's pretty 00:38:20.480 |
More generally, I've mentioned before I think that the best thing to do is always to scratch 00:38:26.680 |
an itch, so pick whatever you've been passionate about or something that's just driven you 00:38:36.840 |
If you have the benefit of being able to take enough time to do absolutely anything you 00:38:40.580 |
want, I felt like the three most important areas for applying deep learning when I last 00:38:48.880 |
looked, which was two or three years ago, were medicine, robotics, and satellite imagery. 00:38:55.440 |
Because at that time, computer vision was the only area that was remotely mature for 00:39:02.280 |
machine learning, deep learning, and those three areas all were areas that very heavily 00:39:09.160 |
used computer vision or could heavily use computer vision and were potentially very 00:39:17.040 |
Medicine is probably the largest industry in the world, I think it's $3 trillion in 00:39:20.480 |
America alone, robotics isn't currently that large, but at some point it probably will 00:39:26.760 |
become the largest industry in the world if everything we do manually is replaced with 00:39:34.720 |
And satellite imagery is massively used by military intelligence, so we have some of 00:39:39.880 |
the biggest budgets in the world, so I guess those three areas. 00:39:44.840 |
I'm going to take a break soon, before I do, I might just introduce what we're going to 00:40:05.440 |
So we're going to start on our NLP, and specifically translation deep dive, and we're going to 00:40:14.680 |
be really following on from the end-to-end memory networks from last week. 00:40:23.600 |
One of the things that I find kind of most interesting and most challenging in setting 00:40:28.080 |
up this course is coming up with good problem sets which are hard enough to be interesting 00:40:40.440 |
And often other people have already done that, so I was lucky enough that somebody else had 00:40:44.360 |
already shown an example of using sequence-to-sequence learning for what they called Spelling Bee. 00:40:53.120 |
Basically we start with this thing called the CMU Pronouncing Dictionary, which has 00:40:58.000 |
things that look like this, Zwicky, followed by a phonetic description of how to read Zweki. 00:41:10.440 |
So the way these work, this is actually specifically an American pronunciation dictionary. 00:41:21.600 |
The vowel sounds have a number at the end showing how much stress is on each one. 00:41:28.440 |
So in this case you can see that the middle one is where most of the stress is, so it's 00:41:35.340 |
So here is the letter A, and it is pronounced A. 00:41:42.200 |
So the goal that we're going to be looking at after the break is to do the other direction, 00:41:48.680 |
which is to start with how do you say it and turn it into how do you spell it. 00:41:56.000 |
This is quite a difficult question because English is really weird to spell. 00:42:01.960 |
And the number of phonemes doesn't necessarily match the number of letters. 00:42:10.640 |
So this is going to be where we're going to start. 00:42:12.720 |
And then we're going to try and solve this puzzle, and then we'll use a solution from 00:42:15.840 |
this puzzle to try and learn to translate French into English using the same basic idea. 00:42:22.800 |
So let's have a 10 minute break and we'll come back at 7.40. 00:42:29.040 |
So just to clarify, I just want to make sure everybody understands the problem we're solving 00:42:34.320 |
So the problem we're solving is we're going to be told here is how to pronounce something, 00:42:39.280 |
and then we have to say here is how to spell it. 00:42:42.000 |
So this is going to be our input, and this is going to be our target. 00:42:46.160 |
So this is like a translation problem, but it's a bit simpler. 00:42:52.140 |
So we don't have pre-trained phoneme vectors or pre-trained letter vectors. 00:42:58.960 |
So we're going to have to do this by building a model, and we're going to have to create 00:43:05.200 |
So in general, the first steps necessary to create an NLP model tends to look very, very 00:43:14.600 |
I feel like I've done them in a thousand different ways now, and at some point I really need 00:43:19.560 |
to abstract this all out into a simple set of functions that we use again and again and 00:43:26.560 |
But let's go through it, and if you've got any questions about any of the code or steps 00:43:36.200 |
So the basic pronunciation dictionary is just a text file, and I'm going to just grab the 00:43:45.360 |
lines which are actual words so they have to start with a letter. 00:43:52.360 |
Now something which I have, so we're going to go through every line in the text file. 00:44:06.200 |
Here's a handy thing that a lot of people don't realize you can do in Python. 00:44:09.960 |
When you call open, that returns a generator which lists all of the lines. 00:44:14.860 |
So if you just go for l in openblah, that's now looping through every line in that file. 00:44:23.120 |
So I can then say filter those which start with a lowercase letter, and then strip off 00:44:44.120 |
So that's basically the steps necessary to separate out the word from the pronunciation. 00:44:52.200 |
And then the pronunciation is just white space delimited, so we can then split that. 00:44:59.040 |
And that's the steps necessary to get the word and the pronunciation as a set of phonemes. 00:45:05.480 |
So as we tend to pretty much always do with these language models, we next need to get 00:45:09.980 |
a list of what are all of the vocabulary items. 00:45:13.200 |
So in this case, the vocabulary items are all the possible phonemes. 00:45:17.360 |
So we can create a set of every possible phoneme, and then we can sort it. 00:45:27.160 |
And what we always like to do is get an extra character or an extra object in position 0, 00:45:40.120 |
So that's why I'm going to use underscore as our special padding letter here. 00:45:50.560 |
This is our special padding one, which is going to be index 0. 00:45:54.080 |
And then there's r, r and r with 3 different levels of stress, and so forth. 00:46:00.520 |
Now the next thing that we tend to do anytime we've got a list of vocabulary items is to 00:46:08.560 |
So we go from phoneme to index, which is just a dictionary where we enumerate through all 00:46:15.680 |
of our phonemes and put it in the opposite order. 00:46:21.840 |
I know we've used this approach a thousand times before, but I just want to make sure 00:46:29.400 |
When you use enumerate in Python, it doesn't just return each phoneme, but it returns a 00:46:34.400 |
tuple that contains the index of the phoneme and then the phoneme itself. 00:46:41.700 |
So then if we go value,key, that's now the phoneme followed by the index. 00:46:47.120 |
So if we turn that into a dictionary, we now have a dictionary which you can give it a 00:46:56.720 |
Again with our special underscore at the front. 00:47:00.760 |
We've got one extra thing we'll talk about later, which is an asterisk. 00:47:07.120 |
And so again, to go from letter to letter index, we just create a dictionary which reverses 00:47:17.280 |
So now that we've got our phoneme to index and letter to index, we can use that to convert 00:47:25.280 |
this data into numeric data, which is what we always do with these language models. 00:47:35.120 |
We can pick some maximum length word, so I'm just going to say 15. 00:47:41.240 |
So we're going to create a dictionary which maps from each word to a list of phonemes, 00:47:54.880 |
So this dictionary comprehension is a little bit awkward, so I thought this would be a good 00:48:00.520 |
opportunity to talk about dictionary comprehensions and list comprehensions for a moment. 00:48:06.160 |
So we're going to pause this in a moment, but first of all, let's look at a couple of examples 00:48:12.880 |
So the first thing to note is when you go something like this, a string x, y, z or this 00:48:17.520 |
string here, Python is perfectly happy to consider that a list of letters. 00:48:22.600 |
So Python considers this the same as being a list of x, y, z. 00:48:26.920 |
So you can think of this as two lists, a list of x, y, z and a list of a, b, c. 00:48:32.040 |
So here is the simplest possible list comprehension. 00:48:37.640 |
So go through every element of a and put that into a list. 00:48:42.600 |
So if I call that, that returns exactly what I started with. 00:48:50.080 |
What if now we replaced 'o' with another list comprehension? 00:48:59.680 |
So what that's going to do is it's now going to return a list for each list. 00:49:08.240 |
So this is one way of pulling things out of sub-lists, is to basically take the thing 00:49:12.760 |
that was here and replace it with a new list comprehension, and that's going to give you 00:49:20.520 |
Now the reason I wanted to talk about this is because it's quite confusing. 00:49:23.920 |
In Python you can also write this, which is different. 00:49:28.680 |
So in this case, I'm going for each object in our a list, and then for each object in 00:49:39.760 |
I don't have square brackets, it's just all laid out next to each other. 00:49:44.340 |
So I find this really confusing, but the idea is you're meant to think of this as just being 00:49:52.800 |
And so what this does is it goes through x, y, z and then a, b, c, and then in x, y, z 00:49:58.720 |
it goes through each of x and y and z, but because there's no embedded set of square 00:50:02.840 |
brackets, that actually ends up flattening the list. 00:50:08.020 |
So we're about to see an example of the square bracket version, and pretty soon we'll be seeing 00:50:26.280 |
It's very useful to be able to flatten a list, it's very useful to be able to do things with 00:50:32.160 |
And then just to be aware that any time you have any kind of expression like this, you 00:50:38.080 |
can replace the thing here with any expression you like. 00:50:45.000 |
So we could say, for example, we could say o.upper, so you can basically map different 00:51:05.160 |
computations to each element of a list, and then the second thing you can do is put an 00:51:38.720 |
You can create any list comprehension you like by putting computations here, filters 00:51:43.040 |
here and optionally multiple lists of lists here. 00:51:48.920 |
The other thing you can do is replace the square brackets with curly brackets, in which 00:51:53.720 |
case you need to put something before a colon and something after a colon, the thing before 00:51:58.440 |
is your key and the thing after is your value. 00:52:01.520 |
So here we're going for, oh, and then there's another thing you can do, which is if the 00:52:05.920 |
thing you're looping through is a bunch of lists or tuples or anything like that, you 00:52:14.880 |
So this is the word, and this is the list of phonemes. 00:52:18.840 |
So we're going to have the lower case word will be our keys in our dictionary, and the 00:52:26.200 |
values will be lists, so we're doing it just like we did down here. 00:52:32.280 |
And the list will be, let's go through each phoneme and go phoneme to index. 00:52:40.280 |
So now we have something that maps from every word to its list of phoneme indexes. 00:52:53.360 |
We can find out what the maximum length of anything is in terms of how many phonemes 00:53:02.000 |
We can just go through every one of those dictionary items, calling length on each one 00:53:12.800 |
So you can see combining list comprehensions with other functions is also powerful. 00:53:22.400 |
So finally we're going to create our nice square arrays. 00:53:31.280 |
Normally we do this with Keras.pad sequences, just for a change, we're going to do this 00:53:40.440 |
So the key is that we start out by creating two arrays of zeros, because all the padding 00:53:48.560 |
So if we start off with all zeros, then we can just fill in the non-zeros. 00:53:56.360 |
This is going to be our actual spelling, that's our target labels. 00:54:00.220 |
So then we go through all of our, and we've permitted them randomly, so randomly ordered 00:54:04.880 |
things in the pronunciation dictionary, and we put inside input all of the items from 00:54:15.480 |
that pronunciation dictionary, and into labels we go letter to index. 00:54:22.800 |
So we now have one thing called input, one thing called labels that contains nice rectangular 00:54:28.360 |
arrays padded with zeros containing exactly what we want. 00:54:36.480 |
I'm not going to worry about this line yet because we're not going to use it for the 00:54:42.560 |
So anywhere you see DEC something, just ignore that for now, we'll get back to that later. 00:54:49.080 |
Train test split is a very handy function from sklearn that takes all of these lists 00:54:55.800 |
and splits them all in the same way with this proportion in the test set. 00:55:01.680 |
And so input becomes input_train and input_test, labels becomes labels_train and labels_test. 00:55:11.120 |
We've often written that manually, but this is a nice quick way to do it when you've got 00:55:20.280 |
So just to have a look at how many phonemes we have in our vocabulary, there are 70, how 00:55:25.720 |
many letters in our vocabulary, there's 28, that's because we've got that underscore and 00:55:55.620 |
So the embedding is going to take every one of our phonemes. 00:56:00.240 |
Max_len_p is the maximum number of phonemes we have in any pronunciation. 00:56:07.200 |
And each one of those phonemes is going to go into an embedding. 00:56:13.400 |
And the lookup for that embedding is the vocab size for phonemes, which I think was 70. 00:56:21.880 |
And then the output is whatever we decide, what dimensionality we want. 00:56:26.600 |
And in experimentation, I found 120 seems to work pretty well. 00:56:30.840 |
I was surprised by how high that number is, but there you go, it is. 00:56:37.600 |
We started out with a list of phonemes, and then after we go through this embedding, we 00:57:02.640 |
So the basic idea is to take this big thing, which is all of our phonemes embedded, and 00:57:11.320 |
we want to turn it into a single distributed representation which contains all of the richness 00:57:22.440 |
Later on we're going to be doing the same thing with an English sentence. 00:57:27.000 |
And so we know that when you have a sequence and you want to turn it into a representation, 00:57:41.920 |
Because an RNN we know is good at dealing with things like state and memory. 00:57:47.320 |
So when we're looking at translation, we really want something which can remember where are 00:57:55.280 |
So let's say we were doing this simple phonetic translation, the idea of have we just had 00:58:04.000 |
Because if we just had a C, then the H is going to make a totally different sound to 00:58:13.360 |
So an RNN we think is a good way to do this kind of thing. 00:58:17.640 |
And in general, this whole class of models remember is called Seek-to-Seek, sequence-to-sequence 00:58:25.360 |
models, which is where we start with some arbitrary length sequence and we produce some 00:58:32.600 |
And so the general idea here is taking that arbitrary length sequence and turning it into 00:58:37.280 |
a fixed size representation using an RNN is probably a good first step. 00:58:53.560 |
So looking ahead, I'm actually going to be using quite a few layers of RNN. 00:59:03.040 |
So to make that easier, we've created a getRNN function. 00:59:15.280 |
You can put anything you like here, GRU or LSTM or whatever. 00:59:24.880 |
The kind of dropout that you use in an RNN is slightly different to normal dropout. 00:59:29.760 |
It turns out that if the particular things you dropout, it's best to have them the same 00:59:41.240 |
There's a really good paper that explains why this is the case and shows that this is 00:59:46.560 |
So this is why there's a special dropout parameter inside the RNN in Keras because it does this 00:59:56.960 |
So I put in a tiny bit of dropout here, and if it turns out that we overfit, we can always 01:00:16.840 |
I don't know if you remember, but when we looked at doing RNNs from scratch last year, 01:00:36.840 |
we learned that you could actually combine the matrices together and do a single matrix 01:00:42.960 |
If you do that, it's going to use up more memory, but it allows the GPU to be more highly parallel. 01:00:49.600 |
So if you look at the Keras documentation, it will tell you the different things you 01:00:53.560 |
can use, but since we're using a GPU, you probably always want to say consume less equals GPU. 01:01:03.200 |
The other thing that we learned about last year is bidirectional RNNs. 01:01:09.760 |
And maybe the best way to come at this is actually to go all the way back and remind 01:01:19.320 |
We haven't done much revision, but it's been a while since we've looked at RNNs in much 01:01:24.680 |
So just to remind you, this is kind of our drawing of a totally basic neural net. 01:01:31.520 |
Square is input, circle is intermediate activations, hidden, and triangle is output, and arrows 01:01:40.000 |
represent affine transformations with non-linearities. 01:01:47.880 |
We can then have multiple copies of those to create deeper convolutions, for example. 01:01:57.160 |
And so the other thing we can do is actually have inputs going in at different places. 01:02:03.740 |
So in this case, if we were trying to predict the third character from first two characters, 01:02:08.040 |
we can use a totally standard neural network and actually have input coming in at two different 01:02:17.920 |
And then we realized that we could kind of make this arbitrarily large, but what we should 01:02:22.560 |
probably do then is make everything where an input is going to a hidden state be the 01:02:28.240 |
So this color coding, remember, represents the same weight matrix. 01:02:33.040 |
So hidden to hidden would be the same weight matrix and hidden to output, and it's a separate 01:02:40.760 |
So then to remind you, we realized that we could draw that more simply like this. 01:02:46.160 |
So RNNs, when they're unrolled, just look like a normal neural network in which some 01:02:55.940 |
And if this is not ringing a bell, go back to lesson 5 where we actually build these 01:03:03.160 |
weight matrices from scratch and tie them together manually so that will hopefully remind 01:03:12.840 |
Now importantly, we can then take one of those RNNs and have the output go to the input of 01:03:26.360 |
And stacked RNNs basically give us richer computations in our recurrent neural nets. 01:03:35.280 |
And this is what it looks like when we unroll it. 01:03:39.160 |
So you can see here that we've got multiple inputs coming in, going through multiple layers 01:03:47.020 |
But of course we don't have to create multiple outputs. 01:03:51.680 |
You could also get rid of these two triangles here and have just one output. 01:04:01.440 |
And remember in Keras, the difference is whether or not we say return_sequences=true or return_sequences=false. 01:04:08.760 |
This one you're seeing here is return_sequences=true. 01:04:19.680 |
So what we've got is input_train has 97,000 words. 01:04:36.600 |
It's 15 characters long plus the padding, and then labels is 15 because we chose earlier 01:05:00.520 |
on that our max length would be a 15 long spelling. 01:05:10.200 |
So after the embedding, so if we take one of those tens of thousands of words, remember 01:05:33.720 |
And then we're putting it into an embedding matrix which is 70 by 120. 01:05:45.720 |
And the reason it's 70 is that each of these phonemes contains a number between 0 to 69. 01:05:57.360 |
So basically we go through and we get each one of these indexes and we look up to find 01:06:02.720 |
So this is 5 here, then we find number 5 here. 01:06:15.720 |
"Are we then taking a sequence of these phonemes represented as 120 dimensional floating point 01:06:26.080 |
vectors and using an RNN to create a sequence of word2vec embeddings which we will then 01:06:33.960 |
So we're not going to use word2vec here, right? 01:06:36.320 |
Word2vec is a particular set of pre-trained embeddings. 01:06:39.760 |
We're not using pre-trained embeddings, we have to create our own embeddings. 01:06:47.960 |
So if somebody else later on wanted to do something else with phonemes, and we saved 01:06:53.360 |
the result of this, we could provide phoneme2vec. 01:06:56.960 |
And you could download them and use the fast.ai pre-trained phoneme2vec embeddings. 01:07:03.000 |
This is how embeddings basically get created. 01:07:05.760 |
It's people build models starting with random embeddings and then save those embeddings 01:07:11.560 |
and make them available for other people to use. 01:07:13.720 |
"I may be misinterpreting it, but I thought the question was getting at the second set 01:07:20.800 |
of embeddings when you want to get back to your words." 01:07:25.600 |
So let's wait until we get there, because we're going to create letters, not words, 01:07:31.720 |
and then we'll just join the letters together. 01:07:35.560 |
So we've got, as far as creating our embeddings, we've then got an RNN which is going to take 01:07:44.860 |
our embeddings and attempt to turn it into a single vector. 01:07:55.040 |
So we've got here return sequences by default is true, so this first RNN returns something 01:08:08.800 |
And so if you want to stack RNNs on top of each other, every one of them is return sequences 01:08:18.440 |
So at the end of this one, this gives us a single vector which is the final state. 01:08:28.980 |
And bidirectional, you can totally do this manually yourself. 01:08:32.440 |
You take your input and feed it into an RNN, and then you reverse your input and feed it 01:08:39.760 |
into a different RNN, and then just concatenate the two together. 01:08:44.480 |
So Keras has something which does that for you, which is called bidirectional. 01:08:50.420 |
Bidirectional actually requires you to pass it an RNN, so it takes an RNN and returns two 01:08:58.000 |
copies of that RNN stacked on top of each other, one of which reverses its input. 01:09:05.480 |
That's interesting because often in language, what happens later influences what comes before. 01:09:12.660 |
For example, in French, the gender of your definite article depends on the noun that it 01:09:22.320 |
So you need to be able to look backwards or forwards in both directions to figure out 01:09:28.880 |
Or in any language with tense, what verb do you use depends on the tense and often also 01:09:37.080 |
depends on the details about the subject and the object. 01:09:40.280 |
So we want to be able to both look forwards and look backwards. 01:09:44.440 |
So that's why we want two copies of the RNN, one which goes from left to right and one 01:09:51.680 |
And indeed, we could assume that when you spell things, I'm not exactly sure how this 01:09:55.720 |
would work, but when you spell things depending on what the latest dresses might be or the 01:10:00.440 |
later details of the phonetics might be might change how you pronounce things earlier on. 01:10:07.120 |
Does the bidirectional RNN can cat two RNNs or does it stack them? 01:10:24.760 |
You end up with the same number of dimensions that you had before, but it basically doubles 01:10:34.840 |
So in this case, we have 240, so it just doubles those, and I think we had one question here. 01:10:47.720 |
Okay, so let's simplify this down a little bit. 01:11:17.040 |
Basically, we started out with a set of embeddings, and we've gone through two layers, we've gone 01:11:30.240 |
through a bidirectional RNN, and then we feed that to a second RNN to create a representation 01:11:46.820 |
So x at this point is a vector, because return sequence is equals false. 01:11:53.400 |
That vector, once we've trained this thing, the idea is it represents everything important 01:12:00.800 |
there is to know about this ordered list of phonemes, everything that we could possibly 01:12:08.520 |
So the idea is we could now take that vector and feed it into a new RNN, or even a few 01:12:16.800 |
layers of RNN, and that RNN could basically go through and with return sequence equals 01:12:27.360 |
It could spit out at every time step what it thinks the next letter in this spelling 01:12:34.000 |
And so this is how a sequence-to-sequence works. 01:12:36.200 |
One part, which is called the encoder, takes our initial sequence and turns it into a distributed 01:12:48.160 |
representation into a vector using, generally speaking, some stacked RNNs. 01:12:56.080 |
Then the second piece, called the decoder, takes the output of the encoder and passes 01:13:04.720 |
that into a separate stack of RNNs with return sequence equals true. 01:13:10.400 |
And those RNNs are taught to then generate the labels, in this case the spellings, or 01:13:22.080 |
Now in Keras, it's not convenient to create an RNN by handing it some initial state, some 01:13:35.720 |
That's not really how Keras likes to do things. 01:13:47.320 |
Problem number two, if you do hand it to an RNN just at the start, it's quite hard for 01:13:53.120 |
the RNN to remember the whole time what is this word I'm meant to be translating. 01:14:00.400 |
One is what's the word I'm meant to be spelling, and the second is what's the letter I'm trying 01:14:07.780 |
So what we do with Keras is we actually take this whole state and we copy it. 01:14:18.840 |
So in this case we're trying to create a word that could be up to 15 letters long, so in 01:14:33.400 |
So we take this and we actually make 15 copies of it. 01:14:46.040 |
And those 15 copies of our final encoder state becomes the input to our decoder RNN. 01:14:54.760 |
So it seems kind of clunky, but it's actually not difficult to do. 01:15:01.480 |
We take the output from our encoder and we repeat it 15 times. 01:15:11.840 |
So we literally have 15 identical copies of the same vector. 01:15:16.320 |
And so that's how Keras expects to see things, and it also turns out that you get better 01:15:22.080 |
results when you pass into the RNN the state that it needs again and again at every time 01:15:28.000 |
So we're basically passing in something saying we're trying to spell this word, we're trying 01:15:32.280 |
to spell this word, we're trying to spell this word, we're trying to spell this word. 01:15:35.740 |
And then as the RNN goes along, it's generating its own internal state, figuring out what 01:15:42.760 |
have we spelt so far and what are we going to have to spell next. 01:15:45.440 |
Why can't we have return_sequences=true for the second bidirectional LSTM? 01:15:55.400 |
Not bidirectional for the second LSTM, we only have one bidirectional LSTM. 01:16:02.000 |
We don't want return_sequences=true here because we're trying to create a representation of 01:16:11.880 |
So there's no point having something saying here's a representation of the first phoneme, 01:16:19.080 |
of the first 2, of the first 3, of the first 4, of the first 5, because we don't really 01:16:24.200 |
know exactly which letter of the output is going to correspond to which phoneme of the 01:16:30.360 |
And particularly when we get to translation, it can get much harder, like some languages 01:16:33.440 |
totally reverse the subject and object order or put the verb somewhere else. 01:16:37.640 |
So that's why we try to package up the whole thing into a single piece of state which has 01:16:43.760 |
all of the information necessary to build our target sequence. 01:16:49.160 |
So remember, these sequence-to-sequence models are also used for things like image captioning. 01:16:57.760 |
So with image captioning, you wouldn't want to have something that created a representation 01:17:08.240 |
You want a single representation, which is like this is something that somehow contains 01:17:14.440 |
all of the information about what this is a picture of. 01:17:18.000 |
Or if you're doing neural language translation, here's my English sentence, I've turned it 01:17:24.560 |
into a representation of everything that it means so that I can generate my French sentence. 01:17:29.140 |
We're going to be seeing later how we can use return sequences equals true when we look 01:17:37.040 |
But for now, we're just going to keep things simple. 01:18:08.720 |
But if you remember back to Lesson 5, we talked about some of the challenges with that. 01:18:18.600 |
So if you're trying to create something which can parse some kind of markup block like this, 01:18:27.280 |
it has to both remember that you've just opened up a piece of markup, you're in the middle 01:18:31.960 |
of it, and then in here you have to remember that you're actually inside a comment block 01:18:38.860 |
This kind of long-term dependency and memory and stateful representation becomes increasingly 01:18:44.760 |
difficult to do with CNNs as they get longer. 01:18:49.320 |
It's not impossible by any means, but RNNs are one good way of doing this. 01:18:57.800 |
But it is critical that we start with an embedding, because where else in image we're already 01:19:03.680 |
given float-valued numbers that really represent the image, that's not true with text. 01:19:10.660 |
So with text we have to use embeddings to turn it into these nice numeric representations. 01:19:17.680 |
RNN is a kind of generic term here, so a specific network we use is LSTM, but there are other 01:19:42.540 |
>> Yeah, with all the ones we did in the last part of the course. 01:19:44.540 |
>> So is that like the LSTM would be the best for that task? 01:19:50.920 |
The GRUs and LSTMs are pretty similar, so it's not worth thinking about too much. 01:19:58.480 |
So at this point here, we now have 15 copies of x. 01:20:21.300 |
And so we now pass that into two more layers of RNN. 01:20:27.040 |
So this here is our encoder, and this here is our decoder. 01:20:33.300 |
There's nothing we did particularly in Keras to say this is an encoder, this is a decoder. 01:20:37.760 |
The important thing is the return sequence equals false here, and the repeat vector here. 01:20:46.320 |
Well somehow it has to take this single summary and create some layers of RNNs until then 01:20:54.880 |
at the end we say, okay, here's a dense layer, and it's time distributed. 01:21:01.480 |
So remember that means that we actually have 15 dense layers. 01:21:08.320 |
And so each of these dense layers now has a softmax activation, which means that we 01:21:16.160 |
basically can then do an argmax on that to create our final list of letters. 01:21:22.080 |
So this is kind of our reverse embedding, if you like. 01:21:29.400 |
So the model is very little code, and once we've built it, and again, if things like 01:21:38.080 |
this are mysterious to you, go back and re-look at lessons 4, 5 and 6, remind you how these 01:21:47.560 |
embeddings work and how these kind of time distributed dense works to give us effectively 01:21:58.480 |
It starts with our phoneme input, ends with our time distributed dense output. 01:22:07.560 |
Our targets are just indexes, remember we turn them into indexes. 01:22:11.480 |
So we use this handy sparse categorical cross entropy, just the same as our normal categorical 01:22:18.160 |
cross entropy, but rather than one hot encoding, we just skip the whole one hot encoding and 01:22:26.360 |
We can go ahead and fit passing in our training data. 01:22:30.800 |
So that was our rectangular data of the phoneme indexes, our labels, and then we can use some 01:22:40.760 |
valid test set data that we set aside as well. 01:22:48.040 |
I found that the first three epochs, the loss went down like this. 01:22:52.280 |
The second three epochs it went down like this. 01:22:54.880 |
It seemed to be flattening out, so that's as far as I stopped it. 01:23:05.680 |
Now what I wanted to do was not just say what percentage of letters are correct, because 01:23:11.560 |
that doesn't really give you the right sense at all. 01:23:14.480 |
What I really want to know is what percentage of words are correct. 01:23:18.640 |
So that's what this little eval_keras function does. 01:23:22.860 |
It takes the thing that I'm trying to evaluate, calls.predict on it, it then does the argmax 01:23:31.960 |
as per usual to take that softmax and turn it into a specific number, which character 01:23:39.440 |
And then I want to check whether it's true for all of the characters that the real character 01:23:49.380 |
So this is going to return true only if every single item in the word is correct. 01:23:56.600 |
And so then taking the mean of that is going to tell us what percentage of the words did 01:24:03.520 |
And unfortunately the answer is not very many, 26%. 01:24:11.800 |
So we can go through 20 words at random, and we can print out all of the phonemes with 01:24:23.260 |
We can print out the actual word, and we can print out our prediction. 01:24:30.920 |
So here is a whole bunch of words that I don't really recognize, perturbations. 01:24:38.040 |
It should be spelled like that, slightly wrong. 01:24:46.960 |
So you can see some of the time the mistakes it makes are pretty clear. 01:24:51.220 |
So "larrow" could be spelled like that, but this seems perfectly reasonable. 01:25:02.840 |
And interestingly, what you find is that most of the time when it's way off, I found it 01:25:15.040 |
And the reason for that is that the longer the word, this one where it's terrible, is 01:25:23.520 |
by far the most phonemes, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 phonemes. 01:25:31.080 |
So we had to somehow create a single representation that contained all of the information of all 01:25:45.380 |
And then that single vector got passed to the decoder, and that was everything it had 01:25:58.400 |
So that's the problem with this basic encoder-decoder method. 01:26:06.280 |
And indeed, here is a graph from the paper which originally introduced attentional models. 01:26:24.960 |
What it showed is that as the sentence length got bigger, the standard approach which we're 01:26:36.080 |
talking about, the standard encoder or decoder approach, the accuracy absolutely died. 01:26:43.880 |
So what these researchers did was that they built a new kind of RNN model, called an attentional 01:26:51.960 |
And with the attentional model, the accuracy actually stayed pretty good. 01:26:59.080 |
So goal number 1 for the next couple of lessons is for me not to have a cold anymore. 01:27:08.720 |
So basically we're going to finish our deep dive into neural translation, and then we're 01:27:16.440 |
Although we're not specifically looking at time series, it turns out that the best way 01:27:20.080 |
that I found for time series is not specific to time series at all. 01:27:30.440 |
Reinforcement learning was something I was planning to cover, but I just haven't found 01:27:38.040 |
almost any good examples of it actually being used in practice to solve important real problems. 01:27:44.720 |
And indeed, when you look at the -- have you guys seen the paper in the last week or two 01:27:50.720 |
about using evolutionary strategies for reinforcement learning? 01:27:54.600 |
Basically it turns out that using random search is better than reinforcement learning. 01:28:03.120 |
This paper, by the way, is ridiculously overhyped. 01:28:05.960 |
These evolutionary strategies is something that I was working on over 20 years ago, and 01:28:12.760 |
in those days, these are genetic algorithms, as we called them, used much more sophisticated 01:28:18.600 |
methods than DeepMind's brand new evolutionary strategies. 01:28:22.680 |
So people are like rediscovering these randomized metaheuristics, which is great, but they're 01:28:28.800 |
still far behind where they were 20 years ago, but far ahead of reinforcement learning 01:28:37.280 |
So given I try to teach things which I think are actually going to stand the test of time, 01:28:42.680 |
I'm not at all convinced that any current technique for reinforcement learning is going 01:28:45.840 |
to stand the test of time, so I don't think we're going to touch that. 01:28:50.880 |
Part 3, yeah, I think before that we might have a part 0 where we do practical machine 01:29:01.880 |
learning for coders, talk about decision tree ensembles and training test splits and stuff 01:29:14.000 |
I'm sure Rachel and I are not going to stop doing this in a hurry, it's really fun and 01:29:20.160 |
interesting and we're really interested in your ideas about how to keep this going. 01:29:27.360 |
By the end of part 2, you guys have put in hundreds of hours, on average maybe 140 hours, 01:29:38.320 |
put together your own box, written blog posts, done hackathons, you're seriously in this 01:29:46.200 |
And in fact, I've got to say, this week's kind of been special for me. 01:29:51.920 |
This week's been the week where again and again I've spoken to various folks of you 01:29:56.240 |
guys and heard about how many of you have implemented projects at your workplace that 01:30:04.480 |
have worked and are now running and making your business money, or that you've achieved 01:30:10.000 |
the career thing that you've been aiming for, or that you've won yet another GPU at a hackathon, 01:30:17.600 |
or of course the social impact thing where it's like all these transformative and inspirational 01:30:27.120 |
When Rachel and I started this, we had no idea if it was possible to teach people with no 01:30:36.120 |
specific required math background other than high school math, deep learning to the point 01:30:44.260 |
We thought we probably could because I don't have that background and I've been able to. 01:30:50.960 |
But I've been kind of playing around with similar things for a couple of decades. 01:30:55.240 |
So it was a bit of an experiment, and this week's been the week that for me it's been 01:31:02.560 |
So I don't know what part 3 is going to look like. 01:31:06.280 |
I think it will be a bit different because it'll be more of a meeting of minds amongst 01:31:11.840 |
a group of people who are kind of at the same level and thinking about the same kinds of 01:31:16.920 |
So maybe it's more of an ongoing keypad knowledge up-to-date kind of thing, it might be more 01:31:26.440 |
of us teaching each other, I'm not sure, I'm certainly interested to hear your ideas. 01:31:36.960 |
We don't normally have two breaks, but I think I need one today and we're covering a lot 01:31:41.040 |
Why don't we have a short break, let's have a short break and we'll go for the last 20 01:32:13.440 |
So I actually really like these, I think they're great. 01:32:20.520 |
And really the paper that introduced these, it was quite an extraordinary paper that introduced 01:32:28.080 |
both GIUs and attention models at the same time. 01:32:31.320 |
I think it might even be before the guy had his PhD if I remember correctly, it was just 01:32:43.700 |
And the basic idea of an attention model is actually pretty simple. 01:32:52.120 |
You'll see here, here is our decoder, and here's our embedding. 01:33:13.480 |
And notice here, remember that my getRNN_return_sequences = true is the default, so the decoder now is 01:33:30.200 |
Now the length of that sequence is equal to the number of phonemes. 01:33:38.160 |
And we know that there isn't a mapping 1 to 1 of phonemes to letters. 01:33:43.160 |
So this is kind of interesting to think about how we're going to deal with this. 01:33:53.600 |
And the states, because they started with bidirectional, state 1 both represents a combination of 01:34:00.880 |
everything that's come before the first phoneme and everything that's come after. 01:34:04.400 |
State 2 is everything that's come before the second phoneme and everything that's come 01:34:09.840 |
So the states in a sense are all representing something very similar, but they've got a 01:34:17.080 |
Each one of these states represents everything that comes before and everything that comes 01:34:26.700 |
after that point, but with a focus on that phoneme. 01:34:33.160 |
So what we want to do now is create an RNN where the number of inputs to the RNN needs 01:34:47.360 |
to be 15, not 16, because remember the length of the word we're creating is 15. 01:34:54.600 |
So we're going to have 15 output time steps, and at each point we wanted to have the opportunity 01:35:07.800 |
to look at all of these 16 output states, but we're going to go in with the assumption 01:35:15.760 |
that only some of those 16 are going to be relevant, but we don't know which. 01:35:24.560 |
So what we want to do is basically take each of these 16 states and do a weighted sum, 01:35:35.600 |
sum of weights times encoded states, where these weights somehow represent how important 01:35:51.300 |
is each one of those 16 inputs for calculating this output, and how important are each of 01:35:58.000 |
those 16 inputs for calculating this output, and so forth. 01:36:03.180 |
If we could somehow come up with a set of weights for every single one of those time 01:36:09.080 |
steps, then we can replace the length 16 thing with a single thing. 01:36:17.000 |
If it turns out that output number 1 only really depends on input number 1 and nothing 01:36:24.360 |
else, then basically that input, those weights are going to be 1, 0, 0, 0, 0, 0, 0, 0, right? 01:36:33.840 |
But if it turns out that output number 1 actually depends on phonemes 1 and 2 equally, then 01:36:52.360 |
So in other words, we want some function, wi equals some function that returns the right 01:37:03.000 |
set of weights to tell us which bit of the decoded input to look at. 01:37:11.520 |
So it so happens we actually know a good way of learning functions. 01:37:28.600 |
So here's the paper, "Neural Machine Translation by Jointly Learning to Align and Translate." 01:37:46.560 |
It's not the clearest in my opinion in terms of understandability, but let me describe 01:38:09.280 |
Okay, let's describe how to read this equation. 01:38:16.260 |
When you see a probability like this, you can very often think of it as a loss function. 01:38:22.840 |
The idea of SGD, basically most of the time when we're using it, is to come up with a 01:38:29.720 |
model where the probabilities that the model creates are as high as possible for the true 01:38:36.080 |
data and as low as possible for the other data. 01:38:39.360 |
That's just another way of talking about a loss function. 01:38:43.440 |
So very often when you read the papers where we would write a loss function, a paper will 01:38:50.720 |
What this here says, earlier on they say that y is basically our outputs, very common for 01:38:57.880 |
And what this is saying is that the probability of the output at time step i, so at some particular 01:39:04.560 |
time step, depends on, so this bar here means depends on all of the previous outputs. 01:39:12.960 |
In other words, in our spelling thing, when we're looking at the fourth letter that we're 01:39:20.920 |
spelling, it depends on the three letters that we've spelled so far. 01:39:25.680 |
You can't have it depend on the later letters, that's cheating. 01:39:30.600 |
So this is basically a description of the problem, is that we're building something 01:39:34.280 |
which is time-dependent and where the i-th thing that we're creating can only be allowed 01:39:40.360 |
to depend on the previous i-1 things, comma, that basically means and, and it's also allowed 01:39:50.000 |
to depend on, anything in bold is a vector, a vector of inputs. 01:39:56.880 |
And so this here is our list of phonemes, and this here is our list of all of the letters 01:40:08.360 |
So that whole thing, that whole probability, we're going to calculate using some function. 01:40:21.680 |
And because this is a neural net paper, you can be pretty sure that it's going to turn 01:40:27.000 |
And what are the things that we're allowed to calculate with? 01:40:30.440 |
Well we're allowed to calculate with the previous letter that we just translated. 01:40:41.200 |
The RNN hidden state that we've built up so far, and what's this? 01:40:54.080 |
The context vector is a weighted sum of annotations h. 01:41:03.920 |
So these are the hidden states that come out of our encoder, and these are some weights. 01:41:12.680 |
So I'm trying to give you enough information to try and parse this paper over the week. 01:41:22.920 |
The nice thing is that hopefully you guys have now read enough papers that you can look 01:41:29.960 |
at something like this and skip over it, and go, oh, that's just softmax. 01:41:36.840 |
Over time, your pattern recognition starts getting good. 01:41:42.160 |
You start seeing something like this, and you go, oh, that's a weighted sum. 01:41:45.000 |
And you say something like this, and you go, oh, that's softmax. 01:41:48.600 |
People who read papers don't actually read every symbol. 01:41:52.840 |
Their eye looks at it and goes, softmax, weighted sum, logistic function, got it. 01:41:59.560 |
As if it was like pieces of code, only this is really annoying code that you can't look 01:42:04.480 |
up in a dictionary and you can't run and you can't check it and you can't debug it. 01:42:12.120 |
So the alphas are things that came out of a softmax. 01:42:18.860 |
Something called E. The other annoying thing about math notation is often you introduce 01:42:26.080 |
So here we are, later we define E. What's E equal to? 01:42:35.720 |
Some function of the previous hidden state and the encoder state. 01:42:57.420 |
Now the important piece here is jointly trained. 01:43:02.440 |
Jointly trained means it's not like a GAN where we train a bit of discriminator and 01:43:08.560 |
It's not like one of these manual attentional models where we first of all figure out the 01:43:14.040 |
nodules are here and then we zoom into them and find them there. 01:43:18.120 |
Jointly trained means we create a single model, a single computation graph if you like, where 01:43:22.920 |
the gradients are going to flow through everything. 01:43:26.220 |
So we have to try and come up with a way basically where we're going to build a standard regular 01:43:31.520 |
RNN, but the RNN is going to use as the input at each time step this. 01:43:43.800 |
So we're going to have to come up with a way of actually making this mini-neuronet. 01:43:52.040 |
This is just a single one-hidden layer standard neural net that's going to be inside every 01:44:07.640 |
This whole thing is summarized in another paper. 01:44:16.200 |
This is actually a really cool paper, Grammar as a Foreign Language. 01:44:22.960 |
Lots of names you probably recognize here, Jeffrey Hinton, who's kind of a father of 01:44:29.440 |
deep learning, Julia, who's now I think chief scientist or something at OpenAI, this paper 01:44:43.080 |
It basically says what if you didn't know anything about grammar and you attempted to 01:44:49.760 |
build a neural net which assigned grammar to sentences, it turns out you actually end 01:44:57.400 |
up with something more accurate than any rule-based grammar system that's been built. 01:45:04.040 |
One of the nice things they do is to summarize all the bits. 01:45:08.920 |
And again, this is where like if you were reading a paper the first time and didn't 01:45:13.440 |
know what an LSTM was and went oh, an LSTM is all these things, that's not going to mean 01:45:19.760 |
You have to recognize that people write stuff in papers, there's no point writing LSTM equations 01:45:27.320 |
in papers, but basically you're going to have to go and find the LSTM paper or find a tutorial 01:45:32.640 |
like learn about LSTMs when you're finished, come back, and the same way they summarize 01:45:41.080 |
So they say we've adapted the attention model from 2, if you go and have a look at 2, that's 01:45:51.000 |
But the nice thing is that because this came a little later, they've done a pretty good 01:45:55.360 |
job of trying to summarize it into a single page. 01:45:59.080 |
So during the week if you want to try and get the hang of attention, you might find 01:46:03.160 |
it good to have a look at this paper and look at their summary. 01:46:08.000 |
And you'll see that basically the idea is that it's a standard sequence to sequence 01:46:14.880 |
model, so a standard sequence to sequence model means encoder, hidden states, the final 01:46:27.240 |
And so we have two separate LSTMs, an encoder and a decoder, and now be careful of the notation 01:46:36.520 |
the encoder states are going to be called H. The decoder states H1 through HTA, the decoder 01:46:45.920 |
states D, which we're also going to call HTA+1 to TA+TB. 01:46:54.280 |
So the inputs are 1 through TA, and here you can see is defining a single layer neural 01:47:06.640 |
So we've got our decoder states, we've got our current encoder state, put it through 01:47:19.120 |
a nonlinearity, put it through another affine transformation, stick it through a softmax, 01:47:29.660 |
So there's like all of it in one little snapshot. 01:47:34.400 |
So don't expect this to make perfect sense the first time you see it necessarily, but 01:47:40.840 |
hopefully you can kind of see that these bits are all stuff you've seen lots of times before. 01:47:47.360 |
So next week we're going to come back and work through creating this code and seeing 01:47:57.320 |
We have two questions, one is, won't the weightings be heavily impacted by the padding done to 01:48:07.920 |
And specifically those weights will say the padding is always weighted zero. 01:48:12.800 |
It's not going to take very long to learn to create that pattern. 01:48:17.360 |
And is A shared among all ij pairs, or do we train a separate alignment for each pair? 01:48:23.240 |
No, A is not trained, A is the output of a softmax. 01:48:34.560 |
And note that capital letters are matrices, right? 01:48:37.320 |
So we just have to learn a single w1 and a single w2. 01:48:41.960 |
But note that they're being applied to all of the encoded states and the current state 01:48:58.760 |
And in fact, easier is to just abstract out this all the way back to say it is some function. 01:49:11.800 |
This is the best way to think of it, it's some function of what? 01:49:15.240 |
Some function of the current hidden state and all of the decoder states. 01:49:26.960 |
So that's the inputs to the function, and we just have to learn a set of weights that 01:49:44.560 |
So I don't feel like I want to introduce something new so let's take one final AMA before we 01:49:51.720 |
There's not really that much clever you can do about it. 01:50:10.160 |
Basically if you've got, well a great example would be one of the impact talks talked about 01:50:19.040 |
breast cancer detection from mammography scans and this thing called the Dream Challenge 01:50:26.880 |
had less than 0.3% of the scans actually had cancer. 01:50:40.200 |
I think the first thing you try to do with such an unbalanced dataset is ignore it and 01:50:47.800 |
The reason that often it doesn't go well is that the initial gradients will tend to point 01:50:53.760 |
to say they never have cancer, because that's going to give you a very accurate model. 01:51:00.200 |
So one thing you can try and do is to come up with some kind of initial model which is 01:51:06.760 |
like maybe some kind of heuristic which is not terrible and gets it to the point where 01:51:11.400 |
the gradients don't always point to saying they never have cancer. 01:51:19.020 |
But the really obvious thing to do is to adjust your thing which is creating the mini batches 01:51:24.960 |
so that on every mini batch you grab like half of it as being people with cancer and 01:51:34.560 |
So that way you can still go through lots and lots of epochs, the challenge is that 01:51:42.600 |
the people that do have cancer, you're going to see lots and lots and lots of times so 01:51:52.440 |
And then basically there's kind of things between those two extremes. 01:51:57.240 |
So I think what you really need to do is figure out what's the smallest number of people with 01:52:05.560 |
What's the smallest number where the gradients don't point to zero? 01:52:10.440 |
And then create a model where every batch, mini batch, you create 10% of it with people 01:52:22.680 |
The good news is once it's working pretty well, you can then decrease the size that 01:52:31.080 |
has cancer size because you're already at a point where your model is not pointing off 01:52:36.680 |
So you can gradually start to change the sample to have less and less. 01:52:45.080 |
So in this example where you're repeating the positive results over and over again, you're 01:53:06.800 |
Could you get the same results by just throwing away a bunch of the false data set? 01:53:07.800 |
You could do that, and that's the really quick way to do it, but that way you're not using 01:53:13.320 |
all the information about the false stuff still has information.