Lesson 12: Cutting Edge Deep Learning for Coders

00:00:00.000 | I'm just going to start by going back to clustering.

00:00:07.740 | We're going to talk about clustering again in the next lesson or two in terms of an application

00:00:12.320 | of it.

00:00:15.800 | But specifically what I wanted to do was show you k-means clustering in TensorFlow.

00:00:29.720 | There are some things which are easier to do in TensorFlow than PyTorch, mainly because

00:00:33.600 | TensorFlow has a more complete API so far.

00:00:40.040 | So there are some things that are just like, oh, there's already a method that does that,

00:00:45.040 | but there isn't one in PyTorch.

00:00:47.360 | And some things are just a bit neater in TensorFlow than in PyTorch.

00:00:53.020 | And I actually found k-means quite easy to do.

00:00:55.720 | But what I want to do is I'm going to try and show you a way to write custom TensorFlow

00:01:00.560 | code in a kind of a really PyTorch-y way, in a kind of an interactive-y way, and we're

00:01:06.840 | going to try and avoid all of the fancy, session-y, graph-y, scope-y business as much as possible.

00:01:18.040 | So to remind you, the way we initially came at clustering was to say, Hey, what if we

00:01:24.760 | were doing lung cancer detection in CT scans, and these were like 512x512x200 volumetric

00:01:35.760 | things which is too big to really run a whole CNN over conveniently.

00:01:42.560 | So one of the thoughts to fix that was to run some kind of heuristic that found all

00:01:48.120 | of the things that looked like they could vaguely be nodules, and then create a new

00:01:53.280 | data set where you basically zoomed into each of those maybe nodules and created a small

00:01:57.720 | little 20x20x20 cube or something, and you could then use a 3D CNN on that, or try Planar

00:02:06.360 | CNN.

00:02:07.520 | And this general concept I wanted to remind you about because I feel like it's something

00:02:13.840 | which maybe I haven't stressed enough.

00:02:17.360 | I've kind of kept on showing you ways of doing this.

00:02:19.360 | Going back to lesson 7 with the fish, I showed you the bounding boxes, and I showed you the

00:02:23.760 | heat maps.

00:02:24.760 | The reason for all of that was basically to show you how to zoom into things and then

00:02:28.640 | create new models based on those zoomed in things.

00:02:33.000 | So in the fisheries case, we could really just use a lower-res CNN to find the fish

00:02:41.440 | and then zoom into those.

00:02:44.600 | In the CT scan case, maybe we can't even do that, so maybe we need to use this kind of

00:02:49.680 | mean shift clustering approach.

00:02:51.080 | I'm not saying we necessarily do.

00:02:53.480 | It would be interesting to see what the winners use, but certainly particularly if you don't

00:02:57.880 | have lots of time or you have a lot of data, heuristics become more and more interesting.

00:03:04.520 | The reason a heuristic is interesting is you can do something quickly, an approximate that

00:03:11.200 | could have lots and lots and lots of false positives, and that doesn't really matter.

00:03:16.800 | Those false positives means just extra data that you're feeding to your real model.

00:03:24.200 | So you can always tune it as to how much time have I got to train my real model, and then

00:03:29.280 | I can decide how many false positives I can handle.

00:03:33.040 | So as long as your preprocessing model is better than nothing, you can use it to get

00:03:39.900 | rid of some of the stuff that is clearly not a nodule, for example.

00:03:44.900 | For example, anything that is in the middle of the lung wall is not a nodule, anything

00:03:50.120 | that is all white space is not a nodule, and so forth.

00:03:57.600 | So we talked about mean shift clustering and how the big benefit of it is that it allows

00:04:06.360 | us to build clusters without knowing how many clusters there are at a time.

00:04:13.680 | Also without any special extra work, it allows us to find clusters which aren't kind of Gaussian

00:04:21.160 | or spherical if you like in shape.

00:04:24.560 | That's really important for something like a CT scan, where a cluster will often be like

00:04:27.600 | a vessel, which is this really skinny long thing.

00:04:33.240 | So k-means on the other hand is faster, I think it's n^2 rather than n^3 time.

00:04:42.340 | We have talked particularly on the forum about dramatically speeding up mean shift clustering

00:04:46.840 | using approximate nearest neighbors, which is something which we started making some

00:04:50.600 | progress on today, so hopefully we'll have results from that maybe by next week.

00:04:55.800 | But the basic naive algorithm certainly should be a lot faster for k-means, so there's one

00:05:02.140 | good reason to use it.

00:05:03.720 | So as per usual, we can start with some data, and we're going to try and figure out where

00:05:12.200 | the cluster centers are.

00:05:15.680 | So one quick way to avoid hassles in TensorFlow is to create an interactive session.

00:05:24.280 | So an interactive session basically means that you can call .run() on a computation graph

00:05:32.340 | which doesn't return something, or .eval() on a computation graph that does return something,

00:05:39.000 | and you don't have to worry about creating a graph or a session or having a session with

00:05:44.600 | clause or anything like that, it just works.

00:05:49.140 | So that's basically what happens when you call tf.interactive_session.

00:05:56.220 | So by creating an interactive session, we can then do things one step at a time.

00:06:08.200 | So in this case, the first step in k-means is to pick some initial centroids.

00:06:15.780 | So you basically start out and say, okay, if we're going to create however many clusters

00:06:19.960 | -- in this case, n clusters is 6 -- then start out by saying, okay, where might those 6 clusters

00:06:29.440 | be?

00:06:30.440 | And for a long time with k-means, people picked them randomly.

00:06:35.960 | But most practitioners realized that was a dumb idea soon enough, and a lot of people

00:06:40.480 | had various heuristics for picking them.

00:06:42.560 | In 2007, finally, a paper was published actually suggesting a heuristic.

00:06:46.480 | I tend to use a very simple heuristic, which is what I use here in find_initial_centroids.

00:06:53.720 | So to describe this heuristic, I will show you the code.

00:07:00.640 | So find_initial_centroids looks like this.

00:07:04.960 | Basically -- and I'm going to run through it quickly, and then I'll run through it slowly.

00:07:08.560 | Basically the idea is we first of all pick a single data point index, and then we select

00:07:17.280 | that single data point.

00:07:19.280 | So we have one randomly selected data point, and then we find what is the distance from

00:07:25.560 | that randomly selected data point to every other data point.

00:07:30.720 | And then we say, okay, what is the data point that is the furthest away from that randomly

00:07:38.520 | selected data point, the index of it and the point itself.

00:07:43.320 | And then we say, okay, we're going to append that to the initial centroids.

00:07:47.960 | So say I picked at random this point as my random initial point, the furthest point away

00:07:54.480 | from that is probably somewhere around here.

00:07:58.080 | So that would be the first centroid we picked.

00:08:03.000 | Okay we're now inside a loop, and we now go back and we repeat the process.

00:08:07.600 | So we now replace our random point with the actual first centroid, and we go through the

00:08:15.000 | loop once more.

00:08:16.800 | So if we had our first centroid here, our second one now might be somewhere over here.

00:08:24.000 | Okay so we now have two centroids.

00:08:27.200 | The next time through the loop, therefore, this is slightly more interesting.

00:08:30.480 | All distances, we're now going to have the distance between every one of our initial centroids

00:08:38.520 | and every other data point.

00:08:41.160 | So we've got a matrix, in this case it's going to be 2 by the number of data points.

00:08:47.120 | So then we say, okay, for every data point, find the closest cluster.

00:08:55.120 | So what's its distance to the closest initial centroid?

00:09:00.360 | And then tell me which data point is the furthest away from its closest initial centroid.

00:09:07.240 | So in other words, which data point is the furthest away from any centroid?

00:09:13.680 | So that's the basic algorithm.

00:09:18.560 | So let's look and see how we actually do that in TensorFlow.

00:09:23.200 | So it looks a lot like NumPy, except in places you would expect to see np, we see tf, and

00:09:31.200 | then we see the API is a little different, but not too different.

00:09:37.160 | So to get a random number, we can just use random uniform.

00:09:43.680 | We can tell it what type of random number we want, so we want a random int because we're

00:09:46.960 | trying to get a random index, which is a random data point.

00:09:51.240 | That's going to be between 0 and the amount of data points we have.

00:09:56.640 | So that gives us some random index.

00:10:01.880 | We can now go ahead and index into our data.

00:10:06.200 | Now you'll notice I've created something called vdata.

00:10:08.120 | So what is vdata?

00:10:10.960 | When we set up this k-means in the first place, the data was sent in as a NumPy array, and

00:10:17.040 | then I called tf.variable on it.

00:10:19.840 | Now this is the critical thing that lets us make TensorFlow feel more like PyTorch.

00:10:26.920 | Once I do this, data is now basically copied to the GPU, and so when I'm calling something

00:10:34.360 | using vdata, I'm calling this GPU object.

00:10:41.280 | Now there's one thing problematic to be aware of, which is that the copying does not actually

00:10:46.720 | occur when you write this.

00:10:50.000 | The copying occurs when you write this.

00:10:56.440 | So any time you call tf.variable, if you then try to run something using that variable,

00:11:04.800 | you'll get back an uninitialized variable error unless you call this in the meantime.

00:11:09.800 | This is kind of like performance stuff in TensorFlow where they try to say you can set up lots

00:11:17.840 | of variables at once and then call this initializer and we'll do it all at once for you.

00:11:25.440 | So earlier on we created this k-means object.

00:11:33.900 | We know that in Python when you create an object, it calls _init__ (that's just how

00:11:39.200 | Python works), inside that we copied the data to the GPU by using tf.variable, and then

00:11:45.980 | inside find_initial_centroids, we can now access that in order to basically do calculations

00:11:52.160 | involving data on the GPU.

00:11:58.200 | In TensorFlow, pretty much everything takes and returns a tensor.

00:12:02.480 | So when you call random_uniform, it's giving us a tensor, an array of random numbers.

00:12:08.160 | In this case, we just wanted one of them, so we have to use tf.squeeze to take that

00:12:13.200 | tensor and turn it into a scalar, because then we're just indexing it into here to get

00:12:17.960 | a single item back.

00:12:22.080 | So now that we've got that single item back, we then expand it back again into a tensor

00:12:29.160 | because inside our loop, remember, this is going to be a list of initial centroids.

00:12:34.360 | It's just that this list happens to be of length 1 at the moment.

00:12:40.980 | So one of these tricks in making TensorFlow feel more like PyTorch is to use standard

00:12:49.880 | Python loops.

00:12:51.920 | So in a lot of TensorFlow code where it's more serious performance intensive stuff,

00:12:57.440 | you'll see people use TensorFlow-specific loops like tf.while or tf.scan or map or so forth.

00:13:05.200 | The challenge with using those kind of loops is it's basically creating a computation graph

00:13:09.480 | of that loop.

00:13:10.480 | You can't step through it, you can't use it in the normal Pythonic kind of ways.

00:13:16.760 | So we can just use normal Python loops if we're careful about how we do it.

00:13:21.960 | So inside our normal Python loop, we can use normal Python functions.

00:13:27.720 | So here's a function I created which calculates the distance between everything in this tensor

00:13:33.480 | compared to everything in this tensor.

00:13:35.760 | So all distances, it looks very familiar because it looks a lot like the PyTorch code we had.

00:13:45.160 | So for the first array, for the first tensor, we add an additional axis to axis 0, and for

00:13:54.560 | the second we add an additional axis to axis 1.

00:13:58.720 | So the reason this works is because of broadcasting.

00:14:03.320 | So A, when it starts out, is a vector, and B is a vector.

00:14:15.280 | Now is A a column or is A a row?

00:14:19.400 | What's the orientation of it?

00:14:21.800 | Well the answer is it's both and it's neither.

00:14:24.360 | It's one-dimensional, so it has no concept of what direction it's looking.

00:14:30.040 | So then what we do is we set expandims on axis 0, so that's rows.

00:14:37.400 | So that basically says to A, you are now definitely a row vector.

00:14:42.960 | You now have one row and however many columns, same as before.

00:14:48.960 | And then where else with B, we add an axis at the end.

00:14:54.360 | So B is now definitely a column vector, it now has one column and however many rows we

00:15:01.640 | had before.

00:15:03.960 | So with broadcasting, what happens is that this one gets broadcast to this length, and

00:15:12.880 | this one gets broadcast to this length.

00:15:17.520 | So we end up with a matrix containing the difference between every one of these items

00:15:24.320 | and every one of these items.

00:15:27.320 | So that's like this simple but powerful concept of how we can do very fast GPU-accelerated

00:15:36.040 | loops and less code than it would have taken to actually write the loop.

00:15:40.800 | And we don't have to worry about out-of-bounds conditions or anything like that, it's all

00:15:44.680 | done for us.

00:15:47.280 | So that's the trick here.

00:15:48.760 | And once we've got that matrix, because in TensorFlow everything is a tensor, we can

00:15:54.440 | call square difference rather than just regular difference, and it gives us the squares of

00:15:59.920 | those differences, and then we can sum over the last axis.

00:16:04.140 | So the last axis is the dimensions, so we're just creating a Euclidean distance here.

00:16:12.360 | And so that's all this code does.

00:16:15.680 | So this gives us every distance between every element of A and every element of B.

00:16:23.240 | So that's how we get to this point.

00:16:28.420 | So then let's say we've gone through a couple of loops.

00:16:32.760 | So after a couple of loops, R is going to contain a few initial centroids.

00:16:38.680 | So we now want to basically find out for every point how far away is it from its nearest initial

00:16:48.440 | centroid.

00:16:50.360 | So when we go reduce min with axis=0, then we know that that's going over the axis here

00:17:00.680 | because that's what we put into our all-distances function.

00:17:03.960 | So it's going to go through, well actually it's reducing across into that axis, so it's

00:17:10.320 | actually reducing across our centroids.

00:17:13.160 | So at the end of this, it says, alright, this is for every piece of our data how far it

00:17:20.200 | is away from its nearest centroid.

00:17:26.200 | And that returns the actual distance, because we said do the actual min.

00:17:31.440 | So there's a difference between min and arg, the arg version.

00:17:35.880 | So argmax then says, okay, now go through all of the points.

00:17:41.680 | We now know how far away they are from their closest centroid, and tell me the index of

00:17:48.080 | the one which is furthest away.

00:17:51.240 | So argmax is a super handy function.

00:17:53.700 | We used it quite a bit in part 1 of the course, but it's well worth making sure we understand

00:17:59.400 | how it works.

00:18:01.040 | I think IntensorFlow, I think they're getting rid of these reduce_ prefixes.

00:18:07.600 | I'm not sure.

00:18:08.600 | I think I read that somewhere.

00:18:10.700 | So in some version you may find this is called min rather than reduce_min.

00:18:16.000 | I certainly hope they are.

00:18:18.520 | For those of you who don't have such a computer science background, a reduction basically

00:18:22.880 | means taking something in a higher dimension and squishing it down into something that's

00:18:26.400 | a lower dimension, for example, summing a vector and turning it into a scalar is called

00:18:32.000 | a reduction.

00:18:33.000 | So this is a very TensorFlow API assuming that everybody is a computer scientist and

00:18:37.640 | that you wouldn't look for min, you would look for reduce_min.

00:18:43.800 | So that's how we got that index.

00:18:46.880 | So generally speaking, you have to be a bit careful of data types.

00:18:54.600 | I generally don't really notice about data type problems until I get the error, but if

00:18:58.240 | you get an error that says you passed an int64 into something that expected an int32, you

00:19:04.360 | can always just cast things like this.

00:19:07.000 | So we need to index something with an int32, so we just have to cast it.

00:19:11.120 | And so this returns the actual point, append, and then this is very similar to NumPy stacking

00:19:18.460 | together the initial centroids to create a tensor of them.

00:19:25.680 | So the code doesn't look at all weird or different, but it's important to remember that when we

00:19:34.000 | run this code, nothing happens, other than that it creates a computation graph.

00:19:40.580 | So when we call k.find_initial_centroids, nothing happens.

00:19:49.120 | But because we're in an interactive session, we can now call .eval, and that actually runs

00:19:55.200 | it.

00:19:56.360 | And it runs it, and it actually takes the data that's returned from that and copies

00:20:01.520 | it off the GPU and puts it back in the CPU as a NumPy array.

00:20:08.120 | So it's important to remember that if you call eval, we now have an actual genuine regular

00:20:13.920 | NumPy array here.

00:20:16.180 | And this is the thing that makes us be able to write code that looks a lot like PyTorch

00:20:23.200 | code, because we now know that we can take something that's a NumPy array and turn it

00:20:26.960 | into a GPU tensor like that, and we can take something that's a GPU tensor and turn it

00:20:34.060 | into a NumPy array like that.

00:20:39.000 | I suspect this might make TensorFlow developers shake at how horrible this is.

00:20:50.120 | It's not really quite the way you're meant to do things I think, but it's super easy

00:20:56.200 | and it seems to work pretty well.

00:21:00.080 | This approach where we're calling .eval, you do need to be a bit careful.

00:21:06.000 | If this was inside a loop that we were calling eval and we were copying back a really big

00:21:11.240 | chunk of data to the GPU and the CPU again and again and again, that would be a performance

00:21:16.560 | nightmare.

00:21:17.560 | So you do need to think about what's going on as you do it.

00:21:22.040 | So we'll look inside the inner loop in a moment and just check.

00:21:25.840 | Anyway, the result's pretty fantastic.

00:21:29.720 | As you can see, this little hackyheuristic does a great job.

00:21:37.000 | It's a hackyheuristic I've been using for decades now, and it's a kind of thing which

00:21:42.660 | often doesn't appear in papers.

00:21:44.480 | In this case, a similar hackyheuristic did actually appear in a paper in 2007 and an

00:21:48.280 | even better one appeared just last year.

00:21:52.760 | But it's always worth thinking about how can I pre-process my data to get it close to where

00:21:59.700 | I might want it to be.

00:22:00.920 | And often these kind of approaches are useful.

00:22:05.740 | There's actually -- I don't know if we'll have time to maybe talk about it someday -- but

00:22:08.440 | there's an approach to doing PCA, Principle Components Analysis, which has a similar flavor,

00:22:13.840 | basically finding random numbers and finding the furthest numbers away from them.

00:22:19.200 | So it's a good general technique actually.

00:22:22.800 | So we've got our initial centroids.

00:22:25.560 | What do we do next?

00:22:26.560 | Well, what we do next is we're going to be doing more computation in TensorFlow with

00:22:29.760 | them, so we want to copy them back to the GPU.

00:22:34.800 | And so because we copied them to the GPU, before we do an eval or anything later on,

00:22:41.240 | we're going to have to make sure we go global variable initializers dot run.

00:22:45.080 | Question.

00:22:46.080 | Can you explain what happens if you don't create interactive session?

00:22:52.760 | So what the TensorFlow authors decided to do in their wisdom was to generate their own

00:22:58.560 | whole concept of namespaces and variables and whatever else.

00:23:04.200 | So rather than using Python's, there's TensorFlow.

00:23:10.040 | And so a session is basically kind of like a namespace that holds the computation graphs

00:23:17.400 | and the variables and so forth.

00:23:21.360 | You can, and then there's this concept of a context manager, which is basically where

00:23:26.680 | you have a with clause in Python and you say with this session, now you're going to do

00:23:31.600 | this bunch of stuff in this namespace.

00:23:34.760 | And then there's a concept of a graph.

00:23:36.120 | You can have multiple computation graphs, so you can say with this graph, create these

00:23:39.600 | various computations.

00:23:42.920 | Where it comes in very handy is if you want to say run this graph on this GPU, or stick

00:23:50.560 | this variable on that GPU.

00:23:56.160 | So without an interactive session, you basically have to create that session, you have to say

00:24:01.640 | which session to use using a with clause, and then there's many layers of that.

00:24:06.480 | So within that you can then create namescopes and variablescopes and blah blah blah.

00:24:11.200 | So the annoying thing is the vast majority of tutorial code out there uses all of these

00:24:17.360 | concepts.

00:24:18.360 | It's as if all of Python's OO and variables and modules doesn't exist, and you use TensorFlow

00:24:26.040 | for everything.

00:24:27.040 | So I wanted to show you that you don't have to use any of these concepts, pretty much.

00:24:31.560 | Thank you for the question.

00:24:34.520 | I haven't quite finished thinking through this, but have you tried?

00:24:51.200 | If you had initially said I have seven clusters, or eight clusters, what you would find after

00:24:56.600 | you hit your six is you'd all of a sudden start getting centroids that were very close

00:25:01.640 | to various existing centroids.

00:25:04.000 | So it seems like you could somehow intelligently define a width of a cluster, or kind of look

00:25:12.960 | for a jump in things dropping down and how far apart they are from some other cluster,

00:25:17.940 | and programmatically come up with a way to decide the number of clusters.

00:25:22.520 | Yeah, I think you could.

00:25:26.680 | Maybe then you're using TK means, I don't know.

00:25:31.080 | I think it's a fascinating question.

00:25:33.840 | I haven't seen that done.

00:25:36.760 | There are certainly papers about figuring out the number of clusters in K-means.

00:25:44.720 | So maybe during the week you'd check one out, put it to TensorFlow, that would be really

00:25:50.520 | interesting.

00:25:51.520 | And I just wanted to follow up what you said about sessions to emphasize that with a lot

00:25:57.400 | of tutorials, you could make the code simpler by using an interactive session in a Jupyter

00:26:02.520 | notebook instead.

00:26:04.200 | I remember when Rachel was going through a TensorFlow course a while ago and she kept

00:26:08.440 | on banging her head against a desk with sessions and variable scopes and whatever else.

00:26:14.400 | That was part of what led us to think, "Okay, let's simplify all that."

00:26:19.680 | So step one was to take our initial centroids and copy them onto the GPU.

00:26:23.840 | So we now have a symbol representing those.

00:26:27.960 | So the next step in the K-means algorithm is to take every point and assign them to

00:26:34.280 | a cluster, which is basically to say for every point which of the centroids is the closest.

00:26:45.120 | So that's what assigned to nearest does.

00:26:47.840 | We'll get to that in a moment, but let's pretend we've done it.

00:26:50.460 | This will now be a list of which centroid is the closest for every data point.

00:26:58.620 | So then we need one more piece of TensorFlow concepts, which is we want to update an existing

00:27:13.000 | variable with some new data.

00:27:15.760 | And so we can actually call updateCentroids to basically do that updating, and I'll show

00:27:22.240 | you how that works as well.

00:27:24.760 | So basically the idea is we're going to loop through doing this again and again and again.

00:27:33.480 | But when we just do it once, you can actually see it's nearly perfect already.

00:27:39.520 | So it's a pretty powerful idea as long as your initial cluster centers are good.

00:27:47.120 | So let's see how this works, assigned to nearest.

00:27:52.960 | There's a single line of code.

00:27:59.960 | The reason there's a single line of code is we already have the code to find the distance

00:28:04.740 | between every piece of data and its centroid.

00:28:10.880 | Now rather than calling tf.reduceMin, which returned the distance to its nearest centroid,

00:28:16.880 | we call tf.argmin to get the index of its nearest centroid.

00:28:23.280 | So generally speaking, the hard bit of doing this kind of highly vectorized code is figuring

00:28:28.840 | out this number, which is what axis we're working with.

00:28:33.600 | And so it's a good idea to actually write down on a piece of paper for each of your

00:28:39.220 | tensors.

00:28:40.220 | It's like it's time by batch, by row, by column, or whatever.

00:28:43.800 | Make sure you know what every axis represents.

00:28:48.880 | When I'm creating these algorithms, I'm constantly printing out the shape of things.

00:28:57.480 | Another really simple trick, but a lot of people don't do this, is make sure that your

00:29:01.720 | different dimensions actually have different sizes.

00:29:05.540 | So when you're playing around with testing things, don't have a batch size of 10 and

00:29:08.600 | an n of 10 and a number of dimensions of 10.

00:29:13.320 | I find it much easier to think of real numbers, so have a batch size of 8 and an n of 10 and

00:29:18.560 | a dimensionality of 4.

00:29:19.560 | Because every time you print out the shape, you'll find out exactly what everything is.

00:29:24.960 | So this is going to return the nearest indices.

00:29:32.040 | So then we can go ahead and update the centroids.

00:29:35.280 | So here is update-centroids.

00:29:37.080 | And suddenly we have some crazy function.

00:29:41.680 | And this is where TensorFlow is super handy.

00:29:46.120 | It's full of crazy functions.

00:29:48.320 | And if you know the computer science term for the thing you're trying to do, it's generally

00:29:56.960 | called that.

00:30:00.640 | The only other way to find it is just to do lots and lots of searching through the documentation.

00:30:04.840 | So in general, taking a set of data and sticking it into multiple chunks of data according

00:30:15.780 | to some kind of criteria is called partitioning in computer science.

00:30:20.920 | So I got a bit lucky when I first looked for this.

00:30:22.800 | I googled for TensorFlow partition and bang, this thing popped up.

00:30:27.660 | So let's take a look at it.

00:30:35.100 | And this is where reading about GPU programming in general is very helpful.

00:30:41.520 | Because in GPU programming there's this kind of smallish subset of things which everything

00:30:46.760 | else is built on, and one of them is partitioning.

00:30:56.640 | So here we have tf.dynamic_partition.

00:31:04.800 | Partition is the data into some number of partitions using some indices.

00:31:10.600 | And generally speaking, it's easiest to just look at some code.

00:31:14.380 | So here's our data.

00:31:16.480 | We're going to create two partitions, we're calling them clusters, and it's going to go

00:31:22.720 | like this.

00:31:23.720 | 0th partition, 0th, 1st, 1st, 0th.

00:31:28.000 | So 10 will go to the 0th partition, 20 will go to the 0th partition, 30 will go to the

00:31:34.400 | 1st partition.

00:31:35.400 | Okay, this is exactly what we want.

00:31:38.280 | So this is the nice thing is that there's a lot of these, you can see all this stuff.

00:31:42.720 | There's so many functions available, often there's the exact function you need.

00:31:47.980 | And here it is.

00:31:48.980 | So we just take our list of indices, convert it to a list of int32s, pass it out data,

00:31:55.360 | the indices and the number of clusters, and we're done.

00:32:00.040 | This is now a separate array, basically a separate tensor for each of our clusters.

00:32:10.600 | So now that we've done that, we can then figure out what is the mean of each of those clusters.

00:32:19.440 | So the mean of each of those clusters is our new centroid.

00:32:22.760 | So what we're doing is we're saying which points are the closest to this one, and these

00:32:31.200 | points are the closest to this one, okay, what's the average of those points?

00:32:36.280 | That's all that happened from here to here.

00:32:40.160 | So that's taking the mean of those points.

00:32:44.280 | And then we can basically say, okay, that's our new clusters.

00:32:52.000 | So then just join them all together, concatenate them together.

00:33:04.380 | Except for that dynamic partitions, well in fact including that dynamic partitions, that

00:33:08.960 | was incredibly simple, but it was incredibly simple because we had a function that did

00:33:11.880 | exactly what we wanted.

00:33:13.720 | So because we assigned a variable up here, we have to call initializer.run.

00:33:20.120 | And then of course before we can do anything with this tensor, we have to call .eval to

00:33:26.520 | actually call the computation graph and copy it back to our CPU.

00:33:33.060 | So that's all those steps.

00:33:38.000 | So then we want to replace the contents of current centroids with the contents of updated

00:33:44.920 | centroids.

00:33:47.680 | And so to do that, we can't just say equals, everything is different in TensorFlow, you

00:33:54.560 | have to call .assign.

00:33:57.080 | So this is the same as basically saying current centroids equals updated centroids, but it's

00:34:01.840 | creating a computation graph that basically does that assignment on the GPU.

00:34:07.880 | How can we extrapolate this to other non-numeric data types such as words, images?

00:34:17.800 | Well they're all numeric data types really, so an image is absolutely a numeric data type.

00:34:26.600 | So it's just a bunch of pixels, you just have to decide what distance measure you want,

00:34:34.240 | which generally just means deciding you're probably using Euclidean distance, but are

00:34:37.680 | you doing it in pixel space, or are you picking one of the activation layers in a neural net?

00:34:44.000 | For words, you would create a word vector for your words.

00:34:52.800 | There's nothing specifically two-dimensional about this.

00:34:56.160 | This works in as many dimensions as we like.

00:35:00.240 | That's really the whole point, and I'm hoping that maybe during the week, some people will

00:35:03.360 | start to play around with some higher dimensional data sets to get a feel for how this works.

00:35:08.560 | Particularly if you can get it working on CT scans, that would be fascinating using

00:35:12.280 | the five-dimensional clustering we talked about.

00:35:17.600 | So here's what it looks like in total if we weren't using an interactive session.

00:35:23.960 | You basically say with tf.session, that creates a session, but as default, that sets it to

00:35:30.960 | the current session, and then within the with block, we can now run things.

00:35:36.520 | And then k.run does all the stuff we just saw, so if we go to k.run, here it is.

00:35:45.640 | So k.run does all of those steps.

00:35:48.600 | So this is how you can create a complete computation graph in TensorFlow using a notebook.

00:35:54.820 | You do each one, one step at a time.

00:35:56.800 | Once you've got it working, you put it all together.

00:35:59.280 | So you can see find_initial_centroids.eval, put it back into a variable again, assigned

00:36:08.480 | to nearest, update_centroids.

00:36:13.140 | Because we created a variable in the process there, we then have to rerun global_variable_initializer.

00:36:21.120 | We could have avoided this I guess by not calling eval and just treating this as a variable

00:36:28.560 | the whole time, but it doesn't matter, this works fine.

00:36:34.420 | And then we just loop through a bunch of times, calling centroids.assign_updated_centroids.

00:36:50.440 | What we should be doing after that is then calling update_centroids each time.

00:37:01.360 | Then the nice thing is, because I've used a normal Python for loop here and I'm calling

00:37:08.520 | dot eval each time, it means I can check, "Oh, have any of the cluster centroids moved?"

00:37:20.080 | And if they haven't, then I will stop working.

00:37:24.040 | So it makes it very easy to create dynamic for loops, which could be quite tricky sometimes

00:37:29.600 | with TensorFlow otherwise.

00:37:35.960 | So that is the TensorFlow algorithm from end to end.

00:37:48.000 | Rachel, do you want to pick out an AMA question?

00:37:53.820 | So actually I kind of am helping start a company, I don't know if you've seen my talk on ted.com,

00:38:04.280 | but I kind of show this demo of this interactive labelling tool, a friend of mine said that

00:38:12.360 | he wanted to start a company to actually make that and commercialize it.

00:38:15.240 | So I guess my short answer is I'm helping somebody do that because I think that's pretty

00:38:19.480 | cool.

00:38:20.480 | More generally, I've mentioned before I think that the best thing to do is always to scratch

00:38:26.680 | an itch, so pick whatever you've been passionate about or something that's just driven you

00:38:32.280 | crazy and fix it.

00:38:36.840 | If you have the benefit of being able to take enough time to do absolutely anything you

00:38:40.580 | want, I felt like the three most important areas for applying deep learning when I last

00:38:48.880 | looked, which was two or three years ago, were medicine, robotics, and satellite imagery.

00:38:55.440 | Because at that time, computer vision was the only area that was remotely mature for

00:39:02.280 | machine learning, deep learning, and those three areas all were areas that very heavily

00:39:09.160 | used computer vision or could heavily use computer vision and were potentially very

00:39:14.480 | large markets.

00:39:17.040 | Medicine is probably the largest industry in the world, I think it's $3 trillion in

00:39:20.480 | America alone, robotics isn't currently that large, but at some point it probably will

00:39:26.760 | become the largest industry in the world if everything we do manually is replaced with

00:39:32.720 | automated approaches.

00:39:34.720 | And satellite imagery is massively used by military intelligence, so we have some of

00:39:39.880 | the biggest budgets in the world, so I guess those three areas.

00:39:44.840 | I'm going to take a break soon, before I do, I might just introduce what we're going to

00:40:02.920 | be looking at next.

00:40:05.440 | So we're going to start on our NLP, and specifically translation deep dive, and we're going to

00:40:14.680 | be really following on from the end-to-end memory networks from last week.

00:40:23.600 | One of the things that I find kind of most interesting and most challenging in setting

00:40:28.080 | up this course is coming up with good problem sets which are hard enough to be interesting

00:40:35.420 | and easy enough to be possible.

00:40:40.440 | And often other people have already done that, so I was lucky enough that somebody else had

00:40:44.360 | already shown an example of using sequence-to-sequence learning for what they called Spelling Bee.

00:40:53.120 | Basically we start with this thing called the CMU Pronouncing Dictionary, which has

00:40:58.000 | things that look like this, Zwicky, followed by a phonetic description of how to read Zweki.

00:41:10.440 | So the way these work, this is actually specifically an American pronunciation dictionary.

00:41:16.680 | The consonants are pretty straightforward.

00:41:21.600 | The vowel sounds have a number at the end showing how much stress is on each one.

00:41:26.600 | So 0, 1, or 2.

00:41:28.440 | So in this case you can see that the middle one is where most of the stress is, so it's

00:41:32.080 | Zwicky.

00:41:35.340 | So here is the letter A, and it is pronounced A.

00:41:42.200 | So the goal that we're going to be looking at after the break is to do the other direction,

00:41:48.680 | which is to start with how do you say it and turn it into how do you spell it.

00:41:56.000 | This is quite a difficult question because English is really weird to spell.

00:42:01.960 | And the number of phonemes doesn't necessarily match the number of letters.

00:42:10.640 | So this is going to be where we're going to start.

00:42:12.720 | And then we're going to try and solve this puzzle, and then we'll use a solution from

00:42:15.840 | this puzzle to try and learn to translate French into English using the same basic idea.

00:42:22.800 | So let's have a 10 minute break and we'll come back at 7.40.

00:42:29.040 | So just to clarify, I just want to make sure everybody understands the problem we're solving

00:42:33.320 | here.

00:42:34.320 | So the problem we're solving is we're going to be told here is how to pronounce something,

00:42:39.280 | and then we have to say here is how to spell it.

00:42:42.000 | So this is going to be our input, and this is going to be our target.

00:42:46.160 | So this is like a translation problem, but it's a bit simpler.

00:42:52.140 | So we don't have pre-trained phoneme vectors or pre-trained letter vectors.

00:42:58.960 | So we're going to have to do this by building a model, and we're going to have to create

00:43:01.920 | some embeddings of our own.

00:43:05.200 | So in general, the first steps necessary to create an NLP model tends to look very, very

00:43:13.600 | similar.

00:43:14.600 | I feel like I've done them in a thousand different ways now, and at some point I really need

00:43:19.560 | to abstract this all out into a simple set of functions that we use again and again and

00:43:25.560 | again.

00:43:26.560 | But let's go through it, and if you've got any questions about any of the code or steps

00:43:32.560 | or anything, let me know.

00:43:36.200 | So the basic pronunciation dictionary is just a text file, and I'm going to just grab the

00:43:45.360 | lines which are actual words so they have to start with a letter.

00:43:52.360 | Now something which I have, so we're going to go through every line in the text file.

00:44:06.200 | Here's a handy thing that a lot of people don't realize you can do in Python.

00:44:09.960 | When you call open, that returns a generator which lists all of the lines.

00:44:14.860 | So if you just go for l in openblah, that's now looping through every line in that file.

00:44:23.120 | So I can then say filter those which start with a lowercase letter, and then strip off

00:44:38.120 | any white space and split it on white space.

00:44:44.120 | So that's basically the steps necessary to separate out the word from the pronunciation.

00:44:52.200 | And then the pronunciation is just white space delimited, so we can then split that.

00:44:59.040 | And that's the steps necessary to get the word and the pronunciation as a set of phonemes.

00:45:05.480 | So as we tend to pretty much always do with these language models, we next need to get

00:45:09.980 | a list of what are all of the vocabulary items.

00:45:13.200 | So in this case, the vocabulary items are all the possible phonemes.

00:45:17.360 | So we can create a set of every possible phoneme, and then we can sort it.

00:45:27.160 | And what we always like to do is get an extra character or an extra object in position 0,

00:45:37.160 | because remember we use 0 for padding.

00:45:40.120 | So that's why I'm going to use underscore as our special padding letter here.

00:45:44.880 | So I'll stick an underscore at the front.

00:45:46.840 | So here are the first 5 phonemes.

00:45:50.560 | This is our special padding one, which is going to be index 0.

00:45:54.080 | And then there's r, r and r with 3 different levels of stress, and so forth.

00:46:00.520 | Now the next thing that we tend to do anytime we've got a list of vocabulary items is to

00:46:06.200 | create a list in the opposite direction.

00:46:08.560 | So we go from phoneme to index, which is just a dictionary where we enumerate through all

00:46:15.680 | of our phonemes and put it in the opposite order.

00:46:19.480 | So from phoneme to index.

00:46:21.840 | I know we've used this approach a thousand times before, but I just want to make sure

00:46:27.000 | everybody understands it.

00:46:29.400 | When you use enumerate in Python, it doesn't just return each phoneme, but it returns a

00:46:34.400 | tuple that contains the index of the phoneme and then the phoneme itself.

00:46:39.200 | So that's the key and the value.

00:46:41.700 | So then if we go value,key, that's now the phoneme followed by the index.

00:46:47.120 | So if we turn that into a dictionary, we now have a dictionary which you can give it a

00:46:50.640 | phoneme and return it an index.

00:46:55.000 | Here's all the letters of English.

00:46:56.720 | Again with our special underscore at the front.

00:47:00.760 | We've got one extra thing we'll talk about later, which is an asterisk.

00:47:04.160 | So that's a list of letters.

00:47:07.120 | And so again, to go from letter to letter index, we just create a dictionary which reverses

00:47:12.760 | it again.

00:47:17.280 | So now that we've got our phoneme to index and letter to index, we can use that to convert

00:47:25.280 | this data into numeric data, which is what we always do with these language models.

00:47:31.080 | We end up with just a list of indices.

00:47:35.120 | We can pick some maximum length word, so I'm just going to say 15.

00:47:41.240 | So we're going to create a dictionary which maps from each word to a list of phonemes,

00:47:49.000 | and we're going to get the indexes for them.

00:47:51.360 | Yes Rachel.

00:47:54.880 | So this dictionary comprehension is a little bit awkward, so I thought this would be a good

00:48:00.520 | opportunity to talk about dictionary comprehensions and list comprehensions for a moment.

00:48:06.160 | So we're going to pause this in a moment, but first of all, let's look at a couple of examples

00:48:10.440 | of list comprehensions.

00:48:12.880 | So the first thing to note is when you go something like this, a string x, y, z or this

00:48:17.520 | string here, Python is perfectly happy to consider that a list of letters.

00:48:22.600 | So Python considers this the same as being a list of x, y, z.

00:48:26.920 | So you can think of this as two lists, a list of x, y, z and a list of a, b, c.

00:48:32.040 | So here is the simplest possible list comprehension.

00:48:37.640 | So go through every element of a and put that into a list.

00:48:42.600 | So if I call that, that returns exactly what I started with.

00:48:46.760 | So that's not very interesting.

00:48:50.080 | What if now we replaced 'o' with another list comprehension?

00:48:59.680 | So what that's going to do is it's now going to return a list for each list.

00:49:08.240 | So this is one way of pulling things out of sub-lists, is to basically take the thing

00:49:12.760 | that was here and replace it with a new list comprehension, and that's going to give you

00:49:17.040 | a list of lists.

00:49:20.520 | Now the reason I wanted to talk about this is because it's quite confusing.

00:49:23.920 | In Python you can also write this, which is different.

00:49:28.680 | So in this case, I'm going for each object in our a list, and then for each object in

00:49:37.000 | that sub-list.

00:49:38.000 | And do you see what's different here?

00:49:39.760 | I don't have square brackets, it's just all laid out next to each other.

00:49:44.340 | So I find this really confusing, but the idea is you're meant to think of this as just being

00:49:49.240 | like a normal for loop inside a for loop.

00:49:52.800 | And so what this does is it goes through x, y, z and then a, b, c, and then in x, y, z

00:49:58.720 | it goes through each of x and y and z, but because there's no embedded set of square

00:50:02.840 | brackets, that actually ends up flattening the list.

00:50:08.020 | So we're about to see an example of the square bracket version, and pretty soon we'll be seeing

00:50:23.000 | an example of this version as well.

00:50:25.120 | These are both useful, right?

00:50:26.280 | It's very useful to be able to flatten a list, it's very useful to be able to do things with

00:50:31.160 | sub-lists.

00:50:32.160 | And then just to be aware that any time you have any kind of expression like this, you

00:50:38.080 | can replace the thing here with any expression you like.

00:50:45.000 | So we could say, for example, we could say o.upper, so you can basically map different

00:51:05.160 | computations to each element of a list, and then the second thing you can do is put an

00:51:10.680 | if here to filter it, if o0=x.

00:51:35.760 | So that's basically the idea.

00:51:38.720 | You can create any list comprehension you like by putting computations here, filters

00:51:43.040 | here and optionally multiple lists of lists here.

00:51:48.920 | The other thing you can do is replace the square brackets with curly brackets, in which

00:51:53.720 | case you need to put something before a colon and something after a colon, the thing before

00:51:58.440 | is your key and the thing after is your value.

00:52:01.520 | So here we're going for, oh, and then there's another thing you can do, which is if the

00:52:05.920 | thing you're looping through is a bunch of lists or tuples or anything like that, you

00:52:10.800 | can pull them out into two pieces, like so.

00:52:14.880 | So this is the word, and this is the list of phonemes.

00:52:18.840 | So we're going to have the lower case word will be our keys in our dictionary, and the

00:52:26.200 | values will be lists, so we're doing it just like we did down here.

00:52:32.280 | And the list will be, let's go through each phoneme and go phoneme to index.

00:52:40.280 | So now we have something that maps from every word to its list of phoneme indexes.

00:52:53.360 | We can find out what the maximum length of anything is in terms of how many phonemes

00:52:59.360 | there are, and we can do that by again.

00:53:02.000 | We can just go through every one of those dictionary items, calling length on each one

00:53:08.120 | and then doing a max on that.

00:53:10.640 | So there is the maximum length.

00:53:12.800 | So you can see combining list comprehensions with other functions is also powerful.

00:53:22.400 | So finally we're going to create our nice square arrays.

00:53:31.280 | Normally we do this with Keras.pad sequences, just for a change, we're going to do this

00:53:36.920 | manually this time.

00:53:40.440 | So the key is that we start out by creating two arrays of zeros, because all the padding

00:53:47.560 | is going to be zero.

00:53:48.560 | So if we start off with all zeros, then we can just fill in the non-zeros.

00:53:52.680 | So this is going to be all of our phonemes.

00:53:56.360 | This is going to be our actual spelling, that's our target labels.

00:54:00.220 | So then we go through all of our, and we've permitted them randomly, so randomly ordered

00:54:04.880 | things in the pronunciation dictionary, and we put inside input all of the items from

00:54:15.480 | that pronunciation dictionary, and into labels we go letter to index.

00:54:22.800 | So we now have one thing called input, one thing called labels that contains nice rectangular

00:54:28.360 | arrays padded with zeros containing exactly what we want.

00:54:36.480 | I'm not going to worry about this line yet because we're not going to use it for the

00:54:39.240 | starting point.

00:54:42.560 | So anywhere you see DEC something, just ignore that for now, we'll get back to that later.

00:54:49.080 | Train test split is a very handy function from sklearn that takes all of these lists

00:54:55.800 | and splits them all in the same way with this proportion in the test set.

00:55:01.680 | And so input becomes input_train and input_test, labels becomes labels_train and labels_test.

00:55:08.760 | So that's pretty handy.

00:55:11.120 | We've often written that manually, but this is a nice quick way to do it when you've got

00:55:15.640 | lots of lists to do.

00:55:20.280 | So just to have a look at how many phonemes we have in our vocabulary, there are 70, how

00:55:25.720 | many letters in our vocabulary, there's 28, that's because we've got that underscore and

00:55:29.080 | the star as well.

00:55:33.200 | So let's go ahead and create the model.

00:55:42.080 | Here's the basic idea.

00:55:47.640 | The model has three parts.

00:55:52.320 | The first is an embedding.

00:55:55.620 | So the embedding is going to take every one of our phonemes.

00:56:00.240 | Max_len_p is the maximum number of phonemes we have in any pronunciation.

00:56:07.200 | And each one of those phonemes is going to go into an embedding.

00:56:13.400 | And the lookup for that embedding is the vocab size for phonemes, which I think was 70.

00:56:21.880 | And then the output is whatever we decide, what dimensionality we want.

00:56:26.600 | And in experimentation, I found 120 seems to work pretty well.

00:56:30.840 | I was surprised by how high that number is, but there you go, it is.

00:56:37.600 | We started out with a list of phonemes, and then after we go through this embedding, we

00:56:50.280 | now have a list of embeddings.

00:56:52.840 | So this is like 70, and this is like 120.

00:57:02.640 | So the basic idea is to take this big thing, which is all of our phonemes embedded, and

00:57:11.320 | we want to turn it into a single distributed representation which contains all of the richness

00:57:19.600 | of what this pronunciation says.

00:57:22.440 | Later on we're going to be doing the same thing with an English sentence.

00:57:27.000 | And so we know that when you have a sequence and you want to turn it into a representation,

00:57:34.040 | one great way of doing that is with an RNN.

00:57:39.360 | Now why an RNN?

00:57:41.920 | Because an RNN we know is good at dealing with things like state and memory.

00:57:47.320 | So when we're looking at translation, we really want something which can remember where are

00:57:54.280 | we.

00:57:55.280 | So let's say we were doing this simple phonetic translation, the idea of have we just had

00:58:03.000 | a C?

00:58:04.000 | Because if we just had a C, then the H is going to make a totally different sound to

00:58:08.080 | if we haven't just had a C.

00:58:13.360 | So an RNN we think is a good way to do this kind of thing.

00:58:17.640 | And in general, this whole class of models remember is called Seek-to-Seek, sequence-to-sequence

00:58:25.360 | models, which is where we start with some arbitrary length sequence and we produce some

00:58:30.680 | arbitrary length sequence.

00:58:32.600 | And so the general idea here is taking that arbitrary length sequence and turning it into

00:58:37.280 | a fixed size representation using an RNN is probably a good first step.

00:58:53.560 | So looking ahead, I'm actually going to be using quite a few layers of RNN.

00:59:03.040 | So to make that easier, we've created a getRNN function.

00:59:15.280 | You can put anything you like here, GRU or LSTM or whatever.

00:59:19.640 | And yes indeed, I am using dropout.

00:59:24.880 | The kind of dropout that you use in an RNN is slightly different to normal dropout.

00:59:29.760 | It turns out that if the particular things you dropout, it's best to have them the same

00:59:36.760 | things at every time step in an RNN.

00:59:41.240 | There's a really good paper that explains why this is the case and shows that this is

00:59:45.520 | the case.

00:59:46.560 | So this is why there's a special dropout parameter inside the RNN in Keras because it does this

00:59:52.920 | proper RNN-style dropout.

00:59:56.960 | So I put in a tiny bit of dropout here, and if it turns out that we overfit, we can always

01:00:05.320 | increase it.

01:00:07.400 | If we don't, we can always turn it to zero.

01:00:12.920 | So what we're going to do is -- yes, Rachel?

01:00:16.840 | I don't know if you remember, but when we looked at doing RNNs from scratch last year,

01:00:36.840 | we learned that you could actually combine the matrices together and do a single matrix

01:00:41.960 | computation.

01:00:42.960 | If you do that, it's going to use up more memory, but it allows the GPU to be more highly parallel.

01:00:49.600 | So if you look at the Keras documentation, it will tell you the different things you

01:00:53.560 | can use, but since we're using a GPU, you probably always want to say consume less equals GPU.

01:01:03.200 | The other thing that we learned about last year is bidirectional RNNs.

01:01:09.760 | And maybe the best way to come at this is actually to go all the way back and remind

01:01:14.160 | you how RNNs work.

01:01:19.320 | We haven't done much revision, but it's been a while since we've looked at RNNs in much

01:01:23.680 | detail.

01:01:24.680 | So just to remind you, this is kind of our drawing of a totally basic neural net.

01:01:31.520 | Square is input, circle is intermediate activations, hidden, and triangle is output, and arrows

01:01:40.000 | represent affine transformations with non-linearities.

01:01:47.880 | We can then have multiple copies of those to create deeper convolutions, for example.

01:01:57.160 | And so the other thing we can do is actually have inputs going in at different places.

01:02:03.740 | So in this case, if we were trying to predict the third character from first two characters,

01:02:08.040 | we can use a totally standard neural network and actually have input coming in at two different

01:02:12.720 | places.

01:02:17.920 | And then we realized that we could kind of make this arbitrarily large, but what we should

01:02:22.560 | probably do then is make everything where an input is going to a hidden state be the

01:02:27.240 | same matrix.

01:02:28.240 | So this color coding, remember, represents the same weight matrix.

01:02:33.040 | So hidden to hidden would be the same weight matrix and hidden to output, and it's a separate

01:02:37.440 | weight matrix.

01:02:40.760 | So then to remind you, we realized that we could draw that more simply like this.

01:02:46.160 | So RNNs, when they're unrolled, just look like a normal neural network in which some

01:02:53.440 | of the weight matrices are tied together.

01:02:55.940 | And if this is not ringing a bell, go back to lesson 5 where we actually build these

01:03:03.160 | weight matrices from scratch and tie them together manually so that will hopefully remind

01:03:10.640 | you of what's going on.

01:03:12.840 | Now importantly, we can then take one of those RNNs and have the output go to the input of

01:03:22.720 | another RNN.

01:03:24.860 | And these are stacked RNNs.

01:03:26.360 | And stacked RNNs basically give us richer computations in our recurrent neural nets.

01:03:35.280 | And this is what it looks like when we unroll it.

01:03:39.160 | So you can see here that we've got multiple inputs coming in, going through multiple layers

01:03:44.720 | and creating multiple outputs.

01:03:47.020 | But of course we don't have to create multiple outputs.

01:03:51.680 | You could also get rid of these two triangles here and have just one output.

01:04:01.440 | And remember in Keras, the difference is whether or not we say return_sequences=true or return_sequences=false.

01:04:08.760 | This one you're seeing here is return_sequences=true.

01:04:15.320 | This one here is return_sequences=false.

01:04:19.680 | So what we've got is input_train has 97,000 words.

01:04:34.000 | Each one is of length 16.

01:04:36.600 | It's 15 characters long plus the padding, and then labels is 15 because we chose earlier

01:05:00.520 | on that our max length would be a 15 long spelling.

01:05:04.700 | So phonemes don't match to letters exactly.

01:05:10.200 | So after the embedding, so if we take one of those tens of thousands of words, remember

01:05:19.640 | it was of length for phonemes, length 16.

01:05:33.720 | And then we're putting it into an embedding matrix which is 70 by 120.

01:05:45.720 | And the reason it's 70 is that each of these phonemes contains a number between 0 to 69.

01:05:57.360 | So basically we go through and we get each one of these indexes and we look up to find

01:06:01.720 | it.

01:06:02.720 | So this is 5 here, then we find number 5 here.

01:06:07.280 | And so we end up with 16 by 120.

01:06:15.720 | "Are we then taking a sequence of these phonemes represented as 120 dimensional floating point

01:06:26.080 | vectors and using an RNN to create a sequence of word2vec embeddings which we will then

01:06:32.080 | reverse to actual words?"

01:06:33.960 | So we're not going to use word2vec here, right?

01:06:36.320 | Word2vec is a particular set of pre-trained embeddings.

01:06:39.760 | We're not using pre-trained embeddings, we have to create our own embeddings.

01:06:45.640 | We're creating phoneme embeddings.

01:06:47.960 | So if somebody else later on wanted to do something else with phonemes, and we saved

01:06:53.360 | the result of this, we could provide phoneme2vec.

01:06:56.960 | And you could download them and use the fast.ai pre-trained phoneme2vec embeddings.

01:07:03.000 | This is how embeddings basically get created.

01:07:05.760 | It's people build models starting with random embeddings and then save those embeddings

01:07:11.560 | and make them available for other people to use.

01:07:13.720 | "I may be misinterpreting it, but I thought the question was getting at the second set

01:07:20.800 | of embeddings when you want to get back to your words."

01:07:25.600 | So let's wait until we get there, because we're going to create letters, not words,

01:07:31.720 | and then we'll just join the letters together.

01:07:33.840 | So there won't be any word2vec here.

01:07:35.560 | So we've got, as far as creating our embeddings, we've then got an RNN which is going to take

01:07:44.860 | our embeddings and attempt to turn it into a single vector.

01:07:52.760 | That's kind of what an RNN does.

01:07:55.040 | So we've got here return sequences by default is true, so this first RNN returns something

01:08:06.020 | which is just as long as we started with.

01:08:08.800 | And so if you want to stack RNNs on top of each other, every one of them is return sequences

01:08:12.880 | equals true until the last one isn't.

01:08:15.720 | So that's why we have false here.

01:08:18.440 | So at the end of this one, this gives us a single vector which is the final state.

01:08:25.800 | The other important piece is bidirectional.

01:08:28.980 | And bidirectional, you can totally do this manually yourself.

01:08:32.440 | You take your input and feed it into an RNN, and then you reverse your input and feed it

01:08:39.760 | into a different RNN, and then just concatenate the two together.

01:08:44.480 | So Keras has something which does that for you, which is called bidirectional.

01:08:50.420 | Bidirectional actually requires you to pass it an RNN, so it takes an RNN and returns two

01:08:58.000 | copies of that RNN stacked on top of each other, one of which reverses its input.

01:09:03.000 | And so why is that interesting?

01:09:05.480 | That's interesting because often in language, what happens later influences what comes before.

01:09:12.660 | For example, in French, the gender of your definite article depends on the noun that it

01:09:21.320 | refers to.

01:09:22.320 | So you need to be able to look backwards or forwards in both directions to figure out

01:09:27.060 | how to match those two together.

01:09:28.880 | Or in any language with tense, what verb do you use depends on the tense and often also

01:09:37.080 | depends on the details about the subject and the object.

01:09:40.280 | So we want to be able to both look forwards and look backwards.

01:09:44.440 | So that's why we want two copies of the RNN, one which goes from left to right and one

01:09:49.560 | which goes from right to left.

01:09:51.680 | And indeed, we could assume that when you spell things, I'm not exactly sure how this

01:09:55.720 | would work, but when you spell things depending on what the latest dresses might be or the

01:10:00.440 | later details of the phonetics might be might change how you pronounce things earlier on.

01:10:07.120 | Does the bidirectional RNN can cat two RNNs or does it stack them?

01:10:24.760 | You end up with the same number of dimensions that you had before, but it basically doubles

01:10:32.240 | the number of features that you have.

01:10:34.840 | So in this case, we have 240, so it just doubles those, and I think we had one question here.

01:10:47.720 | Okay, so let's simplify this down a little bit.

01:11:17.040 | Basically, we started out with a set of embeddings, and we've gone through two layers, we've gone

01:11:30.240 | through a bidirectional RNN, and then we feed that to a second RNN to create a representation

01:11:39.320 | of this ordered list of phonemes.

01:11:43.200 | And specifically, this is a vector.

01:11:46.820 | So x at this point is a vector, because return sequence is equals false.

01:11:53.400 | That vector, once we've trained this thing, the idea is it represents everything important

01:12:00.800 | there is to know about this ordered list of phonemes, everything that we could possibly

01:12:04.480 | need to know in order to spell it.

01:12:08.520 | So the idea is we could now take that vector and feed it into a new RNN, or even a few

01:12:16.800 | layers of RNN, and that RNN could basically go through and with return sequence equals

01:12:24.560 | true this time.

01:12:27.360 | It could spit out at every time step what it thinks the next letter in this spelling

01:12:32.240 | is.

01:12:34.000 | And so this is how a sequence-to-sequence works.

01:12:36.200 | One part, which is called the encoder, takes our initial sequence and turns it into a distributed

01:12:48.160 | representation into a vector using, generally speaking, some stacked RNNs.

01:12:56.080 | Then the second piece, called the decoder, takes the output of the encoder and passes

01:13:04.720 | that into a separate stack of RNNs with return sequence equals true.

01:13:10.400 | And those RNNs are taught to then generate the labels, in this case the spellings, or

01:13:17.240 | in our later case the English sentences.

01:13:22.080 | Now in Keras, it's not convenient to create an RNN by handing it some initial state, some

01:13:34.040 | initial hidden state.

01:13:35.720 | That's not really how Keras likes to do things.

01:13:38.680 | Keras expects to be handed a list of inputs.

01:13:45.960 | Problem number one.

01:13:47.320 | Problem number two, if you do hand it to an RNN just at the start, it's quite hard for

01:13:53.120 | the RNN to remember the whole time what is this word I'm meant to be translating.

01:13:58.520 | It has to keep two things in its head.

01:14:00.400 | One is what's the word I'm meant to be spelling, and the second is what's the letter I'm trying

01:14:06.040 | to spell right now.

01:14:07.780 | So what we do with Keras is we actually take this whole state and we copy it.

01:14:18.840 | So in this case we're trying to create a word that could be up to 15 letters long, so in

01:14:28.440 | other words 15 time steps.

01:14:33.400 | So we take this and we actually make 15 copies of it.

01:14:46.040 | And those 15 copies of our final encoder state becomes the input to our decoder RNN.

01:14:54.760 | So it seems kind of clunky, but it's actually not difficult to do.

01:14:58.000 | In Keras we just go like this.

01:15:01.480 | We take the output from our encoder and we repeat it 15 times.

01:15:11.840 | So we literally have 15 identical copies of the same vector.

01:15:16.320 | And so that's how Keras expects to see things, and it also turns out that you get better

01:15:22.080 | results when you pass into the RNN the state that it needs again and again at every time

01:15:27.000 | step.

01:15:28.000 | So we're basically passing in something saying we're trying to spell this word, we're trying

01:15:32.280 | to spell this word, we're trying to spell this word, we're trying to spell this word.

01:15:35.740 | And then as the RNN goes along, it's generating its own internal state, figuring out what

01:15:42.760 | have we spelt so far and what are we going to have to spell next.

01:15:45.440 | Why can't we have return_sequences=true for the second bidirectional LSTM?

01:15:55.400 | Not bidirectional for the second LSTM, we only have one bidirectional LSTM.

01:16:02.000 | We don't want return_sequences=true here because we're trying to create a representation of

01:16:08.200 | the whole word we're trying to spell.

01:16:11.880 | So there's no point having something saying here's a representation of the first phoneme,

01:16:19.080 | of the first 2, of the first 3, of the first 4, of the first 5, because we don't really

01:16:24.200 | know exactly which letter of the output is going to correspond to which phoneme of the

01:16:29.360 | input.

01:16:30.360 | And particularly when we get to translation, it can get much harder, like some languages

01:16:33.440 | totally reverse the subject and object order or put the verb somewhere else.

01:16:37.640 | So that's why we try to package up the whole thing into a single piece of state which has

01:16:43.760 | all of the information necessary to build our target sequence.

01:16:49.160 | So remember, these sequence-to-sequence models are also used for things like image captioning.

01:16:57.760 | So with image captioning, you wouldn't want to have something that created a representation

01:17:04.600 | separately for every pixel.

01:17:08.240 | You want a single representation, which is like this is something that somehow contains

01:17:14.440 | all of the information about what this is a picture of.

01:17:18.000 | Or if you're doing neural language translation, here's my English sentence, I've turned it

01:17:24.560 | into a representation of everything that it means so that I can generate my French sentence.

01:17:29.140 | We're going to be seeing later how we can use return sequences equals true when we look

01:17:35.080 | at attention models.

01:17:37.040 | But for now, we're just going to keep things simple.

01:17:40.280 | Well, it does, it absolutely does.

01:18:05.520 | And indeed we can use convolutional models.

01:18:08.720 | But if you remember back to Lesson 5, we talked about some of the challenges with that.

01:18:18.600 | So if you're trying to create something which can parse some kind of markup block like this,

01:18:27.280 | it has to both remember that you've just opened up a piece of markup, you're in the middle

01:18:31.960 | of it, and then in here you have to remember that you're actually inside a comment block

01:18:35.360 | so that at the end you're able to close it.

01:18:38.860 | This kind of long-term dependency and memory and stateful representation becomes increasingly

01:18:44.760 | difficult to do with CNNs as they get longer.

01:18:49.320 | It's not impossible by any means, but RNNs are one good way of doing this.

01:18:57.800 | But it is critical that we start with an embedding, because where else in image we're already

01:19:03.680 | given float-valued numbers that really represent the image, that's not true with text.

01:19:10.660 | So with text we have to use embeddings to turn it into these nice numeric representations.

01:19:17.680 | RNN is a kind of generic term here, so a specific network we use is LSTM, but there are other

01:19:37.540 | types we can use.

01:19:38.540 | >> GRU, remember?

01:19:39.540 | >> Yeah.

01:19:40.540 | >> Simple RNN?

01:19:41.540 | >> So Keras supports a couple of things.

01:19:42.540 | >> Yeah, with all the ones we did in the last part of the course.

01:19:43.540 | So we looked in Simple RNN, GRU, and LSTM.

01:19:44.540 | >> So is that like the LSTM would be the best for that task?

01:19:46.280 | >> No, not at all.

01:19:50.920 | The GRUs and LSTMs are pretty similar, so it's not worth thinking about too much.

01:19:58.480 | So at this point here, we now have 15 copies of x.

01:20:21.300 | And so we now pass that into two more layers of RNN.

01:20:27.040 | So this here is our encoder, and this here is our decoder.

01:20:33.300 | There's nothing we did particularly in Keras to say this is an encoder, this is a decoder.

01:20:37.760 | The important thing is the return sequence equals false here, and the repeat vector here.

01:20:44.820 | So what does it have to do?

01:20:46.320 | Well somehow it has to take this single summary and create some layers of RNNs until then

01:20:54.880 | at the end we say, okay, here's a dense layer, and it's time distributed.

01:21:01.480 | So remember that means that we actually have 15 dense layers.

01:21:08.320 | And so each of these dense layers now has a softmax activation, which means that we

01:21:16.160 | basically can then do an argmax on that to create our final list of letters.

01:21:22.080 | So this is kind of our reverse embedding, if you like.

01:21:29.400 | So the model is very little code, and once we've built it, and again, if things like

01:21:38.080 | this are mysterious to you, go back and re-look at lessons 4, 5 and 6, remind you how these

01:21:47.560 | embeddings work and how these kind of time distributed dense works to give us effectively

01:21:52.640 | a kind of reverse embedding.

01:21:56.460 | So that's our model.

01:21:58.480 | It starts with our phoneme input, ends with our time distributed dense output.

01:22:04.560 | We can then compile that.

01:22:07.560 | Our targets are just indexes, remember we turn them into indexes.

01:22:11.480 | So we use this handy sparse categorical cross entropy, just the same as our normal categorical

01:22:18.160 | cross entropy, but rather than one hot encoding, we just skip the whole one hot encoding and

01:22:22.620 | just leave it as an index.

01:22:26.360 | We can go ahead and fit passing in our training data.

01:22:30.800 | So that was our rectangular data of the phoneme indexes, our labels, and then we can use some

01:22:40.760 | valid test set data that we set aside as well.

01:22:45.360 | So we've hit that for a while.

01:22:48.040 | I found that the first three epochs, the loss went down like this.

01:22:52.280 | The second three epochs it went down like this.

01:22:54.880 | It seemed to be flattening out, so that's as far as I stopped it.

01:22:59.600 | So we can now see how well that worked.

01:23:05.680 | Now what I wanted to do was not just say what percentage of letters are correct, because

01:23:11.560 | that doesn't really give you the right sense at all.

01:23:14.480 | What I really want to know is what percentage of words are correct.

01:23:18.640 | So that's what this little eval_keras function does.

01:23:22.860 | It takes the thing that I'm trying to evaluate, calls.predict on it, it then does the argmax

01:23:31.960 | as per usual to take that softmax and turn it into a specific number, which character

01:23:37.360 | is this.

01:23:39.440 | And then I want to check whether it's true for all of the characters that the real character

01:23:46.120 | equals the predicted character.

01:23:49.380 | So this is going to return true only if every single item in the word is correct.

01:23:56.600 | And so then taking the mean of that is going to tell us what percentage of the words did

01:24:00.520 | it get totally right.

01:24:03.520 | And unfortunately the answer is not very many, 26%.

01:24:08.840 | So let's look at some examples.

01:24:11.800 | So we can go through 20 words at random, and we can print out all of the phonemes with

01:24:19.160 | dashes between.

01:24:20.160 | So here's an example of some phonemes.

01:24:23.260 | We can print out the actual word, and we can print out our prediction.

01:24:30.920 | So here is a whole bunch of words that I don't really recognize, perturbations.

01:24:38.040 | It should be spelled like that, slightly wrong.

01:24:46.960 | So you can see some of the time the mistakes it makes are pretty clear.

01:24:51.220 | So "larrow" could be spelled like that, but this seems perfectly reasonable.

01:24:58.680 | Sometimes on the other hand it's "way off."

01:25:02.840 | And interestingly, what you find is that most of the time when it's way off, I found it

01:25:10.080 | tends to be with the longer words.

01:25:15.040 | And the reason for that is that the longer the word, this one where it's terrible, is

01:25:23.520 | by far the most phonemes, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 phonemes.

01:25:31.080 | So we had to somehow create a single representation that contained all of the information of all

01:25:40.080 | of those 11 phonemes in a single vector.

01:25:43.280 | And that's hard to do, right?

01:25:45.380 | And then that single vector got passed to the decoder, and that was everything it had

01:25:54.200 | to try to create this output.

01:25:58.400 | So that's the problem with this basic encoder-decoder method.

01:26:06.280 | And indeed, here is a graph from the paper which originally introduced attentional models.

01:26:23.040 | And this is for neural translation.

01:26:24.960 | What it showed is that as the sentence length got bigger, the standard approach which we're

01:26:36.080 | talking about, the standard encoder or decoder approach, the accuracy absolutely died.

01:26:43.880 | So what these researchers did was that they built a new kind of RNN model, called an attentional

01:26:50.600 | model.

01:26:51.960 | And with the attentional model, the accuracy actually stayed pretty good.

01:26:59.080 | So goal number 1 for the next couple of lessons is for me not to have a cold anymore.

01:27:08.720 | So basically we're going to finish our deep dive into neural translation, and then we're

01:27:15.440 | going to look at time series.

01:27:16.440 | Although we're not specifically looking at time series, it turns out that the best way

01:27:20.080 | that I found for time series is not specific to time series at all.

01:27:30.440 | Reinforcement learning was something I was planning to cover, but I just haven't found

01:27:38.040 | almost any good examples of it actually being used in practice to solve important real problems.

01:27:44.720 | And indeed, when you look at the -- have you guys seen the paper in the last week or two

01:27:50.720 | about using evolutionary strategies for reinforcement learning?

01:27:54.600 | Basically it turns out that using random search is better than reinforcement learning.

01:28:03.120 | This paper, by the way, is ridiculously overhyped.

01:28:05.960 | These evolutionary strategies is something that I was working on over 20 years ago, and

01:28:12.760 | in those days, these are genetic algorithms, as we called them, used much more sophisticated

01:28:18.600 | methods than DeepMind's brand new evolutionary strategies.

01:28:22.680 | So people are like rediscovering these randomized metaheuristics, which is great, but they're

01:28:28.800 | still far behind where they were 20 years ago, but far ahead of reinforcement learning

01:28:36.280 | approaches.

01:28:37.280 | So given I try to teach things which I think are actually going to stand the test of time,

01:28:42.680 | I'm not at all convinced that any current technique for reinforcement learning is going

01:28:45.840 | to stand the test of time, so I don't think we're going to touch that.

01:28:50.880 | Part 3, yeah, I think before that we might have a part 0 where we do practical machine

01:29:01.880 | learning for coders, talk about decision tree ensembles and training test splits and stuff

01:29:07.560 | like that.

01:29:11.920 | And then we'll see where we are.

01:29:14.000 | I'm sure Rachel and I are not going to stop doing this in a hurry, it's really fun and

01:29:20.160 | interesting and we're really interested in your ideas about how to keep this going.

01:29:27.360 | By the end of part 2, you guys have put in hundreds of hours, on average maybe 140 hours,

01:29:38.320 | put together your own box, written blog posts, done hackathons, you're seriously in this

01:29:45.200 | now.

01:29:46.200 | And in fact, I've got to say, this week's kind of been special for me.

01:29:51.920 | This week's been the week where again and again I've spoken to various folks of you

01:29:56.240 | guys and heard about how many of you have implemented projects at your workplace that

01:30:04.480 | have worked and are now running and making your business money, or that you've achieved

01:30:10.000 | the career thing that you've been aiming for, or that you've won yet another GPU at a hackathon,

01:30:17.600 | or of course the social impact thing where it's like all these transformative and inspirational

01:30:23.080 | things.

01:30:27.120 | When Rachel and I started this, we had no idea if it was possible to teach people with no

01:30:36.120 | specific required math background other than high school math, deep learning to the point

01:30:41.040 | that you could use it to build cool things.

01:30:44.260 | We thought we probably could because I don't have that background and I've been able to.

01:30:50.960 | But I've been kind of playing around with similar things for a couple of decades.

01:30:55.240 | So it was a bit of an experiment, and this week's been the week that for me it's been

01:31:00.900 | clear that the experiments worked.

01:31:02.560 | So I don't know what part 3 is going to look like.

01:31:06.280 | I think it will be a bit different because it'll be more of a meeting of minds amongst

01:31:11.840 | a group of people who are kind of at the same level and thinking about the same kinds of

01:31:15.920 | things.

01:31:16.920 | So maybe it's more of an ongoing keypad knowledge up-to-date kind of thing, it might be more

01:31:26.440 | of us teaching each other, I'm not sure, I'm certainly interested to hear your ideas.

01:31:36.960 | We don't normally have two breaks, but I think I need one today and we're covering a lot

01:31:40.040 | of territories.

01:31:41.040 | Why don't we have a short break, let's have a short break and we'll go for the last 20

01:31:50.740 | minutes.

01:31:51.740 | Let's come back at 8.40.

01:31:56.760 | Okay, thank you.

01:32:03.880 | So attention models, attention models.

01:32:13.440 | So I actually really like these, I think they're great.

01:32:20.520 | And really the paper that introduced these, it was quite an extraordinary paper that introduced

01:32:28.080 | both GIUs and attention models at the same time.

01:32:31.320 | I think it might even be before the guy had his PhD if I remember correctly, it was just

01:32:36.480 | a wonderful paper, very successful.

01:32:43.700 | And the basic idea of an attention model is actually pretty simple.

01:32:52.120 | You'll see here, here is our decoder, and here's our embedding.

01:33:13.480 | And notice here, remember that my getRNN_return_sequences = true is the default, so the decoder now is

01:33:24.280 | actually spitting out a sequence of states.

01:33:30.200 | Now the length of that sequence is equal to the number of phonemes.

01:33:38.160 | And we know that there isn't a mapping 1 to 1 of phonemes to letters.

01:33:43.160 | So this is kind of interesting to think about how we're going to deal with this.

01:33:47.000 | How are we going to deal with 16 states?

01:33:53.600 | And the states, because they started with bidirectional, state 1 both represents a combination of

01:34:00.880 | everything that's come before the first phoneme and everything that's come after.

01:34:04.400 | State 2 is everything that's come before the second phoneme and everything that's come

01:34:07.680 | after and so forth.

01:34:09.840 | So the states in a sense are all representing something very similar, but they've got a

01:34:15.440 | different focus.

01:34:17.080 | Each one of these states represents everything that comes before and everything that comes

01:34:26.700 | after that point, but with a focus on that phoneme.

01:34:33.160 | So what we want to do now is create an RNN where the number of inputs to the RNN needs

01:34:47.360 | to be 15, not 16, because remember the length of the word we're creating is 15.

01:34:54.600 | So we're going to have 15 output time steps, and at each point we wanted to have the opportunity

01:35:07.800 | to look at all of these 16 output states, but we're going to go in with the assumption

01:35:15.760 | that only some of those 16 are going to be relevant, but we don't know which.

01:35:24.560 | So what we want to do is basically take each of these 16 states and do a weighted sum,

01:35:35.600 | sum of weights times encoded states, where these weights somehow represent how important

01:35:51.300 | is each one of those 16 inputs for calculating this output, and how important are each of

01:35:58.000 | those 16 inputs for calculating this output, and so forth.

01:36:03.180 | If we could somehow come up with a set of weights for every single one of those time

01:36:09.080 | steps, then we can replace the length 16 thing with a single thing.

01:36:17.000 | If it turns out that output number 1 only really depends on input number 1 and nothing

01:36:24.360 | else, then basically that input, those weights are going to be 1, 0, 0, 0, 0, 0, 0, 0, right?

01:36:31.700 | It can learn to do that.

01:36:33.840 | But if it turns out that output number 1 actually depends on phonemes 1 and 2 equally, then

01:36:45.400 | it can learn the weights 0.5, 0.5, 0.000.

01:36:52.360 | So in other words, we want some function, wi equals some function that returns the right

01:37:03.000 | set of weights to tell us which bit of the decoded input to look at.

01:37:11.520 | So it so happens we actually know a good way of learning functions.

01:37:17.420 | What if we made the function a neural net?

01:37:21.480 | And what if we learned it using SGD?

01:37:27.160 | Why not?

01:37:28.600 | So here's the paper, "Neural Machine Translation by Jointly Learning to Align and Translate."

01:37:42.720 | And it's a great paper.

01:37:46.560 | It's not the clearest in my opinion in terms of understandability, but let me describe

01:37:52.120 | some of the main pieces.

01:37:56.840 | So here's the starting point.

01:38:09.280 | Okay, let's describe how to read this equation.

01:38:16.260 | When you see a probability like this, you can very often think of it as a loss function.

01:38:22.840 | The idea of SGD, basically most of the time when we're using it, is to come up with a

01:38:29.720 | model where the probabilities that the model creates are as high as possible for the true

01:38:36.080 | data and as low as possible for the other data.

01:38:39.360 | That's just another way of talking about a loss function.

01:38:43.440 | So very often when you read the papers where we would write a loss function, a paper will

01:38:48.520 | say a probability.

01:38:50.720 | What this here says, earlier on they say that y is basically our outputs, very common for

01:38:56.880 | y to be an output.

01:38:57.880 | And what this is saying is that the probability of the output at time step i, so at some particular

01:39:04.560 | time step, depends on, so this bar here means depends on all of the previous outputs.

01:39:12.960 | In other words, in our spelling thing, when we're looking at the fourth letter that we're

01:39:20.920 | spelling, it depends on the three letters that we've spelled so far.

01:39:25.680 | You can't have it depend on the later letters, that's cheating.

01:39:30.600 | So this is basically a description of the problem, is that we're building something

01:39:34.280 | which is time-dependent and where the i-th thing that we're creating can only be allowed

01:39:40.360 | to depend on the previous i-1 things, comma, that basically means and, and it's also allowed

01:39:50.000 | to depend on, anything in bold is a vector, a vector of inputs.

01:39:56.880 | And so this here is our list of phonemes, and this here is our list of all of the letters

01:40:05.160 | we've spelled so far.

01:40:08.360 | So that whole thing, that whole probability, we're going to calculate using some function.

01:40:21.680 | And because this is a neural net paper, you can be pretty sure that it's going to turn

01:40:24.400 | out to be a neural net.

01:40:27.000 | And what are the things that we're allowed to calculate with?

01:40:30.440 | Well we're allowed to calculate with the previous letter that we just translated.

01:40:38.600 | What's this?

01:40:41.200 | The RNN hidden state that we've built up so far, and what's this?

01:40:49.320 | A context vector.

01:40:52.080 | What is a context vector?

01:40:54.080 | The context vector is a weighted sum of annotations h.

01:41:03.920 | So these are the hidden states that come out of our encoder, and these are some weights.

01:41:12.680 | So I'm trying to give you enough information to try and parse this paper over the week.

01:41:18.120 | So that's everything I've described so far.

01:41:22.920 | The nice thing is that hopefully you guys have now read enough papers that you can look

01:41:29.960 | at something like this and skip over it, and go, oh, that's just softmax.

01:41:36.840 | Over time, your pattern recognition starts getting good.

01:41:42.160 | You start seeing something like this, and you go, oh, that's a weighted sum.

01:41:45.000 | And you say something like this, and you go, oh, that's softmax.

01:41:48.600 | People who read papers don't actually read every symbol.

01:41:52.840 | Their eye looks at it and goes, softmax, weighted sum, logistic function, got it.

01:41:59.560 | As if it was like pieces of code, only this is really annoying code that you can't look

01:42:04.480 | up in a dictionary and you can't run and you can't check it and you can't debug it.

01:42:07.600 | But apart from that, it's just like code.

01:42:12.120 | So the alphas are things that came out of a softmax.

01:42:15.920 | What goes into the softmax?

01:42:18.860 | Something called E. The other annoying thing about math notation is often you introduce

01:42:23.280 | something and define it later.

01:42:26.080 | So here we are, later we define E. What's E equal to?

01:42:30.800 | E is equal to some function of what?

01:42:35.720 | Some function of the previous hidden state and the encoder state.

01:42:48.580 | And what's that function?

01:42:50.520 | That function is again a neural network.

01:42:57.420 | Now the important piece here is jointly trained.

01:43:02.440 | Jointly trained means it's not like a GAN where we train a bit of discriminator and

01:43:07.080 | a bit of generator.

01:43:08.560 | It's not like one of these manual attentional models where we first of all figure out the

01:43:14.040 | nodules are here and then we zoom into them and find them there.

01:43:18.120 | Jointly trained means we create a single model, a single computation graph if you like, where

01:43:22.920 | the gradients are going to flow through everything.

01:43:26.220 | So we have to try and come up with a way basically where we're going to build a standard regular

01:43:31.520 | RNN, but the RNN is going to use as the input at each time step this.

01:43:43.800 | So we're going to have to come up with a way of actually making this mini-neuronet.

01:43:52.040 | This is just a single one-hidden layer standard neural net that's going to be inside every

01:43:58.540 | time step in our RNN.

01:44:07.640 | This whole thing is summarized in another paper.

01:44:16.200 | This is actually a really cool paper, Grammar as a Foreign Language.

01:44:22.960 | Lots of names you probably recognize here, Jeffrey Hinton, who's kind of a father of

01:44:29.440 | deep learning, Julia, who's now I think chief scientist or something at OpenAI, this paper

01:44:40.760 | is kind of neat and fun anyway.

01:44:43.080 | It basically says what if you didn't know anything about grammar and you attempted to

01:44:49.760 | build a neural net which assigned grammar to sentences, it turns out you actually end

01:44:57.400 | up with something more accurate than any rule-based grammar system that's been built.

01:45:04.040 | One of the nice things they do is to summarize all the bits.

01:45:08.920 | And again, this is where like if you were reading a paper the first time and didn't

01:45:13.440 | know what an LSTM was and went oh, an LSTM is all these things, that's not going to mean

01:45:18.760 | anything to you.

01:45:19.760 | You have to recognize that people write stuff in papers, there's no point writing LSTM equations

01:45:27.320 | in papers, but basically you're going to have to go and find the LSTM paper or find a tutorial

01:45:32.640 | like learn about LSTMs when you're finished, come back, and the same way they summarize

01:45:39.520 | attention.

01:45:41.080 | So they say we've adapted the attention model from 2, if you go and have a look at 2, that's

01:45:47.560 | the paper we just looked at.

01:45:51.000 | But the nice thing is that because this came a little later, they've done a pretty good

01:45:55.360 | job of trying to summarize it into a single page.

01:45:59.080 | So during the week if you want to try and get the hang of attention, you might find

01:46:03.160 | it good to have a look at this paper and look at their summary.

01:46:08.000 | And you'll see that basically the idea is that it's a standard sequence to sequence

01:46:14.880 | model, so a standard sequence to sequence model means encoder, hidden states, the final

01:46:20.240 | hidden state decoder, plus adding attention.

01:46:27.240 | And so we have two separate LSTMs, an encoder and a decoder, and now be careful of the notation

01:46:36.520 | the encoder states are going to be called H. The decoder states H1 through HTA, the decoder

01:46:45.920 | states D, which we're also going to call HTA+1 to TA+TB.

01:46:54.280 | So the inputs are 1 through TA, and here you can see is defining a single layer neural

01:47:03.120 | net.

01:47:06.640 | So we've got our decoder states, we've got our current encoder state, put it through

01:47:19.120 | a nonlinearity, put it through another affine transformation, stick it through a softmax,

01:47:26.960 | and use that to create a weighted sum.

01:47:29.660 | So there's like all of it in one little snapshot.

01:47:34.400 | So don't expect this to make perfect sense the first time you see it necessarily, but

01:47:40.840 | hopefully you can kind of see that these bits are all stuff you've seen lots of times before.

01:47:47.360 | So next week we're going to come back and work through creating this code and seeing

01:47:55.320 | how it works.

01:47:56.320 | Did you have something Rachel?

01:47:57.320 | We have two questions, one is, won't the weightings be heavily impacted by the padding done to

01:48:04.000 | the input set?

01:48:05.760 | Sure, absolutely.

01:48:07.920 | And specifically those weights will say the padding is always weighted zero.

01:48:12.800 | It's not going to take very long to learn to create that pattern.

01:48:17.360 | And is A shared among all ij pairs, or do we train a separate alignment for each pair?

01:48:23.240 | No, A is not trained, A is the output of a softmax.

01:48:31.160 | What's trained is w1 and w2.

01:48:34.560 | And note that capital letters are matrices, right?

01:48:37.320 | So we just have to learn a single w1 and a single w2.

01:48:41.960 | But note that they're being applied to all of the encoded states and the current state

01:48:56.400 | of the decoder.

01:48:58.760 | And in fact, easier is to just abstract out this all the way back to say it is some function.

01:49:11.800 | This is the best way to think of it, it's some function of what?

01:49:15.240 | Some function of the current hidden state and all of the decoder states.

01:49:26.960 | So that's the inputs to the function, and we just have to learn a set of weights that

01:49:32.440 | will spit out the inputs to our softmax.

01:49:35.800 | Did you say you had another question?

01:49:43.560 | Okay great.

01:49:44.560 | So I don't feel like I want to introduce something new so let's take one final AMA before we

01:49:48.680 | go home.

01:49:51.720 | There's not really that much clever you can do about it.

01:50:10.160 | Basically if you've got, well a great example would be one of the impact talks talked about

01:50:19.040 | breast cancer detection from mammography scans and this thing called the Dream Challenge

01:50:26.880 | had less than 0.3% of the scans actually had cancer.

01:50:38.540 | So that's very unbalanced.

01:50:40.200 | I think the first thing you try to do with such an unbalanced dataset is ignore it and

01:50:45.000 | try it and see how it goes.

01:50:47.800 | The reason that often it doesn't go well is that the initial gradients will tend to point

01:50:53.760 | to say they never have cancer, because that's going to give you a very accurate model.

01:51:00.200 | So one thing you can try and do is to come up with some kind of initial model which is

01:51:06.760 | like maybe some kind of heuristic which is not terrible and gets it to the point where

01:51:11.400 | the gradients don't always point to saying they never have cancer.

01:51:19.020 | But the really obvious thing to do is to adjust your thing which is creating the mini batches

01:51:24.960 | so that on every mini batch you grab like half of it as being people with cancer and

01:51:31.120 | half of it being without cancer.

01:51:34.560 | So that way you can still go through lots and lots of epochs, the challenge is that

01:51:42.600 | the people that do have cancer, you're going to see lots and lots and lots of times so

01:51:46.720 | you have to be very careful of overfitting.

01:51:52.440 | And then basically there's kind of things between those two extremes.

01:51:57.240 | So I think what you really need to do is figure out what's the smallest number of people with

01:52:03.620 | cancer that you can get away with.

01:52:05.560 | What's the smallest number where the gradients don't point to zero?

01:52:10.440 | And then create a model where every batch, mini batch, you create 10% of it with people

01:52:17.640 | with cancer and 90% people without.

01:52:20.520 | Train that for a while.

01:52:22.680 | The good news is once it's working pretty well, you can then decrease the size that

01:52:31.080 | has cancer size because you're already at a point where your model is not pointing off

01:52:35.680 | to zero.

01:52:36.680 | So you can gradually start to change the sample to have less and less.

01:52:42.520 | I think that's the basic technique.

01:52:45.080 | So in this example where you're repeating the positive results over and over again, you're

01:53:05.800 | essentially just weighting them more.

01:53:06.800 | Could you get the same results by just throwing away a bunch of the false data set?

01:53:07.800 | You could do that, and that's the really quick way to do it, but that way you're not using

01:53:13.320 | all the information about the false stuff still has information.

01:53:17.000 | So yeah.

01:53:18.000 | OK.

01:53:19.000 | Thanks everybody.

01:53:20.000 | Have a good week.

01:53:21.000 | [Applause]

01:53:21.000 | (audience applauds)

Lesson 12: Cutting Edge Deep Learning for Coders

Chapters