Back to Index

Lesson 12: Cutting Edge Deep Learning for Coders


Chapters

0:0
6:51 find initial centroids
12:48 use standard python loops
31:2 partition the data into some number of partitions
35:10 using the five dimensional clustering
48:7 look at a couple of examples of list comprehensions
53:40 start out by creating two arrays of zeros
60:38 combine the matrices

Transcript

I'm just going to start by going back to clustering. We're going to talk about clustering again in the next lesson or two in terms of an application of it. But specifically what I wanted to do was show you k-means clustering in TensorFlow. There are some things which are easier to do in TensorFlow than PyTorch, mainly because TensorFlow has a more complete API so far.

So there are some things that are just like, oh, there's already a method that does that, but there isn't one in PyTorch. And some things are just a bit neater in TensorFlow than in PyTorch. And I actually found k-means quite easy to do. But what I want to do is I'm going to try and show you a way to write custom TensorFlow code in a kind of a really PyTorch-y way, in a kind of an interactive-y way, and we're going to try and avoid all of the fancy, session-y, graph-y, scope-y business as much as possible.

So to remind you, the way we initially came at clustering was to say, Hey, what if we were doing lung cancer detection in CT scans, and these were like 512x512x200 volumetric things which is too big to really run a whole CNN over conveniently. So one of the thoughts to fix that was to run some kind of heuristic that found all of the things that looked like they could vaguely be nodules, and then create a new data set where you basically zoomed into each of those maybe nodules and created a small little 20x20x20 cube or something, and you could then use a 3D CNN on that, or try Planar CNN.

And this general concept I wanted to remind you about because I feel like it's something which maybe I haven't stressed enough. I've kind of kept on showing you ways of doing this. Going back to lesson 7 with the fish, I showed you the bounding boxes, and I showed you the heat maps.

The reason for all of that was basically to show you how to zoom into things and then create new models based on those zoomed in things. So in the fisheries case, we could really just use a lower-res CNN to find the fish and then zoom into those. In the CT scan case, maybe we can't even do that, so maybe we need to use this kind of mean shift clustering approach.

I'm not saying we necessarily do. It would be interesting to see what the winners use, but certainly particularly if you don't have lots of time or you have a lot of data, heuristics become more and more interesting. The reason a heuristic is interesting is you can do something quickly, an approximate that could have lots and lots and lots of false positives, and that doesn't really matter.

Those false positives means just extra data that you're feeding to your real model. So you can always tune it as to how much time have I got to train my real model, and then I can decide how many false positives I can handle. So as long as your preprocessing model is better than nothing, you can use it to get rid of some of the stuff that is clearly not a nodule, for example.

For example, anything that is in the middle of the lung wall is not a nodule, anything that is all white space is not a nodule, and so forth. So we talked about mean shift clustering and how the big benefit of it is that it allows us to build clusters without knowing how many clusters there are at a time.

Also without any special extra work, it allows us to find clusters which aren't kind of Gaussian or spherical if you like in shape. That's really important for something like a CT scan, where a cluster will often be like a vessel, which is this really skinny long thing. So k-means on the other hand is faster, I think it's n^2 rather than n^3 time.

We have talked particularly on the forum about dramatically speeding up mean shift clustering using approximate nearest neighbors, which is something which we started making some progress on today, so hopefully we'll have results from that maybe by next week. But the basic naive algorithm certainly should be a lot faster for k-means, so there's one good reason to use it.

So as per usual, we can start with some data, and we're going to try and figure out where the cluster centers are. So one quick way to avoid hassles in TensorFlow is to create an interactive session. So an interactive session basically means that you can call .run() on a computation graph which doesn't return something, or .eval() on a computation graph that does return something, and you don't have to worry about creating a graph or a session or having a session with clause or anything like that, it just works.

So that's basically what happens when you call tf.interactive_session. So by creating an interactive session, we can then do things one step at a time. So in this case, the first step in k-means is to pick some initial centroids. So you basically start out and say, okay, if we're going to create however many clusters -- in this case, n clusters is 6 -- then start out by saying, okay, where might those 6 clusters be?

And for a long time with k-means, people picked them randomly. But most practitioners realized that was a dumb idea soon enough, and a lot of people had various heuristics for picking them. In 2007, finally, a paper was published actually suggesting a heuristic. I tend to use a very simple heuristic, which is what I use here in find_initial_centroids.

So to describe this heuristic, I will show you the code. So find_initial_centroids looks like this. Basically -- and I'm going to run through it quickly, and then I'll run through it slowly. Basically the idea is we first of all pick a single data point index, and then we select that single data point.

So we have one randomly selected data point, and then we find what is the distance from that randomly selected data point to every other data point. And then we say, okay, what is the data point that is the furthest away from that randomly selected data point, the index of it and the point itself.

And then we say, okay, we're going to append that to the initial centroids. So say I picked at random this point as my random initial point, the furthest point away from that is probably somewhere around here. So that would be the first centroid we picked. Okay we're now inside a loop, and we now go back and we repeat the process.

So we now replace our random point with the actual first centroid, and we go through the loop once more. So if we had our first centroid here, our second one now might be somewhere over here. Okay so we now have two centroids. The next time through the loop, therefore, this is slightly more interesting.

All distances, we're now going to have the distance between every one of our initial centroids and every other data point. So we've got a matrix, in this case it's going to be 2 by the number of data points. So then we say, okay, for every data point, find the closest cluster.

So what's its distance to the closest initial centroid? And then tell me which data point is the furthest away from its closest initial centroid. So in other words, which data point is the furthest away from any centroid? So that's the basic algorithm. So let's look and see how we actually do that in TensorFlow.

So it looks a lot like NumPy, except in places you would expect to see np, we see tf, and then we see the API is a little different, but not too different. So to get a random number, we can just use random uniform. We can tell it what type of random number we want, so we want a random int because we're trying to get a random index, which is a random data point.

That's going to be between 0 and the amount of data points we have. So that gives us some random index. We can now go ahead and index into our data. Now you'll notice I've created something called vdata. So what is vdata? When we set up this k-means in the first place, the data was sent in as a NumPy array, and then I called tf.variable on it.

Now this is the critical thing that lets us make TensorFlow feel more like PyTorch. Once I do this, data is now basically copied to the GPU, and so when I'm calling something using vdata, I'm calling this GPU object. Now there's one thing problematic to be aware of, which is that the copying does not actually occur when you write this.

The copying occurs when you write this. So any time you call tf.variable, if you then try to run something using that variable, you'll get back an uninitialized variable error unless you call this in the meantime. This is kind of like performance stuff in TensorFlow where they try to say you can set up lots of variables at once and then call this initializer and we'll do it all at once for you.

So earlier on we created this k-means object. We know that in Python when you create an object, it calls _init__ (that's just how Python works), inside that we copied the data to the GPU by using tf.variable, and then inside find_initial_centroids, we can now access that in order to basically do calculations involving data on the GPU.

In TensorFlow, pretty much everything takes and returns a tensor. So when you call random_uniform, it's giving us a tensor, an array of random numbers. In this case, we just wanted one of them, so we have to use tf.squeeze to take that tensor and turn it into a scalar, because then we're just indexing it into here to get a single item back.

So now that we've got that single item back, we then expand it back again into a tensor because inside our loop, remember, this is going to be a list of initial centroids. It's just that this list happens to be of length 1 at the moment. So one of these tricks in making TensorFlow feel more like PyTorch is to use standard Python loops.

So in a lot of TensorFlow code where it's more serious performance intensive stuff, you'll see people use TensorFlow-specific loops like tf.while or tf.scan or map or so forth. The challenge with using those kind of loops is it's basically creating a computation graph of that loop. You can't step through it, you can't use it in the normal Pythonic kind of ways.

So we can just use normal Python loops if we're careful about how we do it. So inside our normal Python loop, we can use normal Python functions. So here's a function I created which calculates the distance between everything in this tensor compared to everything in this tensor. So all distances, it looks very familiar because it looks a lot like the PyTorch code we had.

So for the first array, for the first tensor, we add an additional axis to axis 0, and for the second we add an additional axis to axis 1. So the reason this works is because of broadcasting. So A, when it starts out, is a vector, and B is a vector.

Now is A a column or is A a row? What's the orientation of it? Well the answer is it's both and it's neither. It's one-dimensional, so it has no concept of what direction it's looking. So then what we do is we set expandims on axis 0, so that's rows.

So that basically says to A, you are now definitely a row vector. You now have one row and however many columns, same as before. And then where else with B, we add an axis at the end. So B is now definitely a column vector, it now has one column and however many rows we had before.

So with broadcasting, what happens is that this one gets broadcast to this length, and this one gets broadcast to this length. So we end up with a matrix containing the difference between every one of these items and every one of these items. So that's like this simple but powerful concept of how we can do very fast GPU-accelerated loops and less code than it would have taken to actually write the loop.

And we don't have to worry about out-of-bounds conditions or anything like that, it's all done for us. So that's the trick here. And once we've got that matrix, because in TensorFlow everything is a tensor, we can call square difference rather than just regular difference, and it gives us the squares of those differences, and then we can sum over the last axis.

So the last axis is the dimensions, so we're just creating a Euclidean distance here. And so that's all this code does. So this gives us every distance between every element of A and every element of B. So that's how we get to this point. So then let's say we've gone through a couple of loops.

So after a couple of loops, R is going to contain a few initial centroids. So we now want to basically find out for every point how far away is it from its nearest initial centroid. So when we go reduce min with axis=0, then we know that that's going over the axis here because that's what we put into our all-distances function.

So it's going to go through, well actually it's reducing across into that axis, so it's actually reducing across our centroids. So at the end of this, it says, alright, this is for every piece of our data how far it is away from its nearest centroid. And that returns the actual distance, because we said do the actual min.

So there's a difference between min and arg, the arg version. So argmax then says, okay, now go through all of the points. We now know how far away they are from their closest centroid, and tell me the index of the one which is furthest away. So argmax is a super handy function.

We used it quite a bit in part 1 of the course, but it's well worth making sure we understand how it works. I think IntensorFlow, I think they're getting rid of these reduce_ prefixes. I'm not sure. I think I read that somewhere. So in some version you may find this is called min rather than reduce_min.

I certainly hope they are. For those of you who don't have such a computer science background, a reduction basically means taking something in a higher dimension and squishing it down into something that's a lower dimension, for example, summing a vector and turning it into a scalar is called a reduction.

So this is a very TensorFlow API assuming that everybody is a computer scientist and that you wouldn't look for min, you would look for reduce_min. So that's how we got that index. So generally speaking, you have to be a bit careful of data types. I generally don't really notice about data type problems until I get the error, but if you get an error that says you passed an int64 into something that expected an int32, you can always just cast things like this.

So we need to index something with an int32, so we just have to cast it. And so this returns the actual point, append, and then this is very similar to NumPy stacking together the initial centroids to create a tensor of them. So the code doesn't look at all weird or different, but it's important to remember that when we run this code, nothing happens, other than that it creates a computation graph.

So when we call k.find_initial_centroids, nothing happens. But because we're in an interactive session, we can now call .eval, and that actually runs it. And it runs it, and it actually takes the data that's returned from that and copies it off the GPU and puts it back in the CPU as a NumPy array.

So it's important to remember that if you call eval, we now have an actual genuine regular NumPy array here. And this is the thing that makes us be able to write code that looks a lot like PyTorch code, because we now know that we can take something that's a NumPy array and turn it into a GPU tensor like that, and we can take something that's a GPU tensor and turn it into a NumPy array like that.

I suspect this might make TensorFlow developers shake at how horrible this is. It's not really quite the way you're meant to do things I think, but it's super easy and it seems to work pretty well. This approach where we're calling .eval, you do need to be a bit careful.

If this was inside a loop that we were calling eval and we were copying back a really big chunk of data to the GPU and the CPU again and again and again, that would be a performance nightmare. So you do need to think about what's going on as you do it.

So we'll look inside the inner loop in a moment and just check. Anyway, the result's pretty fantastic. As you can see, this little hackyheuristic does a great job. It's a hackyheuristic I've been using for decades now, and it's a kind of thing which often doesn't appear in papers. In this case, a similar hackyheuristic did actually appear in a paper in 2007 and an even better one appeared just last year.

But it's always worth thinking about how can I pre-process my data to get it close to where I might want it to be. And often these kind of approaches are useful. There's actually -- I don't know if we'll have time to maybe talk about it someday -- but there's an approach to doing PCA, Principle Components Analysis, which has a similar flavor, basically finding random numbers and finding the furthest numbers away from them.

So it's a good general technique actually. So we've got our initial centroids. What do we do next? Well, what we do next is we're going to be doing more computation in TensorFlow with them, so we want to copy them back to the GPU. And so because we copied them to the GPU, before we do an eval or anything later on, we're going to have to make sure we go global variable initializers dot run.

Question. Can you explain what happens if you don't create interactive session? So what the TensorFlow authors decided to do in their wisdom was to generate their own whole concept of namespaces and variables and whatever else. So rather than using Python's, there's TensorFlow. And so a session is basically kind of like a namespace that holds the computation graphs and the variables and so forth.

You can, and then there's this concept of a context manager, which is basically where you have a with clause in Python and you say with this session, now you're going to do this bunch of stuff in this namespace. And then there's a concept of a graph. You can have multiple computation graphs, so you can say with this graph, create these various computations.

Where it comes in very handy is if you want to say run this graph on this GPU, or stick this variable on that GPU. So without an interactive session, you basically have to create that session, you have to say which session to use using a with clause, and then there's many layers of that.

So within that you can then create namescopes and variablescopes and blah blah blah. So the annoying thing is the vast majority of tutorial code out there uses all of these concepts. It's as if all of Python's OO and variables and modules doesn't exist, and you use TensorFlow for everything.

So I wanted to show you that you don't have to use any of these concepts, pretty much. Thank you for the question. I haven't quite finished thinking through this, but have you tried? If you had initially said I have seven clusters, or eight clusters, what you would find after you hit your six is you'd all of a sudden start getting centroids that were very close to various existing centroids.

So it seems like you could somehow intelligently define a width of a cluster, or kind of look for a jump in things dropping down and how far apart they are from some other cluster, and programmatically come up with a way to decide the number of clusters. Yeah, I think you could.

Maybe then you're using TK means, I don't know. I think it's a fascinating question. I haven't seen that done. There are certainly papers about figuring out the number of clusters in K-means. So maybe during the week you'd check one out, put it to TensorFlow, that would be really interesting.

And I just wanted to follow up what you said about sessions to emphasize that with a lot of tutorials, you could make the code simpler by using an interactive session in a Jupyter notebook instead. I remember when Rachel was going through a TensorFlow course a while ago and she kept on banging her head against a desk with sessions and variable scopes and whatever else.

That was part of what led us to think, "Okay, let's simplify all that." So step one was to take our initial centroids and copy them onto the GPU. So we now have a symbol representing those. So the next step in the K-means algorithm is to take every point and assign them to a cluster, which is basically to say for every point which of the centroids is the closest.

So that's what assigned to nearest does. We'll get to that in a moment, but let's pretend we've done it. This will now be a list of which centroid is the closest for every data point. So then we need one more piece of TensorFlow concepts, which is we want to update an existing variable with some new data.

And so we can actually call updateCentroids to basically do that updating, and I'll show you how that works as well. So basically the idea is we're going to loop through doing this again and again and again. But when we just do it once, you can actually see it's nearly perfect already.

So it's a pretty powerful idea as long as your initial cluster centers are good. So let's see how this works, assigned to nearest. There's a single line of code. The reason there's a single line of code is we already have the code to find the distance between every piece of data and its centroid.

Now rather than calling tf.reduceMin, which returned the distance to its nearest centroid, we call tf.argmin to get the index of its nearest centroid. So generally speaking, the hard bit of doing this kind of highly vectorized code is figuring out this number, which is what axis we're working with. And so it's a good idea to actually write down on a piece of paper for each of your tensors.

It's like it's time by batch, by row, by column, or whatever. Make sure you know what every axis represents. When I'm creating these algorithms, I'm constantly printing out the shape of things. Another really simple trick, but a lot of people don't do this, is make sure that your different dimensions actually have different sizes.

So when you're playing around with testing things, don't have a batch size of 10 and an n of 10 and a number of dimensions of 10. I find it much easier to think of real numbers, so have a batch size of 8 and an n of 10 and a dimensionality of 4.

Because every time you print out the shape, you'll find out exactly what everything is. So this is going to return the nearest indices. So then we can go ahead and update the centroids. So here is update-centroids. And suddenly we have some crazy function. And this is where TensorFlow is super handy.

It's full of crazy functions. And if you know the computer science term for the thing you're trying to do, it's generally called that. The only other way to find it is just to do lots and lots of searching through the documentation. So in general, taking a set of data and sticking it into multiple chunks of data according to some kind of criteria is called partitioning in computer science.

So I got a bit lucky when I first looked for this. I googled for TensorFlow partition and bang, this thing popped up. So let's take a look at it. And this is where reading about GPU programming in general is very helpful. Because in GPU programming there's this kind of smallish subset of things which everything else is built on, and one of them is partitioning.

So here we have tf.dynamic_partition. Partition is the data into some number of partitions using some indices. And generally speaking, it's easiest to just look at some code. So here's our data. We're going to create two partitions, we're calling them clusters, and it's going to go like this. 0th partition, 0th, 1st, 1st, 0th.

So 10 will go to the 0th partition, 20 will go to the 0th partition, 30 will go to the 1st partition. Okay, this is exactly what we want. So this is the nice thing is that there's a lot of these, you can see all this stuff. There's so many functions available, often there's the exact function you need.

And here it is. So we just take our list of indices, convert it to a list of int32s, pass it out data, the indices and the number of clusters, and we're done. This is now a separate array, basically a separate tensor for each of our clusters. So now that we've done that, we can then figure out what is the mean of each of those clusters.

So the mean of each of those clusters is our new centroid. So what we're doing is we're saying which points are the closest to this one, and these points are the closest to this one, okay, what's the average of those points? That's all that happened from here to here.

So that's taking the mean of those points. And then we can basically say, okay, that's our new clusters. So then just join them all together, concatenate them together. Except for that dynamic partitions, well in fact including that dynamic partitions, that was incredibly simple, but it was incredibly simple because we had a function that did exactly what we wanted.

So because we assigned a variable up here, we have to call initializer.run. And then of course before we can do anything with this tensor, we have to call .eval to actually call the computation graph and copy it back to our CPU. So that's all those steps. So then we want to replace the contents of current centroids with the contents of updated centroids.

And so to do that, we can't just say equals, everything is different in TensorFlow, you have to call .assign. So this is the same as basically saying current centroids equals updated centroids, but it's creating a computation graph that basically does that assignment on the GPU. How can we extrapolate this to other non-numeric data types such as words, images?

Well they're all numeric data types really, so an image is absolutely a numeric data type. So it's just a bunch of pixels, you just have to decide what distance measure you want, which generally just means deciding you're probably using Euclidean distance, but are you doing it in pixel space, or are you picking one of the activation layers in a neural net?

For words, you would create a word vector for your words. There's nothing specifically two-dimensional about this. This works in as many dimensions as we like. That's really the whole point, and I'm hoping that maybe during the week, some people will start to play around with some higher dimensional data sets to get a feel for how this works.

Particularly if you can get it working on CT scans, that would be fascinating using the five-dimensional clustering we talked about. So here's what it looks like in total if we weren't using an interactive session. You basically say with tf.session, that creates a session, but as default, that sets it to the current session, and then within the with block, we can now run things.

And then k.run does all the stuff we just saw, so if we go to k.run, here it is. So k.run does all of those steps. So this is how you can create a complete computation graph in TensorFlow using a notebook. You do each one, one step at a time.

Once you've got it working, you put it all together. So you can see find_initial_centroids.eval, put it back into a variable again, assigned to nearest, update_centroids. Because we created a variable in the process there, we then have to rerun global_variable_initializer. We could have avoided this I guess by not calling eval and just treating this as a variable the whole time, but it doesn't matter, this works fine.

And then we just loop through a bunch of times, calling centroids.assign_updated_centroids. What we should be doing after that is then calling update_centroids each time. Then the nice thing is, because I've used a normal Python for loop here and I'm calling dot eval each time, it means I can check, "Oh, have any of the cluster centroids moved?" And if they haven't, then I will stop working.

So it makes it very easy to create dynamic for loops, which could be quite tricky sometimes with TensorFlow otherwise. So that is the TensorFlow algorithm from end to end. Rachel, do you want to pick out an AMA question? So actually I kind of am helping start a company, I don't know if you've seen my talk on ted.com, but I kind of show this demo of this interactive labelling tool, a friend of mine said that he wanted to start a company to actually make that and commercialize it.

So I guess my short answer is I'm helping somebody do that because I think that's pretty cool. More generally, I've mentioned before I think that the best thing to do is always to scratch an itch, so pick whatever you've been passionate about or something that's just driven you crazy and fix it.

If you have the benefit of being able to take enough time to do absolutely anything you want, I felt like the three most important areas for applying deep learning when I last looked, which was two or three years ago, were medicine, robotics, and satellite imagery. Because at that time, computer vision was the only area that was remotely mature for machine learning, deep learning, and those three areas all were areas that very heavily used computer vision or could heavily use computer vision and were potentially very large markets.

Medicine is probably the largest industry in the world, I think it's $3 trillion in America alone, robotics isn't currently that large, but at some point it probably will become the largest industry in the world if everything we do manually is replaced with automated approaches. And satellite imagery is massively used by military intelligence, so we have some of the biggest budgets in the world, so I guess those three areas.

I'm going to take a break soon, before I do, I might just introduce what we're going to be looking at next. So we're going to start on our NLP, and specifically translation deep dive, and we're going to be really following on from the end-to-end memory networks from last week.

One of the things that I find kind of most interesting and most challenging in setting up this course is coming up with good problem sets which are hard enough to be interesting and easy enough to be possible. And often other people have already done that, so I was lucky enough that somebody else had already shown an example of using sequence-to-sequence learning for what they called Spelling Bee.

Basically we start with this thing called the CMU Pronouncing Dictionary, which has things that look like this, Zwicky, followed by a phonetic description of how to read Zweki. So the way these work, this is actually specifically an American pronunciation dictionary. The consonants are pretty straightforward. The vowel sounds have a number at the end showing how much stress is on each one.

So 0, 1, or 2. So in this case you can see that the middle one is where most of the stress is, so it's Zwicky. So here is the letter A, and it is pronounced A. So the goal that we're going to be looking at after the break is to do the other direction, which is to start with how do you say it and turn it into how do you spell it.

This is quite a difficult question because English is really weird to spell. And the number of phonemes doesn't necessarily match the number of letters. So this is going to be where we're going to start. And then we're going to try and solve this puzzle, and then we'll use a solution from this puzzle to try and learn to translate French into English using the same basic idea.

So let's have a 10 minute break and we'll come back at 7.40. So just to clarify, I just want to make sure everybody understands the problem we're solving here. So the problem we're solving is we're going to be told here is how to pronounce something, and then we have to say here is how to spell it.

So this is going to be our input, and this is going to be our target. So this is like a translation problem, but it's a bit simpler. So we don't have pre-trained phoneme vectors or pre-trained letter vectors. So we're going to have to do this by building a model, and we're going to have to create some embeddings of our own.

So in general, the first steps necessary to create an NLP model tends to look very, very similar. I feel like I've done them in a thousand different ways now, and at some point I really need to abstract this all out into a simple set of functions that we use again and again and again.

But let's go through it, and if you've got any questions about any of the code or steps or anything, let me know. So the basic pronunciation dictionary is just a text file, and I'm going to just grab the lines which are actual words so they have to start with a letter.

Now something which I have, so we're going to go through every line in the text file. Here's a handy thing that a lot of people don't realize you can do in Python. When you call open, that returns a generator which lists all of the lines. So if you just go for l in openblah, that's now looping through every line in that file.

So I can then say filter those which start with a lowercase letter, and then strip off any white space and split it on white space. So that's basically the steps necessary to separate out the word from the pronunciation. And then the pronunciation is just white space delimited, so we can then split that.

And that's the steps necessary to get the word and the pronunciation as a set of phonemes. So as we tend to pretty much always do with these language models, we next need to get a list of what are all of the vocabulary items. So in this case, the vocabulary items are all the possible phonemes.

So we can create a set of every possible phoneme, and then we can sort it. And what we always like to do is get an extra character or an extra object in position 0, because remember we use 0 for padding. So that's why I'm going to use underscore as our special padding letter here.

So I'll stick an underscore at the front. So here are the first 5 phonemes. This is our special padding one, which is going to be index 0. And then there's r, r and r with 3 different levels of stress, and so forth. Now the next thing that we tend to do anytime we've got a list of vocabulary items is to create a list in the opposite direction.

So we go from phoneme to index, which is just a dictionary where we enumerate through all of our phonemes and put it in the opposite order. So from phoneme to index. I know we've used this approach a thousand times before, but I just want to make sure everybody understands it.

When you use enumerate in Python, it doesn't just return each phoneme, but it returns a tuple that contains the index of the phoneme and then the phoneme itself. So that's the key and the value. So then if we go value,key, that's now the phoneme followed by the index. So if we turn that into a dictionary, we now have a dictionary which you can give it a phoneme and return it an index.

Here's all the letters of English. Again with our special underscore at the front. We've got one extra thing we'll talk about later, which is an asterisk. So that's a list of letters. And so again, to go from letter to letter index, we just create a dictionary which reverses it again.

So now that we've got our phoneme to index and letter to index, we can use that to convert this data into numeric data, which is what we always do with these language models. We end up with just a list of indices. We can pick some maximum length word, so I'm just going to say 15.

So we're going to create a dictionary which maps from each word to a list of phonemes, and we're going to get the indexes for them. Yes Rachel. So this dictionary comprehension is a little bit awkward, so I thought this would be a good opportunity to talk about dictionary comprehensions and list comprehensions for a moment.

So we're going to pause this in a moment, but first of all, let's look at a couple of examples of list comprehensions. So the first thing to note is when you go something like this, a string x, y, z or this string here, Python is perfectly happy to consider that a list of letters.

So Python considers this the same as being a list of x, y, z. So you can think of this as two lists, a list of x, y, z and a list of a, b, c. So here is the simplest possible list comprehension. So go through every element of a and put that into a list.

So if I call that, that returns exactly what I started with. So that's not very interesting. What if now we replaced 'o' with another list comprehension? So what that's going to do is it's now going to return a list for each list. So this is one way of pulling things out of sub-lists, is to basically take the thing that was here and replace it with a new list comprehension, and that's going to give you a list of lists.

Now the reason I wanted to talk about this is because it's quite confusing. In Python you can also write this, which is different. So in this case, I'm going for each object in our a list, and then for each object in that sub-list. And do you see what's different here?

I don't have square brackets, it's just all laid out next to each other. So I find this really confusing, but the idea is you're meant to think of this as just being like a normal for loop inside a for loop. And so what this does is it goes through x, y, z and then a, b, c, and then in x, y, z it goes through each of x and y and z, but because there's no embedded set of square brackets, that actually ends up flattening the list.

So we're about to see an example of the square bracket version, and pretty soon we'll be seeing an example of this version as well. These are both useful, right? It's very useful to be able to flatten a list, it's very useful to be able to do things with sub-lists.

And then just to be aware that any time you have any kind of expression like this, you can replace the thing here with any expression you like. So we could say, for example, we could say o.upper, so you can basically map different computations to each element of a list, and then the second thing you can do is put an if here to filter it, if o0=x.

So that's basically the idea. You can create any list comprehension you like by putting computations here, filters here and optionally multiple lists of lists here. The other thing you can do is replace the square brackets with curly brackets, in which case you need to put something before a colon and something after a colon, the thing before is your key and the thing after is your value.

So here we're going for, oh, and then there's another thing you can do, which is if the thing you're looping through is a bunch of lists or tuples or anything like that, you can pull them out into two pieces, like so. So this is the word, and this is the list of phonemes.

So we're going to have the lower case word will be our keys in our dictionary, and the values will be lists, so we're doing it just like we did down here. And the list will be, let's go through each phoneme and go phoneme to index. So now we have something that maps from every word to its list of phoneme indexes.

We can find out what the maximum length of anything is in terms of how many phonemes there are, and we can do that by again. We can just go through every one of those dictionary items, calling length on each one and then doing a max on that. So there is the maximum length.

So you can see combining list comprehensions with other functions is also powerful. So finally we're going to create our nice square arrays. Normally we do this with Keras.pad sequences, just for a change, we're going to do this manually this time. So the key is that we start out by creating two arrays of zeros, because all the padding is going to be zero.

So if we start off with all zeros, then we can just fill in the non-zeros. So this is going to be all of our phonemes. This is going to be our actual spelling, that's our target labels. So then we go through all of our, and we've permitted them randomly, so randomly ordered things in the pronunciation dictionary, and we put inside input all of the items from that pronunciation dictionary, and into labels we go letter to index.

So we now have one thing called input, one thing called labels that contains nice rectangular arrays padded with zeros containing exactly what we want. I'm not going to worry about this line yet because we're not going to use it for the starting point. So anywhere you see DEC something, just ignore that for now, we'll get back to that later.

Train test split is a very handy function from sklearn that takes all of these lists and splits them all in the same way with this proportion in the test set. And so input becomes input_train and input_test, labels becomes labels_train and labels_test. So that's pretty handy. We've often written that manually, but this is a nice quick way to do it when you've got lots of lists to do.

So just to have a look at how many phonemes we have in our vocabulary, there are 70, how many letters in our vocabulary, there's 28, that's because we've got that underscore and the star as well. So let's go ahead and create the model. Here's the basic idea. The model has three parts.

The first is an embedding. So the embedding is going to take every one of our phonemes. Max_len_p is the maximum number of phonemes we have in any pronunciation. And each one of those phonemes is going to go into an embedding. And the lookup for that embedding is the vocab size for phonemes, which I think was 70.

And then the output is whatever we decide, what dimensionality we want. And in experimentation, I found 120 seems to work pretty well. I was surprised by how high that number is, but there you go, it is. We started out with a list of phonemes, and then after we go through this embedding, we now have a list of embeddings.

So this is like 70, and this is like 120. So the basic idea is to take this big thing, which is all of our phonemes embedded, and we want to turn it into a single distributed representation which contains all of the richness of what this pronunciation says. Later on we're going to be doing the same thing with an English sentence.

And so we know that when you have a sequence and you want to turn it into a representation, one great way of doing that is with an RNN. Now why an RNN? Because an RNN we know is good at dealing with things like state and memory. So when we're looking at translation, we really want something which can remember where are we.

So let's say we were doing this simple phonetic translation, the idea of have we just had a C? Because if we just had a C, then the H is going to make a totally different sound to if we haven't just had a C. So an RNN we think is a good way to do this kind of thing.

And in general, this whole class of models remember is called Seek-to-Seek, sequence-to-sequence models, which is where we start with some arbitrary length sequence and we produce some arbitrary length sequence. And so the general idea here is taking that arbitrary length sequence and turning it into a fixed size representation using an RNN is probably a good first step.

So looking ahead, I'm actually going to be using quite a few layers of RNN. So to make that easier, we've created a getRNN function. You can put anything you like here, GRU or LSTM or whatever. And yes indeed, I am using dropout. The kind of dropout that you use in an RNN is slightly different to normal dropout.

It turns out that if the particular things you dropout, it's best to have them the same things at every time step in an RNN. There's a really good paper that explains why this is the case and shows that this is the case. So this is why there's a special dropout parameter inside the RNN in Keras because it does this proper RNN-style dropout.

So I put in a tiny bit of dropout here, and if it turns out that we overfit, we can always increase it. If we don't, we can always turn it to zero. So what we're going to do is -- yes, Rachel? I don't know if you remember, but when we looked at doing RNNs from scratch last year, we learned that you could actually combine the matrices together and do a single matrix computation.

If you do that, it's going to use up more memory, but it allows the GPU to be more highly parallel. So if you look at the Keras documentation, it will tell you the different things you can use, but since we're using a GPU, you probably always want to say consume less equals GPU.

The other thing that we learned about last year is bidirectional RNNs. And maybe the best way to come at this is actually to go all the way back and remind you how RNNs work. We haven't done much revision, but it's been a while since we've looked at RNNs in much detail.

So just to remind you, this is kind of our drawing of a totally basic neural net. Square is input, circle is intermediate activations, hidden, and triangle is output, and arrows represent affine transformations with non-linearities. We can then have multiple copies of those to create deeper convolutions, for example. And so the other thing we can do is actually have inputs going in at different places.

So in this case, if we were trying to predict the third character from first two characters, we can use a totally standard neural network and actually have input coming in at two different places. And then we realized that we could kind of make this arbitrarily large, but what we should probably do then is make everything where an input is going to a hidden state be the same matrix.

So this color coding, remember, represents the same weight matrix. So hidden to hidden would be the same weight matrix and hidden to output, and it's a separate weight matrix. So then to remind you, we realized that we could draw that more simply like this. So RNNs, when they're unrolled, just look like a normal neural network in which some of the weight matrices are tied together.

And if this is not ringing a bell, go back to lesson 5 where we actually build these weight matrices from scratch and tie them together manually so that will hopefully remind you of what's going on. Now importantly, we can then take one of those RNNs and have the output go to the input of another RNN.

And these are stacked RNNs. And stacked RNNs basically give us richer computations in our recurrent neural nets. And this is what it looks like when we unroll it. So you can see here that we've got multiple inputs coming in, going through multiple layers and creating multiple outputs. But of course we don't have to create multiple outputs.

You could also get rid of these two triangles here and have just one output. And remember in Keras, the difference is whether or not we say return_sequences=true or return_sequences=false. This one you're seeing here is return_sequences=true. This one here is return_sequences=false. So what we've got is input_train has 97,000 words.

Each one is of length 16. It's 15 characters long plus the padding, and then labels is 15 because we chose earlier on that our max length would be a 15 long spelling. So phonemes don't match to letters exactly. So after the embedding, so if we take one of those tens of thousands of words, remember it was of length for phonemes, length 16.

And then we're putting it into an embedding matrix which is 70 by 120. And the reason it's 70 is that each of these phonemes contains a number between 0 to 69. So basically we go through and we get each one of these indexes and we look up to find it.

So this is 5 here, then we find number 5 here. And so we end up with 16 by 120. "Are we then taking a sequence of these phonemes represented as 120 dimensional floating point vectors and using an RNN to create a sequence of word2vec embeddings which we will then reverse to actual words?" So we're not going to use word2vec here, right?

Word2vec is a particular set of pre-trained embeddings. We're not using pre-trained embeddings, we have to create our own embeddings. We're creating phoneme embeddings. So if somebody else later on wanted to do something else with phonemes, and we saved the result of this, we could provide phoneme2vec. And you could download them and use the fast.ai pre-trained phoneme2vec embeddings.

This is how embeddings basically get created. It's people build models starting with random embeddings and then save those embeddings and make them available for other people to use. "I may be misinterpreting it, but I thought the question was getting at the second set of embeddings when you want to get back to your words." So let's wait until we get there, because we're going to create letters, not words, and then we'll just join the letters together.

So there won't be any word2vec here. So we've got, as far as creating our embeddings, we've then got an RNN which is going to take our embeddings and attempt to turn it into a single vector. That's kind of what an RNN does. So we've got here return sequences by default is true, so this first RNN returns something which is just as long as we started with.

And so if you want to stack RNNs on top of each other, every one of them is return sequences equals true until the last one isn't. So that's why we have false here. So at the end of this one, this gives us a single vector which is the final state.

The other important piece is bidirectional. And bidirectional, you can totally do this manually yourself. You take your input and feed it into an RNN, and then you reverse your input and feed it into a different RNN, and then just concatenate the two together. So Keras has something which does that for you, which is called bidirectional.

Bidirectional actually requires you to pass it an RNN, so it takes an RNN and returns two copies of that RNN stacked on top of each other, one of which reverses its input. And so why is that interesting? That's interesting because often in language, what happens later influences what comes before.

For example, in French, the gender of your definite article depends on the noun that it refers to. So you need to be able to look backwards or forwards in both directions to figure out how to match those two together. Or in any language with tense, what verb do you use depends on the tense and often also depends on the details about the subject and the object.

So we want to be able to both look forwards and look backwards. So that's why we want two copies of the RNN, one which goes from left to right and one which goes from right to left. And indeed, we could assume that when you spell things, I'm not exactly sure how this would work, but when you spell things depending on what the latest dresses might be or the later details of the phonetics might be might change how you pronounce things earlier on.

Does the bidirectional RNN can cat two RNNs or does it stack them? You end up with the same number of dimensions that you had before, but it basically doubles the number of features that you have. So in this case, we have 240, so it just doubles those, and I think we had one question here.

Okay, so let's simplify this down a little bit. Basically, we started out with a set of embeddings, and we've gone through two layers, we've gone through a bidirectional RNN, and then we feed that to a second RNN to create a representation of this ordered list of phonemes. And specifically, this is a vector.

So x at this point is a vector, because return sequence is equals false. That vector, once we've trained this thing, the idea is it represents everything important there is to know about this ordered list of phonemes, everything that we could possibly need to know in order to spell it.

So the idea is we could now take that vector and feed it into a new RNN, or even a few layers of RNN, and that RNN could basically go through and with return sequence equals true this time. It could spit out at every time step what it thinks the next letter in this spelling is.

And so this is how a sequence-to-sequence works. One part, which is called the encoder, takes our initial sequence and turns it into a distributed representation into a vector using, generally speaking, some stacked RNNs. Then the second piece, called the decoder, takes the output of the encoder and passes that into a separate stack of RNNs with return sequence equals true.

And those RNNs are taught to then generate the labels, in this case the spellings, or in our later case the English sentences. Now in Keras, it's not convenient to create an RNN by handing it some initial state, some initial hidden state. That's not really how Keras likes to do things.

Keras expects to be handed a list of inputs. Problem number one. Problem number two, if you do hand it to an RNN just at the start, it's quite hard for the RNN to remember the whole time what is this word I'm meant to be translating. It has to keep two things in its head.

One is what's the word I'm meant to be spelling, and the second is what's the letter I'm trying to spell right now. So what we do with Keras is we actually take this whole state and we copy it. So in this case we're trying to create a word that could be up to 15 letters long, so in other words 15 time steps.

So we take this and we actually make 15 copies of it. And those 15 copies of our final encoder state becomes the input to our decoder RNN. So it seems kind of clunky, but it's actually not difficult to do. In Keras we just go like this. We take the output from our encoder and we repeat it 15 times.

So we literally have 15 identical copies of the same vector. And so that's how Keras expects to see things, and it also turns out that you get better results when you pass into the RNN the state that it needs again and again at every time step. So we're basically passing in something saying we're trying to spell this word, we're trying to spell this word, we're trying to spell this word, we're trying to spell this word.

And then as the RNN goes along, it's generating its own internal state, figuring out what have we spelt so far and what are we going to have to spell next. Why can't we have return_sequences=true for the second bidirectional LSTM? Not bidirectional for the second LSTM, we only have one bidirectional LSTM.

We don't want return_sequences=true here because we're trying to create a representation of the whole word we're trying to spell. So there's no point having something saying here's a representation of the first phoneme, of the first 2, of the first 3, of the first 4, of the first 5, because we don't really know exactly which letter of the output is going to correspond to which phoneme of the input.

And particularly when we get to translation, it can get much harder, like some languages totally reverse the subject and object order or put the verb somewhere else. So that's why we try to package up the whole thing into a single piece of state which has all of the information necessary to build our target sequence.

So remember, these sequence-to-sequence models are also used for things like image captioning. So with image captioning, you wouldn't want to have something that created a representation separately for every pixel. You want a single representation, which is like this is something that somehow contains all of the information about what this is a picture of.

Or if you're doing neural language translation, here's my English sentence, I've turned it into a representation of everything that it means so that I can generate my French sentence. We're going to be seeing later how we can use return sequences equals true when we look at attention models. But for now, we're just going to keep things simple.

Well, it does, it absolutely does. And indeed we can use convolutional models. But if you remember back to Lesson 5, we talked about some of the challenges with that. So if you're trying to create something which can parse some kind of markup block like this, it has to both remember that you've just opened up a piece of markup, you're in the middle of it, and then in here you have to remember that you're actually inside a comment block so that at the end you're able to close it.

This kind of long-term dependency and memory and stateful representation becomes increasingly difficult to do with CNNs as they get longer. It's not impossible by any means, but RNNs are one good way of doing this. But it is critical that we start with an embedding, because where else in image we're already given float-valued numbers that really represent the image, that's not true with text.

So with text we have to use embeddings to turn it into these nice numeric representations. RNN is a kind of generic term here, so a specific network we use is LSTM, but there are other types we can use. >> GRU, remember? >> Yeah. >> Simple RNN? >> So Keras supports a couple of things.

>> Yeah, with all the ones we did in the last part of the course. So we looked in Simple RNN, GRU, and LSTM. >> So is that like the LSTM would be the best for that task? >> No, not at all. The GRUs and LSTMs are pretty similar, so it's not worth thinking about too much.

So at this point here, we now have 15 copies of x. And so we now pass that into two more layers of RNN. So this here is our encoder, and this here is our decoder. There's nothing we did particularly in Keras to say this is an encoder, this is a decoder.

The important thing is the return sequence equals false here, and the repeat vector here. So what does it have to do? Well somehow it has to take this single summary and create some layers of RNNs until then at the end we say, okay, here's a dense layer, and it's time distributed.

So remember that means that we actually have 15 dense layers. And so each of these dense layers now has a softmax activation, which means that we basically can then do an argmax on that to create our final list of letters. So this is kind of our reverse embedding, if you like.

So the model is very little code, and once we've built it, and again, if things like this are mysterious to you, go back and re-look at lessons 4, 5 and 6, remind you how these embeddings work and how these kind of time distributed dense works to give us effectively a kind of reverse embedding.

So that's our model. It starts with our phoneme input, ends with our time distributed dense output. We can then compile that. Our targets are just indexes, remember we turn them into indexes. So we use this handy sparse categorical cross entropy, just the same as our normal categorical cross entropy, but rather than one hot encoding, we just skip the whole one hot encoding and just leave it as an index.

We can go ahead and fit passing in our training data. So that was our rectangular data of the phoneme indexes, our labels, and then we can use some valid test set data that we set aside as well. So we've hit that for a while. I found that the first three epochs, the loss went down like this.

The second three epochs it went down like this. It seemed to be flattening out, so that's as far as I stopped it. So we can now see how well that worked. Now what I wanted to do was not just say what percentage of letters are correct, because that doesn't really give you the right sense at all.

What I really want to know is what percentage of words are correct. So that's what this little eval_keras function does. It takes the thing that I'm trying to evaluate, calls.predict on it, it then does the argmax as per usual to take that softmax and turn it into a specific number, which character is this.

And then I want to check whether it's true for all of the characters that the real character equals the predicted character. So this is going to return true only if every single item in the word is correct. And so then taking the mean of that is going to tell us what percentage of the words did it get totally right.

And unfortunately the answer is not very many, 26%. So let's look at some examples. So we can go through 20 words at random, and we can print out all of the phonemes with dashes between. So here's an example of some phonemes. We can print out the actual word, and we can print out our prediction.

So here is a whole bunch of words that I don't really recognize, perturbations. It should be spelled like that, slightly wrong. So you can see some of the time the mistakes it makes are pretty clear. So "larrow" could be spelled like that, but this seems perfectly reasonable. Sometimes on the other hand it's "way off." And interestingly, what you find is that most of the time when it's way off, I found it tends to be with the longer words.

And the reason for that is that the longer the word, this one where it's terrible, is by far the most phonemes, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 phonemes. So we had to somehow create a single representation that contained all of the information of all of those 11 phonemes in a single vector.

And that's hard to do, right? And then that single vector got passed to the decoder, and that was everything it had to try to create this output. So that's the problem with this basic encoder-decoder method. And indeed, here is a graph from the paper which originally introduced attentional models.

And this is for neural translation. What it showed is that as the sentence length got bigger, the standard approach which we're talking about, the standard encoder or decoder approach, the accuracy absolutely died. So what these researchers did was that they built a new kind of RNN model, called an attentional model.

And with the attentional model, the accuracy actually stayed pretty good. So goal number 1 for the next couple of lessons is for me not to have a cold anymore. So basically we're going to finish our deep dive into neural translation, and then we're going to look at time series.

Although we're not specifically looking at time series, it turns out that the best way that I found for time series is not specific to time series at all. Reinforcement learning was something I was planning to cover, but I just haven't found almost any good examples of it actually being used in practice to solve important real problems.

And indeed, when you look at the -- have you guys seen the paper in the last week or two about using evolutionary strategies for reinforcement learning? Basically it turns out that using random search is better than reinforcement learning. This paper, by the way, is ridiculously overhyped. These evolutionary strategies is something that I was working on over 20 years ago, and in those days, these are genetic algorithms, as we called them, used much more sophisticated methods than DeepMind's brand new evolutionary strategies.

So people are like rediscovering these randomized metaheuristics, which is great, but they're still far behind where they were 20 years ago, but far ahead of reinforcement learning approaches. So given I try to teach things which I think are actually going to stand the test of time, I'm not at all convinced that any current technique for reinforcement learning is going to stand the test of time, so I don't think we're going to touch that.

Part 3, yeah, I think before that we might have a part 0 where we do practical machine learning for coders, talk about decision tree ensembles and training test splits and stuff like that. And then we'll see where we are. I'm sure Rachel and I are not going to stop doing this in a hurry, it's really fun and interesting and we're really interested in your ideas about how to keep this going.

By the end of part 2, you guys have put in hundreds of hours, on average maybe 140 hours, put together your own box, written blog posts, done hackathons, you're seriously in this now. And in fact, I've got to say, this week's kind of been special for me. This week's been the week where again and again I've spoken to various folks of you guys and heard about how many of you have implemented projects at your workplace that have worked and are now running and making your business money, or that you've achieved the career thing that you've been aiming for, or that you've won yet another GPU at a hackathon, or of course the social impact thing where it's like all these transformative and inspirational things.

When Rachel and I started this, we had no idea if it was possible to teach people with no specific required math background other than high school math, deep learning to the point that you could use it to build cool things. We thought we probably could because I don't have that background and I've been able to.

But I've been kind of playing around with similar things for a couple of decades. So it was a bit of an experiment, and this week's been the week that for me it's been clear that the experiments worked. So I don't know what part 3 is going to look like.

I think it will be a bit different because it'll be more of a meeting of minds amongst a group of people who are kind of at the same level and thinking about the same kinds of things. So maybe it's more of an ongoing keypad knowledge up-to-date kind of thing, it might be more of us teaching each other, I'm not sure, I'm certainly interested to hear your ideas.

We don't normally have two breaks, but I think I need one today and we're covering a lot of territories. Why don't we have a short break, let's have a short break and we'll go for the last 20 minutes. Let's come back at 8.40. Okay, thank you. So attention models, attention models.

So I actually really like these, I think they're great. And really the paper that introduced these, it was quite an extraordinary paper that introduced both GIUs and attention models at the same time. I think it might even be before the guy had his PhD if I remember correctly, it was just a wonderful paper, very successful.

And the basic idea of an attention model is actually pretty simple. You'll see here, here is our decoder, and here's our embedding. And notice here, remember that my getRNN_return_sequences = true is the default, so the decoder now is actually spitting out a sequence of states. Now the length of that sequence is equal to the number of phonemes.

And we know that there isn't a mapping 1 to 1 of phonemes to letters. So this is kind of interesting to think about how we're going to deal with this. How are we going to deal with 16 states? And the states, because they started with bidirectional, state 1 both represents a combination of everything that's come before the first phoneme and everything that's come after.

State 2 is everything that's come before the second phoneme and everything that's come after and so forth. So the states in a sense are all representing something very similar, but they've got a different focus. Each one of these states represents everything that comes before and everything that comes after that point, but with a focus on that phoneme.

So what we want to do now is create an RNN where the number of inputs to the RNN needs to be 15, not 16, because remember the length of the word we're creating is 15. So we're going to have 15 output time steps, and at each point we wanted to have the opportunity to look at all of these 16 output states, but we're going to go in with the assumption that only some of those 16 are going to be relevant, but we don't know which.

So what we want to do is basically take each of these 16 states and do a weighted sum, sum of weights times encoded states, where these weights somehow represent how important is each one of those 16 inputs for calculating this output, and how important are each of those 16 inputs for calculating this output, and so forth.

If we could somehow come up with a set of weights for every single one of those time steps, then we can replace the length 16 thing with a single thing. If it turns out that output number 1 only really depends on input number 1 and nothing else, then basically that input, those weights are going to be 1, 0, 0, 0, 0, 0, 0, 0, right?

It can learn to do that. But if it turns out that output number 1 actually depends on phonemes 1 and 2 equally, then it can learn the weights 0.5, 0.5, 0.000. So in other words, we want some function, wi equals some function that returns the right set of weights to tell us which bit of the decoded input to look at.

So it so happens we actually know a good way of learning functions. What if we made the function a neural net? And what if we learned it using SGD? Why not? So here's the paper, "Neural Machine Translation by Jointly Learning to Align and Translate." And it's a great paper.

It's not the clearest in my opinion in terms of understandability, but let me describe some of the main pieces. So here's the starting point. Okay, let's describe how to read this equation. When you see a probability like this, you can very often think of it as a loss function.

The idea of SGD, basically most of the time when we're using it, is to come up with a model where the probabilities that the model creates are as high as possible for the true data and as low as possible for the other data. That's just another way of talking about a loss function.

So very often when you read the papers where we would write a loss function, a paper will say a probability. What this here says, earlier on they say that y is basically our outputs, very common for y to be an output. And what this is saying is that the probability of the output at time step i, so at some particular time step, depends on, so this bar here means depends on all of the previous outputs.

In other words, in our spelling thing, when we're looking at the fourth letter that we're spelling, it depends on the three letters that we've spelled so far. You can't have it depend on the later letters, that's cheating. So this is basically a description of the problem, is that we're building something which is time-dependent and where the i-th thing that we're creating can only be allowed to depend on the previous i-1 things, comma, that basically means and, and it's also allowed to depend on, anything in bold is a vector, a vector of inputs.

And so this here is our list of phonemes, and this here is our list of all of the letters we've spelled so far. So that whole thing, that whole probability, we're going to calculate using some function. And because this is a neural net paper, you can be pretty sure that it's going to turn out to be a neural net.

And what are the things that we're allowed to calculate with? Well we're allowed to calculate with the previous letter that we just translated. What's this? The RNN hidden state that we've built up so far, and what's this? A context vector. What is a context vector? The context vector is a weighted sum of annotations h.

So these are the hidden states that come out of our encoder, and these are some weights. So I'm trying to give you enough information to try and parse this paper over the week. So that's everything I've described so far. The nice thing is that hopefully you guys have now read enough papers that you can look at something like this and skip over it, and go, oh, that's just softmax.

Over time, your pattern recognition starts getting good. You start seeing something like this, and you go, oh, that's a weighted sum. And you say something like this, and you go, oh, that's softmax. People who read papers don't actually read every symbol. Their eye looks at it and goes, softmax, weighted sum, logistic function, got it.

As if it was like pieces of code, only this is really annoying code that you can't look up in a dictionary and you can't run and you can't check it and you can't debug it. But apart from that, it's just like code. So the alphas are things that came out of a softmax.

What goes into the softmax? Something called E. The other annoying thing about math notation is often you introduce something and define it later. So here we are, later we define E. What's E equal to? E is equal to some function of what? Some function of the previous hidden state and the encoder state.

And what's that function? That function is again a neural network. Now the important piece here is jointly trained. Jointly trained means it's not like a GAN where we train a bit of discriminator and a bit of generator. It's not like one of these manual attentional models where we first of all figure out the nodules are here and then we zoom into them and find them there.

Jointly trained means we create a single model, a single computation graph if you like, where the gradients are going to flow through everything. So we have to try and come up with a way basically where we're going to build a standard regular RNN, but the RNN is going to use as the input at each time step this.

So we're going to have to come up with a way of actually making this mini-neuronet. This is just a single one-hidden layer standard neural net that's going to be inside every time step in our RNN. This whole thing is summarized in another paper. This is actually a really cool paper, Grammar as a Foreign Language.

Lots of names you probably recognize here, Jeffrey Hinton, who's kind of a father of deep learning, Julia, who's now I think chief scientist or something at OpenAI, this paper is kind of neat and fun anyway. It basically says what if you didn't know anything about grammar and you attempted to build a neural net which assigned grammar to sentences, it turns out you actually end up with something more accurate than any rule-based grammar system that's been built.

One of the nice things they do is to summarize all the bits. And again, this is where like if you were reading a paper the first time and didn't know what an LSTM was and went oh, an LSTM is all these things, that's not going to mean anything to you.

You have to recognize that people write stuff in papers, there's no point writing LSTM equations in papers, but basically you're going to have to go and find the LSTM paper or find a tutorial like learn about LSTMs when you're finished, come back, and the same way they summarize attention.

So they say we've adapted the attention model from 2, if you go and have a look at 2, that's the paper we just looked at. But the nice thing is that because this came a little later, they've done a pretty good job of trying to summarize it into a single page.

So during the week if you want to try and get the hang of attention, you might find it good to have a look at this paper and look at their summary. And you'll see that basically the idea is that it's a standard sequence to sequence model, so a standard sequence to sequence model means encoder, hidden states, the final hidden state decoder, plus adding attention.

And so we have two separate LSTMs, an encoder and a decoder, and now be careful of the notation the encoder states are going to be called H. The decoder states H1 through HTA, the decoder states D, which we're also going to call HTA+1 to TA+TB. So the inputs are 1 through TA, and here you can see is defining a single layer neural net.

So we've got our decoder states, we've got our current encoder state, put it through a nonlinearity, put it through another affine transformation, stick it through a softmax, and use that to create a weighted sum. So there's like all of it in one little snapshot. So don't expect this to make perfect sense the first time you see it necessarily, but hopefully you can kind of see that these bits are all stuff you've seen lots of times before.

So next week we're going to come back and work through creating this code and seeing how it works. Did you have something Rachel? We have two questions, one is, won't the weightings be heavily impacted by the padding done to the input set? Sure, absolutely. And specifically those weights will say the padding is always weighted zero.

It's not going to take very long to learn to create that pattern. And is A shared among all ij pairs, or do we train a separate alignment for each pair? No, A is not trained, A is the output of a softmax. What's trained is w1 and w2. And note that capital letters are matrices, right?

So we just have to learn a single w1 and a single w2. But note that they're being applied to all of the encoded states and the current state of the decoder. And in fact, easier is to just abstract out this all the way back to say it is some function.

This is the best way to think of it, it's some function of what? Some function of the current hidden state and all of the decoder states. So that's the inputs to the function, and we just have to learn a set of weights that will spit out the inputs to our softmax.

Did you say you had another question? Okay great. So I don't feel like I want to introduce something new so let's take one final AMA before we go home. There's not really that much clever you can do about it. Basically if you've got, well a great example would be one of the impact talks talked about breast cancer detection from mammography scans and this thing called the Dream Challenge had less than 0.3% of the scans actually had cancer.

So that's very unbalanced. I think the first thing you try to do with such an unbalanced dataset is ignore it and try it and see how it goes. The reason that often it doesn't go well is that the initial gradients will tend to point to say they never have cancer, because that's going to give you a very accurate model.

So one thing you can try and do is to come up with some kind of initial model which is like maybe some kind of heuristic which is not terrible and gets it to the point where the gradients don't always point to saying they never have cancer. But the really obvious thing to do is to adjust your thing which is creating the mini batches so that on every mini batch you grab like half of it as being people with cancer and half of it being without cancer.

So that way you can still go through lots and lots of epochs, the challenge is that the people that do have cancer, you're going to see lots and lots and lots of times so you have to be very careful of overfitting. And then basically there's kind of things between those two extremes.

So I think what you really need to do is figure out what's the smallest number of people with cancer that you can get away with. What's the smallest number where the gradients don't point to zero? And then create a model where every batch, mini batch, you create 10% of it with people with cancer and 90% people without.

Train that for a while. The good news is once it's working pretty well, you can then decrease the size that has cancer size because you're already at a point where your model is not pointing off to zero. So you can gradually start to change the sample to have less and less.

I think that's the basic technique. So in this example where you're repeating the positive results over and over again, you're essentially just weighting them more. Could you get the same results by just throwing away a bunch of the false data set? You could do that, and that's the really quick way to do it, but that way you're not using all the information about the false stuff still has information.

So yeah. OK. Thanks everybody. Have a good week. (audience applauds)