Back to Index

Lesson 11: Cutting Edge Deep Learning for Coders


Chapters

0:0 Reminders
7:35 Linear Algebra Cheat Sheet
8:35 ZeroShot Learning
9:55 Computer Vision
11:40 Activation Functions
13:32 Colour transformations
14:18 Batch norm
14:58 Is there any advantage
17:35 Removing Data
19:56 Noise Labels
21:22 Accuracy vs Size
23:23 Design patents
24:20 Lexical Learning Rooms
27:20 Data Science Bowl

Transcript

So this week, obviously quite a bit just to get set up to get results from this week in terms of needing all of ImageNet and that kind of thing and getting all that working. So I know that a lot of you are still working through that. I did want to mention a couple of reminders just that I've noticed.

One is that in general, we have that thing on the wiki about how to use the notebooks, and we really strongly advise that you don't open up the notebook we give you and click shift enter through it again and again. You're not really going to learn much from that.

But go back to that wiki page. It's like the first thing that's mentioned in the first paragraph of the home page of the wiki is how to use the notebooks. Basically the idea is try to start with a fresh notebook, think about what you think you need to do first, try and do that thing if you have no idea, then you can go to the existing notebook, take a peek, close it again, try and re-implement what you just saw.

As much as possible, really not just shift enter through the notebooks. I know some of you are doing it because there are threats on the forum saying, "I was shift enter through the notebook and this thing didn't work." And somebody is like, "Well, that's because that thing's not defined yet." So consider yourself busted.

The other thing to remind you about is that the goal of part 2 is to get you to a point where you can read papers, and the reason for that is because you kind of know the best practices now, so anytime you want to do something beyond what we've learned, you're going to be implementing things from papers or probably going beyond that and implementing new things.

Reading a new paper in an area that you haven't looked at before is, at least to me, somewhat terrifying. On the other hand, reading a paper for the thing that we already studied last week hopefully isn't terrifying at all because you already know what the paper says. So I always have that in the assignments each week.

Read the paper for the thing you just learned about, and go back over it and please ask on the forums if there's a bit of notation or anything that you don't understand, or if there's something we heard in class that you can't see in the paper, or if it's particularly interesting if you see something in the paper that you don't think we mentioned in class.

So that's the reason that I really encourage you to read the papers for the topics we studied in class. I think for those of you like me who don't have a technical academic background, it's really a great way to familiarize yourself with notation. And I'm really looking forward to some of you asking about notation on the forums, so I can explain some of it to you.

There's a few key things that keep coming up in notation, like probability distributions and stuff like that. So please feel free, and if you're watching this later in the MOOC, again, feel free to ask on the forum anything that's not clear. I was kind of interested in following up on some of last week's experiments myself.

And the thing that I think we all were a bit shocked about was putting this guy into the device model and getting out more pictures of similar looking fish in nets. And I was kind of curious about how that was working and how well that was working, and I then completely broke things by training it for a few more epochs.

And after doing that, I then did an image similarity search again and I got these three guys who were no longer in nets. So I'm not quite sure what's going on here. And the other thing I mentioned is when I trained it where my starting point was what we looked at in class, which was just before the final bottleneck layer.

I didn't get very good results from this thing, but when I trained it from the starting point of just after the bottleneck layer, I got the good results that you saw. And again, I don't know why that is, and I don't think this has been studied as far as I'm aware.

So there's lots of open questions here. But I'll show you something I did then do was I thought, well, that's interesting. I think what's happened here is that when you train it for longer, it knows that the important thing is the fish and not the net. And it seems to be now focusing on giving us the same kind of fish.

These are clearly the exact same type of fish, I guess. So I started wondering how could we force it to combine. So I tried the most obvious possible thing, I wanted to get more fish in nets. And I typed Word2VecticTensh, that's a kind of fish, plus Word2VecticNet divided by 2, get the average of the 2 word vectors, and give me the nearest neighbor.

And that's what I got. And then just to prove it wasn't a fluke, I tried the same on Tensh plus Rod, and there's my nearest neighbor. Now do you know what's really freaky about this? If you Google for ImageNet categories, you'll get a list of 1000 ImageNet categories. If you search through them, neither net nor rod appear at all.

I can't begin to imagine why this works, but it does. So this device model is clearly doing some pretty deep magic in terms of the understanding of these objects and their relationships. Not only are we able to combine things like this, but we're able to combine it with categories that it's literally never seen before.

It's never seen a rod, we've never told it what a rod looks like, and I did over a net. And I tried quite a few of these combinations and they just kept working. Like another one I tried was, I understand why this works, which is I tried searching for boat.

Now boat doesn't appear in ImageNet, but there's lots of kinds of boats that appear in ImageNet. So not surprisingly, it figures out, generally speaking, how to find boats. I expected that. And then I tried boat plus engine, and I got back pictures of powerboats, and then I tried boat plus paddle, and I got back pictures of rowing boats.

So there's a whole lot going on here, and I think there's lots of opportunities for you to explore and experiment based on the explorations and experiments that I've done. And more to the point, perhaps to create some interesting and valuable tools. I would have thought a tool to do an image search to say, show me all the images that contain these kinds of objects.

Or better still, maybe you could start training with things that aren't just nouns but also adjectives. So you could start to search for pictures of crying babies or flaming houses or whatever. I think there's all kinds of stuff you could do with this, which would be really interesting whether it be in a narrow organizational setting or create some new startup or a new open source project or whatever.

So anyway, lots of things to try. More stuff this week. I actually missed this lesson this week, but I was thrilled to see that one of our students has written this fantastic Medium post, linear algebra cheat sheet. I think I missed it because it was posted not to the part 2 forum, but maybe to the main forum.

But this is really cool, Brendan has gone through and really explained all the stuff that I would have wanted to have known about linear algebra before I got started, and particularly I really appreciate that he's taking a code-first approach. So how do you actually do this in NumPy and talking about broadcasting?

So you guys will all be very familiar with this already, but your friends who are wondering how to get started in deep learning, what's the minimal things you need to know, it's probably the chain rule and some linear algebra. I think this covers a lot of any linear algebra pretty effectively.

So thank you Brendan. Other things from last week. Andrea Fromme who wrote that device paper, I actually emailed her and asked her what she thought else I should look at. And she suggested this paper, "Zero-shot learning by convex combination of semantic embeddings," which she's only a later author on, but she says it's kind of in some ways a more powerful version of the device.

It's actually quite different, and I haven't implemented it myself, but it solves some similar problems, and anybody who's interested in exploring this multimodal images and text space might be interested in this. And we'll put this on the lesson wiki of course. And then one more involving the same author in a similar area a little bit later was looking at attention for fine-grained categorization.

So a lot of these things, at least the way I think Andrea Fromme was casting it, was about fine-grained categorization, which is how do we build something that can find very specific kinds of birds or very specific kinds of dogs. I think these kinds of models have very, very wide applicability.

So I mentioned we'd kind of wrap up some final topics around computer vision stuff this week before we started looking at some more NLP-related stuff. One of the things I wanted to zip through was a paper which I think some of you might enjoy, "Systematic Evaluation of CNN Advances on the ImageNet Data Set." And I've pulled out what I thought were some of the key insights for some of these things we haven't really looked at before.

One key insight which is very much the kind of thing I appreciate is that they compared what's the difference between the original CafeNet/AlexNet vs. GoogleNet vs. VGGNet on two different sized images training on the original 227 or 128. And what this chart shows is that the relative difference between these different architectures is almost exactly the same regardless of what size image you're looking at.

And this really reminds me of like in Part 1 when we looked at data augmentation and we said hey, you couldn't understand which types of data augmentation to use and how much on a small sample of the data rather than on the whole data set. What this paper is saying is something similar, which is you can look at different architectures on small sized images rather than full sized images.

And so they then used this insight to do all of their experiments using a smaller 128x128 ImageNet model, which they said was 10 times faster. So I thought that was the kind of thing which not enough academic papers do, which is like what are the hacky shortcuts we can get away with?

So they tried lots of different activation functions. It does look like max_pooling is way better, so this is the gain compared to ReLU. But this one actually has twice the complexity, so it doesn't quite say that. What it really says is that something we haven't looked at, which is LU, which as you can see is very simple, if x is greater than or equal to 0, it's y equals x, otherwise it's e^x-1.

So LU basically is just like ReLU, except it's smooth. Whereas ReLU looks like that, e_lu looks exactly the same here, then here it goes like that. So it's kind of a nice smooth version. So that's one thing you might want to try using. Another thing they tried which was interesting was using e_lu for the convolutional layers.

Max out for the fully connected layers. I guess nowadays we don't use fully connected layers very much, so maybe that's not as interesting. Main interesting thing here I think is the e_lu activation function. Two percentage points is quite a big difference. They looked at different learning rate annealing approaches.

You can use Keras to automatically do learning rate annealing, and what they showed is that linear annealing seems to work the best. They tried something else, which was like what about different color transformations. They found that amongst the normal approaches to thinking about color, RGB actually seems to work the best.

But then they tried something I haven't seen before, which is they added two 1x1 convolutions at the very start of the network. So each of those 1x1 convolutions is basically doing some kind of linear combination with the channels, with a nonlinearity then in between. And they found that that actually gave them quite a big improvement, and that should be pretty much zero cost.

So there's another thing which I haven't already seen written about elsewhere, but that's a good trick. They looked at the impact of batch norms. So here is the impact of batch norm, positive or negative. Actually adding batch norm to GoogleNet didn't help, it actually made it worse. So it seems these are really complex, carefully tuned architectures, you've got to be pretty careful or else on a simple network it helps a lot.

And the amount it helps also depends on somewhat which activation function you use. So batch norm, I think we kind of know that now, be careful when you use it. Sometimes it's fantastically helpful, sometimes it's slightly unhelpful. Question, is there any advantage in using fully connected layers for files?

Yeah, I think there is, they're terribly out of fashion. But I think for transfer learning, they still seem to be the best in terms of the fully connected layers, a super fast train, and you seem to get a lot of flexibility there. So I don't think we know one way or another yet, but I do think that VGG still has a lot to give us in terms of the last carefully tuned thing with fully connected layers, and that really seems to be great for transfer learning.

And then there was a comment saying that Elle's advantage is not just that it's moved, but that it goes a little below zero, and he mentions that this action is really strong in being unused. Yeah, that's a great point. Thank you for adding that. Anytime you hear me say something slightly stupid, please feel free to jump in, otherwise it's on the video forever.

So on the other hand, it does give you an improvement in accuracy if you remove the final max pooling layer, replace all the fully connected layers with convolutional layers, and stick an average pooling at the end, which is basically what this is doing. So it does seem there's definitely an upside to fully convolutional networks in terms of accuracy, but there may be a downside in terms of flexibility around transfer learning.

I thought this was an interesting picture I haven't quite seen before, let me explain the picture. What this shows is these are different batch sizes along the bottom, and then we've got accuracy. And what it's showing is with a learning rate of 0.01, this is what happens to accuracy.

So as you go above 256, batch size plummets. On the other hand, if you use a learning rate of 0.01 times batch size over 256, it's pretty flat. So what this suggests to me is that any time you change the batch size, this basically is telling you to change the learning rate by a proportional amount, which I think a lot of us have realized through experiments, but I don't think I've seen it explicitly mentioned before.

I think this is very helpful to understand as well is that removing data has a nonlinear effect on accuracy. So here's this green line here is what happens when you remove images. So with ImageNet, down to about half the size of ImageNet, there isn't a huge impact on accuracy.

So maybe if you want to really speed things up, you could go 128x128 sized images and use just 600,000 of them, or even maybe 400,000, but then beneath that it starts to plummet. So I think that's an interesting insight. Another interesting insight, although I'm going to add something to this in a moment, is that rather than removing images, if you instead flip the labels to make them incorrect, that has a worse effect than not having the data at all.

But there are things we can do to try to improve things there, and specifically I want to bring your attention to this paper, "Training Deep Neural Networks on Noisy Labels with Bootstrapping". And what they show is a very simple approach, a very simple tweak you can add to any training method which dramatically improves their ability to handle noisy labels.

This here is showing if you add noise from 0.3 up to 0.5 to MNIST, up to half of it, the baselines are doing nothing at all, it really collapses the accuracy. But if you use their approach to bootstrapping, you can actually go up to nearly half the images being intentionally changing their label, and it still works nearly as well.

I think this is a really important paper to mention in this stuff that most of you will find important and useful area, because most real-world datasets have noise in them. So maybe this is something you should consider adding to everything that you've trained, whether it be tag or datasets, or your own datasets, or whatever, particularly because you don't necessarily know how noisy the labels are.

"Noisy labels means incorrect" Yeah, noisy just means incorrect. "But bootstrapping is some sort of technique that" Yeah, this is this particular paper that's grabbed a particular technique which you can read during the week if you're interested. So interestingly, they find that if you take VGG and then add all of these things together and do them all at once, you can actually get a pretty big performance hike.

It looks in fact like VGG becomes more accurate than GoogleNet if you make all these changes. So that's an interesting point, although VGG is very, very slow and big. There's lots of stuff that I noticed they didn't look at. They didn't look at data augmentation, different approaches to zooming and cropping, adding skip connections like in ResNet or DenseNet or IraNetworks, different initialization methods, different amounts of depth.

And to me, the most important is the impact on transfer learning. So these to me are all open questions as far as I know, and so maybe one of you would like to create the successor to this, more observations on training CNNs. There's another interesting paper, although the main interesting thing about this paper is this particular picture, so feel free to check it out, it's pretty short and simple.

This paper is looking at the accuracy versus the size and the speed of different networks. So the size of a bubble is how big is the network, how many parameters does it have. So you can see VGG 16 and VGG 19 are by far the biggest of any of these networks.

Interestingly the second biggest are the very old basic AlexNet. Interestingly newer networks tend to have a lot less parameters, which is a good sign. Then on this axis we have basically how long does it take to train. So again, VGG is big and slow, and without at least some tweaks, not terribly accurate.

So again, there's definitely reasons not to use VGG even if it seems easier for transfer learning or we don't necessarily know how to do a great job of transfer learning on ResNet or Inception. But as you can see, the more recent ResNet and Inception-based approaches are significantly more accurate and faster and smaller.

So this is why I was looking last week at trying to do transfer learning on top of ResNet and there's really good reasons to want to do that. I think this is a great picture. These two papers really show us that academic papers are not always just some highly theoretical wacky result.

From time to time people write these great analysis of best practices and everything that's going on. There's some really great stuff out there. One other paper to mention in this kind of broad ideas about things that you might find helpful is a paper by somebody named Leslie Smith who I think is going to be just about the most overlooked researcher.

Leslie Smith does a lot of really great papers which I really like. This particular paper came up with a list of 14 design patterns which seem to be generally associated with better CNNs. This is a great paper to read, it's a really easy read. You guys won't have any trouble with it at all, I don't think.

It's very short. But I looked through all these and I just thought these all make a lot of sense. If you're doing something a bit different and a bit new and you have to design a new architecture, this would be a great list of patterns to look through. One more Leslie Smith paper to mention, and it's crazy that this is not more well known, something incredibly simple, which is a different approach to learning rates.

Rather than just having your learning rate gradually decrease, I'm sure a lot of you have noticed that sometimes if you suddenly increase the learning rate for a bit and then suddenly decrease it again for a bit, it kind of goes into a better little area. What this paper suggests doing is try actually continually increasing your learning rate and then decreasing it, increasing it, decreasing it, increasing it, decreasing it, something that they call cyclical learning rates.

And check out the impact, compared to non-cyclical approaches, it is way, way faster and at every point it's much better. And this is something which you could easily add. I haven't seen this added to any library. If you created the cyclical learning rate annealing class for Keras, many people would thank you.

Actually many people would have no idea what you're talking about, so you don't have to write the blog post to explain why it's good and show them this picture and they would thank you. I just wanted to quickly add that Keras has lots of callbacks that I actually play with some.

Yeah, exactly. It's a great loop with a bunch of callbacks. And if I was doing this in Keras, what I would do would be I would start with the existing learning rate annealing code that's there and make small changes until it starts working. There's already code that does pretty much everything you want.

The other cool thing about this paper is that they suggest a fairly automated approach to picking what the minimum and maximum bounds should be. And again, this idea of roughly what should our learning rate be is something which we tend to use a lot of trial and error for.

So check out this paper for a suggestion about how to do it somewhat automatically. So there's a whole bunch of things that I've zipped over. Normally I would have dug into each of those and explained it and shown examples in notebooks and stuff. So you guys hopefully now have enough knowledge to take this information and play with it.

And what I'm hoping is that different people will play with different parts and come back and tell us what you find and hopefully we'll get some good new contributions to Keras or PyTorch or some blog posts or some papers or so forth, or maybe with that device stuff or even some new applications.

So the next thing I wanted to look at, again somewhat briefly, is the data science bowl. And the reason I particularly wanted to dig into the data science bowl is there's a couple of reasons. One of them, there's a million reasons, it's a million dollar prize, and there are 23 days to go.

The second is, it's an extension for everything that you guys have learned so far about computer vision. It uses all the techniques you've learned, but then some. So rather than 2D images, they're going to be 3D volumes. Rather than being 300x300 or 500x500, they're going to be 512x512x200, so a couple of hundred times bigger than stuff you've dealt with before.

The stuff we learned in lesson 7 about where are the fish, you're going to be needing to use a lot of that. I think it's a really interesting problem to solve. And then I personally care a lot about this because my previous startup in LITC was the first organization to use deep learning to tackle this exact problem, which is trying to find lung cancer in CT scans.

The reason I made that in LITC's first problem was mainly because I learned that if you can find lung cancer earlier, the probability of survival is 10 times higher. So here is something where you can have a real impact by doing this well, which is not to say that a million dollars is in the big impact as well.

So let me tell you a little bit about this problem. Here is a lung. It is a format which contains two main things. One is a stack of images and another is some metadata. Metadata will be things like how much radiation were those at by how far away from the chest with the machine and what brand of machine was it, and so on and so forth.

Most icon viewers just use your scroll wheel to zip through them, so all this is doing is going from top to bottom or from bottom to top, so you can kind of see what's going on. What I might do, I think is more interesting, is to focus on the bit that's going to matter to you, which is the inside of the lung is this dark area here, and these little white dots are what's called the vasculature, so the little vessels and stuff going through the lungs.

And as I scroll through, have a look at this little dot. You'll see that it seems to move, see how it's moving. The reason it's moving is because it's not a dot, it's actually a vessel going through space so it actually looks like this. And so if you take a slice through that, it looks like lots of dots.

And so as you go through those slices, it looks like that. And then eventually we get to the top of the lung, and that's why you see eventually a hole goes to white, so that's the edge basically of the organ. So you can see there are edges on each side, and then there's also bone.

So some of you have been looking at this already over the last few weeks and have often asked me about how to deal with multiple images, and what I've said each time is don't think of it as multiple images. Think of it in the way your DICOM viewer can if you have a 3D button like this one does.

That's actually what we were just looking at. So it's not a bunch of flat images, it's a 3D volume. It just so happens that the default way that most DICOM viewers show things is by a bunch of flat images. But it's really important that you think of it as a 3D volume, because you're looking in this space.

Now what are you looking for in this space? What you're looking for is you're looking for somebody who has lung cancer. And what somebody who has lung cancer looks like is that somewhere in this space there is a blob, it could be roughly a spherical blob, it could be pretty small, around 5 millimeters is where people start to get particularly concerned about a blob.

And so what that means is that for a radiologist, as they flick through a scan like this, is that they're looking for a dot which doesn't move, but which appears, gets bigger and then disappears. That's what a blob looks like. So you can see why radiologists very, very, very often miss nodules in lungs.

Because in all this area, you've got to have extraordinary vision to be able to see every little blob appear and then disappear again. And remember, the sooner you catch it, you get a 10x improved chance of survival. And generally speaking, when a radiologist looks at one of these scans, they're not looking for nodules, they're looking for something else.

Because lung cancer, at least in the earlier stages, is asymptomatic, it doesn't cause you to feel different. So it's like something that every radiologist has to be thinking about when they're looking for pneumonia or whatever else. So that's the basic idea is that we're going to try and come up with in the next half hour or so some idea about how would you find these blobs, how would you find these nodules.

So each of these things generally is about 512x512 by a couple of hundred. And the equivalent of a pixel in 3D space is called a voxel. So a voxel simply means a pixel in 3D space. So this here is rendering a bunch of voxels. Each voxel in a CT scan is a 12-bit integer, if memory serves me correctly.

And a computer screen can only show 8 bits of grayscale, and furthermore your eyes can't necessarily distinguish between all those grayscale perfectly anyway. So what every DICOM viewer provides is something called a windowing adjustment. So a windowing adjustment, here is the default window, which is designed to basically map some subset of that 12-bit space to the screen so that it highlights certain things.

And so the units CT scans use are called Hounsfield units, and certain ranges of Hounsfield units tell you that something is some particular part of the body. And so you can see here that the bone is being lit up. So we've selected an image window which is designed to allow us to see the bone clearly.

So what I did when I opened this was I switched it to CT's chest, which is some kind person has already figured out what the best CT lungs have figured out what's the best window to see the nodules and vasculature in lines. Now for you working with deep learning, you don't have to care about that, because of course the deep learning algorithm can see 12 bits perfectly well.

So nothing really to worry about. So one of the challenges with dealing with this data science bold data is that there's a lot of preprocessing to do, but the good news is that there's a couple of fantastic tutorials. So hopefully you've found out by now that on Kaggle, if you click on the kernels button, you basically get to see people's IPython notebooks where they show you how to do certain things.

In this case, this guy has got a full preprocessing tutorial showing how to load DICOM, convert the values to Hounsfield units, and so forth. I'll show you some of these pieces. So DICOM you will load with some library, probably with PyDICOM. So PyDICOM is a library that's a bit like Pillow or P-I-L, an image.open, this is more like a DICOM.open and end up with a 3D file, and of course the metadata.

You can see here using the metadata, image_position, lice_location. So the metadata comes through with just attributes of the Python object. This person is very kindly provided to you a list of the Hounsfield units for each of the different substances. So he shows how to translate stuff into that range.

And so it's great to draw lots of pictures. So here is a histogram for this particular picture. So you can see that most of it is air, and then you get some bone and some lung as the actual slice. So then the next thing to think about is really voxel spacing, which is as you move across one bit of x-axis or one bit of y-axis or from slice to slice, how far in the real world are you moving?

And one of the annoying things about medical imaging is that different kinds of scanners have different distances between those slices, called the slice thickness and different meanings of the x and y-axis. Luckily that stuff is all in the diagram metadata. So the resampling process means taking those lists of slices and turning it into something where every step in the x-direction or the y-direction or the z-direction equals 1mm in the real world.

And so it would be very annoying for your deep learning network if your different lung images were squished by different amounts, especially if you didn't give it the metadata about how much this was being squished. So that's what resampling does, and as you can see it's using the slice thickness and the pixel spacing to make everything nice and even.

So there are various ways to do 3D plots, and it's always a good idea to do that. And then something else that people tend to do is segmentation. Depending on time, you may or may not get around to looking more at segmentation in this part of the course, but effectively segmentation is just another generative model.

It's a generative model where hopefully somebody has given you some things saying this is lung, this is air, and then you build a model that tries to predict for something else what's lung and what's air. Unfortunately for lung CT scans, we don't generally have the ground truth of which bit is lung and which bit is air.

So generally speaking, in medical imaging, people use a whole lot of heuristic approaches, so kind of hacky rule-based approaches, and in particular applications of region-growing and morphological operations. I find this kind of the boring part of medical imaging because it's so clearly a dumb way to do things, but deep learning is far too new in this area yet to kind of develop the data sets that we need to do this properly.

But the good news is that there's a button which I don't think many people notice exists called 'Tutorial' on the main data science page where these folks from Boozell and Hamilton actually show you a complete segmentation approach. Now it's interesting that they picked unit segmentation. This is definitely the thing about segmentation I would be teaching you guys if we have time.

Segment is one of these things that outside of the Kaggle world, I don't think that many people are familiar with, but inside the Kaggle world we know that any time segmentation crops up, unit wins, and it's the best. More recently there's actually been something called 'DenseNet' for segmentation which takes unit even a little bit further and maybe that would be the new winner for newer Kaggle competitions when they happen.

But the basic idea here of things like Unet and DenseNet is that we have a model where when we do generative models, when I think about doing style transfer, we generally start with this kind of large image and then we do some downsampling operations to make it a smaller image and then we do some computation and then we make it bigger again with these upsampling operations.

What happens in Unet is that there are additional neural network connections made directly from here to here, and directly from here to here, and here to here, and here to here. Those connections basically allow it to almost do like a kind of residual learning approach, like it can figure out the key semantic pieces at really low resolution, but then as it upscales it can learn what was special about the difference between the downsampled image and the original image here.

It can kind of learn to add that additional detail at each point. So Unet and DenseNet for segmentation are really interesting and I hope we find some time to get back to them in this part of the course, but if we don't, you can get started by looking at this tutorial in which these folks basically show you from scratch.

What they try to do in this tutorial is something very specific, which is the detection part. So what happens in this kind of, like think about the fisheries competition. We pretty much decided that in the fisheries competition, if you wanted to do really well, you would first of all find the fish and then you would zoom into the fish and then you would figure out what kind of fish it is.

Certainly in the right whale competition earlier, that was how that was found. For this competition, this is even more clearly going to be the approach, because these images are just far too big to do a normal convolutional neural network. So we need one step that's going to find the nodule, and then a second step that's going to zoom into a possible nodule and figure out is this a malignant tumor or something else, a false positive.

The bad news is that the data science bowl data set does not give you any information at all for the training set. Where are the cancerous nodules? Which I actually wrote a post in the Kaggle forums about this, I just think this is a terrible idea. That information actually exists, the dataset they got this from is something called the National Lung Screening Trial, which actually has that information or something pretty close to it.

So the fact they didn't provide it, I just think it's horrible for a competition which can save lives and I can't begin to imagine. The good news though is that there is a data set which does have this information. The original data set was called LIDC Idri, but interestingly that data set was recently used for another competition, a non-Kaggle competition called Luna, that competition is now finished.

And one of the tracks in that competition was actually specifically a false positive detection track, and then the other track was a find the nodule track basically. So you can actually go back and look at the papers written by the winners. They're generally ridiculously short. Many of them are a single sentence saying for a commercial confidentiality agreement we can't do anything.

But some of them, including the winner of the false positive track, they actually provide it. Surprisingly, they all use deep learning. And so what you could do, in fact I think what you have to do to do well in this competition is download the Luna data set, use that to build a nodule detection algorithm.

So the Luna data set includes files saying this lung has nodules here, here, here, here. So do nodule detection based on that, and then run that nodule detection algorithm on the Kaggle data set, find the nodules, and then use that to do some classification. There are some tricky things with that.

The biggest tricky thing is that most of the CT scans in the Luna data set are what's called contrast studies. A contrast scan means that the patient had a radioactive dye injected into them, so that the things that they're looking for are easier to see. For the National Lung Screening Trial, which is what they use in the Kaggle data set, none of them use contrast.

And the reason why is that what we really want to be able to do is to take anybody who's over 65 and has been smoking more than a pack a day for more than 20 years and give them all a CT scan and find out which ones have cancer, but in the process we don't want to be shooting them up with radioactive dye and giving them cancer.

So that's why we try to make sure that when we're doing these kind of asymptomatic scans that they're as low radiation dose as possible. So that means that you're going to have to think about transfer learning issues, that the contrast in your image is going to be different between the thing you build on the Luna data set, the nodule protection, and the Kaggle competition data set.

When I looked at it, I didn't find that that was a terribly difficult problem. I'm sure you won't find it impossible by any means. So to finalize this discussion, I wanted to refer to this paper, which I'm guessing not that many people have read yet. It's a medical imaging paper.

And what it is, is a non-deep learning approach to trying to find nodules. So that's where they use nodule segmentation. Yes, Rachael. I have a correction from our radiologist saying that dye is not radioactive. It's just dense, isofu-70 or isofu-70. Okay, but there's a reason we don't inject people with a contrast dye.

The issues are contrast in the nephropath or jacarina. Yeah, that's what I meant. I do know though that the NLST studies use a lower amount of radioactivity than I think the Luna ones do, so that's another difference. So this is an interesting idea of how can you find nodules using more of a heuristic approach.

And the heuristic approach they suggest here is to do clustering, and we haven't really done any clustering in class yet, so we're going to dig into this in some detail. Because I think this is a great idea for the kind of heuristics you can add on top of deep learning to make deep learning work in different areas.

The basic idea here is to, as you can say, they call it a five-dimensional mean. They're going to try and find groups of voxels which are similar, and they're going to cluster them together. And hopefully we've got to particularly cluster together things that look like nodules. So the idea is at the end of this segmentation there will be one cluster for the whole lung boundary, one cluster for the whole vasculature, and then one cluster for every nodule.

So the five dimensions are x, y and z, which is straightforward, intensity, so the number of Hounsfield units, and then the fifth one is volumetric shape index, and this is the one tricky one. The basic idea here is that it's going to be a combination of the different curvatures of a voxel based on the Gaussian and mean curvatures.

Now what the paper goes on to explain is that you can use for these the first and second derivatives of the image. Now all that basically means is you subtract one voxel from its neighbor, and then you take that whole thing and subtract one voxel's version of that from its neighbor.

You get the first and second derivatives, so it kind of tells you the direction of the change of image intensity at that point. So by getting these first and second derivatives of the image and then you put it into this formula, it comes out with something which basically tells you how sphere-like this voxel seems to be part of how sphere-like a construct.

So that's great. If we can basically take all the voxels and combine the ones that are nearby, have a similar number of Hounsfield units and seem to be of similar kinds of shapes, we're going to get what we want. So I'm not going to worry about this bit here because it's very specific to medical imaging.

Anybody who's interested in doing this, feel free to talk on the forum about what this book is like in Python. But what I did want to talk about was the meanshift clustering, which is a particular approach to clustering which they talk about. "Clustering" is something which for a long time I've been kind of an anti-fan of.

It belongs to this group of unsupervised learning algorithms which always seem to be kind of looking for a problem to solve. But I've realized recently there are some specific problems that can be solved well with them. I'm going to be showing you a couple, one today and one in Lesson 14.

Clustering algorithms are perhaps the best easiest to describe by what they do by generating some data to show them. Here's some generated data. I'm going to create 6 clusters, and for each cluster I'll create 250 samples. So I'm going to basically say let's create a bunch of centroids by creating some random numbers.

So 6 pairs of random numbers for my centroids, and then I'll grab a bunch of random numbers around each of those centroids and combine them all together and then plot them. And so here you can see each of these X's represents a centroid, so a centroid is just like the average point for a cluster of data.

And each color represents one cluster. So imagine if this was showing you clusterings of different kinds of lung tissue, ideally you'd have some voxels that were colored one thing for nodule, and a bunch of different color for vasculature and so forth. We can only show this easily in 2 dimensions, but there's no reason to not be able to imagine doing this in certainly 5 dimensions.

So the goal of clustering will be to undo this. Given the data, but not the X's, how can you figure out where the X's were? And then it's pretty straightforward once you know where the X's are to then find the closest points to that to assign every data point to a cluster.

The most popular approach to clustering is called A-means. A-means is an approach where you have to decide up front how many clusters are there. And what it basically does is there's two steps. The first one is to guess as to where those clusters might be. And the really simple way to do that is just to randomly pick a point, and then start randomly picking points which are as far away as possible from all the previous ones I've picked.

Let me throw away the first one. So if I started here, then probably the furthest away point would be down here. So this would be our starting point for cluster 1, and say what point is the furthest away from that? That's probably this one here, so we have a starting point for cluster 2.

What's the furthest point away from both of these? Probably this one over here, and so forth, so you keep doing that to get your initial points. And then you just iteratively move every point, so you basically then say these are the clusters, let's assume these are the clusters, which cluster does every point belong to, and then you just iteratively move the points to different clusters a bunch of times.

Now K means, it's a shame it's so popular because it kind of sucks, right? Sucky thing number 1 is that you have to decide how many clusters there are, and the whole point is we don't know how many nodules there are. And then sucky thing number 2 is without some changes, to do something called kernel K means, it only works with the things that are the same shape, they're all nicely Gaussian shaped.

So we're going to talk about something way-caller, which only came across somewhat recently, much less well known, which is called mean-shift clustering. Now mean-shift clustering is one of these things which seems to spend all of its time in serious mathematician land. Whenever I tried to look up something about mean-shift clustering, I kind of started seeing this kind of thing.

This is like the first tutorial not in the PDF that I could find. So this is one way to think about mean-shift clustering, another way is a code-first approach, which is that this is the entire algorithm. So let's talk about what's going on here. What are we doing? At a high level, we're going to do a bunch of loops.

So we're going to do 5 steps. It would be better if I didn't do 5 steps, but I kept doing this until it was stable, but for now I'm just going to do 5 steps. And each step I'm going to go through, so our data is x, I'm going to go through, enumerate through our data.

So small x is the current data point I'm looking at. Now what I want to do is find out how far away is this data point from every other data point. So I'm going to create a vector of distances. And I'm going to do that with the magic of broadcasting.

So small x is a vector of size 2, this is 2 coordinates, and big X is a matrix of size n by 2, where n is the number of points. And thanks to what we've now learned about broadcasting, we know that we can subtract a matrix from a vector, and that vector will be broadcast across the axis of the matrix.

And so this is going to subtract every element of big X from little x. And so if we then go ahead and square that, and then sum it up, and then take the square root, this is going to return a vector of distances of small x to every element of big X.

And the sum here is just summing up the two coordinates. So that's step 1. So we now know for this particular data point, how far away is it from all of the other data points. Now the next thing we want to do is to -- let's go to the final step.

The final step will be to take a weighted average. In the final step, we're going to say what cluster do you belong to. Let's draw this. So we've got a whole bunch of data points, and we're currently looking at this one. What we've done is we've now got a list of how far it is away from all of the other data points.

And the basic idea is now what we want to do is take the weighted average of all of those data points, weighted by the inverse of that distance. So the things that are a long way away, we want to weight very small. And the things that are very close, we want to weight very big.

So I think this is probably the closest, and this is about the second-closest, and this is about the third-closest. So assuming these have most of the weight, the average is going to be somewhere about here. And so by doing that at every point, we're going to move every point closer to where its friends are, closer to where the nearby things are.

And so if we keep doing this again and again, everything is going to move until it's right next to its friends. So how do we take something which initially is a distance and make it so that the larger distances have smaller weights? And the answer is we probably want to shape something like that.

In other words, Gaussian. This is by no means the only shape you could choose. It would be equally valid to choose this shape, which is a triangle, at least half of one. In general though, note that if we're going to multiply every point by one of these things and add them all together, it would be nice if all of our weights added to 1, because then we're going to end up with something that's of the same scale that we start with.

So when you create one of these curves where it all adds up to 1, generally speaking we call that a kernel. And I mention this because you will see kernels everywhere. If you haven't already, now that you've seen it, you'll see them everywhere. In fact, kernel methods is a whole area of machine learning that in the late 90s basically took over because it was so theoretically pure.

And if you want to get published in conference proceedings, it's much more important to be theoretically pure than actually accurate. So for a long time, kernel methods went out, and neural networks in particular disappeared. Eventually people realized that accuracy was important as well, and in more recent times kernel methods are largely disappearing.

But you still see the idea of a kernel coming up very often, because they're super useful tools to have. They're basically something that lets you take a number, like in this case a distance, and turn it into some other number where you can weight everything by that other number and add them together to get a nice little weighted average.

So in our case, we're going to use a Gaussian kernel. The particular formula for a Gaussian doesn't matter. I remember learning this formula in grade 10, and it was by far the most terrifying mathematical formula I've ever seen, but it doesn't really matter. For those of you that remember or have seen the Gaussian formula, you'll recognize it.

For those of you that haven't, it doesn't matter. But this is the function that draws that curve. So if we take every one of our distances and put it through the Gaussian, we will then get back a bunch of weights that add to 1. So then in the final step, we can multiply every one of our data points by that weight, add them up, and divide by the sum of the weights.

In other words, take a weighted average. You'll notice that I had to be a bit careful about broadcasting here, because I needed to add a unit axis at the end of my dimensions, not at the start, so by default it adds unit axes to the beginning when you do broadcasting.

That's why I had to do an expandims. If you're not clear on why this is, then that's a sign you definitely need to do some more playing around with broadcasting. So have a fiddle with that during the week. You're free to ask if you're not clear after you've experimented.

But this is just a weighted sum. So this is just doing sum of weights times x divided by sum of weights. Importantly there's a nice little thing that we can pass to a Gaussian, which is the thing that decides does it look like the thing I just drew, or does it look like this, or does it look like this.

All of those things add up to one. They all have the same area underneath, but they're very different shapes. If we make it look like this, then what that's going to do is create a lot more clusters, because things that are really close to it are going to have really high weights, and everything else is going to have a tiny weight from being meaningless.

So if we use something like this, we're going to have much fewer clusters, because even stuff that's further away is going to have a higher weight from the weight of the sum. The choice that you use for the kernel width, that's got lots of different names you can use.

From here I've used BW being bandwidth. There's actually some cool ways to choose it. One simple way to choose it is to find out which size of bandwidth covers 1/3 of the data in your dataset. I think that's the approach that Scikit-learn uses. So there are some ways that you can automatically figure out a bandwidth, and just one of the very nice things about mean shift.

So we just go through a bunch of times, five times, and each time we replace every point with its weighted average weighted by this Gaussian kernel. So when we run this 5 times, it takes a second, and here's the results. I've offset everything by 1 just so that we can see it, otherwise it would be right on top of the x.

So you can see that for nearly all of them, it's in exactly the right spot, whereas for this cluster, let's just remind ourselves what that cluster looked like, these two clusters, this particular bandwidth, it decided to create one cluster for them rather than two. So this is kind of an example, whereas if we decreased our bandwidth, it would create two clusters.

There's no one right answer, that should be one or two. So one challenge with this is that it's kind of slow. So I thought let's try and accelerate it for the GPU. Because mean shift's not very cool, nobody seems to have implemented it for the GPU yet, or maybe it's just not a good idea, so I thought I'd use PyTorch.

And the reason I use PyTorch is because it really feels like writing PyTorch, it just feels like writing NumPy, everything happens straight away. So I really hoped that I could take my original code and make it almost the same. And indeed, here is the entirety of mean shift in PyTorch.

So that's pretty cool. You can see anywhere I used to have np, it now says torch, np.array is now torch.flow_tensor, np.square_root is torch.square_root, everything else is almost the same. One issue is that torch doesn't support broadcasting. So we'll talk more about this shortly in a couple of weeks, but basically I decided that's not okay, so I wrote my own broadcasting library for PyTorch.

So rather than saying x, little x minus big X, I used sub for subtract. That's the subtract from my broadcasting library. If you're curious, check out TorchUtils and you can see my broadcasting operations there. But basically if you use those, you can see save for modification, it will do all the broadcasting for you.

So as you can see, this looks basically identical to the previous code, but it takes longer. So that's not ideal. One problem here is that I'm not using CUDA. So I could easily fix that by adding .Cuda to my x, but that made it slower still. The reason why is that all the work is being done in this for loop, and PyTorch doesn't accelerate for loops.

Each run through a for loop in PyTorch is basically calling a new CUDA kernel each time you're going through. It takes a certain amount of time to even launch a CUDA kernel. When I'm saying CUDA kernel, this is a different usage of the word kernel. In CUDA, kernel refers to a little piece of code that runs on the GPU.

So it's launching a little GPU process every time through the for loop. It takes quite a bit of time, and it's also having to copy data all over the place. So what I then tried to do was to make it faster. The trick is to do it by minibatch.

So each time through the loop we don't want to do just one piece of data, but a minibatch of data. So here are the changes I made. The main one was that my for_i now jumps through one batch size at a time. So I'm going to go 0.123, but 0.1632.

So I now need to create a slice which is from i to i plus batch size, unless we've gone past the end of the data, in which case it's just as far as hand. So this is going to refer to the slice of data that we're interested in. So what we can now do is say x with that slice to grab back all of the data in this minibatch.

And so then I had to create a special version of, I can't say subtract anymore, I need to think carefully about the broadcasting operations here. I'm going to return a matrix, let's say batch size is 32, I'm going to have 32 rows, and then let's say n is 1000, it will be 1000 columns.

That shows me how far away each thing in my batch is from every piece of data. So when we do things a batch at a time, we're basically adding another axis to all of your tensors. Suddenly now you have a batch axis all the time. And when we've been doing deep learning, that's been something I think we've got pretty used to.

The first axis in all of our tensors has always been a batch axis. So now we're writing our own GPU-accelerated algorithm. Can you believe how crazy this is? Two years ago, if you Google for K-means, CUDA, or K-means GPU, you get back research studies where people write papers about how to put these algorithms in GPUs, because it was hard.

And here's a page of code that does it. So this is crazy, this is possible, but here we are. We have built a batch-by-batch GPU-accelerated main shift algorithm. The basic distance formula is exactly the same. I just have to be careful about where I added unsqueezed as the same as expandims in NumPy.

So I just have to be careful about where I add my unit axes, add it to the first axis of one bit and the second axis of the other bit. So that's going to subtract every one of these from every one of these returner matrix. Again, this is a really good time to look at this and think why does this broadcasting work, because this is getting more and more complex broadcasting.

And hopefully you can now see the value of broadcasting. Not only did I get to avoid writing a pair of nested for loops here, but I also got to do this all on the GPU in a single operation, so I've made this thousands of times faster. So here is a single operation which does that entire matrix subtraction.

Yes, Rachel? I was just going to suggest that we take a break soon, it's a tentalate. So that's our batchwise distance function. We then chuck that into a Gaussian, and because this is just element-wise, the Gaussian function hasn't changed at all, so that's nice. And then I've got my weighted sum, and then divide that by the sum of weights.

So that's basically the algorithm. So previously for my NumPy version, it took a second, now it's 48ms, so we've just sped that up by 20ms. Yes, Rachel? Question - I get how batching helps with locality and cache, but I do not quite follow how it helps otherwise, especially with respect to accelerating the for loop.

So in PyTorch, the for loop is not run on the GPU. The for loop is run on your CPU, and your CPU goes through each step of the for loop and calls the GPU to say do this thing, do this thing, do this thing. So this is not to say you can't accelerate this IntensorFlow in a similar way.

Like IntensorFlow, there's a tf.while and stuff like that where you can actually do GPU-based loops. Even still, if you do it entirely in a loop in Python, it's going to be pretty difficult to get this performance. But particularly in PyTorch, it's important to remember in PyTorch, your loops are not optimized.

It's what you do inside each loop that's optimized. We have another question. Some of the math functions are coming from Torch and others are coming from the Python library. What is the difference when you use the Python math library? Does that mean the GPU is not being used? You'll see that I use that math.py is a constant and then math.square root of 2 times py is a constant.

You need to use the GPU to calculate a constant, obviously. We only use Torch for things that are running on a vector or a matrix or a tensor of data. So let's have a break. We'll come back in 10 minutes, so that would be 2 past 8, and we'll talk about some ideas I have for improving mean shift, which maybe you guys will want to try during the week.

The idea here is we figure that there are two steps we need to figure out where the nodules are in something like this, if any. Step number one is to find the things that may be kind of nodule-ish, zoom into them and create a little cropped version. Step two would be where your learning particularly comes in, which is to figure out is that cancerous or not.

Once you've found a nodule-ish thing, the cancerous ones are actually by far the biggest driver of whether or not something is a malignant cancer is how big it is. It's actually pretty straightforward. The other thing particularly important is how kind of spidery it looks. If it looks like it's kind of evilly going out to capture more territory, that's probably a bad sign as well.

So the size and the shape are the two things that you're going to be wanting to try and find, and obviously that's a pretty good thing for a neural net to be able to do. You probably don't need that in the examples of it. When you get to that point, there was obviously a question about how to deal with the 3D aspect here.

You can just create a 3D convolutional neural net. So if you had like a 10x10x10 space, that's obviously certainly not going to be too big, it's 20x20x20, you might be okay, and kind of think about how big a volume can you create. There's plenty of papers around on 3D convolutions, although I'm not sure if you even need one because it's just a convolution in 3D.

The other approach that you might find interesting to think about is something called triplanar. What triplanar means is that you take a slice through the x and the y and the z axis, and so you basically end up with three images. One is a slice through x, y and z, and then you can kind of treat those as different channels if you like.

You can probably use pretty standard neural net libraries that expect three channels. So there's a couple of ideas for how you can deal with the 3D aspect of it. I think using the lunar dataset as much as possible is going to be a good idea because you really want something that's pretty good at detecting nodules before you start putting it onto the Kaggle dataset because the other problem with the Kaggle dataset is it's ridiculously small.

And again, there's no reason for it, there are far more cases in NLST than they've provided to Kaggle, so I can't begin to imagine why they went to all this trouble and a million dollars of money for something which has not been set up to succeed. Anyway, that's not our problem, it makes it all a more interesting thing to play with.

But after the competition's finished, if you get interested in it, you'll probably want to go and download the whole NLST dataset or as much as possible and do it properly. Actually, there are two questions that I wanted to read. One is just for the audio stream, there are occasional max volume pops that are really hard on the ears for remote listeners.

This might not be solvable right now, but something to look into. And then last class you mentioned that you would explain when and why to use Keras versus PyTorch. If you only had brain space for one in the same way, some only have brain space for VI or Emacs, which would you pick?

So I just reduced the volume a little bit, so let us know if that helps. I would pick PyTorch, it feels like it kind of does everything Keras does, but gives you the flexibility to really play around a lot more. I'm sure you've got brain space for both. So question, you mentioned there are other datasets of cancerous images that has labels and proper marks.

Can you explain the thing on that dataset? That was my suggestion, and that's what the tutorial shows how to do. There's a whole thing, a kernel on Kaggle called candidate generation and Luna16, which shows how to use Luna to build a module finder, and this is one of the highest rated Kaggle kernels.

We've now used kernel in three totally different ways in this lesson. If we can come up with a fourth, Kaggle kernels, CUDA kernels and kernel methods. So this looks very familiar, doesn't it? So here's a Keras approach to finding lung modules based on Luna. So I mentioned an opportunity to improve this mean shift algorithm, and the opportunity for improvement, when you think about it, it's pretty obvious.

The actual amount of data is huge. You've got data points all over the place. The ones that are a long way away, like the weight is going to be so close to zero that we may as well just ignore them. The question is, how do we quickly find the ones which are a long way away?

We know the answer to that, we learned it. It's approximate nearest neighbors. So what if we added an extra step here, which rather than using x to get the distance to every data point, instead using approximate nearest neighbors to grab the closest ones, the ones that are actually going to matter.

So that would basically turn this linear timepiece into a logarithmic timepiece, which would be pretty fantastic. So we learned very briefly about a particular approach, which is locality-sensitive hashing. I think I mentioned also there's another approach which I'm really fond of, called SpillTrees. I really want us as a team to take this algorithm and add approximate nearest neighbors to it and release it to the community as the first ever superfast GPU-accelerated, approximate nearest neighbor-accelerated in-chip clustering algorithm.

I think that would be a really big deal. If anybody's interested in doing that, I believe you're going to have to implement something like LSH or SpillTrees in PyTorch, and once you've done that, it should be totally trivial to add the step that then uses that here. So if you do that, then if you're interested, I would invite you to team up with me in that we would then release this piece of software together and author a paper or a post together.

So that's my hope is that a group of you will make that happen. That would be super exciting because I think this would be great. We'll be showing people something pretty cool about the idea of writing GPU algorithms today. In fact, I found just during the break, here's a whole paper about how to write k-means with CUDA.

It used to be so much work. This is without even including any kind of approximate nearest neighbor's piece or whatever. So I think this would be great. Hopefully that will happen. And look, it gives the right answer. I guess to do it properly, we should also be replacing the Gaussian kernel bandwidth with something that we figure out dynamically rather than have it hard coded.

So take change, we're going to learn about chatbots. So we're going to start here with Slate. Facebook thinks it has found the secret to making bots less dumb. So this talks about a new thing called memory networks, which was demonstrated by Facebook. You can feed it sentences that convey key plot points in Lord of the Rings and then ask it various questions.

Published a new paper on archive that generalizes the approach. There was another long article about this unpopular science in which they described its early progress towards a truly intelligent AI. Lacuna is excited about working on a memory network, giving the ability to retain information. You can help the network a story and have it answer questions.

And so it even has this little gif. In the article, they've got this little example showing reading a story of Lord of the Rings and then asking various questions about Lord of the Rings, and it all looks pretty impressive. So we're going to implement this paper. And the paper is called End-to-End Memory Networks.

The paper was actually not shown on Lord of the Rings, but was actually shown on something called Babbie, I don't know, Babbie or Baby, I'm never quite sure which one it is. It's a paper describing a synthetic dataset towards AI complete question answering, a set of pre-requisite toy tasks.

I saw a cute tweet last week explaining the meaning of various different types of titles of papers, and it's basically saying 'towards' means we've actually made no progress whatsoever. So we'll take this with a grain of salt. So these introduce the Babbie tasks, and the Babbie tasks are probably best described by showing an example.

Here's an example. So each task is basically a story. A story contains a list of sentences, a sentence contains a list of words. At the end of the story is a query to which there is an answer. So the sentences are ordered in time. So where is Daniel? We'll have to go backwards.

This says where John is. This is where Daniel is, Daniel going to the bathroom, so Daniel is in the bathroom. So this is what the Babbie tasks look like. There's a number of different structures. This is called a one-supporting fact structure, which is to say you only have to go back and find one sentence in the story to figure out the answer.

We're also going to look at two supporting fact stories, which is ones where you're going to have to look twice. So reading in these data sets is not remotely interesting, they're just a text file. We can parse them out. There's various different text files for the various different tasks.

If you're interested in the various different tasks, you can check out the paper. We're going to be looking at a single supporting fact and two supporting facts. They have some with 10,000 examples and some with 1,000 examples. The goal is to be able to solve every one of their challenges with just 1,000 examples.

This paper is not successful at that goal, but it makes some movement towards it. So basically, we're going to put that into a bunch of different lists of stories along with their queries. We can start off by having a look at some statistics about them. The first is, for each story, what's the maximum number of sentences in a story?

And the answer is 10. So Lord of the Rings, it ain't. In fact, if you go back and you look at the gif, when it says read story, Lord of the Rings, that's the whole Lord of the Rings. The total number of different words in this thing is 32.

The maximum length of any sentence in a story is 8. The maximum number of words in any query is 4. So we're immediately thinking, what the hell? Because this was presented by the press as being the secret to making bots less dumb, and showed us that they took a story and summarized Lord of the Rings, made plot points and asked various questions, and clearly that's not entirely true.

What they did, if you look at even the stories, the first word is always somebody's name. The second word here is 'or is some synonym for move'. There's then a bunch of prepositions, and then the last word is 'always place'. So these toy tasks are very, very, very toy.

So immediately we're kind of thinking maybe this is not a step to making bots less dumb or whatever they said here, a truly intelligent AI. Maybe it's towards a truly intelligent AI. So to get this into Keras, we need to turn it into a tensor in which everything is the same size, so we use pad sequences for that, like we did in the last part of the course, which will add zeroes to make sure that everything is the same size.

So the other thing we'll do is we will create a dictionary from words to integers to turn every word into an index, so we're going to turn every word into an index and then pad them so that they're all the same length. And then that's going to give us inputs_train, 10,000 stories, each one of 10 sentences, each one of 8 words.

Anything that's not 10 sentences long is going to get sentences of just zeroes, any sentences not 8 words long will get some zeroes, we'll get into that. And you know for the test, except we just got 1000. So how do we do this? Not surprisingly, we're going to use embeddings.

Now we've never done this before. We have to turn a sentence into an embedding, not just a word into an embedding. So there's lots of interesting ways of turning a sentence into an embedding, but when you're just doing towards intelligent AI, you don't do any of them. You instead just add the embeddings up, and that's what happened in this paper.

And if you look at the way it was set up, you can see why, you can just add the embeddings up. Mary John and Sandra, they only ever appear in one place, they're always the object of this. The verb is always the same thing, the prepositions are always meaningless, and the last word is always a place.

So to figure out what a whole sentence says, you can just add up the word concepts. The order of them doesn't make any difference, there's no knots, there's nothing that makes language remotely complicated or interesting. So what we're going to do is we're going to create an input for our stories with the number of sentences and the length of each one.

We're going to take each word and put it through an embedding, so that's what time-distributed is doing here. It's putting each word through a separate embedding, and then we do a lambda layer to add them up. So here is our very sophisticated approach to creating sentence embeddings. So we do that for our story.

So we end up with something which rather than being 10 by 8, 10 sentences by 8 words, it's now 10 by 20, that is 10 sentences by length 20 embedding. So each one of our 10 sentences has been turned into a length 20 embedding, and we're just starting with a random embedding.

We're not going to use Word2vec or anything because we don't need the complexity of that vocabulary model. We're going to do exactly the same thing for the query. We don't need to use time-distributed this time, we can just take the query because this time we have just one query.

So we can do the embedding, sum it up, and then we use reshape to add a unit axis to the front so that it's now the same basic rank. We now have one question of embedding to length 20. So we have 10 sentences of the story and one query.

So what is the memory network, or more specifically the more advanced end-to-end memory network? And the answer is, it is this. As per usual, when you get down to it, it's less than a page of code to do these things. Let's draw this before we look at the code.

So we have a bunch of sentences. Let's just use 4 sentences for now. So each sentence contained a bunch of words. We took each word and we turned them into an embedding. And then we summed all of those embeddings up to get an embedding for that sentence. So each sentence was turned into an embedding, and they were length 20, that's what it was.

And then we took the query, so this is my query, same kind of idea, a bunch of words which we got embeddings for, and we added them up to get an embedding for our question. Okay, so to do a memory network, what we're going to do is we're going to take each of these embeddings and we're going to combine it, each one, with a question or a query.

And we're just going to take a .product, so this way to draw this, .product, .product, okay, so we're going to end up with 4 dot products from each sentence of the story times the query. So what does the dot product do? It basically says how similar two things are, when one thing is big, if the other thing is big, if one thing is small, if the other thing is small, those things both make the dot product bigger.

So these basically are going to be 4 vectors describing how similar each of our 4 sentences do the query. So that's step 1. Step 2 is to stick them through a softmax. So remember the dot product just reaches a scalar, so we now have 4 scalars. And they add up to 1.

And they each are basically related to how similar is the query to each of the 4 sentences. We're now going to create a totally separate embedding of each of the sentences in our story by creating a totally separate embedding for each word. So we're basically just going to create a new random embedding matrix for each word to start with, sum them all together, and that's going to give us a new embedding, this one they call C I believe.

And all we're going to do is we're going to multiply each one of these embeddings by the equivalent softmax as a weighting, and then just add them all together. So we're just going to have S1234, C1 times S1 plus C2 times S2, and then divide it by S1234, so that's going to be our final result, which is going to be of length 20.

So this thing is a vector of length 20, and then we're going to take that and put it through a single dense layer, and we're going to get back the answer. And that whole thing is the memory network. It's incredibly simple, there's nothing deep in terms of deep learning, there's almost none on linearities, so it doesn't seem like it's likely to be able to do very much, but I guess we haven't given it very much to do.

So let's take a look at the code version. Yes. >> So in that last step you said the answer, was that really the embedding of the answer, and then it has to get the reverse lookup? >> Yeah, it's the softmax of the answer, and then you have to do an argmax.

So here it is, we've got the story times the embedding of the story times the embedding of the query, the dot product. We do a softmax. Softmax works in the last dimension, so I just have to reshape to get rid of the unit axis, and then I reshape again to put the unit axis back on again.

The reshapes aren't doing anything interesting, so it's just a dot product followed by a softmax, and that gives us the weights. So now we're going to take each weight and multiply it by the second set of embeddings, here's our second set of embeddings, embedding C, and in order to do this, I just used the dot product again, but because of the fact that you've got a unit axis there, this is actually just doing a very simple weighted average.

And again, I've reshaped to get rid of the unit axis so that we can stick it through a dense layer of the softmax, and that gives us our final result. So what this is effectively doing is it's basically saying, okay, how similar is the query to each one of the sentences in the story?

Use that to create a bunch of weights, and then these things here are basically the answers. This is like, if story number 1 was where the answer was, then we're going to use this one, story number 2, 3, and 4, because there's a single linear layer at the very end, so it doesn't really get to do much computation.

It basically has to learn what the answer represented by each story is. And again, this is lucky because the original data set, the answer to every question is the last word of the sentence. Where is Frodo's ring? So that's why we just can have this incredibly simple final piece.

So this is an interesting use of Keras, right? We've created a model which is in no possible way deep learning, but it's a bunch of tensors and layers that are stuck together. And so it has some inputs, it has an output, so we can call it a model. We can compile it with an optimizer and a loss, and then we can fit it.

So it's kind of interesting how you can use Keras for things which don't really use any of the normal layers in any normal way. And as you can see, it works for what it's worth. We solved this problem. And the particular problem we solved here is the one supporting that problem.

And in fact, it worked in less than 1 epoch. More interesting is two supporting facts. Actually before I do that, I'll just point out something interesting, which is we could create another model, now that this is already trained, which is to return not the final answer, but the value of the weights.

And so we can now go back and say, for a particular story, what are the weights? So let's do 'f' rather than the answer. For this story, for this particular story, the weights are here, and you can see that the weight for sentence number 2 is 0.98. So we can actually look inside the model and find out what sentences it's using to answer this question.

Question - would it not make more sense to concat the embeddings rather than sum them? Not for this particular problem, because of the way the vocabulary is structured when the sentence is structured. It would also have to deal with the variable length of the sentence. Well, we've used padding to make them the same length.

If you wanted to use this in real life, you would need to come up with a better sentence embedding, which presumably might be an RNN or something like that, because you need to deal with things like 'not' and the location of subject and object and so forth. One thing to point out is that the order of the sentences matters.

And so what I actually did when I preprocessed it was I added a 0 colon, 1 colon, whatever to the start of each sentence, so that it would actually be able to learn the time order of sentences. So this is like another token that I added. So in case you were wondering what that was, that was something that I added in the preprocessing.

So one nice thing with memory networks is we can kind of look and see if they're not working, in particular why they're not working. So multi-hop, so let's now look at an example of a two supporting facts story. It's mildly more interesting. We still only have one type of verb with various synonyms and a small number of subjects and a small number of objects, so it's basically the same.

But now, to answer a question, we have to go down through two hots. So where is the milk? Let's find the milk. Daniel left the milk there. Where is Daniel? Daniel traveled to the hallway. Where is the milk? Hallway. Alright. So that's what we have to be able to do this time.

And so what we're going to do is exactly the same thing as we did before, but we're going to take our whole little model, so do the embedding, reshape, dot, reshape, softmax, reshape, dot, reshape, dense layer, sum, and we're going to call it a counter and we're going to call this one hop.

So this whole picture is going to become one hop. And what we're going to do is we're going to take this and go back and replace the query with our new output. So at each step, each hop, we're going to replace the query with the result of our memory network.

And so that way, the memory network can learn to recognize that the first thing I need is the milk, search back, find milk. I now have the milk, now you need to update the query to where is Daniel. Now go back, I'm Daniel. So the memory network in multi-hop mode basically does this whole thing again and again and again, replacing the query each time.

So that's why I just took the whole set of steps and chucked it into a single function. And so then I just go, OK, response, story is one hop, response, story is one hop on that, and you can keep repeating that again and again and again. And then at the end, get our output, that's our model, compile, fit.

I had real trouble getting this to fit nicely, I had to play around a lot with learning rates and batch sizes and whatever else, but I did eventually get it up to 0.999 accuracy. So this is kind of an unusual class for me to be teaching, because particularly compared to Part 1 where it was like best practices, clearly this is anything but.

I'm kind of showing you something which was maybe the most popular request, was like teachers about chatbots. But let's be honest, who has ever used a chatbot that's not terrible? And the reason no one's used a chatbot that's not terrible is that the current state-of-the-art is terrible. So chatbots have their place and indeed one of the students of class has written a really interesting kind of analysis of this, which hopefully she'll share on the forum.

But that place is really kind of lots of heuristics and carefully set up vocabularies and selecting from small sets of answers and so forth. It's not kind of general purpose, here's a story, ask anything you like about it, here are some answers. It's not to say we won't get there, I sure hope we will, but the kind of incredible hype we had around Turing machines and memory networks and end-to-end memory networks is kind of, as you can see, even when you just look at the dataset, what they worked on, it's kind of crazy.

So that is not quite the final conclusion of this though, because yesterday a paper came out which showed how to identify buffer overruns in computer source code using memory networks. And so it kind of spoilt my whole narrative that somebody seems to have actually used this technology for something effectively.

And I guess when you think about it, it makes some sense. So in case you don't know what a buffer overrun is, that's like if you're writing in an unsafe language, you allocate some memory, it's going to store some result or some input, and you try to put into that memory something bigger than the amount that you allocated, it basically spills out the end.

And in the best case, it crashes. In the worst case, somebody figures out how to get exactly the right code to spill out into exactly the right place and ends up taking over your machine. So buffer overruns are horrible things. And the idea of being able to find them, I can actually see it does look a lot like this memory network.

You kind of have to see where was that variable kind of set, and then where was the thing that was set from set, and where was the original thing allocated. It's kind of like just going back through the source code. The vocabulary is pretty straightforward, it's just the variables that have been defined.

So that's kind of interesting. I haven't had a chance to really study the paper yet, but it's no chat bot, but maybe there is a room for memory networks already after all. Is there a way to visualize what the neural network has learned for the text? There is no neural network.

If you mean the embeddings, you can look at the embeddings easily enough. The whole thing is so simple, it's very easy to look at every embedding. As I mentioned, we looked at visualizing the weights that came out of the softmax. We don't even need to look at it in order to figure out what it looked like, based on the fact that this is just a small number of simple linear steps.

We know that it basically has to learn what each sentence answer can be, you know, sentence number 3, its answer will always be milk, its answer will always be 4-way or whatever. And then, so that's what the C embeddings are going to have to be. And then the embeddings of the weights are going to have to basically learn how to come up with what's going to be probably a similar embedding to the query.

In fact, I think you can even make them the same embedding, so that these dot products basically give you something that gives you similarity scores. So this is really a very simple, largely linear model, so it doesn't require too much visualizing it. So having said all that, none of this is to say that memory networks are useless, right?

I mean, they're created by very smart people with an impressive pedigree in deep learning. This is very early, and this tends to happen in popular press, they kind of get overexcited about things. Although in this case, I don't think we can blame the press, I think we have to blame it for creating a ridiculous demo like this.

I mean, this has clearly created to give people the wrong idea, which I find very surprising from people like Yann McCone, who normally do the opposite of that kind of thing. So this is not really the press' fault in this case. But this may well turn out to be a critical component in chatbots and Q&A systems and whatever else.

But we're not there yet. I had a good chat to Steven Meridy the other day, who's a researcher I respect a lot, and also somebody I like. I asked him what he thought was the most exciting research in this direction at the moment, and he mentioned something that I was also very excited about, which is called Recurrent Entity Networks.

And the Recurrent Entity Network paper is the first to solve all of the BABY tasks with 100% accuracy. Now take of that what you will, I don't know how much that means, they're synthetic tasks. One of the things that Steven Meridy actually pointed out in the blog post is that even the basic kind of coding of how they're created is pretty bad.

They have lots of replicas and the whole thing is a bit of a mess. But anyway, nonetheless this is an interesting approach. So if you're interested in memory networks, this is certainly something you can look at. And I do think this is likely to be an important direction. Having said all that, one of the key reasons I wanted to look at these memory networks is not only because it was the largest request from the forums for this part of the course, but also because it introduces something that's going to be critical for the next couple of lessons, which is the concept of attention.

Attention or models are models where we have to do exactly what we just looked at, which is basically find out at each time which part of a story to look at next, or which part of an image to look at next, or which part of a sentence to look at next.

And so the task that we're going to be trying to get at over the next lesson or two is going to be to translate French into English. So this is clearly not a toy task. This is a very challenging task. And one of the challenges is that in a particular French sentence which has got some bunch of words, it's likely to turn into an English sentence with some different bunch of words.

And maybe these particular words here might be this translation here, and this one might be this one, and this one might be this one. And so as you go through, you need some way of saying which word do I look at next. So that's going to be the attentional model.

And so what we're going to do is we're going to be trying to come up with a proper RNN like an LSTM, or a GRU, or whatever, where we're going to change it so that inside the RNN it's going to actually have some way of figuring out which part of the input to look at next.

So that's the basic idea of attentional models. And so interestingly, during this time that memory networks and neural Turing machines and stuff were getting all this huge amount of press attention very quietly in the background at exactly the same time, attentional models were appearing as well. And it's the attentional models for language that have really turned out to be critical.

So you've probably seen all of the press about Google's new neural translation system, and that really is everything that it's claimed to be. It really is basically one giant neural network that can translate any pair of languages. The accuracy of those translations is far beyond anything that's happened before.

And the basic structure of that neural net, as we're going to learn, is not that different to what we've already learned. We're just going to have this one extra step, which is attention. And depending on how interested you guys are in the details of this neural translation system, it turns out that there are also lots of little tweaks.

The tweaks are kind of around like, OK, you've got a really big vocabulary, some of the words appear very rarely, how do you build a system that can understand how to translate those really rare words, for example, and also just kind of things like how do you deal with the memory issues around having huge embedding matrices of 160,000 words and stuff like that.

So there's lots of details, and the nice thing is that because Google has ended up putting this thing in production, all of these little details have answers now, and those answers are all really interesting. There aren't really on the whole great examples of all of those things put together.

So one of the things interesting here will be that you'll have opportunities to do that. Generally speaking, the blog posts about these neural translation systems tend to be kind of at a pretty high level. They describe roughly how these kind of approaches work, but Google's complete neural translation system is not out there, you can't download it and see the code.

So we'll see how we go, but we'll kind of do it piece by piece. I guess one other thing to mention about the memory network is that Keras actually comes with a end-to-end memory network example in the Keras GitHub, which weirdly enough, when I actually looked at it, it turns out doesn't implement this at all.

And so even on the single supporting fact thing, it takes many, many generations and doesn't get to 100% accuracy. And I found this quite surprising to discover that once you start getting to some of these more recent advances or not just a standard CNN or whatever, it's just less and less common that you actually find code that's correct and that works.

And so this memory network example was one of them. So if you actually go into the Keras GitHub and look at examples and go and have a look and download the memory network, you'll find that you don't get results anything like this. If you look at the code, you'll see that it really doesn't do this at all.

So I just wanted to mention that as a bit of a warning that you're kind of at the point now where you might want to take what a grain of salt blog posts you read or even some papers that you read, well worth experimenting with them and assuming you should start with the assumption that you can do it better.

And maybe even start with the assumption that you can't necessarily trust all of the conclusions that you've read because the vast majority of the time, in my experience putting together this part of the course, the vast majority of the time, the stuff out there is just wrong. Even in cases like I deeply respect the Keras authors and the Keras source code, but even in that case this is wrong.

I think that's an important point to be aware of. I think we're done, so I think we're going to finish five minutes early for a change. I think that's never happened before. So thanks everybody, and so this week hopefully we can have a look at the Data Science Bowl, make a million dollars, create a new PyTorch Approximate Nearest Neighbors algorithm, and then when you're done, maybe figure out the next stage for memory networks.

Thanks everybody. (audience applauds) (audience applauds)