back to indexLesson 11: Cutting Edge Deep Learning for Coders
Chapters
0:0 Reminders
7:35 Linear Algebra Cheat Sheet
8:35 ZeroShot Learning
9:55 Computer Vision
11:40 Activation Functions
13:32 Colour transformations
14:18 Batch norm
14:58 Is there any advantage
17:35 Removing Data
19:56 Noise Labels
21:22 Accuracy vs Size
23:23 Design patents
24:20 Lexical Learning Rooms
27:20 Data Science Bowl
00:00:00.000 |
So this week, obviously quite a bit just to get set up to get results from this week in 00:00:14.840 |
terms of needing all of ImageNet and that kind of thing and getting all that working. 00:00:19.780 |
So I know that a lot of you are still working through that. 00:00:25.180 |
I did want to mention a couple of reminders just that I've noticed. 00:00:30.860 |
One is that in general, we have that thing on the wiki about how to use the notebooks, 00:00:37.500 |
and we really strongly advise that you don't open up the notebook we give you and click 00:00:45.120 |
You're not really going to learn much from that. 00:00:50.340 |
It's like the first thing that's mentioned in the first paragraph of the home page of 00:00:55.560 |
Basically the idea is try to start with a fresh notebook, think about what you think 00:01:00.360 |
you need to do first, try and do that thing if you have no idea, then you can go to the 00:01:04.680 |
existing notebook, take a peek, close it again, try and re-implement what you just saw. 00:01:10.760 |
As much as possible, really not just shift enter through the notebooks. 00:01:16.080 |
I know some of you are doing it because there are threats on the forum saying, "I was shift 00:01:20.080 |
enter through the notebook and this thing didn't work." 00:01:23.240 |
And somebody is like, "Well, that's because that thing's not defined yet." 00:01:32.560 |
The other thing to remind you about is that the goal of part 2 is to get you to a point 00:01:38.960 |
where you can read papers, and the reason for that is because you kind of know the best 00:01:44.480 |
practices now, so anytime you want to do something beyond what we've learned, you're going to 00:01:49.640 |
be implementing things from papers or probably going beyond that and implementing new things. 00:01:56.880 |
Reading a new paper in an area that you haven't looked at before is, at least to me, somewhat 00:02:04.640 |
On the other hand, reading a paper for the thing that we already studied last week hopefully 00:02:10.960 |
isn't terrifying at all because you already know what the paper says. 00:02:14.680 |
So I always have that in the assignments each week. 00:02:18.320 |
Read the paper for the thing you just learned about, and go back over it and please ask 00:02:22.560 |
on the forums if there's a bit of notation or anything that you don't understand, or 00:02:28.000 |
if there's something we heard in class that you can't see in the paper, or if it's particularly 00:02:31.960 |
interesting if you see something in the paper that you don't think we mentioned in class. 00:02:37.320 |
So that's the reason that I really encourage you to read the papers for the topics we studied 00:02:46.960 |
I think for those of you like me who don't have a technical academic background, it's 00:02:52.680 |
really a great way to familiarize yourself with notation. 00:02:58.120 |
And I'm really looking forward to some of you asking about notation on the forums, so 00:03:05.320 |
There's a few key things that keep coming up in notation, like probability distributions 00:03:12.720 |
So please feel free, and if you're watching this later in the MOOC, again, feel free to 00:03:23.040 |
I was kind of interested in following up on some of last week's experiments myself. 00:03:28.800 |
And the thing that I think we all were a bit shocked about was putting this guy into the 00:03:34.920 |
device model and getting out more pictures of similar looking fish in nets. 00:03:40.960 |
And I was kind of curious about how that was working and how well that was working, and 00:03:46.840 |
I then completely broke things by training it for a few more epochs. 00:03:52.000 |
And after doing that, I then did an image similarity search again and I got these three 00:04:03.720 |
And the other thing I mentioned is when I trained it where my starting point was what 00:04:10.080 |
we looked at in class, which was just before the final bottleneck layer. 00:04:14.600 |
I didn't get very good results from this thing, but when I trained it from the starting point 00:04:20.240 |
of just after the bottleneck layer, I got the good results that you saw. 00:04:27.360 |
And again, I don't know why that is, and I don't think this has been studied as far as 00:04:33.360 |
But I'll show you something I did then do was I thought, well, that's interesting. 00:04:38.360 |
I think what's happened here is that when you train it for longer, it knows that the 00:04:46.160 |
And it seems to be now focusing on giving us the same kind of fish. 00:04:48.840 |
These are clearly the exact same type of fish, I guess. 00:04:55.100 |
So I started wondering how could we force it to combine. 00:04:58.160 |
So I tried the most obvious possible thing, I wanted to get more fish in nets. 00:05:04.720 |
And I typed Word2VecticTensh, that's a kind of fish, plus Word2VecticNet divided by 2, 00:05:11.760 |
get the average of the 2 word vectors, and give me the nearest neighbor. 00:05:18.480 |
And then just to prove it wasn't a fluke, I tried the same on Tensh plus Rod, and there's 00:05:24.520 |
Now do you know what's really freaky about this? 00:05:27.680 |
If you Google for ImageNet categories, you'll get a list of 1000 ImageNet categories. 00:05:32.880 |
If you search through them, neither net nor rod appear at all. 00:05:37.560 |
I can't begin to imagine why this works, but it does. 00:05:43.940 |
So this device model is clearly doing some pretty deep magic in terms of the understanding 00:05:52.560 |
Not only are we able to combine things like this, but we're able to combine it with categories 00:06:00.440 |
It's never seen a rod, we've never told it what a rod looks like, and I did over a net. 00:06:05.880 |
And I tried quite a few of these combinations and they just kept working. 00:06:09.040 |
Like another one I tried was, I understand why this works, which is I tried searching 00:06:14.880 |
Now boat doesn't appear in ImageNet, but there's lots of kinds of boats that appear in ImageNet. 00:06:20.160 |
So not surprisingly, it figures out, generally speaking, how to find boats. 00:06:26.480 |
And then I tried boat plus engine, and I got back pictures of powerboats, and then I tried 00:06:31.940 |
boat plus paddle, and I got back pictures of rowing boats. 00:06:36.640 |
So there's a whole lot going on here, and I think there's lots of opportunities for 00:06:40.160 |
you to explore and experiment based on the explorations and experiments that I've done. 00:06:46.000 |
And more to the point, perhaps to create some interesting and valuable tools. 00:06:53.880 |
I would have thought a tool to do an image search to say, show me all the images that 00:07:00.720 |
Or better still, maybe you could start training with things that aren't just nouns but also 00:07:06.480 |
So you could start to search for pictures of crying babies or flaming houses or whatever. 00:07:18.360 |
I think there's all kinds of stuff you could do with this, which would be really interesting 00:07:20.880 |
whether it be in a narrow organizational setting or create some new startup or a new open source 00:07:37.840 |
I actually missed this lesson this week, but I was thrilled to see that one of our students 00:07:43.520 |
has written this fantastic Medium post, linear algebra cheat sheet. 00:07:47.640 |
I think I missed it because it was posted not to the part 2 forum, but maybe to the 00:07:54.400 |
But this is really cool, Brendan has gone through and really explained all the stuff 00:08:01.040 |
that I would have wanted to have known about linear algebra before I got started, and particularly 00:08:06.000 |
I really appreciate that he's taking a code-first approach. 00:08:10.320 |
So how do you actually do this in NumPy and talking about broadcasting? 00:08:14.840 |
So you guys will all be very familiar with this already, but your friends who are wondering 00:08:19.280 |
how to get started in deep learning, what's the minimal things you need to know, it's 00:08:24.680 |
probably the chain rule and some linear algebra. 00:08:27.160 |
I think this covers a lot of any linear algebra pretty effectively. 00:08:39.160 |
Andrea Fromme who wrote that device paper, I actually emailed her and asked her what she 00:08:47.000 |
And she suggested this paper, "Zero-shot learning by convex combination of semantic embeddings," 00:08:52.680 |
which she's only a later author on, but she says it's kind of in some ways a more powerful 00:09:02.840 |
It's actually quite different, and I haven't implemented it myself, but it solves some 00:09:08.680 |
similar problems, and anybody who's interested in exploring this multimodal images and text 00:09:16.200 |
And we'll put this on the lesson wiki of course. 00:09:19.920 |
And then one more involving the same author in a similar area a little bit later was looking 00:09:27.840 |
at attention for fine-grained categorization. 00:09:31.520 |
So a lot of these things, at least the way I think Andrea Fromme was casting it, was 00:09:36.840 |
about fine-grained categorization, which is how do we build something that can find very 00:09:43.400 |
specific kinds of birds or very specific kinds of dogs. 00:09:46.720 |
I think these kinds of models have very, very wide applicability. 00:09:53.240 |
So I mentioned we'd kind of wrap up some final topics around computer vision stuff this week 00:10:06.840 |
before we started looking at some more NLP-related stuff. 00:10:11.520 |
One of the things I wanted to zip through was a paper which I think some of you might 00:10:15.440 |
enjoy, "Systematic Evaluation of CNN Advances on the ImageNet Data Set." 00:10:22.360 |
And I've pulled out what I thought were some of the key insights for some of these things 00:10:29.340 |
One key insight which is very much the kind of thing I appreciate is that they compared 00:10:36.040 |
what's the difference between the original CafeNet/AlexNet vs. GoogleNet vs. VGGNet on 00:10:43.680 |
two different sized images training on the original 227 or 128. 00:10:49.800 |
And what this chart shows is that the relative difference between these different architectures 00:10:55.700 |
is almost exactly the same regardless of what size image you're looking at. 00:11:00.280 |
And this really reminds me of like in Part 1 when we looked at data augmentation and 00:11:04.440 |
we said hey, you couldn't understand which types of data augmentation to use and how 00:11:08.200 |
much on a small sample of the data rather than on the whole data set. 00:11:13.160 |
What this paper is saying is something similar, which is you can look at different architectures 00:11:17.920 |
on small sized images rather than full sized images. 00:11:22.360 |
And so they then used this insight to do all of their experiments using a smaller 128x128 00:11:28.840 |
ImageNet model, which they said was 10 times faster. 00:11:31.800 |
So I thought that was the kind of thing which not enough academic papers do, which is like 00:11:37.400 |
what are the hacky shortcuts we can get away with? 00:11:41.680 |
So they tried lots of different activation functions. 00:11:47.780 |
It does look like max_pooling is way better, so this is the gain compared to ReLU. 00:11:55.920 |
But this one actually has twice the complexity, so it doesn't quite say that. 00:12:02.560 |
What it really says is that something we haven't looked at, which is LU, which as you can see 00:12:07.560 |
is very simple, if x is greater than or equal to 0, it's y equals x, otherwise it's e^x-1. 00:12:15.920 |
So LU basically is just like ReLU, except it's smooth. 00:12:25.080 |
Whereas ReLU looks like that, e_lu looks exactly the same here, then here it goes like that. 00:12:45.400 |
So that's one thing you might want to try using. 00:12:48.360 |
Another thing they tried which was interesting was using e_lu for the convolutional layers. 00:13:03.600 |
I guess nowadays we don't use fully connected layers very much, so maybe that's not as interesting. 00:13:10.240 |
Main interesting thing here I think is the e_lu activation function. 00:13:13.600 |
Two percentage points is quite a big difference. 00:13:19.080 |
They looked at different learning rate annealing approaches. 00:13:23.840 |
You can use Keras to automatically do learning rate annealing, and what they showed is that 00:13:31.520 |
They tried something else, which was like what about different color transformations. 00:13:38.320 |
They found that amongst the normal approaches to thinking about color, RGB actually seems 00:13:43.480 |
But then they tried something I haven't seen before, which is they added two 1x1 convolutions 00:13:51.720 |
So each of those 1x1 convolutions is basically doing some kind of linear combination with 00:13:59.680 |
the channels, with a nonlinearity then in between. 00:14:04.840 |
And they found that that actually gave them quite a big improvement, and that should be 00:14:12.520 |
So there's another thing which I haven't already seen written about elsewhere, but that's a 00:14:22.280 |
So here is the impact of batch norm, positive or negative. 00:14:28.520 |
Actually adding batch norm to GoogleNet didn't help, it actually made it worse. 00:14:33.240 |
So it seems these are really complex, carefully tuned architectures, you've got to be pretty 00:14:37.680 |
careful or else on a simple network it helps a lot. 00:14:43.040 |
And the amount it helps also depends on somewhat which activation function you use. 00:14:48.560 |
So batch norm, I think we kind of know that now, be careful when you use it. 00:14:54.440 |
Sometimes it's fantastically helpful, sometimes it's slightly unhelpful. 00:14:59.680 |
Question, is there any advantage in using fully connected layers for files? 00:15:08.680 |
Yeah, I think there is, they're terribly out of fashion. 00:15:15.280 |
But I think for transfer learning, they still seem to be the best in terms of the fully 00:15:24.800 |
connected layers, a super fast train, and you seem to get a lot of flexibility there. 00:15:31.160 |
So I don't think we know one way or another yet, but I do think that VGG still has a lot 00:15:38.040 |
to give us in terms of the last carefully tuned thing with fully connected layers, and 00:15:45.360 |
that really seems to be great for transfer learning. 00:15:49.200 |
And then there was a comment saying that Elle's advantage is not just that it's moved, but 00:15:55.920 |
that it goes a little below zero, and he mentions that this action is really strong in being 00:16:03.240 |
Anytime you hear me say something slightly stupid, please feel free to jump in, otherwise 00:16:15.240 |
So on the other hand, it does give you an improvement in accuracy if you remove the 00:16:22.120 |
final max pooling layer, replace all the fully connected layers with convolutional layers, 00:16:28.840 |
and stick an average pooling at the end, which is basically what this is doing. 00:16:33.140 |
So it does seem there's definitely an upside to fully convolutional networks in terms of 00:16:38.440 |
accuracy, but there may be a downside in terms of flexibility around transfer learning. 00:16:45.960 |
I thought this was an interesting picture I haven't quite seen before, let me explain 00:16:53.720 |
What this shows is these are different batch sizes along the bottom, and then we've got 00:17:01.400 |
And what it's showing is with a learning rate of 0.01, this is what happens to accuracy. 00:17:13.080 |
On the other hand, if you use a learning rate of 0.01 times batch size over 256, it's pretty 00:17:20.800 |
So what this suggests to me is that any time you change the batch size, this basically 00:17:24.560 |
is telling you to change the learning rate by a proportional amount, which I think a 00:17:29.360 |
lot of us have realized through experiments, but I don't think I've seen it explicitly 00:17:37.720 |
I think this is very helpful to understand as well is that removing data has a nonlinear 00:17:46.200 |
So here's this green line here is what happens when you remove images. 00:17:49.660 |
So with ImageNet, down to about half the size of ImageNet, there isn't a huge impact on 00:17:57.320 |
So maybe if you want to really speed things up, you could go 128x128 sized images and 00:18:02.080 |
use just 600,000 of them, or even maybe 400,000, but then beneath that it starts to plummet. 00:18:14.120 |
Another interesting insight, although I'm going to add something to this in a moment, 00:18:17.480 |
is that rather than removing images, if you instead flip the labels to make them incorrect, 00:18:25.040 |
that has a worse effect than not having the data at all. 00:18:31.120 |
But there are things we can do to try to improve things there, and specifically I want to bring 00:18:36.040 |
your attention to this paper, "Training Deep Neural Networks on Noisy Labels with Bootstrapping". 00:18:45.000 |
And what they show is a very simple approach, a very simple tweak you can add to any training 00:18:50.840 |
method which dramatically improves their ability to handle noisy labels. 00:18:57.920 |
This here is showing if you add noise from 0.3 up to 0.5 to MNIST, up to half of it, 00:19:08.080 |
the baselines are doing nothing at all, it really collapses the accuracy. 00:19:16.520 |
But if you use their approach to bootstrapping, you can actually go up to nearly half the 00:19:22.600 |
images being intentionally changing their label, and it still works nearly as well. 00:19:30.120 |
I think this is a really important paper to mention in this stuff that most of you will 00:19:33.800 |
find important and useful area, because most real-world datasets have noise in them. 00:19:39.520 |
So maybe this is something you should consider adding to everything that you've trained, 00:19:44.960 |
whether it be tag or datasets, or your own datasets, or whatever, particularly because 00:19:51.200 |
you don't necessarily know how noisy the labels are. 00:19:57.000 |
"Noisy labels means incorrect" Yeah, noisy just means incorrect. 00:20:05.240 |
"But bootstrapping is some sort of technique that" Yeah, this is this particular paper 00:20:09.160 |
that's grabbed a particular technique which you can read during the week if you're interested. 00:20:14.560 |
So interestingly, they find that if you take VGG and then add all of these things together 00:20:24.640 |
and do them all at once, you can actually get a pretty big performance hike. 00:20:29.600 |
It looks in fact like VGG becomes more accurate than GoogleNet if you make all these changes. 00:20:37.160 |
So that's an interesting point, although VGG is very, very slow and big. 00:20:46.760 |
There's lots of stuff that I noticed they didn't look at. 00:20:48.600 |
They didn't look at data augmentation, different approaches to zooming and cropping, adding 00:20:52.880 |
skip connections like in ResNet or DenseNet or IraNetworks, different initialization methods, 00:21:04.600 |
And to me, the most important is the impact on transfer learning. 00:21:08.600 |
So these to me are all open questions as far as I know, and so maybe one of you would like 00:21:14.540 |
to create the successor to this, more observations on training CNNs. 00:21:25.000 |
There's another interesting paper, although the main interesting thing about this paper 00:21:28.120 |
is this particular picture, so feel free to check it out, it's pretty short and simple. 00:21:34.080 |
This paper is looking at the accuracy versus the size and the speed of different networks. 00:21:45.720 |
So the size of a bubble is how big is the network, how many parameters does it have. 00:21:51.360 |
So you can see VGG 16 and VGG 19 are by far the biggest of any of these networks. 00:21:58.880 |
Interestingly the second biggest are the very old basic AlexNet. 00:22:04.200 |
Interestingly newer networks tend to have a lot less parameters, which is a good sign. 00:22:08.360 |
Then on this axis we have basically how long does it take to train. 00:22:14.560 |
So again, VGG is big and slow, and without at least some tweaks, not terribly accurate. 00:22:24.920 |
So again, there's definitely reasons not to use VGG even if it seems easier for transfer 00:22:30.240 |
learning or we don't necessarily know how to do a great job of transfer learning on 00:22:37.680 |
But as you can see, the more recent ResNet and Inception-based approaches are significantly 00:22:48.480 |
So this is why I was looking last week at trying to do transfer learning on top of ResNet 00:22:54.880 |
and there's really good reasons to want to do that. 00:23:03.400 |
These two papers really show us that academic papers are not always just some highly theoretical 00:23:12.000 |
From time to time people write these great analysis of best practices and everything 00:23:26.320 |
One other paper to mention in this kind of broad ideas about things that you might find 00:23:32.400 |
helpful is a paper by somebody named Leslie Smith who I think is going to be just about 00:23:44.000 |
Leslie Smith does a lot of really great papers which I really like. 00:23:48.480 |
This particular paper came up with a list of 14 design patterns which seem to be generally 00:24:00.160 |
This is a great paper to read, it's a really easy read. 00:24:03.920 |
You guys won't have any trouble with it at all, I don't think. 00:24:07.160 |
But I looked through all these and I just thought these all make a lot of sense. 00:24:12.960 |
If you're doing something a bit different and a bit new and you have to design a new 00:24:15.640 |
architecture, this would be a great list of patterns to look through. 00:24:22.120 |
One more Leslie Smith paper to mention, and it's crazy that this is not more well known, 00:24:28.440 |
something incredibly simple, which is a different approach to learning rates. 00:24:32.920 |
Rather than just having your learning rate gradually decrease, I'm sure a lot of you 00:24:37.360 |
have noticed that sometimes if you suddenly increase the learning rate for a bit and then 00:24:42.200 |
suddenly decrease it again for a bit, it kind of goes into a better little area. 00:24:48.280 |
What this paper suggests doing is try actually continually increasing your learning rate 00:24:52.880 |
and then decreasing it, increasing it, decreasing it, increasing it, decreasing it, something 00:25:01.400 |
And check out the impact, compared to non-cyclical approaches, it is way, way faster and at every 00:25:18.200 |
And this is something which you could easily add. 00:25:26.320 |
If you created the cyclical learning rate annealing class for Keras, many people would 00:25:33.660 |
Actually many people would have no idea what you're talking about, so you don't have to 00:25:36.680 |
write the blog post to explain why it's good and show them this picture and they would 00:25:41.640 |
I just wanted to quickly add that Keras has lots of callbacks that I actually play with 00:25:53.520 |
And if I was doing this in Keras, what I would do would be I would start with the existing 00:25:59.640 |
learning rate annealing code that's there and make small changes until it starts working. 00:26:06.320 |
There's already code that does pretty much everything you want. 00:26:12.040 |
The other cool thing about this paper is that they suggest a fairly automated approach to 00:26:18.520 |
picking what the minimum and maximum bounds should be. 00:26:22.560 |
And again, this idea of roughly what should our learning rate be is something which we 00:26:29.000 |
So check out this paper for a suggestion about how to do it somewhat automatically. 00:26:37.040 |
So there's a whole bunch of things that I've zipped over. 00:26:40.240 |
Normally I would have dug into each of those and explained it and shown examples in notebooks 00:26:45.420 |
So you guys hopefully now have enough knowledge to take this information and play with it. 00:26:54.280 |
And what I'm hoping is that different people will play with different parts and come back 00:26:57.200 |
and tell us what you find and hopefully we'll get some good new contributions to Keras or 00:27:03.640 |
PyTorch or some blog posts or some papers or so forth, or maybe with that device stuff 00:27:13.640 |
So the next thing I wanted to look at, again somewhat briefly, is the data science bowl. 00:27:25.960 |
And the reason I particularly wanted to dig into the data science bowl is there's a couple 00:27:31.800 |
One of them, there's a million reasons, it's a million dollar prize, and there are 23 days 00:27:40.480 |
The second is, it's an extension for everything that you guys have learned so far about computer 00:27:46.720 |
It uses all the techniques you've learned, but then some. 00:27:50.680 |
So rather than 2D images, they're going to be 3D volumes. 00:27:56.400 |
Rather than being 300x300 or 500x500, they're going to be 512x512x200, so a couple of hundred 00:28:05.960 |
times bigger than stuff you've dealt with before. 00:28:09.800 |
The stuff we learned in lesson 7 about where are the fish, you're going to be needing to 00:28:17.320 |
I think it's a really interesting problem to solve. 00:28:20.280 |
And then I personally care a lot about this because my previous startup in LITC was the 00:28:25.800 |
first organization to use deep learning to tackle this exact problem, which is trying 00:28:36.760 |
The reason I made that in LITC's first problem was mainly because I learned that if you can 00:28:42.760 |
find lung cancer earlier, the probability of survival is 10 times higher. 00:28:48.040 |
So here is something where you can have a real impact by doing this well, which is not 00:28:55.880 |
to say that a million dollars is in the big impact as well. 00:29:00.120 |
So let me tell you a little bit about this problem. 00:29:30.460 |
It is a format which contains two main things. 00:29:35.620 |
One is a stack of images and another is some metadata. 00:29:40.620 |
Metadata will be things like how much radiation were those at by how far away from the chest 00:29:45.580 |
with the machine and what brand of machine was it, and so on and so forth. 00:29:53.300 |
Most icon viewers just use your scroll wheel to zip through them, so all this is doing 00:29:58.540 |
is going from top to bottom or from bottom to top, so you can kind of see what's going 00:30:07.540 |
What I might do, I think is more interesting, is to focus on the bit that's going to matter 00:30:23.380 |
to you, which is the inside of the lung is this dark area here, and these little white 00:30:30.340 |
dots are what's called the vasculature, so the little vessels and stuff going through 00:30:35.860 |
And as I scroll through, have a look at this little dot. 00:30:38.820 |
You'll see that it seems to move, see how it's moving. 00:30:43.420 |
The reason it's moving is because it's not a dot, it's actually a vessel going through 00:30:55.820 |
And so if you take a slice through that, it looks like lots of dots. 00:31:01.620 |
And so as you go through those slices, it looks like that. 00:31:08.180 |
And then eventually we get to the top of the lung, and that's why you see eventually a 00:31:13.460 |
hole goes to white, so that's the edge basically of the organ. 00:31:19.100 |
So you can see there are edges on each side, and then there's also bone. 00:31:23.900 |
So some of you have been looking at this already over the last few weeks and have often asked 00:31:29.620 |
me about how to deal with multiple images, and what I've said each time is don't think 00:31:39.460 |
Think of it in the way your DICOM viewer can if you have a 3D button like this one does. 00:31:47.620 |
That's actually what we were just looking at. 00:31:51.320 |
So it's not a bunch of flat images, it's a 3D volume. 00:31:56.780 |
It just so happens that the default way that most DICOM viewers show things is by a bunch 00:32:04.820 |
But it's really important that you think of it as a 3D volume, because you're looking in 00:32:14.940 |
What you're looking for is you're looking for somebody who has lung cancer. 00:32:19.700 |
And what somebody who has lung cancer looks like is that somewhere in this space there 00:32:23.700 |
is a blob, it could be roughly a spherical blob, it could be pretty small, around 5 millimeters 00:32:33.220 |
is where people start to get particularly concerned about a blob. 00:32:37.820 |
And so what that means is that for a radiologist, as they flick through a scan like this, is 00:32:43.420 |
that they're looking for a dot which doesn't move, but which appears, gets bigger and then 00:32:54.860 |
So you can see why radiologists very, very, very often miss nodules in lungs. 00:33:03.700 |
Because in all this area, you've got to have extraordinary vision to be able to see every 00:33:11.940 |
And remember, the sooner you catch it, you get a 10x improved chance of survival. 00:33:20.300 |
And generally speaking, when a radiologist looks at one of these scans, they're not looking 00:33:26.380 |
for nodules, they're looking for something else. 00:33:29.220 |
Because lung cancer, at least in the earlier stages, is asymptomatic, it doesn't cause 00:33:36.580 |
So it's like something that every radiologist has to be thinking about when they're looking 00:33:43.820 |
So that's the basic idea is that we're going to try and come up with in the next half hour 00:33:49.220 |
or so some idea about how would you find these blobs, how would you find these nodules. 00:33:55.660 |
So each of these things generally is about 512x512 by a couple of hundred. 00:34:08.260 |
And the equivalent of a pixel in 3D space is called a voxel. 00:34:25.920 |
Each voxel in a CT scan is a 12-bit integer, if memory serves me correctly. 00:34:35.700 |
And a computer screen can only show 8 bits of grayscale, and furthermore your eyes can't 00:34:42.660 |
necessarily distinguish between all those grayscale perfectly anyway. 00:34:46.780 |
So what every DICOM viewer provides is something called a windowing adjustment. 00:34:53.820 |
So a windowing adjustment, here is the default window, which is designed to basically map 00:35:01.880 |
some subset of that 12-bit space to the screen so that it highlights certain things. 00:35:08.780 |
And so the units CT scans use are called Hounsfield units, and certain ranges of Hounsfield units 00:35:19.380 |
tell you that something is some particular part of the body. 00:35:23.580 |
And so you can see here that the bone is being lit up. 00:35:26.860 |
So we've selected an image window which is designed to allow us to see the bone clearly. 00:35:33.020 |
So what I did when I opened this was I switched it to CT's chest, which is some kind person 00:35:41.040 |
has already figured out what the best CT lungs have figured out what's the best window to 00:36:02.460 |
Now for you working with deep learning, you don't have to care about that, because of 00:36:07.860 |
course the deep learning algorithm can see 12 bits perfectly well. 00:36:17.020 |
So one of the challenges with dealing with this data science bold data is that there's 00:36:24.700 |
a lot of preprocessing to do, but the good news is that there's a couple of fantastic 00:36:34.540 |
So hopefully you've found out by now that on Kaggle, if you click on the kernels button, 00:36:40.260 |
you basically get to see people's IPython notebooks where they show you how to do certain 00:36:47.120 |
In this case, this guy has got a full preprocessing tutorial showing how to load DICOM, convert 00:36:54.320 |
the values to Hounsfield units, and so forth. 00:37:01.100 |
So DICOM you will load with some library, probably with PyDICOM. 00:37:09.500 |
So PyDICOM is a library that's a bit like Pillow or P-I-L, an image.open, this is more 00:37:15.660 |
like a DICOM.open and end up with a 3D file, and of course the metadata. 00:37:24.300 |
You can see here using the metadata, image_position, lice_location. 00:37:31.180 |
So the metadata comes through with just attributes of the Python object. 00:37:37.560 |
This person is very kindly provided to you a list of the Hounsfield units for each of 00:37:52.380 |
So he shows how to translate stuff into that range. 00:38:02.580 |
So here is a histogram for this particular picture. 00:38:06.260 |
So you can see that most of it is air, and then you get some bone and some lung as the 00:38:20.580 |
So then the next thing to think about is really voxel spacing, which is as you move across 00:38:30.540 |
one bit of x-axis or one bit of y-axis or from slice to slice, how far in the real world 00:38:39.820 |
And one of the annoying things about medical imaging is that different kinds of scanners 00:38:43.980 |
have different distances between those slices, called the slice thickness and different meanings 00:38:54.060 |
Luckily that stuff is all in the diagram metadata. 00:38:56.560 |
So the resampling process means taking those lists of slices and turning it into something 00:39:04.620 |
where every step in the x-direction or the y-direction or the z-direction equals 1mm 00:39:12.380 |
And so it would be very annoying for your deep learning network if your different lung 00:39:16.780 |
images were squished by different amounts, especially if you didn't give it the metadata 00:39:25.620 |
So that's what resampling does, and as you can see it's using the slice thickness and 00:39:29.740 |
the pixel spacing to make everything nice and even. 00:39:35.900 |
So there are various ways to do 3D plots, and it's always a good idea to do that. 00:39:45.760 |
And then something else that people tend to do is segmentation. 00:39:51.300 |
Depending on time, you may or may not get around to looking more at segmentation in 00:39:55.100 |
this part of the course, but effectively segmentation is just another generative model. 00:39:59.340 |
It's a generative model where hopefully somebody has given you some things saying this is lung, 00:40:05.420 |
this is air, and then you build a model that tries to predict for something else what's 00:40:13.300 |
Unfortunately for lung CT scans, we don't generally have the ground truth of which bit 00:40:25.980 |
So generally speaking, in medical imaging, people use a whole lot of heuristic approaches, 00:40:31.140 |
so kind of hacky rule-based approaches, and in particular applications of region-growing 00:40:41.820 |
I find this kind of the boring part of medical imaging because it's so clearly a dumb way 00:40:46.540 |
to do things, but deep learning is far too new in this area yet to kind of develop the 00:40:57.620 |
But the good news is that there's a button which I don't think many people notice exists 00:41:03.940 |
called 'Tutorial' on the main data science page where these folks from Boozell and Hamilton 00:41:11.220 |
actually show you a complete segmentation approach. 00:41:14.540 |
Now it's interesting that they picked unit segmentation. 00:41:19.180 |
This is definitely the thing about segmentation I would be teaching you guys if we have time. 00:41:25.660 |
Segment is one of these things that outside of the Kaggle world, I don't think that many 00:41:28.780 |
people are familiar with, but inside the Kaggle world we know that any time segmentation crops 00:41:39.140 |
More recently there's actually been something called 'DenseNet' for segmentation which takes 00:41:45.500 |
unit even a little bit further and maybe that would be the new winner for newer Kaggle competitions 00:41:52.420 |
But the basic idea here of things like Unet and DenseNet is that we have a model where 00:42:09.980 |
when we do generative models, when I think about doing style transfer, we generally start 00:42:16.020 |
with this kind of large image and then we do some downsampling operations to make it 00:42:20.780 |
a smaller image and then we do some computation and then we make it bigger again with these 00:42:32.940 |
What happens in Unet is that there are additional neural network connections made directly from 00:42:41.820 |
here to here, and directly from here to here, and here to here, and here to here. 00:42:52.140 |
Those connections basically allow it to almost do like a kind of residual learning approach, 00:43:00.340 |
like it can figure out the key semantic pieces at really low resolution, but then as it upscales 00:43:07.660 |
it can learn what was special about the difference between the downsampled image and the original 00:43:13.860 |
It can kind of learn to add that additional detail at each point. 00:43:20.020 |
So Unet and DenseNet for segmentation are really interesting and I hope we find some 00:43:34.380 |
time to get back to them in this part of the course, but if we don't, you can get started 00:43:40.980 |
by looking at this tutorial in which these folks basically show you from scratch. 00:43:47.620 |
What they try to do in this tutorial is something very specific, which is the detection part. 00:43:54.540 |
So what happens in this kind of, like think about the fisheries competition. 00:44:02.220 |
We pretty much decided that in the fisheries competition, if you wanted to do really well, 00:44:05.980 |
you would first of all find the fish and then you would zoom into the fish and then you 00:44:12.860 |
Certainly in the right whale competition earlier, that was how that was found. 00:44:17.420 |
For this competition, this is even more clearly going to be the approach, because these images 00:44:21.420 |
are just far too big to do a normal convolutional neural network. 00:44:24.620 |
So we need one step that's going to find the nodule, and then a second step that's going 00:44:29.100 |
to zoom into a possible nodule and figure out is this a malignant tumor or something 00:44:42.260 |
The bad news is that the data science bowl data set does not give you any information 00:44:55.940 |
Which I actually wrote a post in the Kaggle forums about this, I just think this is a 00:45:03.740 |
That information actually exists, the dataset they got this from is something called the 00:45:07.900 |
National Lung Screening Trial, which actually has that information or something pretty close 00:45:12.860 |
So the fact they didn't provide it, I just think it's horrible for a competition which 00:45:23.260 |
The good news though is that there is a data set which does have this information. 00:45:29.740 |
The original data set was called LIDC Idri, but interestingly that data set was recently 00:45:37.820 |
used for another competition, a non-Kaggle competition called Luna, that competition 00:45:46.140 |
And one of the tracks in that competition was actually specifically a false positive 00:45:50.780 |
detection track, and then the other track was a find the nodule track basically. 00:45:57.380 |
So you can actually go back and look at the papers written by the winners. 00:46:04.540 |
Many of them are a single sentence saying for a commercial confidentiality agreement 00:46:10.420 |
But some of them, including the winner of the false positive track, they actually provide 00:46:21.100 |
And so what you could do, in fact I think what you have to do to do well in this competition 00:46:24.700 |
is download the Luna data set, use that to build a nodule detection algorithm. 00:46:30.660 |
So the Luna data set includes files saying this lung has nodules here, here, here, here. 00:46:37.660 |
So do nodule detection based on that, and then run that nodule detection algorithm on 00:46:43.300 |
the Kaggle data set, find the nodules, and then use that to do some classification. 00:46:55.380 |
The biggest tricky thing is that most of the CT scans in the Luna data set are what's called 00:47:07.700 |
A contrast scan means that the patient had a radioactive dye injected into them, so that 00:47:15.300 |
the things that they're looking for are easier to see. 00:47:21.540 |
For the National Lung Screening Trial, which is what they use in the Kaggle data set, none 00:47:27.420 |
And the reason why is that what we really want to be able to do is to take anybody who's 00:47:31.860 |
over 65 and has been smoking more than a pack a day for more than 20 years and give them 00:47:36.540 |
all a CT scan and find out which ones have cancer, but in the process we don't want to 00:47:40.940 |
be shooting them up with radioactive dye and giving them cancer. 00:47:44.940 |
So that's why we try to make sure that when we're doing these kind of asymptomatic scans 00:47:52.580 |
that they're as low radiation dose as possible. 00:47:57.100 |
So that means that you're going to have to think about transfer learning issues, that 00:48:02.820 |
the contrast in your image is going to be different between the thing you build on the 00:48:07.300 |
Luna data set, the nodule protection, and the Kaggle competition data set. 00:48:18.220 |
When I looked at it, I didn't find that that was a terribly difficult problem. 00:48:22.300 |
I'm sure you won't find it impossible by any means. 00:48:32.620 |
So to finalize this discussion, I wanted to refer to this paper, which I'm guessing not 00:48:46.160 |
And what it is, is a non-deep learning approach to trying to find nodules. 00:48:52.980 |
So that's where they use nodule segmentation. 00:49:03.940 |
I have a correction from our radiologist saying that dye is not radioactive. 00:49:13.440 |
Okay, but there's a reason we don't inject people with a contrast dye. 00:49:17.580 |
The issues are contrast in the nephropath or jacarina. 00:49:28.020 |
I do know though that the NLST studies use a lower amount of radioactivity than I think 00:49:40.300 |
the Luna ones do, so that's another difference. 00:49:46.300 |
So this is an interesting idea of how can you find nodules using more of a heuristic 00:49:57.060 |
And the heuristic approach they suggest here is to do clustering, and we haven't really 00:50:03.340 |
done any clustering in class yet, so we're going to dig into this in some detail. 00:50:08.060 |
Because I think this is a great idea for the kind of heuristics you can add on top of deep 00:50:12.380 |
learning to make deep learning work in different areas. 00:50:16.180 |
The basic idea here is to, as you can say, they call it a five-dimensional mean. 00:50:23.100 |
They're going to try and find groups of voxels which are similar, and they're going to cluster 00:50:29.460 |
And hopefully we've got to particularly cluster together things that look like nodules. 00:50:34.300 |
So the idea is at the end of this segmentation there will be one cluster for the whole lung 00:50:40.540 |
boundary, one cluster for the whole vasculature, and then one cluster for every nodule. 00:50:45.860 |
So the five dimensions are x, y and z, which is straightforward, intensity, so the number 00:50:52.460 |
of Hounsfield units, and then the fifth one is volumetric shape index, and this is the 00:51:01.380 |
The basic idea here is that it's going to be a combination of the different curvatures 00:51:05.820 |
of a voxel based on the Gaussian and mean curvatures. 00:51:11.540 |
Now what the paper goes on to explain is that you can use for these the first and second 00:51:20.660 |
Now all that basically means is you subtract one voxel from its neighbor, and then you 00:51:26.780 |
take that whole thing and subtract one voxel's version of that from its neighbor. 00:51:30.700 |
You get the first and second derivatives, so it kind of tells you the direction of the 00:51:44.180 |
So by getting these first and second derivatives of the image and then you put it into this 00:51:47.940 |
formula, it comes out with something which basically tells you how sphere-like this voxel 00:51:56.900 |
seems to be part of how sphere-like a construct. 00:52:03.380 |
If we can basically take all the voxels and combine the ones that are nearby, have a similar 00:52:09.260 |
number of Hounsfield units and seem to be of similar kinds of shapes, we're going to 00:52:16.380 |
So I'm not going to worry about this bit here because it's very specific to medical imaging. 00:52:21.380 |
Anybody who's interested in doing this, feel free to talk on the forum about what this 00:52:30.020 |
But what I did want to talk about was the meanshift clustering, which is a particular 00:52:35.300 |
approach to clustering which they talk about. 00:52:37.780 |
"Clustering" is something which for a long time I've been kind of an anti-fan of. 00:53:07.440 |
It belongs to this group of unsupervised learning algorithms which always seem to be kind of 00:53:15.980 |
But I've realized recently there are some specific problems that can be solved well 00:53:20.860 |
I'm going to be showing you a couple, one today and one in Lesson 14. 00:53:28.340 |
Clustering algorithms are perhaps the best easiest to describe by what they do by generating 00:53:36.100 |
I'm going to create 6 clusters, and for each cluster I'll create 250 samples. 00:53:42.820 |
So I'm going to basically say let's create a bunch of centroids by creating some random 00:53:49.300 |
So 6 pairs of random numbers for my centroids, and then I'll grab a bunch of random numbers 00:54:00.100 |
around each of those centroids and combine them all together and then plot them. 00:54:05.860 |
And so here you can see each of these X's represents a centroid, so a centroid is just 00:54:11.680 |
like the average point for a cluster of data. 00:54:20.740 |
So imagine if this was showing you clusterings of different kinds of lung tissue, ideally 00:54:32.420 |
you'd have some voxels that were colored one thing for nodule, and a bunch of different 00:54:43.540 |
We can only show this easily in 2 dimensions, but there's no reason to not be able to imagine 00:54:52.100 |
So the goal of clustering will be to undo this. 00:54:57.220 |
Given the data, but not the X's, how can you figure out where the X's were? 00:55:04.380 |
And then it's pretty straightforward once you know where the X's are to then find the 00:55:08.260 |
closest points to that to assign every data point to a cluster. 00:55:15.700 |
The most popular approach to clustering is called A-means. 00:55:24.020 |
A-means is an approach where you have to decide up front how many clusters are there. 00:55:32.140 |
And what it basically does is there's two steps. 00:55:36.620 |
The first one is to guess as to where those clusters might be. 00:55:42.020 |
And the really simple way to do that is just to randomly pick a point, and then start 00:55:53.860 |
randomly picking points which are as far away as possible from all the previous ones I've 00:56:03.860 |
So if I started here, then probably the furthest away point would be down here. 00:56:08.500 |
So this would be our starting point for cluster 1, and say what point is the furthest away 00:56:14.540 |
That's probably this one here, so we have a starting point for cluster 2. 00:56:18.460 |
What's the furthest point away from both of these? 00:56:20.460 |
Probably this one over here, and so forth, so you keep doing that to get your initial 00:56:26.940 |
And then you just iteratively move every point, so you basically then say these are the clusters, 00:56:34.700 |
let's assume these are the clusters, which cluster does every point belong to, and then 00:56:39.980 |
you just iteratively move the points to different clusters a bunch of times. 00:56:46.260 |
Now K means, it's a shame it's so popular because it kind of sucks, right? 00:56:54.200 |
Sucky thing number 1 is that you have to decide how many clusters there are, and the whole 00:56:59.900 |
point is we don't know how many nodules there are. 00:57:04.140 |
And then sucky thing number 2 is without some changes, to do something called kernel K means, 00:57:09.260 |
it only works with the things that are the same shape, they're all nicely Gaussian shaped. 00:57:14.420 |
So we're going to talk about something way-caller, which only came across somewhat recently, much 00:57:22.320 |
less well known, which is called mean-shift clustering. 00:57:25.900 |
Now mean-shift clustering is one of these things which seems to spend all of its time 00:57:43.580 |
Whenever I tried to look up something about mean-shift clustering, I kind of started seeing 00:57:50.540 |
This is like the first tutorial not in the PDF that I could find. 00:57:55.780 |
So this is one way to think about mean-shift clustering, another way is a code-first approach, 00:58:11.820 |
At a high level, we're going to do a bunch of loops. 00:58:19.580 |
It would be better if I didn't do 5 steps, but I kept doing this until it was stable, 00:58:26.460 |
And each step I'm going to go through, so our data is x, I'm going to go through, enumerate 00:58:33.540 |
So small x is the current data point I'm looking at. 00:58:42.260 |
Now what I want to do is find out how far away is this data point from every other data 00:58:47.820 |
So I'm going to create a vector of distances. 00:58:50.900 |
And I'm going to do that with the magic of broadcasting. 00:58:55.020 |
So small x is a vector of size 2, this is 2 coordinates, and big X is a matrix of size 00:59:08.940 |
And thanks to what we've now learned about broadcasting, we know that we can subtract 00:59:12.920 |
a matrix from a vector, and that vector will be broadcast across the axis of the matrix. 00:59:19.420 |
And so this is going to subtract every element of big X from little x. 00:59:25.300 |
And so if we then go ahead and square that, and then sum it up, and then take the square 00:59:31.780 |
root, this is going to return a vector of distances of small x to every element of big 00:59:44.460 |
And the sum here is just summing up the two coordinates. 00:59:52.260 |
So we now know for this particular data point, how far away is it from all of the other data 00:59:58.220 |
Now the next thing we want to do is to -- let's go to the final step. 01:00:04.340 |
The final step will be to take a weighted average. 01:00:08.740 |
In the final step, we're going to say what cluster do you belong to. 01:00:19.220 |
So we've got a whole bunch of data points, and we're currently looking at this one. 01:00:34.260 |
What we've done is we've now got a list of how far it is away from all of the other data 01:00:43.820 |
And the basic idea is now what we want to do is take the weighted average of all of 01:00:49.420 |
those data points, weighted by the inverse of that distance. 01:00:53.940 |
So the things that are a long way away, we want to weight very small. 01:00:58.580 |
And the things that are very close, we want to weight very big. 01:01:02.820 |
So I think this is probably the closest, and this is about the second-closest, and this 01:01:11.420 |
So assuming these have most of the weight, the average is going to be somewhere about 01:01:18.740 |
And so by doing that at every point, we're going to move every point closer to where 01:01:25.540 |
its friends are, closer to where the nearby things are. 01:01:29.180 |
And so if we keep doing this again and again, everything is going to move until it's right 01:01:36.340 |
So how do we take something which initially is a distance and make it so that the larger 01:01:47.300 |
And the answer is we probably want to shape something like that. 01:02:00.220 |
This is by no means the only shape you could choose. 01:02:02.540 |
It would be equally valid to choose this shape, which is a triangle, at least half of one. 01:02:17.940 |
In general though, note that if we're going to multiply every point by one of these things 01:02:25.620 |
and add them all together, it would be nice if all of our weights added to 1, because 01:02:31.500 |
then we're going to end up with something that's of the same scale that we start with. 01:02:35.940 |
So when you create one of these curves where it all adds up to 1, generally speaking we 01:02:49.460 |
And I mention this because you will see kernels everywhere. 01:02:53.860 |
If you haven't already, now that you've seen it, you'll see them everywhere. 01:02:58.100 |
In fact, kernel methods is a whole area of machine learning that in the late 90s basically 01:03:04.740 |
took over because it was so theoretically pure. 01:03:08.980 |
And if you want to get published in conference proceedings, it's much more important to be 01:03:16.460 |
So for a long time, kernel methods went out, and neural networks in particular disappeared. 01:03:23.980 |
Eventually people realized that accuracy was important as well, and in more recent times 01:03:30.940 |
But you still see the idea of a kernel coming up very often, because they're super useful 01:03:38.660 |
They're basically something that lets you take a number, like in this case a distance, 01:03:43.100 |
and turn it into some other number where you can weight everything by that other number 01:03:48.660 |
and add them together to get a nice little weighted average. 01:03:52.540 |
So in our case, we're going to use a Gaussian kernel. 01:03:57.500 |
The particular formula for a Gaussian doesn't matter. 01:04:01.420 |
I remember learning this formula in grade 10, and it was by far the most terrifying 01:04:05.840 |
mathematical formula I've ever seen, but it doesn't really matter. 01:04:09.540 |
For those of you that remember or have seen the Gaussian formula, you'll recognize it. 01:04:14.580 |
For those of you that haven't, it doesn't matter. 01:04:17.320 |
But this is the function that draws that curve. 01:04:23.460 |
So if we take every one of our distances and put it through the Gaussian, we will then 01:04:35.260 |
So then in the final step, we can multiply every one of our data points by that weight, 01:04:44.380 |
add them up, and divide by the sum of the weights. 01:04:50.060 |
You'll notice that I had to be a bit careful about broadcasting here, because I needed 01:04:56.860 |
to add a unit axis at the end of my dimensions, not at the start, so by default it adds unit 01:05:07.300 |
axes to the beginning when you do broadcasting. 01:05:13.220 |
If you're not clear on why this is, then that's a sign you definitely need to do some more 01:05:24.180 |
You're free to ask if you're not clear after you've experimented. 01:05:29.000 |
So this is just doing sum of weights times x divided by sum of weights. 01:05:42.900 |
Importantly there's a nice little thing that we can pass to a Gaussian, which is the thing 01:05:48.300 |
that decides does it look like the thing I just drew, or does it look like this, or does 01:06:00.820 |
They all have the same area underneath, but they're very different shapes. 01:06:04.900 |
If we make it look like this, then what that's going to do is create a lot more clusters, 01:06:10.780 |
because things that are really close to it are going to have really high weights, and 01:06:15.420 |
everything else is going to have a tiny weight from being meaningless. 01:06:18.660 |
So if we use something like this, we're going to have much fewer clusters, because even 01:06:23.180 |
stuff that's further away is going to have a higher weight from the weight of the sum. 01:06:31.140 |
The choice that you use for the kernel width, that's got lots of different names you can 01:06:44.540 |
There's actually some cool ways to choose it. 01:06:46.940 |
One simple way to choose it is to find out which size of bandwidth covers 1/3 of the 01:06:58.020 |
I think that's the approach that Scikit-learn uses. 01:07:01.780 |
So there are some ways that you can automatically figure out a bandwidth, and just one of the 01:07:12.340 |
So we just go through a bunch of times, five times, and each time we replace every point 01:07:17.580 |
with its weighted average weighted by this Gaussian kernel. 01:07:25.900 |
So when we run this 5 times, it takes a second, and here's the results. 01:07:33.060 |
I've offset everything by 1 just so that we can see it, otherwise it would be right on 01:07:38.220 |
So you can see that for nearly all of them, it's in exactly the right spot, whereas for 01:07:43.180 |
this cluster, let's just remind ourselves what that cluster looked like, these two clusters, 01:07:49.340 |
this particular bandwidth, it decided to create one cluster for them rather than two. 01:07:56.180 |
So this is kind of an example, whereas if we decreased our bandwidth, it would create 01:08:01.780 |
There's no one right answer, that should be one or two. 01:08:08.500 |
So one challenge with this is that it's kind of slow. 01:08:14.220 |
So I thought let's try and accelerate it for the GPU. 01:08:20.940 |
Because mean shift's not very cool, nobody seems to have implemented it for the GPU yet, 01:08:26.540 |
or maybe it's just not a good idea, so I thought I'd use PyTorch. 01:08:30.260 |
And the reason I use PyTorch is because it really feels like writing PyTorch, it just 01:08:34.780 |
feels like writing NumPy, everything happens straight away. 01:08:37.700 |
So I really hoped that I could take my original code and make it almost the same. 01:08:45.300 |
And indeed, here is the entirety of mean shift in PyTorch. 01:08:55.380 |
You can see anywhere I used to have np, it now says torch, np.array is now torch.flow_tensor, 01:09:05.460 |
np.square_root is torch.square_root, everything else is almost the same. 01:09:11.660 |
One issue is that torch doesn't support broadcasting. 01:09:18.000 |
So we'll talk more about this shortly in a couple of weeks, but basically I decided that's 01:09:23.020 |
not okay, so I wrote my own broadcasting library for PyTorch. 01:09:26.820 |
So rather than saying x, little x minus big X, I used sub for subtract. 01:09:31.660 |
That's the subtract from my broadcasting library. 01:09:34.900 |
If you're curious, check out TorchUtils and you can see my broadcasting operations there. 01:09:40.300 |
But basically if you use those, you can see save for modification, it will do all the 01:09:51.340 |
So as you can see, this looks basically identical to the previous code, but it takes longer. 01:10:08.080 |
So I could easily fix that by adding .Cuda to my x, but that made it slower still. 01:10:15.660 |
The reason why is that all the work is being done in this for loop, and PyTorch doesn't 01:10:24.760 |
Each run through a for loop in PyTorch is basically calling a new CUDA kernel each time 01:10:33.980 |
It takes a certain amount of time to even launch a CUDA kernel. 01:10:38.060 |
When I'm saying CUDA kernel, this is a different usage of the word kernel. 01:10:44.780 |
In CUDA, kernel refers to a little piece of code that runs on the GPU. 01:10:50.780 |
So it's launching a little GPU process every time through the for loop. 01:10:54.980 |
It takes quite a bit of time, and it's also having to copy data all over the place. 01:10:59.820 |
So what I then tried to do was to make it faster. 01:11:18.580 |
So each time through the loop we don't want to do just one piece of data, but a minibatch 01:11:31.180 |
The main one was that my for_i now jumps through one batch size at a time. 01:11:47.260 |
So I now need to create a slice which is from i to i plus batch size, unless we've gone 01:11:58.380 |
past the end of the data, in which case it's just as far as hand. 01:12:02.540 |
So this is going to refer to the slice of data that we're interested in. 01:12:08.100 |
So what we can now do is say x with that slice to grab back all of the data in this minibatch. 01:12:15.800 |
And so then I had to create a special version of, I can't say subtract anymore, I need to 01:12:22.020 |
think carefully about the broadcasting operations here. 01:12:24.620 |
I'm going to return a matrix, let's say batch size is 32, I'm going to have 32 rows, and 01:12:31.620 |
then let's say n is 1000, it will be 1000 columns. 01:12:35.100 |
That shows me how far away each thing in my batch is from every piece of data. 01:12:40.860 |
So when we do things a batch at a time, we're basically adding another axis to all of your 01:12:47.780 |
Suddenly now you have a batch axis all the time. 01:12:51.180 |
And when we've been doing deep learning, that's been something I think we've got pretty used 01:12:55.780 |
The first axis in all of our tensors has always been a batch axis. 01:13:00.380 |
So now we're writing our own GPU-accelerated algorithm. 01:13:06.180 |
Two years ago, if you Google for K-means, CUDA, or K-means GPU, you get back research 01:13:15.180 |
studies where people write papers about how to put these algorithms in GPUs, because it 01:13:27.460 |
So this is crazy, this is possible, but here we are. 01:13:30.300 |
We have built a batch-by-batch GPU-accelerated main shift algorithm. 01:13:38.100 |
The basic distance formula is exactly the same. 01:13:40.820 |
I just have to be careful about where I added unsqueezed as the same as expandims in NumPy. 01:13:48.420 |
So I just have to be careful about where I add my unit axes, add it to the first axis 01:13:52.780 |
of one bit and the second axis of the other bit. 01:13:55.340 |
So that's going to subtract every one of these from every one of these returner matrix. 01:14:01.700 |
Again, this is a really good time to look at this and think why does this broadcasting 01:14:09.020 |
work, because this is getting more and more complex broadcasting. 01:14:15.540 |
And hopefully you can now see the value of broadcasting. 01:14:19.860 |
Not only did I get to avoid writing a pair of nested for loops here, but I also got to 01:14:26.580 |
do this all on the GPU in a single operation, so I've made this thousands of times faster. 01:14:35.600 |
So here is a single operation which does that entire matrix subtraction. 01:14:40.780 |
I was just going to suggest that we take a break soon, it's a tentalate. 01:14:54.580 |
We then chuck that into a Gaussian, and because this is just element-wise, the Gaussian function 01:15:04.900 |
And then I've got my weighted sum, and then divide that by the sum of weights. 01:15:15.500 |
So previously for my NumPy version, it took a second, now it's 48ms, so we've just sped 01:15:29.060 |
Question - I get how batching helps with locality and cache, but I do not quite follow how it 01:15:34.620 |
helps otherwise, especially with respect to accelerating the for loop. 01:15:40.020 |
So in PyTorch, the for loop is not run on the GPU. 01:15:47.620 |
The for loop is run on your CPU, and your CPU goes through each step of the for loop 01:15:52.940 |
and calls the GPU to say do this thing, do this thing, do this thing. 01:15:58.140 |
So this is not to say you can't accelerate this IntensorFlow in a similar way. 01:16:05.740 |
Like IntensorFlow, there's a tf.while and stuff like that where you can actually do 01:16:13.020 |
Even still, if you do it entirely in a loop in Python, it's going to be pretty difficult 01:16:18.700 |
But particularly in PyTorch, it's important to remember in PyTorch, your loops are not 01:16:24.860 |
It's what you do inside each loop that's optimized. 01:16:30.500 |
Some of the math functions are coming from Torch and others are coming from the Python 01:16:35.700 |
What is the difference when you use the Python math library? 01:16:45.460 |
You'll see that I use that math.py is a constant and then math.square root of 2 times py is 01:16:52.220 |
You need to use the GPU to calculate a constant, obviously. 01:16:56.940 |
We only use Torch for things that are running on a vector or a matrix or a tensor of data. 01:17:11.260 |
We'll come back in 10 minutes, so that would be 2 past 8, and we'll talk about some ideas 01:17:16.340 |
I have for improving mean shift, which maybe you guys will want to try during the week. 01:17:24.840 |
The idea here is we figure that there are two steps we need to figure out where the 01:17:41.500 |
Step number one is to find the things that may be kind of nodule-ish, zoom into them 01:17:50.020 |
Step two would be where your learning particularly comes in, which is to figure out is that cancerous 01:18:00.060 |
Once you've found a nodule-ish thing, the cancerous ones are actually by far the biggest 01:18:07.740 |
driver of whether or not something is a malignant cancer is how big it is. 01:18:17.620 |
The other thing particularly important is how kind of spidery it looks. 01:18:24.860 |
If it looks like it's kind of evilly going out to capture more territory, that's probably 01:18:32.740 |
So the size and the shape are the two things that you're going to be wanting to try and 01:18:36.380 |
find, and obviously that's a pretty good thing for a neural net to be able to do. 01:18:41.740 |
You probably don't need that in the examples of it. 01:18:47.140 |
When you get to that point, there was obviously a question about how to deal with the 3D aspect 01:18:52.620 |
You can just create a 3D convolutional neural net. 01:18:56.860 |
So if you had like a 10x10x10 space, that's obviously certainly not going to be too big, 01:19:05.380 |
it's 20x20x20, you might be okay, and kind of think about how big a volume can you create. 01:19:10.940 |
There's plenty of papers around on 3D convolutions, although I'm not sure if you even need one 01:19:21.500 |
The other approach that you might find interesting to think about is something called triplanar. 01:19:27.580 |
What triplanar means is that you take a slice through the x and the y and the z axis, and 01:19:38.900 |
One is a slice through x, y and z, and then you can kind of treat those as different channels 01:19:46.020 |
You can probably use pretty standard neural net libraries that expect three channels. 01:19:53.740 |
So there's a couple of ideas for how you can deal with the 3D aspect of it. 01:20:03.060 |
I think using the lunar dataset as much as possible is going to be a good idea because 01:20:09.100 |
you really want something that's pretty good at detecting nodules before you start putting 01:20:13.580 |
it onto the Kaggle dataset because the other problem with the Kaggle dataset is it's ridiculously 01:20:18.660 |
And again, there's no reason for it, there are far more cases in NLST than they've provided 01:20:25.460 |
to Kaggle, so I can't begin to imagine why they went to all this trouble and a million 01:20:29.700 |
dollars of money for something which has not been set up to succeed. 01:20:34.900 |
Anyway, that's not our problem, it makes it all a more interesting thing to play with. 01:20:43.220 |
But after the competition's finished, if you get interested in it, you'll probably want 01:20:47.540 |
to go and download the whole NLST dataset or as much as possible and do it properly. 01:20:53.820 |
Actually, there are two questions that I wanted to read. 01:21:01.780 |
One is just for the audio stream, there are occasional max volume pops that are really 01:21:08.860 |
This might not be solvable right now, but something to look into. 01:21:16.700 |
And then last class you mentioned that you would explain when and why to use Keras versus 01:21:23.260 |
If you only had brain space for one in the same way, some only have brain space for VI 01:21:35.340 |
So I just reduced the volume a little bit, so let us know if that helps. 01:21:45.500 |
I would pick PyTorch, it feels like it kind of does everything Keras does, but gives 01:21:52.500 |
you the flexibility to really play around a lot more. 01:22:04.460 |
So question, you mentioned there are other datasets of cancerous images that has labels 01:22:13.620 |
That was my suggestion, and that's what the tutorial shows how to do. 01:22:25.300 |
There's a whole thing, a kernel on Kaggle called candidate generation and Luna16, which shows 01:22:33.620 |
how to use Luna to build a module finder, and this is one of the highest rated Kaggle 01:22:44.780 |
We've now used kernel in three totally different ways in this lesson. 01:22:48.860 |
If we can come up with a fourth, Kaggle kernels, CUDA kernels and kernel methods. 01:23:02.620 |
So here's a Keras approach to finding lung modules based on Luna. 01:23:17.540 |
So I mentioned an opportunity to improve this mean shift algorithm, and the opportunity 01:23:45.420 |
for improvement, when you think about it, it's pretty obvious. 01:23:57.540 |
The ones that are a long way away, like the weight is going to be so close to zero that 01:24:07.740 |
The question is, how do we quickly find the ones which are a long way away? 01:24:22.280 |
So what if we added an extra step here, which rather than using x to get the distance to 01:24:33.780 |
every data point, instead using approximate nearest neighbors to grab the closest ones, 01:24:46.260 |
So that would basically turn this linear timepiece into a logarithmic timepiece, which would 01:25:00.180 |
So we learned very briefly about a particular approach, which is locality-sensitive hashing. 01:25:07.700 |
I think I mentioned also there's another approach which I'm really fond of, called SpillTrees. 01:25:17.460 |
I really want us as a team to take this algorithm and add approximate nearest neighbors to it 01:25:25.980 |
and release it to the community as the first ever superfast GPU-accelerated, approximate 01:25:37.100 |
nearest neighbor-accelerated in-chip clustering algorithm. 01:25:43.660 |
If anybody's interested in doing that, I believe you're going to have to implement something 01:25:50.100 |
like LSH or SpillTrees in PyTorch, and once you've done that, it should be totally trivial 01:26:00.780 |
So if you do that, then if you're interested, I would invite you to team up with me in that 01:26:07.540 |
we would then release this piece of software together and author a paper or a post together. 01:26:14.780 |
So that's my hope is that a group of you will make that happen. 01:26:20.180 |
That would be super exciting because I think this would be great. 01:26:23.860 |
We'll be showing people something pretty cool about the idea of writing GPU algorithms today. 01:26:31.980 |
In fact, I found just during the break, here's a whole paper about how to write k-means with 01:26:47.140 |
This is without even including any kind of approximate nearest neighbor's piece or whatever. 01:27:00.380 |
I guess to do it properly, we should also be replacing the Gaussian kernel bandwidth 01:27:07.620 |
with something that we figure out dynamically rather than have it hard coded. 01:27:16.420 |
So take change, we're going to learn about chatbots. 01:27:27.620 |
Facebook thinks it has found the secret to making bots less dumb. 01:27:35.820 |
So this talks about a new thing called memory networks, which was demonstrated by Facebook. 01:27:42.340 |
You can feed it sentences that convey key plot points in Lord of the Rings and then 01:27:49.700 |
Published a new paper on archive that generalizes the approach. 01:27:56.940 |
There was another long article about this unpopular science in which they described 01:28:00.740 |
its early progress towards a truly intelligent AI. 01:28:05.780 |
Lacuna is excited about working on a memory network, giving the ability to retain information. 01:28:11.300 |
You can help the network a story and have it answer questions. 01:28:21.940 |
In the article, they've got this little example showing reading a story of Lord of the Rings 01:28:29.580 |
and then asking various questions about Lord of the Rings, and it all looks pretty impressive. 01:28:38.100 |
And the paper is called End-to-End Memory Networks. 01:28:43.900 |
The paper was actually not shown on Lord of the Rings, but was actually shown on something 01:28:48.860 |
called Babbie, I don't know, Babbie or Baby, I'm never quite sure which one it is. 01:28:54.980 |
It's a paper describing a synthetic dataset towards AI complete question answering, a 01:29:04.300 |
I saw a cute tweet last week explaining the meaning of various different types of titles 01:29:10.660 |
of papers, and it's basically saying 'towards' means we've actually made no progress whatsoever. 01:29:21.440 |
So these introduce the Babbie tasks, and the Babbie tasks are probably best described by 01:29:39.260 |
A story contains a list of sentences, a sentence contains a list of words. 01:29:46.700 |
At the end of the story is a query to which there is an answer. 01:30:00.420 |
This is where Daniel is, Daniel going to the bathroom, so Daniel is in the bathroom. 01:30:12.140 |
This is called a one-supporting fact structure, which is to say you only have to go back and 01:30:17.660 |
find one sentence in the story to figure out the answer. 01:30:21.620 |
We're also going to look at two supporting fact stories, which is ones where you're going 01:30:31.020 |
So reading in these data sets is not remotely interesting, they're just a text file. 01:30:43.780 |
There's various different text files for the various different tasks. 01:30:46.500 |
If you're interested in the various different tasks, you can check out the paper. 01:30:51.580 |
We're going to be looking at a single supporting fact and two supporting facts. 01:30:54.800 |
They have some with 10,000 examples and some with 1,000 examples. 01:31:01.540 |
The goal is to be able to solve every one of their challenges with just 1,000 examples. 01:31:09.260 |
This paper is not successful at that goal, but it makes some movement towards it. 01:31:17.300 |
So basically, we're going to put that into a bunch of different lists of stories along 01:31:30.100 |
We can start off by having a look at some statistics about them. 01:31:34.340 |
The first is, for each story, what's the maximum number of sentences in a story? 01:31:42.860 |
In fact, if you go back and you look at the gif, when it says read story, Lord of the 01:32:00.780 |
The total number of different words in this thing is 32. 01:32:07.940 |
The maximum length of any sentence in a story is 8. 01:32:13.220 |
The maximum number of words in any query is 4. 01:32:17.860 |
So we're immediately thinking, what the hell? 01:32:22.740 |
Because this was presented by the press as being the secret to making bots less dumb, 01:32:28.380 |
and showed us that they took a story and summarized Lord of the Rings, made plot points and asked 01:32:33.660 |
various questions, and clearly that's not entirely true. 01:32:39.780 |
What they did, if you look at even the stories, the first word is always somebody's name. 01:32:46.940 |
The second word here is 'or is some synonym for move'. 01:32:51.580 |
There's then a bunch of prepositions, and then the last word is 'always place'. 01:33:00.580 |
So immediately we're kind of thinking maybe this is not a step to making bots less dumb 01:33:08.620 |
or whatever they said here, a truly intelligent AI. 01:33:19.700 |
So to get this into Keras, we need to turn it into a tensor in which everything is the 01:33:26.940 |
same size, so we use pad sequences for that, like we did in the last part of the course, 01:33:35.180 |
which will add zeroes to make sure that everything is the same size. 01:33:39.780 |
So the other thing we'll do is we will create a dictionary from words to integers to turn 01:33:47.180 |
every word into an index, so we're going to turn every word into an index and then pad 01:33:56.900 |
And then that's going to give us inputs_train, 10,000 stories, each one of 10 sentences, 01:34:09.060 |
Anything that's not 10 sentences long is going to get sentences of just zeroes, any sentences 01:34:14.780 |
not 8 words long will get some zeroes, we'll get into that. 01:34:18.500 |
And you know for the test, except we just got 1000. 01:34:28.340 |
Not surprisingly, we're going to use embeddings. 01:34:37.500 |
We have to turn a sentence into an embedding, not just a word into an embedding. 01:34:44.580 |
So there's lots of interesting ways of turning a sentence into an embedding, but when you're 01:34:50.780 |
just doing towards intelligent AI, you don't do any of them. 01:34:54.740 |
You instead just add the embeddings up, and that's what happened in this paper. 01:34:59.260 |
And if you look at the way it was set up, you can see why, you can just add the embeddings 01:35:05.780 |
Mary John and Sandra, they only ever appear in one place, they're always the object of 01:35:11.060 |
The verb is always the same thing, the prepositions are always meaningless, and the last word 01:35:16.260 |
So to figure out what a whole sentence says, you can just add up the word concepts. 01:35:22.420 |
The order of them doesn't make any difference, there's no knots, there's nothing that makes 01:35:25.940 |
language remotely complicated or interesting. 01:35:28.460 |
So what we're going to do is we're going to create an input for our stories with the number 01:35:37.660 |
We're going to take each word and put it through an embedding, so that's what time-distributed 01:35:43.460 |
It's putting each word through a separate embedding, and then we do a lambda layer to 01:35:52.620 |
So here is our very sophisticated approach to creating sentence embeddings. 01:36:00.620 |
So we end up with something which rather than being 10 by 8, 10 sentences by 8 words, it's 01:36:08.100 |
now 10 by 20, that is 10 sentences by length 20 embedding. 01:36:15.060 |
So each one of our 10 sentences has been turned into a length 20 embedding, and we're just 01:36:20.060 |
We're not going to use Word2vec or anything because we don't need the complexity of that 01:36:27.700 |
We're going to do exactly the same thing for the query. 01:36:32.220 |
We don't need to use time-distributed this time, we can just take the query because this 01:36:44.460 |
So we can do the embedding, sum it up, and then we use reshape to add a unit axis to 01:36:51.460 |
the front so that it's now the same basic rank. 01:36:56.300 |
We now have one question of embedding to length 20. 01:37:02.140 |
So we have 10 sentences of the story and one query. 01:37:09.700 |
So what is the memory network, or more specifically the more advanced end-to-end memory network? 01:37:19.300 |
As per usual, when you get down to it, it's less than a page of code to do these things. 01:37:50.300 |
We took each word and we turned them into an embedding. 01:37:58.300 |
And then we summed all of those embeddings up to get an embedding for that sentence. 01:38:06.900 |
So each sentence was turned into an embedding, and they were length 20, that's what it was. 01:38:19.980 |
And then we took the query, so this is my query, same kind of idea, a bunch of words 01:38:40.260 |
which we got embeddings for, and we added them up to get an embedding for our question. 01:38:49.060 |
Okay, so to do a memory network, what we're going to do is we're going to take each of 01:38:58.900 |
these embeddings and we're going to combine it, each one, with a question or a query. 01:39:13.060 |
And we're just going to take a .product, so this way to draw this, .product, .product, okay, 01:39:40.940 |
so we're going to end up with 4 dot products from each sentence of the story times the 01:39:50.020 |
It basically says how similar two things are, when one thing is big, if the other thing 01:39:54.100 |
is big, if one thing is small, if the other thing is small, those things both make the 01:40:00.620 |
So these basically are going to be 4 vectors describing how similar each of our 4 sentences 01:40:29.340 |
So remember the dot product just reaches a scalar, so we now have 4 scalars. 01:40:40.780 |
And they each are basically related to how similar is the query to each of the 4 sentences. 01:40:50.660 |
We're now going to create a totally separate embedding of each of the sentences in our 01:40:57.980 |
story by creating a totally separate embedding for each word. 01:41:03.800 |
So we're basically just going to create a new random embedding matrix for each word 01:41:09.380 |
to start with, sum them all together, and that's going to give us a new embedding, this 01:41:23.860 |
And all we're going to do is we're going to multiply each one of these embeddings by the 01:41:41.700 |
equivalent softmax as a weighting, and then just add them all together. 01:41:46.980 |
So we're just going to have S1234, C1 times S1 plus C2 times S2, and then divide it by 01:42:03.140 |
S1234, so that's going to be our final result, which is going to be of length 20. 01:42:21.380 |
So this thing is a vector of length 20, and then we're going to take that and put it through 01:42:27.620 |
a single dense layer, and we're going to get back the answer. 01:42:43.140 |
It's incredibly simple, there's nothing deep in terms of deep learning, there's almost 01:42:53.180 |
none on linearities, so it doesn't seem like it's likely to be able to do very much, but 01:43:08.100 |
>> So in that last step you said the answer, was that really the embedding of the answer, 01:43:18.860 |
>> Yeah, it's the softmax of the answer, and then you have to do an argmax. 01:43:22.660 |
So here it is, we've got the story times the embedding of the story times the embedding 01:43:43.700 |
Softmax works in the last dimension, so I just have to reshape to get rid of the unit 01:43:46.980 |
axis, and then I reshape again to put the unit axis back on again. 01:43:51.540 |
The reshapes aren't doing anything interesting, so it's just a dot product followed by a softmax, 01:44:01.300 |
So now we're going to take each weight and multiply it by the second set of embeddings, 01:44:08.460 |
here's our second set of embeddings, embedding C, and in order to do this, I just used the 01:44:15.100 |
dot product again, but because of the fact that you've got a unit axis there, this is 01:44:20.780 |
actually just doing a very simple weighted average. 01:44:27.600 |
And again, I've reshaped to get rid of the unit axis so that we can stick it through 01:44:31.400 |
a dense layer of the softmax, and that gives us our final result. 01:44:37.100 |
So what this is effectively doing is it's basically saying, okay, how similar is the 01:44:43.640 |
query to each one of the sentences in the story? 01:44:48.140 |
Use that to create a bunch of weights, and then these things here are basically the answers. 01:44:53.140 |
This is like, if story number 1 was where the answer was, then we're going to use this 01:44:58.860 |
one, story number 2, 3, and 4, because there's a single linear layer at the very end, so 01:45:06.300 |
it doesn't really get to do much computation. 01:45:08.940 |
It basically has to learn what the answer represented by each story is. 01:45:13.900 |
And again, this is lucky because the original data set, the answer to every question is 01:45:40.740 |
So that's why we just can have this incredibly simple final piece. 01:45:47.340 |
So this is an interesting use of Keras, right? 01:45:51.780 |
We've created a model which is in no possible way deep learning, but it's a bunch of tensors 01:46:02.780 |
And so it has some inputs, it has an output, so we can call it a model. 01:46:07.300 |
We can compile it with an optimizer and a loss, and then we can fit it. 01:46:13.580 |
So it's kind of interesting how you can use Keras for things which don't really use any 01:46:23.180 |
And as you can see, it works for what it's worth. 01:46:27.380 |
And the particular problem we solved here is the one supporting that problem. 01:46:41.620 |
Actually before I do that, I'll just point out something interesting, which is we could 01:46:45.360 |
create another model, now that this is already trained, which is to return not the final 01:46:54.900 |
And so we can now go back and say, for a particular story, what are the weights? 01:47:06.740 |
For this story, for this particular story, the weights are here, and you can see that 01:47:28.380 |
So we can actually look inside the model and find out what sentences it's using to answer 01:47:36.780 |
Question - would it not make more sense to concat the embeddings rather than sum them? 01:47:46.680 |
Not for this particular problem, because of the way the vocabulary is structured when 01:47:53.460 |
It would also have to deal with the variable length of the sentence. 01:47:58.780 |
Well, we've used padding to make them the same length. 01:48:05.020 |
If you wanted to use this in real life, you would need to come up with a better sentence 01:48:10.500 |
embedding, which presumably might be an RNN or something like that, because you need to 01:48:17.300 |
deal with things like 'not' and the location of subject and object and so forth. 01:48:23.820 |
One thing to point out is that the order of the sentences matters. 01:48:28.020 |
And so what I actually did when I preprocessed it was I added a 0 colon, 1 colon, whatever 01:48:33.980 |
to the start of each sentence, so that it would actually be able to learn the time order 01:48:43.780 |
So in case you were wondering what that was, that was something that I added in the preprocessing. 01:48:49.620 |
So one nice thing with memory networks is we can kind of look and see if they're not 01:48:53.380 |
working, in particular why they're not working. 01:48:57.660 |
So multi-hop, so let's now look at an example of a two supporting facts story. 01:49:08.440 |
We still only have one type of verb with various synonyms and a small number of subjects and 01:49:12.540 |
a small number of objects, so it's basically the same. 01:49:17.820 |
But now, to answer a question, we have to go down through two hots. 01:49:35.780 |
So that's what we have to be able to do this time. 01:49:40.420 |
And so what we're going to do is exactly the same thing as we did before, but we're going 01:49:47.260 |
to take our whole little model, so do the embedding, reshape, dot, reshape, softmax, 01:49:57.980 |
reshape, dot, reshape, dense layer, sum, and we're going to call it a counter and we're 01:50:12.180 |
So this whole picture is going to become one hop. 01:50:17.720 |
And what we're going to do is we're going to take this and go back and replace the query 01:50:33.220 |
So at each step, each hop, we're going to replace the query with the result of our memory network. 01:50:41.440 |
And so that way, the memory network can learn to recognize that the first thing I need is 01:50:55.140 |
I now have the milk, now you need to update the query to where is Daniel. 01:51:03.820 |
So the memory network in multi-hop mode basically does this whole thing again and again and 01:51:15.760 |
So that's why I just took the whole set of steps and chucked it into a single function. 01:51:22.340 |
And so then I just go, OK, response, story is one hop, response, story is one hop on 01:51:30.100 |
that, and you can keep repeating that again and again and again. 01:51:34.820 |
And then at the end, get our output, that's our model, compile, fit. 01:51:45.020 |
I had real trouble getting this to fit nicely, I had to play around a lot with learning rates 01:51:53.380 |
and batch sizes and whatever else, but I did eventually get it up to 0.999 accuracy. 01:52:05.220 |
So this is kind of an unusual class for me to be teaching, because particularly compared 01:52:11.580 |
to Part 1 where it was like best practices, clearly this is anything but. 01:52:16.720 |
I'm kind of showing you something which was maybe the most popular request, was like teachers 01:52:24.380 |
But let's be honest, who has ever used a chatbot that's not terrible? 01:52:29.980 |
And the reason no one's used a chatbot that's not terrible is that the current state-of-the-art 01:52:36.380 |
So chatbots have their place and indeed one of the students of class has written a really 01:52:43.820 |
interesting kind of analysis of this, which hopefully she'll share on the forum. 01:52:50.120 |
But that place is really kind of lots of heuristics and carefully set up vocabularies and selecting 01:53:03.220 |
It's not kind of general purpose, here's a story, ask anything you like about it, here 01:53:11.380 |
It's not to say we won't get there, I sure hope we will, but the kind of incredible hype 01:53:18.940 |
we had around Turing machines and memory networks and end-to-end memory networks is kind of, 01:53:25.340 |
as you can see, even when you just look at the dataset, what they worked on, it's kind 01:53:32.340 |
So that is not quite the final conclusion of this though, because yesterday a paper 01:53:41.660 |
came out which showed how to identify buffer overruns in computer source code using memory 01:53:55.100 |
And so it kind of spoilt my whole narrative that somebody seems to have actually used 01:54:06.220 |
And I guess when you think about it, it makes some sense. 01:54:08.700 |
So in case you don't know what a buffer overrun is, that's like if you're writing in an unsafe 01:54:13.340 |
language, you allocate some memory, it's going to store some result or some input, and you 01:54:21.060 |
try to put into that memory something bigger than the amount that you allocated, it basically 01:54:31.840 |
In the worst case, somebody figures out how to get exactly the right code to spill out 01:54:36.580 |
into exactly the right place and ends up taking over your machine. 01:54:45.100 |
And the idea of being able to find them, I can actually see it does look a lot like this 01:54:50.340 |
You kind of have to see where was that variable kind of set, and then where was the thing 01:54:55.780 |
that was set from set, and where was the original thing allocated. 01:54:58.820 |
It's kind of like just going back through the source code. 01:55:02.820 |
The vocabulary is pretty straightforward, it's just the variables that have been defined. 01:55:11.660 |
I haven't had a chance to really study the paper yet, but it's no chat bot, but maybe 01:55:18.500 |
there is a room for memory networks already after all. 01:55:22.740 |
Is there a way to visualize what the neural network has learned for the text? 01:55:28.600 |
If you mean the embeddings, you can look at the embeddings easily enough. 01:55:36.420 |
The whole thing is so simple, it's very easy to look at every embedding. 01:55:40.300 |
As I mentioned, we looked at visualizing the weights that came out of the softmax. 01:55:48.140 |
We don't even need to look at it in order to figure out what it looked like, based on 01:55:52.620 |
the fact that this is just a small number of simple linear steps. 01:55:57.060 |
We know that it basically has to learn what each sentence answer can be, you know, sentence 01:56:07.500 |
number 3, its answer will always be milk, its answer will always be 4-way or whatever. 01:56:15.020 |
And then, so that's what the C embeddings are going to have to be. 01:56:21.860 |
And then the embeddings of the weights are going to have to basically learn how to come 01:56:26.300 |
up with what's going to be probably a similar embedding to the query. 01:56:29.500 |
In fact, I think you can even make them the same embedding, so that these dot products 01:56:34.260 |
basically give you something that gives you similarity scores. 01:56:38.820 |
So this is really a very simple, largely linear model, so it doesn't require too much visualizing 01:56:48.940 |
So having said all that, none of this is to say that memory networks are useless, right? 01:56:54.740 |
I mean, they're created by very smart people with an impressive pedigree in deep learning. 01:57:01.020 |
This is very early, and this tends to happen in popular press, they kind of get overexcited 01:57:10.260 |
Although in this case, I don't think we can blame the press, I think we have to blame 01:57:15.860 |
I mean, this has clearly created to give people the wrong idea, which I find very surprising 01:57:21.340 |
from people like Yann McCone, who normally do the opposite of that kind of thing. 01:57:27.380 |
So this is not really the press' fault in this case. 01:57:32.620 |
But this may well turn out to be a critical component in chatbots and Q&A systems and whatever 01:57:45.140 |
I had a good chat to Steven Meridy the other day, who's a researcher I respect a lot, and 01:57:54.820 |
I asked him what he thought was the most exciting research in this direction at the moment, 01:58:00.460 |
and he mentioned something that I was also very excited about, which is called Recurrent 01:58:08.900 |
And the Recurrent Entity Network paper is the first to solve all of the BABY tasks with 01:58:19.740 |
Now take of that what you will, I don't know how much that means, they're synthetic tasks. 01:58:26.860 |
One of the things that Steven Meridy actually pointed out in the blog post is that even 01:58:32.580 |
the basic kind of coding of how they're created is pretty bad. 01:58:35.140 |
They have lots of replicas and the whole thing is a bit of a mess. 01:58:40.220 |
But anyway, nonetheless this is an interesting approach. 01:58:44.260 |
So if you're interested in memory networks, this is certainly something you can look at. 01:58:49.340 |
And I do think this is likely to be an important direction. 01:58:54.380 |
Having said all that, one of the key reasons I wanted to look at these memory networks 01:58:58.940 |
is not only because it was the largest request from the forums for this part of the course, 01:59:03.980 |
but also because it introduces something that's going to be critical for the next couple of 01:59:17.860 |
Attention or models are models where we have to do exactly what we just looked at, which 01:59:31.340 |
is basically find out at each time which part of a story to look at next, or which part 01:59:41.780 |
of an image to look at next, or which part of a sentence to look at next. 01:59:47.620 |
And so the task that we're going to be trying to get at over the next lesson or two is going 02:00:07.540 |
And one of the challenges is that in a particular French sentence which has got some bunch of 02:00:12.860 |
words, it's likely to turn into an English sentence with some different bunch of words. 02:00:18.100 |
And maybe these particular words here might be this translation here, and this one might 02:00:25.160 |
And so as you go through, you need some way of saying which word do I look at next. 02:00:36.140 |
And so what we're going to do is we're going to be trying to come up with a proper RNN 02:00:43.820 |
like an LSTM, or a GRU, or whatever, where we're going to change it so that inside the 02:00:51.660 |
RNN it's going to actually have some way of figuring out which part of the input to look 02:01:03.060 |
So that's the basic idea of attentional models. 02:01:06.780 |
And so interestingly, during this time that memory networks and neural Turing machines 02:01:13.820 |
and stuff were getting all this huge amount of press attention very quietly in the background 02:01:20.900 |
at exactly the same time, attentional models were appearing as well. 02:01:27.500 |
And it's the attentional models for language that have really turned out to be critical. 02:01:34.800 |
So you've probably seen all of the press about Google's new neural translation system, and 02:01:42.700 |
that really is everything that it's claimed to be. 02:01:46.500 |
It really is basically one giant neural network that can translate any pair of languages. 02:01:54.860 |
The accuracy of those translations is far beyond anything that's happened before. 02:02:00.580 |
And the basic structure of that neural net, as we're going to learn, is not that different 02:02:10.780 |
We're just going to have this one extra step, which is attention. 02:02:17.780 |
And depending on how interested you guys are in the details of this neural translation 02:02:21.980 |
system, it turns out that there are also lots of little tweaks. 02:02:25.180 |
The tweaks are kind of around like, OK, you've got a really big vocabulary, some of the words 02:02:34.500 |
appear very rarely, how do you build a system that can understand how to translate those 02:02:40.060 |
really rare words, for example, and also just kind of things like how do you deal with the 02:02:47.460 |
memory issues around having huge embedding matrices of 160,000 words and stuff like that. 02:02:54.740 |
So there's lots of details, and the nice thing is that because Google has ended up putting 02:03:01.940 |
this thing in production, all of these little details have answers now, and those answers 02:03:11.300 |
There aren't really on the whole great examples of all of those things put together. 02:03:19.340 |
So one of the things interesting here will be that you'll have opportunities to do that. 02:03:25.860 |
Generally speaking, the blog posts about these neural translation systems tend to be kind 02:03:31.500 |
They describe roughly how these kind of approaches work, but Google's complete neural translation 02:03:38.100 |
system is not out there, you can't download it and see the code. 02:03:44.180 |
So we'll see how we go, but we'll kind of do it piece by piece. 02:04:01.060 |
I guess one other thing to mention about the memory network is that Keras actually comes 02:04:09.060 |
with a end-to-end memory network example in the Keras GitHub, which weirdly enough, when 02:04:18.900 |
I actually looked at it, it turns out doesn't implement this at all. 02:04:24.740 |
And so even on the single supporting fact thing, it takes many, many generations and 02:04:34.100 |
And I found this quite surprising to discover that once you start getting to some of these 02:04:38.940 |
more recent advances or not just a standard CNN or whatever, it's just less and less common 02:04:49.860 |
that you actually find code that's correct and that works. 02:04:53.660 |
And so this memory network example was one of them. 02:04:56.340 |
So if you actually go into the Keras GitHub and look at examples and go and have a look 02:05:01.420 |
and download the memory network, you'll find that you don't get results anything like this. 02:05:06.740 |
If you look at the code, you'll see that it really doesn't do this at all. 02:05:11.700 |
So I just wanted to mention that as a bit of a warning that you're kind of at the point 02:05:18.780 |
now where you might want to take what a grain of salt blog posts you read or even some papers 02:05:24.700 |
that you read, well worth experimenting with them and assuming you should start with the 02:05:33.420 |
And maybe even start with the assumption that you can't necessarily trust all of the conclusions 02:05:40.480 |
that you've read because the vast majority of the time, in my experience putting together 02:05:46.340 |
this part of the course, the vast majority of the time, the stuff out there is just wrong. 02:05:52.500 |
Even in cases like I deeply respect the Keras authors and the Keras source code, but even 02:06:02.360 |
I think that's an important point to be aware of. 02:06:08.760 |
I think we're done, so I think we're going to finish five minutes early for a change. 02:06:14.020 |
So thanks everybody, and so this week hopefully we can have a look at the Data Science Bowl, 02:06:22.780 |
make a million dollars, create a new PyTorch Approximate Nearest Neighbors algorithm, 02:06:27.840 |
and then when you're done, maybe figure out the next stage for memory networks.