back to index

Lesson 11: Cutting Edge Deep Learning for Coders


Chapters

0:0 Reminders
7:35 Linear Algebra Cheat Sheet
8:35 ZeroShot Learning
9:55 Computer Vision
11:40 Activation Functions
13:32 Colour transformations
14:18 Batch norm
14:58 Is there any advantage
17:35 Removing Data
19:56 Noise Labels
21:22 Accuracy vs Size
23:23 Design patents
24:20 Lexical Learning Rooms
27:20 Data Science Bowl

Whisper Transcript | Transcript Only Page

00:00:00.000 | So this week, obviously quite a bit just to get set up to get results from this week in
00:00:14.840 | terms of needing all of ImageNet and that kind of thing and getting all that working.
00:00:19.780 | So I know that a lot of you are still working through that.
00:00:25.180 | I did want to mention a couple of reminders just that I've noticed.
00:00:30.860 | One is that in general, we have that thing on the wiki about how to use the notebooks,
00:00:37.500 | and we really strongly advise that you don't open up the notebook we give you and click
00:00:42.720 | shift enter through it again and again.
00:00:45.120 | You're not really going to learn much from that.
00:00:48.940 | But go back to that wiki page.
00:00:50.340 | It's like the first thing that's mentioned in the first paragraph of the home page of
00:00:53.320 | the wiki is how to use the notebooks.
00:00:55.560 | Basically the idea is try to start with a fresh notebook, think about what you think
00:01:00.360 | you need to do first, try and do that thing if you have no idea, then you can go to the
00:01:04.680 | existing notebook, take a peek, close it again, try and re-implement what you just saw.
00:01:10.760 | As much as possible, really not just shift enter through the notebooks.
00:01:16.080 | I know some of you are doing it because there are threats on the forum saying, "I was shift
00:01:20.080 | enter through the notebook and this thing didn't work."
00:01:23.240 | And somebody is like, "Well, that's because that thing's not defined yet."
00:01:26.200 | So consider yourself busted.
00:01:32.560 | The other thing to remind you about is that the goal of part 2 is to get you to a point
00:01:38.960 | where you can read papers, and the reason for that is because you kind of know the best
00:01:44.480 | practices now, so anytime you want to do something beyond what we've learned, you're going to
00:01:49.640 | be implementing things from papers or probably going beyond that and implementing new things.
00:01:56.880 | Reading a new paper in an area that you haven't looked at before is, at least to me, somewhat
00:02:02.200 | terrifying.
00:02:04.640 | On the other hand, reading a paper for the thing that we already studied last week hopefully
00:02:10.960 | isn't terrifying at all because you already know what the paper says.
00:02:14.680 | So I always have that in the assignments each week.
00:02:18.320 | Read the paper for the thing you just learned about, and go back over it and please ask
00:02:22.560 | on the forums if there's a bit of notation or anything that you don't understand, or
00:02:28.000 | if there's something we heard in class that you can't see in the paper, or if it's particularly
00:02:31.960 | interesting if you see something in the paper that you don't think we mentioned in class.
00:02:37.320 | So that's the reason that I really encourage you to read the papers for the topics we studied
00:02:45.240 | in class.
00:02:46.960 | I think for those of you like me who don't have a technical academic background, it's
00:02:52.680 | really a great way to familiarize yourself with notation.
00:02:58.120 | And I'm really looking forward to some of you asking about notation on the forums, so
00:03:03.320 | I can explain some of it to you.
00:03:05.320 | There's a few key things that keep coming up in notation, like probability distributions
00:03:11.720 | and stuff like that.
00:03:12.720 | So please feel free, and if you're watching this later in the MOOC, again, feel free to
00:03:19.320 | ask on the forum anything that's not clear.
00:03:23.040 | I was kind of interested in following up on some of last week's experiments myself.
00:03:28.800 | And the thing that I think we all were a bit shocked about was putting this guy into the
00:03:34.920 | device model and getting out more pictures of similar looking fish in nets.
00:03:40.960 | And I was kind of curious about how that was working and how well that was working, and
00:03:46.840 | I then completely broke things by training it for a few more epochs.
00:03:52.000 | And after doing that, I then did an image similarity search again and I got these three
00:03:55.680 | guys who were no longer in nets.
00:04:00.880 | So I'm not quite sure what's going on here.
00:04:03.720 | And the other thing I mentioned is when I trained it where my starting point was what
00:04:10.080 | we looked at in class, which was just before the final bottleneck layer.
00:04:14.600 | I didn't get very good results from this thing, but when I trained it from the starting point
00:04:20.240 | of just after the bottleneck layer, I got the good results that you saw.
00:04:27.360 | And again, I don't know why that is, and I don't think this has been studied as far as
00:04:31.360 | I'm aware.
00:04:32.360 | So there's lots of open questions here.
00:04:33.360 | But I'll show you something I did then do was I thought, well, that's interesting.
00:04:38.360 | I think what's happened here is that when you train it for longer, it knows that the
00:04:43.320 | important thing is the fish and not the net.
00:04:46.160 | And it seems to be now focusing on giving us the same kind of fish.
00:04:48.840 | These are clearly the exact same type of fish, I guess.
00:04:55.100 | So I started wondering how could we force it to combine.
00:04:58.160 | So I tried the most obvious possible thing, I wanted to get more fish in nets.
00:05:04.720 | And I typed Word2VecticTensh, that's a kind of fish, plus Word2VecticNet divided by 2,
00:05:11.760 | get the average of the 2 word vectors, and give me the nearest neighbor.
00:05:15.800 | And that's what I got.
00:05:18.480 | And then just to prove it wasn't a fluke, I tried the same on Tensh plus Rod, and there's
00:05:23.520 | my nearest neighbor.
00:05:24.520 | Now do you know what's really freaky about this?
00:05:27.680 | If you Google for ImageNet categories, you'll get a list of 1000 ImageNet categories.
00:05:32.880 | If you search through them, neither net nor rod appear at all.
00:05:37.560 | I can't begin to imagine why this works, but it does.
00:05:43.940 | So this device model is clearly doing some pretty deep magic in terms of the understanding
00:05:50.040 | of these objects and their relationships.
00:05:52.560 | Not only are we able to combine things like this, but we're able to combine it with categories
00:05:57.520 | that it's literally never seen before.
00:06:00.440 | It's never seen a rod, we've never told it what a rod looks like, and I did over a net.
00:06:05.880 | And I tried quite a few of these combinations and they just kept working.
00:06:09.040 | Like another one I tried was, I understand why this works, which is I tried searching
00:06:13.880 | for boat.
00:06:14.880 | Now boat doesn't appear in ImageNet, but there's lots of kinds of boats that appear in ImageNet.
00:06:20.160 | So not surprisingly, it figures out, generally speaking, how to find boats.
00:06:24.880 | I expected that.
00:06:26.480 | And then I tried boat plus engine, and I got back pictures of powerboats, and then I tried
00:06:31.940 | boat plus paddle, and I got back pictures of rowing boats.
00:06:36.640 | So there's a whole lot going on here, and I think there's lots of opportunities for
00:06:40.160 | you to explore and experiment based on the explorations and experiments that I've done.
00:06:46.000 | And more to the point, perhaps to create some interesting and valuable tools.
00:06:53.880 | I would have thought a tool to do an image search to say, show me all the images that
00:06:58.280 | contain these kinds of objects.
00:07:00.720 | Or better still, maybe you could start training with things that aren't just nouns but also
00:07:05.480 | adjectives.
00:07:06.480 | So you could start to search for pictures of crying babies or flaming houses or whatever.
00:07:18.360 | I think there's all kinds of stuff you could do with this, which would be really interesting
00:07:20.880 | whether it be in a narrow organizational setting or create some new startup or a new open source
00:07:28.960 | project or whatever.
00:07:30.800 | So anyway, lots of things to try.
00:07:35.760 | More stuff this week.
00:07:37.840 | I actually missed this lesson this week, but I was thrilled to see that one of our students
00:07:43.520 | has written this fantastic Medium post, linear algebra cheat sheet.
00:07:47.640 | I think I missed it because it was posted not to the part 2 forum, but maybe to the
00:07:53.400 | main forum.
00:07:54.400 | But this is really cool, Brendan has gone through and really explained all the stuff
00:08:01.040 | that I would have wanted to have known about linear algebra before I got started, and particularly
00:08:06.000 | I really appreciate that he's taking a code-first approach.
00:08:10.320 | So how do you actually do this in NumPy and talking about broadcasting?
00:08:14.840 | So you guys will all be very familiar with this already, but your friends who are wondering
00:08:19.280 | how to get started in deep learning, what's the minimal things you need to know, it's
00:08:24.680 | probably the chain rule and some linear algebra.
00:08:27.160 | I think this covers a lot of any linear algebra pretty effectively.
00:08:31.320 | So thank you Brendan.
00:08:35.920 | Other things from last week.
00:08:39.160 | Andrea Fromme who wrote that device paper, I actually emailed her and asked her what she
00:08:44.520 | thought else I should look at.
00:08:47.000 | And she suggested this paper, "Zero-shot learning by convex combination of semantic embeddings,"
00:08:52.680 | which she's only a later author on, but she says it's kind of in some ways a more powerful
00:09:01.280 | version of the device.
00:09:02.840 | It's actually quite different, and I haven't implemented it myself, but it solves some
00:09:08.680 | similar problems, and anybody who's interested in exploring this multimodal images and text
00:09:14.680 | space might be interested in this.
00:09:16.200 | And we'll put this on the lesson wiki of course.
00:09:19.920 | And then one more involving the same author in a similar area a little bit later was looking
00:09:27.840 | at attention for fine-grained categorization.
00:09:31.520 | So a lot of these things, at least the way I think Andrea Fromme was casting it, was
00:09:36.840 | about fine-grained categorization, which is how do we build something that can find very
00:09:43.400 | specific kinds of birds or very specific kinds of dogs.
00:09:46.720 | I think these kinds of models have very, very wide applicability.
00:09:53.240 | So I mentioned we'd kind of wrap up some final topics around computer vision stuff this week
00:10:06.840 | before we started looking at some more NLP-related stuff.
00:10:11.520 | One of the things I wanted to zip through was a paper which I think some of you might
00:10:15.440 | enjoy, "Systematic Evaluation of CNN Advances on the ImageNet Data Set."
00:10:22.360 | And I've pulled out what I thought were some of the key insights for some of these things
00:10:26.240 | we haven't really looked at before.
00:10:29.340 | One key insight which is very much the kind of thing I appreciate is that they compared
00:10:36.040 | what's the difference between the original CafeNet/AlexNet vs. GoogleNet vs. VGGNet on
00:10:43.680 | two different sized images training on the original 227 or 128.
00:10:49.800 | And what this chart shows is that the relative difference between these different architectures
00:10:55.700 | is almost exactly the same regardless of what size image you're looking at.
00:11:00.280 | And this really reminds me of like in Part 1 when we looked at data augmentation and
00:11:04.440 | we said hey, you couldn't understand which types of data augmentation to use and how
00:11:08.200 | much on a small sample of the data rather than on the whole data set.
00:11:13.160 | What this paper is saying is something similar, which is you can look at different architectures
00:11:17.920 | on small sized images rather than full sized images.
00:11:22.360 | And so they then used this insight to do all of their experiments using a smaller 128x128
00:11:28.840 | ImageNet model, which they said was 10 times faster.
00:11:31.800 | So I thought that was the kind of thing which not enough academic papers do, which is like
00:11:37.400 | what are the hacky shortcuts we can get away with?
00:11:41.680 | So they tried lots of different activation functions.
00:11:47.780 | It does look like max_pooling is way better, so this is the gain compared to ReLU.
00:11:55.920 | But this one actually has twice the complexity, so it doesn't quite say that.
00:12:02.560 | What it really says is that something we haven't looked at, which is LU, which as you can see
00:12:07.560 | is very simple, if x is greater than or equal to 0, it's y equals x, otherwise it's e^x-1.
00:12:15.920 | So LU basically is just like ReLU, except it's smooth.
00:12:25.080 | Whereas ReLU looks like that, e_lu looks exactly the same here, then here it goes like that.
00:12:43.120 | So it's kind of a nice smooth version.
00:12:45.400 | So that's one thing you might want to try using.
00:12:48.360 | Another thing they tried which was interesting was using e_lu for the convolutional layers.
00:13:00.960 | Max out for the fully connected layers.
00:13:03.600 | I guess nowadays we don't use fully connected layers very much, so maybe that's not as interesting.
00:13:10.240 | Main interesting thing here I think is the e_lu activation function.
00:13:13.600 | Two percentage points is quite a big difference.
00:13:19.080 | They looked at different learning rate annealing approaches.
00:13:23.840 | You can use Keras to automatically do learning rate annealing, and what they showed is that
00:13:27.360 | linear annealing seems to work the best.
00:13:31.520 | They tried something else, which was like what about different color transformations.
00:13:38.320 | They found that amongst the normal approaches to thinking about color, RGB actually seems
00:13:42.480 | to work the best.
00:13:43.480 | But then they tried something I haven't seen before, which is they added two 1x1 convolutions
00:13:49.720 | at the very start of the network.
00:13:51.720 | So each of those 1x1 convolutions is basically doing some kind of linear combination with
00:13:59.680 | the channels, with a nonlinearity then in between.
00:14:04.840 | And they found that that actually gave them quite a big improvement, and that should be
00:14:10.360 | pretty much zero cost.
00:14:12.520 | So there's another thing which I haven't already seen written about elsewhere, but that's a
00:14:15.800 | good trick.
00:14:19.920 | They looked at the impact of batch norms.
00:14:22.280 | So here is the impact of batch norm, positive or negative.
00:14:28.520 | Actually adding batch norm to GoogleNet didn't help, it actually made it worse.
00:14:33.240 | So it seems these are really complex, carefully tuned architectures, you've got to be pretty
00:14:37.680 | careful or else on a simple network it helps a lot.
00:14:43.040 | And the amount it helps also depends on somewhat which activation function you use.
00:14:48.560 | So batch norm, I think we kind of know that now, be careful when you use it.
00:14:54.440 | Sometimes it's fantastically helpful, sometimes it's slightly unhelpful.
00:14:59.680 | Question, is there any advantage in using fully connected layers for files?
00:15:08.680 | Yeah, I think there is, they're terribly out of fashion.
00:15:15.280 | But I think for transfer learning, they still seem to be the best in terms of the fully
00:15:24.800 | connected layers, a super fast train, and you seem to get a lot of flexibility there.
00:15:31.160 | So I don't think we know one way or another yet, but I do think that VGG still has a lot
00:15:38.040 | to give us in terms of the last carefully tuned thing with fully connected layers, and
00:15:45.360 | that really seems to be great for transfer learning.
00:15:49.200 | And then there was a comment saying that Elle's advantage is not just that it's moved, but
00:15:55.920 | that it goes a little below zero, and he mentions that this action is really strong in being
00:15:59.200 | unused.
00:16:00.200 | Yeah, that's a great point.
00:16:01.360 | Thank you for adding that.
00:16:03.240 | Anytime you hear me say something slightly stupid, please feel free to jump in, otherwise
00:16:09.440 | it's on the video forever.
00:16:15.240 | So on the other hand, it does give you an improvement in accuracy if you remove the
00:16:22.120 | final max pooling layer, replace all the fully connected layers with convolutional layers,
00:16:28.840 | and stick an average pooling at the end, which is basically what this is doing.
00:16:33.140 | So it does seem there's definitely an upside to fully convolutional networks in terms of
00:16:38.440 | accuracy, but there may be a downside in terms of flexibility around transfer learning.
00:16:45.960 | I thought this was an interesting picture I haven't quite seen before, let me explain
00:16:52.720 | the picture.
00:16:53.720 | What this shows is these are different batch sizes along the bottom, and then we've got
00:17:00.400 | accuracy.
00:17:01.400 | And what it's showing is with a learning rate of 0.01, this is what happens to accuracy.
00:17:07.520 | So as you go above 256, batch size plummets.
00:17:13.080 | On the other hand, if you use a learning rate of 0.01 times batch size over 256, it's pretty
00:17:19.800 | flat.
00:17:20.800 | So what this suggests to me is that any time you change the batch size, this basically
00:17:24.560 | is telling you to change the learning rate by a proportional amount, which I think a
00:17:29.360 | lot of us have realized through experiments, but I don't think I've seen it explicitly
00:17:32.600 | mentioned before.
00:17:37.720 | I think this is very helpful to understand as well is that removing data has a nonlinear
00:17:44.640 | effect on accuracy.
00:17:46.200 | So here's this green line here is what happens when you remove images.
00:17:49.660 | So with ImageNet, down to about half the size of ImageNet, there isn't a huge impact on
00:17:56.320 | accuracy.
00:17:57.320 | So maybe if you want to really speed things up, you could go 128x128 sized images and
00:18:02.080 | use just 600,000 of them, or even maybe 400,000, but then beneath that it starts to plummet.
00:18:10.040 | So I think that's an interesting insight.
00:18:14.120 | Another interesting insight, although I'm going to add something to this in a moment,
00:18:17.480 | is that rather than removing images, if you instead flip the labels to make them incorrect,
00:18:25.040 | that has a worse effect than not having the data at all.
00:18:31.120 | But there are things we can do to try to improve things there, and specifically I want to bring
00:18:36.040 | your attention to this paper, "Training Deep Neural Networks on Noisy Labels with Bootstrapping".
00:18:45.000 | And what they show is a very simple approach, a very simple tweak you can add to any training
00:18:50.840 | method which dramatically improves their ability to handle noisy labels.
00:18:57.920 | This here is showing if you add noise from 0.3 up to 0.5 to MNIST, up to half of it,
00:19:08.080 | the baselines are doing nothing at all, it really collapses the accuracy.
00:19:16.520 | But if you use their approach to bootstrapping, you can actually go up to nearly half the
00:19:22.600 | images being intentionally changing their label, and it still works nearly as well.
00:19:30.120 | I think this is a really important paper to mention in this stuff that most of you will
00:19:33.800 | find important and useful area, because most real-world datasets have noise in them.
00:19:39.520 | So maybe this is something you should consider adding to everything that you've trained,
00:19:44.960 | whether it be tag or datasets, or your own datasets, or whatever, particularly because
00:19:51.200 | you don't necessarily know how noisy the labels are.
00:19:57.000 | "Noisy labels means incorrect" Yeah, noisy just means incorrect.
00:20:05.240 | "But bootstrapping is some sort of technique that" Yeah, this is this particular paper
00:20:09.160 | that's grabbed a particular technique which you can read during the week if you're interested.
00:20:14.560 | So interestingly, they find that if you take VGG and then add all of these things together
00:20:24.640 | and do them all at once, you can actually get a pretty big performance hike.
00:20:29.600 | It looks in fact like VGG becomes more accurate than GoogleNet if you make all these changes.
00:20:37.160 | So that's an interesting point, although VGG is very, very slow and big.
00:20:46.760 | There's lots of stuff that I noticed they didn't look at.
00:20:48.600 | They didn't look at data augmentation, different approaches to zooming and cropping, adding
00:20:52.880 | skip connections like in ResNet or DenseNet or IraNetworks, different initialization methods,
00:21:01.200 | different amounts of depth.
00:21:04.600 | And to me, the most important is the impact on transfer learning.
00:21:08.600 | So these to me are all open questions as far as I know, and so maybe one of you would like
00:21:14.540 | to create the successor to this, more observations on training CNNs.
00:21:25.000 | There's another interesting paper, although the main interesting thing about this paper
00:21:28.120 | is this particular picture, so feel free to check it out, it's pretty short and simple.
00:21:34.080 | This paper is looking at the accuracy versus the size and the speed of different networks.
00:21:45.720 | So the size of a bubble is how big is the network, how many parameters does it have.
00:21:51.360 | So you can see VGG 16 and VGG 19 are by far the biggest of any of these networks.
00:21:58.880 | Interestingly the second biggest are the very old basic AlexNet.
00:22:04.200 | Interestingly newer networks tend to have a lot less parameters, which is a good sign.
00:22:08.360 | Then on this axis we have basically how long does it take to train.
00:22:14.560 | So again, VGG is big and slow, and without at least some tweaks, not terribly accurate.
00:22:24.920 | So again, there's definitely reasons not to use VGG even if it seems easier for transfer
00:22:30.240 | learning or we don't necessarily know how to do a great job of transfer learning on
00:22:35.640 | ResNet or Inception.
00:22:37.680 | But as you can see, the more recent ResNet and Inception-based approaches are significantly
00:22:44.240 | more accurate and faster and smaller.
00:22:48.480 | So this is why I was looking last week at trying to do transfer learning on top of ResNet
00:22:54.880 | and there's really good reasons to want to do that.
00:22:59.640 | I think this is a great picture.
00:23:03.400 | These two papers really show us that academic papers are not always just some highly theoretical
00:23:11.000 | wacky result.
00:23:12.000 | From time to time people write these great analysis of best practices and everything
00:23:18.200 | that's going on.
00:23:19.200 | There's some really great stuff out there.
00:23:26.320 | One other paper to mention in this kind of broad ideas about things that you might find
00:23:32.400 | helpful is a paper by somebody named Leslie Smith who I think is going to be just about
00:23:37.280 | the most overlooked researcher.
00:23:44.000 | Leslie Smith does a lot of really great papers which I really like.
00:23:48.480 | This particular paper came up with a list of 14 design patterns which seem to be generally
00:23:56.880 | associated with better CNNs.
00:24:00.160 | This is a great paper to read, it's a really easy read.
00:24:03.920 | You guys won't have any trouble with it at all, I don't think.
00:24:06.160 | It's very short.
00:24:07.160 | But I looked through all these and I just thought these all make a lot of sense.
00:24:12.960 | If you're doing something a bit different and a bit new and you have to design a new
00:24:15.640 | architecture, this would be a great list of patterns to look through.
00:24:22.120 | One more Leslie Smith paper to mention, and it's crazy that this is not more well known,
00:24:28.440 | something incredibly simple, which is a different approach to learning rates.
00:24:32.920 | Rather than just having your learning rate gradually decrease, I'm sure a lot of you
00:24:37.360 | have noticed that sometimes if you suddenly increase the learning rate for a bit and then
00:24:42.200 | suddenly decrease it again for a bit, it kind of goes into a better little area.
00:24:48.280 | What this paper suggests doing is try actually continually increasing your learning rate
00:24:52.880 | and then decreasing it, increasing it, decreasing it, increasing it, decreasing it, something
00:24:57.040 | that they call cyclical learning rates.
00:25:01.400 | And check out the impact, compared to non-cyclical approaches, it is way, way faster and at every
00:25:15.360 | point it's much better.
00:25:18.200 | And this is something which you could easily add.
00:25:20.800 | I haven't seen this added to any library.
00:25:26.320 | If you created the cyclical learning rate annealing class for Keras, many people would
00:25:32.660 | thank you.
00:25:33.660 | Actually many people would have no idea what you're talking about, so you don't have to
00:25:36.680 | write the blog post to explain why it's good and show them this picture and they would
00:25:40.640 | thank you.
00:25:41.640 | I just wanted to quickly add that Keras has lots of callbacks that I actually play with
00:25:49.640 | some.
00:25:50.640 | Yeah, exactly.
00:25:51.640 | It's a great loop with a bunch of callbacks.
00:25:53.520 | And if I was doing this in Keras, what I would do would be I would start with the existing
00:25:59.640 | learning rate annealing code that's there and make small changes until it starts working.
00:26:06.320 | There's already code that does pretty much everything you want.
00:26:12.040 | The other cool thing about this paper is that they suggest a fairly automated approach to
00:26:18.520 | picking what the minimum and maximum bounds should be.
00:26:22.560 | And again, this idea of roughly what should our learning rate be is something which we
00:26:26.560 | tend to use a lot of trial and error for.
00:26:29.000 | So check out this paper for a suggestion about how to do it somewhat automatically.
00:26:37.040 | So there's a whole bunch of things that I've zipped over.
00:26:40.240 | Normally I would have dug into each of those and explained it and shown examples in notebooks
00:26:44.420 | and stuff.
00:26:45.420 | So you guys hopefully now have enough knowledge to take this information and play with it.
00:26:54.280 | And what I'm hoping is that different people will play with different parts and come back
00:26:57.200 | and tell us what you find and hopefully we'll get some good new contributions to Keras or
00:27:03.640 | PyTorch or some blog posts or some papers or so forth, or maybe with that device stuff
00:27:09.960 | or even some new applications.
00:27:13.640 | So the next thing I wanted to look at, again somewhat briefly, is the data science bowl.
00:27:25.960 | And the reason I particularly wanted to dig into the data science bowl is there's a couple
00:27:30.800 | of reasons.
00:27:31.800 | One of them, there's a million reasons, it's a million dollar prize, and there are 23 days
00:27:38.680 | to go.
00:27:40.480 | The second is, it's an extension for everything that you guys have learned so far about computer
00:27:45.720 | vision.
00:27:46.720 | It uses all the techniques you've learned, but then some.
00:27:50.680 | So rather than 2D images, they're going to be 3D volumes.
00:27:56.400 | Rather than being 300x300 or 500x500, they're going to be 512x512x200, so a couple of hundred
00:28:05.960 | times bigger than stuff you've dealt with before.
00:28:09.800 | The stuff we learned in lesson 7 about where are the fish, you're going to be needing to
00:28:14.440 | use a lot of that.
00:28:17.320 | I think it's a really interesting problem to solve.
00:28:20.280 | And then I personally care a lot about this because my previous startup in LITC was the
00:28:25.800 | first organization to use deep learning to tackle this exact problem, which is trying
00:28:32.060 | to find lung cancer in CT scans.
00:28:36.760 | The reason I made that in LITC's first problem was mainly because I learned that if you can
00:28:42.760 | find lung cancer earlier, the probability of survival is 10 times higher.
00:28:48.040 | So here is something where you can have a real impact by doing this well, which is not
00:28:55.880 | to say that a million dollars is in the big impact as well.
00:29:00.120 | So let me tell you a little bit about this problem.
00:29:08.100 | Here is a lung.
00:29:30.460 | It is a format which contains two main things.
00:29:35.620 | One is a stack of images and another is some metadata.
00:29:40.620 | Metadata will be things like how much radiation were those at by how far away from the chest
00:29:45.580 | with the machine and what brand of machine was it, and so on and so forth.
00:29:53.300 | Most icon viewers just use your scroll wheel to zip through them, so all this is doing
00:29:58.540 | is going from top to bottom or from bottom to top, so you can kind of see what's going
00:30:07.540 | What I might do, I think is more interesting, is to focus on the bit that's going to matter
00:30:23.380 | to you, which is the inside of the lung is this dark area here, and these little white
00:30:30.340 | dots are what's called the vasculature, so the little vessels and stuff going through
00:30:34.860 | the lungs.
00:30:35.860 | And as I scroll through, have a look at this little dot.
00:30:38.820 | You'll see that it seems to move, see how it's moving.
00:30:43.420 | The reason it's moving is because it's not a dot, it's actually a vessel going through
00:30:49.260 | space so it actually looks like this.
00:30:55.820 | And so if you take a slice through that, it looks like lots of dots.
00:31:01.620 | And so as you go through those slices, it looks like that.
00:31:08.180 | And then eventually we get to the top of the lung, and that's why you see eventually a
00:31:13.460 | hole goes to white, so that's the edge basically of the organ.
00:31:19.100 | So you can see there are edges on each side, and then there's also bone.
00:31:23.900 | So some of you have been looking at this already over the last few weeks and have often asked
00:31:29.620 | me about how to deal with multiple images, and what I've said each time is don't think
00:31:36.980 | of it as multiple images.
00:31:39.460 | Think of it in the way your DICOM viewer can if you have a 3D button like this one does.
00:31:47.620 | That's actually what we were just looking at.
00:31:51.320 | So it's not a bunch of flat images, it's a 3D volume.
00:31:56.780 | It just so happens that the default way that most DICOM viewers show things is by a bunch
00:32:01.100 | of flat images.
00:32:04.820 | But it's really important that you think of it as a 3D volume, because you're looking in
00:32:10.180 | this space.
00:32:12.100 | Now what are you looking for in this space?
00:32:14.940 | What you're looking for is you're looking for somebody who has lung cancer.
00:32:19.700 | And what somebody who has lung cancer looks like is that somewhere in this space there
00:32:23.700 | is a blob, it could be roughly a spherical blob, it could be pretty small, around 5 millimeters
00:32:33.220 | is where people start to get particularly concerned about a blob.
00:32:37.820 | And so what that means is that for a radiologist, as they flick through a scan like this, is
00:32:43.420 | that they're looking for a dot which doesn't move, but which appears, gets bigger and then
00:32:51.340 | disappears.
00:32:52.340 | That's what a blob looks like.
00:32:54.860 | So you can see why radiologists very, very, very often miss nodules in lungs.
00:33:03.700 | Because in all this area, you've got to have extraordinary vision to be able to see every
00:33:09.060 | little blob appear and then disappear again.
00:33:11.940 | And remember, the sooner you catch it, you get a 10x improved chance of survival.
00:33:20.300 | And generally speaking, when a radiologist looks at one of these scans, they're not looking
00:33:26.380 | for nodules, they're looking for something else.
00:33:29.220 | Because lung cancer, at least in the earlier stages, is asymptomatic, it doesn't cause
00:33:34.300 | you to feel different.
00:33:36.580 | So it's like something that every radiologist has to be thinking about when they're looking
00:33:39.620 | for pneumonia or whatever else.
00:33:43.820 | So that's the basic idea is that we're going to try and come up with in the next half hour
00:33:49.220 | or so some idea about how would you find these blobs, how would you find these nodules.
00:33:55.660 | So each of these things generally is about 512x512 by a couple of hundred.
00:34:08.260 | And the equivalent of a pixel in 3D space is called a voxel.
00:34:12.340 | So a voxel simply means a pixel in 3D space.
00:34:16.740 | So this here is rendering a bunch of voxels.
00:34:25.920 | Each voxel in a CT scan is a 12-bit integer, if memory serves me correctly.
00:34:35.700 | And a computer screen can only show 8 bits of grayscale, and furthermore your eyes can't
00:34:42.660 | necessarily distinguish between all those grayscale perfectly anyway.
00:34:46.780 | So what every DICOM viewer provides is something called a windowing adjustment.
00:34:53.820 | So a windowing adjustment, here is the default window, which is designed to basically map
00:35:01.880 | some subset of that 12-bit space to the screen so that it highlights certain things.
00:35:08.780 | And so the units CT scans use are called Hounsfield units, and certain ranges of Hounsfield units
00:35:19.380 | tell you that something is some particular part of the body.
00:35:23.580 | And so you can see here that the bone is being lit up.
00:35:26.860 | So we've selected an image window which is designed to allow us to see the bone clearly.
00:35:33.020 | So what I did when I opened this was I switched it to CT's chest, which is some kind person
00:35:41.040 | has already figured out what the best CT lungs have figured out what's the best window to
00:35:59.460 | see the nodules and vasculature in lines.
00:36:02.460 | Now for you working with deep learning, you don't have to care about that, because of
00:36:07.860 | course the deep learning algorithm can see 12 bits perfectly well.
00:36:12.460 | So nothing really to worry about.
00:36:17.020 | So one of the challenges with dealing with this data science bold data is that there's
00:36:24.700 | a lot of preprocessing to do, but the good news is that there's a couple of fantastic
00:36:33.540 | tutorials.
00:36:34.540 | So hopefully you've found out by now that on Kaggle, if you click on the kernels button,
00:36:40.260 | you basically get to see people's IPython notebooks where they show you how to do certain
00:36:46.120 | things.
00:36:47.120 | In this case, this guy has got a full preprocessing tutorial showing how to load DICOM, convert
00:36:54.320 | the values to Hounsfield units, and so forth.
00:36:57.580 | I'll show you some of these pieces.
00:37:01.100 | So DICOM you will load with some library, probably with PyDICOM.
00:37:09.500 | So PyDICOM is a library that's a bit like Pillow or P-I-L, an image.open, this is more
00:37:15.660 | like a DICOM.open and end up with a 3D file, and of course the metadata.
00:37:24.300 | You can see here using the metadata, image_position, lice_location.
00:37:31.180 | So the metadata comes through with just attributes of the Python object.
00:37:37.560 | This person is very kindly provided to you a list of the Hounsfield units for each of
00:37:48.060 | the different substances.
00:37:52.380 | So he shows how to translate stuff into that range.
00:38:00.100 | And so it's great to draw lots of pictures.
00:38:02.580 | So here is a histogram for this particular picture.
00:38:06.260 | So you can see that most of it is air, and then you get some bone and some lung as the
00:38:13.860 | actual slice.
00:38:20.580 | So then the next thing to think about is really voxel spacing, which is as you move across
00:38:30.540 | one bit of x-axis or one bit of y-axis or from slice to slice, how far in the real world
00:38:37.840 | are you moving?
00:38:39.820 | And one of the annoying things about medical imaging is that different kinds of scanners
00:38:43.980 | have different distances between those slices, called the slice thickness and different meanings
00:38:51.100 | of the x and y-axis.
00:38:54.060 | Luckily that stuff is all in the diagram metadata.
00:38:56.560 | So the resampling process means taking those lists of slices and turning it into something
00:39:04.620 | where every step in the x-direction or the y-direction or the z-direction equals 1mm
00:39:10.460 | in the real world.
00:39:12.380 | And so it would be very annoying for your deep learning network if your different lung
00:39:16.780 | images were squished by different amounts, especially if you didn't give it the metadata
00:39:21.820 | about how much this was being squished.
00:39:25.620 | So that's what resampling does, and as you can see it's using the slice thickness and
00:39:29.740 | the pixel spacing to make everything nice and even.
00:39:35.900 | So there are various ways to do 3D plots, and it's always a good idea to do that.
00:39:45.760 | And then something else that people tend to do is segmentation.
00:39:51.300 | Depending on time, you may or may not get around to looking more at segmentation in
00:39:55.100 | this part of the course, but effectively segmentation is just another generative model.
00:39:59.340 | It's a generative model where hopefully somebody has given you some things saying this is lung,
00:40:05.420 | this is air, and then you build a model that tries to predict for something else what's
00:40:10.260 | lung and what's air.
00:40:13.300 | Unfortunately for lung CT scans, we don't generally have the ground truth of which bit
00:40:24.500 | is lung and which bit is air.
00:40:25.980 | So generally speaking, in medical imaging, people use a whole lot of heuristic approaches,
00:40:31.140 | so kind of hacky rule-based approaches, and in particular applications of region-growing
00:40:38.380 | and morphological operations.
00:40:41.820 | I find this kind of the boring part of medical imaging because it's so clearly a dumb way
00:40:46.540 | to do things, but deep learning is far too new in this area yet to kind of develop the
00:40:53.780 | data sets that we need to do this properly.
00:40:57.620 | But the good news is that there's a button which I don't think many people notice exists
00:41:03.940 | called 'Tutorial' on the main data science page where these folks from Boozell and Hamilton
00:41:11.220 | actually show you a complete segmentation approach.
00:41:14.540 | Now it's interesting that they picked unit segmentation.
00:41:19.180 | This is definitely the thing about segmentation I would be teaching you guys if we have time.
00:41:25.660 | Segment is one of these things that outside of the Kaggle world, I don't think that many
00:41:28.780 | people are familiar with, but inside the Kaggle world we know that any time segmentation crops
00:41:34.660 | up, unit wins, and it's the best.
00:41:39.140 | More recently there's actually been something called 'DenseNet' for segmentation which takes
00:41:45.500 | unit even a little bit further and maybe that would be the new winner for newer Kaggle competitions
00:41:50.300 | when they happen.
00:41:52.420 | But the basic idea here of things like Unet and DenseNet is that we have a model where
00:42:09.980 | when we do generative models, when I think about doing style transfer, we generally start
00:42:16.020 | with this kind of large image and then we do some downsampling operations to make it
00:42:20.780 | a smaller image and then we do some computation and then we make it bigger again with these
00:42:28.540 | upsampling operations.
00:42:32.940 | What happens in Unet is that there are additional neural network connections made directly from
00:42:41.820 | here to here, and directly from here to here, and here to here, and here to here.
00:42:52.140 | Those connections basically allow it to almost do like a kind of residual learning approach,
00:43:00.340 | like it can figure out the key semantic pieces at really low resolution, but then as it upscales
00:43:07.660 | it can learn what was special about the difference between the downsampled image and the original
00:43:12.860 | image here.
00:43:13.860 | It can kind of learn to add that additional detail at each point.
00:43:20.020 | So Unet and DenseNet for segmentation are really interesting and I hope we find some
00:43:34.380 | time to get back to them in this part of the course, but if we don't, you can get started
00:43:40.980 | by looking at this tutorial in which these folks basically show you from scratch.
00:43:47.620 | What they try to do in this tutorial is something very specific, which is the detection part.
00:43:54.540 | So what happens in this kind of, like think about the fisheries competition.
00:44:02.220 | We pretty much decided that in the fisheries competition, if you wanted to do really well,
00:44:05.980 | you would first of all find the fish and then you would zoom into the fish and then you
00:44:09.900 | would figure out what kind of fish it is.
00:44:12.860 | Certainly in the right whale competition earlier, that was how that was found.
00:44:17.420 | For this competition, this is even more clearly going to be the approach, because these images
00:44:21.420 | are just far too big to do a normal convolutional neural network.
00:44:24.620 | So we need one step that's going to find the nodule, and then a second step that's going
00:44:29.100 | to zoom into a possible nodule and figure out is this a malignant tumor or something
00:44:37.180 | else, a false positive.
00:44:42.260 | The bad news is that the data science bowl data set does not give you any information
00:44:48.220 | at all for the training set.
00:44:53.100 | Where are the cancerous nodules?
00:44:55.940 | Which I actually wrote a post in the Kaggle forums about this, I just think this is a
00:44:58.940 | terrible idea.
00:45:03.740 | That information actually exists, the dataset they got this from is something called the
00:45:07.900 | National Lung Screening Trial, which actually has that information or something pretty close
00:45:11.860 | to it.
00:45:12.860 | So the fact they didn't provide it, I just think it's horrible for a competition which
00:45:18.700 | can save lives and I can't begin to imagine.
00:45:23.260 | The good news though is that there is a data set which does have this information.
00:45:29.740 | The original data set was called LIDC Idri, but interestingly that data set was recently
00:45:37.820 | used for another competition, a non-Kaggle competition called Luna, that competition
00:45:44.100 | is now finished.
00:45:46.140 | And one of the tracks in that competition was actually specifically a false positive
00:45:50.780 | detection track, and then the other track was a find the nodule track basically.
00:45:57.380 | So you can actually go back and look at the papers written by the winners.
00:46:02.300 | They're generally ridiculously short.
00:46:04.540 | Many of them are a single sentence saying for a commercial confidentiality agreement
00:46:08.780 | we can't do anything.
00:46:10.420 | But some of them, including the winner of the false positive track, they actually provide
00:46:17.420 | Surprisingly, they all use deep learning.
00:46:21.100 | And so what you could do, in fact I think what you have to do to do well in this competition
00:46:24.700 | is download the Luna data set, use that to build a nodule detection algorithm.
00:46:30.660 | So the Luna data set includes files saying this lung has nodules here, here, here, here.
00:46:37.660 | So do nodule detection based on that, and then run that nodule detection algorithm on
00:46:43.300 | the Kaggle data set, find the nodules, and then use that to do some classification.
00:46:52.620 | There are some tricky things with that.
00:46:55.380 | The biggest tricky thing is that most of the CT scans in the Luna data set are what's called
00:47:04.780 | contrast studies.
00:47:07.700 | A contrast scan means that the patient had a radioactive dye injected into them, so that
00:47:15.300 | the things that they're looking for are easier to see.
00:47:21.540 | For the National Lung Screening Trial, which is what they use in the Kaggle data set, none
00:47:25.220 | of them use contrast.
00:47:27.420 | And the reason why is that what we really want to be able to do is to take anybody who's
00:47:31.860 | over 65 and has been smoking more than a pack a day for more than 20 years and give them
00:47:36.540 | all a CT scan and find out which ones have cancer, but in the process we don't want to
00:47:40.940 | be shooting them up with radioactive dye and giving them cancer.
00:47:44.940 | So that's why we try to make sure that when we're doing these kind of asymptomatic scans
00:47:52.580 | that they're as low radiation dose as possible.
00:47:57.100 | So that means that you're going to have to think about transfer learning issues, that
00:48:02.820 | the contrast in your image is going to be different between the thing you build on the
00:48:07.300 | Luna data set, the nodule protection, and the Kaggle competition data set.
00:48:18.220 | When I looked at it, I didn't find that that was a terribly difficult problem.
00:48:22.300 | I'm sure you won't find it impossible by any means.
00:48:32.620 | So to finalize this discussion, I wanted to refer to this paper, which I'm guessing not
00:48:41.140 | that many people have read yet.
00:48:43.780 | It's a medical imaging paper.
00:48:46.160 | And what it is, is a non-deep learning approach to trying to find nodules.
00:48:52.980 | So that's where they use nodule segmentation.
00:49:02.940 | Yes, Rachael.
00:49:03.940 | I have a correction from our radiologist saying that dye is not radioactive.
00:49:09.760 | It's just dense, isofu-70 or isofu-70.
00:49:13.440 | Okay, but there's a reason we don't inject people with a contrast dye.
00:49:17.580 | The issues are contrast in the nephropath or jacarina.
00:49:22.300 | Yeah, that's what I meant.
00:49:28.020 | I do know though that the NLST studies use a lower amount of radioactivity than I think
00:49:40.300 | the Luna ones do, so that's another difference.
00:49:46.300 | So this is an interesting idea of how can you find nodules using more of a heuristic
00:49:55.820 | approach.
00:49:57.060 | And the heuristic approach they suggest here is to do clustering, and we haven't really
00:50:03.340 | done any clustering in class yet, so we're going to dig into this in some detail.
00:50:08.060 | Because I think this is a great idea for the kind of heuristics you can add on top of deep
00:50:12.380 | learning to make deep learning work in different areas.
00:50:16.180 | The basic idea here is to, as you can say, they call it a five-dimensional mean.
00:50:23.100 | They're going to try and find groups of voxels which are similar, and they're going to cluster
00:50:28.100 | them together.
00:50:29.460 | And hopefully we've got to particularly cluster together things that look like nodules.
00:50:34.300 | So the idea is at the end of this segmentation there will be one cluster for the whole lung
00:50:40.540 | boundary, one cluster for the whole vasculature, and then one cluster for every nodule.
00:50:45.860 | So the five dimensions are x, y and z, which is straightforward, intensity, so the number
00:50:52.460 | of Hounsfield units, and then the fifth one is volumetric shape index, and this is the
00:50:58.620 | one tricky one.
00:51:01.380 | The basic idea here is that it's going to be a combination of the different curvatures
00:51:05.820 | of a voxel based on the Gaussian and mean curvatures.
00:51:11.540 | Now what the paper goes on to explain is that you can use for these the first and second
00:51:18.820 | derivatives of the image.
00:51:20.660 | Now all that basically means is you subtract one voxel from its neighbor, and then you
00:51:26.780 | take that whole thing and subtract one voxel's version of that from its neighbor.
00:51:30.700 | You get the first and second derivatives, so it kind of tells you the direction of the
00:51:38.860 | change of image intensity at that point.
00:51:44.180 | So by getting these first and second derivatives of the image and then you put it into this
00:51:47.940 | formula, it comes out with something which basically tells you how sphere-like this voxel
00:51:56.900 | seems to be part of how sphere-like a construct.
00:52:01.240 | So that's great.
00:52:03.380 | If we can basically take all the voxels and combine the ones that are nearby, have a similar
00:52:09.260 | number of Hounsfield units and seem to be of similar kinds of shapes, we're going to
00:52:13.860 | get what we want.
00:52:16.380 | So I'm not going to worry about this bit here because it's very specific to medical imaging.
00:52:21.380 | Anybody who's interested in doing this, feel free to talk on the forum about what this
00:52:26.100 | book is like in Python.
00:52:30.020 | But what I did want to talk about was the meanshift clustering, which is a particular
00:52:35.300 | approach to clustering which they talk about.
00:52:37.780 | "Clustering" is something which for a long time I've been kind of an anti-fan of.
00:53:07.440 | It belongs to this group of unsupervised learning algorithms which always seem to be kind of
00:53:13.980 | looking for a problem to solve.
00:53:15.980 | But I've realized recently there are some specific problems that can be solved well
00:53:19.860 | with them.
00:53:20.860 | I'm going to be showing you a couple, one today and one in Lesson 14.
00:53:28.340 | Clustering algorithms are perhaps the best easiest to describe by what they do by generating
00:53:32.460 | some data to show them.
00:53:35.100 | Here's some generated data.
00:53:36.100 | I'm going to create 6 clusters, and for each cluster I'll create 250 samples.
00:53:42.820 | So I'm going to basically say let's create a bunch of centroids by creating some random
00:53:48.300 | numbers.
00:53:49.300 | So 6 pairs of random numbers for my centroids, and then I'll grab a bunch of random numbers
00:54:00.100 | around each of those centroids and combine them all together and then plot them.
00:54:05.860 | And so here you can see each of these X's represents a centroid, so a centroid is just
00:54:11.680 | like the average point for a cluster of data.
00:54:16.740 | And each color represents one cluster.
00:54:20.740 | So imagine if this was showing you clusterings of different kinds of lung tissue, ideally
00:54:32.420 | you'd have some voxels that were colored one thing for nodule, and a bunch of different
00:54:38.900 | color for vasculature and so forth.
00:54:43.540 | We can only show this easily in 2 dimensions, but there's no reason to not be able to imagine
00:54:48.980 | doing this in certainly 5 dimensions.
00:54:52.100 | So the goal of clustering will be to undo this.
00:54:57.220 | Given the data, but not the X's, how can you figure out where the X's were?
00:55:04.380 | And then it's pretty straightforward once you know where the X's are to then find the
00:55:08.260 | closest points to that to assign every data point to a cluster.
00:55:15.700 | The most popular approach to clustering is called A-means.
00:55:24.020 | A-means is an approach where you have to decide up front how many clusters are there.
00:55:32.140 | And what it basically does is there's two steps.
00:55:36.620 | The first one is to guess as to where those clusters might be.
00:55:42.020 | And the really simple way to do that is just to randomly pick a point, and then start
00:55:53.860 | randomly picking points which are as far away as possible from all the previous ones I've
00:56:00.660 | picked.
00:56:01.660 | Let me throw away the first one.
00:56:03.860 | So if I started here, then probably the furthest away point would be down here.
00:56:08.500 | So this would be our starting point for cluster 1, and say what point is the furthest away
00:56:13.540 | from that?
00:56:14.540 | That's probably this one here, so we have a starting point for cluster 2.
00:56:18.460 | What's the furthest point away from both of these?
00:56:20.460 | Probably this one over here, and so forth, so you keep doing that to get your initial
00:56:25.180 | points.
00:56:26.940 | And then you just iteratively move every point, so you basically then say these are the clusters,
00:56:34.700 | let's assume these are the clusters, which cluster does every point belong to, and then
00:56:39.980 | you just iteratively move the points to different clusters a bunch of times.
00:56:46.260 | Now K means, it's a shame it's so popular because it kind of sucks, right?
00:56:54.200 | Sucky thing number 1 is that you have to decide how many clusters there are, and the whole
00:56:59.900 | point is we don't know how many nodules there are.
00:57:04.140 | And then sucky thing number 2 is without some changes, to do something called kernel K means,
00:57:09.260 | it only works with the things that are the same shape, they're all nicely Gaussian shaped.
00:57:14.420 | So we're going to talk about something way-caller, which only came across somewhat recently, much
00:57:22.320 | less well known, which is called mean-shift clustering.
00:57:25.900 | Now mean-shift clustering is one of these things which seems to spend all of its time
00:57:36.140 | in serious mathematician land.
00:57:43.580 | Whenever I tried to look up something about mean-shift clustering, I kind of started seeing
00:57:48.780 | this kind of thing.
00:57:50.540 | This is like the first tutorial not in the PDF that I could find.
00:57:55.780 | So this is one way to think about mean-shift clustering, another way is a code-first approach,
00:58:03.980 | which is that this is the entire algorithm.
00:58:08.060 | So let's talk about what's going on here.
00:58:10.820 | What are we doing?
00:58:11.820 | At a high level, we're going to do a bunch of loops.
00:58:17.420 | So we're going to do 5 steps.
00:58:19.580 | It would be better if I didn't do 5 steps, but I kept doing this until it was stable,
00:58:23.560 | but for now I'm just going to do 5 steps.
00:58:26.460 | And each step I'm going to go through, so our data is x, I'm going to go through, enumerate
00:58:32.540 | through our data.
00:58:33.540 | So small x is the current data point I'm looking at.
00:58:42.260 | Now what I want to do is find out how far away is this data point from every other data
00:58:46.820 | point.
00:58:47.820 | So I'm going to create a vector of distances.
00:58:50.900 | And I'm going to do that with the magic of broadcasting.
00:58:55.020 | So small x is a vector of size 2, this is 2 coordinates, and big X is a matrix of size
00:59:05.520 | n by 2, where n is the number of points.
00:59:08.940 | And thanks to what we've now learned about broadcasting, we know that we can subtract
00:59:12.920 | a matrix from a vector, and that vector will be broadcast across the axis of the matrix.
00:59:19.420 | And so this is going to subtract every element of big X from little x.
00:59:25.300 | And so if we then go ahead and square that, and then sum it up, and then take the square
00:59:31.780 | root, this is going to return a vector of distances of small x to every element of big
00:59:44.460 | And the sum here is just summing up the two coordinates.
00:59:51.260 | So that's step 1.
00:59:52.260 | So we now know for this particular data point, how far away is it from all of the other data
00:59:56.820 | points.
00:59:58.220 | Now the next thing we want to do is to -- let's go to the final step.
01:00:04.340 | The final step will be to take a weighted average.
01:00:08.740 | In the final step, we're going to say what cluster do you belong to.
01:00:15.340 | Let's draw this.
01:00:19.220 | So we've got a whole bunch of data points, and we're currently looking at this one.
01:00:34.260 | What we've done is we've now got a list of how far it is away from all of the other data
01:00:39.100 | points.
01:00:43.820 | And the basic idea is now what we want to do is take the weighted average of all of
01:00:49.420 | those data points, weighted by the inverse of that distance.
01:00:53.940 | So the things that are a long way away, we want to weight very small.
01:00:58.580 | And the things that are very close, we want to weight very big.
01:01:02.820 | So I think this is probably the closest, and this is about the second-closest, and this
01:01:10.420 | is about the third-closest.
01:01:11.420 | So assuming these have most of the weight, the average is going to be somewhere about
01:01:15.620 | here.
01:01:18.740 | And so by doing that at every point, we're going to move every point closer to where
01:01:25.540 | its friends are, closer to where the nearby things are.
01:01:29.180 | And so if we keep doing this again and again, everything is going to move until it's right
01:01:33.980 | next to its friends.
01:01:36.340 | So how do we take something which initially is a distance and make it so that the larger
01:01:44.620 | distances have smaller weights?
01:01:47.300 | And the answer is we probably want to shape something like that.
01:01:53.700 | In other words, Gaussian.
01:02:00.220 | This is by no means the only shape you could choose.
01:02:02.540 | It would be equally valid to choose this shape, which is a triangle, at least half of one.
01:02:17.940 | In general though, note that if we're going to multiply every point by one of these things
01:02:25.620 | and add them all together, it would be nice if all of our weights added to 1, because
01:02:31.500 | then we're going to end up with something that's of the same scale that we start with.
01:02:35.940 | So when you create one of these curves where it all adds up to 1, generally speaking we
01:02:45.540 | call that a kernel.
01:02:49.460 | And I mention this because you will see kernels everywhere.
01:02:53.860 | If you haven't already, now that you've seen it, you'll see them everywhere.
01:02:58.100 | In fact, kernel methods is a whole area of machine learning that in the late 90s basically
01:03:04.740 | took over because it was so theoretically pure.
01:03:08.980 | And if you want to get published in conference proceedings, it's much more important to be
01:03:12.220 | theoretically pure than actually accurate.
01:03:16.460 | So for a long time, kernel methods went out, and neural networks in particular disappeared.
01:03:23.980 | Eventually people realized that accuracy was important as well, and in more recent times
01:03:28.660 | kernel methods are largely disappearing.
01:03:30.940 | But you still see the idea of a kernel coming up very often, because they're super useful
01:03:37.660 | tools to have.
01:03:38.660 | They're basically something that lets you take a number, like in this case a distance,
01:03:43.100 | and turn it into some other number where you can weight everything by that other number
01:03:48.660 | and add them together to get a nice little weighted average.
01:03:52.540 | So in our case, we're going to use a Gaussian kernel.
01:03:57.500 | The particular formula for a Gaussian doesn't matter.
01:04:01.420 | I remember learning this formula in grade 10, and it was by far the most terrifying
01:04:05.840 | mathematical formula I've ever seen, but it doesn't really matter.
01:04:09.540 | For those of you that remember or have seen the Gaussian formula, you'll recognize it.
01:04:14.580 | For those of you that haven't, it doesn't matter.
01:04:17.320 | But this is the function that draws that curve.
01:04:23.460 | So if we take every one of our distances and put it through the Gaussian, we will then
01:04:31.100 | get back a bunch of weights that add to 1.
01:04:35.260 | So then in the final step, we can multiply every one of our data points by that weight,
01:04:44.380 | add them up, and divide by the sum of the weights.
01:04:46.860 | In other words, take a weighted average.
01:04:50.060 | You'll notice that I had to be a bit careful about broadcasting here, because I needed
01:04:56.860 | to add a unit axis at the end of my dimensions, not at the start, so by default it adds unit
01:05:07.300 | axes to the beginning when you do broadcasting.
01:05:10.200 | That's why I had to do an expandims.
01:05:13.220 | If you're not clear on why this is, then that's a sign you definitely need to do some more
01:05:18.500 | playing around with broadcasting.
01:05:21.040 | So have a fiddle with that during the week.
01:05:24.180 | You're free to ask if you're not clear after you've experimented.
01:05:27.380 | But this is just a weighted sum.
01:05:29.000 | So this is just doing sum of weights times x divided by sum of weights.
01:05:42.900 | Importantly there's a nice little thing that we can pass to a Gaussian, which is the thing
01:05:48.300 | that decides does it look like the thing I just drew, or does it look like this, or does
01:05:56.500 | it look like this.
01:05:59.820 | All of those things add up to one.
01:06:00.820 | They all have the same area underneath, but they're very different shapes.
01:06:04.900 | If we make it look like this, then what that's going to do is create a lot more clusters,
01:06:10.780 | because things that are really close to it are going to have really high weights, and
01:06:15.420 | everything else is going to have a tiny weight from being meaningless.
01:06:18.660 | So if we use something like this, we're going to have much fewer clusters, because even
01:06:23.180 | stuff that's further away is going to have a higher weight from the weight of the sum.
01:06:31.140 | The choice that you use for the kernel width, that's got lots of different names you can
01:06:38.580 | From here I've used BW being bandwidth.
01:06:44.540 | There's actually some cool ways to choose it.
01:06:46.940 | One simple way to choose it is to find out which size of bandwidth covers 1/3 of the
01:06:55.740 | data in your dataset.
01:06:58.020 | I think that's the approach that Scikit-learn uses.
01:07:01.780 | So there are some ways that you can automatically figure out a bandwidth, and just one of the
01:07:09.460 | very nice things about mean shift.
01:07:12.340 | So we just go through a bunch of times, five times, and each time we replace every point
01:07:17.580 | with its weighted average weighted by this Gaussian kernel.
01:07:25.900 | So when we run this 5 times, it takes a second, and here's the results.
01:07:33.060 | I've offset everything by 1 just so that we can see it, otherwise it would be right on
01:07:37.220 | top of the x.
01:07:38.220 | So you can see that for nearly all of them, it's in exactly the right spot, whereas for
01:07:43.180 | this cluster, let's just remind ourselves what that cluster looked like, these two clusters,
01:07:49.340 | this particular bandwidth, it decided to create one cluster for them rather than two.
01:07:56.180 | So this is kind of an example, whereas if we decreased our bandwidth, it would create
01:08:00.780 | two clusters.
01:08:01.780 | There's no one right answer, that should be one or two.
01:08:08.500 | So one challenge with this is that it's kind of slow.
01:08:14.220 | So I thought let's try and accelerate it for the GPU.
01:08:20.940 | Because mean shift's not very cool, nobody seems to have implemented it for the GPU yet,
01:08:26.540 | or maybe it's just not a good idea, so I thought I'd use PyTorch.
01:08:30.260 | And the reason I use PyTorch is because it really feels like writing PyTorch, it just
01:08:34.780 | feels like writing NumPy, everything happens straight away.
01:08:37.700 | So I really hoped that I could take my original code and make it almost the same.
01:08:45.300 | And indeed, here is the entirety of mean shift in PyTorch.
01:08:53.500 | So that's pretty cool.
01:08:55.380 | You can see anywhere I used to have np, it now says torch, np.array is now torch.flow_tensor,
01:09:05.460 | np.square_root is torch.square_root, everything else is almost the same.
01:09:11.660 | One issue is that torch doesn't support broadcasting.
01:09:18.000 | So we'll talk more about this shortly in a couple of weeks, but basically I decided that's
01:09:23.020 | not okay, so I wrote my own broadcasting library for PyTorch.
01:09:26.820 | So rather than saying x, little x minus big X, I used sub for subtract.
01:09:31.660 | That's the subtract from my broadcasting library.
01:09:34.900 | If you're curious, check out TorchUtils and you can see my broadcasting operations there.
01:09:40.300 | But basically if you use those, you can see save for modification, it will do all the
01:09:45.300 | broadcasting for you.
01:09:51.340 | So as you can see, this looks basically identical to the previous code, but it takes longer.
01:10:02.140 | So that's not ideal.
01:10:05.300 | One problem here is that I'm not using CUDA.
01:10:08.080 | So I could easily fix that by adding .Cuda to my x, but that made it slower still.
01:10:15.660 | The reason why is that all the work is being done in this for loop, and PyTorch doesn't
01:10:22.180 | accelerate for loops.
01:10:24.760 | Each run through a for loop in PyTorch is basically calling a new CUDA kernel each time
01:10:32.980 | you're going through.
01:10:33.980 | It takes a certain amount of time to even launch a CUDA kernel.
01:10:38.060 | When I'm saying CUDA kernel, this is a different usage of the word kernel.
01:10:44.780 | In CUDA, kernel refers to a little piece of code that runs on the GPU.
01:10:50.780 | So it's launching a little GPU process every time through the for loop.
01:10:54.980 | It takes quite a bit of time, and it's also having to copy data all over the place.
01:10:59.820 | So what I then tried to do was to make it faster.
01:11:14.140 | The trick is to do it by minibatch.
01:11:18.580 | So each time through the loop we don't want to do just one piece of data, but a minibatch
01:11:26.500 | of data.
01:11:27.980 | So here are the changes I made.
01:11:31.180 | The main one was that my for_i now jumps through one batch size at a time.
01:11:39.580 | So I'm going to go 0.123, but 0.1632.
01:11:47.260 | So I now need to create a slice which is from i to i plus batch size, unless we've gone
01:11:58.380 | past the end of the data, in which case it's just as far as hand.
01:12:02.540 | So this is going to refer to the slice of data that we're interested in.
01:12:08.100 | So what we can now do is say x with that slice to grab back all of the data in this minibatch.
01:12:15.800 | And so then I had to create a special version of, I can't say subtract anymore, I need to
01:12:22.020 | think carefully about the broadcasting operations here.
01:12:24.620 | I'm going to return a matrix, let's say batch size is 32, I'm going to have 32 rows, and
01:12:31.620 | then let's say n is 1000, it will be 1000 columns.
01:12:35.100 | That shows me how far away each thing in my batch is from every piece of data.
01:12:40.860 | So when we do things a batch at a time, we're basically adding another axis to all of your
01:12:46.780 | tensors.
01:12:47.780 | Suddenly now you have a batch axis all the time.
01:12:51.180 | And when we've been doing deep learning, that's been something I think we've got pretty used
01:12:55.780 | The first axis in all of our tensors has always been a batch axis.
01:13:00.380 | So now we're writing our own GPU-accelerated algorithm.
01:13:02.980 | Can you believe how crazy this is?
01:13:06.180 | Two years ago, if you Google for K-means, CUDA, or K-means GPU, you get back research
01:13:15.180 | studies where people write papers about how to put these algorithms in GPUs, because it
01:13:21.180 | was hard.
01:13:23.980 | And here's a page of code that does it.
01:13:27.460 | So this is crazy, this is possible, but here we are.
01:13:30.300 | We have built a batch-by-batch GPU-accelerated main shift algorithm.
01:13:38.100 | The basic distance formula is exactly the same.
01:13:40.820 | I just have to be careful about where I added unsqueezed as the same as expandims in NumPy.
01:13:48.420 | So I just have to be careful about where I add my unit axes, add it to the first axis
01:13:52.780 | of one bit and the second axis of the other bit.
01:13:55.340 | So that's going to subtract every one of these from every one of these returner matrix.
01:14:01.700 | Again, this is a really good time to look at this and think why does this broadcasting
01:14:09.020 | work, because this is getting more and more complex broadcasting.
01:14:15.540 | And hopefully you can now see the value of broadcasting.
01:14:19.860 | Not only did I get to avoid writing a pair of nested for loops here, but I also got to
01:14:26.580 | do this all on the GPU in a single operation, so I've made this thousands of times faster.
01:14:35.600 | So here is a single operation which does that entire matrix subtraction.
01:14:39.780 | Yes, Rachel?
01:14:40.780 | I was just going to suggest that we take a break soon, it's a tentalate.
01:14:51.140 | So that's our batchwise distance function.
01:14:54.580 | We then chuck that into a Gaussian, and because this is just element-wise, the Gaussian function
01:15:00.240 | hasn't changed at all, so that's nice.
01:15:04.900 | And then I've got my weighted sum, and then divide that by the sum of weights.
01:15:13.460 | So that's basically the algorithm.
01:15:15.500 | So previously for my NumPy version, it took a second, now it's 48ms, so we've just sped
01:15:25.060 | that up by 20ms.
01:15:28.060 | Yes, Rachel?
01:15:29.060 | Question - I get how batching helps with locality and cache, but I do not quite follow how it
01:15:34.620 | helps otherwise, especially with respect to accelerating the for loop.
01:15:40.020 | So in PyTorch, the for loop is not run on the GPU.
01:15:47.620 | The for loop is run on your CPU, and your CPU goes through each step of the for loop
01:15:52.940 | and calls the GPU to say do this thing, do this thing, do this thing.
01:15:58.140 | So this is not to say you can't accelerate this IntensorFlow in a similar way.
01:16:05.740 | Like IntensorFlow, there's a tf.while and stuff like that where you can actually do
01:16:10.580 | GPU-based loops.
01:16:13.020 | Even still, if you do it entirely in a loop in Python, it's going to be pretty difficult
01:16:17.700 | to get this performance.
01:16:18.700 | But particularly in PyTorch, it's important to remember in PyTorch, your loops are not
01:16:23.860 | optimized.
01:16:24.860 | It's what you do inside each loop that's optimized.
01:16:27.700 | We have another question.
01:16:30.500 | Some of the math functions are coming from Torch and others are coming from the Python
01:16:34.500 | library.
01:16:35.700 | What is the difference when you use the Python math library?
01:16:38.460 | Does that mean the GPU is not being used?
01:16:45.460 | You'll see that I use that math.py is a constant and then math.square root of 2 times py is
01:16:51.220 | a constant.
01:16:52.220 | You need to use the GPU to calculate a constant, obviously.
01:16:56.940 | We only use Torch for things that are running on a vector or a matrix or a tensor of data.
01:17:10.020 | So let's have a break.
01:17:11.260 | We'll come back in 10 minutes, so that would be 2 past 8, and we'll talk about some ideas
01:17:16.340 | I have for improving mean shift, which maybe you guys will want to try during the week.
01:17:24.840 | The idea here is we figure that there are two steps we need to figure out where the
01:17:36.980 | nodules are in something like this, if any.
01:17:41.500 | Step number one is to find the things that may be kind of nodule-ish, zoom into them
01:17:46.900 | and create a little cropped version.
01:17:50.020 | Step two would be where your learning particularly comes in, which is to figure out is that cancerous
01:17:56.540 | or not.
01:18:00.060 | Once you've found a nodule-ish thing, the cancerous ones are actually by far the biggest
01:18:07.740 | driver of whether or not something is a malignant cancer is how big it is.
01:18:15.160 | It's actually pretty straightforward.
01:18:17.620 | The other thing particularly important is how kind of spidery it looks.
01:18:24.860 | If it looks like it's kind of evilly going out to capture more territory, that's probably
01:18:29.620 | a bad sign as well.
01:18:32.740 | So the size and the shape are the two things that you're going to be wanting to try and
01:18:36.380 | find, and obviously that's a pretty good thing for a neural net to be able to do.
01:18:41.740 | You probably don't need that in the examples of it.
01:18:47.140 | When you get to that point, there was obviously a question about how to deal with the 3D aspect
01:18:51.620 | here.
01:18:52.620 | You can just create a 3D convolutional neural net.
01:18:56.860 | So if you had like a 10x10x10 space, that's obviously certainly not going to be too big,
01:19:05.380 | it's 20x20x20, you might be okay, and kind of think about how big a volume can you create.
01:19:10.940 | There's plenty of papers around on 3D convolutions, although I'm not sure if you even need one
01:19:16.340 | because it's just a convolution in 3D.
01:19:21.500 | The other approach that you might find interesting to think about is something called triplanar.
01:19:27.580 | What triplanar means is that you take a slice through the x and the y and the z axis, and
01:19:36.140 | so you basically end up with three images.
01:19:38.900 | One is a slice through x, y and z, and then you can kind of treat those as different channels
01:19:45.020 | if you like.
01:19:46.020 | You can probably use pretty standard neural net libraries that expect three channels.
01:19:53.740 | So there's a couple of ideas for how you can deal with the 3D aspect of it.
01:20:03.060 | I think using the lunar dataset as much as possible is going to be a good idea because
01:20:09.100 | you really want something that's pretty good at detecting nodules before you start putting
01:20:13.580 | it onto the Kaggle dataset because the other problem with the Kaggle dataset is it's ridiculously
01:20:17.660 | small.
01:20:18.660 | And again, there's no reason for it, there are far more cases in NLST than they've provided
01:20:25.460 | to Kaggle, so I can't begin to imagine why they went to all this trouble and a million
01:20:29.700 | dollars of money for something which has not been set up to succeed.
01:20:34.900 | Anyway, that's not our problem, it makes it all a more interesting thing to play with.
01:20:43.220 | But after the competition's finished, if you get interested in it, you'll probably want
01:20:47.540 | to go and download the whole NLST dataset or as much as possible and do it properly.
01:20:53.820 | Actually, there are two questions that I wanted to read.
01:21:01.780 | One is just for the audio stream, there are occasional max volume pops that are really
01:21:06.660 | hard on the ears for remote listeners.
01:21:08.860 | This might not be solvable right now, but something to look into.
01:21:16.700 | And then last class you mentioned that you would explain when and why to use Keras versus
01:21:22.260 | PyTorch.
01:21:23.260 | If you only had brain space for one in the same way, some only have brain space for VI
01:21:30.060 | or Emacs, which would you pick?
01:21:35.340 | So I just reduced the volume a little bit, so let us know if that helps.
01:21:45.500 | I would pick PyTorch, it feels like it kind of does everything Keras does, but gives
01:21:52.500 | you the flexibility to really play around a lot more.
01:22:00.140 | I'm sure you've got brain space for both.
01:22:04.460 | So question, you mentioned there are other datasets of cancerous images that has labels
01:22:09.540 | and proper marks.
01:22:10.940 | Can you explain the thing on that dataset?
01:22:13.620 | That was my suggestion, and that's what the tutorial shows how to do.
01:22:25.300 | There's a whole thing, a kernel on Kaggle called candidate generation and Luna16, which shows
01:22:33.620 | how to use Luna to build a module finder, and this is one of the highest rated Kaggle
01:22:43.780 | kernels.
01:22:44.780 | We've now used kernel in three totally different ways in this lesson.
01:22:48.860 | If we can come up with a fourth, Kaggle kernels, CUDA kernels and kernel methods.
01:22:58.220 | So this looks very familiar, doesn't it?
01:23:02.620 | So here's a Keras approach to finding lung modules based on Luna.
01:23:17.540 | So I mentioned an opportunity to improve this mean shift algorithm, and the opportunity
01:23:45.420 | for improvement, when you think about it, it's pretty obvious.
01:23:48.820 | The actual amount of data is huge.
01:23:51.820 | You've got data points all over the place.
01:23:57.540 | The ones that are a long way away, like the weight is going to be so close to zero that
01:24:02.280 | we may as well just ignore them.
01:24:07.740 | The question is, how do we quickly find the ones which are a long way away?
01:24:15.660 | We know the answer to that, we learned it.
01:24:19.060 | It's approximate nearest neighbors.
01:24:22.280 | So what if we added an extra step here, which rather than using x to get the distance to
01:24:33.780 | every data point, instead using approximate nearest neighbors to grab the closest ones,
01:24:41.980 | the ones that are actually going to matter.
01:24:46.260 | So that would basically turn this linear timepiece into a logarithmic timepiece, which would
01:24:58.820 | be pretty fantastic.
01:25:00.180 | So we learned very briefly about a particular approach, which is locality-sensitive hashing.
01:25:07.700 | I think I mentioned also there's another approach which I'm really fond of, called SpillTrees.
01:25:17.460 | I really want us as a team to take this algorithm and add approximate nearest neighbors to it
01:25:25.980 | and release it to the community as the first ever superfast GPU-accelerated, approximate
01:25:37.100 | nearest neighbor-accelerated in-chip clustering algorithm.
01:25:40.300 | I think that would be a really big deal.
01:25:43.660 | If anybody's interested in doing that, I believe you're going to have to implement something
01:25:50.100 | like LSH or SpillTrees in PyTorch, and once you've done that, it should be totally trivial
01:25:57.700 | to add the step that then uses that here.
01:26:00.780 | So if you do that, then if you're interested, I would invite you to team up with me in that
01:26:07.540 | we would then release this piece of software together and author a paper or a post together.
01:26:14.780 | So that's my hope is that a group of you will make that happen.
01:26:20.180 | That would be super exciting because I think this would be great.
01:26:23.860 | We'll be showing people something pretty cool about the idea of writing GPU algorithms today.
01:26:31.980 | In fact, I found just during the break, here's a whole paper about how to write k-means with
01:26:40.740 | CUDA.
01:26:41.740 | It used to be so much work.
01:26:47.140 | This is without even including any kind of approximate nearest neighbor's piece or whatever.
01:26:51.900 | So I think this would be great.
01:26:54.700 | Hopefully that will happen.
01:26:57.900 | And look, it gives the right answer.
01:27:00.380 | I guess to do it properly, we should also be replacing the Gaussian kernel bandwidth
01:27:07.620 | with something that we figure out dynamically rather than have it hard coded.
01:27:16.420 | So take change, we're going to learn about chatbots.
01:27:23.940 | So we're going to start here with Slate.
01:27:27.620 | Facebook thinks it has found the secret to making bots less dumb.
01:27:35.820 | So this talks about a new thing called memory networks, which was demonstrated by Facebook.
01:27:42.340 | You can feed it sentences that convey key plot points in Lord of the Rings and then
01:27:46.580 | ask it various questions.
01:27:49.700 | Published a new paper on archive that generalizes the approach.
01:27:56.940 | There was another long article about this unpopular science in which they described
01:28:00.740 | its early progress towards a truly intelligent AI.
01:28:05.780 | Lacuna is excited about working on a memory network, giving the ability to retain information.
01:28:11.300 | You can help the network a story and have it answer questions.
01:28:15.380 | And so it even has this little gif.
01:28:21.940 | In the article, they've got this little example showing reading a story of Lord of the Rings
01:28:29.580 | and then asking various questions about Lord of the Rings, and it all looks pretty impressive.
01:28:34.220 | So we're going to implement this paper.
01:28:38.100 | And the paper is called End-to-End Memory Networks.
01:28:43.900 | The paper was actually not shown on Lord of the Rings, but was actually shown on something
01:28:48.860 | called Babbie, I don't know, Babbie or Baby, I'm never quite sure which one it is.
01:28:54.980 | It's a paper describing a synthetic dataset towards AI complete question answering, a
01:29:02.760 | set of pre-requisite toy tasks.
01:29:04.300 | I saw a cute tweet last week explaining the meaning of various different types of titles
01:29:10.660 | of papers, and it's basically saying 'towards' means we've actually made no progress whatsoever.
01:29:17.780 | So we'll take this with a grain of salt.
01:29:21.440 | So these introduce the Babbie tasks, and the Babbie tasks are probably best described by
01:29:27.780 | showing an example.
01:29:29.680 | Here's an example.
01:29:34.100 | So each task is basically a story.
01:29:39.260 | A story contains a list of sentences, a sentence contains a list of words.
01:29:46.700 | At the end of the story is a query to which there is an answer.
01:29:52.620 | So the sentences are ordered in time.
01:29:55.600 | So where is Daniel?
01:29:56.600 | We'll have to go backwards.
01:29:58.660 | This says where John is.
01:30:00.420 | This is where Daniel is, Daniel going to the bathroom, so Daniel is in the bathroom.
01:30:05.900 | So this is what the Babbie tasks look like.
01:30:08.100 | There's a number of different structures.
01:30:12.140 | This is called a one-supporting fact structure, which is to say you only have to go back and
01:30:17.660 | find one sentence in the story to figure out the answer.
01:30:21.620 | We're also going to look at two supporting fact stories, which is ones where you're going
01:30:25.500 | to have to look twice.
01:30:31.020 | So reading in these data sets is not remotely interesting, they're just a text file.
01:30:39.940 | We can parse them out.
01:30:43.780 | There's various different text files for the various different tasks.
01:30:46.500 | If you're interested in the various different tasks, you can check out the paper.
01:30:51.580 | We're going to be looking at a single supporting fact and two supporting facts.
01:30:54.800 | They have some with 10,000 examples and some with 1,000 examples.
01:31:01.540 | The goal is to be able to solve every one of their challenges with just 1,000 examples.
01:31:09.260 | This paper is not successful at that goal, but it makes some movement towards it.
01:31:17.300 | So basically, we're going to put that into a bunch of different lists of stories along
01:31:27.180 | with their queries.
01:31:30.100 | We can start off by having a look at some statistics about them.
01:31:34.340 | The first is, for each story, what's the maximum number of sentences in a story?
01:31:38.740 | And the answer is 10.
01:31:39.740 | So Lord of the Rings, it ain't.
01:31:42.860 | In fact, if you go back and you look at the gif, when it says read story, Lord of the
01:31:48.940 | Rings, that's the whole Lord of the Rings.
01:32:00.780 | The total number of different words in this thing is 32.
01:32:07.940 | The maximum length of any sentence in a story is 8.
01:32:13.220 | The maximum number of words in any query is 4.
01:32:17.860 | So we're immediately thinking, what the hell?
01:32:22.740 | Because this was presented by the press as being the secret to making bots less dumb,
01:32:28.380 | and showed us that they took a story and summarized Lord of the Rings, made plot points and asked
01:32:33.660 | various questions, and clearly that's not entirely true.
01:32:39.780 | What they did, if you look at even the stories, the first word is always somebody's name.
01:32:46.940 | The second word here is 'or is some synonym for move'.
01:32:51.580 | There's then a bunch of prepositions, and then the last word is 'always place'.
01:32:57.700 | So these toy tasks are very, very, very toy.
01:33:00.580 | So immediately we're kind of thinking maybe this is not a step to making bots less dumb
01:33:08.620 | or whatever they said here, a truly intelligent AI.
01:33:13.420 | Maybe it's towards a truly intelligent AI.
01:33:19.700 | So to get this into Keras, we need to turn it into a tensor in which everything is the
01:33:26.940 | same size, so we use pad sequences for that, like we did in the last part of the course,
01:33:35.180 | which will add zeroes to make sure that everything is the same size.
01:33:39.780 | So the other thing we'll do is we will create a dictionary from words to integers to turn
01:33:47.180 | every word into an index, so we're going to turn every word into an index and then pad
01:33:52.420 | them so that they're all the same length.
01:33:56.900 | And then that's going to give us inputs_train, 10,000 stories, each one of 10 sentences,
01:34:06.780 | each one of 8 words.
01:34:09.060 | Anything that's not 10 sentences long is going to get sentences of just zeroes, any sentences
01:34:14.780 | not 8 words long will get some zeroes, we'll get into that.
01:34:18.500 | And you know for the test, except we just got 1000.
01:34:23.860 | So how do we do this?
01:34:28.340 | Not surprisingly, we're going to use embeddings.
01:34:33.580 | Now we've never done this before.
01:34:37.500 | We have to turn a sentence into an embedding, not just a word into an embedding.
01:34:44.580 | So there's lots of interesting ways of turning a sentence into an embedding, but when you're
01:34:50.780 | just doing towards intelligent AI, you don't do any of them.
01:34:54.740 | You instead just add the embeddings up, and that's what happened in this paper.
01:34:59.260 | And if you look at the way it was set up, you can see why, you can just add the embeddings
01:35:05.780 | Mary John and Sandra, they only ever appear in one place, they're always the object of
01:35:10.060 | this.
01:35:11.060 | The verb is always the same thing, the prepositions are always meaningless, and the last word
01:35:14.820 | is always a place.
01:35:16.260 | So to figure out what a whole sentence says, you can just add up the word concepts.
01:35:22.420 | The order of them doesn't make any difference, there's no knots, there's nothing that makes
01:35:25.940 | language remotely complicated or interesting.
01:35:28.460 | So what we're going to do is we're going to create an input for our stories with the number
01:35:34.300 | of sentences and the length of each one.
01:35:37.660 | We're going to take each word and put it through an embedding, so that's what time-distributed
01:35:42.220 | is doing here.
01:35:43.460 | It's putting each word through a separate embedding, and then we do a lambda layer to
01:35:48.980 | add them up.
01:35:52.620 | So here is our very sophisticated approach to creating sentence embeddings.
01:35:57.900 | So we do that for our story.
01:36:00.620 | So we end up with something which rather than being 10 by 8, 10 sentences by 8 words, it's
01:36:08.100 | now 10 by 20, that is 10 sentences by length 20 embedding.
01:36:15.060 | So each one of our 10 sentences has been turned into a length 20 embedding, and we're just
01:36:18.940 | starting with a random embedding.
01:36:20.060 | We're not going to use Word2vec or anything because we don't need the complexity of that
01:36:26.460 | vocabulary model.
01:36:27.700 | We're going to do exactly the same thing for the query.
01:36:32.220 | We don't need to use time-distributed this time, we can just take the query because this
01:36:40.580 | time we have just one query.
01:36:44.460 | So we can do the embedding, sum it up, and then we use reshape to add a unit axis to
01:36:51.460 | the front so that it's now the same basic rank.
01:36:56.300 | We now have one question of embedding to length 20.
01:37:02.140 | So we have 10 sentences of the story and one query.
01:37:09.700 | So what is the memory network, or more specifically the more advanced end-to-end memory network?
01:37:15.500 | And the answer is, it is this.
01:37:19.300 | As per usual, when you get down to it, it's less than a page of code to do these things.
01:37:25.620 | Let's draw this before we look at the code.
01:37:35.780 | So we have a bunch of sentences.
01:37:39.720 | Let's just use 4 sentences for now.
01:37:45.780 | So each sentence contained a bunch of words.
01:37:50.300 | We took each word and we turned them into an embedding.
01:37:58.300 | And then we summed all of those embeddings up to get an embedding for that sentence.
01:38:06.900 | So each sentence was turned into an embedding, and they were length 20, that's what it was.
01:38:19.980 | And then we took the query, so this is my query, same kind of idea, a bunch of words
01:38:40.260 | which we got embeddings for, and we added them up to get an embedding for our question.
01:38:49.060 | Okay, so to do a memory network, what we're going to do is we're going to take each of
01:38:58.900 | these embeddings and we're going to combine it, each one, with a question or a query.
01:39:13.060 | And we're just going to take a .product, so this way to draw this, .product, .product, okay,
01:39:40.940 | so we're going to end up with 4 dot products from each sentence of the story times the
01:39:46.020 | query.
01:39:47.520 | So what does the dot product do?
01:39:50.020 | It basically says how similar two things are, when one thing is big, if the other thing
01:39:54.100 | is big, if one thing is small, if the other thing is small, those things both make the
01:39:58.100 | dot product bigger.
01:40:00.620 | So these basically are going to be 4 vectors describing how similar each of our 4 sentences
01:40:06.780 | do the query.
01:40:11.140 | So that's step 1.
01:40:13.240 | Step 2 is to stick them through a softmax.
01:40:29.340 | So remember the dot product just reaches a scalar, so we now have 4 scalars.
01:40:35.900 | And they add up to 1.
01:40:40.780 | And they each are basically related to how similar is the query to each of the 4 sentences.
01:40:50.660 | We're now going to create a totally separate embedding of each of the sentences in our
01:40:57.980 | story by creating a totally separate embedding for each word.
01:41:03.800 | So we're basically just going to create a new random embedding matrix for each word
01:41:09.380 | to start with, sum them all together, and that's going to give us a new embedding, this
01:41:19.660 | one they call C I believe.
01:41:23.860 | And all we're going to do is we're going to multiply each one of these embeddings by the
01:41:41.700 | equivalent softmax as a weighting, and then just add them all together.
01:41:46.980 | So we're just going to have S1234, C1 times S1 plus C2 times S2, and then divide it by
01:42:03.140 | S1234, so that's going to be our final result, which is going to be of length 20.
01:42:21.380 | So this thing is a vector of length 20, and then we're going to take that and put it through
01:42:27.620 | a single dense layer, and we're going to get back the answer.
01:42:39.540 | And that whole thing is the memory network.
01:42:43.140 | It's incredibly simple, there's nothing deep in terms of deep learning, there's almost
01:42:53.180 | none on linearities, so it doesn't seem like it's likely to be able to do very much, but
01:43:01.020 | I guess we haven't given it very much to do.
01:43:03.900 | So let's take a look at the code version.
01:43:08.100 | >> So in that last step you said the answer, was that really the embedding of the answer,
01:43:16.660 | and then it has to get the reverse lookup?
01:43:18.860 | >> Yeah, it's the softmax of the answer, and then you have to do an argmax.
01:43:22.660 | So here it is, we've got the story times the embedding of the story times the embedding
01:43:36.820 | of the query, the dot product.
01:43:40.980 | We do a softmax.
01:43:43.700 | Softmax works in the last dimension, so I just have to reshape to get rid of the unit
01:43:46.980 | axis, and then I reshape again to put the unit axis back on again.
01:43:51.540 | The reshapes aren't doing anything interesting, so it's just a dot product followed by a softmax,
01:43:57.100 | and that gives us the weights.
01:44:01.300 | So now we're going to take each weight and multiply it by the second set of embeddings,
01:44:08.460 | here's our second set of embeddings, embedding C, and in order to do this, I just used the
01:44:15.100 | dot product again, but because of the fact that you've got a unit axis there, this is
01:44:20.780 | actually just doing a very simple weighted average.
01:44:27.600 | And again, I've reshaped to get rid of the unit axis so that we can stick it through
01:44:31.400 | a dense layer of the softmax, and that gives us our final result.
01:44:37.100 | So what this is effectively doing is it's basically saying, okay, how similar is the
01:44:43.640 | query to each one of the sentences in the story?
01:44:48.140 | Use that to create a bunch of weights, and then these things here are basically the answers.
01:44:53.140 | This is like, if story number 1 was where the answer was, then we're going to use this
01:44:58.860 | one, story number 2, 3, and 4, because there's a single linear layer at the very end, so
01:45:06.300 | it doesn't really get to do much computation.
01:45:08.940 | It basically has to learn what the answer represented by each story is.
01:45:13.900 | And again, this is lucky because the original data set, the answer to every question is
01:45:32.500 | the last word of the sentence.
01:45:37.420 | Where is Frodo's ring?
01:45:40.740 | So that's why we just can have this incredibly simple final piece.
01:45:47.340 | So this is an interesting use of Keras, right?
01:45:51.780 | We've created a model which is in no possible way deep learning, but it's a bunch of tensors
01:46:00.620 | and layers that are stuck together.
01:46:02.780 | And so it has some inputs, it has an output, so we can call it a model.
01:46:07.300 | We can compile it with an optimizer and a loss, and then we can fit it.
01:46:13.580 | So it's kind of interesting how you can use Keras for things which don't really use any
01:46:19.740 | of the normal layers in any normal way.
01:46:23.180 | And as you can see, it works for what it's worth.
01:46:26.340 | We solved this problem.
01:46:27.380 | And the particular problem we solved here is the one supporting that problem.
01:46:31.580 | And in fact, it worked in less than 1 epoch.
01:46:38.540 | More interesting is two supporting facts.
01:46:41.620 | Actually before I do that, I'll just point out something interesting, which is we could
01:46:45.360 | create another model, now that this is already trained, which is to return not the final
01:46:50.340 | answer, but the value of the weights.
01:46:54.900 | And so we can now go back and say, for a particular story, what are the weights?
01:47:02.700 | So let's do 'f' rather than the answer.
01:47:06.740 | For this story, for this particular story, the weights are here, and you can see that
01:47:22.900 | the weight for sentence number 2 is 0.98.
01:47:28.380 | So we can actually look inside the model and find out what sentences it's using to answer
01:47:35.780 | this question.
01:47:36.780 | Question - would it not make more sense to concat the embeddings rather than sum them?
01:47:46.680 | Not for this particular problem, because of the way the vocabulary is structured when
01:47:51.340 | the sentence is structured.
01:47:53.460 | It would also have to deal with the variable length of the sentence.
01:47:58.780 | Well, we've used padding to make them the same length.
01:48:05.020 | If you wanted to use this in real life, you would need to come up with a better sentence
01:48:10.500 | embedding, which presumably might be an RNN or something like that, because you need to
01:48:17.300 | deal with things like 'not' and the location of subject and object and so forth.
01:48:23.820 | One thing to point out is that the order of the sentences matters.
01:48:28.020 | And so what I actually did when I preprocessed it was I added a 0 colon, 1 colon, whatever
01:48:33.980 | to the start of each sentence, so that it would actually be able to learn the time order
01:48:40.260 | of sentences.
01:48:41.260 | So this is like another token that I added.
01:48:43.780 | So in case you were wondering what that was, that was something that I added in the preprocessing.
01:48:49.620 | So one nice thing with memory networks is we can kind of look and see if they're not
01:48:53.380 | working, in particular why they're not working.
01:48:57.660 | So multi-hop, so let's now look at an example of a two supporting facts story.
01:49:06.620 | It's mildly more interesting.
01:49:08.440 | We still only have one type of verb with various synonyms and a small number of subjects and
01:49:12.540 | a small number of objects, so it's basically the same.
01:49:17.820 | But now, to answer a question, we have to go down through two hots.
01:49:22.340 | So where is the milk?
01:49:25.120 | Let's find the milk.
01:49:27.020 | Daniel left the milk there.
01:49:29.500 | Where is Daniel?
01:49:30.900 | Daniel traveled to the hallway.
01:49:32.780 | Where is the milk?
01:49:33.780 | Hallway.
01:49:34.780 | Alright.
01:49:35.780 | So that's what we have to be able to do this time.
01:49:40.420 | And so what we're going to do is exactly the same thing as we did before, but we're going
01:49:47.260 | to take our whole little model, so do the embedding, reshape, dot, reshape, softmax,
01:49:57.980 | reshape, dot, reshape, dense layer, sum, and we're going to call it a counter and we're
01:50:10.100 | going to call this one hop.
01:50:12.180 | So this whole picture is going to become one hop.
01:50:17.720 | And what we're going to do is we're going to take this and go back and replace the query
01:50:30.460 | with our new output.
01:50:33.220 | So at each step, each hop, we're going to replace the query with the result of our memory network.
01:50:41.440 | And so that way, the memory network can learn to recognize that the first thing I need is
01:50:50.340 | the milk, search back, find milk.
01:50:55.140 | I now have the milk, now you need to update the query to where is Daniel.
01:51:01.140 | Now go back, I'm Daniel.
01:51:03.820 | So the memory network in multi-hop mode basically does this whole thing again and again and
01:51:11.380 | again, replacing the query each time.
01:51:15.760 | So that's why I just took the whole set of steps and chucked it into a single function.
01:51:22.340 | And so then I just go, OK, response, story is one hop, response, story is one hop on
01:51:30.100 | that, and you can keep repeating that again and again and again.
01:51:34.820 | And then at the end, get our output, that's our model, compile, fit.
01:51:45.020 | I had real trouble getting this to fit nicely, I had to play around a lot with learning rates
01:51:53.380 | and batch sizes and whatever else, but I did eventually get it up to 0.999 accuracy.
01:52:05.220 | So this is kind of an unusual class for me to be teaching, because particularly compared
01:52:11.580 | to Part 1 where it was like best practices, clearly this is anything but.
01:52:16.720 | I'm kind of showing you something which was maybe the most popular request, was like teachers
01:52:23.380 | about chatbots.
01:52:24.380 | But let's be honest, who has ever used a chatbot that's not terrible?
01:52:29.980 | And the reason no one's used a chatbot that's not terrible is that the current state-of-the-art
01:52:33.980 | is terrible.
01:52:36.380 | So chatbots have their place and indeed one of the students of class has written a really
01:52:43.820 | interesting kind of analysis of this, which hopefully she'll share on the forum.
01:52:50.120 | But that place is really kind of lots of heuristics and carefully set up vocabularies and selecting
01:53:00.460 | from small sets of answers and so forth.
01:53:03.220 | It's not kind of general purpose, here's a story, ask anything you like about it, here
01:53:10.380 | are some answers.
01:53:11.380 | It's not to say we won't get there, I sure hope we will, but the kind of incredible hype
01:53:18.940 | we had around Turing machines and memory networks and end-to-end memory networks is kind of,
01:53:25.340 | as you can see, even when you just look at the dataset, what they worked on, it's kind
01:53:29.280 | of crazy.
01:53:32.340 | So that is not quite the final conclusion of this though, because yesterday a paper
01:53:41.660 | came out which showed how to identify buffer overruns in computer source code using memory
01:53:53.220 | networks.
01:53:55.100 | And so it kind of spoilt my whole narrative that somebody seems to have actually used
01:54:02.420 | this technology for something effectively.
01:54:06.220 | And I guess when you think about it, it makes some sense.
01:54:08.700 | So in case you don't know what a buffer overrun is, that's like if you're writing in an unsafe
01:54:13.340 | language, you allocate some memory, it's going to store some result or some input, and you
01:54:21.060 | try to put into that memory something bigger than the amount that you allocated, it basically
01:54:25.900 | spills out the end.
01:54:27.820 | And in the best case, it crashes.
01:54:31.840 | In the worst case, somebody figures out how to get exactly the right code to spill out
01:54:36.580 | into exactly the right place and ends up taking over your machine.
01:54:40.660 | So buffer overruns are horrible things.
01:54:45.100 | And the idea of being able to find them, I can actually see it does look a lot like this
01:54:49.340 | memory network.
01:54:50.340 | You kind of have to see where was that variable kind of set, and then where was the thing
01:54:55.780 | that was set from set, and where was the original thing allocated.
01:54:58.820 | It's kind of like just going back through the source code.
01:55:02.820 | The vocabulary is pretty straightforward, it's just the variables that have been defined.
01:55:09.520 | So that's kind of interesting.
01:55:11.660 | I haven't had a chance to really study the paper yet, but it's no chat bot, but maybe
01:55:18.500 | there is a room for memory networks already after all.
01:55:22.740 | Is there a way to visualize what the neural network has learned for the text?
01:55:27.140 | There is no neural network.
01:55:28.600 | If you mean the embeddings, you can look at the embeddings easily enough.
01:55:36.420 | The whole thing is so simple, it's very easy to look at every embedding.
01:55:40.300 | As I mentioned, we looked at visualizing the weights that came out of the softmax.
01:55:48.140 | We don't even need to look at it in order to figure out what it looked like, based on
01:55:52.620 | the fact that this is just a small number of simple linear steps.
01:55:57.060 | We know that it basically has to learn what each sentence answer can be, you know, sentence
01:56:07.500 | number 3, its answer will always be milk, its answer will always be 4-way or whatever.
01:56:15.020 | And then, so that's what the C embeddings are going to have to be.
01:56:21.860 | And then the embeddings of the weights are going to have to basically learn how to come
01:56:26.300 | up with what's going to be probably a similar embedding to the query.
01:56:29.500 | In fact, I think you can even make them the same embedding, so that these dot products
01:56:34.260 | basically give you something that gives you similarity scores.
01:56:38.820 | So this is really a very simple, largely linear model, so it doesn't require too much visualizing
01:56:48.940 | So having said all that, none of this is to say that memory networks are useless, right?
01:56:54.740 | I mean, they're created by very smart people with an impressive pedigree in deep learning.
01:57:01.020 | This is very early, and this tends to happen in popular press, they kind of get overexcited
01:57:09.260 | about things.
01:57:10.260 | Although in this case, I don't think we can blame the press, I think we have to blame
01:57:13.460 | it for creating a ridiculous demo like this.
01:57:15.860 | I mean, this has clearly created to give people the wrong idea, which I find very surprising
01:57:21.340 | from people like Yann McCone, who normally do the opposite of that kind of thing.
01:57:27.380 | So this is not really the press' fault in this case.
01:57:32.620 | But this may well turn out to be a critical component in chatbots and Q&A systems and whatever
01:57:39.180 | else.
01:57:40.180 | But we're not there yet.
01:57:45.140 | I had a good chat to Steven Meridy the other day, who's a researcher I respect a lot, and
01:57:52.660 | also somebody I like.
01:57:54.820 | I asked him what he thought was the most exciting research in this direction at the moment,
01:58:00.460 | and he mentioned something that I was also very excited about, which is called Recurrent
01:58:06.100 | Entity Networks.
01:58:08.900 | And the Recurrent Entity Network paper is the first to solve all of the BABY tasks with
01:58:16.740 | 100% accuracy.
01:58:19.740 | Now take of that what you will, I don't know how much that means, they're synthetic tasks.
01:58:26.860 | One of the things that Steven Meridy actually pointed out in the blog post is that even
01:58:32.580 | the basic kind of coding of how they're created is pretty bad.
01:58:35.140 | They have lots of replicas and the whole thing is a bit of a mess.
01:58:40.220 | But anyway, nonetheless this is an interesting approach.
01:58:44.260 | So if you're interested in memory networks, this is certainly something you can look at.
01:58:49.340 | And I do think this is likely to be an important direction.
01:58:54.380 | Having said all that, one of the key reasons I wanted to look at these memory networks
01:58:58.940 | is not only because it was the largest request from the forums for this part of the course,
01:59:03.980 | but also because it introduces something that's going to be critical for the next couple of
01:59:10.020 | lessons, which is the concept of attention.
01:59:17.860 | Attention or models are models where we have to do exactly what we just looked at, which
01:59:31.340 | is basically find out at each time which part of a story to look at next, or which part
01:59:41.780 | of an image to look at next, or which part of a sentence to look at next.
01:59:47.620 | And so the task that we're going to be trying to get at over the next lesson or two is going
01:59:53.860 | to be to translate French into English.
02:00:02.140 | So this is clearly not a toy task.
02:00:04.260 | This is a very challenging task.
02:00:07.540 | And one of the challenges is that in a particular French sentence which has got some bunch of
02:00:12.860 | words, it's likely to turn into an English sentence with some different bunch of words.
02:00:18.100 | And maybe these particular words here might be this translation here, and this one might
02:00:23.020 | be this one, and this one might be this one.
02:00:25.160 | And so as you go through, you need some way of saying which word do I look at next.
02:00:31.900 | So that's going to be the attentional model.
02:00:36.140 | And so what we're going to do is we're going to be trying to come up with a proper RNN
02:00:43.820 | like an LSTM, or a GRU, or whatever, where we're going to change it so that inside the
02:00:51.660 | RNN it's going to actually have some way of figuring out which part of the input to look
02:01:01.020 | at next.
02:01:03.060 | So that's the basic idea of attentional models.
02:01:06.780 | And so interestingly, during this time that memory networks and neural Turing machines
02:01:13.820 | and stuff were getting all this huge amount of press attention very quietly in the background
02:01:20.900 | at exactly the same time, attentional models were appearing as well.
02:01:27.500 | And it's the attentional models for language that have really turned out to be critical.
02:01:34.800 | So you've probably seen all of the press about Google's new neural translation system, and
02:01:42.700 | that really is everything that it's claimed to be.
02:01:46.500 | It really is basically one giant neural network that can translate any pair of languages.
02:01:54.860 | The accuracy of those translations is far beyond anything that's happened before.
02:02:00.580 | And the basic structure of that neural net, as we're going to learn, is not that different
02:02:09.780 | to what we've already learned.
02:02:10.780 | We're just going to have this one extra step, which is attention.
02:02:17.780 | And depending on how interested you guys are in the details of this neural translation
02:02:21.980 | system, it turns out that there are also lots of little tweaks.
02:02:25.180 | The tweaks are kind of around like, OK, you've got a really big vocabulary, some of the words
02:02:34.500 | appear very rarely, how do you build a system that can understand how to translate those
02:02:40.060 | really rare words, for example, and also just kind of things like how do you deal with the
02:02:47.460 | memory issues around having huge embedding matrices of 160,000 words and stuff like that.
02:02:54.740 | So there's lots of details, and the nice thing is that because Google has ended up putting
02:03:01.940 | this thing in production, all of these little details have answers now, and those answers
02:03:08.980 | are all really interesting.
02:03:11.300 | There aren't really on the whole great examples of all of those things put together.
02:03:19.340 | So one of the things interesting here will be that you'll have opportunities to do that.
02:03:25.860 | Generally speaking, the blog posts about these neural translation systems tend to be kind
02:03:30.500 | of at a pretty high level.
02:03:31.500 | They describe roughly how these kind of approaches work, but Google's complete neural translation
02:03:38.100 | system is not out there, you can't download it and see the code.
02:03:44.180 | So we'll see how we go, but we'll kind of do it piece by piece.
02:04:01.060 | I guess one other thing to mention about the memory network is that Keras actually comes
02:04:09.060 | with a end-to-end memory network example in the Keras GitHub, which weirdly enough, when
02:04:18.900 | I actually looked at it, it turns out doesn't implement this at all.
02:04:24.740 | And so even on the single supporting fact thing, it takes many, many generations and
02:04:30.620 | doesn't get to 100% accuracy.
02:04:34.100 | And I found this quite surprising to discover that once you start getting to some of these
02:04:38.940 | more recent advances or not just a standard CNN or whatever, it's just less and less common
02:04:49.860 | that you actually find code that's correct and that works.
02:04:53.660 | And so this memory network example was one of them.
02:04:56.340 | So if you actually go into the Keras GitHub and look at examples and go and have a look
02:05:01.420 | and download the memory network, you'll find that you don't get results anything like this.
02:05:06.740 | If you look at the code, you'll see that it really doesn't do this at all.
02:05:11.700 | So I just wanted to mention that as a bit of a warning that you're kind of at the point
02:05:18.780 | now where you might want to take what a grain of salt blog posts you read or even some papers
02:05:24.700 | that you read, well worth experimenting with them and assuming you should start with the
02:05:30.420 | assumption that you can do it better.
02:05:33.420 | And maybe even start with the assumption that you can't necessarily trust all of the conclusions
02:05:40.480 | that you've read because the vast majority of the time, in my experience putting together
02:05:46.340 | this part of the course, the vast majority of the time, the stuff out there is just wrong.
02:05:52.500 | Even in cases like I deeply respect the Keras authors and the Keras source code, but even
02:05:58.520 | in that case this is wrong.
02:06:02.360 | I think that's an important point to be aware of.
02:06:08.760 | I think we're done, so I think we're going to finish five minutes early for a change.
02:06:12.100 | I think that's never happened before.
02:06:14.020 | So thanks everybody, and so this week hopefully we can have a look at the Data Science Bowl,
02:06:22.780 | make a million dollars, create a new PyTorch Approximate Nearest Neighbors algorithm,
02:06:27.840 | and then when you're done, maybe figure out the next stage for memory networks.
02:06:31.820 | Thanks everybody.
02:06:33.420 | (audience applauds)
02:06:36.420 | (audience applauds)