back to indexLesson 9: Cutting Edge Deep Learning for Coders
Chapters
0:0 Intro
0:27 Wiki
3:25 Style Transfer
10:25 Reading the Paper
15:0 Notation
21:15 Citations
30:35 Super Resolution
33:5 Paper
39:40 Big holes arrays
40:45 The final network
43:10 Practical considerations
52:8 Deconvolution
00:00:00.000 |
So, welcome back everybody. Thanks for coming and I hope you had a good week and had a fun 00:00:05.260 |
time playing around with artistic style. I know I did. I thought I'd show you. So I tried 00:00:15.000 |
a couple of things myself over the week with this artistic style stuff. I've just tried 00:00:21.060 |
a couple of simple little changes which I thought you might be interested in. 00:00:26.000 |
One thing before I talk about the artistic style is I just wanted to point out some of 00:00:34.040 |
the really cool stuff that people have been contributing over the week. If you haven't 00:00:40.080 |
come across it yet, be sure to check out the wiki. There's a nice thing in Discourse where 00:00:47.640 |
you can basically set any post as being a wiki, which means that anybody can edit it. 00:00:52.440 |
So I created this wiki post early on, and by the end of the week we now have all kinds 00:00:57.080 |
of stuff with links to the stuff from the class, a summary of the paper, examples, a 00:01:05.480 |
list of all the links, both snippets, a handy list of steps that are necessary when you're 00:01:15.080 |
doing style transfer, lots of stuff about the TensorFlow Dead Summit, and so forth. Lots 00:01:26.360 |
of other threads. One I saw just this afternoon popped up, which was Greg from XinXin, talked 00:01:36.640 |
about trying to summarize what they've learned from lots of other threads across the forum. 00:01:44.880 |
This is a great thing that we can all do is when you look at lots of different things 00:01:50.760 |
and take some notes, if you put them on the forum for everybody else, this is super handy. 00:01:55.960 |
So if you haven't quite caught up on all the stuff going on in the forum, looking at this 00:02:00.240 |
curating lesson 8 experiments thread would be probably a good place to start. 00:02:07.200 |
So a couple of little changes I made in my experiments. I tried thinking about how depending 00:02:16.360 |
on what your starting point is for your optimizer, you get to a very different place. And so 00:02:22.240 |
clearly our convex optimization is not necessarily finding a local minimum, but at least saddle 00:02:28.920 |
points it's not getting out of. So I tried something which was to take the random image 00:02:35.560 |
and just add a Gaussian blur to it. So that makes a random image into this kind of thing. 00:02:43.280 |
And I just found that even the plain style looked a lot smoother, so that was one change 00:02:49.140 |
that I made which I thought worked quite well. Another change that I made just to play around 00:02:54.040 |
with it was that I added a different weight to each of the style layers. And so my zip 00:03:01.920 |
now has a third thing in which is the weights, and I just multiply by the weight. So I thought 00:03:07.000 |
that those two things made my little bird look significantly better than my little bird 00:03:11.440 |
looked before, so I was happy with that. You could do a similar thing for content loss. 00:03:16.320 |
You could also maybe add more different layers of content loss and give them different weights 00:03:21.200 |
as well. I'm not sure if anybody's tried that yet. Yes, Rachel. 00:03:32.360 |
I have a question in regards to style transfer for cartoons. With cartoons, when we think 00:03:37.560 |
of transferring the style, what we really mean is transferring the contours of the cartoon 00:03:42.440 |
to redraw the content in that style. This is not what style transferring is doing here. 00:03:47.720 |
How might I implement this? I don't know that anybody has quite figured that out, but I'll 00:03:53.080 |
show you a couple of directions that may be useful. 00:03:56.600 |
I've tried selecting activations that correspond with edges, and such is indicated by one of 00:04:04.040 |
the calm visualization papers and comparing outputs from specifically those activations. 00:04:10.760 |
So I'll show you some things you could try. I haven't seen anybody do a great job of this 00:04:15.520 |
yet, but here's one example from the forum. Somebody pointed out that this cartoon approach 00:04:21.840 |
didn't work very well with Dr. Seuss, but then when they changed their initial image 00:04:26.760 |
not to be random but to be the picture of the dog, it actually looked quite a lot better. 00:04:30.800 |
So there's one thing you could try. There's some very helpful diagrams that somebody posted 00:04:36.000 |
which is fantastic. I like this summary of what happens if you add versus remove each 00:04:45.400 |
layer. So this is what happens if you remove block 0, block 1, block 2, block 3, and block 00:04:53.360 |
4 to get a sense of how they impact things. You can see for the style that the last layer 00:04:58.560 |
is really important to making it look good, at least for this image. 00:05:06.880 |
One of you had some particularly nice examples. It seems like there's a certain taste. They're 00:05:13.200 |
kind of figuring out what photos go with what images. I thought this Einstein was terrific. 00:05:17.680 |
I thought this was terrific as well. Brad came up with this really interesting insight 00:05:25.320 |
that starting with this picture and adding a style to it creates this extraordinary shape 00:05:32.160 |
here where, as he points out, you can tell it's a man sitting in the corner, but there's 00:05:36.360 |
less than 10 brush strips. Sometimes this style transfer does things which are surprisingly 00:05:42.660 |
fantastic. I have no idea what this is even in the photos, so I don't know what it is in 00:05:47.740 |
the painting either. I guess I don't watch that kind of music enough. 00:05:53.360 |
So there's lots of interesting ideas you can try, and I've got a link here, and you might 00:05:58.480 |
have seen it in the PowerPoint, to a Keras implementation that has a whole list of things 00:06:03.120 |
that you can try. Here are some particular examples. All of these examples you can get 00:06:12.640 |
the details from this link. There's something called chain blurring. For some things, this 00:06:19.160 |
might work well for cartoons. Notice how the matrix doesn't do a good job with the cat 00:06:25.440 |
when you use the classic This Is Our Paper. But if you use this chain blurring approach, 00:06:31.720 |
it does a fantastic job. So I wonder if that might be one secret to the cartoons. 00:06:38.720 |
Some of you I saw in the forum have already tried this, which is using color preservation 00:06:42.420 |
and luminance matching, which basically means you're still taking the style but you're not 00:06:47.560 |
taking the color. And I think in these particular examples, this is really great results. I think 00:06:53.080 |
it depends a lot on what things you tried with. 00:06:57.720 |
You can go a lot further. For example, you can add a mask and then say just do color 00:07:03.880 |
preservation for one part of the photo. So here the top part of the photo has got color 00:07:08.080 |
preservation and the bottom hasn't. They even show in that code how you can use a mask to 00:07:17.920 |
say one part of my image should not be stylized. This is really crazy. Use masks to decide 00:07:29.240 |
which one of two style images to use and then you can really generate some creative stuff. 00:07:34.320 |
So there's a lot of stuff that you can play with and you can go beyond this to coming 00:07:40.740 |
Now some of the best stuff, you're going to learn a bit more today about how to do some 00:07:44.560 |
of these things better. But just to give an idea, if you go to likemo.net, you can literally 00:07:50.360 |
draw something using four colors and then choose a style image and it will turn your 00:07:57.520 |
drawing into an image. Basically the idea is blue is going to be water and green is going 00:08:02.760 |
to be foliage and I guess red is going to be foreground. There's a lot of good examples 00:08:10.920 |
of this kind of neural doodle they call it online. 00:08:16.600 |
Something else we'll learn more about how to do better today is if you go to affinelayer.com, 00:08:21.240 |
there's a very recent paper called Pics2Pics. We're going to be learning quite a bit in 00:08:26.320 |
this class about how to do segmentation, which is where you take a photo and turn it into 00:08:31.560 |
a colored image, basically saying the horse is here, the bicycle is here, the person is 00:08:35.640 |
here. This is basically doing the opposite. You start by drawing something, saying I want 00:08:40.440 |
you to create something that has a window here and a window still here and a draw here 00:08:44.880 |
and a column there, and it generates a photo, which is fairly remarkable. 00:08:51.840 |
So the stuff we've learned so far won't quite get you to do these two things, but by the 00:08:55.800 |
end of today we should be able to. This is a nice example that I think some folks at 00:09:02.760 |
Adobe built showing that you could basically draw something and it would try and generate 00:09:09.080 |
an image that was close to your drawing where you just needed a small number of lines. Again 00:09:14.000 |
we'll link to this paper from the resources. This actually shows it to you in real time. 00:09:21.400 |
You can see that there's some new way of doing art that's starting to appear where you don't 00:09:27.680 |
necessarily need a whole lot of technique. I'm not promising it's going to turn you into 00:09:32.240 |
a Van Gogh, but you can at least generate images that maybe are in your head in some 00:09:36.000 |
style that's somewhat similar to somebody else's. I think it's really interesting. 00:09:45.880 |
One thing I was thrilled to see is that at least two of you have already written blog 00:09:51.080 |
posts on Medium. That was fantastic to see. So I hope more of you might try to do that 00:09:56.720 |
this week. It definitely doesn't need to be something that takes a long time. I know some 00:10:03.880 |
of you are also planning on turning your forum posts into blog posts, so hopefully we'll 00:10:09.400 |
see a lot more blog posts this week popping up. I know the people who have done that have 00:10:19.240 |
One of the things that I suggested doing pretty high on the list of priorities for this week's 00:10:24.420 |
assignment was to go through the paper knowing what it's going to say. I think this is really 00:10:32.000 |
helpful is when you already know how to do something is to go back over that paper, and 00:10:37.120 |
this is a great way to learn how to read papers. You already know what it's telling you. This 00:10:41.440 |
is like the way I learnt to read papers was totally this method. 00:10:47.800 |
So I've gone through and I've highlighted a few key things which as I went through I 00:10:52.720 |
thought were kind of important. In the abstract of the paper, let me ask, how many people 00:10:59.480 |
kind of went back and relooked at this paper again? Quite a few of you, that's great. In 00:11:08.520 |
the abstract, they basically say what is it that they're introducing. It's a system based 00:11:12.040 |
on a deep neural network that creates artistic images of higher perceptual quality. So we're 00:11:16.440 |
going to read this paper and hopefully at the end of it we'll know how to do that. 00:11:20.360 |
Then in the first section, they tell us about the basic ideas. When CNNs are trained on 00:11:27.960 |
object recognition, they developed a representation of an image. Along the processing hierarchy 00:11:34.000 |
of the network, it's transformed into representations that increasingly care about the actual content 00:11:38.480 |
compared to the pixel values. So it describes the basic idea of content loss. Then they 00:11:46.360 |
describe the basic idea of style loss, which is looking at the correlations between the 00:11:52.120 |
different filter responses over the spatial extent of the feature maps. This is one of 00:11:56.640 |
these sentences that read on its own doesn't mean very much, but now that you know how 00:12:01.560 |
to do it, you can read it and you can see what that means, and then when you get to 00:12:09.080 |
So the idea here is that by including the feature correlations, and this answers one 00:12:13.160 |
of the questions that one of you had on the forum, by including feature correlations of 00:12:16.640 |
multiple layers, we obtain a multi-scale representation of the input image. This idea of a multi-scale 00:12:23.440 |
representation is something we're going to be coming across a lot because a lot of this, 00:12:28.040 |
as we discussed last week, a lot of this class is about generative models. One of the tricky 00:12:33.600 |
things with generative models is both to get the general idea of the thing you're trying 00:12:40.000 |
to generate correct, but also get all the details correct. So the details generally 00:12:44.760 |
require you to zoom into a small scale, and the big picture correct is about zooming out 00:12:51.680 |
So this was one of the key things that they did in this paper was show you how to create 00:12:55.600 |
a style representation that included multiple resolutions. We now know that where they did 00:13:01.360 |
that was to use multiple style layers, and as we go through the layers of VGG, they gradually 00:13:07.560 |
become lower and lower resolution, larger and larger receptive fields. 00:13:14.720 |
I'm always great to look at the figures and make sure I was thrilled to see that some 00:13:18.240 |
of you were trying to recreate these figures, which actually turned out to be slightly non-trivial. 00:13:25.360 |
So we can see exactly what that figure is, and if you haven't tried it for yourself yet, 00:13:31.080 |
you might want to try it, see if you can recreate this figure. 00:13:35.280 |
It's good to try and find in a paper the key thing that they're showing. In this case, 00:13:47.760 |
they found that representations of content and style in a CNN are separable, and you 00:13:52.400 |
can manipulate both to create new images. So again, hopefully now you can look at that 00:14:08.600 |
You can see that with papers, certainly with this paper, there's often quite a lot of introduction 00:14:13.680 |
that often says the same thing a bunch of different ways. The first time you read it, 00:14:20.360 |
one paragraph might not make sense, but later on they say it a different way and it starts 00:14:24.360 |
to make more sense. So it's worth looking through the introductory remarks, maybe two 00:14:29.960 |
or three times. They can certainly see that again, talking about the different layers 00:14:40.280 |
Again, showing the results of some experiments. Again, you can see if you can recreate these 00:14:49.520 |
experiments, make sure you understand how to do it. And then there's a whole lot of stuff 00:14:56.320 |
I didn't find that interesting until we get to the section called methods. 00:15:00.400 |
So the method section is the section that hopefully you'll learn the most about reading 00:15:03.880 |
papers after you've implemented something by reading the section called methods. I want 00:15:07.920 |
to show you a few little tricks of notation. You do need to be careful of little details 00:15:14.760 |
that fly by. Like here, they used average pooling. That's a sentence which if you weren't 00:15:19.880 |
reading carefully, you could skip over it. We need to use average pooling, not math pooling. 00:15:28.120 |
So they will often have a section which explicitly says, Now I'm going to introduce the notation. 00:15:34.320 |
This paper doesn't. This paper just introduces the notation as part of the discussion. But 00:15:39.800 |
at some point, you'll start getting Greek letters or things with subscripts or whatever. 00:15:46.760 |
Notation starts appearing. And so at this point, you need to start looking very carefully. 00:15:51.620 |
And at least for me, I find I have to go back and read something many times to remember 00:15:56.400 |
what's L, what's M, what's N. This is the annoying thing with math notation, is they're single 00:16:02.720 |
letters. They generally don't have any kind of mnemonic. Often though you'll find that 00:16:07.440 |
across papers in a particular field, they'll tend to reuse the same kind of English and 00:16:13.000 |
Greek letters for the same kinds of things. So M will generally be the number of rows, 00:16:18.720 |
capital M. Capital N will often be the number of columns. K will often be the index that 00:16:29.240 |
So here, the first thing which is introduced is x with an arrow on top. So x with an arrow 00:16:35.520 |
on top means it's a vector. It's actually an input image, but they're going to turn it 00:16:46.320 |
So our image is called x. And then the CNN has a whole bunch of layers, and every time 00:16:51.800 |
you see something with a subscript or a superscript like this, you need to look at both of the 00:16:55.440 |
two bits because they've both got a meaning. The big thing is like the main object. So 00:17:01.600 |
in this case, capital N is a filter. And then the subscript or superscript is like in an 00:17:08.160 |
array or a tensor. In Python, it's like the thing in square brackets. 00:17:13.560 |
So each filter has a letter l, which is like which number of the filter is it. And so often 00:17:21.320 |
as I read a paper, I'll actually try to write code as I go and put little comments so that 00:17:27.760 |
I'll write layer, square bracket, layer number, plus square bracket, and then I have a comment 00:17:33.280 |
after, say, ml, just to remind myself. So I'm creating the code and mapping it to the letters. 00:17:40.960 |
So there are nl filters. We know from a CNN that each filter creates a feature map, so 00:17:46.600 |
that's why there are nl feature maps. So remember, any time you see the same letter, it means 00:17:51.240 |
the same thing within a paper. Each feature map is of size m, and as I mentioned before, 00:18:00.400 |
m tends to be rows and m tends to be columns. So here it says m is the height times the 00:18:07.720 |
width of the feature map. So here we can see they've gone .flat, basically, to make it 00:18:13.720 |
all 1 row. Now this is another piece of notation you'll see all the time. A layer l can be stored 00:18:22.200 |
in a matrix called f, and now the l has gone to the top. Same basic idea, just an index. 00:18:29.800 |
So the matrix f is going to contain our activations. And this thing here where it says r with a 00:18:35.920 |
little superscript has a very special meaning. It's referring to basically what is the shape 00:18:42.560 |
of this. So when you see this shape, it says these are r means that they're floats, and 00:18:49.280 |
this thing here means it's a matrix. You can see the x, so it means it's rows by a column. 00:18:53.760 |
So there are n rows and m columns in this matrix, and every matrix, there's one matrix 00:19:00.680 |
for each layer, and there's a different number of rows and different number of columns for 00:19:04.480 |
each layer. So you can basically go through and map it to the code that you've already 00:19:09.920 |
written. So I'm not going to read through the whole thing, but there's not very much here, 00:19:16.120 |
and it would be good to make sure that you understand all of it, perhaps with the exception 00:19:21.240 |
of the derivative, because we don't care about derivatives because they get done for us thanks 00:19:25.880 |
to a theano-intensor flow. So you can always skip the bits about derivatives. 00:19:36.000 |
So then they do the same thing basically describing the Gram matrix. So they show here that the 00:19:43.140 |
basic idea of the Gram matrix is that they create an inner product between the vectorized 00:19:49.320 |
feature map i and j. So vectorized here means turned into a vector, so the way you turn 00:19:55.480 |
a matrix into a vector is flattened. This means the inner product between the flattened feature 00:20:07.640 |
So hopefully you'll find this helpful. You'll see there will be small little differences. 00:20:14.040 |
So rather than taking the mean, they use here the sum, and then they divide back out the 00:20:24.720 |
number of rows and columns to create the mean this way. In our code, we actually put the 00:20:29.980 |
division inside the sum, so you'll see these little differences of how we implement things. 00:20:35.280 |
And sometimes you may see actual meaningful differences, and that's often a suggestion 00:20:41.880 |
of something you can try. So that describes the notation and the method, and that's it. 00:20:56.120 |
But then very importantly, any time you come across some concept which you're not familiar 00:21:03.080 |
with, it will pretty much always have a reference, a citation. So you'll see there's little numbers 00:21:15.480 |
all over the place. There's lots of different ways of doing these references. But anytime 00:21:20.920 |
you come across something which has a citation, like a new piece of notation or a new concept, 00:21:27.360 |
you don't know what it is. Generally the first time I see it in a paper, I ignore it. But 00:21:32.640 |
if I keep reading and it turns out to be something that actually is important and I can't understand 00:21:37.720 |
the basic idea at all, I generally then put this paper aside, I put it in my to-read file, 00:21:44.440 |
and make the new paper I'm reading the thing that it's citing. Because very often a paper 00:21:49.480 |
is entirely meaningless until you've read one or two of the key papers it's based on. 00:21:56.400 |
Sometimes this can be like reading the dictionary if you don't get low English. It can be layer 00:22:00.440 |
upon layer of citations, and at some point you have to stop. I think you should find 00:22:08.800 |
that the basic set of papers that things refer to is pretty much all stuff you guys know 00:22:15.440 |
at this point. So I don't think you're going to get stuck in an infinite loop. But if you 00:22:19.360 |
ever do, let us know in the forum and we'll try to help you get unstuck. Or if there's 00:22:27.200 |
any notation you don't understand, let us know. In other words, the horrible things 00:22:30.840 |
about math is it's very hard to search for. It's not like you can take that function name 00:22:35.920 |
and search for Python and the function name instead of some weird squiggly shape. So again, 00:22:41.080 |
feel free to ask if you're not sure about that. There is a great Wikipedia page which 00:22:47.540 |
lists, I think it's just called math notation or something, which lists pretty much every 00:22:54.440 |
piece of notation. There are various places you can look up notation as well. 00:23:01.720 |
So that's the paper. Let's move to the next step. So I think what I might do is kind of 00:23:17.160 |
try and draw the basic idea of what we did before so that I can draw the idea of what 00:23:22.200 |
we're going to do differently this time. So previously, and now this thing is actually 00:23:25.920 |
calibrated, we had a random image and we had a loss function. It doesn't matter what the 00:23:38.560 |
loss function was. We know that it happened to be a combination of style_loss plus content_loss. 00:23:46.400 |
What we did was we took our image, our random image, and we put it through this loss function 00:23:55.400 |
and we got out of it two things. One was the loss and the other was the gradients. And 00:24:02.880 |
then we used the gradients with respect to the original pixels to change the original 00:24:08.800 |
pixels. So we basically repeated that loop again and again, and the pixels gradually 00:24:15.400 |
changed to make the loss go down. So that's the basic approach that we just used. It's 00:24:25.480 |
a perfectly fine approach for what it is. And in fact, if you are wanting to do lots 00:24:31.240 |
of different photos with lots of different styles, like if you created a web app where 00:24:37.320 |
you said please upload any style image and any content image, here's your artistic style 00:24:43.320 |
version, this is probably still the best, particularly with some of those tweaks I talked about. 00:24:50.400 |
But what if you wanted to create a web app that was a Van Gogh irises generator? Upload 00:24:57.520 |
any image and I will give you that image in the style of Van Gogh's irises. You can do 00:25:03.720 |
better than this approach, and the reason you can do better is that we can do something 00:25:08.520 |
where you don't have to do a whole optimization run in order to create that output. Instead, 00:25:16.500 |
we can train a CNN to learn to output photos in the style of Van Gogh's irises. The basic 00:25:24.760 |
idea is very similar. What we're going to do this time is we're going to have lots of 00:25:32.040 |
images. We're going to take each image and feed it into the exact same loss function 00:25:41.980 |
that we used before, with the style loss plus the content loss. For the style loss, we're 00:25:50.440 |
going to use Van Gogh's irises, and for the content loss, we're going to use the image 00:25:59.960 |
that we're currently looking at. What we do is rather than changing the pixels of the 00:26:09.800 |
original photo, instead what we're going to do is we're going to train a CNN to take this 00:26:20.320 |
out of the way. Let's put a CNN in the middle. These are the layers of the CNN. We're going 00:26:43.960 |
to try and get that CNN to spit out a new image. There's an input image and an output 00:26:55.920 |
image. This new CNN we've created is going to spit out an output image that when you 00:27:02.400 |
put it through this loss function, hopefully it's going to give a small number. If it gives 00:27:10.500 |
a small number, it means that the content of this photo still looks like the original 00:27:17.520 |
photo's content, and the style of this new image looks like the style of Van Gogh's irises. 00:27:24.520 |
So if you think about it, when you have a CNN, you can really pick any loss function 00:27:31.720 |
you like. We've tended to use pretty simple loss functions so far like mean squared error 00:27:37.160 |
or cross entropy. In this case, we're going to use a very different loss function which 00:27:43.000 |
is going to be style plus content loss using the same approach that we used just before. 00:27:51.000 |
And because that was generated by a neural net, we know it's differentiable. And you 00:27:56.520 |
can optimize any loss function as long as the loss function is differentiable. So if 00:28:03.520 |
we now basically take the gradients of this output, not with respect to the input image, 00:28:10.320 |
but with respect to the CNN weights, then we can take those gradients and use them to 00:28:18.600 |
update the weights of the CNN so that the next iteration through the CNN will be slightly 00:28:23.320 |
better at turning that image into a picture that has a good style match with Van Gogh's 00:28:28.880 |
irises. Does that make sense? So at the end of this, we run this through lots of images. 00:28:36.560 |
We're just training a regular CNN, and the only thing we've done differently is to replace 00:28:40.720 |
the loss function with the style_loss plus content_loss that we just used. And so at 00:28:47.080 |
the end of it, we're going to have a CNN that has learnt to take any photo and will spit 00:28:52.720 |
out that photo in the style of Van Gogh's irises. And so this is a win, because it means now 00:28:58.920 |
in your web app, which is your Van Gogh's irises generator, you now don't have to run 00:29:03.880 |
an optimization path on the new photo, you just do a single forward pass to a CNN, which 00:29:09.120 |
is instant. This is going to limit the filters you use, let's say you have Photoshop and 00:29:26.000 |
you want to change multiple styles. Yeah, this is going to do just one type of style. 00:29:35.320 |
Is there a way of combining multiple styles, or is it just going to be a combination of 00:29:41.480 |
all of them? You can combine multiple styles by just having 00:29:45.960 |
multiple bits of style loss for multiple images, but you're still going to have the problems 00:29:50.320 |
that that network has only learned to create one kind of image. It may be possible to train 00:30:01.040 |
it so it takes both a style image and a content image, but I don't think I've seen that done 00:30:06.360 |
yet. Having said that, there is something simpler 00:30:21.820 |
and in my opinion more useful we can do, which is rather than doing style loss plus content 00:30:29.040 |
loss. Let's think of another interesting problem to solve, which is called super resolution. 00:30:37.320 |
Super resolution is something which, honestly, when Rachel and I started playing around with 00:30:42.200 |
it a while ago, nobody was that interested in it. But in the last year or so it's become 00:30:47.560 |
really hot. So we were kind of playing around with it quite a lot, we thought it was really 00:30:56.120 |
interesting but suddenly it's got hot. The basic idea of super resolution is you start 00:31:00.900 |
off with a low-res photo. The reason I started getting interested in this was I wanted to 00:31:05.600 |
help my mom take her family photos that were often pretty low quality and blow them up 00:31:12.320 |
into something that was big and high quality that she could print out. 00:31:16.520 |
So that's what you do. You try to take something which starts with a small low-res photo and 00:31:23.080 |
turns it into a big high-res photo. Now perhaps you can see that we can use a very similar 00:31:36.720 |
technique for this. What we could do is between the low-res photo and the high-res photo, 00:31:55.800 |
we could introduce a CNN. That CNN could look a lot like the CNN from our last idea, but 00:32:00.720 |
it's taking in as input a low-res image, and then it's sticking it into a loss function, 00:32:11.240 |
and the loss function is only going to calculate content loss. The content loss it will calculate 00:32:18.520 |
is between the input that it's got from the low-res after going through the CNN compared 00:32:24.960 |
to the activations from the high-res. So in other words, has this CNN successfully created 00:32:33.400 |
a bigger photo that has the same activations as the high-res photo does? 00:32:38.960 |
And so if we pick the right layer for the high-res photo, then that ought to mean that 00:32:54.960 |
This is one of the things I wanted to talk about today. In fact, I think it's at the 00:33:00.600 |
start of the next paper we're going to look at is they even talk about this. 00:33:06.760 |
This is the paper we're going to look at today, Perceptual Losses for Real-Time Style Transfer 00:33:10.760 |
and Super Resolutions. This is from 2016. So it took about a year or so to go from the 00:33:15.560 |
thing we just saw to this next stage. What they point out in the abstract here is that 00:33:28.520 |
people had done super resolution with CNNs before, but previously the loss function they 00:33:33.680 |
used was simply the mean-squared error between the pixel outputs of the upscaling network 00:33:44.020 |
The problem is that it turns out that that tends to create blurry images. It tends to 00:33:49.360 |
create blurry images because the CNN has no reason not to create blurry images. Blurry 00:33:57.600 |
images actually tend to look pretty good in the loss function because as long as you get 00:34:01.440 |
the general, oh this is probably somebody's face, I'll put a face color here, then it's 00:34:07.120 |
going to be fine. Whereas if you take the second or third conv block of VGG, then it needs 00:34:13.800 |
to know that this is an eyeball or it's not going to look good. So if you do it not with 00:34:20.920 |
pixel loss, but with the content loss we just learned about, you're probably going to get 00:34:29.520 |
Like many papers in deep learning, this paper introduces its own language. In the language 00:34:36.120 |
of this paper, perceptual loss is what they call the mean-squared errors between the activations 00:34:44.040 |
of a network with two images. So the thing we've been calling content loss, they call 00:34:53.280 |
So one of the nice things they do at the start of this, and I really like it when papers 00:34:57.760 |
do this, is to say why is this paper important? Well this paper is important because many 00:35:02.460 |
problems can be framed as image transformation tasks, where a system receives some input 00:35:08.280 |
and chucks out some other output. For example, denoising. Learn to take an input image that's 00:35:15.360 |
full of noise and spit out a beautifully clean image. Super resolution, take an input image 00:35:21.360 |
which is low-res and spit out a high-res. Colorization, take an input image which is black and white 00:35:27.320 |
and spit out something which is color. Now one of the interesting things here is that 00:35:32.720 |
all of these examples, you can generate as much input data as you like by taking lots 00:35:39.580 |
of images, which are either from your camera or you download off the internet or from ImageNet, 00:35:44.320 |
and you can make them lower-res. You can add noise. You can make them black and white. 00:35:50.160 |
So you can generate as much labeled data as you like. That's one of the really cool things 00:36:09.060 |
With that example, going to lower-res imagery, it's algorithmically done. Is the neural net 00:36:19.680 |
only going to learn how to transfer out something that's algorithmically done versus an actual 00:36:30.500 |
So one thing I'll just mention is the way you would create your labeled data is not 00:36:34.560 |
to do that low-res on the camera. You would grab the images that you've already taken 00:36:39.440 |
and make them low-res just by doing filtering in OpenCV or whatever. That is algorithmic, 00:36:50.840 |
and it may not be perfect, but there's lots of ways of generating that low-res image. 00:36:58.480 |
So there's lots of ways of creating a low-res image. Part of it is about how do you do that 00:37:02.320 |
creation of the low-res image and how well do you match the real low-res data you're 00:37:07.720 |
going to be getting. But in the end, in this case, things like low-resolution images or 00:37:14.480 |
black and white images, it's so hard to start with something which could be like -- I've 00:37:20.160 |
seen versions with just an 8x8 picture and turning it into a photo. It's so hard to do 00:37:28.240 |
that regardless of how that 8x8 thing was created that often the details of how the 00:37:34.640 |
low-res image was created don't really matter too much. 00:37:39.600 |
There are some other examples they mention which is turning an image into an image which 00:37:43.720 |
includes segmentation. We'll learn more about this in coming lessons, but segmentation refers 00:37:51.040 |
to taking a photo of something and creating a new image that basically has a different 00:37:56.000 |
color for each object. Horses are green, cars are blue, buildings are red, that kind of 00:38:01.960 |
thing. That's called segmentation. As you know from things like the fisheries competition, 00:38:07.360 |
segmentation can be really important as a part of solving other bigger problems. 00:38:12.640 |
Another example they mention here is depth estimation. There are lots of important reasons 00:38:16.520 |
you would want to use depth estimation. For example, maybe you want to create some fancy 00:38:21.800 |
video effects where you start with a flat photo and you want to create some cool new 00:38:28.800 |
Apple TV thing that moves around the photo with a parallax effect as if it was 3D. If 00:38:36.160 |
you were able to use a CNN to figure out how far away every object was automatically, then 00:38:42.520 |
you could turn a 2D photo into a 3D image automatically. 00:38:49.080 |
Taking an image in and sticking an image out is kind of the idea in computer vision at 00:38:54.680 |
least of generative networks or generative models. This is why I wanted to talk a lot 00:38:59.480 |
about generative models during this class. It's not just about artistic style. Artistic 00:39:05.500 |
style was just my sneaky way of introducing you to the world of generative models. 00:39:15.120 |
Let's look at how to create this super resolution idea. Part of your homework this week will 00:39:23.640 |
be to create the new approach to style transfer. I'm going to build the super resolution version, 00:39:31.520 |
which is a slightly simpler version, and then you're going to try and build on top of that 00:39:35.480 |
to create the style transfer version. Make sure you let me know if you're not sure at 00:39:44.680 |
I've already created a sample of 20,000 image images, and I've created two sizes. One is 00:39:54.560 |
288x288, and one is 72x72, and they're available as bcols arrays. I actually posted the link 00:40:08.600 |
to these last week, and it's on platform.fast.ai. So we'll open up those bcols arrays. One trick 00:40:15.320 |
you might have hopefully learned in part 1 is that you can turn a bcols array into a 00:40:20.160 |
numpy array by slicing it with everything. Any time you slice a bcols array, you get 00:40:27.560 |
back a numpy array. So if you slice everything, then this turns it into a numpy array. This 00:40:33.720 |
is just a convenient way of sharing numpy arrays in this case. So we've now got an array of 00:40:39.480 |
low resolution images and an array of high resolution images. 00:40:45.360 |
So let me start maybe by showing you the final network. Okay, this is the final network. 00:40:58.040 |
So we start off by taking in a batch of low-res images. The very first thing we do is stick 00:41:06.600 |
them through a convolutional block with a stride of 1. This is not going to change its 00:41:11.960 |
size at all. This convolutional block has a filter size of 9, and it generates 64 filters. 00:41:21.300 |
So this is a very large filter size. Particularly nowadays, filter sizes tend to be 3. Actually 00:41:30.240 |
in a lot of modern networks, the very first layer is very often a large filter size, just 00:41:37.840 |
the one, just one very first layer. And the reason is that it basically allows us to immediately 00:41:43.760 |
increase the receptive field of all of the layers from now on. So by having 9x9, and 00:41:53.040 |
we don't lose any information because we've gone from 3 channels to 64 filters. So each 00:42:00.480 |
of these 9x9 convolutions can actually have quite a lot of information because you've 00:42:05.120 |
got 64 filters. So you'll be seeing this quite a lot in modern CNN architectures, just a 00:42:11.720 |
single large filter conv layer. So this won't be unusual in the future. 00:42:18.520 |
Now the next thing, I'm going to give the green box behind you. Oh, just a moment, sorry. 00:42:29.040 |
The stride 1 is also pretty popular, I think. Well the stride 1 is important for this first 00:42:36.600 |
layer because you don't want to throw away any information yet. So in the very first 00:42:40.720 |
layer, we want to keep the full image size. So with the stride 1, it doesn't change, it 00:42:47.680 |
But there's also a lot of duplication, right? Like 9 filter size and 1 filter size? 00:42:53.560 |
They overlap a lot, absolutely. But that's okay. A good implementation of a convolution 00:43:02.080 |
is going to hopefully memoize some of that, or at least keep it in cache. So it hopefully 00:43:11.040 |
One of the discussions I was just having during the break was how practical are the things 00:43:24.440 |
that we're learning at the moment compared to part 1 where everything was just designed 00:43:28.840 |
entirely to be the most practical things which we have best practices for. And the answer 00:43:34.600 |
is a lot of the stuff we're going to be learning, no one quite knows how practical it is because 00:43:39.600 |
a lot of it just hasn't really been around that long and isn't really that well understood 00:43:44.840 |
and maybe there aren't really great libraries for it yet. 00:43:47.520 |
So one of the things I'm actually hoping from this part 2 is by learning the edge of research 00:43:54.800 |
stuff or beyond amongst a diverse group is that some of you will look at it and think 00:44:01.120 |
about whatever you do 9 to 5 or 8 to 6 or whatever and think, oh, I wonder if I could 00:44:08.920 |
use that for this. If that ever pops into your head, please tell us. Please talk about 00:44:16.320 |
it on the forum because that's what we're most interested in. It's like, oh, you could 00:44:20.920 |
use super-resolution for blah or depth-finding for this or generative models in general for 00:44:28.000 |
this thing I do in pathology or architecture or satellite engineering or whatever. 00:44:35.360 |
So it's going to require some imagination sometimes on your part. So often that's why 00:44:42.440 |
I do want to spend some time looking at stuff like this where it's like, okay, what are 00:44:49.320 |
the kinds of things this can be done for? I'm sure you know in your own field, one of 00:44:55.280 |
the differences between expert and beginner is the way an expert can look at something 00:44:59.840 |
from first principles and say, okay, I could use that for this totally different thing 00:45:05.640 |
which has got nothing to do with the example that was originally given to me because I 00:45:08.820 |
know the basic steps are the same. That's what I'm hoping you guys will be able to do 00:45:16.080 |
is not just say, okay, now I know how to do artistic style. Are there things in your field 00:45:33.680 |
which have some similarities? We were going to talk about the super-resolution network. 00:45:48.280 |
We talked about the idea of the initial conv block. After the initial conv block, we have 00:45:57.720 |
the computation. In any kind of generative network, there's the key work it has to do, 00:46:06.640 |
which in this case is starting with a low-res image, figure out what might that black dot 00:46:13.720 |
be. Is it a label or is it a wheel? Basically if you want to do really good upscaling, you 00:46:22.360 |
actually have to figure out what the objects are so you know what to draw. That's kind 00:46:28.920 |
of like the key computation this CNN is going to have to learn to do. In generative models, 00:46:35.760 |
we generally like to do that computation at a low resolution. There's a couple of reasons 00:46:41.400 |
why. The first is that at a low resolution there's less work to do so the computation 00:46:46.040 |
is faster. But more importantly, at higher resolutions it generally means we have a smaller 00:46:53.280 |
receptive field. It generally means we have less ability to capture large amounts of the 00:46:59.480 |
image at once. If you want to do really great computations where you recognize that this 00:47:09.560 |
blob here is a face and therefore the dot inside it is an eyeball, then you're going 00:47:14.400 |
to need enough of a receptive field to cover that whole area. 00:47:18.760 |
Now I noticed a couple of you asked for information about receptive fields on the forum thread. 00:47:24.880 |
There's quite a lot of information about this online, so Google is your friend here. But 00:47:30.960 |
the basic idea is if you have a single convolutional filter of 3x3, the receptive field is 3x3. 00:47:39.840 |
So it's how much space can that convolutional filter impact. 00:47:55.000 |
On the other hand, what if you had a 3x3 filter which had a 3x3 filter as its input? So that 00:48:05.600 |
means that the center one took all of this. But what did this one take? Well this one 00:48:12.000 |
would have taken, depending on the stride, probably these ones here. And this one over 00:48:25.920 |
So in other words, in the second layer, assuming a stride of 1, the receptive field is now 00:48:31.880 |
5x5, not 3x3. So the receptive field depends on two things. One is how many layers deep 00:48:41.600 |
are you, and the second is how much did the previous layers either have a nonunit stride 00:48:48.680 |
or maybe they had max pooling. So in some way they were becoming down sampled. Those 00:48:57.320 |
And so the reason it's great to be doing layer computations on a large receptive field is 00:49:04.040 |
that it then allows you to look at the big picture and look at the context. It's not 00:49:15.760 |
So in this case, we have four blocks of computation where each block is a ResNet block. So for 00:49:23.760 |
those of you that don't recall how ResNet works, it would be a good idea to go back 00:49:28.520 |
to Part 1 and review. But to remind ourselves, let's look at the code. Here's a ResNet block. 00:49:35.880 |
So all a ResNet block does is it takes some input and it does two convolutional blocks 00:49:43.360 |
on that input, and then it adds the result of those convolutions back to the original 00:49:50.240 |
So you might remember from Part 1 we actually drew it. We said there's some input and it 00:49:54.600 |
goes through two convolutional blocks and then it goes back and is added to the original. 00:50:03.200 |
And if you remember, we basically said in that case we've got y equals x plus some function 00:50:10.520 |
of x, which means that the function equals y minus x and this thing here is a residual. 00:50:24.640 |
So a whole stack of residual blocks, ResNet blocks on top of each other can learn to gradually 00:50:30.200 |
get thrown in on whatever it's trying to do. In this case, what it's trying to do is get 00:50:35.800 |
the information it's going to need to upscale this in a smart way. 00:50:41.400 |
So we're going to be using a lot more of this idea of taking blocks that we know work well 00:50:49.160 |
for something and just reusing them. So then what's a conv block? All a conv block is in 00:50:55.480 |
this case is it's a convolution followed by a batch norm, optionally followed by an activation. 00:51:04.360 |
And one of the things we now know about ResNet blocks is that we generally don't want an activation 00:51:12.920 |
at the end. That's one of the things that a more recent paper discovered. So you can 00:51:17.400 |
see that for my second conv block I have no activation. 00:51:22.640 |
I'm sure you've noticed throughout this course that I refactor my network architectures a 00:51:28.360 |
lot. My network architectures don't generally list every single layer, but they're generally 00:51:33.320 |
functions which have a bunch of layers. A lot of people don't do this. A lot of the architectures 00:51:42.120 |
you find online are like hundreds of lines of layer definitions. I think that's crazy. 00:51:47.480 |
It's so easy to make mistakes when you do it that way, and so hard to really see what's 00:51:51.680 |
going on. In general, I would strongly recommend that you try to refactor your architectures 00:51:58.200 |
so that by the time you write the final thing, it's half a page. You'll see plenty of examples 00:52:09.160 |
So we've increased the receptive field, we've done a bunch of computation, but we still 00:52:14.600 |
haven't actually changed the size of the image, which is not very helpful. So the next thing 00:52:19.480 |
we do is we're going to change the size of the image. And the first thing we're going 00:52:23.440 |
to learn is to do that with something that goes by many names. One is deconvolution, 00:52:36.880 |
another is transposed convolutions, and it's also known as fractionally strided convolutions. 00:52:54.920 |
In Keras they call them decomvolutions. And the basic idea is something which I've actually 00:53:04.320 |
got a spreadsheet to show you. The basic idea is that you've got some kind of image, so here's 00:53:16.040 |
a 4x4 image, and you put it through a 3x3 filter, a convolutional filter, and if you're 00:53:25.560 |
doing valid convolutions, that's going to leave you with a 2x2 output, because here's 00:53:34.040 |
one 3x3, another 3x3, and four of them. So each one is grabbing the whole filter and 00:53:45.080 |
the appropriate part of the data. So it's just a standard 2D convolution. 00:53:49.480 |
So we've done that. Now let's say we want to undo that. We want something which can take 00:53:55.680 |
this result and recreate this input. How would you do that? So one way to do that would be 00:54:05.120 |
to take this result and put back that implicit padding. So let's surround it with all these 00:54:15.400 |
zeros such that now if we use some convolutional filter, and we're going to put it through this 00:54:32.300 |
entire matrix, a bunch of zeros with our result matrix in the middle, and then we can calculate 00:54:40.880 |
our result in exactly the same way, just a normal convolutional filter. 00:54:45.560 |
So if we now use gradient descent, we can look and see, what is the error? So how much 00:54:52.920 |
does this pixel differ from this pixel? And how much does this pixel differ from this 00:55:01.000 |
pixel? And then we add them all together to get our mean squared error. So we can now 00:55:07.000 |
use gradient descent, which hopefully you remember from Part 1 in Excel is called solver. 00:55:14.520 |
And we can say, set this cell to a minimum by changing these cells. So this is basically 00:55:23.920 |
like the simplest possible optimization. Solve that, and here's what it's come up with. 00:55:33.520 |
So it's come up with a convolutional filter. You'll see that the result is not exactly 00:55:41.000 |
the same as the original data, and of course, how could it be? We don't have enough information, 00:55:46.240 |
we only have 4 things to try and regenerate 16 things. But it's not terrible. And in general, 00:55:53.760 |
this is the challenge with upscaling. When you've got something that's blurred and down-sampled, 00:56:01.200 |
you've thrown away information. So the only way you can get information back is to guess 00:56:05.760 |
what was there. But the important thing is that by using a convolution like this, we 00:56:12.040 |
can learn those filters. So we can learn how to up-sample it in a way that gives us the 00:56:18.920 |
loss that we want. So this is what a deconvolution is. It's just a convolution on a padded input. 00:56:28.320 |
Now in this case, I've assumed that my convolutions had a unit strived. There was just 1 pixel 00:56:35.320 |
between each convolution. If your convolutions are of strived 2, then it looks like this 00:56:43.440 |
picture. And so you can see that as well as putting the 2 pixels around the outside, we've 00:56:50.040 |
also put a 0 pixel in the middle. So these 4 cells are now our data cells, and you can 00:56:58.280 |
then see it calculating the convolution through here. I strongly suggest looking at this link, 00:57:05.000 |
which is where this picture comes from. And in turn, this link comes from a fantastic 00:57:11.920 |
paper called the Convolution Arithmetic Guide, which is a really great paper. And so if you 00:57:17.800 |
want to know more about both convolutions and deconvolutions, you can look at this page 00:57:23.120 |
and it's got lots of beautiful animations, including animations on transposed convolutions. 00:57:31.200 |
So you can see, this is the one I just showed you. So that's the one we just saw in Excel. 00:57:53.160 |
So that's what we're going to do first, is we're going to do deconvolutions. So in Keras, 00:58:01.560 |
a deconvolution is exactly the same as convolution, except with DE on the front. You've got all 00:58:06.920 |
the same stuff. How many filters do you want? What's the size of your filter? What's your 00:58:11.840 |
stride or subsample, as they call it? Border mode, so close. 00:58:17.040 |
We have a question. If TensorFlow is the backend, shouldn't the batch normalization axis equals 00:58:25.040 |
negative 1? And then there was a link to a GitHub conversation where Francois said that 00:58:43.520 |
No, it should be. And in fact, axis minus 1 is the default. So, yes. Thank you. Well 00:58:53.200 |
spotted. Thank David Gutmann. He is also responsible for some of our beautiful pictures we saw 00:58:58.960 |
earlier. So let's remove axis. That will make things look better. And go faster as well. 00:59:16.040 |
So just in case you weren't clear on that, you might remember from part 1 that the reason 00:59:20.040 |
we had that axis equals 1 is because in Theano that was the channel axis. So we basically 00:59:24.920 |
wanted not to throw away the xy information, the batch normal across channels. In Theano, 00:59:31.360 |
channel is now the last axis. And since minus 1 is the default, we actually don't need that. 00:59:43.800 |
So that's our deconvolution blocks. So we're using a stride of 2,2. So that means that 00:59:54.000 |
each time we go through this deconvolution, it's going to be doubling the size of the 00:59:58.560 |
For some reason I don't fully understand it and haven't really looked into it. In Keras, 01:00:03.000 |
you actually have to tell it the shape of the output. So you can see here, you can actually 01:00:09.280 |
see it's gone from 72x72 to 144x144 to 288x288. So because these are convolutional filters, 01:00:18.160 |
it's learning to upscale. But it's not upscaling with just three channels, it's upscaling with 01:00:24.400 |
64 filters. So that's how it's able to do more sophisticated stuff. 01:00:31.280 |
And then finally, we're kind of reversing things here. We have another 9x9 convolution in order 01:00:40.760 |
to get back our three channels. So the idea is we previously had something with 64 channels, 01:00:48.600 |
and so we now want to turn it into something with just three channels, the three colors, 01:00:52.680 |
and to do that we want to use quite a bit of context. So we have a single 9x9 filter 01:00:58.040 |
at the end to get our three channels out. So at the end we have a 288x288x3 tensor, in 01:01:08.680 |
So if we go ahead now and train this, then it's going to do basically what we want, but 01:01:13.920 |
the thing we're going to have to do is create our loss function. And creating our loss function 01:01:21.880 |
is a little bit messy, but I'll take you through it slowly and hopefully it'll all make sense. 01:01:33.120 |
So let's remember some of the symbols here. Input, imp, is the original low-resolution 01:01:42.340 |
input tensor. And then the output of this is called @p, and so let's call this whole 01:01:50.300 |
network here, let's call it the upsampling network. So this is the thing that's actually 01:01:55.480 |
responsible for doing the upsampling network. So we're going to take the upsampling network 01:02:00.120 |
and we're going to attach it to VGG. And VGG is going to be used only as a loss function 01:02:11.360 |
So before we can take this output and stick it into VGG, we need to stick it through our 01:02:17.200 |
standard mean subtraction pre-processing. So this is just the same thing that we did 01:02:21.880 |
over and over again in Part 1. So let's now define this output as being this lambda function 01:02:33.440 |
applied to the output of our upsampling network. So that's what this is. This is just our pre-processed 01:02:47.520 |
So we can now create a VGG network, and let's go through every layer and make it not trainable. 01:02:57.400 |
You can't ever make your loss function be trainable. The loss function is the fixed 01:03:02.800 |
in stone thing that tells you how well you're doing. So clearly you have to make sure VGG 01:03:07.440 |
is not trainable. Which bit of the VGG network do we want? We're going to try a few things. 01:03:15.320 |
I'm using block2.conf2. So relatively early, and the reason for that is that if you remember 01:03:21.920 |
when we did the content reconstruction last week, the very first thing we did, we found 01:03:27.640 |
that you could basically totally reconstruct the original image from early layer activations, 01:03:34.280 |
or else by the time we got to block4 we've got pretty horrendous things. So we're going 01:03:40.900 |
to use a somewhat early block as our content loss, or as the paper calls it, the perceptual 01:03:49.200 |
loss. You can play around with this and see how it goes. 01:03:56.400 |
So now we're going to create two versions of this VGG output. This is something which 01:04:04.120 |
is I think very poorly understood or appreciated with the Keras dysfunctional API, which is 01:04:12.360 |
any kind of layer, and a model is a layer as far as Keras is concerned, can be treated 01:04:18.080 |
as if it was a function. So we can take this model and pretend it's a function, and we 01:04:24.760 |
can pass it any tensor we like. And what that does is it creates a new model where those 01:04:31.160 |
two pieces are joined together. So VGG2 is now equal to this model on the top and this 01:04:44.920 |
model on the bottom. Remember this model was the result of our upsampling network followed 01:04:53.480 |
In the upsampling network, is the lambda function to normalize the output image? 01:05:00.840 |
Yeah, that's a good point. So we use a fan activation which can go from -1 to 1. So if 01:05:09.640 |
you then go that plus 1 times 127.5, that gives you something that's between 0 and 255, which 01:05:15.720 |
is the range that we want. Interestingly, this was suggested in the original paper and 01:05:21.160 |
supplementary materials. More recently, on Reddit I think it was, the author said that 01:05:26.720 |
they tried it without the fan activation and therefore without the final deprocessing and 01:05:34.800 |
it worked just as well. You can try doing that. If you wanted to try it, you would just 01:05:39.440 |
remove the activation and you would just remove this last thing entirely. But obviously if 01:05:51.520 |
This is actually something I've been playing with with a lot of different models. Any time 01:05:55.760 |
I have some particular range that I want, one way to enforce that is by having a fan 01:06:02.240 |
or sigmoid followed by something that turns that into the range you want. It's not just 01:06:07.080 |
images. So we've got two versions of our BGG layer output. One which is based on the output 01:06:19.800 |
of the upscaling network, and the other which is based on just an input. And this just an 01:06:28.560 |
input is using the high-resolution shape as its input. So that makes sense because this 01:06:36.840 |
BGG network is something that we're going to be using at the high-resolution scale. We're 01:06:43.760 |
going to be taking the high-resolution target image and the high-resolution up-sampling 01:06:48.800 |
result and comparing them. Now that we've done all that, we're nearly there. We've now got 01:06:58.840 |
high-res perceptual activations and we've got the low-res up-sampled perceptual activations. 01:07:07.520 |
We now just need to take the mean sum of squares between them, and here it is here. In Keras, 01:07:15.520 |
anytime you put something into a network, it has to be a layer. So if you want to take 01:07:20.400 |
just a plain old function and turn it into a layer, you just chuck it inside a capital 01:07:26.080 |
L lambda. So our final model is going to take our low-res input and our high-res input as 01:07:39.120 |
our two inputs and return this loss function as an output. 01:07:45.560 |
One last trick. When you fit things in Keras, it assumes that you're trying to take some 01:07:52.920 |
output and make it close to some target. In this case, our loss is the actual loss function 01:08:00.160 |
we want. It's not that there's some target. We want to make it as low as possible. Since 01:08:05.480 |
it's the sum of mean squared errors, it can't go beneath 0. So what we can do is we can basically 01:08:14.560 |
check Keras and say that our target for the loss is 0. And you can't just use the scalar 01:08:21.200 |
0, remember every time we have a target set of labels in Keras, you need 1 for every row. 01:08:29.160 |
So we're going to create an array of zeros. That's just so that we can fit it into what 01:08:37.080 |
Keras expects. And I kind of find that increasingly as I start to move away from the world trodden 01:08:47.440 |
path of deep learning, more and more, particularly if you want to use Keras, you kind of have 01:08:52.720 |
to do weird little hacks like this. So there's a weird little pattern. There's probably more 01:09:02.200 |
So we've got our loss function that we're trying to get every row as close to 0 as possible. 01:09:07.720 |
We have a question. If we're only using up to block2/conf2, could we pop off all the 01:09:14.960 |
layers afterwards to save some computation? Sure. It wouldn't be a bad idea at all. 01:09:22.960 |
So we compile it, we fit it. One thing you'll notice I've started doing is using this callback 01:09:37.080 |
called TQDM notebook callback. TQDM is a really terrific library. Basically it does something 01:09:46.520 |
very simple, which is to add a progress meter to your loops. You can use it in a console, 01:09:56.280 |
as you can see. Basically where you've got a loop, you can add TQDM around it. That loop 01:10:03.520 |
does just what it used to do, but it gets its progress. It even guesses how much time 01:10:09.440 |
it's left and so forth. You can also use it inside a Jupyter notebook and it creates a 01:10:15.920 |
neat little graph that gradually goes up and shows you how long it's left and so forth. 01:10:25.760 |
So this is just a nice little trick. Use some learning rate annealing. At the end of training 01:10:40.720 |
The model we're interested in is just the upsampling model. We're going to be feeding 01:10:44.760 |
the upsampling model low-res inputs and getting out the high-res outputs. We don't actually 01:10:50.400 |
care about the value of the loss. I'll now define a model which takes the low-res input 01:10:57.480 |
and spits out this output, our high-res output. With that model, we can try it called predict. 01:11:08.120 |
Here is our original low-resolution mashed potato, and here is our high-resolution mashed 01:11:16.560 |
potato. It's amazing what it's done. You can see in the original, the shadow of the leaf 01:11:25.080 |
was very unclear, the bits in the mashed potato were just kind of big blobs. In this version 01:11:32.200 |
we have bare shadows, hard edges, and so forth. 01:11:39.920 |
Question. Can you explain the size of the target? It's the first dimension of the high-res 01:11:57.600 |
Obviously it's this number. This is basically the number of images that we have. Then it's 01:12:13.080 |
128 because that layer has 128 filters, so this ends up giving you the mean squared error 01:12:38.520 |
Question. Would popping the unused layers really save anything? Aren't you only getting 01:12:44.320 |
the layers you want when you do the bgg.getLayer block2.com2? 01:12:49.960 |
I'm not sure. I can't quite think quickly enough. You can try it. It might not help. 01:13:00.640 |
Intuitively, what features is this model learning? What it's learning is it's looking at 20,000 01:13:09.720 |
images, very, very low-resolution images like this. When there's a kind of a soft gray bit 01:13:20.480 |
next to a hard bit in certain situations, that's probably a shadow, and when there's 01:13:25.240 |
a shadow, this is what a shadow looks like, for example. It's learning that when there's 01:13:31.520 |
a curve, it doesn't actually meant to look like a jagged edge, but it's actually meant 01:13:34.960 |
to look like something smooth. It's really learning what the world looks like. Then when 01:13:43.880 |
you take that world and blur it and make it small, what does it then look like? It's just 01:13:50.640 |
like when you look at a picture like this, particularly if you blur your eyes and de-focus 01:13:57.780 |
your eyes, you can often see what it originally looked like because your brain basically is 01:14:03.640 |
doing the same thing. It's like when you read a really blurry text. You can still read it 01:14:08.560 |
because your brain is thinking like it knows. That must have been an E, that must have been 01:14:17.880 |
an E. So are you suggesting there is a similar universality on the other way around? You 01:14:24.760 |
know when BGG is saying the first layer is learning a line and then a square and a nose 01:14:30.540 |
or an eye? Are you saying the same thing is true in this case? Yeah. Yeah, absolutely. 01:14:38.120 |
It has to be. There's no way to up-sample. There's an infinite number of ways you can 01:14:45.400 |
up-sample. There's lost information. So in order to do it in a way that decreases this 01:14:50.480 |
lost function, it actually has to figure out what's probably there based on this context. 01:14:58.240 |
But don't you agree, just intuitively thinking about it, like example of the, you say, suggesting 01:15:04.440 |
like the album pictures for your mom. Would you think it would be a bit easier if we're 01:15:09.560 |
just feeding you pictures of humans because it's like the interaction of the circle of 01:15:13.840 |
the eye and the nose is going to be a lot better. In the most extreme versions of super 01:15:20.320 |
resolution networks, where they take 8 by 8 inches, you'll see that all of them pretty 01:15:25.440 |
much use the same dataset, which is something called the Celeb A. Celeb A is a dataset of 01:15:30.440 |
pictures of celebrity spaces. And all celebrity spaces are pretty similar. And so they show 01:15:36.320 |
these fantastic, and they are fantastic in amazing results, but they take an 8 by 8, 01:15:41.360 |
and it looks pretty close. And that's because they're taking advantage of this. In our case, 01:15:48.600 |
we've got 20,000 images from 1,000 categories. It's not going to do nearly as well. If we 01:15:55.280 |
wanted to do as well as the Celeb A versions, we would need hundreds of millions of images 01:16:02.760 |
Yeah, it's just hard for me to imagine mashed potatoes in a face in the same category. That's 01:16:11.480 |
The key thing to realize is there's nothing qualitatively different between what mashed 01:16:16.000 |
potato looks like in one face or another. So something can work to recognize the unique 01:16:23.480 |
features of mashed potatoes. And a big enough network can learn enough examples, can learn 01:16:29.720 |
not just mashed potatoes, but writing and anger pictures and whatever. 01:16:35.680 |
So for your examples, you're most likely to be doing stuff which is more domain-specific. 01:16:41.880 |
And so you should use more domain-specific data taking advantage of exactly these transformations. 01:16:51.280 |
One thing I mentioned here is I haven't used a test set, so another piece of the homework 01:17:08.280 |
is to add in a test set and tell us, is this mashed potato overfit? Is this actually just 01:17:19.340 |
matching the particular training set version of this mashed potato or not? And if it is 01:17:25.020 |
overfitting, can you create something that doesn't overfit? So there's another piece 01:17:33.800 |
So it's very simple now to take this and turn it into our fast style transfer. So the fast 01:17:43.280 |
style transfer is going to do exactly the same thing, but rather than turning something 01:17:49.080 |
low-res into something high-res, it's going to take something that's a photo and turn 01:18:01.740 |
So we're going to do that in just the same way. Rather than go from low-res through a 01:18:09.200 |
CNN to find the content loss against high-res, we're going to take a photo through a CNN 01:18:18.880 |
and do both style loss and content loss against a single fixed style image. 01:18:27.320 |
I've given you links here, so I have not implemented this for you, this is for you to implement, 01:18:32.160 |
but I have given you links to the original paper, and very importantly also to the supplementary 01:18:36.560 |
material, which is a little hard to find because there's two different versions, and only one 01:18:40.880 |
of them is correct. And of course I don't tell you which one is correct. 01:18:46.080 |
So the supplementary material goes through all of the exact details of what was their 01:18:53.900 |
loss function, what was their processing, what was their exact architecture, and so 01:19:00.960 |
So while I wait for that to load, like we did a doodle regeneration using the model's 01:19:10.460 |
photographers weights, could we create a regular image to see how you would look if you were 01:19:17.440 |
I don't know. If you could come up with a loss function, which is how much does somebody 01:19:27.480 |
look like a model? You could. So you'd have to come up with a loss function. And it would 01:19:33.960 |
have to be something where you can generate labeled data. 01:19:40.000 |
One of the things they mentioned in the paper is that they found it very important to add 01:19:46.480 |
quite a lot of padding, and specifically they didn't add zero padding, but they add reflection 01:19:53.400 |
padding. So reflection padding literally means take the edge and reflect it to your padding. 01:20:01.880 |
I've written that for you because there isn't one, but you may find it interesting to look 01:20:06.600 |
at this because this is one of the simplest examples of a custom layer. So we're going 01:20:12.320 |
to be using custom layers more and more, and so I don't want you to be afraid of them. 01:20:17.200 |
So a custom layer in Keras is a Python class. If you haven't done OO programming in Python, 01:20:27.320 |
now's a good time to go and look at some tutorials because we're going to be doing quite a lot 01:20:31.240 |
of it, particularly for PyTorch. PyTorch absolutely relies on it. So we're going to create a class. 01:20:37.400 |
It has to inherit from layer. In Python, this is how you can create a constructor. Python's 01:20:44.820 |
OO syntax is really gross. You have to use a special weird custom name thing, which happens 01:20:51.060 |
to be the constructor. Every single damn thing inside a class, you have to manually type out 01:20:56.520 |
self-commerce as the first parameter. If you forget, you'll get stupid errors. In the constructor 01:21:05.280 |
for a layer, this is basically a way you just save away any of the information you were 01:21:09.920 |
given. In this case, you've said that I want this much padding, so you just have to save 01:21:14.680 |
that somewhere and save only this much padding. 01:21:17.740 |
And then you need to do two things in every Keras custom layer. One is you have to define 01:21:23.320 |
something called get output shape 4. That is going to pass in the shape of an input, 01:21:31.800 |
and you have to return what is the shape of the output that that would create. So in this 01:21:36.680 |
case, if s is the shape of the input, then the output is going to be the same batch size 01:21:43.280 |
and the same number of channels. And then we're going to add in twice the amount of padding 01:21:49.200 |
for both the rows and columns. This is going to tell it, because remember one of the cool 01:21:54.240 |
things about Keras is you just chuck the layers on top of each other, and it magically knows 01:21:59.640 |
how big all the intermediate things are. It magically knows because every layer has this 01:22:08.600 |
The second thing you have to define is something called call. Call is the thing which will 01:22:13.000 |
get your layer data and you have to return whatever your layer does. In our case, we 01:22:20.820 |
want to cause it to add reflection padding. In this case, it happens that TensorFlow has 01:22:31.240 |
Obviously, generally it's nice to create Keras layers that would work with both Fiano and 01:22:38.560 |
TensorFlow backends by using that capital K dot notation. But in this case, Tiano didn't 01:22:45.200 |
have anything obvious that did this easily, and since it was just for our class, I just 01:22:53.520 |
So here is a complete layer. I can now use that layer in a network definition like this. 01:23:01.680 |
I can call dot predict, which will take an input and turn it into, you can see that the 01:23:08.280 |
bird now has the left and right sides here being reflected. 01:23:15.960 |
So that is there for you to use because in the supplementary material for the paper, 01:23:25.000 |
they add spatial reflection padding at the beginning of the network. And they add a lot 01:23:30.240 |
40x40. And the reason they add a lot is because they mention in the supplementary material 01:23:37.720 |
that they don't want to use same convolutions, they want to use valid convolutions in their 01:23:46.920 |
computation because if you add any black borders during those computation steps, it creates 01:23:56.200 |
So you'll see that through this computation of all their residual blocks, the size gets 01:24:01.240 |
smaller by 4 each time. And that's because these are valid convolutions. So that's why 01:24:06.880 |
they have to add padding to the start so that these steps don't cause the image to become 01:24:16.240 |
So this section here should look very familiar because it's the same as our app sampling 01:24:22.680 |
network. A bunch of residual blocks, two decompositions, and one 9x9 convolution. So this is identical. 01:24:32.720 |
So you can copy it. This is the new bit. We've already talked about why we have this 9x9 01:24:44.120 |
conv. But why do we have these downsampling convolutions to start with? We start with 01:24:50.180 |
an image up here of 336x336, and we halve its size, and then we halve its size again. 01:24:58.940 |
Why do we do that? The reason we do that is that, as I mentioned earlier, we want to do 01:25:07.360 |
our computation at a lower resolution because it allows us to have a larger receptive field 01:25:14.760 |
and it allows us to do less computation. So this pattern, where it's reflective, the last 01:25:25.040 |
thing is the same as the top thing, the second last thing is the same as the second thing. 01:25:29.880 |
You can see it's like a reflection symmetric. It's really, really common in generative models. 01:25:35.480 |
It's first of all to take your object, down-sample it, increasing the number of channels at the 01:25:41.240 |
same time. So increasing the receptive field, you're creating more and more complex representations. 01:25:47.040 |
You then do a bunch of computation on those representations and then at the end you up-sample 01:25:52.080 |
again. So you're going to see this pattern all the time. So that's why I wanted you guys 01:26:02.080 |
So there's that, the last major piece of your homework yesterday. 01:26:13.080 |
That's exactly the same as Decombolution Strive 2. So I remember I mentioned earlier that another 01:26:20.600 |
name for Decombolution is fractionally strided convolution. So you can remember that little 01:26:29.080 |
picture we saw, this idea that you put little columns and rows of zeros in between each 01:26:34.960 |
row and column. So you kind of think of it as doing like a half-stride at a time. 01:26:45.200 |
So that's why this is exactly what we already have. I don't think you need to change it at 01:26:50.760 |
all except you'll need to change my same convolutions to valid convolutions. But this is well worth 01:27:00.000 |
reading the whole supplementary material because it really has the details. It's so great when 01:27:07.640 |
a paper has supplementary material like this, you'll often find the majority of papers don't 01:27:13.060 |
actually tell you the details of how to do what they did, and many don't even have code. 01:27:19.560 |
These guys both have code and supplementary material, which makes this absolute A+ paper. 01:27:30.600 |
So that is super-resolution, perceptual losses, and so on and so forth. So I'm glad we got 01:27:40.240 |
there. Let's make sure I don't have any more slides. There's one other thing I'm going to 01:27:53.000 |
show you, which is these deconvolutions can create some very ugly artifacts. I can show 01:28:02.720 |
you some very ugly artifacts because I have some right here. You see it on the screen 01:28:10.040 |
Rachel, this checkerboard? This is called a checkerboard pattern. The checkerboard pattern 01:28:20.480 |
happens for a very specific reason. I've provided a link to this paper. It's an online paper. 01:28:29.560 |
You guys might remember Chris Ola. He had a lot of the best learning materials we looked 01:28:35.760 |
at in Part 1. He's now got this cool thing called distill.pub, done with some of his 01:28:40.280 |
colleagues at Google. He wrote this thing, discovering why is it that everybody gets 01:28:47.980 |
these goddamn checkerboard patterns. What he shows is that it happens because you have 01:28:57.160 |
stride 2 size 3 convolutions, which means that every pair of convolutions sees one pixel 01:29:04.700 |
twice. So it's like a checkerboard is just a natural thing that's going to come out. 01:29:11.720 |
So they talk about this in some detail and all the kind of things you can do. But in 01:29:17.320 |
the end, they point out two things. The first is that you can avoid this by making it that 01:29:26.920 |
your stride divides nicely into your size. So if I change size to 4, they're gone. So 01:29:39.920 |
one thing you could try if you're getting checkerboard patterns, which you will, is 01:29:44.840 |
make your size 3 convolutions into size 4 convolutions. 01:29:49.960 |
The second thing that he suggests doing is not to use deconvolutions. Instead of using 01:29:56.760 |
a deconvolution, he suggests first of all doing an upsampling. What happens when you 01:30:01.960 |
do an upsampling is it's basically the opposite of Max Pauling. You take every pixel and you 01:30:08.440 |
turn it into a 2x2 grid of that exact pixel. That's called upsampling. If you do an upsampling 01:30:17.720 |
followed by a regular convolution, that also gets rid of the checkerboard pattern. 01:30:23.400 |
And as it happens, Keras has something to do that, which is called upsampling2D. So all 01:30:40.240 |
this does is the opposite of Max Pauling. It's going to double the size of your image, at 01:30:44.800 |
which point you can use a standard normal unit-strided convolution and avoid the artifacts. 01:30:51.680 |
So extra credit after you get your network working is to change it to an upsampling and 01:30:58.280 |
unit-strided convolution network and see if the checkerboard artifacts go away. 01:31:07.480 |
So that is that. At the very end here I've got some suggestions for some more things 01:31:14.960 |
that you can look at. Let's move on. I want to talk about going big. Going big can mean 01:31:36.600 |
two things. Of course it does mean we get to say big data, which is important. I'm very 01:31:46.760 |
proud that even during the big data thing I never said big data without saying rude 01:31:53.360 |
things about the stupid idea of big data. Who cares about how big it is? But in deep 01:31:58.560 |
learning, sometimes we do need to use either large objects, like if you're doing diabetic 01:32:05.120 |
retinopathy you all have like 4,000 by 4,000 pictures of eyeballs, or maybe you've got 01:32:10.360 |
lots of objects, like if you're working with ImageNet. To handle this data that doesn't 01:32:19.920 |
fit in RAM, we need some tricks. I thought we would try some interesting project that 01:32:26.400 |
involves looking at the whole ImageNet competition dataset. The ImageNet competition dataset 01:32:32.040 |
is about 1.5 million images in 1,000 categories. As I mentioned in the last class, if you try 01:32:41.800 |
to download it, it will give you a little form saying you have to use it for research 01:32:45.560 |
purposes and that they're going to check it and blah blah blah. In practice, if you fill 01:32:49.520 |
out the form you'll get back an answer seconds later. So anybody who's got a terabyte of space, 01:32:56.760 |
and since you're building your own boxes, you now have a terabyte of space, you can 01:33:00.680 |
go ahead and download ImageNet, and then you can start working through this project. 01:33:06.740 |
This project is about implementing a paper called 'Divides'. And 'Divides' is a really 01:33:15.320 |
interesting paper. I actually just chatted to the author about it quite recently. An amazing 01:33:23.920 |
lady named Andrea Fromme, who's now at Clarify, which is a computer vision start-up. What she 01:33:30.560 |
did was devise was she created a really interesting multimodal architecture. So multimodal means 01:33:38.160 |
that we're going to be combining different types of objects. In her case, she was combining 01:33:45.360 |
language with images. It's quite an early paper to look at this idea. She did something 01:33:51.800 |
which was really interesting. She said normally when we do an ImageNet network, our final 01:34:00.280 |
layer is a one-hot encoding of a category. So that means that a pug and a golden retriever 01:34:10.280 |
are no more similar or different in terms of that encoding than a pug and a jumbo jet. 01:34:17.960 |
And that seems kind of weird. If you had an encoding where similar things were similar 01:34:24.840 |
in the encoding, you could do some pretty cool stuff. In particular, one of the key 01:34:31.000 |
things she was trying to do is to create something which went beyond the 1000 ImageNet categories 01:34:37.160 |
so that you could work with types of images that were not in ImageNet at all. So the way 01:34:44.440 |
she did that was to say, alright, let's throw away the one-hot encoded category and let's 01:34:50.240 |
replace it with a word embedding of the thing. So pug is no longer 0 0 0 1 0 0 0, but it's 01:34:59.840 |
now the word-to-vec vector of the pug. And that's it. That's the entirety of the thing. 01:35:06.880 |
Train that and see what happens. I'll provide a link to the paper. One of the things I love 01:35:12.720 |
about the paper is that what she does is to show quite an interesting range of the kinds 01:35:20.240 |
of cool results and cool things you can do when you replace a one-hot encoded output with 01:35:34.640 |
Let's say this is an image of a pug. It's a type of dog. So pug gets turned into, let's 01:36:03.320 |
say pug is the 300th class in ImageNet. It's going to get turned into a 1000-long vector 01:36:13.560 |
with a thousand zeros and a 1 in position 300. That's normally what we use as our target 01:36:23.040 |
when we're doing image classification. We're going to throw that 1000-long thing away and 01:36:29.320 |
replace it with a 300-long thing. The 300-long thing will be the word vector for pug that 01:36:38.400 |
we downloaded from Word2Vec. Normally our input image comes in, it goes through some 01:36:54.800 |
kind of computation in our CNN and it has to predict something. Normally the thing it 01:37:04.120 |
has to predict is a whole bunch of zeros and a 1 here. So the way we do that is that the 01:37:10.760 |
last layer is a softmax layer which encourages one of the things to be much higher than the 01:37:20.400 |
So what we do is we throw that away and we replace it with the word vector for that thing, 01:37:32.160 |
fox or pug or jumbo-jet. Since the word vector, so generally that might be 300 dimensions, 01:37:40.800 |
and that's dense, that's not lots of zeros, so we can't use a softmax layer at the end 01:37:46.280 |
anymore. We probably now just use a regular linear layer. 01:37:54.320 |
So the hard part about doing this really is processing image data. There's nothing weird 01:38:03.080 |
or interesting or tricky about the architecture. All we do is replace the last layer. So we're 01:38:08.280 |
going to leverage big holes quite a lot. So we start off by inputting our usual stuff 01:38:14.760 |
and don't forget with TensorFlow to call this limit_mem thing I created so that you don't 01:38:18.920 |
use up all of your memory. One thing which can be very helpful is to define actually 01:38:26.320 |
two parts. Once you've got your own box, you've got a bunch of spinning hard disks that are 01:38:33.560 |
big and slow and cheap, and maybe a couple of fast, expensive, small SSDs or NVMe drives. 01:38:44.680 |
So I generally think it's a good idea to define a path for both. This actually happens to 01:38:51.920 |
be a mount point that has my big, slow, cheap spinning disks, and this path happens to live 01:39:00.640 |
somewhere, which is my fast SSDs. And that way, when I'm doing my code, any time I've 01:39:08.760 |
got something I'm going to be accessing a lot, particularly if it's in a random order, 01:39:13.000 |
I want to make sure that that thing, as long as it's not too big, sits in this path, and 01:39:18.880 |
anytime I'm accessing something which I'm accessing generally sequentially, which is 01:39:23.880 |
really big, I can put it in this path. This is one of the good reasons, another good reason 01:39:28.960 |
to have your own box is that you get this kind of flexibility. 01:39:35.920 |
So the first thing we need is some word vectors. The paper builds their own Wikipedia word 01:39:49.120 |
vectors. I actually think that the Word2vec vectors you can download from Google are maybe 01:39:56.800 |
a better choice here, so I've just gone ahead and shown how you can load that in. One of 01:40:03.680 |
the very nice things about Google's Word2vec word vectors is that in part 1, when we used 01:40:13.920 |
word vectors, we tended to use GloVe. GloVe would not have a word vector for GoldenRetriever, 01:40:21.480 |
they would have a word vector for Golden. They don't have phrase things, whereas Google's 01:40:28.640 |
word vectors have phrases like GoldenRetriever. So for our thing, we really need to use Google's 01:40:36.920 |
Word2vec vectors, plus anything like that which has multi-part concepts as things that 01:40:46.040 |
So you can download Word2vec, I will make them available on our platform.ai site because 01:40:55.240 |
the only way to get them otherwise is to get them from the author's Google Drive directory, 01:41:01.120 |
and trying to get to a Google Drive directory from Linux is an absolute nightmare. So I 01:41:05.760 |
will save them for you so that you don't have to get it. 01:41:12.520 |
So once you've got them, you can load them in, and they're in a weird proprietary binary. 01:41:20.040 |
If you're going to share data, why put it in a weird proprietary binary format in a Google 01:41:25.160 |
Drive thing that you can't access from Linux? Anyway, this guy did, so I then save it as 01:41:34.880 |
The word vectors themselves are in a very simple format, they're just the word followed 01:41:39.840 |
by a space, followed by the vector, space separated. I'm going to save them in a simple dictionary 01:41:51.160 |
format, so what I'm going to share with you guys will be the dictionary. So it's a dictionary 01:42:03.920 |
I'm not sure I've used this idea of zip-star before, so I should talk about this a little 01:42:07.840 |
bit. So if I've got a dictionary which maps from word to vector, how do I get out of that 01:42:17.600 |
a list of the words and a list of the vectors? The short answer is like this. But let's think 01:42:27.420 |
So I don't know, we've used zip quite a bit. So normally with zip, you go like zip, list1, 01:42:33.840 |
list2, whatever. And what that returns is an iterator which first of all gives you element1 01:42:43.800 |
of list1, element1 of list2, element1 of list3, and then element2 of list1, so forth. That's 01:42:53.840 |
There's a nice idea in Python that you can put a star before any argument. And if that 01:43:02.800 |
argument is an iterator, something that you can go through, it acts as if you had taken 01:43:09.680 |
that whole list and actually put it inside those brackets. 01:43:16.160 |
So let's say that wtov list contained like fox, colon, array, hug, colon, array, and so 01:43:34.000 |
forth. When you go zip-star that, it's the same as actually taking the contents of that 01:43:49.780 |
You would want star-star if it was a dictionary star for list? 01:43:54.520 |
Not quite. Star just means you're treating it as an iterator. In this case we are using 01:44:05.840 |
a list, so let's talk about star-star another time. But you're right, in this case we have 01:44:13.920 |
a list which is actually just in this form, fox, comma, array, hug, comma, array, and then 01:44:26.160 |
So what this is going to do is when we zip this, it's going to basically take all of 01:44:35.160 |
these things and create one list for those. So this idea of zip-star is something we're 01:44:52.360 |
going to use quite a lot. Honestly I don't normally think about what it's doing, I just 01:44:58.720 |
know that any time I've got like a list of tuples and I want to turn it into a tuple 01:45:06.160 |
of lists, you just do zip-star. So that's all that is, it's just a little Python thing. 01:45:14.520 |
So this gives us a list of words and a list of vectors. So any time I start looking at 01:45:20.040 |
some new data, I always want to test it, and so I wanted to make sure this worked the way 01:45:26.280 |
I thought it ought to work. So one thing I thought was, okay, let's look at the correlation 01:45:30.600 |
coefficient between small j Jeremy and big j Jeremy, and indeed there is some correlation 01:45:36.200 |
which you would expect, or else the correlation between Jeremy and banana, I hate bananas, 01:45:43.220 |
so I was hoping this would be massively negative. Unfortunately it's not, but it is at least 01:45:47.240 |
lower than the correlation between Jeremy and big Jeremy. 01:45:50.200 |
So it's not always easy to exactly test data, but try and come up with things that ought 01:45:56.800 |
to be true and make sure they are true, so in this case this has given me some comfort 01:46:01.600 |
that these word vectors behave the way I expect them to. Now I don't really care about capitalization, 01:46:09.160 |
so I'll just go ahead and create a lower-cased word2vec dictionary, where I just do the lower-cased 01:46:15.040 |
version of everything. One trick here is I go through in reverse, because word2vec is 01:46:23.920 |
ordered where the most common words are first, so by going in reverse it means if there is 01:46:29.800 |
both a capital J Jeremy and a small j Jeremy, the one that's going to end up in my dictionary 01:46:34.620 |
will be the more common one. So what I want for device is to now get this word vector 01:46:47.180 |
for each one of our 1000 categories in ImageNet. 01:46:53.640 |
And then I'm going to go even further than that, because I want to go beyond ImageNet. 01:46:58.880 |
So I actually went and downloaded the original WordNet categories, and I filtered it down 01:47:05.440 |
to find all the nouns, and I discovered that there are actually 82,000 nouns in WordNet, 01:47:11.400 |
which is quite a few, it's quite fun looking through them. 01:47:14.920 |
So I'm going to create a map of word vectors for every ImageNet category, one set, and 01:47:21.520 |
every WordNet noun, another set. And so my goal in this project will be to try and create 01:47:27.080 |
something that can do useful things with the full set of WordNet nouns. We're going to 01:47:32.320 |
go beyond ImageNet. We've already got the 1000 ImageNet categories, we've used that 01:47:36.880 |
many times before, so I'll grab those, load them in, and then I'll do the same thing for 01:47:47.480 |
the full set of WordNet IDs, which I will share with you. 01:47:54.840 |
And so now I can go ahead and create a dictionary which goes through every one of my ImageNet 01:48:05.120 |
1000 categories and converts it into the word vector. Notice I have a filter here, and that's 01:48:14.400 |
because some of the ImageNet categories won't be in Word2vec, and that's because sometimes 01:48:24.280 |
the ImageNet categories will say things like hug bracket doc. They won't be exactly in the 01:48:31.040 |
same format. If you wanted to, you could probably get a better match than this, but I found 01:48:36.720 |
even with this simple approach I managed to match 51,600 out of the 82,000 WordNet nouns, 01:48:48.800 |
So what I did then was I created a list of all of the categories which didn't match, 01:48:55.880 |
and this commented out bit, as you can see, is something which literally just moved those 01:49:00.800 |
folders out of the way so that they're not in my ImageNet path anymore. 01:49:08.800 |
So the details aren't very important, but hopefully you can see at the end of this process 01:49:13.240 |
I've got something that maps every ImageNet category to a word vector, at least if I could 01:49:19.040 |
find it, and every WordNet noun to a vector, at least if I could find it, and that I've 01:49:25.520 |
modified my ImageNet data so that the categories I couldn't find moved those folders out of 01:49:35.520 |
the way. Nothing particularly interesting there. And that's because WordNet's not that big. 01:49:46.520 |
It's in RAM, so that's pretty straightforward. The images are a bit harder because we've 01:49:50.520 |
got a million or so images. So we're going to try everything we can to make this RAM 01:49:57.840 |
as quickly as possible. To start with, even the very process of getting 01:50:10.120 |
a list of the file names of everything in ImageNet takes a non-trivial amount of time. 01:50:16.120 |
So everything that takes a non-trivial amount of time is going to save its output. So the 01:50:20.960 |
first thing I do is I use glob, I can't remember if we used glob in Part 1, I think we did, 01:50:27.480 |
it's just a thing that's like ls star.start. So we use glob to grab all of the ImageNet 01:50:36.280 |
training set, and then I just go ahead and pickle.dump that. For various reasons we'll 01:50:48.640 |
see shortly, it's actually a very good idea though at this point to randomize that list 01:50:53.600 |
of file names, put them in a random order. The basic idea is later on if we use chunks 01:51:02.040 |
of file names that are next to each other, they're not all going to be the same type 01:51:06.920 |
of thing. So by randomizing the file names now it's going to save us a bit of time, so 01:51:12.080 |
then I can go ahead and save that randomized list. I've given it a different name, so I 01:51:20.000 |
So I want to resize all of my images to a constant size. I'm being a bit lazy here, 01:51:25.480 |
I'm going to resize them to 224x224, that's the input size for a lot of models obviously, 01:51:32.640 |
including the one that we're going to use. That would probably be better if we resize 01:51:39.000 |
to something bigger and then we randomly zoom and crop. Maybe if we have time we'll try 01:51:47.240 |
that later, but for now we're just going to resize everything to 224x224. 01:51:56.000 |
So we have nearly a million images it turns out to resize to 224x224. That could be pretty 01:52:03.640 |
slow. So I've got some handy tricks to make it much faster. Generally speaking there are 01:52:12.880 |
three ways to make an algorithm significantly faster. 01:52:22.480 |
The three ways are memory locality, the second is SIMD also known as vectorization. The third 01:52:51.800 |
is parallel processing. Rachel is very familiar with these because she's currently creating 01:53:01.920 |
a course for the master students here on numerical linear algebra, which is very heavily about 01:53:09.480 |
So these are the three ways you can make data processing faster. Memory locality simply 01:53:15.160 |
means in your computer you have lots of different kinds of memory. For example, level 1 cache, 01:53:24.160 |
level 2 cache, RAM, solid state disk, regular hard drives, whatever. The difference in speed 01:53:38.960 |
as you go up from one to the other is generally like 10 times or 100 times or 1000 times slower. 01:53:47.360 |
You really, really, really don't want to go to the next level of the memory hierarchy 01:53:51.120 |
if you can avoid it. Unfortunately level 1 cache might be more like 16k, level 2 cache 01:53:58.840 |
might be a few meg, RAM is going to be a few gig, solid state drives is probably going 01:54:05.600 |
to be a few hundreds of gig, and your hard drives are probably going to be a few terabytes. 01:54:10.640 |
So in reality you've got to be careful about how you manage these things. You want to try 01:54:15.880 |
and make sure that you're putting stuff in the right place, that you're not filling up 01:54:21.600 |
the resources unnecessarily, and that if you're going to use a piece of data multiple times, 01:54:29.360 |
try to use it each time, immediately use it again so that it's already in your cache. 01:54:38.000 |
The second thing, which is what we're about to look at, is SIMD, which stands for Single 01:54:43.140 |
Instruction Multiple Data. Something that a shockingly large number of people even who 01:54:49.800 |
claim to be professional computer programmers don't know is that every modern CPU is capable 01:54:57.840 |
of, in a single operation, in a single thread, calculating multiple things at the same time. 01:55:06.200 |
And the way that it does it is that you basically create a little vector, generally about 8 things, 01:55:13.200 |
and you put all the things you want to calculate. Let's say you want to take the square root 01:55:16.200 |
of something. You put 8 things into this little vector, and then you call a particular CPU 01:55:23.920 |
instruction which takes the square root of 8 floating point numbers that is in this register. 01:55:32.360 |
And it does it in a single clock cycle. So when we say clock cycle, your CPU might be 01:55:38.000 |
2 or 3 GHz, so it's doing 2 or 3 billion things per second. Well it's not, it's doing 2 or 01:55:45.940 |
3 billion times 8 things per second if you're using SIMD. 01:55:52.680 |
Because so few people are aware of SIMD, and because a lot of programming environments 01:55:57.040 |
don't make it easy to use SIMD, a lot of stuff is not written to take advantage of SIMD, 01:56:03.000 |
including, for example, pretty much all of the image processing in Python. 01:56:08.600 |
However, you can do this. You can go pip install pillow SIMD, and that will replace your pillow, 01:56:18.880 |
and remember pillow is like the main Python imaging library, with a new version that does 01:56:29.440 |
Because SIMD only works on certain CPUs, any vaguely recent CPU works, but because it's 01:56:36.560 |
only some, you have to add some special directives to the compiler to tell it, I have this kind 01:56:43.040 |
of CPU, so please do use these kinds of instructions. And what pillow SIMD does, it actually literally 01:56:50.360 |
replaces your existing pillow, so that's why you have to say pause/reinstall, because it's 01:56:56.000 |
going to be like, oh you already have a pillow, but this is like, no I want a pillow by SIMD. 01:57:01.160 |
So if you try this, you'll find that the speed of your resize literally goes up by 600%, 01:57:08.280 |
you don't have to change any code. I'm a huge fan of SIMD in general. It's one of the reasons 01:57:16.960 |
I'm not particularly fond of Python, because it doesn't make it all easy to use SIMD, but 01:57:24.600 |
luckily some people have written stuff in C which does use SIMD and then provided these 01:57:31.440 |
Okay, so this is something to remember to try to get working when you go home. Before 01:57:37.260 |
you do it, write a little benchmark that resizes 1000 images and times it, and then run this 01:57:43.840 |
command and make sure that it gets 600% faster, that way you know it's actually working. 01:57:51.200 |
We have two questions, I don't know if you want to finish the three ways to do things 01:57:55.760 |
faster first. One is, how could you get the relation between a pug and a dog and the photo 01:58:04.600 |
of a pug and its relation to the bigger category of dog? 01:58:14.560 |
Okay, now there, why do we want to randomize the file names, can't we use shuffle equals 01:58:26.960 |
The short answer is kind of to do with locality. If you say shuffle equals true, you're jumping 01:58:32.480 |
from here on the hard disk to here on the hard disk to here on the hard disk, and hard 01:58:36.640 |
disks take that. Remember there's a spinning disk with a little needle, and the thing's 01:58:41.840 |
moving all over the place. So you want to be getting things that are all on a row. It's 01:58:46.000 |
basically the reason. As you'll see, this is going to basically work for the concept of 01:58:53.760 |
dog versus pug, because the word vector for dog is very similar to word vector for pug, 01:59:00.080 |
so at the end we'll try it. We'll see if we can find dogs and see if it works. I'm sure 01:59:08.360 |
Finally, parallel processing refers to the fact that any modern CPU has multiple cores, 01:59:18.720 |
which literally means multiple CPUs in your CPU. Often boxes that you buy for home might 01:59:27.320 |
even have multiple CPUs in them. Again, Python is not great for parallel processing. Python 01:59:35.120 |
3 is certainly a lot better. But a lot of stuff in Python doesn't use parallel processing 01:59:39.800 |
very effectively. But a lot of modern CPUs have 10 cores or more, even for consumer CPU. 01:59:49.320 |
So if you're not using parallel processing, you're missing out on a 10x speedup. If you're 01:59:54.640 |
not using SAMD, you're missing out on a 6-8x speedup. So if you can do both of these things, 02:00:02.280 |
you can get 50+. I mean you will, you'll get 50+ speedup, assuming your CPU has enough cores. 02:00:10.080 |
So we're going to do both. To get SAMD, we're just going to install it. To get parallel 02:00:14.560 |
processing, we're probably not going to see all of it today, but we're going to be using 02:00:21.720 |
I define a few things to do my resizing. One thing is I've actually recently changed how 02:00:31.680 |
I do resizing. As I'm sure you guys have noticed, in the past when I've resized things to square, 02:00:36.760 |
I've tended to add a black border to the bottom or a black border to the right, because that's 02:00:42.680 |
what Keras did. Now that I've looked into it, no best practice papers, capital results 02:00:52.600 |
And it makes perfect sense because CNN is going to have to learn to deal with the black 02:00:57.600 |
border. You're throwing away all that information. What pretty much all the best practice approaches 02:01:04.760 |
is to rather than rescale the longest side to be the size of your square and then fill 02:01:12.240 |
it in with black, instead take the smallest side and make that the size of your square. 02:01:18.960 |
The other side's now too big, so just chop off the top and bottom, or chop off the right 02:01:23.680 |
or right and left. That's called center cropping. So resizing and center cropping. What I've 02:01:31.080 |
done here is I've got something which resizes to the size of the shortest side, and then 02:01:40.760 |
over here I've got something which does the center cropping. You can look at the details 02:01:51.040 |
when you get home if you like, it's not particularly exciting, so I've got something that does 02:01:55.600 |
the resizing. This is something you can improve. Currently I'm making sure that it's a three 02:02:02.080 |
channel image, so I'm not doing a black and white or something with an alpha channel. 02:02:12.600 |
So before I finish up, next time when we start is we're going to learn about parallel processing. 02:02:22.560 |
So anybody who's interested in pre-reading, feel free to start reading and playing around 02:02:28.040 |
with Python parallel processing. Thanks everybody, see you next week. I hope your assignments 02:02:34.320 |
go really well, and let me know if I can help you out in the forum.