back to index

Lesson 9: Cutting Edge Deep Learning for Coders


Chapters

0:0 Intro
0:27 Wiki
3:25 Style Transfer
10:25 Reading the Paper
15:0 Notation
21:15 Citations
30:35 Super Resolution
33:5 Paper
39:40 Big holes arrays
40:45 The final network
43:10 Practical considerations
52:8 Deconvolution

Whisper Transcript | Transcript Only Page

00:00:00.000 | So, welcome back everybody. Thanks for coming and I hope you had a good week and had a fun
00:00:05.260 | time playing around with artistic style. I know I did. I thought I'd show you. So I tried
00:00:15.000 | a couple of things myself over the week with this artistic style stuff. I've just tried
00:00:21.060 | a couple of simple little changes which I thought you might be interested in.
00:00:26.000 | One thing before I talk about the artistic style is I just wanted to point out some of
00:00:34.040 | the really cool stuff that people have been contributing over the week. If you haven't
00:00:40.080 | come across it yet, be sure to check out the wiki. There's a nice thing in Discourse where
00:00:47.640 | you can basically set any post as being a wiki, which means that anybody can edit it.
00:00:52.440 | So I created this wiki post early on, and by the end of the week we now have all kinds
00:00:57.080 | of stuff with links to the stuff from the class, a summary of the paper, examples, a
00:01:05.480 | list of all the links, both snippets, a handy list of steps that are necessary when you're
00:01:15.080 | doing style transfer, lots of stuff about the TensorFlow Dead Summit, and so forth. Lots
00:01:26.360 | of other threads. One I saw just this afternoon popped up, which was Greg from XinXin, talked
00:01:36.640 | about trying to summarize what they've learned from lots of other threads across the forum.
00:01:44.880 | This is a great thing that we can all do is when you look at lots of different things
00:01:50.760 | and take some notes, if you put them on the forum for everybody else, this is super handy.
00:01:55.960 | So if you haven't quite caught up on all the stuff going on in the forum, looking at this
00:02:00.240 | curating lesson 8 experiments thread would be probably a good place to start.
00:02:07.200 | So a couple of little changes I made in my experiments. I tried thinking about how depending
00:02:16.360 | on what your starting point is for your optimizer, you get to a very different place. And so
00:02:22.240 | clearly our convex optimization is not necessarily finding a local minimum, but at least saddle
00:02:28.920 | points it's not getting out of. So I tried something which was to take the random image
00:02:35.560 | and just add a Gaussian blur to it. So that makes a random image into this kind of thing.
00:02:43.280 | And I just found that even the plain style looked a lot smoother, so that was one change
00:02:49.140 | that I made which I thought worked quite well. Another change that I made just to play around
00:02:54.040 | with it was that I added a different weight to each of the style layers. And so my zip
00:03:01.920 | now has a third thing in which is the weights, and I just multiply by the weight. So I thought
00:03:07.000 | that those two things made my little bird look significantly better than my little bird
00:03:11.440 | looked before, so I was happy with that. You could do a similar thing for content loss.
00:03:16.320 | You could also maybe add more different layers of content loss and give them different weights
00:03:21.200 | as well. I'm not sure if anybody's tried that yet. Yes, Rachel.
00:03:32.360 | I have a question in regards to style transfer for cartoons. With cartoons, when we think
00:03:37.560 | of transferring the style, what we really mean is transferring the contours of the cartoon
00:03:42.440 | to redraw the content in that style. This is not what style transferring is doing here.
00:03:47.720 | How might I implement this? I don't know that anybody has quite figured that out, but I'll
00:03:53.080 | show you a couple of directions that may be useful.
00:03:56.600 | I've tried selecting activations that correspond with edges, and such is indicated by one of
00:04:04.040 | the calm visualization papers and comparing outputs from specifically those activations.
00:04:10.760 | So I'll show you some things you could try. I haven't seen anybody do a great job of this
00:04:15.520 | yet, but here's one example from the forum. Somebody pointed out that this cartoon approach
00:04:21.840 | didn't work very well with Dr. Seuss, but then when they changed their initial image
00:04:26.760 | not to be random but to be the picture of the dog, it actually looked quite a lot better.
00:04:30.800 | So there's one thing you could try. There's some very helpful diagrams that somebody posted
00:04:36.000 | which is fantastic. I like this summary of what happens if you add versus remove each
00:04:45.400 | layer. So this is what happens if you remove block 0, block 1, block 2, block 3, and block
00:04:53.360 | 4 to get a sense of how they impact things. You can see for the style that the last layer
00:04:58.560 | is really important to making it look good, at least for this image.
00:05:06.880 | One of you had some particularly nice examples. It seems like there's a certain taste. They're
00:05:13.200 | kind of figuring out what photos go with what images. I thought this Einstein was terrific.
00:05:17.680 | I thought this was terrific as well. Brad came up with this really interesting insight
00:05:25.320 | that starting with this picture and adding a style to it creates this extraordinary shape
00:05:32.160 | here where, as he points out, you can tell it's a man sitting in the corner, but there's
00:05:36.360 | less than 10 brush strips. Sometimes this style transfer does things which are surprisingly
00:05:42.660 | fantastic. I have no idea what this is even in the photos, so I don't know what it is in
00:05:47.740 | the painting either. I guess I don't watch that kind of music enough.
00:05:53.360 | So there's lots of interesting ideas you can try, and I've got a link here, and you might
00:05:58.480 | have seen it in the PowerPoint, to a Keras implementation that has a whole list of things
00:06:03.120 | that you can try. Here are some particular examples. All of these examples you can get
00:06:12.640 | the details from this link. There's something called chain blurring. For some things, this
00:06:19.160 | might work well for cartoons. Notice how the matrix doesn't do a good job with the cat
00:06:25.440 | when you use the classic This Is Our Paper. But if you use this chain blurring approach,
00:06:31.720 | it does a fantastic job. So I wonder if that might be one secret to the cartoons.
00:06:38.720 | Some of you I saw in the forum have already tried this, which is using color preservation
00:06:42.420 | and luminance matching, which basically means you're still taking the style but you're not
00:06:47.560 | taking the color. And I think in these particular examples, this is really great results. I think
00:06:53.080 | it depends a lot on what things you tried with.
00:06:57.720 | You can go a lot further. For example, you can add a mask and then say just do color
00:07:03.880 | preservation for one part of the photo. So here the top part of the photo has got color
00:07:08.080 | preservation and the bottom hasn't. They even show in that code how you can use a mask to
00:07:17.920 | say one part of my image should not be stylized. This is really crazy. Use masks to decide
00:07:29.240 | which one of two style images to use and then you can really generate some creative stuff.
00:07:34.320 | So there's a lot of stuff that you can play with and you can go beyond this to coming
00:07:38.120 | up with your own ideas.
00:07:40.740 | Now some of the best stuff, you're going to learn a bit more today about how to do some
00:07:44.560 | of these things better. But just to give an idea, if you go to likemo.net, you can literally
00:07:50.360 | draw something using four colors and then choose a style image and it will turn your
00:07:57.520 | drawing into an image. Basically the idea is blue is going to be water and green is going
00:08:02.760 | to be foliage and I guess red is going to be foreground. There's a lot of good examples
00:08:10.920 | of this kind of neural doodle they call it online.
00:08:16.600 | Something else we'll learn more about how to do better today is if you go to affinelayer.com,
00:08:21.240 | there's a very recent paper called Pics2Pics. We're going to be learning quite a bit in
00:08:26.320 | this class about how to do segmentation, which is where you take a photo and turn it into
00:08:31.560 | a colored image, basically saying the horse is here, the bicycle is here, the person is
00:08:35.640 | here. This is basically doing the opposite. You start by drawing something, saying I want
00:08:40.440 | you to create something that has a window here and a window still here and a draw here
00:08:44.880 | and a column there, and it generates a photo, which is fairly remarkable.
00:08:51.840 | So the stuff we've learned so far won't quite get you to do these two things, but by the
00:08:55.800 | end of today we should be able to. This is a nice example that I think some folks at
00:09:02.760 | Adobe built showing that you could basically draw something and it would try and generate
00:09:09.080 | an image that was close to your drawing where you just needed a small number of lines. Again
00:09:14.000 | we'll link to this paper from the resources. This actually shows it to you in real time.
00:09:21.400 | You can see that there's some new way of doing art that's starting to appear where you don't
00:09:27.680 | necessarily need a whole lot of technique. I'm not promising it's going to turn you into
00:09:32.240 | a Van Gogh, but you can at least generate images that maybe are in your head in some
00:09:36.000 | style that's somewhat similar to somebody else's. I think it's really interesting.
00:09:45.880 | One thing I was thrilled to see is that at least two of you have already written blog
00:09:51.080 | posts on Medium. That was fantastic to see. So I hope more of you might try to do that
00:09:56.720 | this week. It definitely doesn't need to be something that takes a long time. I know some
00:10:03.880 | of you are also planning on turning your forum posts into blog posts, so hopefully we'll
00:10:09.400 | see a lot more blog posts this week popping up. I know the people who have done that have
00:10:14.320 | found that a useful experience as well.
00:10:19.240 | One of the things that I suggested doing pretty high on the list of priorities for this week's
00:10:24.420 | assignment was to go through the paper knowing what it's going to say. I think this is really
00:10:32.000 | helpful is when you already know how to do something is to go back over that paper, and
00:10:37.120 | this is a great way to learn how to read papers. You already know what it's telling you. This
00:10:41.440 | is like the way I learnt to read papers was totally this method.
00:10:47.800 | So I've gone through and I've highlighted a few key things which as I went through I
00:10:52.720 | thought were kind of important. In the abstract of the paper, let me ask, how many people
00:10:59.480 | kind of went back and relooked at this paper again? Quite a few of you, that's great. In
00:11:08.520 | the abstract, they basically say what is it that they're introducing. It's a system based
00:11:12.040 | on a deep neural network that creates artistic images of higher perceptual quality. So we're
00:11:16.440 | going to read this paper and hopefully at the end of it we'll know how to do that.
00:11:20.360 | Then in the first section, they tell us about the basic ideas. When CNNs are trained on
00:11:27.960 | object recognition, they developed a representation of an image. Along the processing hierarchy
00:11:34.000 | of the network, it's transformed into representations that increasingly care about the actual content
00:11:38.480 | compared to the pixel values. So it describes the basic idea of content loss. Then they
00:11:46.360 | describe the basic idea of style loss, which is looking at the correlations between the
00:11:52.120 | different filter responses over the spatial extent of the feature maps. This is one of
00:11:56.640 | these sentences that read on its own doesn't mean very much, but now that you know how
00:12:01.560 | to do it, you can read it and you can see what that means, and then when you get to
00:12:05.960 | the methods section, we learn more.
00:12:09.080 | So the idea here is that by including the feature correlations, and this answers one
00:12:13.160 | of the questions that one of you had on the forum, by including feature correlations of
00:12:16.640 | multiple layers, we obtain a multi-scale representation of the input image. This idea of a multi-scale
00:12:23.440 | representation is something we're going to be coming across a lot because a lot of this,
00:12:28.040 | as we discussed last week, a lot of this class is about generative models. One of the tricky
00:12:33.600 | things with generative models is both to get the general idea of the thing you're trying
00:12:40.000 | to generate correct, but also get all the details correct. So the details generally
00:12:44.760 | require you to zoom into a small scale, and the big picture correct is about zooming out
00:12:49.840 | to a large scale.
00:12:51.680 | So this was one of the key things that they did in this paper was show you how to create
00:12:55.600 | a style representation that included multiple resolutions. We now know that where they did
00:13:01.360 | that was to use multiple style layers, and as we go through the layers of VGG, they gradually
00:13:07.560 | become lower and lower resolution, larger and larger receptive fields.
00:13:14.720 | I'm always great to look at the figures and make sure I was thrilled to see that some
00:13:18.240 | of you were trying to recreate these figures, which actually turned out to be slightly non-trivial.
00:13:25.360 | So we can see exactly what that figure is, and if you haven't tried it for yourself yet,
00:13:31.080 | you might want to try it, see if you can recreate this figure.
00:13:35.280 | It's good to try and find in a paper the key thing that they're showing. In this case,
00:13:47.760 | they found that representations of content and style in a CNN are separable, and you
00:13:52.400 | can manipulate both to create new images. So again, hopefully now you can look at that
00:13:58.200 | and say, Oh yeah, that makes sense.
00:14:08.600 | You can see that with papers, certainly with this paper, there's often quite a lot of introduction
00:14:13.680 | that often says the same thing a bunch of different ways. The first time you read it,
00:14:20.360 | one paragraph might not make sense, but later on they say it a different way and it starts
00:14:24.360 | to make more sense. So it's worth looking through the introductory remarks, maybe two
00:14:29.960 | or three times. They can certainly see that again, talking about the different layers
00:14:36.840 | and how they behave.
00:14:40.280 | Again, showing the results of some experiments. Again, you can see if you can recreate these
00:14:49.520 | experiments, make sure you understand how to do it. And then there's a whole lot of stuff
00:14:56.320 | I didn't find that interesting until we get to the section called methods.
00:15:00.400 | So the method section is the section that hopefully you'll learn the most about reading
00:15:03.880 | papers after you've implemented something by reading the section called methods. I want
00:15:07.920 | to show you a few little tricks of notation. You do need to be careful of little details
00:15:14.760 | that fly by. Like here, they used average pooling. That's a sentence which if you weren't
00:15:19.880 | reading carefully, you could skip over it. We need to use average pooling, not math pooling.
00:15:28.120 | So they will often have a section which explicitly says, Now I'm going to introduce the notation.
00:15:34.320 | This paper doesn't. This paper just introduces the notation as part of the discussion. But
00:15:39.800 | at some point, you'll start getting Greek letters or things with subscripts or whatever.
00:15:46.760 | Notation starts appearing. And so at this point, you need to start looking very carefully.
00:15:51.620 | And at least for me, I find I have to go back and read something many times to remember
00:15:56.400 | what's L, what's M, what's N. This is the annoying thing with math notation, is they're single
00:16:02.720 | letters. They generally don't have any kind of mnemonic. Often though you'll find that
00:16:07.440 | across papers in a particular field, they'll tend to reuse the same kind of English and
00:16:13.000 | Greek letters for the same kinds of things. So M will generally be the number of rows,
00:16:18.720 | capital M. Capital N will often be the number of columns. K will often be the index that
00:16:25.280 | you're summing over, so on and so forth.
00:16:29.240 | So here, the first thing which is introduced is x with an arrow on top. So x with an arrow
00:16:35.520 | on top means it's a vector. It's actually an input image, but they're going to turn it
00:16:41.900 | into a vector by flattening it out.
00:16:46.320 | So our image is called x. And then the CNN has a whole bunch of layers, and every time
00:16:51.800 | you see something with a subscript or a superscript like this, you need to look at both of the
00:16:55.440 | two bits because they've both got a meaning. The big thing is like the main object. So
00:17:01.600 | in this case, capital N is a filter. And then the subscript or superscript is like in an
00:17:08.160 | array or a tensor. In Python, it's like the thing in square brackets.
00:17:13.560 | So each filter has a letter l, which is like which number of the filter is it. And so often
00:17:21.320 | as I read a paper, I'll actually try to write code as I go and put little comments so that
00:17:27.760 | I'll write layer, square bracket, layer number, plus square bracket, and then I have a comment
00:17:33.280 | after, say, ml, just to remind myself. So I'm creating the code and mapping it to the letters.
00:17:40.960 | So there are nl filters. We know from a CNN that each filter creates a feature map, so
00:17:46.600 | that's why there are nl feature maps. So remember, any time you see the same letter, it means
00:17:51.240 | the same thing within a paper. Each feature map is of size m, and as I mentioned before,
00:18:00.400 | m tends to be rows and m tends to be columns. So here it says m is the height times the
00:18:07.720 | width of the feature map. So here we can see they've gone .flat, basically, to make it
00:18:13.720 | all 1 row. Now this is another piece of notation you'll see all the time. A layer l can be stored
00:18:22.200 | in a matrix called f, and now the l has gone to the top. Same basic idea, just an index.
00:18:29.800 | So the matrix f is going to contain our activations. And this thing here where it says r with a
00:18:35.920 | little superscript has a very special meaning. It's referring to basically what is the shape
00:18:42.560 | of this. So when you see this shape, it says these are r means that they're floats, and
00:18:49.280 | this thing here means it's a matrix. You can see the x, so it means it's rows by a column.
00:18:53.760 | So there are n rows and m columns in this matrix, and every matrix, there's one matrix
00:19:00.680 | for each layer, and there's a different number of rows and different number of columns for
00:19:04.480 | each layer. So you can basically go through and map it to the code that you've already
00:19:09.920 | written. So I'm not going to read through the whole thing, but there's not very much here,
00:19:16.120 | and it would be good to make sure that you understand all of it, perhaps with the exception
00:19:21.240 | of the derivative, because we don't care about derivatives because they get done for us thanks
00:19:25.880 | to a theano-intensor flow. So you can always skip the bits about derivatives.
00:19:36.000 | So then they do the same thing basically describing the Gram matrix. So they show here that the
00:19:43.140 | basic idea of the Gram matrix is that they create an inner product between the vectorized
00:19:49.320 | feature map i and j. So vectorized here means turned into a vector, so the way you turn
00:19:55.480 | a matrix into a vector is flattened. This means the inner product between the flattened feature
00:20:01.600 | maps, so those matrices we saw.
00:20:07.640 | So hopefully you'll find this helpful. You'll see there will be small little differences.
00:20:14.040 | So rather than taking the mean, they use here the sum, and then they divide back out the
00:20:24.720 | number of rows and columns to create the mean this way. In our code, we actually put the
00:20:29.980 | division inside the sum, so you'll see these little differences of how we implement things.
00:20:35.280 | And sometimes you may see actual meaningful differences, and that's often a suggestion
00:20:41.880 | of something you can try. So that describes the notation and the method, and that's it.
00:20:56.120 | But then very importantly, any time you come across some concept which you're not familiar
00:21:03.080 | with, it will pretty much always have a reference, a citation. So you'll see there's little numbers
00:21:15.480 | all over the place. There's lots of different ways of doing these references. But anytime
00:21:20.920 | you come across something which has a citation, like a new piece of notation or a new concept,
00:21:27.360 | you don't know what it is. Generally the first time I see it in a paper, I ignore it. But
00:21:32.640 | if I keep reading and it turns out to be something that actually is important and I can't understand
00:21:37.720 | the basic idea at all, I generally then put this paper aside, I put it in my to-read file,
00:21:44.440 | and make the new paper I'm reading the thing that it's citing. Because very often a paper
00:21:49.480 | is entirely meaningless until you've read one or two of the key papers it's based on.
00:21:56.400 | Sometimes this can be like reading the dictionary if you don't get low English. It can be layer
00:22:00.440 | upon layer of citations, and at some point you have to stop. I think you should find
00:22:08.800 | that the basic set of papers that things refer to is pretty much all stuff you guys know
00:22:15.440 | at this point. So I don't think you're going to get stuck in an infinite loop. But if you
00:22:19.360 | ever do, let us know in the forum and we'll try to help you get unstuck. Or if there's
00:22:27.200 | any notation you don't understand, let us know. In other words, the horrible things
00:22:30.840 | about math is it's very hard to search for. It's not like you can take that function name
00:22:35.920 | and search for Python and the function name instead of some weird squiggly shape. So again,
00:22:41.080 | feel free to ask if you're not sure about that. There is a great Wikipedia page which
00:22:47.540 | lists, I think it's just called math notation or something, which lists pretty much every
00:22:54.440 | piece of notation. There are various places you can look up notation as well.
00:23:01.720 | So that's the paper. Let's move to the next step. So I think what I might do is kind of
00:23:17.160 | try and draw the basic idea of what we did before so that I can draw the idea of what
00:23:22.200 | we're going to do differently this time. So previously, and now this thing is actually
00:23:25.920 | calibrated, we had a random image and we had a loss function. It doesn't matter what the
00:23:38.560 | loss function was. We know that it happened to be a combination of style_loss plus content_loss.
00:23:46.400 | What we did was we took our image, our random image, and we put it through this loss function
00:23:55.400 | and we got out of it two things. One was the loss and the other was the gradients. And
00:24:02.880 | then we used the gradients with respect to the original pixels to change the original
00:24:08.800 | pixels. So we basically repeated that loop again and again, and the pixels gradually
00:24:15.400 | changed to make the loss go down. So that's the basic approach that we just used. It's
00:24:25.480 | a perfectly fine approach for what it is. And in fact, if you are wanting to do lots
00:24:31.240 | of different photos with lots of different styles, like if you created a web app where
00:24:37.320 | you said please upload any style image and any content image, here's your artistic style
00:24:43.320 | version, this is probably still the best, particularly with some of those tweaks I talked about.
00:24:50.400 | But what if you wanted to create a web app that was a Van Gogh irises generator? Upload
00:24:57.520 | any image and I will give you that image in the style of Van Gogh's irises. You can do
00:25:03.720 | better than this approach, and the reason you can do better is that we can do something
00:25:08.520 | where you don't have to do a whole optimization run in order to create that output. Instead,
00:25:16.500 | we can train a CNN to learn to output photos in the style of Van Gogh's irises. The basic
00:25:24.760 | idea is very similar. What we're going to do this time is we're going to have lots of
00:25:32.040 | images. We're going to take each image and feed it into the exact same loss function
00:25:41.980 | that we used before, with the style loss plus the content loss. For the style loss, we're
00:25:50.440 | going to use Van Gogh's irises, and for the content loss, we're going to use the image
00:25:59.960 | that we're currently looking at. What we do is rather than changing the pixels of the
00:26:09.800 | original photo, instead what we're going to do is we're going to train a CNN to take this
00:26:20.320 | out of the way. Let's put a CNN in the middle. These are the layers of the CNN. We're going
00:26:43.960 | to try and get that CNN to spit out a new image. There's an input image and an output
00:26:55.920 | image. This new CNN we've created is going to spit out an output image that when you
00:27:02.400 | put it through this loss function, hopefully it's going to give a small number. If it gives
00:27:10.500 | a small number, it means that the content of this photo still looks like the original
00:27:17.520 | photo's content, and the style of this new image looks like the style of Van Gogh's irises.
00:27:24.520 | So if you think about it, when you have a CNN, you can really pick any loss function
00:27:31.720 | you like. We've tended to use pretty simple loss functions so far like mean squared error
00:27:37.160 | or cross entropy. In this case, we're going to use a very different loss function which
00:27:43.000 | is going to be style plus content loss using the same approach that we used just before.
00:27:51.000 | And because that was generated by a neural net, we know it's differentiable. And you
00:27:56.520 | can optimize any loss function as long as the loss function is differentiable. So if
00:28:03.520 | we now basically take the gradients of this output, not with respect to the input image,
00:28:10.320 | but with respect to the CNN weights, then we can take those gradients and use them to
00:28:18.600 | update the weights of the CNN so that the next iteration through the CNN will be slightly
00:28:23.320 | better at turning that image into a picture that has a good style match with Van Gogh's
00:28:28.880 | irises. Does that make sense? So at the end of this, we run this through lots of images.
00:28:36.560 | We're just training a regular CNN, and the only thing we've done differently is to replace
00:28:40.720 | the loss function with the style_loss plus content_loss that we just used. And so at
00:28:47.080 | the end of it, we're going to have a CNN that has learnt to take any photo and will spit
00:28:52.720 | out that photo in the style of Van Gogh's irises. And so this is a win, because it means now
00:28:58.920 | in your web app, which is your Van Gogh's irises generator, you now don't have to run
00:29:03.880 | an optimization path on the new photo, you just do a single forward pass to a CNN, which
00:29:09.120 | is instant. This is going to limit the filters you use, let's say you have Photoshop and
00:29:26.000 | you want to change multiple styles. Yeah, this is going to do just one type of style.
00:29:35.320 | Is there a way of combining multiple styles, or is it just going to be a combination of
00:29:41.480 | all of them? You can combine multiple styles by just having
00:29:45.960 | multiple bits of style loss for multiple images, but you're still going to have the problems
00:29:50.320 | that that network has only learned to create one kind of image. It may be possible to train
00:30:01.040 | it so it takes both a style image and a content image, but I don't think I've seen that done
00:30:06.360 | yet. Having said that, there is something simpler
00:30:21.820 | and in my opinion more useful we can do, which is rather than doing style loss plus content
00:30:29.040 | loss. Let's think of another interesting problem to solve, which is called super resolution.
00:30:37.320 | Super resolution is something which, honestly, when Rachel and I started playing around with
00:30:42.200 | it a while ago, nobody was that interested in it. But in the last year or so it's become
00:30:47.560 | really hot. So we were kind of playing around with it quite a lot, we thought it was really
00:30:56.120 | interesting but suddenly it's got hot. The basic idea of super resolution is you start
00:31:00.900 | off with a low-res photo. The reason I started getting interested in this was I wanted to
00:31:05.600 | help my mom take her family photos that were often pretty low quality and blow them up
00:31:12.320 | into something that was big and high quality that she could print out.
00:31:16.520 | So that's what you do. You try to take something which starts with a small low-res photo and
00:31:23.080 | turns it into a big high-res photo. Now perhaps you can see that we can use a very similar
00:31:36.720 | technique for this. What we could do is between the low-res photo and the high-res photo,
00:31:55.800 | we could introduce a CNN. That CNN could look a lot like the CNN from our last idea, but
00:32:00.720 | it's taking in as input a low-res image, and then it's sticking it into a loss function,
00:32:11.240 | and the loss function is only going to calculate content loss. The content loss it will calculate
00:32:18.520 | is between the input that it's got from the low-res after going through the CNN compared
00:32:24.960 | to the activations from the high-res. So in other words, has this CNN successfully created
00:32:33.400 | a bigger photo that has the same activations as the high-res photo does?
00:32:38.960 | And so if we pick the right layer for the high-res photo, then that ought to mean that
00:32:43.800 | we've constructed a new image.
00:32:54.960 | This is one of the things I wanted to talk about today. In fact, I think it's at the
00:33:00.600 | start of the next paper we're going to look at is they even talk about this.
00:33:06.760 | This is the paper we're going to look at today, Perceptual Losses for Real-Time Style Transfer
00:33:10.760 | and Super Resolutions. This is from 2016. So it took about a year or so to go from the
00:33:15.560 | thing we just saw to this next stage. What they point out in the abstract here is that
00:33:28.520 | people had done super resolution with CNNs before, but previously the loss function they
00:33:33.680 | used was simply the mean-squared error between the pixel outputs of the upscaling network
00:33:40.240 | and the actual high-res image.
00:33:44.020 | The problem is that it turns out that that tends to create blurry images. It tends to
00:33:49.360 | create blurry images because the CNN has no reason not to create blurry images. Blurry
00:33:57.600 | images actually tend to look pretty good in the loss function because as long as you get
00:34:01.440 | the general, oh this is probably somebody's face, I'll put a face color here, then it's
00:34:07.120 | going to be fine. Whereas if you take the second or third conv block of VGG, then it needs
00:34:13.800 | to know that this is an eyeball or it's not going to look good. So if you do it not with
00:34:20.920 | pixel loss, but with the content loss we just learned about, you're probably going to get
00:34:25.880 | better results.
00:34:29.520 | Like many papers in deep learning, this paper introduces its own language. In the language
00:34:36.120 | of this paper, perceptual loss is what they call the mean-squared errors between the activations
00:34:44.040 | of a network with two images. So the thing we've been calling content loss, they call
00:34:50.120 | perceptual loss.
00:34:53.280 | So one of the nice things they do at the start of this, and I really like it when papers
00:34:57.760 | do this, is to say why is this paper important? Well this paper is important because many
00:35:02.460 | problems can be framed as image transformation tasks, where a system receives some input
00:35:08.280 | and chucks out some other output. For example, denoising. Learn to take an input image that's
00:35:15.360 | full of noise and spit out a beautifully clean image. Super resolution, take an input image
00:35:21.360 | which is low-res and spit out a high-res. Colorization, take an input image which is black and white
00:35:27.320 | and spit out something which is color. Now one of the interesting things here is that
00:35:32.720 | all of these examples, you can generate as much input data as you like by taking lots
00:35:39.580 | of images, which are either from your camera or you download off the internet or from ImageNet,
00:35:44.320 | and you can make them lower-res. You can add noise. You can make them black and white.
00:35:50.160 | So you can generate as much labeled data as you like. That's one of the really cool things
00:35:53.880 | about this whole topic of generators.
00:36:09.060 | With that example, going to lower-res imagery, it's algorithmically done. Is the neural net
00:36:19.680 | only going to learn how to transfer out something that's algorithmically done versus an actual
00:36:25.600 | low-res imagery that doesn't have like --
00:36:30.500 | So one thing I'll just mention is the way you would create your labeled data is not
00:36:34.560 | to do that low-res on the camera. You would grab the images that you've already taken
00:36:39.440 | and make them low-res just by doing filtering in OpenCV or whatever. That is algorithmic,
00:36:50.840 | and it may not be perfect, but there's lots of ways of generating that low-res image.
00:36:58.480 | So there's lots of ways of creating a low-res image. Part of it is about how do you do that
00:37:02.320 | creation of the low-res image and how well do you match the real low-res data you're
00:37:07.720 | going to be getting. But in the end, in this case, things like low-resolution images or
00:37:14.480 | black and white images, it's so hard to start with something which could be like -- I've
00:37:20.160 | seen versions with just an 8x8 picture and turning it into a photo. It's so hard to do
00:37:28.240 | that regardless of how that 8x8 thing was created that often the details of how the
00:37:34.640 | low-res image was created don't really matter too much.
00:37:39.600 | There are some other examples they mention which is turning an image into an image which
00:37:43.720 | includes segmentation. We'll learn more about this in coming lessons, but segmentation refers
00:37:51.040 | to taking a photo of something and creating a new image that basically has a different
00:37:56.000 | color for each object. Horses are green, cars are blue, buildings are red, that kind of
00:38:01.960 | thing. That's called segmentation. As you know from things like the fisheries competition,
00:38:07.360 | segmentation can be really important as a part of solving other bigger problems.
00:38:12.640 | Another example they mention here is depth estimation. There are lots of important reasons
00:38:16.520 | you would want to use depth estimation. For example, maybe you want to create some fancy
00:38:21.800 | video effects where you start with a flat photo and you want to create some cool new
00:38:28.800 | Apple TV thing that moves around the photo with a parallax effect as if it was 3D. If
00:38:36.160 | you were able to use a CNN to figure out how far away every object was automatically, then
00:38:42.520 | you could turn a 2D photo into a 3D image automatically.
00:38:49.080 | Taking an image in and sticking an image out is kind of the idea in computer vision at
00:38:54.680 | least of generative networks or generative models. This is why I wanted to talk a lot
00:38:59.480 | about generative models during this class. It's not just about artistic style. Artistic
00:39:05.500 | style was just my sneaky way of introducing you to the world of generative models.
00:39:15.120 | Let's look at how to create this super resolution idea. Part of your homework this week will
00:39:23.640 | be to create the new approach to style transfer. I'm going to build the super resolution version,
00:39:31.520 | which is a slightly simpler version, and then you're going to try and build on top of that
00:39:35.480 | to create the style transfer version. Make sure you let me know if you're not sure at
00:39:43.680 | any point.
00:39:44.680 | I've already created a sample of 20,000 image images, and I've created two sizes. One is
00:39:54.560 | 288x288, and one is 72x72, and they're available as bcols arrays. I actually posted the link
00:40:08.600 | to these last week, and it's on platform.fast.ai. So we'll open up those bcols arrays. One trick
00:40:15.320 | you might have hopefully learned in part 1 is that you can turn a bcols array into a
00:40:20.160 | numpy array by slicing it with everything. Any time you slice a bcols array, you get
00:40:27.560 | back a numpy array. So if you slice everything, then this turns it into a numpy array. This
00:40:33.720 | is just a convenient way of sharing numpy arrays in this case. So we've now got an array of
00:40:39.480 | low resolution images and an array of high resolution images.
00:40:45.360 | So let me start maybe by showing you the final network. Okay, this is the final network.
00:40:58.040 | So we start off by taking in a batch of low-res images. The very first thing we do is stick
00:41:06.600 | them through a convolutional block with a stride of 1. This is not going to change its
00:41:11.960 | size at all. This convolutional block has a filter size of 9, and it generates 64 filters.
00:41:21.300 | So this is a very large filter size. Particularly nowadays, filter sizes tend to be 3. Actually
00:41:30.240 | in a lot of modern networks, the very first layer is very often a large filter size, just
00:41:37.840 | the one, just one very first layer. And the reason is that it basically allows us to immediately
00:41:43.760 | increase the receptive field of all of the layers from now on. So by having 9x9, and
00:41:53.040 | we don't lose any information because we've gone from 3 channels to 64 filters. So each
00:42:00.480 | of these 9x9 convolutions can actually have quite a lot of information because you've
00:42:05.120 | got 64 filters. So you'll be seeing this quite a lot in modern CNN architectures, just a
00:42:11.720 | single large filter conv layer. So this won't be unusual in the future.
00:42:18.520 | Now the next thing, I'm going to give the green box behind you. Oh, just a moment, sorry.
00:42:29.040 | The stride 1 is also pretty popular, I think. Well the stride 1 is important for this first
00:42:36.600 | layer because you don't want to throw away any information yet. So in the very first
00:42:40.720 | layer, we want to keep the full image size. So with the stride 1, it doesn't change, it
00:42:46.160 | doesn't downsample at all.
00:42:47.680 | But there's also a lot of duplication, right? Like 9 filter size and 1 filter size?
00:42:53.560 | They overlap a lot, absolutely. But that's okay. A good implementation of a convolution
00:43:02.080 | is going to hopefully memoize some of that, or at least keep it in cache. So it hopefully
00:43:08.200 | won't slow it down too much.
00:43:11.040 | One of the discussions I was just having during the break was how practical are the things
00:43:24.440 | that we're learning at the moment compared to part 1 where everything was just designed
00:43:28.840 | entirely to be the most practical things which we have best practices for. And the answer
00:43:34.600 | is a lot of the stuff we're going to be learning, no one quite knows how practical it is because
00:43:39.600 | a lot of it just hasn't really been around that long and isn't really that well understood
00:43:44.840 | and maybe there aren't really great libraries for it yet.
00:43:47.520 | So one of the things I'm actually hoping from this part 2 is by learning the edge of research
00:43:54.800 | stuff or beyond amongst a diverse group is that some of you will look at it and think
00:44:01.120 | about whatever you do 9 to 5 or 8 to 6 or whatever and think, oh, I wonder if I could
00:44:08.920 | use that for this. If that ever pops into your head, please tell us. Please talk about
00:44:16.320 | it on the forum because that's what we're most interested in. It's like, oh, you could
00:44:20.920 | use super-resolution for blah or depth-finding for this or generative models in general for
00:44:28.000 | this thing I do in pathology or architecture or satellite engineering or whatever.
00:44:35.360 | So it's going to require some imagination sometimes on your part. So often that's why
00:44:42.440 | I do want to spend some time looking at stuff like this where it's like, okay, what are
00:44:49.320 | the kinds of things this can be done for? I'm sure you know in your own field, one of
00:44:55.280 | the differences between expert and beginner is the way an expert can look at something
00:44:59.840 | from first principles and say, okay, I could use that for this totally different thing
00:45:05.640 | which has got nothing to do with the example that was originally given to me because I
00:45:08.820 | know the basic steps are the same. That's what I'm hoping you guys will be able to do
00:45:16.080 | is not just say, okay, now I know how to do artistic style. Are there things in your field
00:45:33.680 | which have some similarities? We were going to talk about the super-resolution network.
00:45:48.280 | We talked about the idea of the initial conv block. After the initial conv block, we have
00:45:57.720 | the computation. In any kind of generative network, there's the key work it has to do,
00:46:06.640 | which in this case is starting with a low-res image, figure out what might that black dot
00:46:13.720 | be. Is it a label or is it a wheel? Basically if you want to do really good upscaling, you
00:46:22.360 | actually have to figure out what the objects are so you know what to draw. That's kind
00:46:28.920 | of like the key computation this CNN is going to have to learn to do. In generative models,
00:46:35.760 | we generally like to do that computation at a low resolution. There's a couple of reasons
00:46:41.400 | why. The first is that at a low resolution there's less work to do so the computation
00:46:46.040 | is faster. But more importantly, at higher resolutions it generally means we have a smaller
00:46:53.280 | receptive field. It generally means we have less ability to capture large amounts of the
00:46:59.480 | image at once. If you want to do really great computations where you recognize that this
00:47:09.560 | blob here is a face and therefore the dot inside it is an eyeball, then you're going
00:47:14.400 | to need enough of a receptive field to cover that whole area.
00:47:18.760 | Now I noticed a couple of you asked for information about receptive fields on the forum thread.
00:47:24.880 | There's quite a lot of information about this online, so Google is your friend here. But
00:47:30.960 | the basic idea is if you have a single convolutional filter of 3x3, the receptive field is 3x3.
00:47:39.840 | So it's how much space can that convolutional filter impact.
00:47:55.000 | On the other hand, what if you had a 3x3 filter which had a 3x3 filter as its input? So that
00:48:05.600 | means that the center one took all of this. But what did this one take? Well this one
00:48:12.000 | would have taken, depending on the stride, probably these ones here. And this one over
00:48:19.960 | here would have taken these ones here.
00:48:25.920 | So in other words, in the second layer, assuming a stride of 1, the receptive field is now
00:48:31.880 | 5x5, not 3x3. So the receptive field depends on two things. One is how many layers deep
00:48:41.600 | are you, and the second is how much did the previous layers either have a nonunit stride
00:48:48.680 | or maybe they had max pooling. So in some way they were becoming down sampled. Those
00:48:53.960 | two things increased the receptive field.
00:48:57.320 | And so the reason it's great to be doing layer computations on a large receptive field is
00:49:04.040 | that it then allows you to look at the big picture and look at the context. It's not
00:49:09.400 | just edges anymore, but eyeballs and noses.
00:49:15.760 | So in this case, we have four blocks of computation where each block is a ResNet block. So for
00:49:23.760 | those of you that don't recall how ResNet works, it would be a good idea to go back
00:49:28.520 | to Part 1 and review. But to remind ourselves, let's look at the code. Here's a ResNet block.
00:49:35.880 | So all a ResNet block does is it takes some input and it does two convolutional blocks
00:49:43.360 | on that input, and then it adds the result of those convolutions back to the original
00:49:49.240 | input.
00:49:50.240 | So you might remember from Part 1 we actually drew it. We said there's some input and it
00:49:54.600 | goes through two convolutional blocks and then it goes back and is added to the original.
00:50:03.200 | And if you remember, we basically said in that case we've got y equals x plus some function
00:50:10.520 | of x, which means that the function equals y minus x and this thing here is a residual.
00:50:24.640 | So a whole stack of residual blocks, ResNet blocks on top of each other can learn to gradually
00:50:30.200 | get thrown in on whatever it's trying to do. In this case, what it's trying to do is get
00:50:35.800 | the information it's going to need to upscale this in a smart way.
00:50:41.400 | So we're going to be using a lot more of this idea of taking blocks that we know work well
00:50:49.160 | for something and just reusing them. So then what's a conv block? All a conv block is in
00:50:55.480 | this case is it's a convolution followed by a batch norm, optionally followed by an activation.
00:51:04.360 | And one of the things we now know about ResNet blocks is that we generally don't want an activation
00:51:12.920 | at the end. That's one of the things that a more recent paper discovered. So you can
00:51:17.400 | see that for my second conv block I have no activation.
00:51:22.640 | I'm sure you've noticed throughout this course that I refactor my network architectures a
00:51:28.360 | lot. My network architectures don't generally list every single layer, but they're generally
00:51:33.320 | functions which have a bunch of layers. A lot of people don't do this. A lot of the architectures
00:51:42.120 | you find online are like hundreds of lines of layer definitions. I think that's crazy.
00:51:47.480 | It's so easy to make mistakes when you do it that way, and so hard to really see what's
00:51:51.680 | going on. In general, I would strongly recommend that you try to refactor your architectures
00:51:58.200 | so that by the time you write the final thing, it's half a page. You'll see plenty of examples
00:52:05.280 | of that, so hopefully that will be helpful.
00:52:09.160 | So we've increased the receptive field, we've done a bunch of computation, but we still
00:52:14.600 | haven't actually changed the size of the image, which is not very helpful. So the next thing
00:52:19.480 | we do is we're going to change the size of the image. And the first thing we're going
00:52:23.440 | to learn is to do that with something that goes by many names. One is deconvolution,
00:52:36.880 | another is transposed convolutions, and it's also known as fractionally strided convolutions.
00:52:54.920 | In Keras they call them decomvolutions. And the basic idea is something which I've actually
00:53:04.320 | got a spreadsheet to show you. The basic idea is that you've got some kind of image, so here's
00:53:16.040 | a 4x4 image, and you put it through a 3x3 filter, a convolutional filter, and if you're
00:53:25.560 | doing valid convolutions, that's going to leave you with a 2x2 output, because here's
00:53:34.040 | one 3x3, another 3x3, and four of them. So each one is grabbing the whole filter and
00:53:45.080 | the appropriate part of the data. So it's just a standard 2D convolution.
00:53:49.480 | So we've done that. Now let's say we want to undo that. We want something which can take
00:53:55.680 | this result and recreate this input. How would you do that? So one way to do that would be
00:54:05.120 | to take this result and put back that implicit padding. So let's surround it with all these
00:54:15.400 | zeros such that now if we use some convolutional filter, and we're going to put it through this
00:54:32.300 | entire matrix, a bunch of zeros with our result matrix in the middle, and then we can calculate
00:54:40.880 | our result in exactly the same way, just a normal convolutional filter.
00:54:45.560 | So if we now use gradient descent, we can look and see, what is the error? So how much
00:54:52.920 | does this pixel differ from this pixel? And how much does this pixel differ from this
00:55:01.000 | pixel? And then we add them all together to get our mean squared error. So we can now
00:55:07.000 | use gradient descent, which hopefully you remember from Part 1 in Excel is called solver.
00:55:14.520 | And we can say, set this cell to a minimum by changing these cells. So this is basically
00:55:23.920 | like the simplest possible optimization. Solve that, and here's what it's come up with.
00:55:33.520 | So it's come up with a convolutional filter. You'll see that the result is not exactly
00:55:41.000 | the same as the original data, and of course, how could it be? We don't have enough information,
00:55:46.240 | we only have 4 things to try and regenerate 16 things. But it's not terrible. And in general,
00:55:53.760 | this is the challenge with upscaling. When you've got something that's blurred and down-sampled,
00:56:01.200 | you've thrown away information. So the only way you can get information back is to guess
00:56:05.760 | what was there. But the important thing is that by using a convolution like this, we
00:56:12.040 | can learn those filters. So we can learn how to up-sample it in a way that gives us the
00:56:18.920 | loss that we want. So this is what a deconvolution is. It's just a convolution on a padded input.
00:56:28.320 | Now in this case, I've assumed that my convolutions had a unit strived. There was just 1 pixel
00:56:35.320 | between each convolution. If your convolutions are of strived 2, then it looks like this
00:56:43.440 | picture. And so you can see that as well as putting the 2 pixels around the outside, we've
00:56:50.040 | also put a 0 pixel in the middle. So these 4 cells are now our data cells, and you can
00:56:58.280 | then see it calculating the convolution through here. I strongly suggest looking at this link,
00:57:05.000 | which is where this picture comes from. And in turn, this link comes from a fantastic
00:57:11.920 | paper called the Convolution Arithmetic Guide, which is a really great paper. And so if you
00:57:17.800 | want to know more about both convolutions and deconvolutions, you can look at this page
00:57:23.120 | and it's got lots of beautiful animations, including animations on transposed convolutions.
00:57:31.200 | So you can see, this is the one I just showed you. So that's the one we just saw in Excel.
00:57:44.680 | So that's a really great site.
00:57:53.160 | So that's what we're going to do first, is we're going to do deconvolutions. So in Keras,
00:58:01.560 | a deconvolution is exactly the same as convolution, except with DE on the front. You've got all
00:58:06.920 | the same stuff. How many filters do you want? What's the size of your filter? What's your
00:58:11.840 | stride or subsample, as they call it? Border mode, so close.
00:58:17.040 | We have a question. If TensorFlow is the backend, shouldn't the batch normalization axis equals
00:58:25.040 | negative 1? And then there was a link to a GitHub conversation where Francois said that
00:58:41.480 | for Theano, axis is 1.
00:58:43.520 | No, it should be. And in fact, axis minus 1 is the default. So, yes. Thank you. Well
00:58:53.200 | spotted. Thank David Gutmann. He is also responsible for some of our beautiful pictures we saw
00:58:58.960 | earlier. So let's remove axis. That will make things look better. And go faster as well.
00:59:16.040 | So just in case you weren't clear on that, you might remember from part 1 that the reason
00:59:20.040 | we had that axis equals 1 is because in Theano that was the channel axis. So we basically
00:59:24.920 | wanted not to throw away the xy information, the batch normal across channels. In Theano,
00:59:31.360 | channel is now the last axis. And since minus 1 is the default, we actually don't need that.
00:59:43.800 | So that's our deconvolution blocks. So we're using a stride of 2,2. So that means that
00:59:54.000 | each time we go through this deconvolution, it's going to be doubling the size of the
00:59:57.560 | image.
00:59:58.560 | For some reason I don't fully understand it and haven't really looked into it. In Keras,
01:00:03.000 | you actually have to tell it the shape of the output. So you can see here, you can actually
01:00:09.280 | see it's gone from 72x72 to 144x144 to 288x288. So because these are convolutional filters,
01:00:18.160 | it's learning to upscale. But it's not upscaling with just three channels, it's upscaling with
01:00:24.400 | 64 filters. So that's how it's able to do more sophisticated stuff.
01:00:31.280 | And then finally, we're kind of reversing things here. We have another 9x9 convolution in order
01:00:40.760 | to get back our three channels. So the idea is we previously had something with 64 channels,
01:00:48.600 | and so we now want to turn it into something with just three channels, the three colors,
01:00:52.680 | and to do that we want to use quite a bit of context. So we have a single 9x9 filter
01:00:58.040 | at the end to get our three channels out. So at the end we have a 288x288x3 tensor, in
01:01:05.720 | other words, an image.
01:01:08.680 | So if we go ahead now and train this, then it's going to do basically what we want, but
01:01:13.920 | the thing we're going to have to do is create our loss function. And creating our loss function
01:01:21.880 | is a little bit messy, but I'll take you through it slowly and hopefully it'll all make sense.
01:01:33.120 | So let's remember some of the symbols here. Input, imp, is the original low-resolution
01:01:42.340 | input tensor. And then the output of this is called @p, and so let's call this whole
01:01:50.300 | network here, let's call it the upsampling network. So this is the thing that's actually
01:01:55.480 | responsible for doing the upsampling network. So we're going to take the upsampling network
01:02:00.120 | and we're going to attach it to VGG. And VGG is going to be used only as a loss function
01:02:07.120 | to get the content lost.
01:02:11.360 | So before we can take this output and stick it into VGG, we need to stick it through our
01:02:17.200 | standard mean subtraction pre-processing. So this is just the same thing that we did
01:02:21.880 | over and over again in Part 1. So let's now define this output as being this lambda function
01:02:33.440 | applied to the output of our upsampling network. So that's what this is. This is just our pre-processed
01:02:42.080 | upsampling network output.
01:02:47.520 | So we can now create a VGG network, and let's go through every layer and make it not trainable.
01:02:57.400 | You can't ever make your loss function be trainable. The loss function is the fixed
01:03:02.800 | in stone thing that tells you how well you're doing. So clearly you have to make sure VGG
01:03:07.440 | is not trainable. Which bit of the VGG network do we want? We're going to try a few things.
01:03:15.320 | I'm using block2.conf2. So relatively early, and the reason for that is that if you remember
01:03:21.920 | when we did the content reconstruction last week, the very first thing we did, we found
01:03:27.640 | that you could basically totally reconstruct the original image from early layer activations,
01:03:34.280 | or else by the time we got to block4 we've got pretty horrendous things. So we're going
01:03:40.900 | to use a somewhat early block as our content loss, or as the paper calls it, the perceptual
01:03:49.200 | loss. You can play around with this and see how it goes.
01:03:56.400 | So now we're going to create two versions of this VGG output. This is something which
01:04:04.120 | is I think very poorly understood or appreciated with the Keras dysfunctional API, which is
01:04:12.360 | any kind of layer, and a model is a layer as far as Keras is concerned, can be treated
01:04:18.080 | as if it was a function. So we can take this model and pretend it's a function, and we
01:04:24.760 | can pass it any tensor we like. And what that does is it creates a new model where those
01:04:31.160 | two pieces are joined together. So VGG2 is now equal to this model on the top and this
01:04:44.920 | model on the bottom. Remember this model was the result of our upsampling network followed
01:04:52.480 | by pre-processing.
01:04:53.480 | In the upsampling network, is the lambda function to normalize the output image?
01:05:00.840 | Yeah, that's a good point. So we use a fan activation which can go from -1 to 1. So if
01:05:09.640 | you then go that plus 1 times 127.5, that gives you something that's between 0 and 255, which
01:05:15.720 | is the range that we want. Interestingly, this was suggested in the original paper and
01:05:21.160 | supplementary materials. More recently, on Reddit I think it was, the author said that
01:05:26.720 | they tried it without the fan activation and therefore without the final deprocessing and
01:05:34.800 | it worked just as well. You can try doing that. If you wanted to try it, you would just
01:05:39.440 | remove the activation and you would just remove this last thing entirely. But obviously if
01:05:46.640 | you do have a fan, then you need the output.
01:05:51.520 | This is actually something I've been playing with with a lot of different models. Any time
01:05:55.760 | I have some particular range that I want, one way to enforce that is by having a fan
01:06:02.240 | or sigmoid followed by something that turns that into the range you want. It's not just
01:06:07.080 | images. So we've got two versions of our BGG layer output. One which is based on the output
01:06:19.800 | of the upscaling network, and the other which is based on just an input. And this just an
01:06:28.560 | input is using the high-resolution shape as its input. So that makes sense because this
01:06:36.840 | BGG network is something that we're going to be using at the high-resolution scale. We're
01:06:43.760 | going to be taking the high-resolution target image and the high-resolution up-sampling
01:06:48.800 | result and comparing them. Now that we've done all that, we're nearly there. We've now got
01:06:58.840 | high-res perceptual activations and we've got the low-res up-sampled perceptual activations.
01:07:07.520 | We now just need to take the mean sum of squares between them, and here it is here. In Keras,
01:07:15.520 | anytime you put something into a network, it has to be a layer. So if you want to take
01:07:20.400 | just a plain old function and turn it into a layer, you just chuck it inside a capital
01:07:26.080 | L lambda. So our final model is going to take our low-res input and our high-res input as
01:07:39.120 | our two inputs and return this loss function as an output.
01:07:45.560 | One last trick. When you fit things in Keras, it assumes that you're trying to take some
01:07:52.920 | output and make it close to some target. In this case, our loss is the actual loss function
01:08:00.160 | we want. It's not that there's some target. We want to make it as low as possible. Since
01:08:05.480 | it's the sum of mean squared errors, it can't go beneath 0. So what we can do is we can basically
01:08:14.560 | check Keras and say that our target for the loss is 0. And you can't just use the scalar
01:08:21.200 | 0, remember every time we have a target set of labels in Keras, you need 1 for every row.
01:08:29.160 | So we're going to create an array of zeros. That's just so that we can fit it into what
01:08:37.080 | Keras expects. And I kind of find that increasingly as I start to move away from the world trodden
01:08:47.440 | path of deep learning, more and more, particularly if you want to use Keras, you kind of have
01:08:52.720 | to do weird little hacks like this. So there's a weird little pattern. There's probably more
01:08:59.080 | elegant ways of doing this, but this works.
01:09:02.200 | So we've got our loss function that we're trying to get every row as close to 0 as possible.
01:09:07.720 | We have a question. If we're only using up to block2/conf2, could we pop off all the
01:09:14.960 | layers afterwards to save some computation? Sure. It wouldn't be a bad idea at all.
01:09:22.960 | So we compile it, we fit it. One thing you'll notice I've started doing is using this callback
01:09:37.080 | called TQDM notebook callback. TQDM is a really terrific library. Basically it does something
01:09:46.520 | very simple, which is to add a progress meter to your loops. You can use it in a console,
01:09:56.280 | as you can see. Basically where you've got a loop, you can add TQDM around it. That loop
01:10:03.520 | does just what it used to do, but it gets its progress. It even guesses how much time
01:10:09.440 | it's left and so forth. You can also use it inside a Jupyter notebook and it creates a
01:10:15.920 | neat little graph that gradually goes up and shows you how long it's left and so forth.
01:10:25.760 | So this is just a nice little trick. Use some learning rate annealing. At the end of training
01:10:35.600 | it for a few epochs, we can try out a model.
01:10:40.720 | The model we're interested in is just the upsampling model. We're going to be feeding
01:10:44.760 | the upsampling model low-res inputs and getting out the high-res outputs. We don't actually
01:10:50.400 | care about the value of the loss. I'll now define a model which takes the low-res input
01:10:57.480 | and spits out this output, our high-res output. With that model, we can try it called predict.
01:11:08.120 | Here is our original low-resolution mashed potato, and here is our high-resolution mashed
01:11:16.560 | potato. It's amazing what it's done. You can see in the original, the shadow of the leaf
01:11:25.080 | was very unclear, the bits in the mashed potato were just kind of big blobs. In this version
01:11:32.200 | we have bare shadows, hard edges, and so forth.
01:11:39.920 | Question. Can you explain the size of the target? It's the first dimension of the high-res
01:11:44.520 | times 128. Why?
01:11:57.600 | Obviously it's this number. This is basically the number of images that we have. Then it's
01:12:13.080 | 128 because that layer has 128 filters, so this ends up giving you the mean squared error
01:12:29.200 | of 128 filter losses.
01:12:38.520 | Question. Would popping the unused layers really save anything? Aren't you only getting
01:12:44.320 | the layers you want when you do the bgg.getLayer block2.com2?
01:12:49.960 | I'm not sure. I can't quite think quickly enough. You can try it. It might not help.
01:13:00.640 | Intuitively, what features is this model learning? What it's learning is it's looking at 20,000
01:13:09.720 | images, very, very low-resolution images like this. When there's a kind of a soft gray bit
01:13:20.480 | next to a hard bit in certain situations, that's probably a shadow, and when there's
01:13:25.240 | a shadow, this is what a shadow looks like, for example. It's learning that when there's
01:13:31.520 | a curve, it doesn't actually meant to look like a jagged edge, but it's actually meant
01:13:34.960 | to look like something smooth. It's really learning what the world looks like. Then when
01:13:43.880 | you take that world and blur it and make it small, what does it then look like? It's just
01:13:50.640 | like when you look at a picture like this, particularly if you blur your eyes and de-focus
01:13:57.780 | your eyes, you can often see what it originally looked like because your brain basically is
01:14:03.640 | doing the same thing. It's like when you read a really blurry text. You can still read it
01:14:08.560 | because your brain is thinking like it knows. That must have been an E, that must have been
01:14:17.880 | an E. So are you suggesting there is a similar universality on the other way around? You
01:14:24.760 | know when BGG is saying the first layer is learning a line and then a square and a nose
01:14:30.540 | or an eye? Are you saying the same thing is true in this case? Yeah. Yeah, absolutely.
01:14:38.120 | It has to be. There's no way to up-sample. There's an infinite number of ways you can
01:14:45.400 | up-sample. There's lost information. So in order to do it in a way that decreases this
01:14:50.480 | lost function, it actually has to figure out what's probably there based on this context.
01:14:58.240 | But don't you agree, just intuitively thinking about it, like example of the, you say, suggesting
01:15:04.440 | like the album pictures for your mom. Would you think it would be a bit easier if we're
01:15:09.560 | just feeding you pictures of humans because it's like the interaction of the circle of
01:15:13.840 | the eye and the nose is going to be a lot better. In the most extreme versions of super
01:15:20.320 | resolution networks, where they take 8 by 8 inches, you'll see that all of them pretty
01:15:25.440 | much use the same dataset, which is something called the Celeb A. Celeb A is a dataset of
01:15:30.440 | pictures of celebrity spaces. And all celebrity spaces are pretty similar. And so they show
01:15:36.320 | these fantastic, and they are fantastic in amazing results, but they take an 8 by 8,
01:15:41.360 | and it looks pretty close. And that's because they're taking advantage of this. In our case,
01:15:48.600 | we've got 20,000 images from 1,000 categories. It's not going to do nearly as well. If we
01:15:55.280 | wanted to do as well as the Celeb A versions, we would need hundreds of millions of images
01:16:00.960 | of 1,000 categories.
01:16:02.760 | Yeah, it's just hard for me to imagine mashed potatoes in a face in the same category. That's
01:16:08.960 | my biggest thing here.
01:16:11.480 | The key thing to realize is there's nothing qualitatively different between what mashed
01:16:16.000 | potato looks like in one face or another. So something can work to recognize the unique
01:16:23.480 | features of mashed potatoes. And a big enough network can learn enough examples, can learn
01:16:29.720 | not just mashed potatoes, but writing and anger pictures and whatever.
01:16:35.680 | So for your examples, you're most likely to be doing stuff which is more domain-specific.
01:16:41.880 | And so you should use more domain-specific data taking advantage of exactly these transformations.
01:16:51.280 | One thing I mentioned here is I haven't used a test set, so another piece of the homework
01:17:08.280 | is to add in a test set and tell us, is this mashed potato overfit? Is this actually just
01:17:19.340 | matching the particular training set version of this mashed potato or not? And if it is
01:17:25.020 | overfitting, can you create something that doesn't overfit? So there's another piece
01:17:31.700 | of homework.
01:17:33.800 | So it's very simple now to take this and turn it into our fast style transfer. So the fast
01:17:43.280 | style transfer is going to do exactly the same thing, but rather than turning something
01:17:49.080 | low-res into something high-res, it's going to take something that's a photo and turn
01:17:54.840 | it into Van Gogh's irises.
01:18:01.740 | So we're going to do that in just the same way. Rather than go from low-res through a
01:18:09.200 | CNN to find the content loss against high-res, we're going to take a photo through a CNN
01:18:18.880 | and do both style loss and content loss against a single fixed style image.
01:18:27.320 | I've given you links here, so I have not implemented this for you, this is for you to implement,
01:18:32.160 | but I have given you links to the original paper, and very importantly also to the supplementary
01:18:36.560 | material, which is a little hard to find because there's two different versions, and only one
01:18:40.880 | of them is correct. And of course I don't tell you which one is correct.
01:18:46.080 | So the supplementary material goes through all of the exact details of what was their
01:18:53.900 | loss function, what was their processing, what was their exact architecture, and so
01:18:59.080 | on and so forth.
01:19:00.960 | So while I wait for that to load, like we did a doodle regeneration using the model's
01:19:10.460 | photographers weights, could we create a regular image to see how you would look if you were
01:19:16.440 | a model?
01:19:17.440 | I don't know. If you could come up with a loss function, which is how much does somebody
01:19:27.480 | look like a model? You could. So you'd have to come up with a loss function. And it would
01:19:33.960 | have to be something where you can generate labeled data.
01:19:40.000 | One of the things they mentioned in the paper is that they found it very important to add
01:19:46.480 | quite a lot of padding, and specifically they didn't add zero padding, but they add reflection
01:19:53.400 | padding. So reflection padding literally means take the edge and reflect it to your padding.
01:20:01.880 | I've written that for you because there isn't one, but you may find it interesting to look
01:20:06.600 | at this because this is one of the simplest examples of a custom layer. So we're going
01:20:12.320 | to be using custom layers more and more, and so I don't want you to be afraid of them.
01:20:17.200 | So a custom layer in Keras is a Python class. If you haven't done OO programming in Python,
01:20:27.320 | now's a good time to go and look at some tutorials because we're going to be doing quite a lot
01:20:31.240 | of it, particularly for PyTorch. PyTorch absolutely relies on it. So we're going to create a class.
01:20:37.400 | It has to inherit from layer. In Python, this is how you can create a constructor. Python's
01:20:44.820 | OO syntax is really gross. You have to use a special weird custom name thing, which happens
01:20:51.060 | to be the constructor. Every single damn thing inside a class, you have to manually type out
01:20:56.520 | self-commerce as the first parameter. If you forget, you'll get stupid errors. In the constructor
01:21:05.280 | for a layer, this is basically a way you just save away any of the information you were
01:21:09.920 | given. In this case, you've said that I want this much padding, so you just have to save
01:21:14.680 | that somewhere and save only this much padding.
01:21:17.740 | And then you need to do two things in every Keras custom layer. One is you have to define
01:21:23.320 | something called get output shape 4. That is going to pass in the shape of an input,
01:21:31.800 | and you have to return what is the shape of the output that that would create. So in this
01:21:36.680 | case, if s is the shape of the input, then the output is going to be the same batch size
01:21:43.280 | and the same number of channels. And then we're going to add in twice the amount of padding
01:21:49.200 | for both the rows and columns. This is going to tell it, because remember one of the cool
01:21:54.240 | things about Keras is you just chuck the layers on top of each other, and it magically knows
01:21:59.640 | how big all the intermediate things are. It magically knows because every layer has this
01:22:04.200 | thing defined. That's how it works.
01:22:08.600 | The second thing you have to define is something called call. Call is the thing which will
01:22:13.000 | get your layer data and you have to return whatever your layer does. In our case, we
01:22:20.820 | want to cause it to add reflection padding. In this case, it happens that TensorFlow has
01:22:28.920 | something built in for that called tf.pad.
01:22:31.240 | Obviously, generally it's nice to create Keras layers that would work with both Fiano and
01:22:38.560 | TensorFlow backends by using that capital K dot notation. But in this case, Tiano didn't
01:22:45.200 | have anything obvious that did this easily, and since it was just for our class, I just
01:22:50.000 | decided just to make it TensorFlow.
01:22:53.520 | So here is a complete layer. I can now use that layer in a network definition like this.
01:23:01.680 | I can call dot predict, which will take an input and turn it into, you can see that the
01:23:08.280 | bird now has the left and right sides here being reflected.
01:23:15.960 | So that is there for you to use because in the supplementary material for the paper,
01:23:25.000 | they add spatial reflection padding at the beginning of the network. And they add a lot
01:23:30.240 | 40x40. And the reason they add a lot is because they mention in the supplementary material
01:23:37.720 | that they don't want to use same convolutions, they want to use valid convolutions in their
01:23:46.920 | computation because if you add any black borders during those computation steps, it creates
01:23:53.040 | weird artifacts on the edges of the images.
01:23:56.200 | So you'll see that through this computation of all their residual blocks, the size gets
01:24:01.240 | smaller by 4 each time. And that's because these are valid convolutions. So that's why
01:24:06.880 | they have to add padding to the start so that these steps don't cause the image to become
01:24:13.640 | too small.
01:24:16.240 | So this section here should look very familiar because it's the same as our app sampling
01:24:22.680 | network. A bunch of residual blocks, two decompositions, and one 9x9 convolution. So this is identical.
01:24:32.720 | So you can copy it. This is the new bit. We've already talked about why we have this 9x9
01:24:44.120 | conv. But why do we have these downsampling convolutions to start with? We start with
01:24:50.180 | an image up here of 336x336, and we halve its size, and then we halve its size again.
01:24:58.940 | Why do we do that? The reason we do that is that, as I mentioned earlier, we want to do
01:25:07.360 | our computation at a lower resolution because it allows us to have a larger receptive field
01:25:14.760 | and it allows us to do less computation. So this pattern, where it's reflective, the last
01:25:25.040 | thing is the same as the top thing, the second last thing is the same as the second thing.
01:25:29.880 | You can see it's like a reflection symmetric. It's really, really common in generative models.
01:25:35.480 | It's first of all to take your object, down-sample it, increasing the number of channels at the
01:25:41.240 | same time. So increasing the receptive field, you're creating more and more complex representations.
01:25:47.040 | You then do a bunch of computation on those representations and then at the end you up-sample
01:25:52.080 | again. So you're going to see this pattern all the time. So that's why I wanted you guys
01:25:56.840 | to implement this yourself this week.
01:26:02.080 | So there's that, the last major piece of your homework yesterday.
01:26:13.080 | That's exactly the same as Decombolution Strive 2. So I remember I mentioned earlier that another
01:26:20.600 | name for Decombolution is fractionally strided convolution. So you can remember that little
01:26:29.080 | picture we saw, this idea that you put little columns and rows of zeros in between each
01:26:34.960 | row and column. So you kind of think of it as doing like a half-stride at a time.
01:26:45.200 | So that's why this is exactly what we already have. I don't think you need to change it at
01:26:50.760 | all except you'll need to change my same convolutions to valid convolutions. But this is well worth
01:27:00.000 | reading the whole supplementary material because it really has the details. It's so great when
01:27:07.640 | a paper has supplementary material like this, you'll often find the majority of papers don't
01:27:13.060 | actually tell you the details of how to do what they did, and many don't even have code.
01:27:19.560 | These guys both have code and supplementary material, which makes this absolute A+ paper.
01:27:25.640 | Plus it works great.
01:27:30.600 | So that is super-resolution, perceptual losses, and so on and so forth. So I'm glad we got
01:27:40.240 | there. Let's make sure I don't have any more slides. There's one other thing I'm going to
01:27:53.000 | show you, which is these deconvolutions can create some very ugly artifacts. I can show
01:28:02.720 | you some very ugly artifacts because I have some right here. You see it on the screen
01:28:10.040 | Rachel, this checkerboard? This is called a checkerboard pattern. The checkerboard pattern
01:28:20.480 | happens for a very specific reason. I've provided a link to this paper. It's an online paper.
01:28:29.560 | You guys might remember Chris Ola. He had a lot of the best learning materials we looked
01:28:35.760 | at in Part 1. He's now got this cool thing called distill.pub, done with some of his
01:28:40.280 | colleagues at Google. He wrote this thing, discovering why is it that everybody gets
01:28:47.980 | these goddamn checkerboard patterns. What he shows is that it happens because you have
01:28:57.160 | stride 2 size 3 convolutions, which means that every pair of convolutions sees one pixel
01:29:04.700 | twice. So it's like a checkerboard is just a natural thing that's going to come out.
01:29:11.720 | So they talk about this in some detail and all the kind of things you can do. But in
01:29:17.320 | the end, they point out two things. The first is that you can avoid this by making it that
01:29:26.920 | your stride divides nicely into your size. So if I change size to 4, they're gone. So
01:29:39.920 | one thing you could try if you're getting checkerboard patterns, which you will, is
01:29:44.840 | make your size 3 convolutions into size 4 convolutions.
01:29:49.960 | The second thing that he suggests doing is not to use deconvolutions. Instead of using
01:29:56.760 | a deconvolution, he suggests first of all doing an upsampling. What happens when you
01:30:01.960 | do an upsampling is it's basically the opposite of Max Pauling. You take every pixel and you
01:30:08.440 | turn it into a 2x2 grid of that exact pixel. That's called upsampling. If you do an upsampling
01:30:17.720 | followed by a regular convolution, that also gets rid of the checkerboard pattern.
01:30:23.400 | And as it happens, Keras has something to do that, which is called upsampling2D. So all
01:30:40.240 | this does is the opposite of Max Pauling. It's going to double the size of your image, at
01:30:44.800 | which point you can use a standard normal unit-strided convolution and avoid the artifacts.
01:30:51.680 | So extra credit after you get your network working is to change it to an upsampling and
01:30:58.280 | unit-strided convolution network and see if the checkerboard artifacts go away.
01:31:07.480 | So that is that. At the very end here I've got some suggestions for some more things
01:31:14.960 | that you can look at. Let's move on. I want to talk about going big. Going big can mean
01:31:36.600 | two things. Of course it does mean we get to say big data, which is important. I'm very
01:31:46.760 | proud that even during the big data thing I never said big data without saying rude
01:31:53.360 | things about the stupid idea of big data. Who cares about how big it is? But in deep
01:31:58.560 | learning, sometimes we do need to use either large objects, like if you're doing diabetic
01:32:05.120 | retinopathy you all have like 4,000 by 4,000 pictures of eyeballs, or maybe you've got
01:32:10.360 | lots of objects, like if you're working with ImageNet. To handle this data that doesn't
01:32:19.920 | fit in RAM, we need some tricks. I thought we would try some interesting project that
01:32:26.400 | involves looking at the whole ImageNet competition dataset. The ImageNet competition dataset
01:32:32.040 | is about 1.5 million images in 1,000 categories. As I mentioned in the last class, if you try
01:32:41.800 | to download it, it will give you a little form saying you have to use it for research
01:32:45.560 | purposes and that they're going to check it and blah blah blah. In practice, if you fill
01:32:49.520 | out the form you'll get back an answer seconds later. So anybody who's got a terabyte of space,
01:32:56.760 | and since you're building your own boxes, you now have a terabyte of space, you can
01:33:00.680 | go ahead and download ImageNet, and then you can start working through this project.
01:33:06.740 | This project is about implementing a paper called 'Divides'. And 'Divides' is a really
01:33:15.320 | interesting paper. I actually just chatted to the author about it quite recently. An amazing
01:33:23.920 | lady named Andrea Fromme, who's now at Clarify, which is a computer vision start-up. What she
01:33:30.560 | did was devise was she created a really interesting multimodal architecture. So multimodal means
01:33:38.160 | that we're going to be combining different types of objects. In her case, she was combining
01:33:45.360 | language with images. It's quite an early paper to look at this idea. She did something
01:33:51.800 | which was really interesting. She said normally when we do an ImageNet network, our final
01:34:00.280 | layer is a one-hot encoding of a category. So that means that a pug and a golden retriever
01:34:10.280 | are no more similar or different in terms of that encoding than a pug and a jumbo jet.
01:34:17.960 | And that seems kind of weird. If you had an encoding where similar things were similar
01:34:24.840 | in the encoding, you could do some pretty cool stuff. In particular, one of the key
01:34:31.000 | things she was trying to do is to create something which went beyond the 1000 ImageNet categories
01:34:37.160 | so that you could work with types of images that were not in ImageNet at all. So the way
01:34:44.440 | she did that was to say, alright, let's throw away the one-hot encoded category and let's
01:34:50.240 | replace it with a word embedding of the thing. So pug is no longer 0 0 0 1 0 0 0, but it's
01:34:59.840 | now the word-to-vec vector of the pug. And that's it. That's the entirety of the thing.
01:35:06.880 | Train that and see what happens. I'll provide a link to the paper. One of the things I love
01:35:12.720 | about the paper is that what she does is to show quite an interesting range of the kinds
01:35:20.240 | of cool results and cool things you can do when you replace a one-hot encoded output with
01:35:26.400 | a vector output embedding.
01:35:34.640 | Let's say this is an image of a pug. It's a type of dog. So pug gets turned into, let's
01:36:03.320 | say pug is the 300th class in ImageNet. It's going to get turned into a 1000-long vector
01:36:13.560 | with a thousand zeros and a 1 in position 300. That's normally what we use as our target
01:36:23.040 | when we're doing image classification. We're going to throw that 1000-long thing away and
01:36:29.320 | replace it with a 300-long thing. The 300-long thing will be the word vector for pug that
01:36:38.400 | we downloaded from Word2Vec. Normally our input image comes in, it goes through some
01:36:54.800 | kind of computation in our CNN and it has to predict something. Normally the thing it
01:37:04.120 | has to predict is a whole bunch of zeros and a 1 here. So the way we do that is that the
01:37:10.760 | last layer is a softmax layer which encourages one of the things to be much higher than the
01:37:19.400 | others.
01:37:20.400 | So what we do is we throw that away and we replace it with the word vector for that thing,
01:37:32.160 | fox or pug or jumbo-jet. Since the word vector, so generally that might be 300 dimensions,
01:37:40.800 | and that's dense, that's not lots of zeros, so we can't use a softmax layer at the end
01:37:46.280 | anymore. We probably now just use a regular linear layer.
01:37:54.320 | So the hard part about doing this really is processing image data. There's nothing weird
01:38:03.080 | or interesting or tricky about the architecture. All we do is replace the last layer. So we're
01:38:08.280 | going to leverage big holes quite a lot. So we start off by inputting our usual stuff
01:38:14.760 | and don't forget with TensorFlow to call this limit_mem thing I created so that you don't
01:38:18.920 | use up all of your memory. One thing which can be very helpful is to define actually
01:38:26.320 | two parts. Once you've got your own box, you've got a bunch of spinning hard disks that are
01:38:33.560 | big and slow and cheap, and maybe a couple of fast, expensive, small SSDs or NVMe drives.
01:38:44.680 | So I generally think it's a good idea to define a path for both. This actually happens to
01:38:51.920 | be a mount point that has my big, slow, cheap spinning disks, and this path happens to live
01:39:00.640 | somewhere, which is my fast SSDs. And that way, when I'm doing my code, any time I've
01:39:08.760 | got something I'm going to be accessing a lot, particularly if it's in a random order,
01:39:13.000 | I want to make sure that that thing, as long as it's not too big, sits in this path, and
01:39:18.880 | anytime I'm accessing something which I'm accessing generally sequentially, which is
01:39:23.880 | really big, I can put it in this path. This is one of the good reasons, another good reason
01:39:28.960 | to have your own box is that you get this kind of flexibility.
01:39:35.920 | So the first thing we need is some word vectors. The paper builds their own Wikipedia word
01:39:49.120 | vectors. I actually think that the Word2vec vectors you can download from Google are maybe
01:39:56.800 | a better choice here, so I've just gone ahead and shown how you can load that in. One of
01:40:03.680 | the very nice things about Google's Word2vec word vectors is that in part 1, when we used
01:40:13.920 | word vectors, we tended to use GloVe. GloVe would not have a word vector for GoldenRetriever,
01:40:21.480 | they would have a word vector for Golden. They don't have phrase things, whereas Google's
01:40:28.640 | word vectors have phrases like GoldenRetriever. So for our thing, we really need to use Google's
01:40:36.920 | Word2vec vectors, plus anything like that which has multi-part concepts as things that
01:40:43.720 | we can look at.
01:40:46.040 | So you can download Word2vec, I will make them available on our platform.ai site because
01:40:55.240 | the only way to get them otherwise is to get them from the author's Google Drive directory,
01:41:01.120 | and trying to get to a Google Drive directory from Linux is an absolute nightmare. So I
01:41:05.760 | will save them for you so that you don't have to get it.
01:41:12.520 | So once you've got them, you can load them in, and they're in a weird proprietary binary.
01:41:20.040 | If you're going to share data, why put it in a weird proprietary binary format in a Google
01:41:25.160 | Drive thing that you can't access from Linux? Anyway, this guy did, so I then save it as
01:41:30.560 | text to make it a bit easier to work with.
01:41:34.880 | The word vectors themselves are in a very simple format, they're just the word followed
01:41:39.840 | by a space, followed by the vector, space separated. I'm going to save them in a simple dictionary
01:41:51.160 | format, so what I'm going to share with you guys will be the dictionary. So it's a dictionary
01:41:55.360 | from a word or a phrase to a NumPy array.
01:42:03.920 | I'm not sure I've used this idea of zip-star before, so I should talk about this a little
01:42:07.840 | bit. So if I've got a dictionary which maps from word to vector, how do I get out of that
01:42:17.600 | a list of the words and a list of the vectors? The short answer is like this. But let's think
01:42:26.420 | about what that's doing.
01:42:27.420 | So I don't know, we've used zip quite a bit. So normally with zip, you go like zip, list1,
01:42:33.840 | list2, whatever. And what that returns is an iterator which first of all gives you element1
01:42:43.800 | of list1, element1 of list2, element1 of list3, and then element2 of list1, so forth. That's
01:42:49.720 | what zip normally does.
01:42:53.840 | There's a nice idea in Python that you can put a star before any argument. And if that
01:43:02.800 | argument is an iterator, something that you can go through, it acts as if you had taken
01:43:09.680 | that whole list and actually put it inside those brackets.
01:43:16.160 | So let's say that wtov list contained like fox, colon, array, hug, colon, array, and so
01:43:34.000 | forth. When you go zip-star that, it's the same as actually taking the contents of that
01:43:41.840 | list and putting them inside there.
01:43:49.780 | You would want star-star if it was a dictionary star for list?
01:43:54.520 | Not quite. Star just means you're treating it as an iterator. In this case we are using
01:44:05.840 | a list, so let's talk about star-star another time. But you're right, in this case we have
01:44:13.920 | a list which is actually just in this form, fox, comma, array, hug, comma, array, and then
01:44:24.160 | lots more.
01:44:26.160 | So what this is going to do is when we zip this, it's going to basically take all of
01:44:35.160 | these things and create one list for those. So this idea of zip-star is something we're
01:44:52.360 | going to use quite a lot. Honestly I don't normally think about what it's doing, I just
01:44:58.720 | know that any time I've got like a list of tuples and I want to turn it into a tuple
01:45:06.160 | of lists, you just do zip-star. So that's all that is, it's just a little Python thing.
01:45:14.520 | So this gives us a list of words and a list of vectors. So any time I start looking at
01:45:20.040 | some new data, I always want to test it, and so I wanted to make sure this worked the way
01:45:26.280 | I thought it ought to work. So one thing I thought was, okay, let's look at the correlation
01:45:30.600 | coefficient between small j Jeremy and big j Jeremy, and indeed there is some correlation
01:45:36.200 | which you would expect, or else the correlation between Jeremy and banana, I hate bananas,
01:45:43.220 | so I was hoping this would be massively negative. Unfortunately it's not, but it is at least
01:45:47.240 | lower than the correlation between Jeremy and big Jeremy.
01:45:50.200 | So it's not always easy to exactly test data, but try and come up with things that ought
01:45:56.800 | to be true and make sure they are true, so in this case this has given me some comfort
01:46:01.600 | that these word vectors behave the way I expect them to. Now I don't really care about capitalization,
01:46:09.160 | so I'll just go ahead and create a lower-cased word2vec dictionary, where I just do the lower-cased
01:46:15.040 | version of everything. One trick here is I go through in reverse, because word2vec is
01:46:23.920 | ordered where the most common words are first, so by going in reverse it means if there is
01:46:29.800 | both a capital J Jeremy and a small j Jeremy, the one that's going to end up in my dictionary
01:46:34.620 | will be the more common one. So what I want for device is to now get this word vector
01:46:47.180 | for each one of our 1000 categories in ImageNet.
01:46:53.640 | And then I'm going to go even further than that, because I want to go beyond ImageNet.
01:46:58.880 | So I actually went and downloaded the original WordNet categories, and I filtered it down
01:47:05.440 | to find all the nouns, and I discovered that there are actually 82,000 nouns in WordNet,
01:47:11.400 | which is quite a few, it's quite fun looking through them.
01:47:14.920 | So I'm going to create a map of word vectors for every ImageNet category, one set, and
01:47:21.520 | every WordNet noun, another set. And so my goal in this project will be to try and create
01:47:27.080 | something that can do useful things with the full set of WordNet nouns. We're going to
01:47:32.320 | go beyond ImageNet. We've already got the 1000 ImageNet categories, we've used that
01:47:36.880 | many times before, so I'll grab those, load them in, and then I'll do the same thing for
01:47:47.480 | the full set of WordNet IDs, which I will share with you.
01:47:54.840 | And so now I can go ahead and create a dictionary which goes through every one of my ImageNet
01:48:05.120 | 1000 categories and converts it into the word vector. Notice I have a filter here, and that's
01:48:14.400 | because some of the ImageNet categories won't be in Word2vec, and that's because sometimes
01:48:24.280 | the ImageNet categories will say things like hug bracket doc. They won't be exactly in the
01:48:31.040 | same format. If you wanted to, you could probably get a better match than this, but I found
01:48:36.720 | even with this simple approach I managed to match 51,600 out of the 82,000 WordNet nouns,
01:48:44.960 | which I thought was pretty good.
01:48:48.800 | So what I did then was I created a list of all of the categories which didn't match,
01:48:55.880 | and this commented out bit, as you can see, is something which literally just moved those
01:49:00.800 | folders out of the way so that they're not in my ImageNet path anymore.
01:49:08.800 | So the details aren't very important, but hopefully you can see at the end of this process
01:49:13.240 | I've got something that maps every ImageNet category to a word vector, at least if I could
01:49:19.040 | find it, and every WordNet noun to a vector, at least if I could find it, and that I've
01:49:25.520 | modified my ImageNet data so that the categories I couldn't find moved those folders out of
01:49:35.520 | the way. Nothing particularly interesting there. And that's because WordNet's not that big.
01:49:46.520 | It's in RAM, so that's pretty straightforward. The images are a bit harder because we've
01:49:50.520 | got a million or so images. So we're going to try everything we can to make this RAM
01:49:57.840 | as quickly as possible. To start with, even the very process of getting
01:50:10.120 | a list of the file names of everything in ImageNet takes a non-trivial amount of time.
01:50:16.120 | So everything that takes a non-trivial amount of time is going to save its output. So the
01:50:20.960 | first thing I do is I use glob, I can't remember if we used glob in Part 1, I think we did,
01:50:27.480 | it's just a thing that's like ls star.start. So we use glob to grab all of the ImageNet
01:50:36.280 | training set, and then I just go ahead and pickle.dump that. For various reasons we'll
01:50:48.640 | see shortly, it's actually a very good idea though at this point to randomize that list
01:50:53.600 | of file names, put them in a random order. The basic idea is later on if we use chunks
01:51:02.040 | of file names that are next to each other, they're not all going to be the same type
01:51:06.920 | of thing. So by randomizing the file names now it's going to save us a bit of time, so
01:51:12.080 | then I can go ahead and save that randomized list. I've given it a different name, so I
01:51:17.040 | can always come back to the original.
01:51:20.000 | So I want to resize all of my images to a constant size. I'm being a bit lazy here,
01:51:25.480 | I'm going to resize them to 224x224, that's the input size for a lot of models obviously,
01:51:32.640 | including the one that we're going to use. That would probably be better if we resize
01:51:39.000 | to something bigger and then we randomly zoom and crop. Maybe if we have time we'll try
01:51:47.240 | that later, but for now we're just going to resize everything to 224x224.
01:51:56.000 | So we have nearly a million images it turns out to resize to 224x224. That could be pretty
01:52:03.640 | slow. So I've got some handy tricks to make it much faster. Generally speaking there are
01:52:12.880 | three ways to make an algorithm significantly faster.
01:52:22.480 | The three ways are memory locality, the second is SIMD also known as vectorization. The third
01:52:51.800 | is parallel processing. Rachel is very familiar with these because she's currently creating
01:53:01.920 | a course for the master students here on numerical linear algebra, which is very heavily about
01:53:08.480 | these things.
01:53:09.480 | So these are the three ways you can make data processing faster. Memory locality simply
01:53:15.160 | means in your computer you have lots of different kinds of memory. For example, level 1 cache,
01:53:24.160 | level 2 cache, RAM, solid state disk, regular hard drives, whatever. The difference in speed
01:53:38.960 | as you go up from one to the other is generally like 10 times or 100 times or 1000 times slower.
01:53:47.360 | You really, really, really don't want to go to the next level of the memory hierarchy
01:53:51.120 | if you can avoid it. Unfortunately level 1 cache might be more like 16k, level 2 cache
01:53:58.840 | might be a few meg, RAM is going to be a few gig, solid state drives is probably going
01:54:05.600 | to be a few hundreds of gig, and your hard drives are probably going to be a few terabytes.
01:54:10.640 | So in reality you've got to be careful about how you manage these things. You want to try
01:54:15.880 | and make sure that you're putting stuff in the right place, that you're not filling up
01:54:21.600 | the resources unnecessarily, and that if you're going to use a piece of data multiple times,
01:54:29.360 | try to use it each time, immediately use it again so that it's already in your cache.
01:54:38.000 | The second thing, which is what we're about to look at, is SIMD, which stands for Single
01:54:43.140 | Instruction Multiple Data. Something that a shockingly large number of people even who
01:54:49.800 | claim to be professional computer programmers don't know is that every modern CPU is capable
01:54:57.840 | of, in a single operation, in a single thread, calculating multiple things at the same time.
01:55:06.200 | And the way that it does it is that you basically create a little vector, generally about 8 things,
01:55:13.200 | and you put all the things you want to calculate. Let's say you want to take the square root
01:55:16.200 | of something. You put 8 things into this little vector, and then you call a particular CPU
01:55:23.920 | instruction which takes the square root of 8 floating point numbers that is in this register.
01:55:32.360 | And it does it in a single clock cycle. So when we say clock cycle, your CPU might be
01:55:38.000 | 2 or 3 GHz, so it's doing 2 or 3 billion things per second. Well it's not, it's doing 2 or
01:55:45.940 | 3 billion times 8 things per second if you're using SIMD.
01:55:52.680 | Because so few people are aware of SIMD, and because a lot of programming environments
01:55:57.040 | don't make it easy to use SIMD, a lot of stuff is not written to take advantage of SIMD,
01:56:03.000 | including, for example, pretty much all of the image processing in Python.
01:56:08.600 | However, you can do this. You can go pip install pillow SIMD, and that will replace your pillow,
01:56:18.880 | and remember pillow is like the main Python imaging library, with a new version that does
01:56:23.800 | use SIMD for at least some of its things.
01:56:29.440 | Because SIMD only works on certain CPUs, any vaguely recent CPU works, but because it's
01:56:36.560 | only some, you have to add some special directives to the compiler to tell it, I have this kind
01:56:43.040 | of CPU, so please do use these kinds of instructions. And what pillow SIMD does, it actually literally
01:56:50.360 | replaces your existing pillow, so that's why you have to say pause/reinstall, because it's
01:56:56.000 | going to be like, oh you already have a pillow, but this is like, no I want a pillow by SIMD.
01:57:01.160 | So if you try this, you'll find that the speed of your resize literally goes up by 600%,
01:57:08.280 | you don't have to change any code. I'm a huge fan of SIMD in general. It's one of the reasons
01:57:16.960 | I'm not particularly fond of Python, because it doesn't make it all easy to use SIMD, but
01:57:24.600 | luckily some people have written stuff in C which does use SIMD and then provided these
01:57:29.840 | Python interfaces.
01:57:31.440 | Okay, so this is something to remember to try to get working when you go home. Before
01:57:37.260 | you do it, write a little benchmark that resizes 1000 images and times it, and then run this
01:57:43.840 | command and make sure that it gets 600% faster, that way you know it's actually working.
01:57:51.200 | We have two questions, I don't know if you want to finish the three ways to do things
01:57:55.760 | faster first. One is, how could you get the relation between a pug and a dog and the photo
01:58:04.600 | of a pug and its relation to the bigger category of dog?
01:58:09.840 | Yes, sure, we'll think about that.
01:58:14.560 | Okay, now there, why do we want to randomize the file names, can't we use shuffle equals
01:58:21.920 | true on the Keras flowform directory?
01:58:26.960 | The short answer is kind of to do with locality. If you say shuffle equals true, you're jumping
01:58:32.480 | from here on the hard disk to here on the hard disk to here on the hard disk, and hard
01:58:36.640 | disks take that. Remember there's a spinning disk with a little needle, and the thing's
01:58:41.840 | moving all over the place. So you want to be getting things that are all on a row. It's
01:58:46.000 | basically the reason. As you'll see, this is going to basically work for the concept of
01:58:53.760 | dog versus pug, because the word vector for dog is very similar to word vector for pug,
01:59:00.080 | so at the end we'll try it. We'll see if we can find dogs and see if it works. I'm sure
01:59:07.360 | it will.
01:59:08.360 | Finally, parallel processing refers to the fact that any modern CPU has multiple cores,
01:59:18.720 | which literally means multiple CPUs in your CPU. Often boxes that you buy for home might
01:59:27.320 | even have multiple CPUs in them. Again, Python is not great for parallel processing. Python
01:59:35.120 | 3 is certainly a lot better. But a lot of stuff in Python doesn't use parallel processing
01:59:39.800 | very effectively. But a lot of modern CPUs have 10 cores or more, even for consumer CPU.
01:59:49.320 | So if you're not using parallel processing, you're missing out on a 10x speedup. If you're
01:59:54.640 | not using SAMD, you're missing out on a 6-8x speedup. So if you can do both of these things,
02:00:02.280 | you can get 50+. I mean you will, you'll get 50+ speedup, assuming your CPU has enough cores.
02:00:10.080 | So we're going to do both. To get SAMD, we're just going to install it. To get parallel
02:00:14.560 | processing, we're probably not going to see all of it today, but we're going to be using
02:00:18.160 | parallel processing.
02:00:21.720 | I define a few things to do my resizing. One thing is I've actually recently changed how
02:00:31.680 | I do resizing. As I'm sure you guys have noticed, in the past when I've resized things to square,
02:00:36.760 | I've tended to add a black border to the bottom or a black border to the right, because that's
02:00:42.680 | what Keras did. Now that I've looked into it, no best practice papers, capital results
02:00:51.240 | in a thing used that way.
02:00:52.600 | And it makes perfect sense because CNN is going to have to learn to deal with the black
02:00:57.600 | border. You're throwing away all that information. What pretty much all the best practice approaches
02:01:04.760 | is to rather than rescale the longest side to be the size of your square and then fill
02:01:12.240 | it in with black, instead take the smallest side and make that the size of your square.
02:01:18.960 | The other side's now too big, so just chop off the top and bottom, or chop off the right
02:01:23.680 | or right and left. That's called center cropping. So resizing and center cropping. What I've
02:01:31.080 | done here is I've got something which resizes to the size of the shortest side, and then
02:01:40.760 | over here I've got something which does the center cropping. You can look at the details
02:01:51.040 | when you get home if you like, it's not particularly exciting, so I've got something that does
02:01:55.600 | the resizing. This is something you can improve. Currently I'm making sure that it's a three
02:02:02.080 | channel image, so I'm not doing a black and white or something with an alpha channel.
02:02:12.600 | So before I finish up, next time when we start is we're going to learn about parallel processing.
02:02:22.560 | So anybody who's interested in pre-reading, feel free to start reading and playing around
02:02:28.040 | with Python parallel processing. Thanks everybody, see you next week. I hope your assignments
02:02:34.320 | go really well, and let me know if I can help you out in the forum.
02:02:36.800 | [applause]