Welcome to Lesson 13, where we're going to be talking about image enhancement. Image enhancement would cover things like this painting that you might be familiar with. However, you might not have noticed before that this painting actually has a picture of an eagle in it. The reason you may not have noticed that before is that this painting actually didn't use to have an eagle in it.
By the same token, actually, on that first page, this painting did not use to have Captain America's shield on it either. This painting did not use to have a clock in it either. This is a cool new paper that just came out a couple of days ago called "Deep Painterly Harmonization." It uses almost exactly the technique we're going to learn in this lesson with some minor tweaks.
But you can see the basic idea is to take one picture, paste it on top of another picture, and then use some kind of approach to combine the two. And the basic approach is something called style transfer. Before we talk about that, though, I wanted to mention this really cool contribution by William Horton, who added this stochastic weight averaging technique to the FastAI library that is now all merged and ready to go.
And he's written a whole post about that which I strongly recommend you check out, not just because stochastic weight averaging actually lets you get higher performance from your existing neural networks with basically no extra work. It's as simple as adding these two parameters to your fit function. But also he's described his process of building this and how he tested it and how he contributed to the library.
So I think it's interesting if you're interested in doing something like this, because I think William had not built this kind of library before, so he describes how he did it. Another very cool contribution to the FastAI library is a new train phase API. And I'm going to do something I've never done before, which I'm actually going to present somebody else's notebook.
And the reason I haven't done it before is because I haven't liked any notebooks enough to think they're worth presenting, but still has done a fantastic job here of not just creating this new API, but also creating a beautiful notebook describing what it is and how it works and so forth.
And the background here is, as you guys know, we've been trying to train networks faster partly as part of this Dawnbench competition, and also for a reason that you'll learn about next week. And I mentioned on the forums last week, it would be really handy for our experiments if we had an easier way to try out different learning rate schedules and stuff, and I basically laid out an API that I had in mind.
I said it would be really cool if somebody could write this, because I'm going to bed now and I kind of need it by tomorrow. And Sylvain replied on the forum, "Well, that sounds like a good challenge." And by 24 hours later, it was done. And it's been super cool.
I want to take you through it because it's going to allow you to do research into things that nobody's tried before. So it's called the train phase API, and the easiest way to show it is to show an example of what it does, which is here. Here is an iteration against learning rate chart, as you're familiar with seeing.
And this is one where we train for a while at a learning rate of 0.01, and then we train for a while at a learning rate of 0.001. I actually wanted to create something very much like that learning rate chart because most people that train ImageNet use this stepwise approach, and it's actually not something that's built into fast AI because it's not generally something we recommend.
But in order to replicate existing papers, I wanted to do it the same way. And so rather than writing a number of fit calls with different learning rates, it would be nice to be able to basically say train for n epochs at this learning rate and then m epochs at that learning rate.
And so here's how you do that. You can say a phase is a period of training with particular optimizer parameters, and it consists of a number of training phase objects. Training phase objects is how many epochs to train for, what optimization function to use, and what learning rate, amongst other things that we'll see.
And so here you'll see the two training phases that you just saw on that graph. So now, rather than calling learn.fit, you say learn.fit with an optimizer scheduler with these phases, fit, opt, shed. And then from there, most of the things you pass in can just get sent across to the fit function as per usual, so most of the usual parameters will work fine.
But in this case, generally speaking, actually we can just use these training phases and you'll see it fits in the usual way. And then when you say plot LR, there it is. And not only does it plot the learning rate, it also plots momentum, and for each phase it tells you what optimizer it used.
You can turn off the printing of the optimizers, you can turn off the printing of mentums, and you can do other little things like a training phase could have an LR decay parameter. So here's a fixed learning rate, and then a linear decay learning rate, and then a fixed learning rate, which gives us that picture.
And this might be quite a good way to train actually, because we know at high learning rates you get to explore better, and at low learning rates you get to fine-tune better, and it's probably better to gradually slide between the two. So this actually isn't a bad approach, I suspect.
You can use other decay types, not just linear, so cosine, and this probably makes even more sense as a genuinely, potentially useful learning rate and kneeling shape exponential, which is a super popular approach. Polynomial which isn't terribly popular, but actually in the literature works better than just about anything else, but seems to have been largely ignored, so polynomial is good to be aware of.
And what Sylvain's done is he's given us the formula for each of these curves. And so with a polynomial you get to pick what polynomial to use. So here it is with a different size. And I believe a p of 0.9 is the one that I've seen really good results for, FYI.
If you don't give a tuple of learning rates when there's an LR decay, then it will decay all the way down to zero. And as you can see, you can happily start the next cycle at a different point. So the cool thing is now we can replicate all of our existing schedules using nothing but these training phases.
So here's a function called Phases SGDR, which does SGDR using the new training phase API. And so you can see if he runs this schedule, then here's what it looks like. That is even done the little trick I have where you train at a really low learning rate just for a little bit, and then pop up and do a few cycles, and the cycles are increasing in length, and that's all done in a single function.
So the new one cycle we can now implement with, again, a single little function. And so if we fit with that, we get this triangle followed by a little flatter bit, and the momentum has a momentum decay. And then here we've got a fixed momentum at the end. So it's doing the momentum and the learning rate at the same time.
So something that I haven't tried yet that I think would be really interesting is to use differential learning rates. We've changed the name now to discriminative learning rates. So a combination of discriminative learning rates and one cycle, no one's tried yet. So that would be really interesting. The only paper I've come across which has discriminative learning rates uses something called Lars, L-A-R-S, and it was used to train ImageNet with very, very large batch sizes by basically looking at the ratio between the gradient and the mean at each layer and using that to change the learning rate of each layer automatically, and they found that they could use much larger batch sizes.
That's the only other place I've seen this kind of approach used, but there's lots of interesting things you could try with combining discriminative learning rates and different interesting schedules. So you can now write your own LRFinder of different types specifically because there's now this stop div parameter which basically means that it'll use whatever schedule you asked for, but when the loss gets too bad it'll stop training.
So here's one with learning rate versus loss and you can see it stops itself automatically. One useful thing that's been added is the linear parameter to the plot function. If you use linear schedule rather than an exponential schedule in your learning rate finder, which is a good idea if you're fine-tuned into roughly the right area, then you can use linear to find exactly the right area, and then you probably want to plot it with a linear scale.
So that's why you can also pass linear to plot now as well. You can change the optimizer H-face, and that's more important than you might imagine because actually the current state-of-the-art for training on really large batch sizes really quickly for ImageNet actually starts with RMSprop for the first bit and then they switch to SGD for the second bit.
And so that could be something interesting to experiment more with because at least one paper has now shown that that can work well. And again it's something that isn't well appreciated as yet. And then the bit I find most interesting is you can change your data. Why would we want to change our data?
Because you remember from lessons 1 and 2 you could use smaller images at the start and bigger images later. And the theory is that you could use that to train the first bit more quickly with smaller images. And remember if you have half the height and half the width and you've got a quarter of the activations of basically every layer, so it can be a lot faster.
And it might even generalize better. So you can now create a couple of different -- for example in this case he's got 28 and then 32 sized images. This is just sci-fi 10, so there's only so much you can do. And then if you pass in an array of data in this data list parameter, when you call fit.shed, it'll use a different data set for each phase.
So that's really cool because we can use that now, like we could use that in our dorm bench entries and see what happens when we actually increase the size with very little code. So what happens when we do that? Well the answer is here in dorm bench training on ImageNet.
And you can see here that Google won this with half an hour on a cluster of TPUs. The best non-cluster of TPU result is fastAI+ students under three hours beating out Intel on 128 computers, or else we ran on a single computer. We also beat Google running on a TPU.
So using this approach we've shown the fastest GPU result, the fastest single machine result, the fastest publicly available infrastructure result, these TPU pods you can't use unless you're Google. And the cost is tiny, like this Intel one cost them $1200 worth of compute, they haven't even written it here.
That's what you get if you use 128 computers in parallel, each one with 36 cores, each one with 140 GB compared to our single AWS instance. So this is a kind of a breakthrough in what we can do, the idea that we can train ImageNet on a single publicly available machine.
And this $72, by the way, it was actually $25 because we used a spot instance, so one of our students, Andrew Shaw, built this whole system to allow us to throw a whole bunch of spot instance experiments up and run them simultaneously, and pretty much automatically. But Dawn Bench doesn't quote the actual number we used, so it's actually $25, not $72.
So this data list idea is super important and helpful. And so our SciFi 10 results are also now up there officially, and you might remember the previous best was a bit over an hour, and the trick here was using one cycle, basically. So all this stuff that's in Sylvan's training phase API is really all the stuff that we use to get these top results.
And really cool, another fast AI student who goes by the name here, BKJ, has taken that and done his own version. He took ResNet18 and added the concat pooling that you might remember that we learned about on top, and used Leslie Smith's one cycle, and so he's got on the leaderboard.
So all the top three fast AI students, which is wonderful. And same for cost, the top three. And you can see paper space. So Brett ran this on paper space and got the cheapest result, just ahead of BKJ. Ben, his name is, I believe. Okay. So I think you can see a lot of the interesting opportunities at the moment for training stuff more quickly and cheaply are all about the learning rate annealing, and size annealing, like training with different parameters at different times, and I still think everybody's scratching the surface.
I think we can go a lot faster and a lot cheaper. And that's really helpful for people in resource constrained environments, which is basically everybody except Google, maybe Facebook. Architecture is interesting as well, though. And one of the things we looked at last week was just like creating a simpler architecture, which is basically state of the art, like the really basic kind of dark-net architecture.
But there's a piece of architecture we haven't talked about, which is necessary to understand the inception network. And the inception network is actually pretty interesting because they use some tricks to actually make things more efficient, and we're not currently using these tricks, and I kind of feel like maybe we should try it.
And so this is -- the most interesting, most successful inception network is their Inception ResNet 2 network, and most of the blocks in that look something like this. And it looks a lot like a standard ResNet block, and there's an identity connection here, and then there's a conv-con path here, and then we add them up together.
But it's not quite that, right? The first is that this path is a 1 by 1 conv, not just any old conv, but a 1 by 1 conv. And so it's worth thinking about what a 1 by 1 conv actually is. So a 1 by 1 conv is simply saying for each grid cell in your input, you've got a -- basically it's a vector, a 1 by 1 by number of filters tensor is basically a vector.
So for each grid cell in your input, you're just doing a dot product with that tensor. And then of course it's going to be one of those vectors for each of the 192 activations we're creating. So you basically do 192 dot products with grid cell 1, 1, and then 192 with grid cell 1, 2 and 1, 3 and so forth, and so you'll end up with something which has got the same grid size as the input and 192 channels in the output.
So that's a really good way to either reduce the dimensionality or increase the dimensionality of an input without changing the grid size. That's normally what we use 1 by 1 convs for. So here we've got a 1 by 1 conv and then we've got another 1 by 1 conv and then they're added together.
And then there's a third path and this third path is not added. This third path is not actually explicitly mentioned but it's concatenated. And so actually there is a form of resnet which is basically identical to resnet but we don't do plus, we do concat. And that's called a densenet.
So it's just a resnet where we do concat instead of plus. And that's an interesting approach because then the kind of the identity path is literally being copied. Right? So you kind of get that flow through all the way through and so as we'll see next week that tends to be good for like segmentation and stuff like that where you really want to kind of keep the original pixels and the first layer of pixels and the second layer of pixels untouched.
So concatenating rather than adding branches is a very useful thing to do. And so here we're concatenating this branch and this branch is doing something interesting which is it's doing first of all the 1 by 1 conv and then a 1 by 7 and then a 7 by 1.
So what's going on there? So what's going on there is basically what we really want to do is do a 7 by 7 conv. The reason we want to do a 7 by 7 conv is that if you've got multiple paths, each of which has different kernel sizes, then it's able to look at different amounts of the image.
And so like the original inception network had like a 1 by 1, a 3 by 3, a 5 by 5, 7 by 7 kind of getting concatenated in together, something like that. And so if we can have a 7 by 7 filter then we get to kind of look at a lot of the image at once and create a really rich representation.
And so actually the stem of the inception network, that is the first few layers of the inception network actually also use this kind of 7 by 7 conv because you start out with 224 by 224 by 3 and you want to turn it into something that's like 112 by 112 by 64.
And so by using a 7 by 7 conv you can get a lot of information in each one of those outputs to get those 64 filters. But the problem is that 7 by 7 conv is a lot of work. We've got 49 kernel values to multiply by 49 inputs for every input pixel across every channel.
So the compute is crazy, you know. You can kind of get away with it maybe for the very first layer and in fact the very first layer, the very first conv of ResNet is a 7 by 7 conv. But not so for Inception, for Inception they don't do a 7 by 7 conv.
Instead they do a 1 by 7 followed by a 7 by 1. And so to explain, the basic idea of the inception networks, all of the different versions of it, is that you have a number of separate paths which have different convolution widths. In this case conceptually the idea is this is a 1 by 1 convolution width and this is going to be a 7 convolution width.
And so they're looking at different amounts of data and then we combine them together. But we don't want to have a 7 by 7 conv throughout the network because it's just too computationally expensive. But if you think about it, if we've got some input coming in and we have some big filter that we want and it's too big to deal with, what could we do?
Let's make it a little bit less drawing, let's do 5 by 5. What we can do is to create two filters, one which is 1 by 5, one which is 5 by 1, or 7 or whatever, or 9. So we take our activations, the previous layer, and we put it through the 1 by 5.
We take the activations out of that and put it through the 5 by 1, and something comes out the other end. What comes out the other end? Well rather than thinking of it as first of all we take the activations, then we put it through the 5 by 1, then we put it through the 1 by 5, then the 5 by 1.
What if instead we think of these two operations together and say what is a 5 by 1 dot product and a 1 by 5 dot product do together? And effectively you could take a 1 by 5 and a 5 by 1, and the outer product of that is going to give you a 5 by 5.
You can't create any possible 5 by 5 matrix by taking that product, but there's a lot of 5 by 5 matrices that you can create. And so the basic idea here is when you think about the order of operations, and I'm not going to go into the detail of this, if you're interested in more of the theory here, you should check out Rachel's Numerical Linear Algebra course, which is basically a whole course about this stuff.
But conceptually the idea is that very often the computation you want to do is actually more simple than an entire 5 by 5 convolution. Very often the term we use in linear algebra is that there's some lower rank approximation. In other words, the 1 by 5 and 5 by 1 combined together, that 5 by 5 matrix is nearly as good as the 5 by 5 matrix you ideally would have computed if you were able to.
And so this is very often the case in practice, just because the nature of the real world is that the real world tends to have more structure than randomness. So the cool thing is, if we replace our 7 by 7 conv with a 1 by 7 and a 7 by 1, then this has 14 dot products to do, whereas this one has 49 to do.
So it's just going to be a lot faster, and we have to hope that it's going to be nearly as good. It's certainly capturing as much width of information by definition. So if you're interested in learning more about this specifically in the deep learning area, you can Google for factored convolutions.
The idea was to come up with three or four years ago now, it's probably been around for longer but that was when I first saw it, and it turned out to work really well and the Inception network uses it quite widely. They actually use it in their stem. It's interesting actually, we've talked before about how we tend to say like there's this main backbone, like when we have ResNet34 for example, we say there's this main backbone which is all of the convolutions.
And then we've talked about how we can add on to it a custom head. And that tends to be like a max pooling layer and a fully connected layer. It's actually kind of better to talk about the backbone as containing kind of two pieces. One is the stem, and then the other is kind of the main backbone.
And the reason is that the thing that's coming in, remember it's only got three channels, and so we want some sequence of operations that's going to expand that out into something richer, generally something like 64 channels. And so in ResNet, the stem is just super simple. It's a 7x7 conv, a stride 2 conv, followed by a stride 2 max pool.
I think that's it if memory says it correctly. In Inception they have a much more complex stem with multiple paths, getting combined and concatenated, including factored comms, 1x7 and 7x1. What would happen if you stuck a standard ResNet on top of an InceptionStem, for instance? I think that would be a really interesting thing to try because an InceptionStem is quite a carefully engineered thing.
And this thing of how do you take your three-channel input and turn it into something richer seems really important. And all of that work seems to have got thrown away for ResNet. We like ResNet. It works really well. But what if we put a denseNet backbone on top of an InceptionStem?
Or what if we replaced the 7x7 conv with a 1x7 7x1 factored conv in a standard ResNet? I don't know. We could try it. I think it would be really interesting. So there's some more thoughts about potential research directions. So that was kind of my little bunch of random stuff section.
Moving a little bit closer to the actual main topic of this, which is -- what was the word I used? Image enhancement. I'm going to talk about a new paper briefly because it really connects what I just discussed with what we're going to discuss next. And the new paper -- well, it's not that new, maybe it's a year old.
It's a paper on progressive GANs, which came from NVIDIA. And the progressive GANs paper is really neat. It basically -- sorry, Rachel, yes. We have a question. One-by-one conv is usually called a network within a network in the literature. What is the intuition of such a name? No. Network in network is more than just a one-by-one conv.
It's part of an IN. And I don't think there's any particular reason to look at that that I'm aware of. Okay. So the progressive GAN basically takes this idea of gradually increasing the image size. It's the only other direction I'm aware of where people have gradually increased the image size.
And it kind of surprises me because this paper is actually very popular and very well-known and very well-liked. And yet people haven't taken the basic idea of gradually increasing the image size and use it anywhere else, which shows you the general level of creativity you can expect to find in the deep learning research community, perhaps.
So they really go back. Start with the 4x4 GAN, literally, they're trying to replicate 4x4 pixels, and then 8x8. And so here's the 8x8 pixels. This is the celeb A dataset. So we're trying to recreate pictures of celebrities. And then they go 60x16, and then 32, and then 64, and then 128, and then 256.
And one of the really nifty things they do is that as they increase size, they also add more layers to the network, which kind of makes sense, because if you're doing a more of a resnetty type thing, then you're spitting out something which hopefully makes sense in each grid cell size, and so you should be able to layer stuff on top.
And they do another nifty thing where they add a skip connection when they do that, and they gradually change a linear interpolation parameter that moves it more and more away from the old 4x4 network and towards the new 8x8 network. And then once they've totally moved it across, they throw away that extra connection.
So the details don't matter too much, but it uses the basic ideas we've talked about gradually increasing the image size, skip connections and stuff. But it's a great paper to study because it's one of these rare things where good engineers actually built something that just works in a really sensible way.
It's not surprising, this actually comes from Nvidia themselves. So Nvidia don't do a lot of papers, but it's interesting that when they do, they build something that's so thoroughly practical and sensible. And so I think it's a great paper to study if you want to put together lots of the different things we've learned.
And there aren't many re-implementations of this, so it's an interesting thing to project, and maybe you could build on and find something else. So here's what happens next. We eventually go up to 1024x1024, and you'll see that the images are not only getting higher resolution but they're getting better.
And so 1024x1024, I'm going to see if you can guess which one of the next page is fake. They're all fake. That's the next stage. You go up, up, up, up, up, up, up, and then boom. So like, dans and stuff are getting crazy, and some of you may have seen this during the week.
Yeah, so this video just came out, and it's a speech by Barack Obama, and let's check it out. So like Jordan Peele, this is a dangerous time. Moving forward, we need to be more vigilant with what we trust from the internet. It's a time when we need to rely on trusted news sources.
It may sound basic, but how do we move forward? So as you can see, they've used this kind of technology to literally move Obama's face in the way that Jordan Peele's face was moving. You basically have all the techniques you need now to do that. So is that a good idea?
So this is the bit where we talk about what's most important, which is like, now that we can do all this stuff, what should we be doing, and how do we think about that? And the TODR version is, I actually don't know. Actually a lot of you saw the founders of the Spacey Prodigy folks, founders of explosion AI, I did a talk with Matthew, and I went to dinner with them afterwards, and we basically spent the entire evening talking, debating, arguing about what does it mean that companies like ours are building tools that are democratizing access to tools that can be used in harmful ways.
They're incredibly thoughtful people, and I wouldn't say we didn't agree, we just couldn't come to a conclusion ourselves. So I'm just going to lay out some of the questions and point to some of the research. And when I say research, most of the actual literature review and putting this together was done by Rachel.
We start by saying the models we build are often pretty shitty in ways which are not immediately apparent, and you won't know how shitty they are unless the people that are building them with you are a range of people, and the people that are using them with you are a range of people.
So for example, a couple of wonderful research is Timnett's at Stanford. Where's Joy? Joy is from MIT. So Joy and Timnett did this really interesting research where they looked at some basically off-the-shelf face recognizers, one from Face++ which is a huge Chinese company, IBM's and Microsoft's, and they looked for a range of different face types.
And generally speaking, the Microsoft one in particular was incredibly accurate unless the face type happened to be dark-skinned when suddenly it went 25 times worse, got it wrong nearly half the time. And for somebody to, a big company like this, to release a product that for a very, very large percentage of the world basically doesn't work, it's more than a technical failure, right?
It's a really deep failure of understanding what kind of team needs to be used to create such a technology and to test such a technology, or even an understanding of who your customers are. Yeah, some of your customers have dark skin. Yes, Rachel? I was also going to add that the classifiers all did worse on women than on men.
Shocking. Yeah. It's funny, actually, Rachel tweeted about something like this the other day, and some guy was like, "What's this all about, what are you saying, don't you know about, people made cars for a long time, you're saying you don't need women to make cars too?" And Rachel pointed out, "Well, actually, yes, for most of the history of car safety, women in cars have been far, far more at risk of death than men in cars because the men created male-looking, feeling-sized crash test dummies.
And so car safety was literally not tested on women-sized bodies. So the fact that shitty product management with a total failure of diversity and understanding is not new to our field." And I was just going to say, that was comparing impacts of similar strength, men and women. Yeah, I don't know why.
Whenever you say something on Twitter, Rachel has to say this, because any time you say something like this on Twitter, there's like 10 people who'll be like, "Oh, you have to compare all these other things," as if we didn't know that. Other things our very best, most famous systems do, like Microsoft's Face Recognizer or Google's Language Translator, you turn "she is a doctor, he is a nurse" into Turkish, and quite correctly, both the pronouns become "oh," because there's no gendered pronouns in Turkish.
So go the other direction, "I'll be a doctor" -- I don't know how to say that, the equivalent for Turkish nurse. And what does it get turned into? "He is a doctor, she is a nurse." So we've got these kind of biases built into tools that we're all using every day.
And again, people are like, "Oh, it's just showing us what's in the world, and well, okay, there's lots of problems with that basic assertion," but as you know, machine learning algorithms love to generalize. And so because they love to generalize -- this is one of the cool things about you guys knowing the technical details now -- because they love to generalize, when you see something like 60% of people cooking are women in the pictures they use to build this model, and then you actually run the model on a separate set of pictures, then 84% of the people they choose as cooking are women rather than the correct 67%, which is like a really understandable thing for an algorithm to do, is it took a biased input and created a more biased output because for this particular loss function, that's kind of where it ended up.
And this is a really common kind of model amplification. So this stuff matters. It matters in ways more than just awkward translations, or like black people's photos not being classified correctly, or maybe there's some wins too as well, like horrifying surveillance everywhere maybe won't work on black people, I don't know.
Or it'll be even worse because it's horrifying surveillance and it's flat-out racist and wrong. But let's go deeper, right? For all we say about human failings, there's a long history of civilization and societies creating layers of human judgment which avoid hopefully the most horrible things happening. And sometimes companies which love technology think "let's throw away the humans and replace them with technology like Facebook did." So two or three years ago, Facebook literally got rid of their human editors, like this was in the news at the time, and they were replaced with algorithms.
And so now it's algorithms that put all the stuff on your newsfeed and human editors right at the loop. What happened next? Many things happened next. One of which was a massive horrifying genocide in Myanmar. Babies getting torn out of their mother's arms and thrown onto fires, mass rape, murder, and an entire people exiled from their homeland.
I'm not going to say that was because Facebook did this, but what I will say is that when the leaders of this horrifying project are interviewed, they regularly talk about how everything they learned about the disgusting animal behaviors of Rohingya that need to be thrown off the earth, they learned from Facebook.
Because the algorithms just want to feed you more stuff that gets you clicking. And so if you get told these people that don't look like you and you don't know are bad people and here's lots of stories about the bad people, and then you start clicking on them and then they feed you more of those things, the next thing you know you have this extraordinary cycle.
And people have been studying this. So for example, we've been told a few times people click on our fast AI videos and then the next thing recommended to them is conspiracy theory videos from Alex Jones and then that continues there. Because humans click on things that shock us and surprise us and horrify us.
And so at so many levels, this decision has had extraordinary consequences which we're only beginning to understand. And again, this is not to say this particular consequence is because of this one thing, but to say it's entirely unrelated would be clearly ignoring all of the evidence and information that we have.
So this is really kind of the key takeaway is to think like, what are you building and how could it be used? So lots and lots of effort now being put into face detection, including in our course, we've been spending a lot of time thinking about how to recognize stuff and where it is.
And there's lots of good reasons to want to be good at that, for improving crop yields in agriculture, for improving diagnostic and treatment planning in medicine, for improving your Lego sorting robot system, whatever. But it's also being widely used in surveillance and propaganda and disinformation, and again, the question is what do I do about that?
I don't exactly know, but it's definitely at least important to be thinking about it, talking about it, and sometimes you can do really good things. For example, meetup.com did something which I would put in the category of really good thing, which is they recognized early a potential problem, which is that more men were tending to go to their meetups.
And that was causing their collaborative filtering systems, which you're all familiar with building now, to recommend more technical content to men. And that was causing more men to go to more technical content, which is causing the recommendation systems to suggest more technical content to men. And this kind of runaway feedback loop is extremely common when we interface the algorithm and the human together.
So what did meetup do? They intentionally made the decision to recommend more technical content to women, not because of some highfalutin idea about how the world should be, but just because that makes sense. The runaway feedback loop was a bug. There are women that want to go to tech meetups, but when you turn up to a tech meetup and it's all men, then you don't go and it recommends more men, and so on and so forth.
So meetup made a really strong product management decision here, which was to not do what the algorithm said to do. Unfortunately, this is rare. Most of these runaway feedback loops, for example in predictive policing, where algorithms tell policemen where to go, which very often is more black neighborhoods, which end up crawling with more policemen, which leads to more arrests, which has assistance to tell more policemen to go to more black neighborhoods, and so forth.
So this problem of algorithmic bias is now very widespread, and as algorithms become more and more widely used for specific policy decisions, judicial decisions, day-to-day decisions about who to give what offer to, this just keeps becoming a bigger problem. And some of them are really things that the people involved in the product management decision should have seen at the very start didn't make sense and were unreasonable under any definition of the term.
For example, this stuff that I had gone and pointed out, these were questions that were used to decide - Rachel, is this Sentencing Guidelines? This software is used for both pretrial, so who it was required to post bail, so these are people that haven't even been convicted, as well as for sentencing and for who gets parole, and this was upheld by the Wisconsin Supreme Court last year, despite all the flaws pointed out.
So whether you have to stay in jail because you can't pay the bail and how long your sentence is for and how long you stay in jail for depends on what your father did, whether your parents stayed married, who your friends are, and where you live. Now it turns out these algorithms are actually terribly, terribly bad, so some recent analysis showed that they're basically worse than chance, but even if the companies building them were confident and these were statistically accurate correlations, does anybody imagine there's a world where it makes sense to decide what happens to you based on what your dad did?
So a lot of this stuff at the basic level is obviously unreasonable, and a lot of it just fails in these ways, but you can see empirically that these runaway feedback loops must have happened, and these overgeneralizations must have happened. For example, these are the kind of cross tabs that anybody working in these fields, in any field that's using algorithms, should be preparing.
So prediction of likelihood of reoffending for black versus white defendants, we can just calculate this very simply. Of the people that were labeled high risk but didn't reoffend, there were 23.5% white, but about twice that African American, whereas those that were labeled lower risk but did reoffend was like half the white people and only 20% of the African American.
So this is the kind of stuff where at least if you're taking the technologies we've been talking about and putting the production in any way, or building an API for other people, or providing training for people, or whatever, then at least make sure that what you're doing can be tracked in a way that people know what's going on, so at least they're informed.
I think it's a mistake in my opinion to assume that people are evil and trying to break society. I prefer to start with an assumption of if people are doing dumb stuff it's because they don't know better, so at least make sure that they have this information. And I find very few ML practitioners thinking about what is the information they should be presenting in their interface.
And then often I'll talk to data scientists who will say, "Oh, the stuff I'm working on doesn't have a societal impact." It's like, really? Like a number of people who think that what they're doing is entirely pointless? Come on! People are paying you to do it for a reason, it's going to impact people in some way.
So think about what that is. The other thing I know is a lot of people involved here are hiring people. And so if you're hiring people, I guess you're all very familiar with the fast.ai philosophy now which is the basic premise. And I think it comes back to this idea that I don't think people on the whole are evil, I think they need to be informed and have tools.
So we're trying to give as many people the tools as possible that they need, and particularly we're trying to put those tools in the hands of a more diverse range of people. So if you're involved in hiring decisions, perhaps you can keep this kind of philosophy in mind as well.
If you're not just hiring a wider range of people, but also promoting a wider range of people and providing really appropriate career management for a wider range of people, apart from anything else, your company will do better. It actually turns out that more diverse teams are more creative and tend to solve problems more quickly and better than less diverse teams.
But also you might avoid these awful screw-ups which at one level are bad for the world, and at another level if you ever get found out they can also destroy your company. Also they can destroy you, or at least make you look pretty bad in history. A couple of examples.
One is going right back to the Second World War, IBM basically provided all of the infrastructure necessary to track the Holocaust. These were the forms that they used, and they had different code for Jews were 8 and Gypsies were 12, death in the gas chambers were 6, and they all went on these punch cards.
You can go and look at these punch cards in museums now. This has actually been reviewed by a Swiss judge who said that IBM's technical assistance facilitated the task of the Nazis in the commission of the crimes against humanity. It's interesting to read back the history from these times to see what was going through the minds of people at IBM at that time.
What was clearly going through the minds was the opportunity to show technical superiority, the opportunity to test out their new systems, and of course the extraordinary amount of money that they were making. When you do something which at some point down the line turns out to be a problem, even if you were told to do it, that can turn out to be a problem for you personally.
For example, you'll remember the diesel emissions scandal in VW, who was the one guy that went to jail? It was the engineer, just doing his job. So if all of this stuff about actually not fucking up the world isn't enough to convince you, it can fuck up your life too.
So if you do something that turns out to cause problems, even though somebody told you to do it, you can absolutely be held criminally responsible. And you'll certainly look at Kogan, I think a lot of people now know the name Alexander Kogan, he was the guy that handed over the Cambridge Analytica data.
He's a Cambridge academic, now a very famous Cambridge academic the world over for doing his part to destroy the foundations of democracy. So this is probably not how we want to go down in history. So let's have a break, before we do, Rachel. In one of your tweets, you said dropout is patented.
I think this is about WaveNet patent from Google. What does it mean? Can you please share more insight on this subject? Does it mean that we'll have to pay to use dropout in the future? Okay, good question. Let's talk about that after the break. So let's come back at 7.40.
The question before the break was about patents. What does it mean? So I guess the reason it's coming up was because I wrote a tweet this week, which I think was like three words, and said dropout is patented. One of the patent holders is Jeffrey Hinton. So what? Isn't that great?
Inventions all about patents, blah blah blah, right? My answer is no. Patents have gone wildly crazy. The amount of things that are patentable that we talk about every week would be dozens. Like it's so easy to come up with a little tweak and then if you turn that into a patent you stop everybody from using that little tweak for the next 14 years and you end up with a situation we have now where everything is patented in 50 different ways and so then you get these patent trolls who have made a very very good business out of basically buying lots of shitty little patents and then suing anybody who accidentally turned out did that thing, like putting rounded corners on buttons.
So what does it mean for us that a lot of stuff is patented in deep learning? I don't know. One of the main people doing this is Google and people from Google who reply to this patent tend to assume that Google is doing it because they want to have it defensively, so if somebody sues them they'll be like don't sue us, we'll sue you back because we have all these patents.
The problem is that as far as I know they haven't signed what's called a defensive patent pledge. So basically you can sign a legally binding document that says our patent portfolio will only be used in defense and not offense, and even if you believe all the management of Google would never turn into a patent troll, you've got to remember that management changes.
To give a specific example, I know the somewhat recent CFO of Google has a much more aggressive stance towards the P&L and I don't know, maybe she might decide that they should start monetizing their patents, or maybe the group that made that patent might get spun off and then sold to another company that might end up in private equity hands and decide to monetize the patents.
I think it's a problem. There has been a big shift legally recently away from software patents actually having any legal standing, so it's possible that these all end up thrown out of court, but the reality is that anything but a big company is unlikely to have the financial ability to defend themselves against one of these huge patent trolls.
So I think it's a problem. You can't avoid using patented stuff if you write code. I wouldn't be surprised if most lines of code you write have patents on them. So actually, funnily enough, the best thing to do is not to study the patents, because if you do and you infringe knowingly, the penalties are worse.
The best thing to do is to put your hands in your ears, sing a song, and get back to work. So that thing I said about dropouts patented, forget I said that, you skipped that bit. This is super fun, artistic style. We're going to go a bit retro here, because this is actually the original artistic style paper.
There's been a lot of updates to it, a lot of different approaches. And I actually think, in many ways, the original is the best. We're going to look at some of the newer approaches as well, but I actually think the original is a terrific way to do it, even with everything that's gone since.
Let's just jump to the code. This is the style transfer notebook. So the idea here is that we want to take a photo of this bird, and we want to create a painting that looks like Van Gogh painted the picture of the bird. Quite a bit of the stuff that I'm doing, by the way, uses ImageNet.
You don't have to download the whole of ImageNet for any of the things I'm doing. There's an ImageNet sample on files.fast.ai/data, which has a couple of gigs, and it should be plenty good enough for everything we're doing. If you want to get really great results, you can grab ImageNet.
You can download it from Kaggle. On Kaggle, the localization competition actually contains all of the classification data as well. So if you've got room, it's good to have a copy of ImageNet because it comes in handy all the time. So I just grabbed a bird out of my ImageNet folder, and there is my bird.
What I'm going to do is I'm going to start with this picture, and I'm going to try and make it more and more like a picture of this bird painted by Van Gogh. And the way I do that is actually very simple. You're all familiar with it. We will create a loss function, which we'll call f, and the loss function is going to take as input a picture, and spit out as output a value, and the value will be lower if the image looks more like the bird photo painted by Van Gogh.
Having written that loss function, we will then use the PyTorch gradient and optimizers gradient times the learning rate, and we're not going to update any weights, we're going to update the pixels of the input image to make it a little bit more like a picture which would be a bird painted by Van Gogh.
And we'll stick it through the loss function again to get more gradients, and do it again and again. And that's it. It's identical to how we solve every problem. You know I'm a one-trick pony, right? This is my only trick. Create a loss function, use it to get some gradients, multiply it by learning rates to update something, always before we've updated weights in a model, but today we're not going to do that.
We're going to update the pixels of the input, but it's no different at all. We're just taking the gradient with respect to the input, rather than with respect to the weights. That's it. So we're nearly done. Let's do a couple more things. Let's mention here that there's going to be two more inputs to our loss function.
One is the picture of the bird, birds look like this. And the second is an artwork by Van Gogh, they look like this. By having those as inputs as well, that means we'll be able to re-run the function later to make it look like a bird painted by Monet or a jumbo jet painted by Van Gogh or whatever.
So those are going to be the three inputs. And so initially, as we discussed, our input here, this is going to be the first time I've ever found the rainbow pen useful. So we start with some random noise, use the loss function, get the gradients, make it a little bit more like a bird painted by Van Gogh and so forth.
So the only outstanding question which I guess we can talk about briefly is how we calculate how much our image looks like a bird, this bird, painted by Van Gogh. So let's split it into two parts. Let's put it into a part called the content_loss, and that's going to return a value that's lower if it looks more like the bird.
Not just any bird, the specific bird that we had coming in. And then let's also create something called the style_loss, and that's going to be a lower number if the image is more like Van Gogh's style. So there's one way to do the content_loss which is very simple. We could look at the pixels of the output, compare them to the pixels of the bird, and do a mean squared error, add them up.
So if we did that, I ran this for a while, eventually our image would turn into an image of the bird. You should try it. You should try this as an exercise, try to use the optimizer_npy torch to start with a random image, and turn it into another image by using mean squared error pixel_loss.
Not terribly exciting, but that would be step 1. The problem is, even if we already had a style_loss function working beautifully, and then presumably what we're going to do is we're going to add these two together, and then one of them will be multiplied by some lambda. Some number will pick to adjust how much style versus how much content.
So assuming we had a style_loss, or we had picked some sensible lambda, if we used a pixel-wise content_loss, then anything that makes it look more like Van Gogh and less like the exact photo, the exact background, the exact contrast, lighting, everything will decrease the content loss, which is not what we want.
We want it to look like the bird, but not in the same way. It's still going to have the same two eyes in the same place, and be the same kind of shape and so forth, but not the same representation. So what we're going to do is, this is going to shock you, we're going to use a neural network, we're going to use a neural network.
I totally meant that to be black and it came out green. It's always a black box. And we're going to use the VGG neural network, because that's what I used last year and I didn't have time to see if other things worked, so you can try that yourself during the wig.
And the VGG network is something which takes in an input and sticks it through a number of layers. And I'm just going to treat these as just the convolutional layers. There's obviously ReLU there, and if it's a VGG with batch norm, which most are today, then it's also got batch norm.
And there's max pooling and so forth, but that's fine. What we could do is we could take one of these convolutional activations, and then rather than comparing the pixels of this bird, we could instead compare the VGG layer 5 activations of this to the VGG layer 5 activations of our original bird, or layer 6, layer 7 or whatever.
So why might that be more interesting? Well for one thing, it wouldn't be the same bird. It wouldn't be exactly the same, because we're not checking the pixels, we're checking some later set of activations. And so what do those later sets of activations contain? Well assuming that after some max pooling they contain a smaller grid, so it's less specific about where things are, and rather than containing pixel color values, they're more like semantic things like, is this kind of like an eyeball, or is this kind of furry, or is this kind of bright, or is this kind of reflective, or is this laying flat, or whatever.
So we would hope that there's some level of semantic features through those layers, where if we get a picture that matches those activations, then any picture that matches those activations looks like the bird, but it's not the same representation of the bird. So that's what we're going to do.
That's what our content loss is going to be. People generally call this a perceptual loss, because it's really important in deep learning that you always create a new name for every obvious thing you do. So if you compare two activations together, you're doing a perceptual loss. So that's it.
Our content loss is going to be a perceptual loss, and then we'll do the style loss later. So let's start by trying to create a bird that initially is random noise, and we're going to use perceptual loss to create something that is bird-like, but it's not this bird. So let's start by saying we're going to do 288 by 288.
Because we're only going to do one bird, there's going to be no GPU memory problems. So I was actually disappointed that I realized that I picked a rather small input image. It would be fun to try this with something much bigger to create a really grand scale piece. The other thing to remember is if you were productionizing this, you could do a whole batch at a time.
People sometimes complain about this approach, Gatties is the lead author, the Gatties style transfer approach is being slow. I don't agree it's slow. It takes a few seconds and you can do a whole batch in a few seconds. So we're going to stick it through some transforms as per usual, transforms through a BGG16 model.
And so remember, the transform class has a dunder call method, so we can treat it as if it's a function. So if you pass an image into that, then we get the transformed image. Try not to treat the fastai and PyTorch infrastructure as a black box, because it's all designed to be really easy to use in a decoupled way.
So this idea of transforms are just callables, i.e. things that you can do with parentheses comes from PyTorch, and we totally plagiarized the idea. So with TorchVision or with fastai, your transforms are just callables. The whole pipeline of transforms is just a callable. So now we have something of 3x288x288, because PyTorch likes the channel to be first, and as you can see it's been turned into a square for us, it's been normalized to 0.1, all that normal stuff.
Now we're creating a random image. And here's something I discovered. Trying to turn this into a picture of anything is actually really hard. I found it very difficult to actually get an optimizer to get reasonable gradients that went anywhere. And just as I thought I was going to run out of time for this class and really embarrass myself, I realized the key issue is that pictures don't look like this, they have more smoothness.
So I turned this into this by just blurring it a little bit. I used a median filter, basically it's like a median pooling effectively. As soon as I changed it from this to this, it immediately started training really well. So it's like a number of little tweaks you have to do to get these things to work is kind of insane, but here's a little tweak.
So we start with a random image which is at least somewhat smooth. I found that my bird image had a standard deviation of pixels that was about half of this mean, so I divided it by 2, just trying to make it a little bit easier for it to match.
I don't know if it matters. Turn that into a variable because this image, remember, we're going to be modifying those pixels with an optimization algorithm. So anything that's involved in the loss function needs to be a variable, and specifically it requires a gradient because we're actually updating the image.
So we now have a mini-batch of 1, 3 channels, 288 by 288, random noise. We're going to use for no particular reason the 37th layer of VGG. If you print out the VGG network, you can just type in m_VGG and print it out. You'll see that this is a mid to late stage layer.
So we can just grab the first 37 layers and turn it into a sequential model, so now we've got a subset of VGG that will spit out some mid-layer activations. And so that's what the model's going to be. So we can take our actual bird image, and we want to create a mini-batch of 1.
So remember if you slice in numpy with none, also known as np.newaxis, it introduces a new unit axis in that point. So here I want to create an axis of size 1 to say this is a mini-batch of size 1, so slicing with none, just like I did here, sliced with none to get this 1 unit axis at the front.
So then we turn that into a variable. And this one doesn't need to be updated, so we use vv to say you don't need gradients for this guy. And so that's going to give us our target activations. So we've basically taken our bird image, turned it into a variable, stuck it through our model to grab the 37th layer activations, and that's our target, is that we want our content lost to be this set of activations here.
So now we're going to create an optimizer. We'll go back to the details of this in a moment, but we're going to create an optimizer, and we're going to step a bunch of times, going 0 to gradients, call some lost function, stop backward, blah blah blah. So that's the high-level version, and I'm going to come back to the details in a moment.
But the key thing is that the lost function we're passing in that randomly generated image, the optimization image, or actually the variable of it. So we pass that to our lost function. And so it's going to update this using the lost function, and the lost function is the mean squared error loss, comparing our current optimization image, pass through our VGG to get the intermediate activations, and comparing it to our target activations.
Just like we discussed. And we'll run that a bunch of times, and we'll print it out, and we have our bird, but not the representation of the bird, so there it is. So a couple of new details here. One is a weird optimizer, LBFGS. Anybody who's done, I don't know exactly what courses they're in, but certain parts of math and computer science courses come into deep learning, discovers we use all this stuff like Adam and SGD and always assume that nobody in the field knows the first thing about computer science and immediately says, "Oh, have any of you guys tried using VFGS?" There's basically a long history of a totally different kind of algorithm for optimization that we don't use to train neural networks.
And of course the answer is actually the people who have spent decades studying neural networks do know a thing or two about computer science, and it turns out these techniques don't work very well. But it's actually going to work well for this, and it's a good opportunity to talk about an interesting algorithm for those of you that haven't studied this type of optimization algorithm at school.
So VFGS is -- what are the names? I can't remember, anyway, initials are four different people. The L stands for limited memory, so it's really just called VFGS. Limited memory VFGS. And it's an optimizer. So as an optimizer, it means that there's some loss function, and it's going to use some gradients to -- not all optimizers use gradients, but all the ones we use do -- use gradients to find a direction to go and try to make the loss function go lower and lower by adjusting some parameters.
It's just an optimizer. But it's an interesting kind of optimizer because it does a bit more work than the ones we're used to on each step. And so specifically -- okay. So the way it works is it starts the same way that we're used to, which is we just kind of pick somewhere to get started, and in this case we pick a random image, as you saw.
And as per usual, we calculate the gradient. But we don't just take a step, but what we actually do is as well as find in the gradient, we also try to find the second derivative. So the second derivative says how fast does the gradient change? So the gradient is how fast does the function change, the second derivative is how fast does the gradient change?
In other words, how curvy is it? And the basic idea is that if you know that it's not very curvy, then you can probably jump further. But if it is very curvy, then you probably don't want to jump as far. And so in higher dimensions, the gradient's called the Jacobian, and the second derivative's called the Hessian.
You'll see those words all the time, but that's all they mean. Again, mathematicians have to invent new words for everything as well. They're just like people learning researchers, except maybe a bit more snooty. So with BFGS, we're going to try and calculate the second derivative, and then we're going to use that to figure out what direction to go and how far to go.
So it's less of a wild jump into the unknown. Now the problem is that actually calculating the Hessian, the second derivative, is almost certainly not a good idea, because in each possible direction that you can head, for each direction that you're measuring the gradient in, you also have to calculate the Hessian in every direction.
It gets ridiculously big. So rather than actually calculating it, we take a few steps and we basically look at how much the gradient's changing as we do each step, and we approximate the Hessian using that little function. And again, this seems like a really obvious thing to do, but nobody thought of it until somewhat surprisingly long time later.
Keeping track of every single step you take takes a lot of memory. So don't keep track of every step you take, just keep the last 10 or 20. And the second bit there, that's the L to the L_BFGS. So a limited memory BFGS means keep the last 10 or 20 gradients, use that to approximate the amount of curvature, and then use the curvature and gradient to estimate what direction to travel and how far.
And so that's normally not a good idea in deep learning for a number of reasons. It's obviously more work to do than an atom or an SGD update, and obviously more memory. Memory is much more of a big issue when you've got a GPU to store it on and hundreds of millions of weights.
But more importantly, the mini-batches are super bumpy. So figuring out curvature to decide exactly how far to travel is kind of polishing turds as we say. Is that an American expression or just an Australian thing? I bet English say it too. Do we have to say it? Polishing turds.
You get the idea. But also, interestingly, using the second derivative information, it turns out it's like a magnet for saddle points. So there's some interesting theoretical results that basically say it actually sends you towards nasty flat areas of the function if you use second derivative information. So normally not a good idea.
But in this case, we're not optimizing weights. We're optimizing pixels, so all the rules change. And actually it turns out LBFTS does make sense. And because it does more work each time, when it's a different kind of optimizer, the API is a little bit different in PyTorch. As you can see here, when you say optimizer.step, you actually pass in the loss function.
And so my loss function is to call step with a particular loss function, which is my activation loss. And as you can see, inside the loop, you don't say step, step, step, but rather it looks like this. So it's a little bit different. And you're welcome to try and rewrite this to use SGD, it'll still work, it'll just take a bit longer.
I haven't tried it with SGD, I'd be interested to know how much longer it takes. So you can see the loss function going down, the mean squared error between the activations at layer 37 of our VGG model for our optimized image versus the target activations, and remember the target activations were the VGG applied to our bird.
Does that make sense? So we've now got a content loss. Now one thing I'll say about this content loss is we don't know which layer is going to work best, so it would be nice if we were able to experiment a little bit more, and the way it is here is annoying.
Maybe we even want to use multiple layers. So rather than like lopping off all of the layers after the one we want, wouldn't it be nice if we could somehow grab the activations of a few layers as it calculates? Now we already know one way to do that. Back when we did SSD, we actually wrote our own network which had a number of outputs.
Do you remember? Like the different convolutional layers, we spat out a different Ocon thing. But I don't really want to go and add that to the TorchVision ResNet model, especially not if later on I want to try the TorchVision VGG model, and then I want to try a NASNet A model.
I don't want to go into all of them and change their outputs, besides which I'd like to easily be able to turn certain activations on and off the demand. So we've briefly touched before on this idea that PyTorch has these fantastic things called hooks. You can have forward hooks that let you plug anything you like into the forward path of a calculation, or a backward hook that lets you plug anything you like into the backward path.
So we're going to create the world's simplest forward hook. This is one of these things that almost nobody knows about, so like almost any code you find on the internet that implements style transfer will have all kinds of horrible hacks rather than using forward hooks. But with forward hooks, it's really easy.
So to create a forward hook, you just create a class, and the class has to have something called hook function. And your hook function is going to receive the module that you've hooked, it's going to receive the input for the forward pass, and it's going to receive the target, and then you do whatever the hell you like.
So what I'm going to do is I'm just going to store the output of this module in some attribute. That's it. So this can actually be called anything you like, but hook function seems to be the standard. You can see what happens here in the constructor is I store inside some attribute the result of -- this is going to be the layer that I'm going to hook -- you go module.register_forward_hook and pass in the function that you want to be called when this module, when its forward method is called.
So when its forward method is called, it will call self.hook function which will store the output in an attribute called features. So now what we can do is we can create our VGG as before, and that's said it's not trainable so we don't waste time and memory calculating gradients for it.
And let's go through and find all of the MaxPool layers. So let's go through all of the children of this module, and if it's a MaxPool layer, let's spit out index -1. So that's going to give me the layer before the MaxPool. And so in general the layer before a MaxPool or the layer before a Stride2Con is a very interesting layer because it's the most complete representation we have at that grid cell size.
Because the very next layer is changing the grid. So that seems to me like a good place to grab the content loss from is the best, most semantic, most interesting content we have at that grid size. So that's why I'm going to pick those indexes. So here they are.
Those are the indexes of the last layer before each MaxPool in VGG. So I'm going to grab this one here, 22, for no particular reason, just to try something else. So I'm going to say blockends3, that's going to be 32. So children VGG index to blockends3 will give me the 32nd layer of VGG as a module.
And then if I call the save features constructor, it's going to go self.hook equals 32nd layer of VGG.registerforwardhook function. So now every time I do a forward pass on this VGG model, it's going to store the 32nd layer's output inside sf.features. So we can now say, see here I'm calling my VGG network, but I'm not storing it anywhere.
I'm not saying activations equals VGG of my image. I'm calling it, throwing away the answer, and then grabbing the features that we stored in our sf in our save features object. So that way, this is now going to contain, this is a forward pass, that's how you do a forward pass in PyTorch.
You don't say .forward, you just use it as a callable. And using it as a callable on an nn.module automatically calls forward. That's how PyTorch modules work. So we call it as a callable, that ends up calling our forward hook. That forward hook stores the activations in sf.features. And so now we have our target variable, just like before, but in a much more flexible way.
These are the same four lines of code we had earlier, I've just stuck them into a function. And so it's just giving me my random image to optimize, and an optimizer to optimize that image. This is exactly the same code as before, so that gives me these. And so now I can go ahead and do exactly the same thing.
But now I'm going to use a different loss function, activation_loss_2, which doesn't say out=mvgg. Again, it calls mvgg to do a forward pass, throws away the results, and grabs sf.features. And so that's now my 30-second layer activations, which I can then do my MSE loss on. You might have noticed the last loss function and this one are both multiplied by a thousand.
Why are they multiplied by a thousand? Again, this was like all the things that were trying to get this lesson to not work correctly. I didn't used to have the a thousand, it wasn't training. Lunch time to date, nothing was working, after days of trying to get this thing to work.
And finally, just randomly noticed, the loss functions, the numbers are really low, like 10e and x7. And I just thought, what if they weren't so low? So I multiplied them by a thousand and it started working. So why did it not work? Because we're doing single precision floating point, and single precision floating point ain't that precise.
And particularly once you're getting gradients that are kind of small and then you're multiplying the learning rate, it can be kind of small and you end up with a small number. And if it's so small, it can get rounded to zero and that's what was happening and my model wasn't training.
So I'm sure there are better ways of multiplying by a thousand, but whatever, it works fine. It doesn't matter what you multiply a loss function by, because all you care about is its direction and its relative size. And interestingly, this is actually something similar for when we were training ImageNet, we were using half-precision floating point because the Volta tensor cores require that.
And it's actually a standard practice if you want to get the half-precision floating point to train, you actually have to multiply the loss function by a scaling factor. We were using 1024 or 512. And I think FastAI is now the first library that has all of the tricks necessary to train in half-precision floating point built-in.
So if you have a Volta or you can pay for a P3, if you've got a learner object, you can just say "learn.half" and it'll now just magically train correctly half-precision floating point built into the model data objects as well, it's all automatic, and pretty sure no other library does that.
So this is just doing the same thing on a slightly earlier layer. And you can see that the later layer doesn't look very bird-like at all, but you can kind of tell it's a bird, slightly earlier layer, more bird-like. And hopefully that makes sense to you that earlier layers are getting closer to the pixels.
It's a smaller grid size, well, there's more grid cells, each cell is smaller, a smaller receptive field, less complex semantic features. So the earlier we get, the more it's going to look like a bird. And in fact, the paper has a nice picture of that showing various different layers and kind of zooming into this house, they're trying to make this house look like this picture.
And you can see that later on it's pretty messy and earlier on it looks like this. So this is just doing what we just did. And I will say one of the things I've noticed in our study group is anytime I say to somebody to answer a question, anytime I say read the paper, there's a thing in the paper that tells you the answer to that question, there's always this shocked look.
Read the paper, me, the paper, but seriously, the papers have done these experiments and drawn the pictures, like there's all this stuff in the papers. It doesn't mean you have to read every part of the paper, but at least look at the pictures. So check out the Gattis paper, it's got nice pictures.
So they've done the experiment for us, they basically did this experiment, but it looks like they didn't go as deep, they just got some earlier ones. The next thing we need to do is to create style loss. So we've already got the loss, which is how much like the bird is it.
Now we need how much like this painting style is it. And we're going to do nearly the same thing. We're going to grab the activations of some layer. Now the problem is that the activations of some layer, let's say it was a 5x5 layer. Of course there are no 5x5 layers at 224x224, but we'll pretend 5x5 by 19, totally unrealistic sizes, but never mind.
So here's some activations, and we could get these activations both for the image we're optimizing and for our Van Gogh painting. And let's look at our Van Gogh painting. There it is, very nice. I downloaded this from Wikipedia, and I was wondering why it was taking so long to load.
It turns out that the Wikipedia version I downloaded was 30,000 by 30,000 pixels. It's pretty cool, they've got this like serious gallery-quality archive stuff there, I didn't know it existed, so don't try and run a neural net on that. Totally killed my Jupiter notebook. So we can do that for our Van Gogh image and we can do that for our optimized image.
And then we can compare the two and we would end up creating an image that looks content like the painting, but it's not the painting. That's not what we want. We want something with the same style, but it's not the painting, it doesn't have the content. So we actually want to throw away all of the spatial information.
We're not trying to create something that has a moon here and stars here, that's a church here or whatever. We don't want any of that. So how do we throw away all the spatial information? What we do is let's grab, in this case there are like 19 faces on this, like 19 slices.
So let's grab this top slice, so that's going to be a 5x5 matrix. And now let's flatten it. So now we've got a 25 long vector. Now in one stroke, we've thrown away the bulk of the spatial information by flattening it. Now let's grab a second slice, another channel, and do the same thing.
So here's channel 1, flattened, here's channel 2, flattened, and they've both got 25 elements. And now let's take the dot product, which we can do with @, and so the dot product's going to give us one number. What's that number? What is it telling us? Well, assuming this is somewhere around the middle layer of the VGG network, we might expect some of these activations to be like how textured is the brush stroke, and some of them to be like how bright is this area, and some of them to be like is this part of a house or part of a circular thing, or other parts to be how dark is this part of the painting.
And so a dot product, remember, is basically a correlation. If this element and this element are both highly positive or both highly negative, it gives us a big result, whereas if they're the opposite, it gives us a small result. If they're both close to zero, it gives no result.
So it's basically a dot product as a measure of how similar these two things are. And so if the activations of channel 1 and channel 2 are similar, let's give an example. Let's say this first one was like how textured are the brush strokes, and this one here was like how diagonally oriented are the brush strokes.
And if both of these were high together and both of these were high together, then it's basically saying anywhere that there's more textured brush strokes, they tend to be diagonal. Another interesting one is what would be the dot product of C1 with C1? So that would be basically the 2-norm, the sum of the squares of that channel.
Which in other words is basically just, let's go back, I screwed this up. Channel 1 might be texture, and channel 2 might be diagonal, and this one here would be cell 1,1, and this cell here would be cell 4,2. What I should have been saying is if these are both high at the same time, and these are both high at the same time, then it's saying grid cells that have texture tend to also have diagonal.
Sorry, I drew that all wrong. The idea was right, I just drew it all wrong. So this number is going to be high when grid cells that have texture also have diagonal, and when they don't, they don't. So that's C1 dot product C2. Whereas C1 dot product C1 is basically the 2-norm effectively, or the sum of the squares of C1, sum over i of C1 squared.
And this is basically saying how in how many grid cells is the textured channel active, and how active is it? So in other words, C1 dot product C1 tells us how much textured painting is going on, and C2 dot product C2 tells us how much diagonal paint strokes is going on.
Maybe C3 is bright colors. So C3 dot product C3 would be how often do we have bright colored cells. So what we could do then is we could create a 25 by 25 matrix containing every one, channel 1, channel 2, channel 3, channel 1, channel 2, channel 3 -- sorry, not channel -- man, it's been a long day -- 19, there are 19 channels.
19 by 19. Channel 1, channel 2, channel 3, channel 19, channel 1, channel 2, channel 3, channel 19. And so this would be the dot product of channel 1 with channel 1, this would be the dot product of channel 2 with channel 2, and so forth, after flattening. And like we've discussed, mathematicians have to give everything a name.
So this particular matrix where you flatten something out and then do all the dot products is called a Gram Matrix. And I'll tell you a secret, most deep learning practitioners either don't know or don't remember all these things, like what is a Gram Matrix if they ever did study at university, they probably forgot it because they had a big night afterwards.
And the way it works in practice is like you realize, oh, I could create a kind of non-spatial representation of how the channels correlate with each other, and then when I write up the paper I have to go and ask around and say, does this thing have a name?
And somebody would be like, isn't it a Gram Matrix? And you go and look it up, and it is. So don't think you have to go and study all of math first. You use your intuition and common sense and then you worry about what the math is called later, normally.
Sometimes it works the other way, not with me, because I can't do math. So this is called the Gram Matrix, and of course if you're a real mathematician it's very important that you say this as if you always knew it was a Gram Matrix and you kind of just go, oh yes, we just calculate the Gram Matrix, that's really important.
So the Gram Matrix then is this kind of map of -- the diagonal is perhaps the most interesting. The diagonal is like which channels are the most active, and then the off-diagonal is like which channels tend to appear together. And overall, if two pictures have the same style, then we're expecting that some layer of activations, they will have similar Gram Matrices.
Because if we found the level of activations that capture a lot of stuff about paint strokes and colors and stuff, the diagonal alone might even be enough. And that's another interesting homework assignment if somebody wants to take it, is try doing Gaddy style transfer, not using the Gram Matrix, but just using the diagonal of the Gram Matrix.
And that would be like a single line of code to change, but I haven't seen it tried. I don't know if it would work at all, but it might work fine. Christine -- I'll pass this to Christine. Okay yes, Christine, you've tried it. I was going to say I have tried that, and it works most of the time except when you have funny pictures where you need two styles to appear in the same spot.
So if you have grass in one half and a crowd in one half, and you need the two styles. You still want to do your homework, but Christine says she'll do it for you. So let's do that. So here's our painting. I've tried to resize the painting so it's the same size as my bird picture.
It doesn't matter too much which bit I use as long as it's got a nice style in it. I grab my optimizer and my random image just like before. And this time I call save features for all of my blockends, and that's going to give me an array of save features objects, one for each module that appears the layer before a Max pull.
Because this time I want to play around with different activation layer styles, or more specifically I want to let you play around with it. So now I've got a whole array of them. So now I call my VGG module on my image again. I'm not going to use that yet.
Ignore that line. Style image is my Van Gogh painting. So I take my style image, put it through my transformations to create my transform style image. I turn that into a variable, put it through the forward pass of my VGG module, and now I can go through all of my save features objects and grab each set of features.
And notice I call clone, because later on if I call my VGG object again, it's going to replace those contents. I haven't quite thought about whether this is necessary. If you take it away, it's fine, but I was just being careful. So here's now an array of the activations at every block and layer.
So here you can see all of those shapes. And you can see being able to whip up a list comprehension really quickly, it's really important in your Jupyter fiddling around because you really want to be able to immediately see the grid size halving as we would expect because all of these appeared just before a max pull.
So to do a gram MSE loss, it's going to be the MSE loss on the gram matrix of the input versus the gram matrix of the target. And the gram matrix is just the matrix multiply of x with x transpose, where x is simply equal to my input, where I've flattened the batch and channel axes all down together.
And I've already got one image, so you can kind of ignore the batch part, basically channel, and then everything else, which in this case is the height and width, is the other dimension. So this is now going to be channel by height and width, and then as we discussed we can then just do the matrix multiply of that by its transpose.
And just to normalize it, we'll divide that by the number of elements. It would actually be more elegant if I had said "divided by input.num_elements". That would be the same thing. And then again, this kind of gave me tiny numbers, so I multiply it by a big number to make it something more sensible.
So that's basically my loss. So now my style loss is to take my image to optimize, throw it through vgg_forward_pass, have an array of the features in all of the features objects, and then call my gram_msc_loss on every one of those layers. And that's going to give me an array.
And then I just add them up. Now you could add them up with different weightings, you could add up a subset, whatever, in this case I'm just grabbing all of them, pass that into my optimizer as before, and here we have a random image in the style of Van Gogh, which I think is kind of cool.
And again, Gatties has done it for us. Here is different layers of random image in the style of Van Gogh. And so the first one, as you can see, the activations are simple geometric things, not very interesting at all. The later layers are much more interesting. So we kind of have a suspicion that we probably want to use later layers largely for our style loss if we want it to look good.
I added this save_features.close, which just calls, remember I stored the hook here, and so hook.remove gets rid of it, and it's a good idea to get rid of it because otherwise you can potentially just keep using memory. So at the end I go through each of my save_features objects and close it.
So style_transfer is adding the two together with some weight. So there's not much to show. Grab my optimizer, grab my image, and now my combined_loss is the MSC_loss at one particular layer, my style_loss at all of my layers, sum up the style_losses, add them to the content_loss, the content_loss I'm scaling.
Actually the style_loss I scaled already by 1e6, and this one is 1, 2, 3, 4, 5, 6. So actually they're both scaled exactly the same, add them together, and again you could try weighting the different style_losses, or you could remove some of them, whatever. So this is the simplest possible version.
Train that, and holy shit, it actually looks good. So I think that's pretty awesome. The main takeaway here is if you want to solve something with a neural network, all you've got to do is set up a loss function and then optimize something. The loss function is something which a lower number is something that you're happier with.
Because then when you optimize it, it's going to make that number as low as you can, and that will do what you wanted it to do. So here we came up with a loss function that does a good job of being a smaller number when it looks like the thing we want it to look like, and it looks like the style of the thing we want it to be in the style of.
That's all we had to do. When it actually comes to it, apart from implementing grammse_loss, which was like 6 lines of code of that, that's our loss function, pass it to our optimizer, wait about 5 seconds, and we're done. And remember, we could do a batch of these at a time.
So we could wait 5 seconds and 64 of these will be done. So I think that's really interesting. Once this paper came out, it's really inspired a lot of interesting work. To me though, most of the interesting work hasn't happened yet, because to me the interesting work is the work where you combine human creativity with these kinds of tools.
I haven't seen much in the way of tools that you can download or use where the artist is in control and can do things interactively. It's interesting, talking to the guys at the Google Magenta project, which is their Creative AI project, all of the stuff they're doing with music is specifically about this.
It's building tools that musicians can use to perform in real time. And so you'll see much more of that on the music space thanks to Magenta. If you go to their website, there's all kinds of things where you can press the buttons to change the drum beats or melodies or keys or whatever.
You can definitely see Adobe and Nvidia starting to release little prototypes that have started to do this. This kind of creative AI explosion hasn't happened yet. I think we have pretty much all the technology we need, but no one's put it together into a thing and said look at the thing I built and look at the stuff that people built with my thing.
That's just a huge area of opportunity. The paper that I mentioned at the start of class in passing, the one where we can add Captain America's shield to arbitrary paintings, basically used this technique. The trick was some minor tweaks to make the pasted Captain America shield blend in nicely.
That paper's only a couple of days old, so that would be an interesting project to try. You can use all this code, it really does leverage this approach. You could start by making the content image be like the painting with the shield, and then the style image could be the painting without the shield.
That would be a good start, and then you could kind of see what specific problems they're trying to solve in this paper to make it better. You could have a start on it right now. Let's make a quick start on the next bit, which is, yes, Rachel. I'll say two questions.
Earlier there were a number of people that expressed interest in your thoughts on pyro and probabilistic programming. So TensorFlow's now got this TensorFlow probability or something. There's a bunch of probabilistic programming frameworks out there. I think they're intriguing, but as yet unproven in the sense that I haven't seen anything done with any probabilistic programming system which hasn't been done better without them.
The basic premise is that it allows you to create more of a model of how you think the world works and then plug in the parameters. Back when I used to work in management consulting 20 years ago, we used to do a lot of stuff where we would use a spreadsheet and then we would have these Monte Carlo simulation plugins.
There's one called at-risk and one called crystal ball, I don't know if they still exist decades later, but basically they would let you change a spreadsheet cell to say this is not a specific value, but it actually represents a distribution of values with this mean and the standard deviation, or it's got this distribution.
And then you would hit a button and the spreadsheet would recalculate a thousand times pulling random numbers from the distributions and show you the distribution of your outcome that might be some profit or market share or whatever, and we used them all the time back then. I partly think that a spreadsheet is a more obvious place to do that kind of work because you can see it all much more naturally, but at this stage I hope it turns out to be useful because I find it very appealing and it kind of appeals to, as I say, the kind of work I used to do a lot of.
There's actually whole practices around this stuff that you used to call systems dynamics, which really was built on top of this kind of stuff, but I don't know, it's not quite gone anywhere. Then there was a question about pre-training for a generic style transfer. I don't think you can pre-train for a generic style, but you can pre-train for a generic photo for a particular style, which is where we're going to get to, although it may end up being homework, I haven't decided, but I'm going to do all the pieces.
One more question is, "Please ask him to talk about multi-GPU." Oh yeah, I even have a slide about that. It's about to hit it. Before we do just another interesting picture from the Gatties paper, they've got a few more that didn't fit in my slide here, but different convolutional layers for the style, different style to content ratios, and here's the different images.
Obviously this isn't Van Gogh anymore, this is a different combination. You can see if you just do all style, you don't see any image, if you do all lots of content, but you use a low enough convolutional layer, it looks okay, but the background's kind of dumb, so you kind of want somewhere around here or here.
You can play around with an experiment, but also use the paper to help guide you. I think I might work on the math now, and we'll talk about multi-GPU and super-resolution next week. I think this is from the paper, and one of the things I really do want you to do after we talk about a paper is to read the paper and then ask questions on the forum, anything that's not clear.
But there's kind of a key part of this paper which I wanted to talk about and discuss how to interpret it. So the paper says we're going to be given an input image, x, and this little thing means it's a vector, but this one's a matrix, I guess it could mean either.
So normally small-letter bold means vector, or small-letter with doobie on top means vector, they can both mean vector, and normally big-letter means matrix, or small-letter with two doobies on top means matrix. In this case, our image is a matrix. We are going to basically treat it as a vector, so maybe we're just getting ahead of ourselves.
So we've got an input image, x, and it can be encoded in a particular layer of the CNN by the filter responses. So the activations, filter responses are activations. So hopefully that's something you all understand, that's basically what a CNN does, is it produces layers of activations. A layer has a bunch of filters which produce a number of channels, and so this here says that layer number L has capital NL filters, and again this capital does not mean matrix.
So I don't know, math notation is so inconsistent. So capital NL distinct filters at layer L, which means it has also that many feature maps. So make sure you can see that this letter is the same as this letter. So you've got to be very careful to read the letters and recognize it's like snap, that's the same letter as that letter.
So obviously NL feature maps or NL filters create NL feature maps or channels, h1 is of size m, okay so I can see this is where the unrolling is happening, hmap is of size m little l, so this is like m square bracket l in numpy notation, it's the lth layer.
So m for the lth layer. And the size is height times width, so we flattened it out. So the responses of that layer l can be stored in a matrix F, and now the l goes at the top for some reason. So this is not F to the power of l, this is just another indexing, we're just moving it around for fun.
And this thing here where we say it's an element of R, this is a special R meaning the real numbers n times m, this is saying that the dimensions of this is n by m. So this is really important, you don't move on, it's just like with PyTorch, making sure that you understand the rank and size of your dimensions first.
Same with math, these are the bits where you stop and think, why is it n by m? So n is the number of filters, m is height by width, so do you remember that thing where we did view batch times channel comma minus 1? Here that is. So try to map the code to the math.
So f is x. If I was nicer to you, I would have used the same letters as the paper, but I was too busy getting this damn thing working to do that carefully. So you can go back and rename it as capital F. This is why we moved the l to the top, because we're now going to have some more indexing.
So like where else in NumPy or PyTorch we index things by square brackets and then lots of things with commas between, the approach in math is to surround your letter by little letters all around it, and just throw them up there everywhere. So here fl is the lth layer of f, and then ij is the activation of the i-th filter at position j of layer l.
So position j is up to size m, which is up to size height by width. This is the kind of thing that would be easy to get confused. Like often you'd see an ij and assume that's like indexing into a position of an image like height by width, but it's totally not, is it?
It's indexing into channel by flattened image, and it even tells you it's the i-th filter, the i-th channel in the jth position in the flattened out image in layer l. So you're not going to be able to get any further in the paper unless you understand what f is.
So that's why these are the bits where you stop and make sure you're comfortable. So now the content loss I'm not going to spend much time on, but basically we're going to just check out the values of the activations versus the predictions squared. So there's our content loss, and the style loss will be much the same thing but using the Gram matrix g.
And I really wanted to show you this one because sometimes I really like things you can do in math notation, and they're things you can also generally do in j and APL, which is kind of this implicit loop going on here. What this is saying is there's a whole bunch of values of i and a whole bunch of values of j, and I've got to define g for all of them.
And there's a whole bunch of values of l as well, and I've got to define g for all of those as well. And so for all of my g at every l, at every i, at every j, it's going to be equal to something. And you can see that something has an i and a j and an l, so matching these, and it also has a k, and that's part of the sum.
So what's going on here? Well it's saying that my Gram matrix in layer l for the i-th channel, well these aren't channels anymore, in the i-th position in one axis, in the j-th position in another axis, is equal to my f matrix, so my flattened out matrix, for the i-th channel in that layer versus the j-th channel in the same layer.
And then I'm going to sum over, see this k and this k, they're the same letter. So we're going to take the k-th position and multiply them together and then add them all up. So that's exactly what we just did before when we calculated our Gram matrix. So there's a lot going on because of some very neat notation, which is there are three implicit loops all going on at the same time, plus one explicit loop in the sum, and then they all work together to create this Gram matrix for every layer.
So let's go back and see if you can match this. So all that's kind of happening all at once, which I think is pretty great. So that's it. So next week we're going to be looking at a very similar approach, basically doing style transfer all over again, but in a way where we're actually going to train a neural network to do it for us rather than having to do the optimization.
We'll also see that you can do the same thing to do super resolution, and we're also going to go back and revisit some of that SSD stuff as well as doing some segmentation. So if you've forgotten SSD, it might be worth doing a little bit of revision this week.
Thanks everybody. See you next week.