Lesson 8: Cutting Edge Deep Learning for Coders

Some of you have finished Part 1 in the last few days, some of you finished Part 1 in December. I did ask those of you who took it in person to revise the material and make sure it was up to date, but let's do a quick summary of the key things we learned.

So I came up with these five things. I've been interested to hear if anybody has other key insights that they feel they came away from. So the five things are these. Stacks of nonlinear functions with lots of -- well, stacks of differentiable nonlinear functions with lots of parameters solve nearly any predictive modeling problem.

So when we say neural network, a lot of people are suggesting we should use the phrase differentiable network. If you think about things like the collaborative filtering we did, it was really a couple of embeddings and a dot product and that gave us quite a long way, there's nothing very neural looking about that.

But we know that when we stack certain kinds of nonlinear functions on top of each other, the universal approximation theorem tells us that can approximate any computable function to arbitrary precision, we know that if it's differentiable we can use SGD to find the parameters which match that function. So this to me is kind of like the key insight.

But some stacks of functions are better than others for some kinds of data and some kinds of problems. One way to make life very easy, we learned, is transfer learning. I think nearly every network we created in the last course, we used transfer learning. I think particularly in vision and in text, so pretty much everything.

So transfer learning generally was throw away the last layer, replace it with a new one that has the right number of outputs, pre-compute the penultimate layer's output, then very quickly create a linear model that goes from that to your preferred answer. You now have something that works pretty well, and then you can fine-tune more and more layers backwards as necessary.

And we learned that fine-tuning those additional layers, generally the best way to do that was to pre-compute the last of the layers which are not fine-tuning, and so then you could just calculate the weights of the remaining ones, and that saved us lots and lots of time. And remember that convolutional layers are slower, so let's fix up the previous one as well.

Convolutional layers are slower, dense layers are bigger, and there's an interesting question I've added here, which is, remember in the last lesson, we kind of looked at ResNets and InceptionNets and in general more modern nets tend not to have any dense layers. So what's the best way to do transfer learning?

I'm going to leave that as an open question for now. We're going to look into it a bit during this class, but it's not a question that anybody has answered to my satisfaction. So I'll suggest some ideas, but no one's even written a paper that attempts to address it as far as I'm aware.

Given we have transfer learning to get us a long way, the next thing we have to get us a long way is to try and create an architecture which suits our problem, both our data and our loss function. So for example, if we have autocorrelated inputs, so in other words, each input is related to the previous input, so each pixel is similar to the next-door pixel, or in a sound wave, each sample is similar to the previous sample, something like that, that kind of data we tend to like to use CNNs for as long as it's of a fixed size, it's a sequence we like to use an RNN for, if it's a categorical output we like to use a softmax for.

So there are ways we learned of tuning our architecture, not so that it makes it possible to solve a problem, because any standard dense network can solve any problem, but it just makes it a lot faster and a lot easier to train if you've made sure that your activation functions and your architecture suit the problem.

So that was another key thing I think we learned. And something I hope that everybody can narrate is the five steps to avoiding overfitting. If you've forgotten them, they're both here and discussed in more detail in lesson 3. Get more data, fake-keep more data using data augmentation, use more generalizable architectures.

Architectures that generalize well, particularly when we look at batch normalization, use regularization techniques as few as we can, because by definition they destroy some data, but we look particularly at using dropout. And then finally if we have to, we can look at reducing the complexity of the architecture. The general approach we learned, this was absolutely key, is first of all with a new problem, start with a network that's too big, it's not regularized, it can't help but solve the problem, even if it has to overfit terribly.

If you can't do that, there's no point starting to regularize yet. So we start out by trying to overfit terribly. Once we've got to the point that we're getting 100% accuracy and our validation set's terrible because it's overfitting, then we start going through these steps until we get a nice balance.

So that's kind of the process that we learned. And then finally we learned about embeddings as a technique to allow us to use categorical data, and specifically the idea of using words, or the idea of using latent variables. So in this case, this was the movie lens dataset for collaborative filtering.

So that's the 5 main insights I thought of. Did anybody have any other kind of key takeaways that they think people revising should think about or remember, or things they found interesting? No? Okay, that's good. If you come up with something, let me know. I have one question. How does having duplicates and training data affect the model created?

And if you're using data augmentation, do you end up with duplicate data? Duplicates in the input data, I mean it's not a big deal, because we shuffle the batch and then you select things randomly, effectively, you're weighting that data point higher than its neighbors. So in a big dataset, it's going to make very little difference.

If you've got one thing repeated 1,000 times and then there's only another 100 data points, that's going to be a big problem because you're weighting one data point 1,000 times higher. So as you will have seen, we've got a couple of big technology foundation changes. The first one is we're moving from Python 2 to Python 3.

Python 2 I think is a good place to start, given that a lot of the folks in Part 1 had never coded in Python before and many of them had never written very substantial pieces of software before. And a lot of the tutorials out there, like for example one of our preferred starting points which is learn Python the hard way, then Python 2, a lot of the existing codes out there in Python 2, so we thought Python 2 is a good place to start.

Two more questions. One is, are you going to post the slides after this? I will post the slides, yes. And the other is, could you go through steps for underfitting at some point, how to deal with overfitting? Yeah, let's do that in a forum thread. So why don't you create a forum thread asking about underfitting, but you don't need to do that in the Part 2 forum, you can do that in the main forum because lots of people would be interested in hearing about that.

If you want to revise that lesson 3, start it by talking about underfitting. So that seemed like a good place to start. I don't think we should keep using Python 2 though for a number of reasons. One is that since then the IPython folks have come out and said that the next version won't be compatible with Python 2, so that's a problem.

Even from 2020 onwards, Python 2 will be end of life, which means there won't be patches for it. So that's a problem. Also, we're going to be doing more stuff with concurrency and parallel programming this time around, and the features in Python 3 are a lot better. And then Python 3.6 was just released, which has some very nice features in particular, some string formatting, which for some people it's no big deal, but to me it saves a lot of time and makes life a lot easier.

So we're going to move across to Python 3, and hopefully you've all gone through the process already. And there are some tips on the forum about how to have both run at the same time, although I agree with the suggestion I had read from somebody which was go ahead, suck it up and do the translation once now so you don't have to worry about it.

Much more interesting and much bigger is the move from Theano to TensorFlow. So Theano, we thought, was a better starting point because it has a much simpler API. There's very few new concepts to learn to understand Theano. And it doesn't have a whole new ecosystem. You see, TensorFlow lives within Google's whole ecosystem.

It has its own build system called Bazel. It's got its own file serialization system called Protobuf. It's got its own profiler method based on Chrome. It's got all this stuff to learn. But if you've come this far, then you're already investing the time. We think it's worth investing the time in TensorFlow because there's a lot of stuff which just in the last few weeks, it's data being able to do that's pretty amazing.

So Rachel wrote this post about how much TensorFlow sucks, for which we got invited to the TensorFlow Dev Summit and got to meet all the TensorFlow core team. So looking at moving from Theano to TensorFlow, we got invited to the TensorFlow Dev Summit and we were pretty amazed at all the stuff that's literally just been added.

So TensorFlow 1 just came out. And here are some of the things. If you google for TensorFlow Dev Summit videos, you can watch the videos about all this. That's the most exciting thing for us, is that they are really investing in a simplified API. So if you look at this code, you can create a deep neural network regressor on a mixture of categorical and real variables using an almost R-like syntax and fit it in two lines of code.

You'll see that those lines of code at the bottom, the two lines to fit it, look very much like Keras. The Keras author has been a wonderful influence on Google, and in fact everywhere we saw at the Dev Summit was Keras API influences. So TensorFlow and Keras are kind of becoming more and more one, which is terrific.

So one is that they're really investing in the API. The second is that some of the tooling is looking pretty good. So TensorBoard has come a long way. Things like these graphs showing you how your different layers are distributed and how that's changed over time can really help to debug what's going on.

So if you get some kind of gradient saturation in a layer, you can dig through these graphs and very quickly find out where. This was one of my favorite talks, actually. This guy, I remember correctly, his name was Daffodil and his signature was an emoji of a Daffodil, very Google.

If you watch this video, you kind of have to walk through showing some of the functionality that's there and how to use it, and I thought that was pretty helpful. One of the most important ones to me is that TensorFlow has a great story about productionization. For part one, I didn't much care about productionization.

It was really about playing around, what can we learn. At this point, I think we might be starting to think about how do I get my stuff online in front of my customers. These points are talking about something in particular which is called TensorFlow Serving. And TensorFlow Serving is a system that can take your train TensorFlow model and create an API for it which does some pretty cool things.

For example, think about how hard it would be without the help of some library to productionize your system. You've got one request coming in at a time. You've got n GPUs. How do you make sure that you don't saturate all those GPUs, that you send the request to one that's free, that you don't use up all of your memory.

Better still, how do you grab a few requests, put them into a batch, put them in all to the GPU at once, get the bits out of the batch, put them back to the people that requested it, all that stuff. So serving does that for you. It's very early days for this software, a lot of things don't work yet, but you can download an early version and start playing with it, and I think that's pretty interesting.

With the high-level API in TensorFlow, what's going to be the difference between the Keras API and the TensorFlow API? Yeah, that's a great question. In fact, TensorFlow or tf.keras will become a namespace. So Keras will become the official top-level API for TensorFlow, and in fact Rachel was the person who announced that.

I was just going to add that TensorFlow is kind of introducing a few different libraries at different layers, different levels of abstraction. There's this concept of an evaluation API that appears everywhere and basically is the Keras API. I think there's a layers API below the Keras API. So it's being mixed in lots of places.

So all the stuff you've learned about Keras is going to be very helpful, not just in using Keras on TensorFlow, but in using TensorFlow directly. Another interesting thing about TensorFlow is that they've built a lot of cool integrations with various cluster managers and distributed storage systems and stuff like that.

So it will kind of fit into your production systems more neatly, use the data in whatever place it already is more neatly, so if your data is in S3 or something like that, you can generally throw it straight into TensorFlow. Something I found very interesting is that they announced a couple of weeks ago a machine learning toolkit which brings really high-quality implementations of a wide variety of non-deep learning algorithms in TensorFlow.

So all these are GPU-accelerated, parallelized, and supported by Google. And a lot of these have a lot of tech behind them. For example, the random forest, there's a paper, they actually call it the Tensor Forest, which explains all of the interesting things they did to create a fast GPU-accelerated random forest.

Two more questions. Will you give an example of how to solve gradient saturation TensorFlow tools? I'm not sure that I will. We'll see how we go, because I think the video from the Dev Summit, which is available online, kind of already shows you that. So I would say look at that first and see if you still have questions.

All the videos from the Dev Summit are online. Is there an idea for using deep learning on AWS Lambda? Not that I've heard of, and in fact in general, Google has a service version of TensorFlow serving called Google Cloud ML where you can pay them a few cents in a transaction and they'll post your model for you.

There isn't really something like that through Amazon if I was unaware. And then finally in terms of TensorFlow, I had an interesting and infuriating few weeks trying to prepare for this class and trying to get something working that would translate French into English. And every single example I found online had major problems.

Even the official TensorFlow tutorial missed out a key thing which is that the lowest level of a language model really should be bi-directional, as this one shows, bi-directional RNN and their one wasn't. I'm trying to figure out how to make it work horrible, trying to get it to work in Keras, nothing worked properly.

Finally, basically the issue is this, modern RNN systems like a full neural translation system involve a lot of tweaking and mucking around with the innards of the RNN using things that we'll learn about. And there just hasn't been an API that really lets that happen. So I finally got it working by switching to PyTorch, which we'll learn about soon, but I was actually going to start, the first lesson was going to be about neural translation and I've put it back because TensorFlow has just released a new system for RNNs which looks like it's going to make all this a lot easier.

So this is an exciting idea is that there's an API that allows us to create some pretty powerful RNN implementations and we're going to be absolutely needing that when we learn to create translations. Oh yeah, there is one more. Again, early days, but there is something called XLA, which is the Accelerated Linear Algebra virtual, I think, which is a system which takes TensorFlow code and compiles it.

And so for those of you that know something about compiling, you know that a compilation can do a lot of clever stuff in terms of identifying dead code or unrolling loops or fusing operations or whatever. XLA tries to do all that. Now at this stage, it takes your TensorFlow code and turns it into machine code.

One of the cool things that lets you do is run it on a mobile phone with almost no supporting libraries using native machine instructions on that phone, much less memory. But one of the really interesting discussions I had at the summit was with Scott Gray, who some of you may have heard of.

He was the guy that passively accelerated neural network kernels when he was at Nirvana. He had kernels that were two or three times faster than Nvidia's kernels. I don't know of anybody else in the world who knows more about neural network performance than him. He told me that he thinks that XLA is the key to creating performant, concise, expressive neural network code.

And I really like that idea. The idea is currently, if you look in the TensorFlow code, it's thousands and thousands of lines of C++, all custom written. The idea is you throw all that away and replace it with a small number of lines of TensorFlow code that get compiled through XLA.

So that's something that's actually got me pretty excited. So TensorFlow is pretty interesting. Having said that, it's kind of hideous. The API is full of not invented hair syndrome. It's clearly written by a bunch of engineers who have not necessarily spent that much time learning about the user interface of APIs.

It's full of these Googleisms in terms of having to fit into their ecosystem. But most importantly, like Theano, you have to set up the whole computation graph and then you kind of go run, which means that if you want to do stuff in your computation graph that involves like conditionals, if-then statements, if this happens, you do this other part of the loop, it's basically impossible.

It turns out that there's a very different way of programming neural nets, which is dynamic computation, otherwise known as define through run. There's a number of libraries that do this, Torch, PyTorch, Tana, dinet, they're the ones that come to mind. And we're going to be looking at one that was released, but an early version was put out about a month ago called PyTorch, which I've started rewriting a lot of stuff in, and a lot of the more complex stuff just becomes suddenly so much easier.

And because it becomes easier to do more complex things, I often find I can create faster and more concise code by using this approach. So even although PyTorch is very, very, very new, it is coming out of the same people that built Torch, which really all of Facebook's systems build on top of.

I suspect that Facebook are in the process of moving across from Torch to PyTorch. It's already full of incredibly full stuff, as you'll see. So we will be using increasingly more and more PyTorch during this course. There was a question, "Does precompiling mean that we'll write TensorFlow code and test it and then when we train a big model, then we precompile the code and train our model?" Yeah, so if we're talking about XLA, XLA can be used a number of ways.

One is that you come up with some different kind of kernels, a different kind of factorization, something like that. You write it in TensorFlow, you compile it with XLA, and then you make it available to anybody so when they use your layer, they're getting this compiled optimized code. It could mean that when you use TensorFlow serving, TensorFlow serving might compile your code using XLA and be serving up an accelerated version of it.

One example which came up was for RNNs. RNNs often involve nowadays, as you'll learn, some kind of complex customizations of a bidirectional layer and then some stack layers and an attention layer and then a set into a separate stack to decoder, you can fuse that together into a single layer called bidirectional attention sequence to sequence, which indeed Google have actually bought that kind of stuff.

There's various ways in which neural network compilation can be very helpful. What is the relationship between TensorFlow and PyTorch? There's no relationship, so TensorFlow is Google's thing, PyTorch is I guess it's kind of Facebook's thing, but it's also very much a community thing. TensorFlow is a huge complex beast of a system which uses all kinds of advanced software engineering methods all over the place.

In theory, that ought to make it terribly fast. In practice, a recent benchmark actually showed it to be about the slowest, and I think the reason is because it's so big and complex, it's so hard to get everything to work together. In theory, PyTorch ought to be the slowest because this defined by run system means it's way less optimization that the systems can do, but it turned out to be amongst the fastest because it's so easy to write code, it's so much easier to write good code.

It's interesting, I think there's such different approaches, I think it's going to be great to know both because there are going to be some things that are going to be fantastic in TensorFlow and some things that are going to be fantastic in TensorFlow. They couldn't be more different, which is why I think there are two good things to learn.

So wrapping up this introductory part, I wanted to kind of change your expectations about how you've learned so far to how you're going to learn in the future. Part 1 to me was about showing you best practices. So generally it's like, here's a library, here's a problem, you use this library in these steps to solve this problem, and you do it this way, and lo and behold we've gotten the top ten of this capital competition.

I tried to select things that had best practices. So you now know everything I know about best practices. I don't really have anything else to tell you. So we're now up to stuff I haven't quite figured out yet, nor is anybody else, but you probably need to know. So some of it, for example, like neural translation, that's an example of something that is solved.

Google solved it, but they haven't released the way they solved it. So the rest of us are trying to put everything together and figure out how to make something work as well as Google made that work. More often it's going to be, here's a sequence of things you can do that can get some pretty good results here, but there's a thousand things you could do to make it better that no one's tried yet, so that's interesting.

Or thirdly, here's a sequence of things that solves this pretty well, but gosh we wrote a lot of custom code there, didn't we? I'm sure this could be abstracted really nicely, but no one's done that yet. So they're kind of the three main categories. So generally at the end of each class it won't be like, okay, that's it, that's how you do this thing.

It'll be more like, here are the things you can explore. And so the homework will be pick one of these interesting things and dig into it, and generally speaking that homework will get you to a point that probably no one's done before, or at least probably no one's written down before.

I found as I built this, I think nearly every single piece of code I'm presenting, I was unable to find anything online which did that thing correctly. There was often example code that claimed to be something like that, but again and again I found it was missing huge pieces.

And we'll talk about some of the things that it was missing as we go, but one very common one was it would only work on a single item at a time, it wouldn't work with a batch. Therefore the GPU is basically totally wasted. Or it failed to get anywhere near the performance that was claimed in the paper that it was going to be based on.

So generally speaking there's going to be lots of opportunities if you're interested to write a little blog post about the things you tried and what worked and what didn't, and you'll generally find that there's no other post like that out there. Particularly if you pick a dataset that's in your domain area, it's very unlikely that somebody's written it.

Going back, can we use TensorFlow and Torch together? Don't say torch, say PyTorch. Torch is very similar, but it's written in Lua, which is a very small embedded language. Very good for what it is, but not very good for what we want to do. So PyTorch is kind of a port of Torch into Python, which is pretty cool.

So can you use them together? Yeah, sure, we'll kind of see a bit of that. In general, you can do a few steps with TensorFlow to get to a certain point, and then a few more steps with PyTorch. You can't integrate them into the same network, because they're very different approaches, but you can certainly solve a problem with the two of them together.

So for those of you who have some money left over, I would strongly suggest building a box. And the reason I suggest building a box is because you're paying 90 cents an hour for a P2. I know a lot of you are spending a couple of hundred bucks a month on AWS bills.

Here is a box that costs $550 and will be about twice as fast as a P2. So it's just not good value to use a P2, and it's way slower than it needs to be. And also building a box, it's one of the many things that's just good to learn, is understanding how everything fits together.

So I've got some suggestions here about what box to build for various different budgets. You certainly don't have to, but this is my recommendation. Couple of points to me. More RAM helps more than I think people who discuss this stuff online quite appreciate. 12GB of RAM means twice as big of batch sizes, which means half as many steps necessary to go through an epoch.

That means more stable gradients, which means you can use higher learning rates. So more RAM I think is often under-appreciated. The Titan X is the card that has 12GB RAM. It is a lot more expensive, but you can get the previous generation's version secondhand, it's called the Maxwell. So there's a Titan X Pascal, which is the current one, or the Titan X Maxwell, which is the previous generation one.

The previous generation one is not a big step back at all, it still has 12GB RAM. If you can get one used that would be a great option. The GTX 1080 and 1070 are absolutely fantastic as well. They're nearly as good as the Titan X, but they just have 8GB rather than 12GB.

Going back to a GTX 980, which is the kind of previous generation consumer top-end card, you have the RAM again. So of all the places you're going to spend money on a box, put nearly all of it into the GPU. Every one of these steps, 1070, the Titan X, Pascal, they're big steps up.

And as you will have seen from part 1, if you've got more RAM, it really helps because you can pre-compute more stuff and keep it in RAM. Having said that, there's a new kind of hard drive, an NVMe drive on volatile memory. NVMe drives are quite extraordinary. They're not that far away from RAM like speeds, but they're hard drives.

They're persistent. You have to get a special kind of motherboard, but if you can afford it, it's going to be like $400 or $500 to get an NVMe drive. That's going to really allow you to put all of your currently used data on that drive and access it very, very quickly.

So that's my other tip. Question-Doesn't the batch size also depend heavily on the video RAM? Or does- Answer-That's what I was referring to, the 12GB, I'm talking about the RAM that's on the GPU. Question-Does upgrading RAM allow bigger batch sizes? Answer-Upgrading the card, the video card's RAM. You can't upgrade the RAM on the card.

You buy a card that has X amount of RAM, so Titan X has 12, GTX 1080, 8, GTX 980, 4, so that's on the card. Upgrading the amount of RAM that's in your computer doesn't change your batch size, it just changes the amount you can pre-compute unless you use an NVMe drive, in which case RAM is much less important.

You don't have to plug everything in. You can go to Central Computers, which is a San Francisco computer shop, for example, and they'll put it all together for you. There's a fantastic thread on the forums, Brendan, one of the participants in the course has a great Medium post, went there explaining his whole journey to getting something built and set up.

So there's lots of stuff there to help you. Alright, it's time to build your box and while you wait for things to install, it's time to start reading papers. So papers are, if you're a philosophy graduate like me, terrifying. They look like Theorem 4.1 and colloquially 4.2 on the left, but that is an extract from the Adam paper, and you all know how to do Adam in Microsoft Excel.

It's amazing how most papers manage to make simple things incredibly complex. And a lot of that is because academics need to show other academics how worthy they are of a conference spot, which means showing off all their fancy math skills. So if you really need a proof of the convergence of your optimizer rather than just running it and see if it works, you can study Theorem 4.1 and Corollary 4.2 and blah blah blah.

In general though, the way philosophy graduates read papers is to read the abstract, find out what problem they're solving, read the introduction to learn more about that problem and how previous people have tackled it, jump to the bit at the end called Experiments to see how well the thing works.

If it works really well, jump back to the bit which has the pseudocode in and try to get that to work. Ideally, hopefully in the meantime, finding that somebody else has written a blog post in simple English like this example with Adam. So don't be disheartened when you start reading big learning papers, and unless you have a math background, believe it or not, you're a PhD in math and they're still terrifying.

Yeah, they still feel disheartened frequently. Rachel was complaining about a paper just today in fact. You will learn to read the papers. The other thing I'll say is that you'll even see now, there will be a bit that's like, and then we use a softmax layer and there will be the equation for a softmax layer.

You'll look at the equation like, what the hell, and then it's like, oh, I already know what a softmax layer is. And then we'll use an LSTM. Literally still in every paper, they write the damn LSTM equations as if that's any help to anybody. But okay, it adds more Greek symbols, so be it.

I'm talking of Greek symbols. It's very hard to read and remember things that you can't pronounce, so if you don't know how to read the Greek letters, Google the Greek alphabet and learn how to say them. It's just so much easier when you can look at an equation and rather go squiggle something, squiggle something, you can say alpha something and beta something.

I know it's a small little thing, but it does make a big difference. So we are all there to help each other read papers. The reason we need to read papers is because as of now, a lot of the things we're doing only exist in very recent paper form.

Okay, so I really think writing is a good idea. In fact, all of your projects I hope will end up in at least one blog. If you don't have a blog, medium.com is a great place to write. We would love to feature your work on fast.ai, so tell us about what you create.

We're very keen for more people to get into the deep learning community. When you write this stuff, say hey, this is some stuff based on this course I'm doing, and here's what I've learned, and here's what I've tried, and here's what I found out. Put the code on GitHub, it's amazing.

Like even us putting our little AWS setup scripts on GitHub for the MOOC, Rachel had a dozen pull requests within a week with all kinds of little tidbits of like, oh, if you're on this version of Mac, this helps this bit, or I've abstracted this out to make it work in Ireland as well as in America, and so on, so there's lots of stuff that you can do.

I think the most important tip here is don't wait to be perfect before you start writing. What was that tip you told me, Rachel? You should think of your target audience as the person who's one step behind you, so maybe your target audience is someone that's just working through the part one MOOC right now.

So your target audience is not... Jeffrey Hinton. Exactly, it's you six months ago. I don't write the thing that you would love to have seen because there will be far more people in that target audience than the Jeffrey Hinton target audience. How are we going for time, Rachel? 7.45, so this might be a good time for a break.

Let's just get through this and then we can get on to the interesting stuff. I've tried to lay out what I think we'll study in part two. As I say, what I was planning until quite recently to present today was neural translation, and then two things happened. Google suddenly came up with a much better RNN and sequence-to-sequence API, and then also two or three weeks ago a new paper came out for generative models which totally changed everything.

So that's why we've redone things and we're starting with CNN generative models today. We have a question, where to find the current research papers? Okay, we'll get to that for sure. Assuming that things go as planned, the general topic areas in part two will be CNNs and NLP beyond classification.

If you think about it, pretty much everything we did in part one was classification or a little bit of regression. We're going to now be talking more about generative models. It's a little hard to exactly define what I mean by generative models, but we're talking about creating an image, or creating a sentence, we're creating bigger outputs.

So CNNs beyond classification, so generative models for CNNs means the thing that we could produce could be a picture showing this is where the bicycle is, this is where the person is, this is where the grass is, that's called segmentation, or it could be taking a black and white image and turning it into a colour image, or taking a low-res image and turning it into a high-res image, or taking a photo and turning it into a bangoff, or taking a photo and turning it into a sentence describing it.

NLP beyond classification can be taking an English sentence and turning it into French, or taking an English story and a question and turning it into an answer of that question about that story, that's chatbots in Q&A. We'll be talking about how to deal with larger datasets, so that both means datasets with more things in it, and datasets where the things are bigger.

And then finally, something I'm pretty excited about is I've done a lot of work recently finding some interesting stuff about using deep learning for structured data and for time series. For example, we heard about fraud, so fraud is both of those things, it combines time series, transaction histories and thick histories, and structured data, customer information.

Traditionally that's not been tackled with deep learning, but I've actually found some state-of-the-art, world-class approaches to solving those with deep learning, so I'm really looking forward to sharing that with you. So let's take a 8-minute break, come back at 5 to 8, thanks very much. So we're going to learn about this idea of artistic style or neural style transfer.

The idea is that we're going to take a photo and make it look like it was painted in the style of some painter. Our inputs are a photo, and I'm going to call it, oh, that's way off, and style. And so these two things are going to be combined together to create an image which is going to hopefully have the content of the photo and the style of the image.

The way we're going to do this is we're going to assume that there is some function where the inputs to this function are the photo, the style image, and some generated image that I've created. And that will return some number where this function will be higher if the generated image really looks like this photo in this style and lower if it doesn't.

So if we can create this loss function that basically says, here's my generated image, and it returns back a number saying, oh yes, that generated image does look like that photo in that style, then we could use SGD. And we would use SGD not to optimize the weights of a network, we would use SGD to optimize the pixel values of the generated image.

So we would be using it to try to optimize the value of this argument. So we haven't quite done that before, but conceptually it's identical. Conceptually we can just find the derivative of this function with respect to this input. And then we can try and optimize that input, which is just a set of pixel values, to try and maximize the function.

So all we need to do is come up with a function which will tell us how much does some generated image look like this photo in this style. And the way we're going to do that, step 1, is going to be very simple. We're going to turn it into two functions, f-content, which will take the photo and the generated image, and that will tell us a bigger number if the generated image looks more like the photo, if the content looks the same.

And then there will be a second function, which takes the style image and the generated image, and that will tell us a higher number if this generated image looks like it was painted in the same style as the style image. So we can just turn it into two pieces and add them together.

So now we need to come up with these two parts. Now the first part is very easy. What's a way that we could create a function that returns a higher number if the generated image is more similar to some photo? When you come up with a loss function, the really obvious one is the values of the pixels.

The values of the pixels in the generated image, the mean squared error between them and the photo, that mean squared error loss function would be one way of doing this part. The problem with that though is that as I start to turn it into a Van Gogh, those pixel values are going to change.

They're going to change color because the Van Gogh might have been a very blue-looking Van Gogh. They'll change the relationships to each other so it might become a curve or it used to be a straight line. So really the pixel-wise mean squared error is not going to give us much freedom in trying to create something that still looks like a photo.

So here's an idea, instead let's look at not the pixels, but let's take those pixels and stick them through a pre-trained CNN like VGG. And let's look at the 4th or 5th or 8th convolutional layers activations. Remember back to those matzylar visualizations where we saw that the later layers kind of said how much does an eyeball look like here, or how much does this look like a star, or how much does this look like the fur of a dog.

The later layers were dealing with bigger objects and more semantic concepts. So if we were to use a later layer's activations as our loss function, then we could really change the style and the color and all kinds of stuff and really would be saying does the eye still look like an eye, does the beak still look like a beak, does the rock still look like a rock.

And if the answer is yes, then OK, that's good, this is something that matches in terms of the meaning of the content even though the pixels look very different. And so that's exactly what we're going to do. So for f-content, we're going to say that's just the VGG activations of some convolutional layer.

Which one? We can try some. So that's actually enough for us to get started. Let's try and build something that optimizes pixels using a loss function of the VGG network some convolutional layer. So this is the neural style notebook. And much of what we're going to look at is going to look very similar.

The first thing you'll see which doesn't look similar to before is I've got this thing called limit mem. Limit mem, remember you can always see the source code for something by putting two question marks. Limit mem is just these three lines of code which I notice somebody currently has already pasted in the forum.

One of the many things I dislike about TensorFlow for our kind of work is that all of the defaults are production defaults. So one of the defaults is it will use up all of your memory on all of your graphics cards. So I'm currently running this on a server with four graphics cards, which I'm meant to be sharing with my colleagues at the university here.

If every time I run a notebook, nobody else can use any of the graphics cards, they're going to be really pissed. And this nice little gig I have of running these little classes is going to disappear very quickly. So I need to make sure I run limit mem very soon as soon as I start running a notebook.

Honestly I think this is a poor choice by the TensorFlow authors because somebody putting something in production is going to be taking time to optimize things. I don't give a shit about the defaults. Somebody who's hacking something together to quickly see if they can get something working very much wants nice defaults.

So this is like one of the many places where TensorFlow makes some odd little annoying decisions. But anyway, every time I create a new notebook, I copy this line in and make sure I run it and so this does not use up all of your memory. So I've got a link to the paper that we're looking at, and indeed we can open it.

And now is a good time to talk about how helpful it is to use some kind of paper reading system. I really like this one, it's free, it's called Mendeley Desktop. Mendeley let's use, as you find papers, you can save them into a folder on your computer. Mendeley will automatically watch that folder, any PDF that appears there gets added to your library, and it's really quite cool because what it then does is it finds the archive ID and then you can click this little button here and it will go to archive and grab all of the information such as the abstract and so forth and fill it out for you.

And so this is really great because now any time I want to find out what I've read, which I've got anything to do with style, I can type style and up all of the papers. Believe me, after a long time of reading papers without something like this, it basically goes in one ear and out the other, and literally I've read papers a year later and at the end of it I've realized I've read that before, I don't remember anything else about it but I know I've read it before, whereas this way I really find that my knowledge builds.

As I find references, I'm immediately there looking at the references. The other thing you can do is that as you start reading the paper, as you can see, my notes and highlights are saved, and they're also duplicated on my mobile devices and my other computers and they're all synced up, it's really cool.

So talking about archive is a great time to answer a question we had earlier about how do you find papers. So the vast vast vast majority of deep learning papers get put up on archive.org for a long long long time before they're in any journal or conference. So if you wait until they're in a conference proceedings, you're many many months or maybe even a year behind.

So pretty much everybody uses archive. You can go to the AI section of archive and see what's there, but that's not really what anybody does. What everybody does instead is archive sanity, the archive sanity preserver. This is something that the wonderful Andre Capathy built, and what it lets you do is to create a library of articles that somebody tells you to read or that you're interested in or you come across, and as you create that library by clicking this little save button, it then recommends more papers like it.

Or even once you start reading a paper, you go Show Similar, and it will then show you other papers that are similar to this paper and it seems to do a pretty damn good job of it. So you can really explore and get lost in that whole area. So that's one great way to do it.

And then as you do that, you'll find that if you go to archive, one of the buttons that it has is a bookmark on Mendeley button. So like even from the abstract here, bang, straight into your library and the next time you load up Mendeley, it's all there. And then you can put things into folders, so the different parts of the course, I've created folders for them and kind of keep track of what I'm reading that way.

A good little trick to know about archive.org is that you often want to know where it's from, and if you go to the first page on the left-hand side, you can see the date here. And another cool tip is that the file name, the first four digits are the year and month for that file, so there's a couple of handy little tips.

As well as archive sanity, another really great place for finding papers is Twitter. Now if you haven't really used Twitter before or haven't really used Twitter for this purpose before, it's hard to know where to start. So I try to make things easy for people by favoriting lots of the interesting deep learning papers that I come across.

So if you go to Jeremy P. Howard's page and click on Likes, you'll find that there is a thousand links to papers here, and as you can see, there's generally a few every day. That's useful for a number of reasons. One is to get some ideas and papers to read, but perhaps more importantly is to see who's posting these cool links.

And then you can follow them as well. Rachel, can you throw that box to that gentleman? The black? Yes. That's it. It's not a question, it's just information about archive. There is someone who has built a skill on Amazon Alexa, and actually by asking Alexa to give the most recent paper from archive, and actually she reads abstract for you, and you can filter the most papers for you.

The other place which I find extremely helpful is Reddit machine learning. Again, there's a lot less that goes through Reddit than goes through Twitter, but generally like the really interesting things tend to turn up here, and you can often see the discussions of it. For example, there was a great discussion of PyTorch versus TensorFlow in the last day or two, and so there's a couple of good places to get started.

Anything I missed, Rachel? I think that's good. I have two questions on the image stuff when you go back to style. Okay. I'm ready. One of them was if the app Prisma is using something like this. Yes, Prisma is using exactly this. And the other is, is it better to calculate F content for a higher layer for VGG and use a lower layer for app style, since the higher layer of abstracts are captured in the higher layer and the lower layer captures textures?

Probably. Let's try it, shall we? We haven't learned about F style yet, so we're just going to look at F content first. Okay, so I've got some more links to some things you can look at here in the notebook. So the data I've linked to in the lesson thread on the forum, I've just grabbed a random sample of about 20,000 image net images, and I've also put them into bcols arrays.

So you can set up your paths appropriately. I haven't given you this pickle. You can figure out how to get the file names easily enough, so I'm not going to do everything for you. I've grabbed a little one of those pictures. Thank you for the person who's showing all the other stuff at Pippin's store, that's very helpful.

So this is going to be our content image. Given that we're using VGG, as per usual, we're going to have to subtract out the mean pixel value from imageNet and reverse the channel order, because of course that's what the original VGG authors did. So we're going to create an array from the image by just running it through that pre-processing function.

Later on, we're going to be running things through a network and generating images. Those generated images we're going to have to add back on that mean and undo that reordering, so this is what this de-processing function is going to be for. Now I've kind of hand-waved over these functions before and how they work, but I'm going to stop hand-waving for a moment because it's actually quite interesting.

Have you ever thought about how is it that we're able to take X, which is a 4-dimensional tensor, batch size by height by width by channels, (notice this is not the same as the Tiano, so Tiano was batch size by channels by height by width, we're not doing that anymore), batch size by height by width by channels, taking a 4-dimensional tensor and we're subtracting from it a vector.

How are we doing that? How is it making that work? And the way it's making that work is because it's doing something called broadcasting. Broadcasting refers to any kind of operation where you have arrays or tensors of different dimensions and you do element-wise operations on two tensors of different dimensions.

And how does that work? This idea actually goes back to the early 1960s to an amazing programming language called APL. APL stands for A Programming Language. APL was written by an extraordinary person called Kenneth Iverson. Originally APL was a paper describing a new mathematical notation, and this new mathematical notation was designed to be more flexible and far more precise than traditional mathematical notation.

And he then went on to create a programming language that implemented this mathematical notation. APL refers to the notation, which he described as notation as a tool for thought. He really, unlike the TensorFlow authors, understood the importance of a good API. He recognized that the mathematical notation can change how you think about math, and so he created a notation which is incredibly expressive.

His son now has gone on to carry the torch and he now continues to support a direct descendant of APL, which is called J. So if you ever want to find, I think, the most elegant programming language in the world, you can go to Jsoftware.com and check this out.

Now, how many of you here have used regular expressions? How many of you, the first time you looked at a complex regular expression thought, that is totally intuitive? You will feel the same way about J. The first time that you look at a piece of J, you'll go, what the bloody hell?

Because it's an even more expressive and a much older language than regular expressions. Here's an example of a line of J. But what's going on here is that this is a language which at its heart almost never requires you to write a single loop because it does everything with multidimensional tensors and broadcasting.

So everything we're going to learn about today with broadcasting is a very diluted, simplified, graphified version of what APL created in the early 60s, which is not to say anything rude about Python's implementation, it's one of the best. J and APL totally blow it away. If you want to really expand your brain and have fun, check out J.

In the meantime, what does Keras/Theano/TensorFlow broadcasting look like? Let's look at some examples. Here is a vector, a one-dimensional tensor, minus a scalar. That makes perfect sense that you can subtract a scalar from a one-dimensional tensor. But what is it actually doing? What it's actually doing is it's taking this 2 and it's replicating it 3 times.

So this is actually element-wise, 1, 2, 3, minus 2, 2, 2. It has broadcasted the scalar across the 3-element vector 1, 2, 3. So there's our first example of broadcasting. In general, broadcasting has a very specific set of rules, which is this. You can take two tensors and you first of all take the shorter tensor, the tensor of less dimensions, and prepend unit axes to the front.

What do I mean when I say prepend unit axes? Here's an example of prepending unit axes. Take the vector 2, 3 and prepend 3 unit axes on the front. It is now a four-dimensional tensor of shape 1, 1, 1, 2. So if you turn a row into a column, you're adding one unit axis.

If you're then turning it into a single slice, you're adding another unit axis. So you can always make something into a higher dimensionality by adding unit axes. So when you broadcast, it takes the thing with less dimensions and adds prepends unit axes to the front. And then what it does is it says, so let's take this first example, it's taken this thing which has no axes, it's a scalar, and turns it into a vector of length 1.

And then what it does is it finds anything which is of length 1 and duplicates it enough times so that it matches the other thing. So here we have something which is a four-dimensional tensor of size 5, 1, 3, 2. So it's got 2 columns, 3 rows, 1 slice and 5 tubes.

And then we're going to subtract from it a vector of length 2. So remember from our definition, it's then going to automatically reshape this by prepending unit axes until it's the same length. And then it's going to copy this thing 3 times, this thing 1 time and this thing 5 times.

So the shape is 5, 1, 3, 2. So it's going to subtract this vector from every row, every slice, every cube. So you can play around with these little broadcasting examples and try to get a real feel for how to make broadcasting work for you. So in this case, we were able to take a four-dimensional tensor and subtract from it a three-dimensional vector knowing that it is going to copy that three-dimensional vector of channels to every row, to every column, to every batch.

So in the end, it's just done what we mean. It subtracted the mean average of the channels from all of the images the way we wanted it to. But it's been amazing how often I've taken code that I've downloaded off the internet and made it often 10 or 20 times more in terms of lines of code just by using lots of broadcasting.

And the reason I'm talking about this now is because we're going to be using this a lot. So play with it. And as I say, if you really want to have fun, play with it in J. So that was a diversion, but it's one that's going to be important throughout this.

So we've now basically got the data that we want. So next thing we need is a VGG model. Here's the thing though. When we're doing generative models, we want to be very careful of throwing away information. And one of the main ways to throw away information is to use max pooling.

When you use max pooling, you're throwing away 3/4 of the previous layer and just keeping the highest one. In generative models, when you use something like max pooling, you make it very hard to undo that and get back the original data. So if we were to use max pooling with this idea of our f-content, and we say what does the fourth layer of activations look like, if we've used max pooling, then we don't really know what 3/4 of the data look like.

Slightly better is to use average pooling instead of max pooling. Because at least with average pooling, we're using all of the data to create an average. We've still kind of thrown away 3/4 of it, but at least it's all been incorporated into calculating that average. So the only thing I did to turn VGG16 into VGG16 average was to do a search and replace in that file from max pooling to average pooling.

And it's just going to give us some slightly smoother, slightly nicer results. And you're going to see this a lot with generative models. We do little tweaks just to try to lose as little information as possible. You can just think of this as VGG16. Shouldn't we use something like ResNet instead of VGG, since the residual blocks carry more context?

We'll look at using ResNet over the coming weeks. It's a lot harder to use ResNet for anything beyond kind of basic classification, for a number of reasons. One is that just the structure of ResNet blocks is much more complex. So if you're not careful, you're going to end up picking something that's on one of those little arms of the ResNet rather than one of the additive mergers of the ResNet.

And it's not going to give you any meaningful information. You also have to be careful because the ResNet blocks most of the time are just slightly fine-tuning their previous block, like adding the residuals. It's not really adding new types of information. Honestly, the truth is I haven't seen any good research at all about where to use ResNet or Inception architectures for things like generative models or for transfer learning or anything like that.

So we're going to be trying to look at some of that stuff in this course, but it's far from straightforward. Two more questions. Should we put in batch normalization? In Part 1 of the course, I never actually added batch norm to the convolutional part of the model. So that's kind of irrelevant because we're not using any of the fully connected layers.

More generally, is batch norm helpful for generative models? I'm not sure that we have a great answer to that. Try it. Will the pre-trained weights change if we're using average pooling instead of max pooling? Yeah, that's a great question. The pre-trained weights, clearly the optimal weights would change, but having said that it's still going to do a reasonable job without tweaking the weights because the relationships between the activations isn't going to change.

So again, this would be an interesting thing to try if you want to download ImageNet and try fine-tuning it with average pooling, see if you can actually see a difference in the outputs that come out or not. It's not something I've tried. So here is the output tensor of one of the late layers of VGG-16.

So if you remember, there are different blocks of VGG where there's a number of 3x3 comms in a row, and then there's a pooling layer, and then there's another block of 3x3 comms, and then a pooling layer. This is the last block of the comms layers, and this is the first comm of that block.

I think this is maybe the third last layer of the convolutional section of VGG. This is kind of like large, receptive field, very complex concepts being captured at this late stage. So what we're going to do is we need to create our target. So for our bird, when we put that bird through VGG, what is the value of that layer's activations?

So one of the things I suggested you revise was the stuff from the Keras fact about how to get layer outputs. One simple way to do that is to create a new model, which takes our model's input as input, and instead of using the final output as output, we can use this layer as output.

So this is now a model, which when we call .predict, it will return this set of activations. So that's all we've done here. Now we're going to be using this inside the GPU, we're going to be using this as a target. So to give us something which is going to live in the GPU A and B, we can use symbolically in a computation graph B, we wrap it with k.variable.

So to remind you, whatever Keras in the docs use the Keras.backend module, they always call it capital K, I don't know why. So k refers to the API that Keras provides, which provides a way of talking to either Theano or TensorFlow with the same API. So both Theano and TensorFlow have a concept of variables and placeholders and dot functions and subtraction functions and softmax activations and so forth.

And so this k.module is where all of those functions live. This is just a way of creating a variable, which if we're using Theano, it would create a Theano variable. If we're using TensorFlow, it creates a TensorFlow variable. And where possible, I'm trying to use this rather than TensorFlow directly, but I could have absolutely have said tf.variable, and it would work just as well, because we're using the TensorFlow backend.

So this has now created a symbolic variable that contains the activations of block 5.1. So what we now want to do is to generate an image which we're going to use SGD to gradually make the activations of that image look more and more like this variable. So how are we going to do that?

Let's skip over 202 for a moment and think about some pieces. So we're going to need to define a lost function. And the lost function is just the mean squared error between two things. One thing is, of course, that target, that thing we just created, which is the value of our layer using the bird image.

We use the bird image array. So that's our target. And then what do we want to get close to that? Well, what we want to get close to that is whatever the value is of that layer at the moment. So what does layer equal? So layer is just a symbolic object at this stage.

There's nothing in it, so we're going to have to feed it with data later. So remember, this is kind of the interesting way you define computation graphs with TensorFlow and Theano. It's like you define it with these symbolic things now and you feed it with data later. So you've got this symbolic thing called layer, and we can't actually calculate this yet.

So at this stage this is just a computation graph we're building. Now of course any time we have a computation graph, we can get its gradients. So now that we have a computation graph that calculates the loss function we're interested in, so this is f content, if we're going to try to optimize our generated image, we're going to need to know the gradients.

So here we can get the gradients, and again we use k dot gradients rather than TensorFlow gradients or Theano gradients just so that we can use it with any back-end we like. The function we're trying to get gradients of is the loss function, which we just calculated. And then we want it with respect to not some weights, but with respect to the input of the model.

So this is the thing that we want to change is the input to the model so as to minimize our loss. So they're the gradients. So now that we've done that, we can go ahead and create our function. And so the input to the function is just model dot input, and the outputs to the function will be the loss and the gradients.

So that's nearly everything we need. The last step we need to do is to actually run an optimizer. Now normally when we run an optimizer we use some kind of SGD. Now the s in SGD is for stochastic. In this case, there's nothing stochastic. We're not creating lots of random batches and getting different gradients every time.

So why use stochastic gradient descent when we don't have a stochastic problem to solve? So in fact, there's a much longer history of optimization methods which are deterministic, going back to Newton's method, which many of you will be familiar with. The basic idea of these much faster deterministic optimization methods is that rather than saying OK, where's the gradient, which direction does it go, let's just go a small little step in that direction.

Learning rate times gradient, small little step, small little step, because I have no idea how far to go. And it's stochastic, so it's going to keep changing. So next time I look it will be a totally different direction. With a deterministic optimization, we find out which direction to go, and then we find out what is the optimum distance to go in that direction.

And so if you know this is the direction I want to go, and it looks like this, then the way we find the optimum is we go a small distance. Then we go twice as far as that, twice as far as that, twice as far as that, and we keep going until the slope changes sign.

And once the slope changes sign, we know it's called bracketing. We've bracketed the minimum of that function. And then we can use bisection to find the minimum. So now we've bracketed it, we find halfway between the two. Is it on the left or the right of that? Halfway between the two of those, is it left or the right of that?

So we use bracketing and bisection to find the optimum in that direction. Let's call it a line search. All of these optimization techniques rely on the basic idea of a line search. Once you've done the line search, you've found the optimal value in that direction, in our downhill direction.

That doesn't necessarily mean we've found the optimal value across our entire space. So what we then do is we replete the process, find out what's the downhill direction now, use line search to find the optimum in that direction. So the problem with that is that in a saddle point, you will still often find yourself going backwards and forwards in a rather unfortunate way.

The faster optimization approaches when they're going to go in a new direction, they don't just say which direction is down, they say which direction is the most downhill but also the most different to the previous directions I've gone. That's called finding a conjugate direction. So the good news is you don't need to really know any of those details.

All you need to know is that there is a module called SciPy.optimize. And in SciPy.optimize are lots of handy deterministic optimizers. The two most common used are conjugate gradient, or CG, and BFGS. They differ in the detail of how do they decide what direction to go next, which direction is both the most downhill and also the most different to the previous directions we've gone.

And the particular version we're going to use is a limited memory BFGS. So the important thing is not how it works, the important thing for us is how do we use it. So there's the question about loss plus grads. So this is an array containing a single thing, which is loss.

Grads is already an array, or a list I should say, which is a list of all of the loss with respect to all of the inputs. So plus in Python on two lists simply joins the two lists together. So this is a list containing the loss and all of the gradients.

Someone asked if ant colony optimization is something that can be used? Ant colony optimization lives in a class known as metaheuristics, like genetic algorithms or simulated annealing. There's a wide range of optimization algorithms that are designed for very difficult to optimize functions, functions which are extremely bumpy. And so these techniques all use a lot of randomization in order to kind of avoid the bumps.

In our case, we're using mean-squared error, which is a nice smooth objective. So we can use the much faster convex optimization. And then that was the next question, is this a non-convex problem or a convex optimization? Okay, great. So how do we use one of these optimizers? Basically you provide the name of the optimizer, which in this case is minimize something using VFTS, and you have to pass it three things.

A function which will return the loss value at the current point, a starting point, and a function which will return the gradients at the current point. Now unfortunately we have a function which returns the loss and the gradients together, which is not what this wants. So a minor little detail is that we create a simple little class, and all this class does, and again the details really aren't important, but all this class does is that when loss is called, it calls that function that we created, passing in the current value of the data, it gets back the loss and the gradients, and it returns the loss.

Later on when the optimizer asks for the gradients, it returns those gradients that I stored back here. So what this is doing is it's a little class which allows us to basically turn a Keras function that returns the loss and the gradients together into two functions. One which returns the loss, one which returns the gradients.

So it's a pretty minor detail, but it's a handy thing to have in your toolbox because it means you now have something that can use deterministic optimizers on Keras functions. So all we do is we look through a small number of times, calling that optimizer each time, and we need to pass in some starting point.

So the starting point is just a random image. So we just create a random image, and here is what a random image looks like. So let's go ahead and run that so we can see the results, I haven't actually ran this yet. Oh, there it comes. Good. Okay. Run, run.

So you can see it going along and solving here. Here's one I prepared earlier. And here at the end of the 10th iteration is the result. So remember what we did was we started with this image, we called an optimizer which took that image and attempted to optimize this loss function where the target was the value of this layer for our bird image, and the thing it was comparing it to was the layer for the generated image.

So we started with this, we ran that optimizer a bunch of times, calculating the gradient of that loss with respect to the input to the model, the very pixels themselves. And after 10 iterations it turned this random image into this thing. So this is the thing which optimizes the block 5.1 layer.

And you can see it still looks like a bird, but by this point it really doesn't care what the background looks like, it cares a lot what the eye looks like and the beak looks like and the feathers look like, because these things all matter to ImageNet to make sure it correctly sees that it's a bird.

If we look at an earlier layer, let's look at block 4.1, you can see it's getting the details more correct. So when we do our artistic style, we can choose which layer will be our f-content. And if we choose an earlier one, it's going to give it less degrees of freedom to look like a different kind of bird, but it's going to look more like our original bird.

And so then here's a video showing how that happens, so there are the 10 steps. And it's often helpful to be able to visualize the iterations of your generators at work. So feel free to borrow this very simple code, you can just use matplotlib. We actually used this in the last class, remember we little linear optimizer, we animated it.

You just have to define a function that gets called at each step of the animation, and then you can just call animation.func animation passing in that function, and that's a nice way that you can animate your own generators. Question, we're using Keras and TensorFlow to extract the BGG features, these are used by SciPy for BFGs, does the BFGs also run on the GPU?

No, there's really very little for the BFGs to do. Especially for an optimizer, all of the work is in calling the loss function and the gradients. The actual work of doing the bisection and doing the bracketing is so trivial that we just don't care about that. It doesn't take any time.

There's a question about the checkerboard artifact, the geometric pattern that's appearing. Yes. This is actually not a checkerboard artifact exactly, checkerboard artifacts we will look at later. They look a little bit different. That was my interpretation mistake, not the questioner's mistake. I'm not exactly sure why this particular kind of noise has appeared, honestly.

It's an interesting question. How would batching work? It doesn't. So there's no batching to do. We have a single image which is being optimized, so there's really no batching to do here. We'll look at a version which uses a very different approach and has batching shortly. Has anyone tried something like this by averaging or combining the activations of multiple bird images to create some kind of prototypical or novel bird?

Generative adversarial networks do something like that, but probably not quite. I'm not sure. Maybe not quite. Where can people get the pickle file? They don't. You have to get a list of file names yourself from the list of files that you've downloaded. And then just to make sure I understand this, someone says in this example we started with a random image, but if we started with the actual image as the initial condition, we would get the original image back, right?

I would assume so. Yeah, I mean, I can't see why it wouldn't. Basically the gradients would all be zero. They're interested to find out where we initialize for the artistic styling problem. Sorry? That was just a follow-up, we're going to get there. Oh, there's one more. Would it be useful to use a tool like Quiver to figure out which BGG layer to use this?

It's so easy just to try a few and see what works. So we're nearly out of time. We haven't got through as much as I hoped, but we're going to finish off this piece. We're now going to do F style. F style is nearly identical. All of the code is nearly identical.

The only thing different is A, we're going to not feed in a photo, we're going to feed in a painting. And here's a few styles we could choose from. We could do Van Gogh, we could do this little drawing, or we could do the Simpsons. So we pick one of those and we create the style array in the same way as before.

Truck it through BGG. This time, though, we're going to use multiple layers. So I've created a dictionary from the name of the layer to its output, and so we're going to use that to create an array of a number of the outputs. We're going to grab the first, second and the block outputs.

So we're going to create our target as before, but we're going to use a different loss function. The loss function is called style_loss, and just like before, it's going to use the MSE. But rather than just the MSE on the activations, it's the MSE on something called the Gram matrix of the activations.

What is a Gram matrix? A Gram matrix is very simply the dot product of a matrix with its own transpose. So here it is here, dot product of some matrix with its own transpose. And I've just got to divide it by here to create an average. So what is this matrix that we're taking the dot product of it as transpose?

Well what it is, is that we start with our image, and remember the image is height by width by channels, and we change the order of dimensions, so it's channels by height by width. And then we do a batch flatten. What batch flatten does is it takes everything except the first dimension and flattens it out into a vector.

This is now going to be a matrix where the rows, the channels and the columns are a flattened version of the height by width. This is C by H by W, the result of this will be C rows and H times W columns. So when you take the dot product of something with a transpose of itself, what you're basically doing is creating something a lot like a correlation matrix.

You're saying how much is each row similar to each other row? You can think of it a number of ways. You can think about it like a cosine, a cosine is basically just a dot product. You can think of it as a correlation matrix, it's basically a normalized version of this.

So maybe if it's not clear to you, write it down on a piece of paper on the way home tonight. Just think about taking the rows of a matrix, and then flipping it around, and you're basically then turning them into columns, and then you're multiplying the rows by the columns, it's basically the same as taking each row and comparing it to each other row.

So that's what this Gram matrix is, it's basically saying for every channel, how similar are its values to each other channel? So if channel number 1 in most parts of the image is very similar to channel 3 in most parts of the image, then 1,3 of this result will be a higher number.

So it's kind of a weird matrix, it basically tells us, it's like a fingerprint of how the channels relate to each other in this particular image, or how the filters relate to each other in a particular layer of this particular image. I think the most important thing to recognize is that there is no geometry left here at all.

The x and the y coordinates are totally thrown away, they're actually flattened out. So this loss function can by definition in no way at all contain anything about the content of the image, because it's thrown away all of the x and y information, and all that's left is some kind of fingerprint of how the channels relate to each other, how the filters relate to each other.

So this style loss then says for two different images, how do these fingerprints differ? How similar are these fingerprints? So it turns out that if you now do the exact same steps as before using that as our loss function and you run it through a few iterations, it looks like that.

It looks a lot like the original Van Gogh, but without any of the content. So the question is, why? The answer is, nobody the fuck knows. So a paper just came out two weeks ago called Demystifying Neural Style Transfer with a mathematical treatment where they claim to have an answer to this question.

But as the point at which this was created, a year and a half ago, until now, no one really knows why that happens. But the important thing that the authors of this paper realized is, if we could create a function that gives you content loss and a function that gives you style loss, and you add the two together and optimize them, you can do neural style.

So all I can assume is that they tried a few different things. They knew that they had to throw away all of the geometry, so they probably tried a few things to throw away the geometry, and at some point they looked at this and they went, "Oh shit! That's it!" So now that we have this magical thing, there's the Simpsons, all we have to do is add the two together.

So here's our bird, which I'll call source. We've got our style layers, I'm actually going to take the top five now. Here's our content layer, I'm going to take block 4, column 2. As promised, for our loss function, I'm just going to add the two together. Style loss for all of the style layers, plus the content loss.

And I'm going to divide the content loss by 10. This is something you can play with, and in the paper you'll see they play with it. How much style loss versus how much content loss? Get the gradients, evaluator, solve it, and there it is. Other than the fact that we don't really know why the style loss works, but it does, everything else kind of fits together.

So there's the bird as Van Gogh, there's the bird as the Simpsons, and there's the bird in the style of a bird picture. There's a question, "Since the publication of that paper, has anyone used any other loss functions for f_style that achieve similar results?" Yeah, so as I mentioned, just a couple of weeks ago there was a paper, I'll put it on the forum, that tries to generalize this loss function.

It turns out actually that this particular loss function seems to be about the best that they could come up with, but there you go. So it's 9 o'clock, so we have run out of time. So we're going to move some of this lesson to the next lesson, but to give you a sense of where we're going to head, what we're going to do is we're going to take this thing where you have to optimize every single image separately, and we're going to train a CNN, which will learn how to turn a picture into a Van Gogh version of that picture.

So that's basically going to be what we're going to learn next time, and we're also going to learn about adversarial networks, which is where we're going to create two networks. One will be designed to generate pictures like this, and the other will be designed to try and classify whether this is a real Simpsons picture or a fake Simpsons picture.

And then you'll do one, generate, the other, discriminate, generate, discriminate. And by doing that, we can take any generative model and make it better by basically having something else learn to pick the difference between it, the real, and the fake. And then finally we're going to learn about a particular thing that came out three weeks ago called the Wasserstein GAN, which is the reason I actually decided to move all of this forwards.

Generative adversarial networks basically didn't work very well at all until about three weeks ago. Now that they do work, suddenly there's a shitload of stuff that nobody's done yet, which you can do for the first time. So we're going to look at that next week.

Lesson 8: Cutting Edge Deep Learning for Coders

Chapters

Transcript