Lesson 8: Cutting Edge Deep Learning for Coders

00:00:00.000 | Some of you have finished Part 1 in the last few days, some of you finished Part 1 in December.

00:00:07.000 | I did ask those of you who took it in person to revise the material and make sure it was

00:00:13.880 | up to date, but let's do a quick summary of the key things we learned.

00:00:20.120 | So I came up with these five things.

00:00:22.880 | I've been interested to hear if anybody has other key insights that they feel they came

00:00:29.400 | away from.

00:00:30.640 | So the five things are these.

00:00:36.160 | Stacks of nonlinear functions with lots of -- well, stacks of differentiable nonlinear

00:00:43.000 | functions with lots of parameters solve nearly any predictive modeling problem.

00:00:47.640 | So when we say neural network, a lot of people are suggesting we should use the phrase differentiable

00:00:53.640 | network.

00:00:54.640 | If you think about things like the collaborative filtering we did, it was really a couple of

00:01:00.640 | embeddings and a dot product and that gave us quite a long way, there's nothing very

00:01:06.120 | neural looking about that.

00:01:09.560 | But we know that when we stack certain kinds of nonlinear functions on top of each other,

00:01:15.440 | the universal approximation theorem tells us that can approximate any computable function

00:01:21.520 | to arbitrary precision, we know that if it's differentiable we can use SGD to find the

00:01:27.560 | parameters which match that function.

00:01:31.240 | So this to me is kind of like the key insight.

00:01:36.640 | But some stacks of functions are better than others for some kinds of data and some kinds

00:01:44.240 | of problems.

00:01:46.840 | One way to make life very easy, we learned, is transfer learning.

00:01:51.920 | I think nearly every network we created in the last course, we used transfer learning.

00:01:59.360 | I think particularly in vision and in text, so pretty much everything.

00:02:04.880 | So transfer learning generally was throw away the last layer, replace it with a new one

00:02:10.880 | that has the right number of outputs, pre-compute the penultimate layer's output, then very

00:02:17.440 | quickly create a linear model that goes from that to your preferred answer.

00:02:23.560 | You now have something that works pretty well, and then you can fine-tune more and more layers

00:02:28.120 | backwards as necessary.

00:02:30.600 | And we learned that fine-tuning those additional layers, generally the best way to do that

00:02:35.200 | was to pre-compute the last of the layers which are not fine-tuning, and so then you

00:02:40.280 | could just calculate the weights of the remaining ones, and that saved us lots and lots of time.

00:02:48.040 | And remember that convolutional layers are slower, so let's fix up the previous one as

00:02:59.280 | well.

00:03:13.080 | Convolutional layers are slower, dense layers are bigger, and there's an interesting question

00:03:19.520 | I've added here, which is, remember in the last lesson, we kind of looked at ResNets

00:03:25.560 | and InceptionNets and in general more modern nets tend not to have any dense layers.

00:03:31.560 | So what's the best way to do transfer learning?

00:03:35.280 | I'm going to leave that as an open question for now.

00:03:37.840 | We're going to look into it a bit during this class, but it's not a question that anybody

00:03:42.160 | has answered to my satisfaction.

00:03:44.680 | So I'll suggest some ideas, but no one's even written a paper that attempts to address it

00:03:51.960 | as far as I'm aware.

00:03:57.680 | Given we have transfer learning to get us a long way, the next thing we have to get

00:04:01.080 | us a long way is to try and create an architecture which suits our problem, both our data and

00:04:07.880 | our loss function.

00:04:09.680 | So for example, if we have autocorrelated inputs, so in other words, each input is related

00:04:17.680 | to the previous input, so each pixel is similar to the next-door pixel, or in a sound wave,

00:04:23.320 | each sample is similar to the previous sample, something like that, that kind of data we

00:04:28.880 | tend to like to use CNNs for as long as it's of a fixed size, it's a sequence we like to

00:04:35.320 | use an RNN for, if it's a categorical output we like to use a softmax for.

00:04:40.680 | So there are ways we learned of tuning our architecture, not so that it makes it possible

00:04:45.800 | to solve a problem, because any standard dense network can solve any problem, but it just

00:04:52.760 | makes it a lot faster and a lot easier to train if you've made sure that your activation

00:05:00.360 | functions and your architecture suit the problem.

00:05:04.360 | So that was another key thing I think we learned.

00:05:11.800 | And something I hope that everybody can narrate is the five steps to avoiding overfitting.

00:05:18.320 | If you've forgotten them, they're both here and discussed in more detail in lesson 3.

00:05:23.880 | Get more data, fake-keep more data using data augmentation, use more generalizable architectures.

00:05:32.360 | Architectures that generalize well, particularly when we look at batch normalization, use regularization

00:05:38.240 | techniques as few as we can, because by definition they destroy some data, but we look particularly

00:05:46.920 | at using dropout.

00:05:49.320 | And then finally if we have to, we can look at reducing the complexity of the architecture.

00:05:54.360 | The general approach we learned, this was absolutely key, is first of all with a new

00:06:00.760 | problem, start with a network that's too big, it's not regularized, it can't help but solve

00:06:07.760 | the problem, even if it has to overfit terribly.

00:06:12.680 | If you can't do that, there's no point starting to regularize yet.

00:06:16.760 | So we start out by trying to overfit terribly.

00:06:19.720 | Once we've got to the point that we're getting 100% accuracy and our validation set's terrible

00:06:23.720 | because it's overfitting, then we start going through these steps until we get a nice balance.

00:06:31.120 | So that's kind of the process that we learned.

00:06:34.360 | And then finally we learned about embeddings as a technique to allow us to use categorical

00:06:40.840 | data, and specifically the idea of using words, or the idea of using latent variables.

00:06:49.760 | So in this case, this was the movie lens dataset for collaborative filtering.

00:06:56.440 | So that's the 5 main insights I thought of.

00:07:02.960 | Did anybody have any other kind of key takeaways that they think people revising should think

00:07:10.680 | about or remember, or things they found interesting?

00:07:13.360 | No?

00:07:14.360 | Okay, that's good.

00:07:16.360 | If you come up with something, let me know.

00:07:19.960 | I have one question.

00:07:22.920 | How does having duplicates and training data affect the model created?

00:07:26.760 | And if you're using data augmentation, do you end up with duplicate data?

00:07:35.440 | Duplicates in the input data, I mean it's not a big deal, because we shuffle the batch

00:07:41.800 | and then you select things randomly, effectively, you're weighting that data point higher than

00:07:48.440 | its neighbors.

00:07:49.440 | So in a big dataset, it's going to make very little difference.

00:07:54.000 | If you've got one thing repeated 1,000 times and then there's only another 100 data points,

00:07:57.640 | that's going to be a big problem because you're weighting one data point 1,000 times higher.

00:08:05.880 | So as you will have seen, we've got a couple of big technology foundation changes.

00:08:12.080 | The first one is we're moving from Python 2 to Python 3.

00:08:16.640 | Python 2 I think is a good place to start, given that a lot of the folks in Part 1 had

00:08:23.200 | never coded in Python before and many of them had never written very substantial pieces

00:08:29.280 | of software before.

00:08:31.300 | And a lot of the tutorials out there, like for example one of our preferred starting

00:08:35.800 | points which is learn Python the hard way, then Python 2, a lot of the existing codes

00:08:40.840 | out there in Python 2, so we thought Python 2 is a good place to start.

00:08:44.480 | Two more questions.

00:08:45.480 | One is, are you going to post the slides after this?

00:08:48.760 | I will post the slides, yes.

00:08:50.360 | And the other is, could you go through steps for underfitting at some point, how to deal

00:08:57.520 | with overfitting?

00:08:58.520 | Yeah, let's do that in a forum thread.

00:09:00.200 | So why don't you create a forum thread asking about underfitting, but you don't need to

00:09:04.160 | do that in the Part 2 forum, you can do that in the main forum because lots of people would

00:09:08.800 | be interested in hearing about that.

00:09:11.000 | If you want to revise that lesson 3, start it by talking about underfitting.

00:09:21.680 | So that seemed like a good place to start.

00:09:24.400 | I don't think we should keep using Python 2 though for a number of reasons.

00:09:28.240 | One is that since then the IPython folks have come out and said that the next version won't

00:09:32.760 | be compatible with Python 2, so that's a problem.

00:09:37.040 | Even from 2020 onwards, Python 2 will be end of life, which means there won't be patches

00:09:42.960 | for it.

00:09:43.960 | So that's a problem.

00:09:44.960 | Also, we're going to be doing more stuff with concurrency and parallel programming this

00:09:49.000 | time around, and the features in Python 3 are a lot better.

00:09:54.160 | And then Python 3.6 was just released, which has some very nice features in particular,

00:09:58.520 | some string formatting, which for some people it's no big deal, but to me it saves a lot

00:10:02.680 | of time and makes life a lot easier.

00:10:04.480 | So we're going to move across to Python 3, and hopefully you've all gone through the

00:10:09.240 | process already.

00:10:10.240 | And there are some tips on the forum about how to have both run at the same time, although

00:10:16.520 | I agree with the suggestion I had read from somebody which was go ahead, suck it up and

00:10:22.440 | do the translation once now so you don't have to worry about it.

00:10:30.640 | Much more interesting and much bigger is the move from Theano to TensorFlow.

00:10:35.640 | So Theano, we thought, was a better starting point because it has a much simpler API.

00:10:41.800 | There's very few new concepts to learn to understand Theano.

00:10:46.120 | And it doesn't have a whole new ecosystem.

00:10:47.920 | You see, TensorFlow lives within Google's whole ecosystem.

00:10:52.840 | It has its own build system called Bazel.

00:10:55.120 | It's got its own file serialization system called Protobuf.

00:10:58.880 | It's got its own profiler method based on Chrome.

00:11:01.840 | It's got all this stuff to learn.

00:11:04.720 | But if you've come this far, then you're already investing the time.

00:11:09.400 | We think it's worth investing the time in TensorFlow because there's a lot of stuff which just

00:11:15.200 | in the last few weeks, it's data being able to do that's pretty amazing.

00:11:18.640 | So Rachel wrote this post about how much TensorFlow sucks, for which we got invited to the TensorFlow

00:11:31.000 | Dev Summit and got to meet all the TensorFlow core team.

00:11:40.320 | So looking at moving from Theano to TensorFlow, we got invited to the TensorFlow Dev Summit

00:11:48.880 | and we were pretty amazed at all the stuff that's literally just been added.

00:11:55.080 | So TensorFlow 1 just came out.

00:11:58.680 | And here are some of the things.

00:12:01.840 | If you google for TensorFlow Dev Summit videos, you can watch the videos about all this.

00:12:06.520 | That's the most exciting thing for us, is that they are really investing in a simplified

00:12:11.640 | API.

00:12:12.640 | So if you look at this code, you can create a deep neural network regressor on a mixture

00:12:21.080 | of categorical and real variables using an almost R-like syntax and fit it in two lines

00:12:29.320 | of code.

00:12:30.560 | You'll see that those lines of code at the bottom, the two lines to fit it, look very

00:12:33.880 | much like Keras.

00:12:36.680 | The Keras author has been a wonderful influence on Google, and in fact everywhere we saw at

00:12:42.840 | the Dev Summit was Keras API influences.

00:12:48.440 | So TensorFlow and Keras are kind of becoming more and more one, which is terrific.

00:12:54.040 | So one is that they're really investing in the API.

00:12:58.400 | The second is that some of the tooling is looking pretty good.

00:13:01.360 | So TensorBoard has come a long way.

00:13:04.600 | Things like these graphs showing you how your different layers are distributed and how that's

00:13:09.760 | changed over time can really help to debug what's going on.

00:13:14.220 | So if you get some kind of gradient saturation in a layer, you can dig through these graphs

00:13:19.400 | and very quickly find out where.

00:13:23.920 | This was one of my favorite talks, actually.

00:13:27.200 | This guy, I remember correctly, his name was Daffodil and his signature was an emoji of

00:13:32.000 | a Daffodil, very Google.

00:13:37.680 | If you watch this video, you kind of have to walk through showing some of the functionality

00:13:50.280 | that's there and how to use it, and I thought that was pretty helpful.

00:13:56.960 | One of the most important ones to me is that TensorFlow has a great story about productionization.

00:14:03.560 | For part one, I didn't much care about productionization.

00:14:06.240 | It was really about playing around, what can we learn.

00:14:10.640 | At this point, I think we might be starting to think about how do I get my stuff online

00:14:16.120 | in front of my customers.

00:14:19.000 | These points are talking about something in particular which is called TensorFlow Serving.

00:14:22.840 | And TensorFlow Serving is a system that can take your train TensorFlow model and create

00:14:28.960 | an API for it which does some pretty cool things.

00:14:32.720 | For example, think about how hard it would be without the help of some library to productionize

00:14:39.560 | your system.

00:14:40.680 | You've got one request coming in at a time.

00:14:44.280 | You've got n GPUs.

00:14:47.360 | How do you make sure that you don't saturate all those GPUs, that you send the request to

00:14:51.320 | one that's free, that you don't use up all of your memory.

00:14:54.640 | Better still, how do you grab a few requests, put them into a batch, put them in all to

00:15:01.080 | the GPU at once, get the bits out of the batch, put them back to the people that requested

00:15:04.760 | it, all that stuff.

00:15:06.120 | So serving does that for you.

00:15:09.200 | It's very early days for this software, a lot of things don't work yet, but you can download

00:15:15.120 | an early version and start playing with it, and I think that's pretty interesting.

00:15:19.240 | With the high-level API in TensorFlow, what's going to be the difference between the Keras

00:15:25.280 | API and the TensorFlow API?

00:15:26.860 | Yeah, that's a great question.

00:15:30.000 | In fact, TensorFlow or tf.keras will become a namespace.

00:15:39.480 | So Keras will become the official top-level API for TensorFlow, and in fact Rachel was

00:15:45.840 | the person who announced that.

00:15:49.400 | I was just going to add that TensorFlow is kind of introducing a few different libraries

00:15:53.680 | at different layers, different levels of abstraction.

00:15:58.360 | There's this concept of an evaluation API that appears everywhere and basically is the

00:16:05.400 | Keras API.

00:16:06.400 | I think there's a layers API below the Keras API.

00:16:11.160 | So it's being mixed in lots of places.

00:16:13.920 | So all the stuff you've learned about Keras is going to be very helpful, not just in using

00:16:18.520 | Keras on TensorFlow, but in using TensorFlow directly.

00:16:25.840 | Another interesting thing about TensorFlow is that they've built a lot of cool integrations

00:16:30.740 | with various cluster managers and distributed storage systems and stuff like that.

00:16:34.680 | So it will kind of fit into your production systems more neatly, use the data in whatever

00:16:39.640 | place it already is more neatly, so if your data is in S3 or something like that, you

00:16:45.680 | can generally throw it straight into TensorFlow.

00:16:50.480 | Something I found very interesting is that they announced a couple of weeks ago a machine

00:16:56.560 | learning toolkit which brings really high-quality implementations of a wide variety of non-deep

00:17:03.280 | learning algorithms in TensorFlow.

00:17:05.440 | So all these are GPU-accelerated, parallelized, and supported by Google.

00:17:15.360 | And a lot of these have a lot of tech behind them.

00:17:18.440 | For example, the random forest, there's a paper, they actually call it the Tensor Forest,

00:17:23.720 | which explains all of the interesting things they did to create a fast GPU-accelerated random

00:17:29.240 | forest.

00:17:30.240 | Two more questions.

00:17:33.960 | Will you give an example of how to solve gradient saturation TensorFlow tools?

00:17:40.320 | I'm not sure that I will.

00:17:41.680 | We'll see how we go, because I think the video from the Dev Summit, which is available online,

00:17:46.640 | kind of already shows you that.

00:17:48.680 | So I would say look at that first and see if you still have questions.

00:17:53.160 | All the videos from the Dev Summit are online.

00:18:01.440 | Is there an idea for using deep learning on AWS Lambda?

00:18:06.000 | Not that I've heard of, and in fact in general, Google has a service version of TensorFlow

00:18:17.480 | serving called Google Cloud ML where you can pay them a few cents in a transaction and

00:18:23.400 | they'll post your model for you.

00:18:26.320 | There isn't really something like that through Amazon if I was unaware.

00:18:32.880 | And then finally in terms of TensorFlow, I had an interesting and infuriating few weeks

00:18:40.360 | trying to prepare for this class and trying to get something working that would translate

00:18:44.080 | French into English.

00:18:46.400 | And every single example I found online had major problems.

00:18:52.480 | Even the official TensorFlow tutorial missed out a key thing which is that the lowest level

00:18:59.600 | of a language model really should be bi-directional, as this one shows, bi-directional RNN and

00:19:05.180 | their one wasn't.

00:19:06.180 | I'm trying to figure out how to make it work horrible, trying to get it to work in Keras,

00:19:11.160 | nothing worked properly.

00:19:12.800 | Finally, basically the issue is this, modern RNN systems like a full neural translation

00:19:22.600 | system involve a lot of tweaking and mucking around with the innards of the RNN using things

00:19:29.160 | that we'll learn about.

00:19:31.680 | And there just hasn't been an API that really lets that happen.

00:19:38.120 | So I finally got it working by switching to PyTorch, which we'll learn about soon, but

00:19:43.880 | I was actually going to start, the first lesson was going to be about neural translation and

00:19:48.320 | I've put it back because TensorFlow has just released a new system for RNNs which looks

00:19:55.880 | like it's going to make all this a lot easier.

00:19:58.120 | So this is an exciting idea is that there's an API that allows us to create some pretty

00:20:04.600 | powerful RNN implementations and we're going to be absolutely needing that when we learn

00:20:08.800 | to create translations.

00:20:10.560 | Oh yeah, there is one more.

00:20:13.840 | Again, early days, but there is something called XLA, which is the Accelerated Linear

00:20:20.160 | Algebra virtual, I think, which is a system which takes TensorFlow code and compiles it.

00:20:33.280 | And so for those of you that know something about compiling, you know that a compilation

00:20:37.460 | can do a lot of clever stuff in terms of identifying dead code or unrolling loops or fusing operations

00:20:45.760 | or whatever.

00:20:46.760 | XLA tries to do all that.

00:20:48.840 | Now at this stage, it takes your TensorFlow code and turns it into machine code.

00:20:53.360 | One of the cool things that lets you do is run it on a mobile phone with almost no supporting

00:20:58.520 | libraries using native machine instructions on that phone, much less memory.

00:21:06.080 | But one of the really interesting discussions I had at the summit was with Scott Gray, who

00:21:12.160 | some of you may have heard of.

00:21:13.320 | He was the guy that passively accelerated neural network kernels when he was at Nirvana.

00:21:20.800 | He had kernels that were two or three times faster than Nvidia's kernels.

00:21:25.880 | I don't know of anybody else in the world who knows more about neural network performance

00:21:30.680 | than him.

00:21:32.420 | He told me that he thinks that XLA is the key to creating performant, concise, expressive

00:21:42.880 | neural network code.

00:21:45.760 | And I really like that idea.

00:21:47.600 | The idea is currently, if you look in the TensorFlow code, it's thousands and thousands

00:21:51.920 | of lines of C++, all custom written.

00:21:55.800 | The idea is you throw all that away and replace it with a small number of lines of TensorFlow

00:22:00.480 | code that get compiled through XLA.

00:22:03.040 | So that's something that's actually got me pretty excited.

00:22:08.840 | So TensorFlow is pretty interesting.

00:22:14.600 | Having said that, it's kind of hideous.

00:22:20.400 | The API is full of not invented hair syndrome.

00:22:25.320 | It's clearly written by a bunch of engineers who have not necessarily spent that much time

00:22:30.920 | learning about the user interface of APIs.

00:22:35.600 | It's full of these Googleisms in terms of having to fit into their ecosystem.

00:22:43.000 | But most importantly, like Theano, you have to set up the whole computation graph and

00:22:51.880 | then you kind of go run, which means that if you want to do stuff in your computation

00:22:57.000 | graph that involves like conditionals, if-then statements, if this happens, you do this other

00:23:02.400 | part of the loop, it's basically impossible.

00:23:10.560 | It turns out that there's a very different way of programming neural nets, which is dynamic

00:23:19.800 | computation, otherwise known as define through run.

00:23:24.060 | There's a number of libraries that do this, Torch, PyTorch, Tana, dinet, they're the ones

00:23:31.740 | that come to mind.

00:23:35.640 | And we're going to be looking at one that was released, but an early version was put

00:23:41.920 | out about a month ago called PyTorch, which I've started rewriting a lot of stuff in,

00:23:49.120 | and a lot of the more complex stuff just becomes suddenly so much easier.

00:23:53.760 | And because it becomes easier to do more complex things, I often find I can create faster and

00:24:01.760 | more concise code by using this approach.

00:24:05.960 | So even although PyTorch is very, very, very new, it is coming out of the same people that

00:24:12.660 | built Torch, which really all of Facebook's systems build on top of.

00:24:18.080 | I suspect that Facebook are in the process of moving across from Torch to PyTorch.

00:24:23.280 | It's already full of incredibly full stuff, as you'll see.

00:24:29.000 | So we will be using increasingly more and more PyTorch during this course.

00:24:35.540 | There was a question, "Does precompiling mean that we'll write TensorFlow code and test it

00:24:45.560 | and then when we train a big model, then we precompile the code and train our model?"

00:24:50.400 | Yeah, so if we're talking about XLA, XLA can be used a number of ways.

00:24:56.420 | One is that you come up with some different kind of kernels, a different kind of factorization,

00:25:04.120 | something like that.

00:25:05.120 | You write it in TensorFlow, you compile it with XLA, and then you make it available to

00:25:10.680 | anybody so when they use your layer, they're getting this compiled optimized code.

00:25:17.460 | It could mean that when you use TensorFlow serving, TensorFlow serving might compile

00:25:23.400 | your code using XLA and be serving up an accelerated version of it.

00:25:29.040 | One example which came up was for RNNs.

00:25:31.640 | RNNs often involve nowadays, as you'll learn, some kind of complex customizations of a bidirectional

00:25:38.680 | layer and then some stack layers and an attention layer and then a set into a separate stack

00:25:43.640 | to decoder, you can fuse that together into a single layer called bidirectional attention

00:25:53.360 | sequence to sequence, which indeed Google have actually bought that kind of stuff.

00:25:59.240 | There's various ways in which neural network compilation can be very helpful.

00:26:09.960 | What is the relationship between TensorFlow and PyTorch?

00:26:14.240 | There's no relationship, so TensorFlow is Google's thing, PyTorch is I guess it's kind

00:26:22.120 | of Facebook's thing, but it's also very much a community thing.

00:26:27.000 | TensorFlow is a huge complex beast of a system which uses all kinds of advanced software

00:26:36.560 | engineering methods all over the place.

00:26:39.280 | In theory, that ought to make it terribly fast.

00:26:41.640 | In practice, a recent benchmark actually showed it to be about the slowest, and I think the

00:26:45.520 | reason is because it's so big and complex, it's so hard to get everything to work together.

00:26:50.520 | In theory, PyTorch ought to be the slowest because this defined by run system means it's

00:26:56.280 | way less optimization that the systems can do, but it turned out to be amongst the fastest

00:27:01.760 | because it's so easy to write code, it's so much easier to write good code.

00:27:07.080 | It's interesting, I think there's such different approaches, I think it's going to be great

00:27:13.760 | to know both because there are going to be some things that are going to be fantastic

00:27:17.400 | in TensorFlow and some things that are going to be fantastic in TensorFlow.

00:27:21.640 | They couldn't be more different, which is why I think there are two good things to learn.

00:27:28.240 | So wrapping up this introductory part, I wanted to kind of change your expectations about

00:27:37.760 | how you've learned so far to how you're going to learn in the future.

00:27:41.240 | Part 1 to me was about showing you best practices.

00:27:44.280 | So generally it's like, here's a library, here's a problem, you use this library in these steps

00:27:50.720 | to solve this problem, and you do it this way, and lo and behold we've gotten the top

00:27:55.040 | ten of this capital competition.

00:28:00.000 | I tried to select things that had best practices.

00:28:05.680 | So you now know everything I know about best practices.

00:28:08.960 | I don't really have anything else to tell you.

00:28:12.080 | So we're now up to stuff I haven't quite figured out yet, nor is anybody else, but you probably

00:28:20.480 | need to know.

00:28:22.320 | So some of it, for example, like neural translation, that's an example of something that is solved.

00:28:30.600 | Google solved it, but they haven't released the way they solved it.

00:28:35.320 | So the rest of us are trying to put everything together and figure out how to make something

00:28:40.180 | work as well as Google made that work.

00:28:43.760 | More often it's going to be, here's a sequence of things you can do that can get some pretty

00:28:50.600 | good results here, but there's a thousand things you could do to make it better that

00:28:55.360 | no one's tried yet, so that's interesting.

00:28:59.600 | Or thirdly, here's a sequence of things that solves this pretty well, but gosh we wrote

00:29:06.200 | a lot of custom code there, didn't we?

00:29:09.640 | I'm sure this could be abstracted really nicely, but no one's done that yet.

00:29:13.000 | So they're kind of the three main categories.

00:29:15.320 | So generally at the end of each class it won't be like, okay, that's it, that's how you do

00:29:20.600 | this thing.

00:29:21.600 | It'll be more like, here are the things you can explore.

00:29:23.960 | And so the homework will be pick one of these interesting things and dig into it, and generally

00:29:31.800 | speaking that homework will get you to a point that probably no one's done before, or at

00:29:37.960 | least probably no one's written down before.

00:29:40.240 | I found as I built this, I think nearly every single piece of code I'm presenting, I was

00:29:49.360 | unable to find anything online which did that thing correctly.

00:29:54.280 | There was often example code that claimed to be something like that, but again and again

00:29:59.840 | I found it was missing huge pieces.

00:30:02.040 | And we'll talk about some of the things that it was missing as we go, but one very common

00:30:06.680 | one was it would only work on a single item at a time, it wouldn't work with a batch.

00:30:12.480 | Therefore the GPU is basically totally wasted.

00:30:16.200 | Or it failed to get anywhere near the performance that was claimed in the paper that it was

00:30:21.040 | going to be based on.

00:30:24.040 | So generally speaking there's going to be lots of opportunities if you're interested

00:30:28.480 | to write a little blog post about the things you tried and what worked and what didn't,

00:30:33.320 | and you'll generally find that there's no other post like that out there.

00:30:39.040 | Particularly if you pick a dataset that's in your domain area, it's very unlikely that

00:30:44.680 | somebody's written it.

00:30:46.480 | Going back, can we use TensorFlow and Torch together?

00:30:51.920 | Don't say torch, say PyTorch.

00:30:55.800 | Torch is very similar, but it's written in Lua, which is a very small embedded language.

00:31:04.720 | Very good for what it is, but not very good for what we want to do.

00:31:10.240 | So PyTorch is kind of a port of Torch into Python, which is pretty cool.

00:31:17.760 | So can you use them together?

00:31:20.120 | Yeah, sure, we'll kind of see a bit of that.

00:31:23.640 | In general, you can do a few steps with TensorFlow to get to a certain point, and then a few

00:31:28.240 | more steps with PyTorch.

00:31:30.400 | You can't integrate them into the same network, because they're very different approaches,

00:31:34.800 | but you can certainly solve a problem with the two of them together.

00:31:42.240 | So for those of you who have some money left over, I would strongly suggest building a

00:31:48.440 | box. And the reason I suggest building a box is because you're paying 90 cents an hour

00:31:55.720 | for a P2.

00:31:56.720 | I know a lot of you are spending a couple of hundred bucks a month on AWS bills.

00:32:04.040 | Here is a box that costs $550 and will be about twice as fast as a P2.

00:32:12.600 | So it's just not good value to use a P2, and it's way slower than it needs to be.

00:32:23.440 | And also building a box, it's one of the many things that's just good to learn, is understanding

00:32:29.280 | how everything fits together.

00:32:31.560 | So I've got some suggestions here about what box to build for various different budgets.

00:32:39.320 | You certainly don't have to, but this is my recommendation.

00:32:44.860 | Couple of points to me.

00:32:48.420 | More RAM helps more than I think people who discuss this stuff online quite appreciate.

00:32:54.960 | 12GB of RAM means twice as big of batch sizes, which means half as many steps necessary to

00:33:01.920 | go through an epoch.

00:33:02.920 | That means more stable gradients, which means you can use higher learning rates.

00:33:07.700 | So more RAM I think is often under-appreciated.

00:33:10.960 | The Titan X is the card that has 12GB RAM.

00:33:16.000 | It is a lot more expensive, but you can get the previous generation's version secondhand,

00:33:21.920 | it's called the Maxwell.

00:33:22.920 | So there's a Titan X Pascal, which is the current one, or the Titan X Maxwell, which

00:33:27.920 | is the previous generation one.

00:33:30.000 | The previous generation one is not a big step back at all, it still has 12GB RAM.

00:33:34.720 | If you can get one used that would be a great option.

00:33:41.880 | The GTX 1080 and 1070 are absolutely fantastic as well.

00:33:48.560 | They're nearly as good as the Titan X, but they just have 8GB rather than 12GB.

00:33:54.680 | Going back to a GTX 980, which is the kind of previous generation consumer top-end card,

00:34:01.360 | you have the RAM again.

00:34:03.460 | So of all the places you're going to spend money on a box, put nearly all of it into

00:34:09.400 | the GPU.

00:34:11.080 | Every one of these steps, 1070, the Titan X, Pascal, they're big steps up.

00:34:19.200 | And as you will have seen from part 1, if you've got more RAM, it really helps because

00:34:26.280 | you can pre-compute more stuff and keep it in RAM.

00:34:29.160 | Having said that, there's a new kind of hard drive, an NVMe drive on volatile memory.

00:34:36.520 | NVMe drives are quite extraordinary.

00:34:39.840 | They're not that far away from RAM like speeds, but they're hard drives.

00:34:46.400 | They're persistent.

00:34:47.400 | You have to get a special kind of motherboard, but if you can afford it, it's going to be

00:34:52.880 | like $400 or $500 to get an NVMe drive.

00:34:57.920 | That's going to really allow you to put all of your currently used data on that drive

00:35:03.800 | and access it very, very quickly.

00:35:05.880 | So that's my other tip.

00:35:06.880 | Question-Doesn't the batch size also depend heavily on the video RAM?

00:35:11.120 | Or does-

00:35:12.120 | Answer-That's what I was referring to, the 12GB, I'm talking about the RAM that's on

00:35:15.760 | the GPU.

00:35:16.760 | Question-Does upgrading RAM allow bigger batch sizes?

00:35:20.800 | Answer-Upgrading the card, the video card's RAM.

00:35:25.800 | You can't upgrade the RAM on the card.

00:35:27.480 | You buy a card that has X amount of RAM, so Titan X has 12, GTX 1080, 8, GTX 980, 4, so

00:35:36.380 | that's on the card.

00:35:38.280 | Upgrading the amount of RAM that's in your computer doesn't change your batch size, it

00:35:42.800 | just changes the amount you can pre-compute unless you use an NVMe drive, in which case

00:35:50.040 | RAM is much less important.

00:35:53.920 | You don't have to plug everything in.

00:35:56.760 | You can go to Central Computers, which is a San Francisco computer shop, for example,

00:36:00.720 | and they'll put it all together for you.

00:36:03.920 | There's a fantastic thread on the forums, Brendan, one of the participants in the course

00:36:10.080 | has a great Medium post, went there explaining his whole journey to getting something built

00:36:16.200 | and set up.

00:36:17.200 | So there's lots of stuff there to help you.

00:36:20.960 | Alright, it's time to build your box and while you wait for things to install, it's time

00:36:27.040 | to start reading papers.

00:36:29.240 | So papers are, if you're a philosophy graduate like me, terrifying.

00:36:35.560 | They look like Theorem 4.1 and colloquially 4.2 on the left, but that is an extract from

00:36:44.720 | the Adam paper, and you all know how to do Adam in Microsoft Excel.

00:36:52.160 | It's amazing how most papers manage to make simple things incredibly complex.

00:36:59.080 | And a lot of that is because academics need to show other academics how worthy they are

00:37:05.160 | of a conference spot, which means showing off all their fancy math skills.

00:37:11.080 | So if you really need a proof of the convergence of your optimizer rather than just running

00:37:18.520 | it and see if it works, you can study Theorem 4.1 and Corollary 4.2 and blah blah blah.

00:37:24.600 | In general though, the way philosophy graduates read papers is to read the abstract, find out

00:37:33.440 | what problem they're solving, read the introduction to learn more about that problem and how previous

00:37:40.120 | people have tackled it, jump to the bit at the end called Experiments to see how well

00:37:44.240 | the thing works.

00:37:45.240 | If it works really well, jump back to the bit which has the pseudocode in and try to

00:37:50.560 | get that to work.

00:37:51.880 | Ideally, hopefully in the meantime, finding that somebody else has written a blog post

00:37:56.000 | in simple English like this example with Adam.

00:38:00.520 | So don't be disheartened when you start reading big learning papers, and unless you have a

00:38:07.880 | math background, believe it or not, you're a PhD in math and they're still terrifying.

00:38:13.080 | Yeah, they still feel disheartened frequently.

00:38:15.680 | Rachel was complaining about a paper just today in fact.

00:38:21.800 | You will learn to read the papers.

00:38:24.100 | The other thing I'll say is that you'll even see now, there will be a bit that's like,

00:38:28.760 | and then we use a softmax layer and there will be the equation for a softmax layer.

00:38:32.720 | You'll look at the equation like, what the hell, and then it's like, oh, I already know

00:38:36.800 | what a softmax layer is.

00:38:38.400 | And then we'll use an LSTM.

00:38:42.160 | Literally still in every paper, they write the damn LSTM equations as if that's any help

00:38:46.880 | to anybody.

00:38:47.880 | But okay, it adds more Greek symbols, so be it.

00:38:51.440 | I'm talking of Greek symbols.

00:38:54.620 | It's very hard to read and remember things that you can't pronounce, so if you don't

00:38:59.280 | know how to read the Greek letters, Google the Greek alphabet and learn how to say them.

00:39:05.160 | It's just so much easier when you can look at an equation and rather go squiggle something,

00:39:09.160 | squiggle something, you can say alpha something and beta something.

00:39:11.760 | I know it's a small little thing, but it does make a big difference.

00:39:16.100 | So we are all there to help each other read papers.

00:39:19.640 | The reason we need to read papers is because as of now, a lot of the things we're doing

00:39:24.700 | only exist in very recent paper form.

00:39:30.640 | Okay, so I really think writing is a good idea.

00:39:38.040 | In fact, all of your projects I hope will end up in at least one blog.

00:39:42.880 | If you don't have a blog, medium.com is a great place to write.

00:39:47.960 | We would love to feature your work on fast.ai, so tell us about what you create.

00:39:56.040 | We're very keen for more people to get into the deep learning community.

00:40:02.480 | When you write this stuff, say hey, this is some stuff based on this course I'm doing,

00:40:07.260 | and here's what I've learned, and here's what I've tried, and here's what I found out.

00:40:10.680 | Put the code on GitHub, it's amazing.

00:40:14.240 | Like even us putting our little AWS setup scripts on GitHub for the MOOC, Rachel had

00:40:23.200 | a dozen pull requests within a week with all kinds of little tidbits of like, oh, if you're

00:40:31.100 | on this version of Mac, this helps this bit, or I've abstracted this out to make it work

00:40:35.360 | in Ireland as well as in America, and so on, so there's lots of stuff that you can do.

00:40:44.200 | I think the most important tip here is don't wait to be perfect before you start writing.

00:40:52.480 | What was that tip you told me, Rachel?

00:40:54.520 | You should think of your target audience as the person who's one step behind you, so maybe

00:40:58.720 | your target audience is someone that's just working through the part one MOOC right now.

00:41:02.320 | So your target audience is not...

00:41:04.920 | Jeffrey Hinton.

00:41:06.920 | Exactly, it's you six months ago.

00:41:09.960 | I don't write the thing that you would love to have seen because there will be far more

00:41:14.000 | people in that target audience than the Jeffrey Hinton target audience.

00:41:20.640 | How are we going for time, Rachel?

00:41:24.240 | 7.45, so this might be a good time for a break.

00:41:27.600 | Let's just get through this and then we can get on to the interesting stuff.

00:41:33.520 | I've tried to lay out what I think we'll study in part two.

00:41:36.720 | As I say, what I was planning until quite recently to present today was neural translation,

00:41:45.080 | and then two things happened. Google suddenly came up with a much better RNN and sequence-to-sequence

00:41:51.280 | API, and then also two or three weeks ago a new paper came out for generative models which

00:41:59.540 | totally changed everything.

00:42:01.200 | So that's why we've redone things and we're starting with CNN generative models today.

00:42:06.240 | We have a question, where to find the current research papers?

00:42:11.760 | Okay, we'll get to that for sure.

00:42:16.040 | Assuming that things go as planned, the general topic areas in part two will be CNNs and NLP

00:42:28.080 | beyond classification.

00:42:29.080 | If you think about it, pretty much everything we did in part one was classification or a

00:42:33.880 | little bit of regression.

00:42:37.480 | We're going to now be talking more about generative models.

00:42:41.720 | It's a little hard to exactly define what I mean by generative models, but we're talking

00:42:45.760 | about creating an image, or creating a sentence, we're creating bigger outputs.

00:42:55.320 | So CNNs beyond classification, so generative models for CNNs means the thing that we could

00:43:00.620 | produce could be a picture showing this is where the bicycle is, this is where the person

00:43:05.880 | is, this is where the grass is, that's called segmentation, or it could be taking a black

00:43:10.240 | and white image and turning it into a colour image, or taking a low-res image and turning

00:43:14.200 | it into a high-res image, or taking a photo and turning it into a bangoff, or taking a

00:43:19.360 | photo and turning it into a sentence describing it.

00:43:24.520 | NLP beyond classification can be taking an English sentence and turning it into French,

00:43:31.720 | or taking an English story and a question and turning it into an answer of that question

00:43:37.960 | about that story, that's chatbots in Q&A.

00:43:42.880 | We'll be talking about how to deal with larger datasets, so that both means datasets with

00:43:47.040 | more things in it, and datasets where the things are bigger.

00:43:52.120 | And then finally, something I'm pretty excited about is I've done a lot of work recently

00:43:56.600 | finding some interesting stuff about using deep learning for structured data and for

00:44:01.320 | time series.

00:44:02.320 | For example, we heard about fraud, so fraud is both of those things, it combines time

00:44:07.800 | series, transaction histories and thick histories, and structured data, customer information.

00:44:14.080 | Traditionally that's not been tackled with deep learning, but I've actually found some

00:44:19.080 | state-of-the-art, world-class approaches to solving those with deep learning, so I'm really

00:44:25.440 | looking forward to sharing that with you.

00:44:29.200 | So let's take a 8-minute break, come back at 5 to 8, thanks very much.

00:44:42.480 | So we're going to learn about this idea of artistic style or neural style transfer.

00:44:47.380 | The idea is that we're going to take a photo and make it look like it was painted in the

00:44:53.560 | style of some painter.

00:44:55.880 | Our inputs are a photo, and I'm going to call it, oh, that's way off, and style.

00:45:22.560 | And so these two things are going to be combined together to create an image which is going

00:45:35.280 | to hopefully have the content of the photo and the style of the image.

00:45:57.040 | The way we're going to do this is we're going to assume that there is some function where

00:46:06.000 | the inputs to this function are the photo, the style image, and some generated image

00:46:23.600 | that I've created.

00:46:27.840 | And that will return some number where this function will be higher if the generated image

00:46:38.920 | really looks like this photo in this style and lower if it doesn't.

00:46:44.800 | So if we can create this loss function that basically says, here's my generated image,

00:46:51.720 | and it returns back a number saying, oh yes, that generated image does look like that photo

00:46:56.200 | in that style, then we could use SGD.

00:47:00.240 | And we would use SGD not to optimize the weights of a network, we would use SGD to optimize

00:47:10.420 | the pixel values of the generated image.

00:47:13.720 | So we would be using it to try to optimize the value of this argument.

00:47:21.240 | So we haven't quite done that before, but conceptually it's identical.

00:47:27.560 | Conceptually we can just find the derivative of this function with respect to this input.

00:47:37.120 | And then we can try and optimize that input, which is just a set of pixel values, to try

00:47:41.520 | and maximize the function.

00:47:44.760 | So all we need to do is come up with a function which will tell us how much does some generated

00:47:54.640 | image look like this photo in this style.

00:48:00.440 | And the way we're going to do that, step 1, is going to be very simple.

00:48:03.520 | We're going to turn it into two functions, f-content, which will take the photo and the

00:48:12.920 | generated image, and that will tell us a bigger number if the generated image looks more like

00:48:21.120 | the photo, if the content looks the same.

00:48:24.680 | And then there will be a second function, which takes the style image and the generated

00:48:32.000 | image, and that will tell us a higher number if this generated image looks like it was

00:48:39.680 | painted in the same style as the style image.

00:48:42.880 | So we can just turn it into two pieces and add them together.

00:48:46.560 | So now we need to come up with these two parts.

00:48:53.640 | Now the first part is very easy.

00:49:00.560 | What's a way that we could create a function that returns a higher number if the generated

00:49:09.640 | image is more similar to some photo?

00:49:12.880 | When you come up with a loss function, the really obvious one is the values of the pixels.

00:49:24.360 | The values of the pixels in the generated image, the mean squared error between them

00:49:30.240 | and the photo, that mean squared error loss function would be one way of doing this part.

00:49:39.800 | The problem with that though is that as I start to turn it into a Van Gogh, those pixel

00:49:47.660 | values are going to change.

00:49:48.660 | They're going to change color because the Van Gogh might have been a very blue-looking

00:49:52.360 | Van Gogh.

00:49:53.360 | They'll change the relationships to each other so it might become a curve or it used to be

00:49:58.920 | a straight line.

00:50:00.960 | So really the pixel-wise mean squared error is not going to give us much freedom in trying

00:50:09.840 | to create something that still looks like a photo.

00:50:13.000 | So here's an idea, instead let's look at not the pixels, but let's take those pixels and

00:50:22.160 | stick them through a pre-trained CNN like VGG.

00:50:29.680 | And let's look at the 4th or 5th or 8th convolutional layers activations.

00:50:35.960 | Remember back to those matzylar visualizations where we saw that the later layers kind of

00:50:43.840 | said how much does an eyeball look like here, or how much does this look like a star, or

00:50:52.840 | how much does this look like the fur of a dog.

00:50:53.840 | The later layers were dealing with bigger objects and more semantic concepts.

00:51:00.580 | So if we were to use a later layer's activations as our loss function, then we could really

00:51:07.520 | change the style and the color and all kinds of stuff and really would be saying does the

00:51:12.560 | eye still look like an eye, does the beak still look like a beak, does the rock still

00:51:18.040 | look like a rock.

00:51:19.400 | And if the answer is yes, then OK, that's good, this is something that matches in terms

00:51:25.160 | of the meaning of the content even though the pixels look very different.

00:51:30.400 | And so that's exactly what we're going to do.

00:51:32.040 | So for f-content, we're going to say that's just the VGG activations of some convolutional

00:51:44.960 | layer.

00:51:46.280 | Which one?

00:51:47.280 | We can try some.

00:51:49.920 | So that's actually enough for us to get started.

00:51:54.000 | Let's try and build something that optimizes pixels using a loss function of the VGG network

00:52:03.260 | some convolutional layer.

00:52:11.320 | So this is the neural style notebook.

00:52:18.520 | And much of what we're going to look at is going to look very similar.

00:52:24.520 | The first thing you'll see which doesn't look similar to before is I've got this thing called

00:52:29.240 | limit mem.

00:52:31.240 | Limit mem, remember you can always see the source code for something by putting two question

00:52:39.400 | marks.

00:52:43.280 | Limit mem is just these three lines of code which I notice somebody currently has already

00:52:47.840 | pasted in the forum.

00:52:51.480 | One of the many things I dislike about TensorFlow for our kind of work is that all of the defaults

00:52:57.000 | are production defaults.

00:52:58.920 | So one of the defaults is it will use up all of your memory on all of your graphics cards.

00:53:04.000 | So I'm currently running this on a server with four graphics cards, which I'm meant

00:53:07.920 | to be sharing with my colleagues at the university here.

00:53:12.040 | If every time I run a notebook, nobody else can use any of the graphics cards, they're

00:53:15.960 | going to be really pissed.

00:53:17.600 | And this nice little gig I have of running these little classes is going to disappear

00:53:22.000 | very quickly.

00:53:23.160 | So I need to make sure I run limit mem very soon as soon as I start running a notebook.

00:53:29.760 | Honestly I think this is a poor choice by the TensorFlow authors because somebody putting

00:53:37.160 | something in production is going to be taking time to optimize things.

00:53:40.840 | I don't give a shit about the defaults.

00:53:42.720 | Somebody who's hacking something together to quickly see if they can get something working

00:53:46.360 | very much wants nice defaults.

00:53:48.400 | So this is like one of the many places where TensorFlow makes some odd little annoying

00:53:54.160 | decisions.

00:53:55.160 | But anyway, every time I create a new notebook, I copy this line in and make sure I run it

00:54:01.800 | and so this does not use up all of your memory.

00:54:07.960 | So I've got a link to the paper that we're looking at, and indeed we can open it.

00:54:16.000 | And now is a good time to talk about how helpful it is to use some kind of paper reading system.

00:54:24.000 | I really like this one, it's free, it's called Mendeley Desktop.

00:54:30.040 | Mendeley let's use, as you find papers, you can save them into a folder on your computer.

00:54:37.440 | Mendeley will automatically watch that folder, any PDF that appears there gets added to your

00:54:41.920 | library, and it's really quite cool because what it then does is it finds the archive

00:54:52.560 | ID and then you can click this little button here and it will go to archive and grab all

00:55:04.400 | of the information such as the abstract and so forth and fill it out for you.

00:55:10.460 | And so this is really great because now any time I want to find out what I've read, which

00:55:16.040 | I've got anything to do with style, I can type style and up all of the papers.

00:55:23.600 | Believe me, after a long time of reading papers without something like this, it basically

00:55:30.240 | goes in one ear and out the other, and literally I've read papers a year later and at the end

00:55:35.320 | of it I've realized I've read that before, I don't remember anything else about it but

00:55:40.560 | I know I've read it before, whereas this way I really find that my knowledge builds.

00:55:46.680 | As I find references, I'm immediately there looking at the references.

00:55:51.400 | The other thing you can do is that as you start reading the paper, as you can see, my

00:56:00.920 | notes and highlights are saved, and they're also duplicated on my mobile devices and my

00:56:08.240 | other computers and they're all synced up, it's really cool.

00:56:12.640 | So talking about archive is a great time to answer a question we had earlier about how

00:56:22.200 | do you find papers.

00:56:24.300 | So the vast vast vast majority of deep learning papers get put up on archive.org for a long

00:56:34.720 | long long time before they're in any journal or conference.

00:56:39.200 | So if you wait until they're in a conference proceedings, you're many many months or maybe

00:56:45.680 | even a year behind.

00:56:47.600 | So pretty much everybody uses archive.

00:56:51.080 | You can go to the AI section of archive and see what's there, but that's not really what

00:56:59.720 | anybody does.

00:57:01.660 | What everybody does instead is archive sanity, the archive sanity preserver.

00:57:15.380 | This is something that the wonderful Andre Capathy built, and what it lets you do is

00:57:20.660 | to create a library of articles that somebody tells you to read or that you're interested

00:57:26.240 | in or you come across, and as you create that library by clicking this little save button,

00:57:31.720 | it then recommends more papers like it.

00:57:34.820 | Or even once you start reading a paper, you go Show Similar, and it will then show you

00:57:42.140 | other papers that are similar to this paper and it seems to do a pretty damn good job

00:57:46.920 | of it.

00:57:47.920 | So you can really explore and get lost in that whole area.

00:57:53.520 | So that's one great way to do it.

00:57:54.720 | And then as you do that, you'll find that if you go to archive, one of the buttons that

00:58:05.600 | it has is a bookmark on Mendeley button.

00:58:08.840 | So like even from the abstract here, bang, straight into your library and the next time

00:58:12.600 | you load up Mendeley, it's all there.

00:58:15.080 | And then you can put things into folders, so the different parts of the course, I've

00:58:23.960 | created folders for them and kind of keep track of what I'm reading that way.

00:58:30.400 | A good little trick to know about archive.org is that you often want to know where it's

00:58:39.440 | from, and if you go to the first page on the left-hand side, you can see the date here.

00:58:44.800 | And another cool tip is that the file name, the first four digits are the year and month

00:58:51.600 | for that file, so there's a couple of handy little tips.

00:58:58.040 | As well as archive sanity, another really great place for finding papers is Twitter.

00:59:06.560 | Now if you haven't really used Twitter before or haven't really used Twitter for this purpose

00:59:10.000 | before, it's hard to know where to start.

00:59:14.640 | So I try to make things easy for people by favoriting lots of the interesting deep learning

00:59:21.840 | papers that I come across.

00:59:23.160 | So if you go to Jeremy P. Howard's page and click on Likes, you'll find that there is

00:59:40.680 | a thousand links to papers here, and as you can see, there's generally a few every day.

00:59:48.760 | That's useful for a number of reasons.

00:59:50.120 | One is to get some ideas and papers to read, but perhaps more importantly is to see who's

00:59:54.560 | posting these cool links.

00:59:56.720 | And then you can follow them as well.

00:59:58.200 | Rachel, can you throw that box to that gentleman?

01:00:03.580 | The black?

01:00:04.580 | Yes.

01:00:05.580 | That's it.

01:00:06.580 | It's not a question, it's just information about archive.

01:00:11.000 | There is someone who has built a skill on Amazon Alexa, and actually by asking Alexa

01:00:18.040 | to give the most recent paper from archive, and actually she reads abstract for you, and

01:00:23.760 | you can filter the most papers for you.

01:00:35.160 | The other place which I find extremely helpful is Reddit machine learning.

01:00:42.360 | Again, there's a lot less that goes through Reddit than goes through Twitter, but generally

01:00:49.800 | like the really interesting things tend to turn up here, and you can often see the discussions

01:00:57.640 | of it.

01:00:58.640 | For example, there was a great discussion of PyTorch versus TensorFlow in the last day

01:01:04.560 | or two, and so there's a couple of good places to get started.

01:01:11.400 | Anything I missed, Rachel?

01:01:13.000 | I think that's good.

01:01:14.880 | I have two questions on the image stuff when you go back to style.

01:01:19.200 | Okay.

01:01:20.200 | I'm ready.

01:01:21.200 | One of them was if the app Prisma is using something like this.

01:01:25.600 | Yes, Prisma is using exactly this.

01:01:28.760 | And the other is, is it better to calculate F content for a higher layer for VGG and use

01:01:34.600 | a lower layer for app style, since the higher layer of abstracts are captured in the higher

01:01:39.680 | layer and the lower layer captures textures?

01:01:41.960 | Probably.

01:01:42.960 | Let's try it, shall we?

01:01:44.720 | We haven't learned about F style yet, so we're just going to look at F content first.

01:01:48.560 | Okay, so I've got some more links to some things you can look at here in the notebook.

01:01:57.640 | So the data I've linked to in the lesson thread on the forum, I've just grabbed a random sample

01:02:05.960 | of about 20,000 image net images, and I've also put them into bcols arrays.

01:02:12.080 | So you can set up your paths appropriately.

01:02:16.800 | I haven't given you this pickle.

01:02:18.340 | You can figure out how to get the file names easily enough, so I'm not going to do everything

01:02:21.520 | for you.

01:02:22.520 | I've grabbed a little one of those pictures.

01:02:28.880 | Thank you for the person who's showing all the other stuff at Pippin's store, that's

01:02:31.640 | very helpful.

01:02:33.440 | So this is going to be our content image.

01:02:37.320 | Given that we're using VGG, as per usual, we're going to have to subtract out the mean

01:02:45.200 | pixel value from imageNet and reverse the channel order, because of course that's what

01:02:51.480 | the original VGG authors did.

01:02:54.840 | So we're going to create an array from the image by just running it through that pre-processing

01:02:58.560 | function.

01:03:02.280 | Later on, we're going to be running things through a network and generating images.

01:03:06.260 | Those generated images we're going to have to add back on that mean and undo that reordering,

01:03:12.320 | so this is what this de-processing function is going to be for.

01:03:18.320 | Now I've kind of hand-waved over these functions before and how they work, but I'm going to

01:03:27.400 | stop hand-waving for a moment because it's actually quite interesting.

01:03:30.560 | Have you ever thought about how is it that we're able to take X, which is a 4-dimensional

01:03:35.800 | tensor, batch size by height by width by channels, (notice this is not the same as the Tiano,

01:03:43.480 | so Tiano was batch size by channels by height by width, we're not doing that anymore), batch

01:03:51.760 | size by height by width by channels, taking a 4-dimensional tensor and we're subtracting

01:03:56.640 | from it a vector.

01:04:00.480 | How are we doing that?

01:04:01.480 | How is it making that work?

01:04:04.640 | And the way it's making that work is because it's doing something called broadcasting.

01:04:11.260 | Broadcasting refers to any kind of operation where you have arrays or tensors of different

01:04:18.360 | dimensions and you do element-wise operations on two tensors of different dimensions.

01:04:26.440 | And how does that work?

01:04:30.280 | This idea actually goes back to the early 1960s to an amazing programming language called

01:04:37.120 | APL.

01:04:38.120 | APL stands for A Programming Language.

01:04:41.680 | APL was written by an extraordinary person called Kenneth Iverson.

01:04:47.240 | Originally APL was a paper describing a new mathematical notation, and this new mathematical

01:04:55.120 | notation was designed to be more flexible and far more precise than traditional mathematical

01:05:01.360 | notation.

01:05:02.360 | And he then went on to create a programming language that implemented this mathematical

01:05:06.520 | notation.

01:05:07.520 | APL refers to the notation, which he described as notation as a tool for thought.

01:05:15.440 | He really, unlike the TensorFlow authors, understood the importance of a good API.

01:05:21.080 | He recognized that the mathematical notation can change how you think about math, and so

01:05:26.440 | he created a notation which is incredibly expressive.

01:05:34.840 | His son now has gone on to carry the torch and he now continues to support a direct descendant

01:05:44.200 | of APL, which is called J.

01:05:47.160 | So if you ever want to find, I think, the most elegant programming language in the world,

01:05:53.920 | you can go to Jsoftware.com and check this out.

01:05:57.360 | Now, how many of you here have used regular expressions?

01:06:03.440 | How many of you, the first time you looked at a complex regular expression thought, that

01:06:07.880 | is totally intuitive?

01:06:11.720 | You will feel the same way about J.

01:06:14.160 | The first time that you look at a piece of J, you'll go, what the bloody hell?

01:06:22.200 | Because it's an even more expressive and a much older language than regular expressions.

01:06:33.920 | Here's an example of a line of J.

01:06:38.520 | But what's going on here is that this is a language which at its heart almost never requires

01:06:45.480 | you to write a single loop because it does everything with multidimensional tensors and

01:06:51.080 | broadcasting.

01:06:52.080 | So everything we're going to learn about today with broadcasting is a very diluted, simplified,

01:06:58.720 | graphified version of what APL created in the early 60s, which is not to say anything

01:07:04.120 | rude about Python's implementation, it's one of the best.

01:07:08.560 | J and APL totally blow it away.

01:07:13.880 | If you want to really expand your brain and have fun, check out J.

01:07:18.280 | In the meantime, what does Keras/Theano/TensorFlow broadcasting look like?

01:07:25.560 | Let's look at some examples.

01:07:30.760 | Here is a vector, a one-dimensional tensor, minus a scalar.

01:07:46.520 | That makes perfect sense that you can subtract a scalar from a one-dimensional tensor.

01:07:51.360 | But what is it actually doing?

01:07:53.280 | What it's actually doing is it's taking this 2 and it's replicating it 3 times.

01:07:58.360 | So this is actually element-wise, 1, 2, 3, minus 2, 2, 2.

01:08:05.000 | It has broadcasted the scalar across the 3-element vector 1, 2, 3.

01:08:15.000 | So there's our first example of broadcasting.

01:08:21.760 | In general, broadcasting has a very specific set of rules, which is this.

01:08:29.840 | You can take two tensors and you first of all take the shorter tensor, the tensor of

01:08:37.080 | less dimensions, and prepend unit axes to the front.

01:08:41.400 | What do I mean when I say prepend unit axes?

01:08:43.680 | Here's an example of prepending unit axes.

01:08:47.240 | Take the vector 2, 3 and prepend 3 unit axes on the front.

01:08:52.640 | It is now a four-dimensional tensor of shape 1, 1, 1, 2.

01:08:58.720 | So if you turn a row into a column, you're adding one unit axis.

01:09:05.000 | If you're then turning it into a single slice, you're adding another unit axis.

01:09:11.560 | So you can always make something into a higher dimensionality by adding unit axes.

01:09:17.960 | So when you broadcast, it takes the thing with less dimensions and adds prepends unit

01:09:25.320 | axes to the front.

01:09:27.600 | And then what it does is it says, so let's take this first example, it's taken this thing

01:09:33.120 | which has no axes, it's a scalar, and turns it into a vector of length 1.

01:09:40.200 | And then what it does is it finds anything which is of length 1 and duplicates it enough

01:09:46.880 | times so that it matches the other thing.

01:09:49.960 | So here we have something which is a four-dimensional tensor of size 5, 1, 3, 2.

01:09:57.880 | So it's got 2 columns, 3 rows, 1 slice and 5 tubes.

01:10:09.280 | And then we're going to subtract from it a vector of length 2.

01:10:13.280 | So remember from our definition, it's then going to automatically reshape this by prepending

01:10:23.000 | unit axes until it's the same length.

01:10:28.000 | And then it's going to copy this thing 3 times, this thing 1 time and this thing 5 times.

01:10:37.280 | So the shape is 5, 1, 3, 2.

01:10:44.880 | So it's going to subtract this vector from every row, every slice, every cube.

01:10:55.440 | So you can play around with these little broadcasting examples and try to get a real feel for how

01:11:03.680 | to make broadcasting work for you.

01:11:05.680 | So in this case, we were able to take a four-dimensional tensor and subtract from it a three-dimensional

01:11:13.320 | vector knowing that it is going to copy that three-dimensional vector of channels to every

01:11:19.200 | row, to every column, to every batch.

01:11:23.760 | So in the end, it's just done what we mean.

01:11:27.280 | It subtracted the mean average of the channels from all of the images the way we wanted it

01:11:34.800 | to.

01:11:35.800 | But it's been amazing how often I've taken code that I've downloaded off the internet

01:11:40.920 | and made it often 10 or 20 times more in terms of lines of code just by using lots of broadcasting.

01:11:49.680 | And the reason I'm talking about this now is because we're going to be using this a

01:11:52.480 | lot.

01:11:53.480 | So play with it.

01:11:56.200 | And as I say, if you really want to have fun, play with it in J.

01:12:02.640 | So that was a diversion, but it's one that's going to be important throughout this.

01:12:07.400 | So we've now basically got the data that we want.

01:12:16.280 | So next thing we need is a VGG model.

01:12:19.520 | Here's the thing though.

01:12:23.860 | When we're doing generative models, we want to be very careful of throwing away information.

01:12:30.560 | And one of the main ways to throw away information is to use max pooling.

01:12:35.320 | When you use max pooling, you're throwing away 3/4 of the previous layer and just keeping

01:12:42.840 | the highest one.

01:12:46.520 | In generative models, when you use something like max pooling, you make it very hard to

01:12:52.400 | undo that and get back the original data.

01:12:56.000 | So if we were to use max pooling with this idea of our f-content, and we say what does

01:13:03.160 | the fourth layer of activations look like, if we've used max pooling, then we don't really

01:13:09.560 | know what 3/4 of the data look like.

01:13:14.680 | Slightly better is to use average pooling instead of max pooling.

01:13:19.120 | Because at least with average pooling, we're using all of the data to create an average.

01:13:23.840 | We've still kind of thrown away 3/4 of it, but at least it's all been incorporated into

01:13:29.640 | calculating that average.

01:13:31.780 | So the only thing I did to turn VGG16 into VGG16 average was to do a search and replace

01:13:38.600 | in that file from max pooling to average pooling.

01:13:41.560 | And it's just going to give us some slightly smoother, slightly nicer results.

01:13:47.040 | And you're going to see this a lot with generative models.

01:13:49.140 | We do little tweaks just to try to lose as little information as possible.

01:13:55.900 | You can just think of this as VGG16.

01:13:58.960 | Shouldn't we use something like ResNet instead of VGG, since the residual blocks carry more

01:14:03.680 | context?

01:14:07.220 | We'll look at using ResNet over the coming weeks.

01:14:14.380 | It's a lot harder to use ResNet for anything beyond kind of basic classification, for a

01:14:25.440 | number of reasons.

01:14:27.520 | One is that just the structure of ResNet blocks is much more complex.

01:14:31.360 | So if you're not careful, you're going to end up picking something that's on one of

01:14:35.560 | those little arms of the ResNet rather than one of the additive mergers of the ResNet.

01:14:41.640 | And it's not going to give you any meaningful information.

01:14:45.720 | You also have to be careful because the ResNet blocks most of the time are just slightly

01:14:51.120 | fine-tuning their previous block, like adding the residuals.

01:14:56.200 | It's not really adding new types of information.

01:15:01.320 | Honestly, the truth is I haven't seen any good research at all about where to use ResNet

01:15:10.920 | or Inception architectures for things like generative models or for transfer learning

01:15:17.120 | or anything like that.

01:15:18.120 | So we're going to be trying to look at some of that stuff in this course, but it's far

01:15:20.720 | from straightforward.

01:15:21.840 | Two more questions.

01:15:24.200 | Should we put in batch normalization?

01:15:30.680 | In Part 1 of the course, I never actually added batch norm to the convolutional part

01:15:36.480 | of the model.

01:15:38.000 | So that's kind of irrelevant because we're not using any of the fully connected layers.

01:15:43.680 | More generally, is batch norm helpful for generative models?

01:15:47.360 | I'm not sure that we have a great answer to that.

01:15:51.280 | Try it.

01:15:52.280 | Will the pre-trained weights change if we're using average pooling instead of max pooling?

01:15:57.080 | Yeah, that's a great question.

01:16:01.360 | The pre-trained weights, clearly the optimal weights would change, but having said that

01:16:08.920 | it's still going to do a reasonable job without tweaking the weights because the relationships

01:16:15.180 | between the activations isn't going to change.

01:16:19.280 | So again, this would be an interesting thing to try if you want to download ImageNet and

01:16:25.040 | try fine-tuning it with average pooling, see if you can actually see a difference in the

01:16:29.880 | outputs that come out or not.

01:16:31.880 | It's not something I've tried.

01:16:36.200 | So here is the output tensor of one of the late layers of VGG-16.

01:16:44.640 | So if you remember, there are different blocks of VGG where there's a number of 3x3 comms

01:16:52.560 | in a row, and then there's a pooling layer, and then there's another block of 3x3 comms,

01:16:57.320 | and then a pooling layer.

01:16:58.320 | This is the last block of the comms layers, and this is the first comm of that block.

01:17:04.040 | I think this is maybe the third last layer of the convolutional section of VGG.

01:17:09.680 | This is kind of like large, receptive field, very complex concepts being captured at this

01:17:18.680 | late stage.

01:17:20.880 | So what we're going to do is we need to create our target.

01:17:27.560 | So for our bird, when we put that bird through VGG, what is the value of that layer's activations?

01:17:40.320 | So one of the things I suggested you revise was the stuff from the Keras fact about how

01:17:46.200 | to get layer outputs.

01:17:49.600 | One simple way to do that is to create a new model, which takes our model's input as input,

01:17:55.440 | and instead of using the final output as output, we can use this layer as output.

01:18:00.360 | So this is now a model, which when we call .predict, it will return this set of activations.

01:18:10.040 | So that's all we've done here.

01:18:13.200 | Now we're going to be using this inside the GPU, we're going to be using this as a target.

01:18:21.360 | So to give us something which is going to live in the GPU A and B, we can use symbolically

01:18:27.680 | in a computation graph B, we wrap it with k.variable.

01:18:33.080 | So to remind you, whatever Keras in the docs use the Keras.backend module, they always

01:18:45.720 | call it capital K, I don't know why.

01:18:48.560 | So k refers to the API that Keras provides, which provides a way of talking to either

01:18:57.200 | Theano or TensorFlow with the same API.

01:19:00.880 | So both Theano and TensorFlow have a concept of variables and placeholders and dot functions

01:19:08.040 | and subtraction functions and softmax activations and so forth.

01:19:11.560 | And so this k.module is where all of those functions live.

01:19:17.440 | This is just a way of creating a variable, which if we're using Theano, it would create

01:19:21.200 | a Theano variable.

01:19:22.200 | If we're using TensorFlow, it creates a TensorFlow variable.

01:19:25.680 | And where possible, I'm trying to use this rather than TensorFlow directly, but I could

01:19:32.240 | have absolutely have said tf.variable, and it would work just as well, because we're

01:19:36.840 | using the TensorFlow backend.

01:19:41.080 | So this has now created a symbolic variable that contains the activations of block 5.1.

01:19:49.340 | So what we now want to do is to generate an image which we're going to use SGD to gradually

01:19:57.940 | make the activations of that image look more and more like this variable.

01:20:02.920 | So how are we going to do that?

01:20:05.680 | Let's skip over 202 for a moment and think about some pieces.

01:20:09.520 | So we're going to need to define a lost function.

01:20:13.320 | And the lost function is just the mean squared error between two things.

01:20:18.500 | One thing is, of course, that target, that thing we just created, which is the value

01:20:24.040 | of our layer using the bird image.

01:20:32.480 | We use the bird image array.

01:20:36.720 | So that's our target.

01:20:38.960 | And then what do we want to get close to that?

01:20:40.400 | Well, what we want to get close to that is whatever the value is of that layer at the

01:20:48.320 | moment.

01:20:49.320 | So what does layer equal?

01:20:51.080 | So layer is just a symbolic object at this stage.

01:20:56.800 | There's nothing in it, so we're going to have to feed it with data later.

01:21:03.840 | So remember, this is kind of the interesting way you define computation graphs with TensorFlow

01:21:08.920 | and Theano.

01:21:09.920 | It's like you define it with these symbolic things now and you feed it with data later.

01:21:15.120 | So you've got this symbolic thing called layer, and we can't actually calculate this yet.

01:21:20.680 | So at this stage this is just a computation graph we're building.

01:21:24.880 | Now of course any time we have a computation graph, we can get its gradients.

01:21:28.680 | So now that we have a computation graph that calculates the loss function we're interested

01:21:49.560 | in, so this is f content, if we're going to try to optimize our generated image, we're

01:21:56.360 | going to need to know the gradients.

01:21:58.720 | So here we can get the gradients, and again we use k dot gradients rather than TensorFlow

01:22:03.280 | gradients or Theano gradients just so that we can use it with any back-end we like.

01:22:11.400 | The function we're trying to get gradients of is the loss function, which we just calculated.

01:22:16.960 | And then we want it with respect to not some weights, but with respect to the input of

01:22:22.600 | the model.

01:22:23.600 | So this is the thing that we want to change is the input to the model so as to minimize

01:22:30.200 | our loss.

01:22:31.200 | So they're the gradients.

01:22:35.520 | So now that we've done that, we can go ahead and create our function.

01:22:39.800 | And so the input to the function is just model dot input, and the outputs to the function

01:22:45.080 | will be the loss and the gradients.

01:22:49.700 | So that's nearly everything we need.

01:22:52.560 | The last step we need to do is to actually run an optimizer.

01:22:57.380 | Now normally when we run an optimizer we use some kind of SGD.

01:23:02.980 | Now the s in SGD is for stochastic.

01:23:07.780 | In this case, there's nothing stochastic.

01:23:10.540 | We're not creating lots of random batches and getting different gradients every time.

01:23:15.560 | So why use stochastic gradient descent when we don't have a stochastic problem to solve?

01:23:23.440 | So in fact, there's a much longer history of optimization methods which are deterministic,

01:23:29.640 | going back to Newton's method, which many of you will be familiar with.

01:23:40.880 | The basic idea of these much faster deterministic optimization methods is that rather than saying

01:23:50.220 | OK, where's the gradient, which direction does it go, let's just go a small little step in

01:23:55.920 | that direction.

01:23:57.320 | Learning rate times gradient, small little step, small little step, because I have no

01:24:00.720 | idea how far to go.

01:24:03.200 | And it's stochastic, so it's going to keep changing.

01:24:05.200 | So next time I look it will be a totally different direction.

01:24:09.560 | With a deterministic optimization, we find out which direction to go, and then we find

01:24:16.160 | out what is the optimum distance to go in that direction.

01:24:19.880 | And so if you know this is the direction I want to go, and it looks like this, then the

01:24:24.800 | way we find the optimum is we go a small distance.

01:24:27.520 | Then we go twice as far as that, twice as far as that, twice as far as that, and we

01:24:32.280 | keep going until the slope changes sign.

01:24:36.800 | And once the slope changes sign, we know it's called bracketing.

01:24:40.120 | We've bracketed the minimum of that function.

01:24:43.360 | And then we can use bisection to find the minimum.

01:24:45.840 | So now we've bracketed it, we find halfway between the two.

01:24:49.240 | Is it on the left or the right of that?

01:24:51.160 | Halfway between the two of those, is it left or the right of that?

01:24:54.100 | So we use bracketing and bisection to find the optimum in that direction.

01:25:02.200 | Let's call it a line search.

01:25:04.680 | All of these optimization techniques rely on the basic idea of a line search.

01:25:11.080 | Once you've done the line search, you've found the optimal value in that direction, in our

01:25:17.440 | downhill direction.

01:25:18.440 | That doesn't necessarily mean we've found the optimal value across our entire space.

01:25:23.840 | So what we then do is we replete the process, find out what's the downhill direction now,

01:25:30.000 | use line search to find the optimum in that direction.

01:25:34.600 | So the problem with that is that in a saddle point, you will still often find yourself

01:25:42.200 | going backwards and forwards in a rather unfortunate way.

01:25:51.160 | The faster optimization approaches when they're going to go in a new direction, they don't

01:25:56.000 | just say which direction is down, they say which direction is the most downhill but also

01:26:01.480 | the most different to the previous directions I've gone.

01:26:05.480 | That's called finding a conjugate direction.

01:26:08.440 | So the good news is you don't need to really know any of those details.

01:26:12.040 | All you need to know is that there is a module called SciPy.optimize.

01:26:23.560 | And in SciPy.optimize are lots of handy deterministic optimizers.

01:26:29.280 | The two most common used are conjugate gradient, or CG, and BFGS.

01:26:40.080 | They differ in the detail of how do they decide what direction to go next, which direction

01:26:45.440 | is both the most downhill and also the most different to the previous directions we've

01:26:50.320 | gone.

01:26:52.480 | And the particular version we're going to use is a limited memory BFGS.

01:26:57.600 | So the important thing is not how it works, the important thing for us is how do we use

01:27:06.480 | it.

01:27:07.480 | So there's the question about loss plus grads.

01:27:17.440 | So this is an array containing a single thing, which is loss.

01:27:24.800 | Grads is already an array, or a list I should say, which is a list of all of the loss with

01:27:30.960 | respect to all of the inputs.

01:27:32.920 | So plus in Python on two lists simply joins the two lists together.

01:27:38.240 | So this is a list containing the loss and all of the gradients.

01:27:42.760 | Someone asked if ant colony optimization is something that can be used?

01:27:48.920 | Ant colony optimization lives in a class known as metaheuristics, like genetic algorithms

01:27:56.240 | or simulated annealing.

01:27:59.320 | There's a wide range of optimization algorithms that are designed for very difficult to optimize

01:28:05.720 | functions, functions which are extremely bumpy.

01:28:09.200 | And so these techniques all use a lot of randomization in order to kind of avoid the bumps.

01:28:17.680 | In our case, we're using mean-squared error, which is a nice smooth objective.

01:28:23.320 | So we can use the much faster convex optimization.

01:28:27.760 | And then that was the next question, is this a non-convex problem or a convex optimization?

01:28:37.000 | Okay, great.

01:28:40.840 | So how do we use one of these optimizers?

01:28:46.280 | Basically you provide the name of the optimizer, which in this case is minimize something using

01:28:51.920 | VFTS, and you have to pass it three things.

01:28:58.040 | A function which will return the loss value at the current point, a starting point, and

01:29:09.440 | a function which will return the gradients at the current point.

01:29:14.680 | Now unfortunately we have a function which returns the loss and the gradients together,

01:29:22.520 | which is not what this wants.

01:29:25.280 | So a minor little detail is that we create a simple little class, and all this class

01:29:34.680 | does, and again the details really aren't important, but all this class does is that when loss

01:29:40.520 | is called, it calls that function that we created, passing in the current value of the

01:29:47.320 | data, it gets back the loss and the gradients, and it returns the loss.

01:29:56.720 | Later on when the optimizer asks for the gradients, it returns those gradients that I stored back

01:30:04.200 | here.

01:30:05.200 | So what this is doing is it's a little class which allows us to basically turn a Keras

01:30:10.560 | function that returns the loss and the gradients together into two functions.

01:30:15.720 | One which returns the loss, one which returns the gradients.

01:30:18.840 | So it's a pretty minor detail, but it's a handy thing to have in your toolbox because

01:30:22.720 | it means you now have something that can use deterministic optimizers on Keras functions.

01:30:31.200 | So all we do is we look through a small number of times, calling that optimizer each time,

01:30:38.520 | and we need to pass in some starting point.

01:30:43.000 | So the starting point is just a random image.

01:30:49.200 | So we just create a random image, and here is what a random image looks like.

01:30:55.800 | So let's go ahead and run that so we can see the results, I haven't actually ran this yet.

01:31:06.080 | Oh, there it comes.

01:31:13.040 | Good.

01:31:14.040 | Okay.

01:31:15.040 | Run, run.

01:31:16.040 | So you can see it going along and solving here.

01:31:34.800 | Here's one I prepared earlier.

01:31:37.040 | And here at the end of the 10th iteration is the result.

01:31:42.360 | So remember what we did was we started with this image, we called an optimizer which took

01:31:51.040 | that image and attempted to optimize this loss function where the target was the value

01:32:01.520 | of this layer for our bird image, and the thing it was comparing it to was the layer

01:32:12.360 | for the generated image.

01:32:16.600 | So we started with this, we ran that optimizer a bunch of times, calculating the gradient

01:32:22.120 | of that loss with respect to the input to the model, the very pixels themselves.

01:32:28.480 | And after 10 iterations it turned this random image into this thing.

01:32:34.400 | So this is the thing which optimizes the block 5.1 layer.

01:32:45.160 | And you can see it still looks like a bird, but by this point it really doesn't care what

01:32:49.820 | the background looks like, it cares a lot what the eye looks like and the beak looks

01:32:53.880 | like and the feathers look like, because these things all matter to ImageNet to make sure

01:32:57.200 | it correctly sees that it's a bird.

01:33:00.440 | If we look at an earlier layer, let's look at block 4.1, you can see it's getting the

01:33:10.040 | details more correct.

01:33:11.800 | So when we do our artistic style, we can choose which layer will be our f-content.

01:33:17.320 | And if we choose an earlier one, it's going to give it less degrees of freedom to look

01:33:23.400 | like a different kind of bird, but it's going to look more like our original bird.

01:33:28.720 | And so then here's a video showing how that happens, so there are the 10 steps.

01:33:41.800 | And it's often helpful to be able to visualize the iterations of your generators at work.

01:33:52.920 | So feel free to borrow this very simple code, you can just use matplotlib.

01:33:58.520 | We actually used this in the last class, remember we little linear optimizer, we animated it.

01:34:08.000 | You just have to define a function that gets called at each step of the animation, and

01:34:13.200 | then you can just call animation.func animation passing in that function, and that's a nice

01:34:17.840 | way that you can animate your own generators.

01:34:23.480 | Question, we're using Keras and TensorFlow to extract the BGG features, these are used

01:34:29.960 | by SciPy for BFGs, does the BFGs also run on the GPU?

01:34:35.840 | No, there's really very little for the BFGs to do.

01:34:42.720 | Especially for an optimizer, all of the work is in calling the loss function and the gradients.

01:34:48.560 | The actual work of doing the bisection and doing the bracketing is so trivial that we

01:34:56.720 | just don't care about that.

01:34:58.400 | It doesn't take any time.

01:35:00.200 | There's a question about the checkerboard artifact, the geometric pattern that's appearing.

01:35:06.240 | Yes.

01:35:07.240 | This is actually not a checkerboard artifact exactly, checkerboard artifacts we will look

01:35:11.920 | at later.

01:35:12.920 | They look a little bit different.

01:35:13.920 | That was my interpretation mistake, not the questioner's mistake.

01:35:22.960 | I'm not exactly sure why this particular kind of noise has appeared, honestly.

01:35:35.680 | It's an interesting question.

01:35:38.280 | How would batching work?

01:35:42.920 | It doesn't.

01:35:43.920 | So there's no batching to do.

01:35:49.020 | We have a single image which is being optimized, so there's really no batching to do here.

01:35:59.560 | We'll look at a version which uses a very different approach and has batching shortly.

01:36:04.160 | Has anyone tried something like this by averaging or combining the activations of multiple bird

01:36:10.360 | images to create some kind of prototypical or novel bird?

01:36:19.960 | Generative adversarial networks do something like that, but probably not quite.

01:36:24.920 | I'm not sure.

01:36:27.960 | Maybe not quite.

01:36:28.960 | Where can people get the pickle file?

01:36:31.800 | They don't.

01:36:33.320 | You have to get a list of file names yourself from the list of files that you've downloaded.

01:36:44.360 | And then just to make sure I understand this, someone says in this example we started with

01:36:49.540 | a random image, but if we started with the actual image as the initial condition, we

01:36:54.840 | would get the original image back, right?

01:36:56.840 | I would assume so.

01:36:58.000 | Yeah, I mean, I can't see why it wouldn't.

01:37:03.720 | Basically the gradients would all be zero.

01:37:05.800 | They're interested to find out where we initialize for the artistic styling problem.

01:37:10.920 | Sorry?

01:37:11.920 | That was just a follow-up, we're going to get there.

01:37:18.520 | Oh, there's one more.

01:37:20.600 | Would it be useful to use a tool like Quiver to figure out which BGG layer to use this?

01:37:27.200 | It's so easy just to try a few and see what works.

01:37:33.440 | So we're nearly out of time.

01:37:34.440 | We haven't got through as much as I hoped, but we're going to finish off this piece.

01:37:41.600 | We're now going to do F style.

01:37:44.920 | F style is nearly identical.

01:37:47.680 | All of the code is nearly identical.

01:37:49.840 | The only thing different is A, we're going to not feed in a photo, we're going to feed

01:37:55.080 | in a painting.

01:37:56.080 | And here's a few styles we could choose from.

01:37:57.680 | We could do Van Gogh, we could do this little drawing, or we could do the Simpsons.

01:38:05.080 | So we pick one of those and we create the style array in the same way as before.

01:38:12.280 | Truck it through BGG.

01:38:14.300 | This time, though, we're going to use multiple layers.

01:38:17.360 | So I've created a dictionary from the name of the layer to its output, and so we're going

01:38:21.800 | to use that to create an array of a number of the outputs.

01:38:30.320 | We're going to grab the first, second and the block outputs.

01:38:38.800 | So we're going to create our target as before, but we're going to use a different loss function.

01:38:48.920 | The loss function is called style_loss, and just like before, it's going to use the MSE.

01:38:54.920 | But rather than just the MSE on the activations, it's the MSE on something called the Gram

01:39:00.000 | matrix of the activations.

01:39:02.720 | What is a Gram matrix?

01:39:04.360 | A Gram matrix is very simply the dot product of a matrix with its own transpose.

01:39:13.520 | So here it is here, dot product of some matrix with its own transpose.

01:39:20.160 | And I've just got to divide it by here to create an average.

01:39:25.920 | So what is this matrix that we're taking the dot product of it as transpose?

01:39:31.200 | Well what it is, is that we start with our image, and remember the image is height by

01:39:35.760 | width by channels, and we change the order of dimensions, so it's channels by height

01:39:41.960 | by width.

01:39:44.200 | And then we do a batch flatten.

01:39:46.640 | What batch flatten does is it takes everything except the first dimension and flattens it

01:39:51.440 | out into a vector.

01:39:53.160 | This is now going to be a matrix where the rows, the channels and the columns are a flattened

01:40:01.000 | version of the height by width.

01:40:03.920 | This is C by H by W, the result of this will be C rows and H times W columns.

01:40:12.840 | So when you take the dot product of something with a transpose of itself, what you're basically

01:40:19.840 | doing is creating something a lot like a correlation matrix.

01:40:23.520 | You're saying how much is each row similar to each other row?

01:40:35.480 | You can think of it a number of ways.

01:40:36.480 | You can think about it like a cosine, a cosine is basically just a dot product.

01:40:42.400 | You can think of it as a correlation matrix, it's basically a normalized version of this.

01:40:51.200 | So maybe if it's not clear to you, write it down on a piece of paper on the way home tonight.

01:40:55.200 | Just think about taking the rows of a matrix, and then flipping it around, and you're basically

01:41:02.600 | then turning them into columns, and then you're multiplying the rows by the columns, it's

01:41:07.400 | basically the same as taking each row and comparing it to each other row.

01:41:13.200 | So that's what this Gram matrix is, it's basically saying for every channel, how similar are

01:41:21.360 | its values to each other channel?

01:41:24.500 | So if channel number 1 in most parts of the image is very similar to channel 3 in most

01:41:33.120 | parts of the image, then 1,3 of this result will be a higher number.

01:41:40.320 | So it's kind of a weird matrix, it basically tells us, it's like a fingerprint of how the

01:41:46.920 | channels relate to each other in this particular image, or how the filters relate to each other

01:41:52.320 | in a particular layer of this particular image.

01:41:54.880 | I think the most important thing to recognize is that there is no geometry left here at

01:42:01.000 | all.

01:42:02.000 | The x and the y coordinates are totally thrown away, they're actually flattened out.

01:42:08.560 | So this loss function can by definition in no way at all contain anything about the content

01:42:16.200 | of the image, because it's thrown away all of the x and y information, and all that's

01:42:21.400 | left is some kind of fingerprint of how the channels relate to each other, how the filters

01:42:27.000 | relate to each other.

01:42:29.720 | So this style loss then says for two different images, how do these fingerprints differ?

01:42:36.640 | How similar are these fingerprints?

01:42:38.920 | So it turns out that if you now do the exact same steps as before using that as our loss

01:42:44.600 | function and you run it through a few iterations, it looks like that.

01:42:53.680 | It looks a lot like the original Van Gogh, but without any of the content.

01:43:04.200 | So the question is, why?

01:43:08.200 | The answer is, nobody the fuck knows.

01:43:14.680 | So a paper just came out two weeks ago called Demystifying Neural Style Transfer with a

01:43:20.360 | mathematical treatment where they claim to have an answer to this question.

01:43:24.760 | But as the point at which this was created, a year and a half ago, until now, no one really

01:43:33.240 | knows why that happens.

01:43:36.080 | But the important thing that the authors of this paper realized is, if we could create

01:43:40.480 | a function that gives you content loss and a function that gives you style loss, and

01:43:45.320 | you add the two together and optimize them, you can do neural style.

01:43:49.720 | So all I can assume is that they tried a few different things.

01:43:55.360 | They knew that they had to throw away all of the geometry, so they probably tried a

01:43:59.120 | few things to throw away the geometry, and at some point they looked at this and they

01:44:04.200 | went, "Oh shit!

01:44:05.200 | That's it!"

01:44:06.200 | So now that we have this magical thing, there's the Simpsons, all we have to do is add the

01:44:14.440 | two together.

01:44:16.040 | So here's our bird, which I'll call source.

01:44:19.360 | We've got our style layers, I'm actually going to take the top five now.

01:44:26.160 | Here's our content layer, I'm going to take block 4, column 2.

01:44:30.060 | As promised, for our loss function, I'm just going to add the two together.

01:44:35.280 | Style loss for all of the style layers, plus the content loss.

01:44:41.920 | And I'm going to divide the content loss by 10.

01:44:43.880 | This is something you can play with, and in the paper you'll see they play with it.

01:44:48.440 | How much style loss versus how much content loss?

01:44:50.880 | Get the gradients, evaluator, solve it, and there it is.

01:44:59.020 | Other than the fact that we don't really know why the style loss works, but it does, everything

01:45:05.640 | else kind of fits together.

01:45:07.320 | So there's the bird as Van Gogh, there's the bird as the Simpsons, and there's the bird

01:45:12.120 | in the style of a bird picture.

01:45:14.920 | There's a question, "Since the publication of that paper, has anyone used any other loss

01:45:21.400 | functions for f_style that achieve similar results?"

01:45:24.560 | Yeah, so as I mentioned, just a couple of weeks ago there was a paper, I'll put it on

01:45:28.360 | the forum, that tries to generalize this loss function.

01:45:32.080 | It turns out actually that this particular loss function seems to be about the best that

01:45:36.320 | they could come up with, but there you go.

01:45:39.200 | So it's 9 o'clock, so we have run out of time.

01:45:42.680 | So we're going to move some of this lesson to the next lesson, but to give you a sense

01:45:46.880 | of where we're going to head, what we're going to do is we're going to take this thing where

01:45:52.740 | you have to optimize every single image separately, and we're going to train a CNN, which will

01:46:00.840 | learn how to turn a picture into a Van Gogh version of that picture.

01:46:06.340 | So that's basically going to be what we're going to learn next time, and we're also going

01:46:09.680 | to learn about adversarial networks, which is where we're going to create two networks.

01:46:14.680 | One will be designed to generate pictures like this, and the other will be designed

01:46:19.800 | to try and classify whether this is a real Simpsons picture or a fake Simpsons picture.

01:46:26.600 | And then you'll do one, generate, the other, discriminate, generate, discriminate.

01:46:31.680 | And by doing that, we can take any generative model and make it better by basically having

01:46:39.280 | something else learn to pick the difference between it, the real, and the fake.

01:46:45.120 | And then finally we're going to learn about a particular thing that came out three weeks

01:46:47.960 | ago called the Wasserstein GAN, which is the reason I actually decided to move all of this

01:46:53.960 | forwards.

01:46:54.960 | Generative adversarial networks basically didn't work very well at all until about three weeks

01:46:59.000 | ago.

01:47:00.000 | Now that they do work, suddenly there's a shitload of stuff that nobody's done yet,

01:47:04.280 | which you can do for the first time.

01:47:06.240 | So we're going to look at that next week.

Lesson 8: Cutting Edge Deep Learning for Coders

Chapters