back to indexLesson 8: Cutting Edge Deep Learning for Coders
Chapters
0:0 Intro
3:54 Architecture
5:9 Key Insights
8:5 Technology Foundation Changes
10:28 Tensorflow
13:54 Productionization
16:49 TensorFlow Toolkit
20:11 Excellet
26:5 PyTorch vs Tensorflow
31:39 Building a box
36:21 Reading papers
39:29 Writing
41:19 Part 2 Outline
44:41 Part 2 Code
52:11 Part 2 Notebook
54:11 Mendeley Desktop
56:26 Archive
57:16 Preserver
00:00:00.000 |
Some of you have finished Part 1 in the last few days, some of you finished Part 1 in December. 00:00:07.000 |
I did ask those of you who took it in person to revise the material and make sure it was 00:00:13.880 |
up to date, but let's do a quick summary of the key things we learned. 00:00:22.880 |
I've been interested to hear if anybody has other key insights that they feel they came 00:00:36.160 |
Stacks of nonlinear functions with lots of -- well, stacks of differentiable nonlinear 00:00:43.000 |
functions with lots of parameters solve nearly any predictive modeling problem. 00:00:47.640 |
So when we say neural network, a lot of people are suggesting we should use the phrase differentiable 00:00:54.640 |
If you think about things like the collaborative filtering we did, it was really a couple of 00:01:00.640 |
embeddings and a dot product and that gave us quite a long way, there's nothing very 00:01:09.560 |
But we know that when we stack certain kinds of nonlinear functions on top of each other, 00:01:15.440 |
the universal approximation theorem tells us that can approximate any computable function 00:01:21.520 |
to arbitrary precision, we know that if it's differentiable we can use SGD to find the 00:01:31.240 |
So this to me is kind of like the key insight. 00:01:36.640 |
But some stacks of functions are better than others for some kinds of data and some kinds 00:01:46.840 |
One way to make life very easy, we learned, is transfer learning. 00:01:51.920 |
I think nearly every network we created in the last course, we used transfer learning. 00:01:59.360 |
I think particularly in vision and in text, so pretty much everything. 00:02:04.880 |
So transfer learning generally was throw away the last layer, replace it with a new one 00:02:10.880 |
that has the right number of outputs, pre-compute the penultimate layer's output, then very 00:02:17.440 |
quickly create a linear model that goes from that to your preferred answer. 00:02:23.560 |
You now have something that works pretty well, and then you can fine-tune more and more layers 00:02:30.600 |
And we learned that fine-tuning those additional layers, generally the best way to do that 00:02:35.200 |
was to pre-compute the last of the layers which are not fine-tuning, and so then you 00:02:40.280 |
could just calculate the weights of the remaining ones, and that saved us lots and lots of time. 00:02:48.040 |
And remember that convolutional layers are slower, so let's fix up the previous one as 00:03:13.080 |
Convolutional layers are slower, dense layers are bigger, and there's an interesting question 00:03:19.520 |
I've added here, which is, remember in the last lesson, we kind of looked at ResNets 00:03:25.560 |
and InceptionNets and in general more modern nets tend not to have any dense layers. 00:03:31.560 |
So what's the best way to do transfer learning? 00:03:35.280 |
I'm going to leave that as an open question for now. 00:03:37.840 |
We're going to look into it a bit during this class, but it's not a question that anybody 00:03:44.680 |
So I'll suggest some ideas, but no one's even written a paper that attempts to address it 00:03:57.680 |
Given we have transfer learning to get us a long way, the next thing we have to get 00:04:01.080 |
us a long way is to try and create an architecture which suits our problem, both our data and 00:04:09.680 |
So for example, if we have autocorrelated inputs, so in other words, each input is related 00:04:17.680 |
to the previous input, so each pixel is similar to the next-door pixel, or in a sound wave, 00:04:23.320 |
each sample is similar to the previous sample, something like that, that kind of data we 00:04:28.880 |
tend to like to use CNNs for as long as it's of a fixed size, it's a sequence we like to 00:04:35.320 |
use an RNN for, if it's a categorical output we like to use a softmax for. 00:04:40.680 |
So there are ways we learned of tuning our architecture, not so that it makes it possible 00:04:45.800 |
to solve a problem, because any standard dense network can solve any problem, but it just 00:04:52.760 |
makes it a lot faster and a lot easier to train if you've made sure that your activation 00:05:00.360 |
functions and your architecture suit the problem. 00:05:04.360 |
So that was another key thing I think we learned. 00:05:11.800 |
And something I hope that everybody can narrate is the five steps to avoiding overfitting. 00:05:18.320 |
If you've forgotten them, they're both here and discussed in more detail in lesson 3. 00:05:23.880 |
Get more data, fake-keep more data using data augmentation, use more generalizable architectures. 00:05:32.360 |
Architectures that generalize well, particularly when we look at batch normalization, use regularization 00:05:38.240 |
techniques as few as we can, because by definition they destroy some data, but we look particularly 00:05:49.320 |
And then finally if we have to, we can look at reducing the complexity of the architecture. 00:05:54.360 |
The general approach we learned, this was absolutely key, is first of all with a new 00:06:00.760 |
problem, start with a network that's too big, it's not regularized, it can't help but solve 00:06:07.760 |
the problem, even if it has to overfit terribly. 00:06:12.680 |
If you can't do that, there's no point starting to regularize yet. 00:06:16.760 |
So we start out by trying to overfit terribly. 00:06:19.720 |
Once we've got to the point that we're getting 100% accuracy and our validation set's terrible 00:06:23.720 |
because it's overfitting, then we start going through these steps until we get a nice balance. 00:06:31.120 |
So that's kind of the process that we learned. 00:06:34.360 |
And then finally we learned about embeddings as a technique to allow us to use categorical 00:06:40.840 |
data, and specifically the idea of using words, or the idea of using latent variables. 00:06:49.760 |
So in this case, this was the movie lens dataset for collaborative filtering. 00:07:02.960 |
Did anybody have any other kind of key takeaways that they think people revising should think 00:07:10.680 |
about or remember, or things they found interesting? 00:07:22.920 |
How does having duplicates and training data affect the model created? 00:07:26.760 |
And if you're using data augmentation, do you end up with duplicate data? 00:07:35.440 |
Duplicates in the input data, I mean it's not a big deal, because we shuffle the batch 00:07:41.800 |
and then you select things randomly, effectively, you're weighting that data point higher than 00:07:49.440 |
So in a big dataset, it's going to make very little difference. 00:07:54.000 |
If you've got one thing repeated 1,000 times and then there's only another 100 data points, 00:07:57.640 |
that's going to be a big problem because you're weighting one data point 1,000 times higher. 00:08:05.880 |
So as you will have seen, we've got a couple of big technology foundation changes. 00:08:12.080 |
The first one is we're moving from Python 2 to Python 3. 00:08:16.640 |
Python 2 I think is a good place to start, given that a lot of the folks in Part 1 had 00:08:23.200 |
never coded in Python before and many of them had never written very substantial pieces 00:08:31.300 |
And a lot of the tutorials out there, like for example one of our preferred starting 00:08:35.800 |
points which is learn Python the hard way, then Python 2, a lot of the existing codes 00:08:40.840 |
out there in Python 2, so we thought Python 2 is a good place to start. 00:08:45.480 |
One is, are you going to post the slides after this? 00:08:50.360 |
And the other is, could you go through steps for underfitting at some point, how to deal 00:09:00.200 |
So why don't you create a forum thread asking about underfitting, but you don't need to 00:09:04.160 |
do that in the Part 2 forum, you can do that in the main forum because lots of people would 00:09:11.000 |
If you want to revise that lesson 3, start it by talking about underfitting. 00:09:24.400 |
I don't think we should keep using Python 2 though for a number of reasons. 00:09:28.240 |
One is that since then the IPython folks have come out and said that the next version won't 00:09:32.760 |
be compatible with Python 2, so that's a problem. 00:09:37.040 |
Even from 2020 onwards, Python 2 will be end of life, which means there won't be patches 00:09:44.960 |
Also, we're going to be doing more stuff with concurrency and parallel programming this 00:09:49.000 |
time around, and the features in Python 3 are a lot better. 00:09:54.160 |
And then Python 3.6 was just released, which has some very nice features in particular, 00:09:58.520 |
some string formatting, which for some people it's no big deal, but to me it saves a lot 00:10:04.480 |
So we're going to move across to Python 3, and hopefully you've all gone through the 00:10:10.240 |
And there are some tips on the forum about how to have both run at the same time, although 00:10:16.520 |
I agree with the suggestion I had read from somebody which was go ahead, suck it up and 00:10:22.440 |
do the translation once now so you don't have to worry about it. 00:10:30.640 |
Much more interesting and much bigger is the move from Theano to TensorFlow. 00:10:35.640 |
So Theano, we thought, was a better starting point because it has a much simpler API. 00:10:41.800 |
There's very few new concepts to learn to understand Theano. 00:10:47.920 |
You see, TensorFlow lives within Google's whole ecosystem. 00:10:55.120 |
It's got its own file serialization system called Protobuf. 00:10:58.880 |
It's got its own profiler method based on Chrome. 00:11:04.720 |
But if you've come this far, then you're already investing the time. 00:11:09.400 |
We think it's worth investing the time in TensorFlow because there's a lot of stuff which just 00:11:15.200 |
in the last few weeks, it's data being able to do that's pretty amazing. 00:11:18.640 |
So Rachel wrote this post about how much TensorFlow sucks, for which we got invited to the TensorFlow 00:11:31.000 |
Dev Summit and got to meet all the TensorFlow core team. 00:11:40.320 |
So looking at moving from Theano to TensorFlow, we got invited to the TensorFlow Dev Summit 00:11:48.880 |
and we were pretty amazed at all the stuff that's literally just been added. 00:12:01.840 |
If you google for TensorFlow Dev Summit videos, you can watch the videos about all this. 00:12:06.520 |
That's the most exciting thing for us, is that they are really investing in a simplified 00:12:12.640 |
So if you look at this code, you can create a deep neural network regressor on a mixture 00:12:21.080 |
of categorical and real variables using an almost R-like syntax and fit it in two lines 00:12:30.560 |
You'll see that those lines of code at the bottom, the two lines to fit it, look very 00:12:36.680 |
The Keras author has been a wonderful influence on Google, and in fact everywhere we saw at 00:12:48.440 |
So TensorFlow and Keras are kind of becoming more and more one, which is terrific. 00:12:54.040 |
So one is that they're really investing in the API. 00:12:58.400 |
The second is that some of the tooling is looking pretty good. 00:13:04.600 |
Things like these graphs showing you how your different layers are distributed and how that's 00:13:09.760 |
changed over time can really help to debug what's going on. 00:13:14.220 |
So if you get some kind of gradient saturation in a layer, you can dig through these graphs 00:13:27.200 |
This guy, I remember correctly, his name was Daffodil and his signature was an emoji of 00:13:37.680 |
If you watch this video, you kind of have to walk through showing some of the functionality 00:13:50.280 |
that's there and how to use it, and I thought that was pretty helpful. 00:13:56.960 |
One of the most important ones to me is that TensorFlow has a great story about productionization. 00:14:03.560 |
For part one, I didn't much care about productionization. 00:14:06.240 |
It was really about playing around, what can we learn. 00:14:10.640 |
At this point, I think we might be starting to think about how do I get my stuff online 00:14:19.000 |
These points are talking about something in particular which is called TensorFlow Serving. 00:14:22.840 |
And TensorFlow Serving is a system that can take your train TensorFlow model and create 00:14:28.960 |
an API for it which does some pretty cool things. 00:14:32.720 |
For example, think about how hard it would be without the help of some library to productionize 00:14:47.360 |
How do you make sure that you don't saturate all those GPUs, that you send the request to 00:14:51.320 |
one that's free, that you don't use up all of your memory. 00:14:54.640 |
Better still, how do you grab a few requests, put them into a batch, put them in all to 00:15:01.080 |
the GPU at once, get the bits out of the batch, put them back to the people that requested 00:15:09.200 |
It's very early days for this software, a lot of things don't work yet, but you can download 00:15:15.120 |
an early version and start playing with it, and I think that's pretty interesting. 00:15:19.240 |
With the high-level API in TensorFlow, what's going to be the difference between the Keras 00:15:30.000 |
In fact, TensorFlow or tf.keras will become a namespace. 00:15:39.480 |
So Keras will become the official top-level API for TensorFlow, and in fact Rachel was 00:15:49.400 |
I was just going to add that TensorFlow is kind of introducing a few different libraries 00:15:53.680 |
at different layers, different levels of abstraction. 00:15:58.360 |
There's this concept of an evaluation API that appears everywhere and basically is the 00:16:06.400 |
I think there's a layers API below the Keras API. 00:16:13.920 |
So all the stuff you've learned about Keras is going to be very helpful, not just in using 00:16:18.520 |
Keras on TensorFlow, but in using TensorFlow directly. 00:16:25.840 |
Another interesting thing about TensorFlow is that they've built a lot of cool integrations 00:16:30.740 |
with various cluster managers and distributed storage systems and stuff like that. 00:16:34.680 |
So it will kind of fit into your production systems more neatly, use the data in whatever 00:16:39.640 |
place it already is more neatly, so if your data is in S3 or something like that, you 00:16:45.680 |
can generally throw it straight into TensorFlow. 00:16:50.480 |
Something I found very interesting is that they announced a couple of weeks ago a machine 00:16:56.560 |
learning toolkit which brings really high-quality implementations of a wide variety of non-deep 00:17:05.440 |
So all these are GPU-accelerated, parallelized, and supported by Google. 00:17:15.360 |
And a lot of these have a lot of tech behind them. 00:17:18.440 |
For example, the random forest, there's a paper, they actually call it the Tensor Forest, 00:17:23.720 |
which explains all of the interesting things they did to create a fast GPU-accelerated random 00:17:33.960 |
Will you give an example of how to solve gradient saturation TensorFlow tools? 00:17:41.680 |
We'll see how we go, because I think the video from the Dev Summit, which is available online, 00:17:48.680 |
So I would say look at that first and see if you still have questions. 00:17:53.160 |
All the videos from the Dev Summit are online. 00:18:01.440 |
Is there an idea for using deep learning on AWS Lambda? 00:18:06.000 |
Not that I've heard of, and in fact in general, Google has a service version of TensorFlow 00:18:17.480 |
serving called Google Cloud ML where you can pay them a few cents in a transaction and 00:18:26.320 |
There isn't really something like that through Amazon if I was unaware. 00:18:32.880 |
And then finally in terms of TensorFlow, I had an interesting and infuriating few weeks 00:18:40.360 |
trying to prepare for this class and trying to get something working that would translate 00:18:46.400 |
And every single example I found online had major problems. 00:18:52.480 |
Even the official TensorFlow tutorial missed out a key thing which is that the lowest level 00:18:59.600 |
of a language model really should be bi-directional, as this one shows, bi-directional RNN and 00:19:06.180 |
I'm trying to figure out how to make it work horrible, trying to get it to work in Keras, 00:19:12.800 |
Finally, basically the issue is this, modern RNN systems like a full neural translation 00:19:22.600 |
system involve a lot of tweaking and mucking around with the innards of the RNN using things 00:19:31.680 |
And there just hasn't been an API that really lets that happen. 00:19:38.120 |
So I finally got it working by switching to PyTorch, which we'll learn about soon, but 00:19:43.880 |
I was actually going to start, the first lesson was going to be about neural translation and 00:19:48.320 |
I've put it back because TensorFlow has just released a new system for RNNs which looks 00:19:55.880 |
like it's going to make all this a lot easier. 00:19:58.120 |
So this is an exciting idea is that there's an API that allows us to create some pretty 00:20:04.600 |
powerful RNN implementations and we're going to be absolutely needing that when we learn 00:20:13.840 |
Again, early days, but there is something called XLA, which is the Accelerated Linear 00:20:20.160 |
Algebra virtual, I think, which is a system which takes TensorFlow code and compiles it. 00:20:33.280 |
And so for those of you that know something about compiling, you know that a compilation 00:20:37.460 |
can do a lot of clever stuff in terms of identifying dead code or unrolling loops or fusing operations 00:20:48.840 |
Now at this stage, it takes your TensorFlow code and turns it into machine code. 00:20:53.360 |
One of the cool things that lets you do is run it on a mobile phone with almost no supporting 00:20:58.520 |
libraries using native machine instructions on that phone, much less memory. 00:21:06.080 |
But one of the really interesting discussions I had at the summit was with Scott Gray, who 00:21:13.320 |
He was the guy that passively accelerated neural network kernels when he was at Nirvana. 00:21:20.800 |
He had kernels that were two or three times faster than Nvidia's kernels. 00:21:25.880 |
I don't know of anybody else in the world who knows more about neural network performance 00:21:32.420 |
He told me that he thinks that XLA is the key to creating performant, concise, expressive 00:21:47.600 |
The idea is currently, if you look in the TensorFlow code, it's thousands and thousands 00:21:55.800 |
The idea is you throw all that away and replace it with a small number of lines of TensorFlow 00:22:03.040 |
So that's something that's actually got me pretty excited. 00:22:20.400 |
The API is full of not invented hair syndrome. 00:22:25.320 |
It's clearly written by a bunch of engineers who have not necessarily spent that much time 00:22:35.600 |
It's full of these Googleisms in terms of having to fit into their ecosystem. 00:22:43.000 |
But most importantly, like Theano, you have to set up the whole computation graph and 00:22:51.880 |
then you kind of go run, which means that if you want to do stuff in your computation 00:22:57.000 |
graph that involves like conditionals, if-then statements, if this happens, you do this other 00:23:10.560 |
It turns out that there's a very different way of programming neural nets, which is dynamic 00:23:19.800 |
computation, otherwise known as define through run. 00:23:24.060 |
There's a number of libraries that do this, Torch, PyTorch, Tana, dinet, they're the ones 00:23:35.640 |
And we're going to be looking at one that was released, but an early version was put 00:23:41.920 |
out about a month ago called PyTorch, which I've started rewriting a lot of stuff in, 00:23:49.120 |
and a lot of the more complex stuff just becomes suddenly so much easier. 00:23:53.760 |
And because it becomes easier to do more complex things, I often find I can create faster and 00:24:05.960 |
So even although PyTorch is very, very, very new, it is coming out of the same people that 00:24:12.660 |
built Torch, which really all of Facebook's systems build on top of. 00:24:18.080 |
I suspect that Facebook are in the process of moving across from Torch to PyTorch. 00:24:23.280 |
It's already full of incredibly full stuff, as you'll see. 00:24:29.000 |
So we will be using increasingly more and more PyTorch during this course. 00:24:35.540 |
There was a question, "Does precompiling mean that we'll write TensorFlow code and test it 00:24:45.560 |
and then when we train a big model, then we precompile the code and train our model?" 00:24:50.400 |
Yeah, so if we're talking about XLA, XLA can be used a number of ways. 00:24:56.420 |
One is that you come up with some different kind of kernels, a different kind of factorization, 00:25:05.120 |
You write it in TensorFlow, you compile it with XLA, and then you make it available to 00:25:10.680 |
anybody so when they use your layer, they're getting this compiled optimized code. 00:25:17.460 |
It could mean that when you use TensorFlow serving, TensorFlow serving might compile 00:25:23.400 |
your code using XLA and be serving up an accelerated version of it. 00:25:31.640 |
RNNs often involve nowadays, as you'll learn, some kind of complex customizations of a bidirectional 00:25:38.680 |
layer and then some stack layers and an attention layer and then a set into a separate stack 00:25:43.640 |
to decoder, you can fuse that together into a single layer called bidirectional attention 00:25:53.360 |
sequence to sequence, which indeed Google have actually bought that kind of stuff. 00:25:59.240 |
There's various ways in which neural network compilation can be very helpful. 00:26:09.960 |
What is the relationship between TensorFlow and PyTorch? 00:26:14.240 |
There's no relationship, so TensorFlow is Google's thing, PyTorch is I guess it's kind 00:26:22.120 |
of Facebook's thing, but it's also very much a community thing. 00:26:27.000 |
TensorFlow is a huge complex beast of a system which uses all kinds of advanced software 00:26:39.280 |
In theory, that ought to make it terribly fast. 00:26:41.640 |
In practice, a recent benchmark actually showed it to be about the slowest, and I think the 00:26:45.520 |
reason is because it's so big and complex, it's so hard to get everything to work together. 00:26:50.520 |
In theory, PyTorch ought to be the slowest because this defined by run system means it's 00:26:56.280 |
way less optimization that the systems can do, but it turned out to be amongst the fastest 00:27:01.760 |
because it's so easy to write code, it's so much easier to write good code. 00:27:07.080 |
It's interesting, I think there's such different approaches, I think it's going to be great 00:27:13.760 |
to know both because there are going to be some things that are going to be fantastic 00:27:17.400 |
in TensorFlow and some things that are going to be fantastic in TensorFlow. 00:27:21.640 |
They couldn't be more different, which is why I think there are two good things to learn. 00:27:28.240 |
So wrapping up this introductory part, I wanted to kind of change your expectations about 00:27:37.760 |
how you've learned so far to how you're going to learn in the future. 00:27:41.240 |
Part 1 to me was about showing you best practices. 00:27:44.280 |
So generally it's like, here's a library, here's a problem, you use this library in these steps 00:27:50.720 |
to solve this problem, and you do it this way, and lo and behold we've gotten the top 00:28:00.000 |
I tried to select things that had best practices. 00:28:05.680 |
So you now know everything I know about best practices. 00:28:08.960 |
I don't really have anything else to tell you. 00:28:12.080 |
So we're now up to stuff I haven't quite figured out yet, nor is anybody else, but you probably 00:28:22.320 |
So some of it, for example, like neural translation, that's an example of something that is solved. 00:28:30.600 |
Google solved it, but they haven't released the way they solved it. 00:28:35.320 |
So the rest of us are trying to put everything together and figure out how to make something 00:28:43.760 |
More often it's going to be, here's a sequence of things you can do that can get some pretty 00:28:50.600 |
good results here, but there's a thousand things you could do to make it better that 00:28:59.600 |
Or thirdly, here's a sequence of things that solves this pretty well, but gosh we wrote 00:29:09.640 |
I'm sure this could be abstracted really nicely, but no one's done that yet. 00:29:13.000 |
So they're kind of the three main categories. 00:29:15.320 |
So generally at the end of each class it won't be like, okay, that's it, that's how you do 00:29:21.600 |
It'll be more like, here are the things you can explore. 00:29:23.960 |
And so the homework will be pick one of these interesting things and dig into it, and generally 00:29:31.800 |
speaking that homework will get you to a point that probably no one's done before, or at 00:29:40.240 |
I found as I built this, I think nearly every single piece of code I'm presenting, I was 00:29:49.360 |
unable to find anything online which did that thing correctly. 00:29:54.280 |
There was often example code that claimed to be something like that, but again and again 00:30:02.040 |
And we'll talk about some of the things that it was missing as we go, but one very common 00:30:06.680 |
one was it would only work on a single item at a time, it wouldn't work with a batch. 00:30:12.480 |
Therefore the GPU is basically totally wasted. 00:30:16.200 |
Or it failed to get anywhere near the performance that was claimed in the paper that it was 00:30:24.040 |
So generally speaking there's going to be lots of opportunities if you're interested 00:30:28.480 |
to write a little blog post about the things you tried and what worked and what didn't, 00:30:33.320 |
and you'll generally find that there's no other post like that out there. 00:30:39.040 |
Particularly if you pick a dataset that's in your domain area, it's very unlikely that 00:30:46.480 |
Going back, can we use TensorFlow and Torch together? 00:30:55.800 |
Torch is very similar, but it's written in Lua, which is a very small embedded language. 00:31:04.720 |
Very good for what it is, but not very good for what we want to do. 00:31:10.240 |
So PyTorch is kind of a port of Torch into Python, which is pretty cool. 00:31:23.640 |
In general, you can do a few steps with TensorFlow to get to a certain point, and then a few 00:31:30.400 |
You can't integrate them into the same network, because they're very different approaches, 00:31:34.800 |
but you can certainly solve a problem with the two of them together. 00:31:42.240 |
So for those of you who have some money left over, I would strongly suggest building a 00:31:48.440 |
box. And the reason I suggest building a box is because you're paying 90 cents an hour 00:31:56.720 |
I know a lot of you are spending a couple of hundred bucks a month on AWS bills. 00:32:04.040 |
Here is a box that costs $550 and will be about twice as fast as a P2. 00:32:12.600 |
So it's just not good value to use a P2, and it's way slower than it needs to be. 00:32:23.440 |
And also building a box, it's one of the many things that's just good to learn, is understanding 00:32:31.560 |
So I've got some suggestions here about what box to build for various different budgets. 00:32:39.320 |
You certainly don't have to, but this is my recommendation. 00:32:48.420 |
More RAM helps more than I think people who discuss this stuff online quite appreciate. 00:32:54.960 |
12GB of RAM means twice as big of batch sizes, which means half as many steps necessary to 00:33:02.920 |
That means more stable gradients, which means you can use higher learning rates. 00:33:07.700 |
So more RAM I think is often under-appreciated. 00:33:16.000 |
It is a lot more expensive, but you can get the previous generation's version secondhand, 00:33:22.920 |
So there's a Titan X Pascal, which is the current one, or the Titan X Maxwell, which 00:33:30.000 |
The previous generation one is not a big step back at all, it still has 12GB RAM. 00:33:34.720 |
If you can get one used that would be a great option. 00:33:41.880 |
The GTX 1080 and 1070 are absolutely fantastic as well. 00:33:48.560 |
They're nearly as good as the Titan X, but they just have 8GB rather than 12GB. 00:33:54.680 |
Going back to a GTX 980, which is the kind of previous generation consumer top-end card, 00:34:03.460 |
So of all the places you're going to spend money on a box, put nearly all of it into 00:34:11.080 |
Every one of these steps, 1070, the Titan X, Pascal, they're big steps up. 00:34:19.200 |
And as you will have seen from part 1, if you've got more RAM, it really helps because 00:34:26.280 |
you can pre-compute more stuff and keep it in RAM. 00:34:29.160 |
Having said that, there's a new kind of hard drive, an NVMe drive on volatile memory. 00:34:39.840 |
They're not that far away from RAM like speeds, but they're hard drives. 00:34:47.400 |
You have to get a special kind of motherboard, but if you can afford it, it's going to be 00:34:57.920 |
That's going to really allow you to put all of your currently used data on that drive 00:35:06.880 |
Question-Doesn't the batch size also depend heavily on the video RAM? 00:35:12.120 |
Answer-That's what I was referring to, the 12GB, I'm talking about the RAM that's on 00:35:16.760 |
Question-Does upgrading RAM allow bigger batch sizes? 00:35:20.800 |
Answer-Upgrading the card, the video card's RAM. 00:35:27.480 |
You buy a card that has X amount of RAM, so Titan X has 12, GTX 1080, 8, GTX 980, 4, so 00:35:38.280 |
Upgrading the amount of RAM that's in your computer doesn't change your batch size, it 00:35:42.800 |
just changes the amount you can pre-compute unless you use an NVMe drive, in which case 00:35:56.760 |
You can go to Central Computers, which is a San Francisco computer shop, for example, 00:36:03.920 |
There's a fantastic thread on the forums, Brendan, one of the participants in the course 00:36:10.080 |
has a great Medium post, went there explaining his whole journey to getting something built 00:36:20.960 |
Alright, it's time to build your box and while you wait for things to install, it's time 00:36:29.240 |
So papers are, if you're a philosophy graduate like me, terrifying. 00:36:35.560 |
They look like Theorem 4.1 and colloquially 4.2 on the left, but that is an extract from 00:36:44.720 |
the Adam paper, and you all know how to do Adam in Microsoft Excel. 00:36:52.160 |
It's amazing how most papers manage to make simple things incredibly complex. 00:36:59.080 |
And a lot of that is because academics need to show other academics how worthy they are 00:37:05.160 |
of a conference spot, which means showing off all their fancy math skills. 00:37:11.080 |
So if you really need a proof of the convergence of your optimizer rather than just running 00:37:18.520 |
it and see if it works, you can study Theorem 4.1 and Corollary 4.2 and blah blah blah. 00:37:24.600 |
In general though, the way philosophy graduates read papers is to read the abstract, find out 00:37:33.440 |
what problem they're solving, read the introduction to learn more about that problem and how previous 00:37:40.120 |
people have tackled it, jump to the bit at the end called Experiments to see how well 00:37:45.240 |
If it works really well, jump back to the bit which has the pseudocode in and try to 00:37:51.880 |
Ideally, hopefully in the meantime, finding that somebody else has written a blog post 00:37:56.000 |
in simple English like this example with Adam. 00:38:00.520 |
So don't be disheartened when you start reading big learning papers, and unless you have a 00:38:07.880 |
math background, believe it or not, you're a PhD in math and they're still terrifying. 00:38:13.080 |
Yeah, they still feel disheartened frequently. 00:38:15.680 |
Rachel was complaining about a paper just today in fact. 00:38:24.100 |
The other thing I'll say is that you'll even see now, there will be a bit that's like, 00:38:28.760 |
and then we use a softmax layer and there will be the equation for a softmax layer. 00:38:32.720 |
You'll look at the equation like, what the hell, and then it's like, oh, I already know 00:38:42.160 |
Literally still in every paper, they write the damn LSTM equations as if that's any help 00:38:47.880 |
But okay, it adds more Greek symbols, so be it. 00:38:54.620 |
It's very hard to read and remember things that you can't pronounce, so if you don't 00:38:59.280 |
know how to read the Greek letters, Google the Greek alphabet and learn how to say them. 00:39:05.160 |
It's just so much easier when you can look at an equation and rather go squiggle something, 00:39:09.160 |
squiggle something, you can say alpha something and beta something. 00:39:11.760 |
I know it's a small little thing, but it does make a big difference. 00:39:16.100 |
So we are all there to help each other read papers. 00:39:19.640 |
The reason we need to read papers is because as of now, a lot of the things we're doing 00:39:30.640 |
Okay, so I really think writing is a good idea. 00:39:38.040 |
In fact, all of your projects I hope will end up in at least one blog. 00:39:42.880 |
If you don't have a blog, medium.com is a great place to write. 00:39:47.960 |
We would love to feature your work on fast.ai, so tell us about what you create. 00:39:56.040 |
We're very keen for more people to get into the deep learning community. 00:40:02.480 |
When you write this stuff, say hey, this is some stuff based on this course I'm doing, 00:40:07.260 |
and here's what I've learned, and here's what I've tried, and here's what I found out. 00:40:14.240 |
Like even us putting our little AWS setup scripts on GitHub for the MOOC, Rachel had 00:40:23.200 |
a dozen pull requests within a week with all kinds of little tidbits of like, oh, if you're 00:40:31.100 |
on this version of Mac, this helps this bit, or I've abstracted this out to make it work 00:40:35.360 |
in Ireland as well as in America, and so on, so there's lots of stuff that you can do. 00:40:44.200 |
I think the most important tip here is don't wait to be perfect before you start writing. 00:40:54.520 |
You should think of your target audience as the person who's one step behind you, so maybe 00:40:58.720 |
your target audience is someone that's just working through the part one MOOC right now. 00:41:09.960 |
I don't write the thing that you would love to have seen because there will be far more 00:41:14.000 |
people in that target audience than the Jeffrey Hinton target audience. 00:41:24.240 |
7.45, so this might be a good time for a break. 00:41:27.600 |
Let's just get through this and then we can get on to the interesting stuff. 00:41:33.520 |
I've tried to lay out what I think we'll study in part two. 00:41:36.720 |
As I say, what I was planning until quite recently to present today was neural translation, 00:41:45.080 |
and then two things happened. Google suddenly came up with a much better RNN and sequence-to-sequence 00:41:51.280 |
API, and then also two or three weeks ago a new paper came out for generative models which 00:42:01.200 |
So that's why we've redone things and we're starting with CNN generative models today. 00:42:06.240 |
We have a question, where to find the current research papers? 00:42:16.040 |
Assuming that things go as planned, the general topic areas in part two will be CNNs and NLP 00:42:29.080 |
If you think about it, pretty much everything we did in part one was classification or a 00:42:37.480 |
We're going to now be talking more about generative models. 00:42:41.720 |
It's a little hard to exactly define what I mean by generative models, but we're talking 00:42:45.760 |
about creating an image, or creating a sentence, we're creating bigger outputs. 00:42:55.320 |
So CNNs beyond classification, so generative models for CNNs means the thing that we could 00:43:00.620 |
produce could be a picture showing this is where the bicycle is, this is where the person 00:43:05.880 |
is, this is where the grass is, that's called segmentation, or it could be taking a black 00:43:10.240 |
and white image and turning it into a colour image, or taking a low-res image and turning 00:43:14.200 |
it into a high-res image, or taking a photo and turning it into a bangoff, or taking a 00:43:19.360 |
photo and turning it into a sentence describing it. 00:43:24.520 |
NLP beyond classification can be taking an English sentence and turning it into French, 00:43:31.720 |
or taking an English story and a question and turning it into an answer of that question 00:43:42.880 |
We'll be talking about how to deal with larger datasets, so that both means datasets with 00:43:47.040 |
more things in it, and datasets where the things are bigger. 00:43:52.120 |
And then finally, something I'm pretty excited about is I've done a lot of work recently 00:43:56.600 |
finding some interesting stuff about using deep learning for structured data and for 00:44:02.320 |
For example, we heard about fraud, so fraud is both of those things, it combines time 00:44:07.800 |
series, transaction histories and thick histories, and structured data, customer information. 00:44:14.080 |
Traditionally that's not been tackled with deep learning, but I've actually found some 00:44:19.080 |
state-of-the-art, world-class approaches to solving those with deep learning, so I'm really 00:44:29.200 |
So let's take a 8-minute break, come back at 5 to 8, thanks very much. 00:44:42.480 |
So we're going to learn about this idea of artistic style or neural style transfer. 00:44:47.380 |
The idea is that we're going to take a photo and make it look like it was painted in the 00:44:55.880 |
Our inputs are a photo, and I'm going to call it, oh, that's way off, and style. 00:45:22.560 |
And so these two things are going to be combined together to create an image which is going 00:45:35.280 |
to hopefully have the content of the photo and the style of the image. 00:45:57.040 |
The way we're going to do this is we're going to assume that there is some function where 00:46:06.000 |
the inputs to this function are the photo, the style image, and some generated image 00:46:27.840 |
And that will return some number where this function will be higher if the generated image 00:46:38.920 |
really looks like this photo in this style and lower if it doesn't. 00:46:44.800 |
So if we can create this loss function that basically says, here's my generated image, 00:46:51.720 |
and it returns back a number saying, oh yes, that generated image does look like that photo 00:47:00.240 |
And we would use SGD not to optimize the weights of a network, we would use SGD to optimize 00:47:13.720 |
So we would be using it to try to optimize the value of this argument. 00:47:21.240 |
So we haven't quite done that before, but conceptually it's identical. 00:47:27.560 |
Conceptually we can just find the derivative of this function with respect to this input. 00:47:37.120 |
And then we can try and optimize that input, which is just a set of pixel values, to try 00:47:44.760 |
So all we need to do is come up with a function which will tell us how much does some generated 00:48:00.440 |
And the way we're going to do that, step 1, is going to be very simple. 00:48:03.520 |
We're going to turn it into two functions, f-content, which will take the photo and the 00:48:12.920 |
generated image, and that will tell us a bigger number if the generated image looks more like 00:48:24.680 |
And then there will be a second function, which takes the style image and the generated 00:48:32.000 |
image, and that will tell us a higher number if this generated image looks like it was 00:48:39.680 |
painted in the same style as the style image. 00:48:42.880 |
So we can just turn it into two pieces and add them together. 00:48:46.560 |
So now we need to come up with these two parts. 00:49:00.560 |
What's a way that we could create a function that returns a higher number if the generated 00:49:12.880 |
When you come up with a loss function, the really obvious one is the values of the pixels. 00:49:24.360 |
The values of the pixels in the generated image, the mean squared error between them 00:49:30.240 |
and the photo, that mean squared error loss function would be one way of doing this part. 00:49:39.800 |
The problem with that though is that as I start to turn it into a Van Gogh, those pixel 00:49:48.660 |
They're going to change color because the Van Gogh might have been a very blue-looking 00:49:53.360 |
They'll change the relationships to each other so it might become a curve or it used to be 00:50:00.960 |
So really the pixel-wise mean squared error is not going to give us much freedom in trying 00:50:09.840 |
to create something that still looks like a photo. 00:50:13.000 |
So here's an idea, instead let's look at not the pixels, but let's take those pixels and 00:50:22.160 |
stick them through a pre-trained CNN like VGG. 00:50:29.680 |
And let's look at the 4th or 5th or 8th convolutional layers activations. 00:50:35.960 |
Remember back to those matzylar visualizations where we saw that the later layers kind of 00:50:43.840 |
said how much does an eyeball look like here, or how much does this look like a star, or 00:50:52.840 |
how much does this look like the fur of a dog. 00:50:53.840 |
The later layers were dealing with bigger objects and more semantic concepts. 00:51:00.580 |
So if we were to use a later layer's activations as our loss function, then we could really 00:51:07.520 |
change the style and the color and all kinds of stuff and really would be saying does the 00:51:12.560 |
eye still look like an eye, does the beak still look like a beak, does the rock still 00:51:19.400 |
And if the answer is yes, then OK, that's good, this is something that matches in terms 00:51:25.160 |
of the meaning of the content even though the pixels look very different. 00:51:30.400 |
And so that's exactly what we're going to do. 00:51:32.040 |
So for f-content, we're going to say that's just the VGG activations of some convolutional 00:51:49.920 |
So that's actually enough for us to get started. 00:51:54.000 |
Let's try and build something that optimizes pixels using a loss function of the VGG network 00:52:18.520 |
And much of what we're going to look at is going to look very similar. 00:52:24.520 |
The first thing you'll see which doesn't look similar to before is I've got this thing called 00:52:31.240 |
Limit mem, remember you can always see the source code for something by putting two question 00:52:43.280 |
Limit mem is just these three lines of code which I notice somebody currently has already 00:52:51.480 |
One of the many things I dislike about TensorFlow for our kind of work is that all of the defaults 00:52:58.920 |
So one of the defaults is it will use up all of your memory on all of your graphics cards. 00:53:04.000 |
So I'm currently running this on a server with four graphics cards, which I'm meant 00:53:07.920 |
to be sharing with my colleagues at the university here. 00:53:12.040 |
If every time I run a notebook, nobody else can use any of the graphics cards, they're 00:53:17.600 |
And this nice little gig I have of running these little classes is going to disappear 00:53:23.160 |
So I need to make sure I run limit mem very soon as soon as I start running a notebook. 00:53:29.760 |
Honestly I think this is a poor choice by the TensorFlow authors because somebody putting 00:53:37.160 |
something in production is going to be taking time to optimize things. 00:53:42.720 |
Somebody who's hacking something together to quickly see if they can get something working 00:53:48.400 |
So this is like one of the many places where TensorFlow makes some odd little annoying 00:53:55.160 |
But anyway, every time I create a new notebook, I copy this line in and make sure I run it 00:54:01.800 |
and so this does not use up all of your memory. 00:54:07.960 |
So I've got a link to the paper that we're looking at, and indeed we can open it. 00:54:16.000 |
And now is a good time to talk about how helpful it is to use some kind of paper reading system. 00:54:24.000 |
I really like this one, it's free, it's called Mendeley Desktop. 00:54:30.040 |
Mendeley let's use, as you find papers, you can save them into a folder on your computer. 00:54:37.440 |
Mendeley will automatically watch that folder, any PDF that appears there gets added to your 00:54:41.920 |
library, and it's really quite cool because what it then does is it finds the archive 00:54:52.560 |
ID and then you can click this little button here and it will go to archive and grab all 00:55:04.400 |
of the information such as the abstract and so forth and fill it out for you. 00:55:10.460 |
And so this is really great because now any time I want to find out what I've read, which 00:55:16.040 |
I've got anything to do with style, I can type style and up all of the papers. 00:55:23.600 |
Believe me, after a long time of reading papers without something like this, it basically 00:55:30.240 |
goes in one ear and out the other, and literally I've read papers a year later and at the end 00:55:35.320 |
of it I've realized I've read that before, I don't remember anything else about it but 00:55:40.560 |
I know I've read it before, whereas this way I really find that my knowledge builds. 00:55:46.680 |
As I find references, I'm immediately there looking at the references. 00:55:51.400 |
The other thing you can do is that as you start reading the paper, as you can see, my 00:56:00.920 |
notes and highlights are saved, and they're also duplicated on my mobile devices and my 00:56:08.240 |
other computers and they're all synced up, it's really cool. 00:56:12.640 |
So talking about archive is a great time to answer a question we had earlier about how 00:56:24.300 |
So the vast vast vast majority of deep learning papers get put up on archive.org for a long 00:56:34.720 |
long long time before they're in any journal or conference. 00:56:39.200 |
So if you wait until they're in a conference proceedings, you're many many months or maybe 00:56:51.080 |
You can go to the AI section of archive and see what's there, but that's not really what 00:57:01.660 |
What everybody does instead is archive sanity, the archive sanity preserver. 00:57:15.380 |
This is something that the wonderful Andre Capathy built, and what it lets you do is 00:57:20.660 |
to create a library of articles that somebody tells you to read or that you're interested 00:57:26.240 |
in or you come across, and as you create that library by clicking this little save button, 00:57:34.820 |
Or even once you start reading a paper, you go Show Similar, and it will then show you 00:57:42.140 |
other papers that are similar to this paper and it seems to do a pretty damn good job 00:57:47.920 |
So you can really explore and get lost in that whole area. 00:57:54.720 |
And then as you do that, you'll find that if you go to archive, one of the buttons that 00:58:08.840 |
So like even from the abstract here, bang, straight into your library and the next time 00:58:15.080 |
And then you can put things into folders, so the different parts of the course, I've 00:58:23.960 |
created folders for them and kind of keep track of what I'm reading that way. 00:58:30.400 |
A good little trick to know about archive.org is that you often want to know where it's 00:58:39.440 |
from, and if you go to the first page on the left-hand side, you can see the date here. 00:58:44.800 |
And another cool tip is that the file name, the first four digits are the year and month 00:58:51.600 |
for that file, so there's a couple of handy little tips. 00:58:58.040 |
As well as archive sanity, another really great place for finding papers is Twitter. 00:59:06.560 |
Now if you haven't really used Twitter before or haven't really used Twitter for this purpose 00:59:14.640 |
So I try to make things easy for people by favoriting lots of the interesting deep learning 00:59:23.160 |
So if you go to Jeremy P. Howard's page and click on Likes, you'll find that there is 00:59:40.680 |
a thousand links to papers here, and as you can see, there's generally a few every day. 00:59:50.120 |
One is to get some ideas and papers to read, but perhaps more importantly is to see who's 00:59:58.200 |
Rachel, can you throw that box to that gentleman? 01:00:06.580 |
It's not a question, it's just information about archive. 01:00:11.000 |
There is someone who has built a skill on Amazon Alexa, and actually by asking Alexa 01:00:18.040 |
to give the most recent paper from archive, and actually she reads abstract for you, and 01:00:35.160 |
The other place which I find extremely helpful is Reddit machine learning. 01:00:42.360 |
Again, there's a lot less that goes through Reddit than goes through Twitter, but generally 01:00:49.800 |
like the really interesting things tend to turn up here, and you can often see the discussions 01:00:58.640 |
For example, there was a great discussion of PyTorch versus TensorFlow in the last day 01:01:04.560 |
or two, and so there's a couple of good places to get started. 01:01:14.880 |
I have two questions on the image stuff when you go back to style. 01:01:21.200 |
One of them was if the app Prisma is using something like this. 01:01:28.760 |
And the other is, is it better to calculate F content for a higher layer for VGG and use 01:01:34.600 |
a lower layer for app style, since the higher layer of abstracts are captured in the higher 01:01:44.720 |
We haven't learned about F style yet, so we're just going to look at F content first. 01:01:48.560 |
Okay, so I've got some more links to some things you can look at here in the notebook. 01:01:57.640 |
So the data I've linked to in the lesson thread on the forum, I've just grabbed a random sample 01:02:05.960 |
of about 20,000 image net images, and I've also put them into bcols arrays. 01:02:18.340 |
You can figure out how to get the file names easily enough, so I'm not going to do everything 01:02:28.880 |
Thank you for the person who's showing all the other stuff at Pippin's store, that's 01:02:37.320 |
Given that we're using VGG, as per usual, we're going to have to subtract out the mean 01:02:45.200 |
pixel value from imageNet and reverse the channel order, because of course that's what 01:02:54.840 |
So we're going to create an array from the image by just running it through that pre-processing 01:03:02.280 |
Later on, we're going to be running things through a network and generating images. 01:03:06.260 |
Those generated images we're going to have to add back on that mean and undo that reordering, 01:03:12.320 |
so this is what this de-processing function is going to be for. 01:03:18.320 |
Now I've kind of hand-waved over these functions before and how they work, but I'm going to 01:03:27.400 |
stop hand-waving for a moment because it's actually quite interesting. 01:03:30.560 |
Have you ever thought about how is it that we're able to take X, which is a 4-dimensional 01:03:35.800 |
tensor, batch size by height by width by channels, (notice this is not the same as the Tiano, 01:03:43.480 |
so Tiano was batch size by channels by height by width, we're not doing that anymore), batch 01:03:51.760 |
size by height by width by channels, taking a 4-dimensional tensor and we're subtracting 01:04:04.640 |
And the way it's making that work is because it's doing something called broadcasting. 01:04:11.260 |
Broadcasting refers to any kind of operation where you have arrays or tensors of different 01:04:18.360 |
dimensions and you do element-wise operations on two tensors of different dimensions. 01:04:30.280 |
This idea actually goes back to the early 1960s to an amazing programming language called 01:04:41.680 |
APL was written by an extraordinary person called Kenneth Iverson. 01:04:47.240 |
Originally APL was a paper describing a new mathematical notation, and this new mathematical 01:04:55.120 |
notation was designed to be more flexible and far more precise than traditional mathematical 01:05:02.360 |
And he then went on to create a programming language that implemented this mathematical 01:05:07.520 |
APL refers to the notation, which he described as notation as a tool for thought. 01:05:15.440 |
He really, unlike the TensorFlow authors, understood the importance of a good API. 01:05:21.080 |
He recognized that the mathematical notation can change how you think about math, and so 01:05:26.440 |
he created a notation which is incredibly expressive. 01:05:34.840 |
His son now has gone on to carry the torch and he now continues to support a direct descendant 01:05:47.160 |
So if you ever want to find, I think, the most elegant programming language in the world, 01:05:53.920 |
you can go to Jsoftware.com and check this out. 01:05:57.360 |
Now, how many of you here have used regular expressions? 01:06:03.440 |
How many of you, the first time you looked at a complex regular expression thought, that 01:06:14.160 |
The first time that you look at a piece of J, you'll go, what the bloody hell? 01:06:22.200 |
Because it's an even more expressive and a much older language than regular expressions. 01:06:38.520 |
But what's going on here is that this is a language which at its heart almost never requires 01:06:45.480 |
you to write a single loop because it does everything with multidimensional tensors and 01:06:52.080 |
So everything we're going to learn about today with broadcasting is a very diluted, simplified, 01:06:58.720 |
graphified version of what APL created in the early 60s, which is not to say anything 01:07:04.120 |
rude about Python's implementation, it's one of the best. 01:07:13.880 |
If you want to really expand your brain and have fun, check out J. 01:07:18.280 |
In the meantime, what does Keras/Theano/TensorFlow broadcasting look like? 01:07:30.760 |
Here is a vector, a one-dimensional tensor, minus a scalar. 01:07:46.520 |
That makes perfect sense that you can subtract a scalar from a one-dimensional tensor. 01:07:53.280 |
What it's actually doing is it's taking this 2 and it's replicating it 3 times. 01:07:58.360 |
So this is actually element-wise, 1, 2, 3, minus 2, 2, 2. 01:08:05.000 |
It has broadcasted the scalar across the 3-element vector 1, 2, 3. 01:08:15.000 |
So there's our first example of broadcasting. 01:08:21.760 |
In general, broadcasting has a very specific set of rules, which is this. 01:08:29.840 |
You can take two tensors and you first of all take the shorter tensor, the tensor of 01:08:37.080 |
less dimensions, and prepend unit axes to the front. 01:08:47.240 |
Take the vector 2, 3 and prepend 3 unit axes on the front. 01:08:52.640 |
It is now a four-dimensional tensor of shape 1, 1, 1, 2. 01:08:58.720 |
So if you turn a row into a column, you're adding one unit axis. 01:09:05.000 |
If you're then turning it into a single slice, you're adding another unit axis. 01:09:11.560 |
So you can always make something into a higher dimensionality by adding unit axes. 01:09:17.960 |
So when you broadcast, it takes the thing with less dimensions and adds prepends unit 01:09:27.600 |
And then what it does is it says, so let's take this first example, it's taken this thing 01:09:33.120 |
which has no axes, it's a scalar, and turns it into a vector of length 1. 01:09:40.200 |
And then what it does is it finds anything which is of length 1 and duplicates it enough 01:09:49.960 |
So here we have something which is a four-dimensional tensor of size 5, 1, 3, 2. 01:09:57.880 |
So it's got 2 columns, 3 rows, 1 slice and 5 tubes. 01:10:09.280 |
And then we're going to subtract from it a vector of length 2. 01:10:13.280 |
So remember from our definition, it's then going to automatically reshape this by prepending 01:10:28.000 |
And then it's going to copy this thing 3 times, this thing 1 time and this thing 5 times. 01:10:44.880 |
So it's going to subtract this vector from every row, every slice, every cube. 01:10:55.440 |
So you can play around with these little broadcasting examples and try to get a real feel for how 01:11:05.680 |
So in this case, we were able to take a four-dimensional tensor and subtract from it a three-dimensional 01:11:13.320 |
vector knowing that it is going to copy that three-dimensional vector of channels to every 01:11:27.280 |
It subtracted the mean average of the channels from all of the images the way we wanted it 01:11:35.800 |
But it's been amazing how often I've taken code that I've downloaded off the internet 01:11:40.920 |
and made it often 10 or 20 times more in terms of lines of code just by using lots of broadcasting. 01:11:49.680 |
And the reason I'm talking about this now is because we're going to be using this a 01:11:56.200 |
And as I say, if you really want to have fun, play with it in J. 01:12:02.640 |
So that was a diversion, but it's one that's going to be important throughout this. 01:12:07.400 |
So we've now basically got the data that we want. 01:12:23.860 |
When we're doing generative models, we want to be very careful of throwing away information. 01:12:30.560 |
And one of the main ways to throw away information is to use max pooling. 01:12:35.320 |
When you use max pooling, you're throwing away 3/4 of the previous layer and just keeping 01:12:46.520 |
In generative models, when you use something like max pooling, you make it very hard to 01:12:56.000 |
So if we were to use max pooling with this idea of our f-content, and we say what does 01:13:03.160 |
the fourth layer of activations look like, if we've used max pooling, then we don't really 01:13:14.680 |
Slightly better is to use average pooling instead of max pooling. 01:13:19.120 |
Because at least with average pooling, we're using all of the data to create an average. 01:13:23.840 |
We've still kind of thrown away 3/4 of it, but at least it's all been incorporated into 01:13:31.780 |
So the only thing I did to turn VGG16 into VGG16 average was to do a search and replace 01:13:38.600 |
in that file from max pooling to average pooling. 01:13:41.560 |
And it's just going to give us some slightly smoother, slightly nicer results. 01:13:47.040 |
And you're going to see this a lot with generative models. 01:13:49.140 |
We do little tweaks just to try to lose as little information as possible. 01:13:58.960 |
Shouldn't we use something like ResNet instead of VGG, since the residual blocks carry more 01:14:07.220 |
We'll look at using ResNet over the coming weeks. 01:14:14.380 |
It's a lot harder to use ResNet for anything beyond kind of basic classification, for a 01:14:27.520 |
One is that just the structure of ResNet blocks is much more complex. 01:14:31.360 |
So if you're not careful, you're going to end up picking something that's on one of 01:14:35.560 |
those little arms of the ResNet rather than one of the additive mergers of the ResNet. 01:14:41.640 |
And it's not going to give you any meaningful information. 01:14:45.720 |
You also have to be careful because the ResNet blocks most of the time are just slightly 01:14:51.120 |
fine-tuning their previous block, like adding the residuals. 01:14:56.200 |
It's not really adding new types of information. 01:15:01.320 |
Honestly, the truth is I haven't seen any good research at all about where to use ResNet 01:15:10.920 |
or Inception architectures for things like generative models or for transfer learning 01:15:18.120 |
So we're going to be trying to look at some of that stuff in this course, but it's far 01:15:30.680 |
In Part 1 of the course, I never actually added batch norm to the convolutional part 01:15:38.000 |
So that's kind of irrelevant because we're not using any of the fully connected layers. 01:15:43.680 |
More generally, is batch norm helpful for generative models? 01:15:47.360 |
I'm not sure that we have a great answer to that. 01:15:52.280 |
Will the pre-trained weights change if we're using average pooling instead of max pooling? 01:16:01.360 |
The pre-trained weights, clearly the optimal weights would change, but having said that 01:16:08.920 |
it's still going to do a reasonable job without tweaking the weights because the relationships 01:16:15.180 |
between the activations isn't going to change. 01:16:19.280 |
So again, this would be an interesting thing to try if you want to download ImageNet and 01:16:25.040 |
try fine-tuning it with average pooling, see if you can actually see a difference in the 01:16:36.200 |
So here is the output tensor of one of the late layers of VGG-16. 01:16:44.640 |
So if you remember, there are different blocks of VGG where there's a number of 3x3 comms 01:16:52.560 |
in a row, and then there's a pooling layer, and then there's another block of 3x3 comms, 01:16:58.320 |
This is the last block of the comms layers, and this is the first comm of that block. 01:17:04.040 |
I think this is maybe the third last layer of the convolutional section of VGG. 01:17:09.680 |
This is kind of like large, receptive field, very complex concepts being captured at this 01:17:20.880 |
So what we're going to do is we need to create our target. 01:17:27.560 |
So for our bird, when we put that bird through VGG, what is the value of that layer's activations? 01:17:40.320 |
So one of the things I suggested you revise was the stuff from the Keras fact about how 01:17:49.600 |
One simple way to do that is to create a new model, which takes our model's input as input, 01:17:55.440 |
and instead of using the final output as output, we can use this layer as output. 01:18:00.360 |
So this is now a model, which when we call .predict, it will return this set of activations. 01:18:13.200 |
Now we're going to be using this inside the GPU, we're going to be using this as a target. 01:18:21.360 |
So to give us something which is going to live in the GPU A and B, we can use symbolically 01:18:27.680 |
in a computation graph B, we wrap it with k.variable. 01:18:33.080 |
So to remind you, whatever Keras in the docs use the Keras.backend module, they always 01:18:48.560 |
So k refers to the API that Keras provides, which provides a way of talking to either 01:19:00.880 |
So both Theano and TensorFlow have a concept of variables and placeholders and dot functions 01:19:08.040 |
and subtraction functions and softmax activations and so forth. 01:19:11.560 |
And so this k.module is where all of those functions live. 01:19:17.440 |
This is just a way of creating a variable, which if we're using Theano, it would create 01:19:22.200 |
If we're using TensorFlow, it creates a TensorFlow variable. 01:19:25.680 |
And where possible, I'm trying to use this rather than TensorFlow directly, but I could 01:19:32.240 |
have absolutely have said tf.variable, and it would work just as well, because we're 01:19:41.080 |
So this has now created a symbolic variable that contains the activations of block 5.1. 01:19:49.340 |
So what we now want to do is to generate an image which we're going to use SGD to gradually 01:19:57.940 |
make the activations of that image look more and more like this variable. 01:20:05.680 |
Let's skip over 202 for a moment and think about some pieces. 01:20:09.520 |
So we're going to need to define a lost function. 01:20:13.320 |
And the lost function is just the mean squared error between two things. 01:20:18.500 |
One thing is, of course, that target, that thing we just created, which is the value 01:20:38.960 |
And then what do we want to get close to that? 01:20:40.400 |
Well, what we want to get close to that is whatever the value is of that layer at the 01:20:51.080 |
So layer is just a symbolic object at this stage. 01:20:56.800 |
There's nothing in it, so we're going to have to feed it with data later. 01:21:03.840 |
So remember, this is kind of the interesting way you define computation graphs with TensorFlow 01:21:09.920 |
It's like you define it with these symbolic things now and you feed it with data later. 01:21:15.120 |
So you've got this symbolic thing called layer, and we can't actually calculate this yet. 01:21:20.680 |
So at this stage this is just a computation graph we're building. 01:21:24.880 |
Now of course any time we have a computation graph, we can get its gradients. 01:21:28.680 |
So now that we have a computation graph that calculates the loss function we're interested 01:21:49.560 |
in, so this is f content, if we're going to try to optimize our generated image, we're 01:21:58.720 |
So here we can get the gradients, and again we use k dot gradients rather than TensorFlow 01:22:03.280 |
gradients or Theano gradients just so that we can use it with any back-end we like. 01:22:11.400 |
The function we're trying to get gradients of is the loss function, which we just calculated. 01:22:16.960 |
And then we want it with respect to not some weights, but with respect to the input of 01:22:23.600 |
So this is the thing that we want to change is the input to the model so as to minimize 01:22:35.520 |
So now that we've done that, we can go ahead and create our function. 01:22:39.800 |
And so the input to the function is just model dot input, and the outputs to the function 01:22:52.560 |
The last step we need to do is to actually run an optimizer. 01:22:57.380 |
Now normally when we run an optimizer we use some kind of SGD. 01:23:10.540 |
We're not creating lots of random batches and getting different gradients every time. 01:23:15.560 |
So why use stochastic gradient descent when we don't have a stochastic problem to solve? 01:23:23.440 |
So in fact, there's a much longer history of optimization methods which are deterministic, 01:23:29.640 |
going back to Newton's method, which many of you will be familiar with. 01:23:40.880 |
The basic idea of these much faster deterministic optimization methods is that rather than saying 01:23:50.220 |
OK, where's the gradient, which direction does it go, let's just go a small little step in 01:23:57.320 |
Learning rate times gradient, small little step, small little step, because I have no 01:24:03.200 |
And it's stochastic, so it's going to keep changing. 01:24:05.200 |
So next time I look it will be a totally different direction. 01:24:09.560 |
With a deterministic optimization, we find out which direction to go, and then we find 01:24:16.160 |
out what is the optimum distance to go in that direction. 01:24:19.880 |
And so if you know this is the direction I want to go, and it looks like this, then the 01:24:24.800 |
way we find the optimum is we go a small distance. 01:24:27.520 |
Then we go twice as far as that, twice as far as that, twice as far as that, and we 01:24:36.800 |
And once the slope changes sign, we know it's called bracketing. 01:24:40.120 |
We've bracketed the minimum of that function. 01:24:43.360 |
And then we can use bisection to find the minimum. 01:24:45.840 |
So now we've bracketed it, we find halfway between the two. 01:24:51.160 |
Halfway between the two of those, is it left or the right of that? 01:24:54.100 |
So we use bracketing and bisection to find the optimum in that direction. 01:25:04.680 |
All of these optimization techniques rely on the basic idea of a line search. 01:25:11.080 |
Once you've done the line search, you've found the optimal value in that direction, in our 01:25:18.440 |
That doesn't necessarily mean we've found the optimal value across our entire space. 01:25:23.840 |
So what we then do is we replete the process, find out what's the downhill direction now, 01:25:30.000 |
use line search to find the optimum in that direction. 01:25:34.600 |
So the problem with that is that in a saddle point, you will still often find yourself 01:25:42.200 |
going backwards and forwards in a rather unfortunate way. 01:25:51.160 |
The faster optimization approaches when they're going to go in a new direction, they don't 01:25:56.000 |
just say which direction is down, they say which direction is the most downhill but also 01:26:01.480 |
the most different to the previous directions I've gone. 01:26:08.440 |
So the good news is you don't need to really know any of those details. 01:26:12.040 |
All you need to know is that there is a module called SciPy.optimize. 01:26:23.560 |
And in SciPy.optimize are lots of handy deterministic optimizers. 01:26:29.280 |
The two most common used are conjugate gradient, or CG, and BFGS. 01:26:40.080 |
They differ in the detail of how do they decide what direction to go next, which direction 01:26:45.440 |
is both the most downhill and also the most different to the previous directions we've 01:26:52.480 |
And the particular version we're going to use is a limited memory BFGS. 01:26:57.600 |
So the important thing is not how it works, the important thing for us is how do we use 01:27:07.480 |
So there's the question about loss plus grads. 01:27:17.440 |
So this is an array containing a single thing, which is loss. 01:27:24.800 |
Grads is already an array, or a list I should say, which is a list of all of the loss with 01:27:32.920 |
So plus in Python on two lists simply joins the two lists together. 01:27:38.240 |
So this is a list containing the loss and all of the gradients. 01:27:42.760 |
Someone asked if ant colony optimization is something that can be used? 01:27:48.920 |
Ant colony optimization lives in a class known as metaheuristics, like genetic algorithms 01:27:59.320 |
There's a wide range of optimization algorithms that are designed for very difficult to optimize 01:28:05.720 |
functions, functions which are extremely bumpy. 01:28:09.200 |
And so these techniques all use a lot of randomization in order to kind of avoid the bumps. 01:28:17.680 |
In our case, we're using mean-squared error, which is a nice smooth objective. 01:28:23.320 |
So we can use the much faster convex optimization. 01:28:27.760 |
And then that was the next question, is this a non-convex problem or a convex optimization? 01:28:46.280 |
Basically you provide the name of the optimizer, which in this case is minimize something using 01:28:58.040 |
A function which will return the loss value at the current point, a starting point, and 01:29:09.440 |
a function which will return the gradients at the current point. 01:29:14.680 |
Now unfortunately we have a function which returns the loss and the gradients together, 01:29:25.280 |
So a minor little detail is that we create a simple little class, and all this class 01:29:34.680 |
does, and again the details really aren't important, but all this class does is that when loss 01:29:40.520 |
is called, it calls that function that we created, passing in the current value of the 01:29:47.320 |
data, it gets back the loss and the gradients, and it returns the loss. 01:29:56.720 |
Later on when the optimizer asks for the gradients, it returns those gradients that I stored back 01:30:05.200 |
So what this is doing is it's a little class which allows us to basically turn a Keras 01:30:10.560 |
function that returns the loss and the gradients together into two functions. 01:30:15.720 |
One which returns the loss, one which returns the gradients. 01:30:18.840 |
So it's a pretty minor detail, but it's a handy thing to have in your toolbox because 01:30:22.720 |
it means you now have something that can use deterministic optimizers on Keras functions. 01:30:31.200 |
So all we do is we look through a small number of times, calling that optimizer each time, 01:30:43.000 |
So the starting point is just a random image. 01:30:49.200 |
So we just create a random image, and here is what a random image looks like. 01:30:55.800 |
So let's go ahead and run that so we can see the results, I haven't actually ran this yet. 01:31:16.040 |
So you can see it going along and solving here. 01:31:37.040 |
And here at the end of the 10th iteration is the result. 01:31:42.360 |
So remember what we did was we started with this image, we called an optimizer which took 01:31:51.040 |
that image and attempted to optimize this loss function where the target was the value 01:32:01.520 |
of this layer for our bird image, and the thing it was comparing it to was the layer 01:32:16.600 |
So we started with this, we ran that optimizer a bunch of times, calculating the gradient 01:32:22.120 |
of that loss with respect to the input to the model, the very pixels themselves. 01:32:28.480 |
And after 10 iterations it turned this random image into this thing. 01:32:34.400 |
So this is the thing which optimizes the block 5.1 layer. 01:32:45.160 |
And you can see it still looks like a bird, but by this point it really doesn't care what 01:32:49.820 |
the background looks like, it cares a lot what the eye looks like and the beak looks 01:32:53.880 |
like and the feathers look like, because these things all matter to ImageNet to make sure 01:33:00.440 |
If we look at an earlier layer, let's look at block 4.1, you can see it's getting the 01:33:11.800 |
So when we do our artistic style, we can choose which layer will be our f-content. 01:33:17.320 |
And if we choose an earlier one, it's going to give it less degrees of freedom to look 01:33:23.400 |
like a different kind of bird, but it's going to look more like our original bird. 01:33:28.720 |
And so then here's a video showing how that happens, so there are the 10 steps. 01:33:41.800 |
And it's often helpful to be able to visualize the iterations of your generators at work. 01:33:52.920 |
So feel free to borrow this very simple code, you can just use matplotlib. 01:33:58.520 |
We actually used this in the last class, remember we little linear optimizer, we animated it. 01:34:08.000 |
You just have to define a function that gets called at each step of the animation, and 01:34:13.200 |
then you can just call animation.func animation passing in that function, and that's a nice 01:34:17.840 |
way that you can animate your own generators. 01:34:23.480 |
Question, we're using Keras and TensorFlow to extract the BGG features, these are used 01:34:29.960 |
by SciPy for BFGs, does the BFGs also run on the GPU? 01:34:35.840 |
No, there's really very little for the BFGs to do. 01:34:42.720 |
Especially for an optimizer, all of the work is in calling the loss function and the gradients. 01:34:48.560 |
The actual work of doing the bisection and doing the bracketing is so trivial that we 01:35:00.200 |
There's a question about the checkerboard artifact, the geometric pattern that's appearing. 01:35:07.240 |
This is actually not a checkerboard artifact exactly, checkerboard artifacts we will look 01:35:13.920 |
That was my interpretation mistake, not the questioner's mistake. 01:35:22.960 |
I'm not exactly sure why this particular kind of noise has appeared, honestly. 01:35:49.020 |
We have a single image which is being optimized, so there's really no batching to do here. 01:35:59.560 |
We'll look at a version which uses a very different approach and has batching shortly. 01:36:04.160 |
Has anyone tried something like this by averaging or combining the activations of multiple bird 01:36:10.360 |
images to create some kind of prototypical or novel bird? 01:36:19.960 |
Generative adversarial networks do something like that, but probably not quite. 01:36:33.320 |
You have to get a list of file names yourself from the list of files that you've downloaded. 01:36:44.360 |
And then just to make sure I understand this, someone says in this example we started with 01:36:49.540 |
a random image, but if we started with the actual image as the initial condition, we 01:37:05.800 |
They're interested to find out where we initialize for the artistic styling problem. 01:37:11.920 |
That was just a follow-up, we're going to get there. 01:37:20.600 |
Would it be useful to use a tool like Quiver to figure out which BGG layer to use this? 01:37:27.200 |
It's so easy just to try a few and see what works. 01:37:34.440 |
We haven't got through as much as I hoped, but we're going to finish off this piece. 01:37:49.840 |
The only thing different is A, we're going to not feed in a photo, we're going to feed 01:37:56.080 |
And here's a few styles we could choose from. 01:37:57.680 |
We could do Van Gogh, we could do this little drawing, or we could do the Simpsons. 01:38:05.080 |
So we pick one of those and we create the style array in the same way as before. 01:38:14.300 |
This time, though, we're going to use multiple layers. 01:38:17.360 |
So I've created a dictionary from the name of the layer to its output, and so we're going 01:38:21.800 |
to use that to create an array of a number of the outputs. 01:38:30.320 |
We're going to grab the first, second and the block outputs. 01:38:38.800 |
So we're going to create our target as before, but we're going to use a different loss function. 01:38:48.920 |
The loss function is called style_loss, and just like before, it's going to use the MSE. 01:38:54.920 |
But rather than just the MSE on the activations, it's the MSE on something called the Gram 01:39:04.360 |
A Gram matrix is very simply the dot product of a matrix with its own transpose. 01:39:13.520 |
So here it is here, dot product of some matrix with its own transpose. 01:39:20.160 |
And I've just got to divide it by here to create an average. 01:39:25.920 |
So what is this matrix that we're taking the dot product of it as transpose? 01:39:31.200 |
Well what it is, is that we start with our image, and remember the image is height by 01:39:35.760 |
width by channels, and we change the order of dimensions, so it's channels by height 01:39:46.640 |
What batch flatten does is it takes everything except the first dimension and flattens it 01:39:53.160 |
This is now going to be a matrix where the rows, the channels and the columns are a flattened 01:40:03.920 |
This is C by H by W, the result of this will be C rows and H times W columns. 01:40:12.840 |
So when you take the dot product of something with a transpose of itself, what you're basically 01:40:19.840 |
doing is creating something a lot like a correlation matrix. 01:40:23.520 |
You're saying how much is each row similar to each other row? 01:40:36.480 |
You can think about it like a cosine, a cosine is basically just a dot product. 01:40:42.400 |
You can think of it as a correlation matrix, it's basically a normalized version of this. 01:40:51.200 |
So maybe if it's not clear to you, write it down on a piece of paper on the way home tonight. 01:40:55.200 |
Just think about taking the rows of a matrix, and then flipping it around, and you're basically 01:41:02.600 |
then turning them into columns, and then you're multiplying the rows by the columns, it's 01:41:07.400 |
basically the same as taking each row and comparing it to each other row. 01:41:13.200 |
So that's what this Gram matrix is, it's basically saying for every channel, how similar are 01:41:24.500 |
So if channel number 1 in most parts of the image is very similar to channel 3 in most 01:41:33.120 |
parts of the image, then 1,3 of this result will be a higher number. 01:41:40.320 |
So it's kind of a weird matrix, it basically tells us, it's like a fingerprint of how the 01:41:46.920 |
channels relate to each other in this particular image, or how the filters relate to each other 01:41:52.320 |
in a particular layer of this particular image. 01:41:54.880 |
I think the most important thing to recognize is that there is no geometry left here at 01:42:02.000 |
The x and the y coordinates are totally thrown away, they're actually flattened out. 01:42:08.560 |
So this loss function can by definition in no way at all contain anything about the content 01:42:16.200 |
of the image, because it's thrown away all of the x and y information, and all that's 01:42:21.400 |
left is some kind of fingerprint of how the channels relate to each other, how the filters 01:42:29.720 |
So this style loss then says for two different images, how do these fingerprints differ? 01:42:38.920 |
So it turns out that if you now do the exact same steps as before using that as our loss 01:42:44.600 |
function and you run it through a few iterations, it looks like that. 01:42:53.680 |
It looks a lot like the original Van Gogh, but without any of the content. 01:43:14.680 |
So a paper just came out two weeks ago called Demystifying Neural Style Transfer with a 01:43:20.360 |
mathematical treatment where they claim to have an answer to this question. 01:43:24.760 |
But as the point at which this was created, a year and a half ago, until now, no one really 01:43:36.080 |
But the important thing that the authors of this paper realized is, if we could create 01:43:40.480 |
a function that gives you content loss and a function that gives you style loss, and 01:43:45.320 |
you add the two together and optimize them, you can do neural style. 01:43:49.720 |
So all I can assume is that they tried a few different things. 01:43:55.360 |
They knew that they had to throw away all of the geometry, so they probably tried a 01:43:59.120 |
few things to throw away the geometry, and at some point they looked at this and they 01:44:06.200 |
So now that we have this magical thing, there's the Simpsons, all we have to do is add the 01:44:19.360 |
We've got our style layers, I'm actually going to take the top five now. 01:44:26.160 |
Here's our content layer, I'm going to take block 4, column 2. 01:44:30.060 |
As promised, for our loss function, I'm just going to add the two together. 01:44:35.280 |
Style loss for all of the style layers, plus the content loss. 01:44:41.920 |
And I'm going to divide the content loss by 10. 01:44:43.880 |
This is something you can play with, and in the paper you'll see they play with it. 01:44:48.440 |
How much style loss versus how much content loss? 01:44:50.880 |
Get the gradients, evaluator, solve it, and there it is. 01:44:59.020 |
Other than the fact that we don't really know why the style loss works, but it does, everything 01:45:07.320 |
So there's the bird as Van Gogh, there's the bird as the Simpsons, and there's the bird 01:45:14.920 |
There's a question, "Since the publication of that paper, has anyone used any other loss 01:45:21.400 |
functions for f_style that achieve similar results?" 01:45:24.560 |
Yeah, so as I mentioned, just a couple of weeks ago there was a paper, I'll put it on 01:45:28.360 |
the forum, that tries to generalize this loss function. 01:45:32.080 |
It turns out actually that this particular loss function seems to be about the best that 01:45:39.200 |
So it's 9 o'clock, so we have run out of time. 01:45:42.680 |
So we're going to move some of this lesson to the next lesson, but to give you a sense 01:45:46.880 |
of where we're going to head, what we're going to do is we're going to take this thing where 01:45:52.740 |
you have to optimize every single image separately, and we're going to train a CNN, which will 01:46:00.840 |
learn how to turn a picture into a Van Gogh version of that picture. 01:46:06.340 |
So that's basically going to be what we're going to learn next time, and we're also going 01:46:09.680 |
to learn about adversarial networks, which is where we're going to create two networks. 01:46:14.680 |
One will be designed to generate pictures like this, and the other will be designed 01:46:19.800 |
to try and classify whether this is a real Simpsons picture or a fake Simpsons picture. 01:46:26.600 |
And then you'll do one, generate, the other, discriminate, generate, discriminate. 01:46:31.680 |
And by doing that, we can take any generative model and make it better by basically having 01:46:39.280 |
something else learn to pick the difference between it, the real, and the fake. 01:46:45.120 |
And then finally we're going to learn about a particular thing that came out three weeks 01:46:47.960 |
ago called the Wasserstein GAN, which is the reason I actually decided to move all of this 01:46:54.960 |
Generative adversarial networks basically didn't work very well at all until about three weeks 01:47:00.000 |
Now that they do work, suddenly there's a shitload of stuff that nobody's done yet,