Lesson 10: Cutting Edge Deep Learning for Coders

00:00:00.000 | Some really fun stuff appearing on the forums this week, and one of the really great projects

00:00:12.600 | which was created by I believe our sole Bulgarian participant in the course, Slav Ivanov, wrote

00:00:19.080 | a great post about picking an optimizer for style transfer.

00:00:24.380 | This post came from a forum discussion in which I made an off-hand remark about how

00:00:31.800 | I know that in theory BFGS is a deterministic optimizer, it uses a line search, it approximates

00:00:39.360 | the Hessian, it ought to work on this kind of deterministic problem better, but I hadn't

00:00:44.600 | tried it myself and I hadn't seen anybody try it and so maybe somebody should try it.

00:00:49.480 | I don't know if you've noticed, but pretty much every week I say something like that

00:00:52.480 | a number of times and every time I do I'm always hoping that somebody might go, "Oh,

00:00:56.000 | I wonder as well."

00:00:57.000 | And so Slav did wonder and so he posted a really interesting blog post about that exact

00:01:04.200 | question.

00:01:06.720 | I was thrilled to see that the blog post got a lot of pick-up on the machine learning Reddit.

00:01:16.080 | It got 55 upvotes which for that subreddit is put in second place on the front page.

00:01:25.240 | It also got picked up by the WildML mailing list weekly summary of interesting things

00:01:34.560 | in AI as the second post that was listed.

00:01:39.600 | So that was great.

00:01:43.360 | For those of you that have looked at it and kind of wondered what is it about this post

00:01:47.160 | that causes it to get noticed whereas other ones don't, I'm not sure I know the secret,

00:01:54.160 | but as soon as I read it I kind of thought, "Okay, I think a lot of people are going to

00:01:56.840 | read this."

00:01:57.840 | It gives some background, it assumes an intelligent reader, but it assumes an intelligent reader

00:02:03.600 | who doesn't necessarily know all about this, something like you guys six months ago.

00:02:10.760 | And so it describes this is what it is and this is where this kind of thing is used and

00:02:16.120 | gives some examples and then goes ahead and sets up the question of different optimization

00:02:25.320 | algorithms and then shows lots of examples of both learning curves as well as pictures

00:02:33.140 | that come out of these different experiments.

00:02:36.560 | And I think hopefully it's been a great experience for Slav as well because in the Reddit thread

00:02:42.120 | there's all kinds of folks pointing out other things that he could try, questions that weren't

00:02:49.760 | quite clear, and so now there's a whole kind of, actually kind of summarized in that thread

00:02:55.640 | a list of things that perhaps could be done next to open up a whole interesting question.

00:03:03.480 | Another post which I'm not even sure if it's officially posted yet, but I got the early

00:03:09.200 | bird version from Brad, is this crazy thing.

00:03:14.200 | Here is Kanye drawn using a brush of Captain Jean-Luc Picard.

00:03:19.000 | In case you're wondering, is that really him, I will show you his zoomed in version.

00:03:26.520 | It really is Jean-Luc Picard.

00:03:30.640 | And this is a really interesting idea because he points out that generally speaking when

00:03:36.600 | you try to use a non-artwork as your style image, it doesn't actually give very good

00:03:45.200 | results.

00:03:46.200 | It's another example of a non-artwork, it doesn't give good results.

00:03:52.800 | It's kind of interesting, but it's not quite what he was looking for.

00:03:56.200 | But if you tile it, you totally get it, so here's Kanye using a Nintendo game controller

00:04:06.880 | brush.

00:04:07.880 | So then he tried out this Jean-Luc Picard and got okay results and kind of realized that

00:04:16.760 | actually the size of the texture is pretty critical.

00:04:20.480 | And I've never seen anybody do this before, so I think when this image gets shared on

00:04:28.100 | Twitter it's going to go everywhere because it's just the freakiest thing.

00:04:33.040 | Freaky is good.

00:04:36.520 | So I think I warned you guys about your projects when I first mentioned them as being something

00:04:42.500 | that's very easy to overshoot a little bit and spend weeks and weeks talking about what

00:04:48.320 | you're eventually going to do.

00:04:51.560 | You've had a couple of weeks.

00:04:55.040 | Really it would have been nice to have something done by now rather than spending a couple

00:04:57.960 | of weeks wondering about what to do.

00:05:00.180 | So if your team is being a bit slow agreeing on something, just start working on something

00:05:03.840 | yourself.

00:05:04.840 | Or as a team, just pick something that you can do by next Monday and write up something

00:05:13.520 | brief about it.

00:05:14.520 | So for example, if you're thinking, okay, we might do the $1 million data science poll.

00:05:20.280 | That's fine.

00:05:21.280 | You're not going to finish it by Monday, but maybe by Monday you could have written a blog

00:05:24.740 | post introducing what you can learn in a week about medical imaging.

00:05:30.280 | Oh, it turns out it uses something called DICOM.

00:05:32.360 | Here are the Python DICOM libraries, and we tried to use them, and these were the things

00:05:36.240 | that got us kind of confused, and these are the ways that we solved them.

00:05:40.240 | And here's a Python notebook which shows you some of the main ways you can look at these

00:05:44.840 | DICOM, for instance.

00:05:46.880 | So split up your project into little pieces.

00:05:49.960 | It's like when you enter a Kaggle competition, I always tell people submit every single day

00:05:58.160 | and try and put in at least half an hour a day to make it slightly better than yesterday.

00:06:02.720 | So how do you put in the first day's submission?

00:06:06.080 | What I always do on the first day is to submit the benchmark script, which is generally like

00:06:11.520 | all zeroes.

00:06:12.520 | And then the next day I try to improve it, so I'll put in all 0.5s, and the next day

00:06:17.720 | I'll try to improve it.

00:06:18.720 | I'll be like, "Okay, what's the average for cats?

00:06:20.240 | The average for dogs?"

00:06:21.240 | I'll submit that.

00:06:22.240 | And if you do that every day for 90 days, you'll be amazed at how much you can achieve.

00:06:28.320 | Whereas if you wait two months and spend all that time reading papers and theorizing and

00:06:33.480 | thinking about the best possible approach, you'll discover that you don't get any submissions

00:06:37.240 | yet.

00:06:38.240 | Or you finally get your perfect submission in and it goes terribly and now you don't

00:06:41.440 | have time to make it better.

00:06:46.120 | I think those tips are equally useful for Kaggle competitions as well as for making

00:06:51.120 | sure that at the end of this part of the course you have something that you're proud of, something

00:06:58.320 | that you feel you did a good job in a small amount of time.

00:07:04.680 | If you try and publish something every week on the same topic, you'll be able to keep

00:07:09.520 | going further and further on that thing.

00:07:11.120 | I don't know what Slav's plans are, but maybe next week he'll follow up on some of the interesting

00:07:18.200 | research angles that came up on Reddit, or maybe Brad will follow up on some of his additional

00:07:23.040 | ideas from his post.

00:07:26.600 | There's a lesson 10 wiki up already which has the notebooks, and just do a git pull

00:07:35.620 | on the GitHub repo to get the most up-to-date on Python.

00:07:40.600 | Another thing that I wanted to point out is that in study groups, so we've been having

00:07:47.520 | study groups each Friday here, and I know some of you have had study groups elsewhere

00:07:51.880 | around the Bay Area. I don't understand this gray matrix stuff. I don't get what's going

00:08:00.520 | on. I understand the symbols, I understand the math, but what's going on?

00:08:05.680 | I said maybe if you had a spreadsheet, it would all make sense. Maybe I'll create a

00:08:19.240 | spreadsheet. Yes, do that! And 20 minutes later I turned to him and I said, "So how do you

00:08:24.600 | feel about gray matrices now?" And he goes, "I totally understand them." And I looked

00:08:28.440 | over and he created a spreadsheet. This was the spreadsheet he created. It's a very simple

00:08:34.200 | spreadsheet where it's like here's an image where the pixels are just 1, -1, and 0. It

00:08:39.480 | has two filters, either 1 or -1. He has the flattened convolutions next to each other,

00:08:48.520 | and then he's created the little dot product matrix.

00:08:54.160 | I haven't been doing so much Excel stuff myself, but I think you'll learn a lot more by trying

00:09:01.120 | it yourself. Particularly if you try it yourself and can't figure out how to do it in Excel,

00:09:06.520 | then we have programs. I love Excel, so if you ask me questions about Excel, I will have

00:09:12.160 | a great time.

00:09:29.440 | So last week we talked about the idea of learning with larger datasets. Our goal was to try

00:09:46.320 | and replicate the device paper. To remind you, the device paper is the one where we do

00:09:53.700 | a regular CNN, but the thing that we're trying to predict is not a one-hot encoding of the

00:10:01.520 | category, but it's the word vector of the category.

00:10:07.480 | So it's an interesting problem, but one of the things interesting about it is we have

00:10:13.280 | to use all of ImageNet, which has its own challenges. So last week we got to the point

00:10:21.120 | where we had created the word vectors. And to remember the word vectors, we then had

00:10:26.640 | to map them to ImageNet categories. There are 1000 ImageNet categories, so we had to

00:10:31.320 | create the word vector for each one. We didn't quite get all of them to match, but something

00:10:35.720 | like 2/3 of them matched, so we're working on 2/3 of ImageNet. We've got as far as reading

00:10:43.080 | all the file names for ImageNet, and then we're going to resize our images to 224x224.

00:10:58.640 | I think it's a good idea to do some of this pre-processing upfront. Something that TensorFlow

00:11:06.720 | and PyTorch both do and Keras recently started doing is that if you use a generator, it actually

00:11:13.040 | does the image pre-processing in a number of separate threads in parallel behind the

00:11:19.700 | scenes. So some of this is a little less important than it was 6 months ago when Keras didn't

00:11:25.520 | do that. It used to be that we had to spend a long time waiting for our data to get processed

00:11:32.720 | before it could get into the CNN. Having said that, particularly image resizing, when you've

00:11:42.780 | got large JPEGs, just reading them off the hard disk and resizing them can take quite

00:11:49.200 | a long time. So I always like to put it into do all that resizing upfront and end up with

00:11:54.700 | something in a nice convenient V-coles array.

00:11:59.500 | Amongst other things, it means that unless you have enough money to have a huge NVMe or

00:12:06.600 | SSD drive, which you can put the entirety of ImageNet on, you probably have your big

00:12:12.480 | data sets on some kind of pretty slow spinning disk or slow rate array. One of the nice things

00:12:19.360 | about doing the resizing first is that it makes it a lot smaller, and you probably can

00:12:22.960 | then fit it on your SSD. There's lots of reasons that I think this is good. I'm going to resize

00:12:29.480 | all of the ImageNet images, put them in a V-coles array on my SSD. So here's the path,

00:12:37.840 | and dpath is the path to my fast SSD mount point. We talked briefly about the resizing,

00:12:50.760 | and we're going to do a different kind of resizing. In the past, we've done the same

00:12:54.160 | kind of resizing that Keras does, which is to add a black border. If you start with something

00:12:59.200 | that's not square, and you make it square, you resize the largest axis to be the size

00:13:06.120 | of your square, which means you're left with a black border. I was concerned that any model

00:13:14.280 | where you have that is going to have to learn to model the black border, a) and b) that

00:13:20.880 | you're kind of throwing away information. You're not using the full size of the image.

00:13:25.400 | And indeed, every other library or pretty much paper I've seen uses a different approach,

00:13:32.560 | which is to resize the smaller side of the image to the square. Now the larger size is

00:13:40.440 | now too big for your square, so you crop off the top and bottom, or crop off the left and

00:13:45.760 | right. So this is called a center-cropping approach.

00:13:59.760 | Okay, that's true. What you're doing is you're throwing away compute. Like with the one

00:14:15.720 | where you do center-crop, you have a complete 224 thing full of meaningful pixels. Whereas

00:14:20.440 | with a black border, you have a 180 by 224 bit with meaningful pixels and a whole bunch

00:14:25.060 | of black pixels. Yeah, that can be a problem. It works well for ImageNet because ImageNet

00:14:37.720 | things are generally somewhat centered. You may need to do some kind of initial step to

00:14:44.800 | do a heat map or something like we did in lesson 7 to figure out roughly where the thing

00:14:48.440 | is before you decide where to center the crop. So these things are all compromises. But I

00:14:53.520 | got to say, since I switched to using this approach, I feel like my models have trained

00:14:57.680 | a lot faster and given better results, certainly the super resolution.

00:15:03.320 | I said last week that we were going to start looking at parallel processing. If you're

00:15:09.040 | wondering about last week's homework, we're going to get there, but some of the techniques

00:15:12.680 | we're about to learn, we're going to use to do last week's homework even better. So what

00:15:19.880 | I want to do is I've got a CPU with something like 10 cores on it, and then each of those

00:15:28.800 | cores have hyperthreading, so each of those cores can do kind of two things at once. So

00:15:33.680 | I really want to be able to have a couple of dozen processes going on, each one resizing

00:15:39.280 | an image. That's called parallel processing.

00:15:44.440 | Just to remind you, this is as opposed to vectorization, or SIMD, which is where a single

00:15:49.560 | thread operates on a bunch of things at a time. So we learned that to get SIMD working,

00:15:54.480 | you just have to install fellow SIMD, and it just happens, 600% speedup. I tried it,

00:16:02.160 | it works. Now we're going to, as well as the 600% speedup, also get another 10 or 20x speedup

00:16:08.720 | by doing parallel processing.

00:16:12.280 | The basic approach to parallel processing in Python 3 is to set up something called either

00:16:19.240 | a process pool or a thread pool. So the idea here is that we've got a number of little

00:16:25.340 | programs running, threads or processes, and when we set up that pool, we say how many

00:16:31.120 | of those little programs do we want to fire up. And then what we do is we say, okay, now

00:16:37.840 | I want you to use workers. I want you to use all of those workers to do some thing. And

00:16:46.240 | the easiest way to do a thing in Python 3 is to use Map. How many of you have used Map

00:16:52.800 | before?

00:16:53.800 | So for those of you who haven't, Map is a very common functional programming construct

00:16:58.480 | that's found its way into lots of other languages, which simply says, loop through a collection

00:17:03.680 | and call a function on everything in that collection and return a new collection, which

00:17:09.360 | is the result of calling that function on that thing. In our case, the function is resize,

00:17:14.280 | and the collection is imageNet_images.

00:17:16.440 | In fact, the collection is a bunch of numbers, 0, 1, 2, 3, 4, and so forth, and what the

00:17:25.240 | resize image is going to do is it's going to open that image off disk. So it's turning

00:17:32.560 | the number 3 into the third image resized, 224x224, and we'll return that.

00:17:39.880 | So the general approach here, this is basically what it looks like to do parallel processing

00:17:46.440 | in Python. It may look a bit weird. We're going result equals exec.map. This is a function

00:17:54.960 | I want, this is the thing to map over, and then I'm saying for each thing in that list,

00:18:00.040 | do something. Now this might make you think, well wait, does that mean this list has to

00:18:05.480 | have enough memory for every single resized image? And the answer is no, no it doesn't.

00:18:12.520 | One of the things that Python 3 uses a lot more is using these things they call generators,

00:18:19.800 | which is basically, it's something that looks like a list, but it's lazy. It only creates

00:18:24.800 | that thing when you ask for it. So as I append each image, it's going to give me that image.

00:18:30.920 | And if this mapping is not yet finished creating it, it will wait. So this approach looks like

00:18:37.080 | it's going to use heaps of memory, but it doesn't. It uses only the minimum amount of

00:18:42.480 | memory necessary and it does everything in parallel.

00:18:47.280 | So resizeImage is something which is going to open up the image, it's going to turn it

00:18:56.040 | into a NumPy array, and then it's going to resize it. And so then the resize does the

00:19:01.520 | center cropping we just mentioned, and then after it's resized it's going to get appended.

00:19:07.520 | What does appendImage do? So this is a bit weird. What's going on here? What it does

00:19:15.600 | is it's going to actually stick it into what we call a pre-allocated array. We're learning

00:19:23.120 | a lot of computer science concepts here. Anybody that's done computer science before will be

00:19:26.800 | familiar with all of this already. If you haven't, you probably won't. But it's important

00:19:32.040 | to know that the slowest thing in your computer, generally speaking, is allocating memory.

00:19:39.660 | It's finding some memory, it's reading stuff from that memory, it's writing to that memory,

00:19:44.040 | unless of course it's like cache or something. And generally speaking, if you create lots

00:19:49.800 | and lots of arrays and then throw them away again, that's likely to be really, really

00:19:54.280 | slow. So what I wanted to do was create a single 224x224 array which is going to contain

00:20:01.000 | my resized image, and then I'm going to append that to my bcol's tensor.

00:20:08.320 | So the way you do that in Python, it's wonderfully easy. You can create a variable from this thing

00:20:20.960 | called threading.local. It's basically something that looks a bit like a dictionary, but it's

00:20:28.800 | a very special kind of dictionary. It's going to create a separate copy of it for every

00:20:33.280 | thread or process. Normally when you've got lots of things happening at once, it's going

00:20:39.360 | to be a real pain because if two things try to use it at the same time, you get bad results

00:20:45.520 | or even crashes. But if you allocate a variable like this, it automatically creates a separate

00:20:52.200 | copy in every thread. You don't have to worry about locks, you don't have to worry about

00:20:56.320 | race conditions, whatever. Once I've created this special threading.local variable, I then

00:21:04.120 | create a placeholder inside it which is just an array of zeros of size 224x224x3.

00:21:12.000 | So then later on, I create my bcol's array, which is where I'm going to put everything

00:21:17.040 | eventually, and to append the image, I grab the bit of the image that I want and I put

00:21:23.600 | it into that preallocated thread local variable, and then I append that to my bcol's array.

00:21:34.200 | So there's lots of detail here in terms of using parallel processing effectively. I wanted

00:21:43.200 | to briefly mention it not because I think somebody who hasn't studied computer science

00:21:47.480 | is now going to go, "Okay, I totally understood all that," but to give you some of the things

00:21:51.240 | to like search for and learn about over the next week if you haven't done any parallel

00:21:56.200 | programming before. You're going to need to understand thread local storage and race conditions.

00:22:03.880 | In Python, there's something called the global interpreter lock, which is one of the many

00:22:22.160 | awful things about Python, which is that in theory two things can't happen at the same

00:22:28.760 | time because Python wasn't really written in a thread-safe way. The good news is that

00:22:37.040 | lots of libraries are written in a thread-safe way. So if you're using a library where most

00:22:42.960 | of its work is being done in C, as is the case with PLOS-AMD, actually you don't have

00:22:48.960 | to worry about that. And I can prove it to you even because I drew a little picture.

00:22:55.080 | Where is the result of serial versus parallel? The serial without SIMD version is 6 times

00:23:04.320 | bigger than this, so the default Python code you would have written maybe before today's

00:23:10.600 | course would have been 120 seconds process 2000 images. With SIMD, it's 25 seconds. With

00:23:21.320 | the process pull, it's 8 seconds. For 3 workers, for 6 workers, it's 5 seconds. The thread pull

00:23:28.800 | is even better, 3.6 seconds for 12 workers, 3.2 seconds for 16 workers.

00:23:36.160 | Your mileage will vary depending on what CPU you have. Given that probably quite a lot of

00:23:42.200 | you are using the P2 still, unless you've got your deep learning box up and running,

00:23:46.040 | you'll have the same performance as other people using the P2. You should try something

00:23:50.120 | like this, which is to try different numbers of workers and see what's the optimal for

00:23:56.040 | that particular CPU. Now once you've done that, you know. Once I went beyond 16, I didn't

00:24:01.360 | really get improvements. So I know that on that computer, a thread pull of size 16 is

00:24:07.520 | a pretty good choice. As you can see, once you get into the right general vicinity, it

00:24:12.480 | doesn't vary too much. So as long as you're roughly okay, just behind you, Rachel.

00:24:23.040 | So that's the general approach here, is run through something in parallel, each time append

00:24:27.400 | it to my big holes array. And at the end of that, I've got a big holes array which I can

00:24:32.680 | use again and again. So I don't re-run that code very often anymore. I've got all of the

00:24:36.880 | image net resized into each of 72x72, 224, and 288. And I give them different names and

00:24:44.440 | I just use them just like this.

00:24:49.000 | In fact, I think that's what Keras does now. I think it squishes. Okay, so here's one of

00:25:10.440 | these things. I'm not quite sure. My guess was that I don't think it's a good idea because

00:25:16.720 | you're now going to have dogs of various different squish levels and you'll see an end is going

00:25:21.520 | to have to learn that thing. It's got another type of symmetry to learn about, level of

00:25:31.360 | squishiness. Whereas if we keep everything of the same aspect ratio, I think it's going

00:25:40.120 | to be easier to learn so we'll get better results with less epochs of training.

00:25:45.960 | That's my theory and I'd be fascinated for somebody to do a really in-depth analysis

00:25:49.160 | of black borders versus center cropping versus squishing with image net.

00:25:57.620 | So for now we can just open the big holes array and there we go. So we're now ready

00:26:03.260 | to create our model. I'll run through this pretty quickly because most of it's pretty

00:26:07.320 | boring. The basic idea here is that we need to create an array of labels which are called

00:26:13.000 | VEX, which contains for every image in my big holes array, it needs to contain the target

00:26:23.760 | word vector for that image.

00:26:26.200 | Just to remind you, last week we randomly ordered the file names, so this big holes

00:26:35.200 | array is in random order. We've got our labels, which is the word vectors for every image.

00:26:44.360 | We need to do our normal pre-processing. This is a handy way to pre-process in the new version

00:26:53.520 | of Keras. We're using the normal Keras ResNet model, the one that comes in keras.applications.

00:27:02.000 | It doesn't do the pre-processing for you, but if you create a lambda layer that does

00:27:08.200 | the pre-processing then you can use that lambda layer as the input tensor. So this whole thing

00:27:16.080 | now will do the pre-processing automatically without you having to worry about it. So that's

00:27:21.760 | a good little trick. I'm not sure it's quite as neat as what we did in part 1 where we

00:27:26.800 | put it in the model itself, but at least this way we don't have to maintain a whole separate

00:27:32.680 | version of all of the models. So that's kind of what I'm doing nowadays.

00:27:44.840 | When you're working on really big datasets, you don't want to process things any more

00:27:50.560 | than necessary and any more times than necessary. I know ahead of time that I'm going to want

00:27:55.640 | to do some fine-tuning. What I decided to do was I decided this is the particular layer

00:28:04.200 | where I'm going to do my fine-tuning. So I decided to first of all create a model which

00:28:09.480 | started at the input and went as far as this layer. So my first step was to create that

00:28:18.680 | model and save the results of that. The next step will be to take that intermediate step

00:28:27.160 | and take it to the next stage I want to fine-tune to and save that. So it's a little shortcut.

00:28:33.920 | There's a couple of really important intricacies to be aware of here though. The first one

00:28:39.480 | is you'll notice that ResNet and Inception are not used very often for transfer learning.

00:28:50.680 | This is something which I've not seen studied, and I actually think this is a really important

00:28:54.360 | thing to study. Which of these things work best for transfer learning? But I think one

00:28:59.280 | of the difficulties is that ResNet and Inception are harder. The reason they're harder is that

00:29:05.240 | if you look at ResNet, you've got lots and lots of layers which make no sense on their

00:29:11.260 | own. Ditto for Inception. They keep on splitting into 2 bits and then merging again. So what

00:29:21.160 | I did was I looked at the Keras source code to find out how each block is named. What

00:29:29.720 | I wanted to do was to say we've got a ResNet block, we've just had a merge, and then it

00:29:37.320 | goes out and it does a couple of convolutions, and then it comes back and does an addition.

00:29:44.840 | Basically I want to get one of these. Unfortunately for some reason Keras does not name these

00:29:53.960 | merge cells. So what I had to do was get the next cell and then go back by 1. So it kind

00:30:04.640 | of shows you how little people have been working with ResNet with transfer learning. Literally

00:30:09.480 | the only bits of it that make sense to transfer learn from are nameless in one of the most

00:30:15.240 | popular things for transfer learning, Keras. There's a second complexity when working with

00:30:26.000 | ResNet. We haven't discussed this much, but ResNet actually has two kinds of ResNet blocks.

00:30:33.080 | One is this kind, which is an identity block, and the second time is a ResNet convolution

00:30:43.280 | block, which they also call a bottleneck block. What this is is it's pretty similar. One thing

00:30:56.120 | that's going up through a couple of convolutions and then goes and gets added together, but

00:31:00.160 | the other side is not an identity. The other side is a single convolution. In ResNet they

00:31:09.360 | throw in one of these every half a dozen blocks or so. Why is that? The reason is that if you

00:31:18.200 | only have identity blocks, then all it can really do is to continually fine-tune where

00:31:26.000 | it's at so far. We've learned quite a few times now that these identity blocks map to

00:31:32.760 | the residual, so they keep trying to fine-tune the types of features that we have so far.

00:31:39.440 | Whereas these bottleneck blocks actually force it from time to time to create a whole different

00:31:45.240 | type of features because there is no identity path through here. The shortest path still

00:31:51.240 | goes through a single convolution. When you think about transfer learning from ResNet,

00:31:57.360 | you kind of need to think about should I transfer learn from an identity block before or after

00:32:03.480 | or from a bottleneck block before or after. Again, I don't think anybody has studied this

00:32:10.080 | or at least I haven't seen anybody write it down. I've played around with it a bit and

00:32:14.840 | I'm not sure I have a totally decisive suggestion for you. Clearly my guess is that the best

00:32:28.440 | point to grab in ResNet is the end of the block immediately before a bottleneck block.

00:32:36.480 | And the reason for that is that at that level of receptive field, obviously because each

00:32:42.760 | bottleneck block is changing the receptive field, and at that level of semantic complexity,

00:32:50.440 | this is the most sophisticated version of it because it's been through a whole bunch

00:32:54.160 | of identity blocks to get there. So my belief is that you want to get just before that bottleneck

00:33:06.840 | is the best place to transfer learn from. So that's what this is. This is the spot just

00:33:16.920 | before the last bottleneck layer in ResNet. So it's pretty late, and so as we know very

00:33:26.120 | well from part 1 with transfer learning, when you're doing something which is not too different,

00:33:31.520 | and in this case we're switching from one-hot encoding to word vectors, which is not too

00:33:35.360 | different. You probably don't want to transfer learn from too early, so that's why I picked

00:33:41.920 | this fairly late stage, which is just before the final bottleneck block.

00:33:50.880 | So the second complexity here is that this bottleneck block has these dimensions. The

00:33:57.880 | output is 14x14x1024. So we have about a million images, so a million by 14x14x1024 is more

00:34:08.760 | than I wanted to deal with. So I did something very simple, which was I popped in one more

00:34:17.600 | layer after this, which is an average pooling layer, 7x7. So that's going to take my 14x14

00:34:27.520 | output and turn it into a 2x2 output.

00:34:31.520 | So let's say one of those activations was looking for bird's eyeballs, then it's saying

00:34:37.880 | in each of the 14x14 spots, how likely is it that this is a bird's eyeball? And so after

00:34:44.000 | this it's now saying in each of these 4 spots, on average, how much were those cells looking

00:34:50.980 | like bird's eyeballs? This is losing information. If I had a bigger SSD and more time, I wouldn't

00:35:03.120 | have done this. But it's a good trick when you're working with these fully convolutional

00:35:07.160 | architectures. You can pop an average pooling layer anywhere and decrease the resolution

00:35:13.320 | to something that you feel like you can deal with. So in this case, my decision was to

00:35:18.920 | go to 2x2 by 1024.

00:35:22.000 | We had a question. I was going to ask, have we talked about why we do the merge operation

00:35:28.840 | in some of these more complex models?

00:35:31.960 | We have quite a few times, which is basically the merge was the thing which does the plus

00:35:37.680 | here. That's the trick to making it into a ResNet block, is having the addition of the

00:35:45.240 | identity with the result factor of how the convolutions.

00:35:55.520 | So recently I was trying to go from many filters. So you kind of just talked about downsizing

00:36:01.600 | the size of the geometry. Is there a good best practice on going from, let's say, like

00:36:07.360 | 512 filters down to less? Or is it just as simple as doing convolution with less filters?

00:36:14.880 | Yeah, there's not exactly a best practice for that. But in a sense, every single successful

00:36:27.800 | architecture gives you some insights about that. Because every one of them eventually

00:36:31.720 | has to end up with 1,000 categories if it's ResNet or three channels of 1.255 continuous

00:36:41.440 | if it's generative. So the best thing you can really do is, well, there's two things

00:36:46.440 | one is to kind of look at the successful architectures. Another thing is, although this week is kind

00:36:52.120 | of the last week where we're mainly going to be looking at images, I am going to briefly

00:36:56.200 | next week open with a quick run through some of the things that you can look at to learn

00:37:01.200 | more. And one of them is going to be a paper. In fact, two different papers which have like

00:37:06.360 | best practices, you know, really nice kind of descriptions of these hundred different

00:37:11.960 | things, these hundred different results. But all this stuff, it's still pretty artisanal.

00:37:19.520 | Good question. So we initially resized images to 224, right? And it ended up being as a

00:37:32.840 | big cause already, right? Yes. So a couple it's like 50 giga or something. Yes. And that's

00:37:44.080 | compressed and uncompressed. It's like a couple of hundred giga. But, well, if you load it

00:37:50.280 | into memory... I'm not going to load it into memory, you'll see. So what you do is kind

00:37:54.360 | of place the load. It's getting there. Yeah. So that's exactly the right segue I was looking

00:38:01.600 | for, so thank you.

00:38:03.440 | So what we're going to do now is we want to run this model we just built, just call basically

00:38:09.640 | dot predict on it and save the predictions. The problem is that the size of those predictions

00:38:15.960 | is going to be bigger than the amount of RAM I have, so I need to do it a batch at a time

00:38:21.360 | and save it a batch at a time. They've got a million things, each one with this many

00:38:26.480 | activations. And this is going to happen quite often, right? You're either working on a smaller

00:38:32.000 | computer or you're working with a bigger dataset, or you're working with a dataset where you're

00:38:36.520 | using a larger number of activations.

00:38:40.480 | This is actually very easy to handle. You just create your bcols array where you're

00:38:45.720 | going to store it. And then all I do is I go from 0 to the length of my array, my source

00:38:55.480 | array, a batch at a time. So this is creating the numbers 0, 0 plus 128, 128 plus 128, and

00:39:04.360 | so on and so forth. And then I take the slice of my source array from originally 0 to 128,

00:39:12.000 | then from 128 to 256 and so forth. So this is now going to contain a slice of my source

00:39:19.760 | bcols array. This is going to create a generator which is going to have all of those slices,

00:39:29.520 | and of course being a generator it's going to be lazy. So I can then enumerate through

00:39:33.800 | each of those slices, and I can append to my bcols array the result of predicting just

00:39:41.520 | on that one batch.

00:39:45.240 | So you've seen like predict and evaluate and fit and so forth, and the generator versions.

00:39:55.280 | Also in Keras there's generally an on-batch version, so there's a train on-batch and a

00:40:01.280 | predict on-batch. What these do is they basically have no smarts to them at all. This is like

00:40:07.480 | the most basic thing. So this is just going to take whatever you give it and call predict

00:40:11.720 | on this thing. It won't shuffle it, it won't batch it, it's just going to throw it directly

00:40:16.720 | into the computation graph.

00:40:18.280 | So I'm just going to take a model, it's going to call predict on just this batch of data.

00:40:25.000 | And then from time to time I print out how far I've gone just so that I know how I'm

00:40:29.480 | going. Also from time to time I call .flush, that's the thing in bcols that actually writes

00:40:36.240 | it to disk. So this thing doesn't actually take very long to run. And one of the nice

00:40:44.080 | things I can do here is I can do some data augmentation as well. So I've added a direction

00:40:49.800 | parameter, and what I'm going to do is I'm going to have a second copy of all of my images

00:40:55.200 | which is flipped horizontally. So to flip things horizontally, that's interesting, I think I

00:41:04.040 | screwed this up. To flip things horizontally, you've got batch, height, and then this is

00:41:20.120 | columns. So if we pass in a -1 here, then it's going to flip it horizontally. That explains

00:41:31.600 | why some of my results haven't been quite as good as I hoped.

00:41:36.280 | So when you run this, we're going to end up with a big big holes array that's going to

00:41:41.560 | contain two copies of every three sites image-net-image, the activations at the layer that we have,

00:41:50.640 | one layer before this. So I call it once with direction forwards and one with direction

00:41:57.960 | backwards. So at the end of that, I've now got nearly 2 million activations of 2x2x1024.

00:42:08.280 | So that's pretty close to the end of ResNet. I've

00:42:19.120 | then just copied and pasted from the Keras code the last few steps of ResNet. So this

00:42:24.720 | is the last few blocks. I added in one extra identity block just because I had a feeling

00:42:31.160 | that might help things along a little bit. Again, people have not really studied this

00:42:35.120 | yet, so I haven't had a chance to properly experiment, but it seemed to work quite well.

00:42:42.080 | This is basically copied and pasted from Keras's code. I then need to copy the weights from

00:42:48.600 | Keras for those last few layers of ResNet. So now I'm going to repeat the same process

00:42:54.240 | again, which is to predict on these last few layers. The input will be the output from

00:43:01.280 | the previous one. So we went like 2/3 of the way into ResNet and got those activations

00:43:07.880 | and put those activations into the last few stages of ResNet to get those activations.

00:43:14.000 | Now the outputs from this are actually just a vector of length 2048, which does fit in

00:43:23.280 | my RAM, so I didn't bother with calling predict on batch, I can just call .predict. If you

00:43:29.880 | try this at home and don't have enough memory, you can use the predict on batch trick again.

00:43:36.280 | Any time you ran out of memory when calling predict, you can always just use this pattern.

00:43:47.560 | So at the end of all that, I've now got the activations from the penultimate layer of

00:43:54.400 | ResNet, and so I can do a usual transfer learning trick of creating a linear model. My linear

00:44:04.160 | model is now going to try to use the number of dimensions in my word vectors as its output,

00:44:12.520 | and you'll see it doesn't have any activation function. That's because I'm not doing one

00:44:18.480 | hot encoding, my word vectors could be any size numbers, so I just leave it as linear.

00:44:25.920 | And then I compile it, and then I fit it, and so this linear model is now my very first

00:44:31.640 | -- this is almost the same as what we did in Lesson 1. Cryptocs vs. cats. We're fine

00:44:38.520 | tuning a model to a slightly different target to what it was originally trained with. It's

00:44:48.040 | just that we're doing it with a lot more data, so we have to be a bit more thoughtful.

00:44:53.480 | There's one other difference here, which is I'm using a custom loss function. And the

00:44:58.560 | loss function I'm using is cosine distance. You can lock that up at home if you're not

00:45:04.240 | familiar with it, but basically cosine distance says for these two points in space, what's

00:45:09.600 | the angle between them, rather than how far away are they? The reason we're doing that

00:45:14.560 | is because we're about to start using k nearest neighbors. So k nearest neighbors, we're going

00:45:19.440 | to basically say here's the word vector we predicted, which is the word vector which

00:45:24.600 | is closest to it. It turns out that in really really high dimensional space, the concept

00:45:30.520 | of how far away something is, is nearly meaningless. And the reason why is that in really really

00:45:36.440 | high dimensional space, everything sits on the edge of that space. Basically because

00:45:43.960 | you can imagine as you add each additional dimension, the probability that something

00:45:48.480 | is on the edge in that dimension, let's say the probability that it's right on the edge

00:45:53.280 | is like 1/10. Then if you've only got one dimension, you've got a probability of 1/10.

00:45:58.000 | It's on the edge in one dimension. If you've got two dimensions, it's basically multiplicatively

00:46:03.800 | decreasing the probability that that happens. So in a few hundred dimensional spaces, everything

00:46:09.440 | is on the edge. And when everything's on the edge, everything is kind of an equal distance

00:46:14.480 | away from each other, more or less. And so distances aren't very helpful. But the angle

00:46:19.560 | between things varies. So when you're doing anything with trying to find nearest neighbors,

00:46:28.360 | it's a really good idea to train things using cosine distance. And this is the formula for

00:46:34.920 | cosine distance. Again, this is one of these things where I'm skipping over something that

00:46:40.460 | you'd probably spend a week in undergrad studying. There's heaps of information about cosine distance

00:46:46.660 | on the web. So for those of you already familiar with it, I won't waste your time. For those

00:46:50.640 | of you not, it's a very very good idea to become familiar with this. And feel free to

00:46:56.360 | ask on the forums if you can't find any material that makes sense.

00:47:02.000 | So we've fitted our linear model. As per usual, we save our weights. And we can see how we're

00:47:07.940 | going. So what we've got now is something where we can fit in an image, and it will

00:47:14.720 | spit out a word vector. But it's something that looks like a word vector. It has the

00:47:20.600 | same dimensionality as a word vector. But it's very unlikely that it's going to be the

00:47:25.120 | exact same vector as one of our thousand target word vectors. So if the word vector for a

00:47:32.800 | pug is this list of 200 floats, even if we have a perfectly puggy pug, we're not going

00:47:40.680 | to get that exact list of 2000 floats. We'll have something that is similar. And when we

00:47:46.120 | say similar, we probably mean that the cosine distance between the perfect platonic pug

00:47:51.560 | and our pug is pretty small. So that's why after we get our predictions, we then have

00:48:05.800 | to use nearest neighbors as a second step to basically say, for each of those predictions,

00:48:11.520 | what are the three word vectors that are the closest to that prediction?

00:48:18.440 | So we can now take those nearest neighbors and find out for a bunch of our images what

00:48:25.100 | are the three things it thinks it might be. For example, for this image here, its best

00:48:32.080 | guess was trombone, next was flute, and third was cello. This gives us some hope that this

00:48:39.520 | approach seems to be working okay. It's not great yet, but it's recognized these things

00:48:44.560 | are musical instruments, and its third guess was in fact the correct musical instrument.

00:48:49.320 | So we know what to do next. What we do next is to fine-tune more layers. And because we

00:48:55.240 | have already saved the intermediate results from an earlier layer, that fine-tuning is

00:49:00.640 | going to be much faster to do.

00:49:02.960 | Two more things I briefly mentioned. One is that there's a couple of different ways to

00:49:08.920 | do nearest neighbors. One is what's called the brute force approach, which is literally

00:49:13.200 | to go through everyone and see how far away it is. There's another approach which is approximate

00:49:23.960 | nearest neighbors. And when you've got lots and lots of things, you're trying to look

00:49:28.640 | for nearest neighbors, the brute force approach is going to be n^2 time. It's going to be super

00:49:35.000 | slow. Approximate nearest neighbors are generally n log n time. So orders of magnitude faster

00:49:44.480 | if you've got a large dataset.

00:49:48.360 | The particular approach I'm using here is something called locality-sensitive hashing.

00:49:53.120 | It's a fascinating and wonderful algorithm. Anybody who's interested in algorithms, I

00:49:57.840 | strongly recommend you go read about it. Let me know if you need a hand with it. My favorite

00:50:04.900 | kind of algorithms are these approximate algorithms. In data science, you almost never need to

00:50:13.280 | know something exactly, yet nearly every algorithm that people learn at university and certainly

00:50:18.640 | at high school are exact. We learn exact nearest neighbor algorithms and exact indexing algorithms

00:50:24.520 | and exact median algorithms. Pretty much for every algorithm out there, there's an approximate

00:50:29.800 | version that runs an order of n or log n over n faster. One of the cool things is that once

00:50:38.960 | you start realizing that, you suddenly discover that all of the libraries you've been using

00:50:42.400 | for ages were written by people who didn't know this. And then you realize that every

00:50:47.520 | sub-algorithm they've written, they could have used an approximate version. The next

00:50:51.160 | thing you've got to know, you've got something that runs a thousand times faster.

00:50:55.040 | The other cool thing about approximate algorithms is that they're generally written to provably

00:51:00.440 | be accurate to within so close. And it can tell you with your parameters how close is

00:51:05.760 | so close, which means that if you want to make it more accurate, you run it more times

00:51:12.400 | with different random seeds. This thing called LSH forest is a locality-sensitive hashing

00:51:18.680 | forest which means it creates a bunch of these locality-sensitive hashes. And the amazingly

00:51:24.000 | great thing about approximate algorithms is that each time you create another version

00:51:28.360 | of it, you're exponentially increasing the accuracy, or multiplicatively increasing the

00:51:34.200 | accuracy, but only linearly increasing the time. So if the error on one call of LSH was

00:51:43.480 | e, then the error on two calls is 1 - e^2. And 3 calls is 1 - e^3. And the time you're

00:51:53.600 | taking is now 2n and 3n. So when you've got something where you can make it as accurate

00:52:01.160 | as you like with only linear increasing time, this is incredibly powerful. This is a great

00:52:08.440 | approximation algorithm. I wish we had more time, so I'd love to tell you all about it.

00:52:16.860 | So I generally use LSH forest when I'm doing nearest neighbors because it's arbitrarily

00:52:21.400 | close and much faster when you've got lots of word vectors. The time that becomes important

00:52:29.640 | is when I move beyond ImageNet, which I'm going to do now.

00:52:34.440 | So let's say I've got a picture, and I don't just want to say which one of the thousand

00:52:40.620 | ImageNet categories is it. Which one of the 100,000 WordNet nouns is it? That's a much

00:52:48.240 | harder thing to do. And that's something that no previous model could do. When you trained

00:52:54.880 | an ImageNet model, the only thing you could do is recognize pictures of things that were

00:52:58.720 | in ImageNet. But now we've got a word vector model, and so we can put in an image that

00:53:06.520 | spits out a word vector, and that word vector could be closer to things that are not in

00:53:11.480 | ImageNet at all. Or it could be some higher level of the hierarchy, so we could look for

00:53:16.840 | a dog rather than a pug, or a plane rather than a 747.

00:53:24.220 | So here we bring in the entire set of word vectors. I'll have to remember to share these

00:53:30.600 | with you because these are actually quite hard to create. And this is where I definitely

00:53:36.040 | want LSH_FOREST because this is going to be pretty slow. And we can now do the same thing.

00:53:43.280 | And not surprisingly, it's got worse. The thing that was actually cello, now cello is not

00:53:47.680 | even in the top 3. So this is a harder problem. So let's try fine-tuning. So fine-tuning is

00:53:56.360 | the final trick I'm going to show you, just behind you Retschoff.

00:54:00.960 | You might remember last week we looked at creating our word vectors, and what we did

00:54:17.800 | was actually I created a list. I went to WordNet and I downloaded the whole of WordNet, and

00:54:32.680 | then I figured out which things were nouns, and then I used a Retschoff to pass out those,

00:54:37.480 | and then I saved that. So we actually have the entirety of WordNet nouns.

00:54:45.320 | Because it's not a good enough model yet. So now that there's 80,000 nouns, there's a lot

00:55:00.040 | more ways to be wrong. So when it only has to say which of these thousand things is it,

00:55:05.600 | that's pretty easy. Which of these 80,000 things is it? It's pretty hard. To fine-tune it, it

00:55:19.680 | looks very similar to our usual way of fine-tuning things, which is that we take our two models

00:55:27.360 | and stick them back to back, and we're now going to train the whole thing rather than

00:55:35.400 | just the linear model.

00:55:37.320 | The problem is that the input to this model is too big to fit in RAM. So how are we going

00:55:46.880 | to call fit or fit generator when we have an array that's too big to fit in RAM? Well,

00:55:54.840 | one obvious thing to do would be to pass in the bcols array. Because to most things in

00:56:00.280 | Python, a bcols array looks just like a regular array. It doesn't really look any different.

00:56:07.500 | The way a bcols array is actually stored is actually stored in a directory, as I'm sure

00:56:21.720 | you've noticed. And in that directory, it's got something called chunk length, I set it

00:56:33.960 | to 32 when I created these bcols arrays. What it does is it takes every 32 images and it

00:56:42.280 | puts them into a separate file. Each one of these has 32 images in it, or 32 of the leading

00:56:55.760 | axis of the array.

00:56:58.400 | Now if you then try to take this whole array and pass it to .fit in Keras with shuffle,

00:57:08.600 | it's going to try and grab one thing from here and one thing from here and one thing

00:57:12.720 | from here. Here's the bad news. For bcols to get one thing out of a chunk, it has to read

00:57:20.320 | and decompress the whole thing. It has to read and decompress 32 images in order to give

00:57:25.800 | you the one image you asked for. That would be a disaster. That would be ridiculously

00:57:30.640 | horribly slow. We didn't have to worry about that when we called predict on batch. We were

00:57:39.960 | going not shuffling, but we were going in order. So it was just grabbing one. It was

00:57:49.520 | never grabbing a single image out of a chunk. But now that we want to shuffle, it would.

00:57:58.240 | So what we've done is somebody very helpfully actually on a Kaggle forum provided something

00:58:07.080 | called a bcols array iterator. The bcols array iterator, which was kindly discovered on the

00:58:19.000 | forums by somebody named MP Janssen, originally written by this fellow, provides a Keras-compatible

00:58:34.120 | generator which grabs an entire chunk at a time. So it's a little bit less random, but

00:58:43.680 | given that if this has got 2 million images in and the chunk length is 32, then it's going

00:58:49.480 | to basically create a batch of chunks rather than a batch of images. And so that means

00:58:56.700 | we have none of the performance problems, and particularly because we randomly shuffled

00:59:03.120 | our files. So this whole thing is randomly shuffled anyway. So this is a good trick.

00:59:10.140 | So you'll find the bcols array iterator on GitHub. Feel free to take a look at the code.

00:59:17.000 | It's pretty straightforward. There were a few issues with the original version, so MP

00:59:25.560 | Janssen and I have tried to fix it up and I've written some tests for it and he's written

00:59:29.320 | some documentation for it. But if you just want to use it, then it's as simple as writing

00:59:36.520 | this. Blah equals bcols array iterator, this is your data, these are your labels, shuffle

00:59:45.240 | equals true, batch size equals whatever, and then you can just call fit generator as per

00:59:49.640 | usual passing in that iterator and that iterator's number of items.

00:59:58.200 | So to all of you guys who have been asking how to deal with data that's bigger than memory,

01:00:06.240 | this is how you do it. So hopefully that will make life easier for a lot of people.

01:00:14.000 | So we fine-tune it for a while, we do some learning annealing for a while, and this basically

01:00:21.880 | runs overnight for me. It takes about 6 hours to run. And so I come back the next morning

01:00:29.280 | and I just copy and paste my k nearest neighbors, so I get my predicted word vectors. For each

01:00:38.040 | word vector, I then pass it into nearest neighbors. This is my just 1000 categories. And lo and

01:00:45.280 | behold, we now have cello in the top spot as we hoped.

01:00:50.880 | How did it go in the harder problem of looking at the 100,000 or so nouns in English? Pretty

01:00:56.520 | good. I've got this one right. And just to pick another one at random, let's pick the

01:01:01.680 | first one. It said throne. That sure looks like a throne. So looking pretty good.

01:01:09.120 | So here's something interesting. Now that we have brought images and words into the

01:01:14.440 | same space, let's play with it some more. So why don't we use nearest neighbors with

01:01:22.400 | those predictions? To the word vector which Google created, but the subset of those which

01:01:47.200 | are nouns according to WordNet, map to their sin set IDs.

01:02:00.200 | The word vectors are just the word2vec vectors that we can download off the internet. They

01:02:05.640 | were pre-trained by Google. We're saying here is this image spits out a vector from a thing

01:02:30.280 | we just trained. We have 100,000 word2vec vectors for all the nouns in English. Which

01:02:38.480 | one of those is the closest to the thing that came out of our model? And the answer was

01:02:44.520 | throne.

01:02:45.520 | Hold that thought. We'll be doing language translation starting next week. No, we don't

01:02:59.780 | quite do it that way, but you can think of it like that.

01:03:02.800 | So let's do something interesting. Let's create a nearest neighbors not for all of the word2vec

01:03:11.120 | vectors, but for all of our image-predicted vectors. And now we can do the opposite. Let's

01:03:17.360 | take a word, we pick it random. Let's look it up in our word2vec dictionary, and let's

01:03:23.120 | find the nearest neighbors for that in our images. There it is. So this is pretty interesting.

01:03:36.200 | You can now find the images that are the most like whatever word you come up with. Okay,

01:03:46.560 | that's crazy, but we can do crazier. Here is a random thing I picked. Now notice I picked

01:03:52.520 | it from the validation set of ImageNet, so we've never seen this image before. And honestly

01:03:57.100 | when I opened it up, my heart sank because I don't know what it is. So this is a problem.

01:04:02.800 | What is that? So what we can do is we can call.predict on that image, and we can then

01:04:15.720 | do a nearest neighbors of all of our other images. There's the first, there's the second,

01:04:23.880 | and the third one is even somebody putting their hand on it, which is slightly crazy,

01:04:30.440 | but that was what the original one looked like. In fact, if I can find it, I ran it

01:04:40.600 | again on a different image. I actually looked around for something weird. This is pretty

01:04:54.200 | weird, right? Is this a net or is it a fish? So when we then ask for nearest neighbors,

01:04:59.840 | we get fish in nets. So it's like, I don't know, sometimes deep learning is so magic

01:05:07.040 | you just kind of go out and they're possibly what's just behind you, Rachel.

01:05:14.240 | Only a little bit, and maybe in a future course we might look at Dask. I think maybe even

01:05:33.000 | in your numerical and algebra course you might be looking at Dask. I don't think we'll cover

01:05:37.360 | it this course. But do look at Dask, D-A-S-K, it's super cool.

01:05:48.400 | No, not at all. So these were actually labeled as this particular kind of fish. In fact that's

01:06:01.960 | the other thing is it's not only found fish in nets, but it's actually found more or less

01:06:05.960 | the same breed of fish in the nets. But when we called dot predict on those, it created

01:06:17.520 | a word vector which was probably like halfway between that kind of fish and a net because

01:06:27.080 | it doesn't know what to do, right? So sometimes when it sees things like that, it would have

01:06:30.800 | been marked in imageNet as a net, and sometimes it would have been a fish. So the best way

01:06:35.200 | to minimize the loss function would have been to kind of hedge. So it hedged and as a result

01:06:40.760 | the images that were closest were the ones which actually were halfway between the two

01:06:45.080 | themselves. So it's kind of a convenient accident.

01:06:48.840 | You absolutely can and I have, but really for nearest neighbors I haven't found anything

01:07:18.680 | nearly as good as cosine and that's true in all of the things I looked up as well. By

01:07:25.640 | the way, I should mention when you use locality-sensitive hashing in Python, by default it uses something

01:07:32.280 | that's equivalent to the cosine metric, so that's why the nearest neighbors work.

01:07:36.520 | So starting next week we're going to be learning about sequence-to-sequence models and memory

01:07:50.880 | and attention methods. They're going to show us how we can take an input such as a sentence

01:07:56.960 | in English and an output such as a sentence in French, which is the particular case study

01:08:02.280 | we're going to be spending 2 or 3 weeks on. When you combine that with this, you get image

01:08:07.680 | captioning. I'm not sure if we're going to have time to do it ourselves, but it will

01:08:12.480 | literally be trivial for you guys to take the two things and combine them and do image

01:08:18.680 | captioning. It's just those two techniques together.

01:08:25.940 | So we're now going to switch to -- actually before we take a break, I want to show you

01:08:35.760 | the homework. Hopefully you guys noticed I gave you some tips because it was a really

01:08:42.440 | challenging one. Even though in a sense it was kind of straightforward, which was take

01:08:48.240 | everything that we've already learned about super-resolution and slightly change the loss

01:08:51.960 | function so that it does perceptual losses for style transfer instead, the details were

01:08:56.840 | tricky.

01:08:57.840 | I'm going to quickly show you two things. First of all, I'm going to show you how I

01:09:01.600 | did the homework because I actually hadn't done it last week. Luckily I have enough RAM

01:09:07.640 | that I could read the two things all into memory, so don't forget you can just do that

01:09:11.680 | with a bcols array to return it into a NumPy array in memory.

01:09:17.520 | So one thing I did was I created my up-sampling block to get rid of the checkerboard patterns.

01:09:22.920 | That was literally as simple as saying up-sampling 2D and then a 1x1 conv. So that got rid of

01:09:29.040 | my checkerboard patterns. The next thing I did was I changed my loss function and I decided

01:09:38.420 | before I tried to do style transfer with perceptual losses, let's try and do super-resolution with

01:09:46.400 | multiple content-loss layers. That's one thing I'm going to have to do for style transfer

01:09:52.560 | is be able to use multiple layers. So I always like to start with something that works and

01:09:57.720 | make small little changes so it keeps working at every point.

01:10:01.640 | So in this case, I thought, let's first of all slightly change the loss function for

01:10:08.760 | super-resolution so that it uses multiple layers. So here's how I did that. I changed

01:10:14.720 | my get output layer. Sorry, I changed my BGG content so it created a list of outputs, conv1

01:10:24.920 | from each of the first, second and third blocks. And then I changed my loss function so it

01:10:32.800 | went through and added the mean squared difference for each of those three layers. I also decided

01:10:40.000 | to add a weight just for fun. So I decided to go 0.1, 0.8, 0.1 because this is the layer

01:10:46.280 | that they used in the paper. But let's have a little bit of more precise super-resolution

01:10:53.320 | and a little bit of more semantic super-resolution and see how it goes. I created this function

01:10:59.880 | to do a more general mean squared error. And that was basically it. Other than that line

01:11:07.960 | everything else was the same, so that gave me super-resolution working on multiple layers.

01:11:15.960 | One of the things I found fascinating is that this is the original low-res, and it's done

01:11:22.560 | a good job of upscaling it, but it's also fixed up the weird white balance, which really

01:11:27.880 | surprised me. It's taken this obviously over-yellow shot, and this is what ceramic should look

01:11:34.880 | like, it should be white. And somehow it's adjusted everything, so the serviette or whatever

01:11:40.040 | it is in the background has gone from a yellowy-brown to a nice white, as with these cups here.

01:11:45.560 | It's figured out that these slightly pixelated things are actually meant to be upside-down

01:11:49.080 | handles. This is on only 20,000 images. I'm very surprised that it's fixing the color

01:11:59.200 | because we never asked it to, but I guess it knows what a cup is meant to look like,

01:12:06.800 | and so this is what it's decided to do, is to make a cup the way it thinks it's meant

01:12:10.900 | to look. So that was pretty cool.

01:12:18.640 | So then to go from there to style-transfer was pretty straightforward. I had to read

01:12:23.040 | in my style as before. This is the code to do this special kind of resnet block where

01:12:30.560 | we use valid convolutions, which means we lose two pixels each time, and so therefore

01:12:36.920 | we have to do a center crop. So don't forget, lambda layers are great for this kind of thing.

01:12:43.080 | Whatever code you can write, chuck it in a lambda layer, and suddenly it's a Keras layer.

01:12:47.560 | So do my center crop. This is now a resnet block which does valid comms. This is basically

01:12:54.920 | all exactly the same. We have to do a few downsamplings, and then the computation, and

01:13:00.720 | our upsampling, just like the supplemental paper.

01:13:05.420 | So the loss function looks a lot like the loss function did before, but we've got two

01:13:10.760 | extra things. One is the Gram matrix. So here is a version of the Gram matrix which works

01:13:17.160 | a batch at a time. If any of you tried to do this a single image at a time, you would have

01:13:21.640 | gone crazy with how slow it took. I saw a few of you trying to do that. So here's the

01:13:26.480 | batch-wise version of Gram matrix.

01:13:30.040 | And then the second thing I needed to do was somehow feed in my style target. Another thing

01:13:35.280 | I saw some of you do was feed in the style target every time feed in that array into

01:13:45.320 | your loss function. You can obviously calculate your style target by just calling .predict

01:13:53.640 | with the thing which gives you all your different style target layers, but the problem is this

01:13:58.880 | thing here returns a NumPy array. It's a pretty big NumPy array, which means that then when

01:14:04.600 | you want to use it as a style target in training, it has to copy that back to the GPU. And copying

01:14:11.680 | to the GPU is very, very slow. And this is a really big thing to copy to the GPU. So

01:14:16.760 | any of you who tried this, and I saw some of you try it, it took forever.

01:14:21.760 | So here's the trick. Call .variable on it. Turning something into a variable picks it

01:14:29.440 | on the GPU for you. So once you've done that, you can now treat this as a list of symbolic

01:14:37.640 | entities which are the GPU versions of this. So I can now use this inside my GPU code.

01:14:46.600 | So here are my style targets I can use inside my loss function, and it doesn't have to do

01:14:57.800 | any copying backwards and forwards. So there's a subtlety, but if you don't get that subtlety

01:15:03.520 | right, you're going to be waiting for a week or so for your code to finish.

01:15:10.000 | So those were the little subtleties which were necessary to get this to work. And once

01:15:15.400 | you get it to work, it does exactly the same thing basically as before.

01:15:20.200 | So where this gets combined with device is I wanted to try something interesting, which

01:15:26.920 | is in the original Perceptual Losses paper, they trained it on the COCO dataset which

01:15:34.480 | has 80,000 images, which didn't seem like many. I wanted to know what would happen if

01:15:39.680 | we trained it on Olive ImageNet. So I did. So I decided to train a super-resolution network

01:15:49.120 | on Olive ImageNet. And the code's all identical, so I'm not going to explain it. Other than,

01:15:57.320 | you'll notice we don't have the square bracket colon square bracket here anymore because

01:16:02.480 | we don't want to try and read in the entirety of ImageNet into RAM. So these are still b

01:16:06.600 | coles arrays. All the other code is identical until we get to here. So I use a bcoles array

01:16:19.760 | iterator. I can't just call .fit because .fit or .fit generator assumes that your iterator

01:16:33.160 | is returning your data and your labels. In our case, we don't have data and labels. We

01:16:40.920 | have two things that both get fed in as two inputs, and our labels are just a list of

01:16:46.720 | zeros.

01:16:47.720 | So here's a good trick. This answers your earlier question about how do you do multi-input

01:16:56.600 | models on large datasets. The answer is create your own training loop which loops through

01:17:05.200 | a bunch of iterations, and then you can grab as many batches of data from as many different

01:17:11.160 | iterators as you like, and then call train on batch. So in my case, my bcoles array iterator

01:17:18.720 | is going to return my high resolution and low resolution batch of images. So I go through

01:17:24.000 | a bunch of iterations, grab one batch of high res and low res images, and pass them as my

01:17:31.240 | two inputs to train on batch.

01:17:34.920 | So this is the only code I changed other than changing .fit generator to actually calling

01:17:44.600 | train. So as you can see, this took me 4.5 hours to train and I then decreased the learning

01:17:55.440 | rate and I trained for another 4.5 hours. Actually, I did it overnight last night and I only had

01:18:00.080 | enough time to do about half of ImageNet, so this isn't even the whole thing. But check

01:18:04.720 | this out.

01:18:06.040 | So take that model and we're going to call .predict. This is the original high res image.

01:18:14.920 | Here's the low res version. And here's the version that we've created. And as you can

01:18:22.360 | see, it's done a pretty extraordinarily good job. When you look at the original ball, there

01:18:29.800 | was this kind of vague yellow thing here. It's kind of turned it into a nice little

01:18:34.000 | version. You can see that her eyes was like two grey blobs. It's kind of turned it into

01:18:40.520 | some eyes. You could just tell that that's an A, maybe if you look carefully. Now it's

01:18:48.520 | very clearly an A. So you can see it does an amazing job of upscaling this.

01:18:58.120 | All that's still is this is a fully convolutional net and therefore is not specific to any particular

01:19:03.680 | input resolution. So what I can do is I can create another version of the model using

01:19:11.280 | our high res as the input. So now we're going to call .predict with the high res input,

01:19:21.520 | and that's what we get back. So look at that, we can now see all of this detail on the basketball,

01:19:32.440 | which simply, none of that really existed here. It was there, but pretty hard to see

01:19:39.160 | what it was. And look at her hair, this kind of grey blob here. Here you can see it knows

01:19:51.440 | it's like little bits of pulled back hair. So we can take any sized image and make it

01:19:59.080 | bigger. This to me is one of the most amazing results I've seen in deepwinding. When we

01:20:08.280 | train something on nearly all of ImageNet, it's a single epoch, so there's definitely

01:20:13.000 | no overfitting. And it's able to recognize what hair is meant to look like when pulled

01:20:18.480 | back into a bun is a pretty extraordinary result, I think. Something else which I only

01:20:25.960 | realized later is that it's all a bit fuzzy, right? And there's this arm in the background

01:20:35.440 | that's a bit fuzzy. The model knows that that is meant to stay fuzzy. It knows what out-of-focus

01:20:44.680 | things look like. Equally cool is not just how that A is now incredibly precise and accurate,

01:20:55.320 | but the fact that it knows that blurry things need to stay blurry. I don't know if you're

01:21:01.800 | as amazed as this as I am, but I thought this was a pretty cool result. We could run this

01:21:08.240 | over a 24-hour period on maybe two epochs of all of ImageNet, and presumably it would

01:21:15.160 | get even better still. Okay, so let's take a 7-minute break and see you back here at

01:21:19.240 | 5 past 8. Okay, thanks everybody. That was fun. So we're going to do something else fun.

01:21:41.840 | And that is to look at -- oh, before I continue, I did want to mention one thing in the homework

01:21:51.600 | that I changed, which is I realized in my manually created loss function, I was already

01:22:04.400 | doing a mean squared error in the loss function. But then when I told Teras to make that thing

01:22:15.400 | as close to 0 as possible, I had to also give it a loss function, and I was giving it MSE.

01:22:21.440 | And effectively that was kind of squaring my squared errors, it seemed wrong. So I've

01:22:25.640 | changed it to M-A-E, mean absolute error. So when you look back over the notebooks, that's

01:22:31.840 | why, because this is just to say, hey, get the loss as close to 0 as possible. I didn't

01:22:39.240 | really want to re-square it. That didn't make any sense. So that's why you'll see that minor

01:22:46.280 | change. The other thing to mention is I didn't notice that when I retrained my super resolution

01:22:55.040 | on my new images that didn't have the black border, it gave good results much, much faster.

01:23:01.800 | And so I really think that thing of learning to put the black border back in seemed to

01:23:05.720 | take quite a lot of effort for it. So again, hopefully some of you are going to look into

01:23:11.560 | that in more detail.

01:23:16.200 | So we're going to learn about general adversarial networks. This will kind of close off our

01:23:22.600 | deep dive into generative models as applied to images. And just to remind you, the purpose

01:23:30.880 | of this has been to learn about generative models, not to specifically learn about super

01:23:35.880 | resolution or artistic style. But remember, these things can be used to create all kinds

01:23:42.600 | of images. So one of the groups is interested in taking a 2D photo and trying to turn it

01:23:48.320 | into something that you can rotate in 3D, or at least show a different angle of that

01:23:52.720 | 2D photo. And that's a great example of something that this should totally work for. It's just

01:23:59.240 | a mapping from one image to some different image, which is like what would this image

01:24:02.840 | look like from above versus from the front.

01:24:07.000 | So keep in mind the purpose of this is just like in Part 1, we learned about classification,

01:24:14.320 | which you can use for 1000 things. Now we're learning about generative models that you

01:24:18.200 | can use for different 1000 things.

01:24:20.800 | Now any generative model you build, you can make it better by adding on top of it again

01:24:29.040 | a generative adversarial network. And this is something I don't really feel like has

01:24:33.880 | been fully appreciated. People I've seen generally treat GANs as a different way of creating

01:24:39.480 | a generative model. But I think of this more as like, why not create a generative model

01:24:45.820 | using the kind of techniques we've been talking about. But then think of it this way. Think

01:24:50.800 | of all the artistic style stuff we were doing in my terrible attempt at a Simpsons cartoon

01:25:01.000 | version of a picture. It looked nothing like a Simpsons. So what would be one way to improve

01:25:09.120 | that?

01:25:10.120 | One way to improve that would be to create two networks. There would be one network that

01:25:18.200 | takes our picture, which is actually not the Simpsons, and takes another picture that actually

01:25:25.280 | is the Simpsons. And maybe we can train a neural network that takes those two images

01:25:32.720 | and spits out something saying, Is that a real Simpsons image or not? And this thing

01:25:41.760 | we'll call the discriminator. So we could easily train a discriminator right now. It's

01:25:52.560 | just a classification network. Just use the same techniques we used in Part 1. We feed

01:25:57.720 | it the two images, and it's going to spit out a 1 if it's a real Simpsons cartoon, and

01:26:04.040 | a 0 if it's Jeremy's crappy generative model of Simpsons. That's easy, right? We know how

01:26:09.960 | to do that right now.

01:26:14.360 | Now, go and build another model. There's two images as inputs. So you would feed it one

01:26:37.520 | thing that's a Simpsons and one thing that's a generative output. It's up to you to feed

01:26:43.560 | it one of each. Or alternatively, you could feed it one thing. In fact, probably easier

01:26:52.120 | is to just feed it one thing and it spits out, Is it the Simpsons or isn't it the Simpsons?

01:26:57.360 | And you could just mix them and match them. Actually, it's the latter that we're going

01:27:00.720 | to do, so that's probably easier. We're going to have one thing which is either not a Simpsons

01:27:11.720 | or it is a Simpsons, and we're going to have a mix of 50/50 of those two, and we're going

01:27:18.360 | to have something come out saying, "What do you think? Is it real or not?" So this thing,

01:27:25.480 | this discriminator, from now on we'll probably generally be calling it D. So there's a thing

01:27:30.400 | called D. And we can think of that as a function. D is a function that takes some input, x, which

01:27:38.760 | is an image, and spits out a 1 or a 0, or maybe a probability.

01:27:48.360 | So what we could now do is create another neural network. And what this neural network

01:27:55.200 | is going to do is it's going to take as input some random noise, just like all of our generators

01:28:03.440 | have so far. And it's going to spit out an image. And the loss function is going to be

01:28:14.120 | if you take that image and stick it through D, did you manage to fool it? So could you

01:28:24.440 | create something where in fact we wanted to say, "Oh yeah, totally, that's a real Simpsons."

01:28:32.160 | So if that was our loss function, we're going to call the generator, we'll call it G. It's

01:28:37.440 | just something exactly like our perceptual losses style transfer model. It could be exactly

01:28:42.960 | the same model. But the loss function is now going to be take the output of that and stick

01:28:48.840 | it through D, the discriminator, and try to trick it. So the generator is doing well if

01:28:55.680 | the discriminator is getting it wrong.

01:28:59.320 | So one way to do this would be to take our discriminator and train it as best as we can

01:29:05.960 | to recognize the difference between our crappy Simpsons and real Simpsons, and then get a

01:29:11.680 | generator and train it to trick that discriminator. But now at the end of that, it's probably

01:29:17.560 | still not very good because you realize that actually the discriminator didn't have to

01:29:21.320 | be very good before because my Simpsons generators were so bad. So I could now go back and retrain

01:29:27.200 | the discriminator based on my better generated images, and then I could go back and retrain

01:29:33.360 | the generator. And back and forth I go.

01:29:37.560 | And that is the general approach of a GAN, is to keep going back between two things,

01:29:43.440 | which is training a discriminator and training a generator using a discriminator as a loss

01:29:50.400 | function. So we've got one thing which is discriminator on some image, and another thing

01:29:59.200 | which is a discriminator on a generator on some noise.

01:30:17.480 | In practice, these things are going to spit out probabilities. So that's the general idea.

01:30:29.960 | In practice, they found it very difficult to do this like train the discriminator as

01:30:36.560 | best as we can, stop train the generator as best as we can, stop and so on and so forth.

01:30:43.000 | So instead, the original GAN paper is called Generative Adversarial Nets. And here you

01:30:56.240 | can see they've actually specified this loss function. So here it is in notation. They

01:31:03.240 | call it minimizing the generator whilst maximizing the discriminator. This is what min max is

01:31:11.040 | referring to.

01:31:14.040 | What they do in practice is they do it a batch at a time. So they have a loop, I'm going

01:31:18.480 | to go through a loop and do a single batch, put it through the discriminator, that same

01:31:22.760 | batch, stick it through the generator, and so we're going to do it a batch at a time.

01:31:28.000 | So let's look at that. So here's the original GAN from that paper, and we're going to do

01:31:35.040 | it on MNIST. And what we're going to do is we're going to see if we can start from scratch

01:31:39.880 | to create something which can create images which the discriminator cannot tell whether

01:31:49.120 | they're real or fake. And it's a discriminator that has learned to be good at discriminating

01:31:54.640 | real from fake pictures of MNIST images.

01:31:58.800 | So we're loaded in MNIST, and the first thing they do in the paper is just use a standard

01:32:04.120 | multilayer perceptron. So I'm just going to skip over that and let's get to the perceptron.

01:32:12.800 | So here's our generator. It's just a standard multilayer perceptron. And here's our discriminator,

01:32:19.840 | which is also a standard multilayer perceptron. The generator has a sigmoid activation, so

01:32:26.400 | in other words, we're going to spit out an image where all of the pixels are between

01:32:30.680 | 0 and 1. So if you want to print it out, we'll just multiply it by 255, I guess.

01:32:36.240 | So there's our generator, there's our discriminator. So there's then the combination of the two.

01:32:42.540 | So take the generator and stick it into the discriminator. We can just use sequential

01:32:46.360 | for that. And this is actually therefore the loss function that I want on my generator.

01:32:52.720 | Generate something and then see if you can fool the discriminator.

01:32:56.520 | So there's all my architectures set up. So the next thing I need to do is set up this

01:33:02.980 | thing called train, which is going to do this adversarial training. Let's go back and have

01:33:08.040 | a look at train. So what train is going to do is go through a bunch of epochs. And notice

01:33:15.260 | here I wrap it in this TQDM. This is the thing that creates a nice little progress bar. Doesn't

01:33:20.360 | do anything else, it just creates a little progress bar. We learned about that last week.

01:33:25.140 | So the first thing I need to do is to generate some data to feed the discriminator. So I've

01:33:31.480 | created a little function for that. And here's my little function. So it's going to create

01:33:36.380 | a little bit of data that's real and a little bit of data that's fake.

01:33:40.780 | So my real data is okay, let's go into my actual training set and grab some randomly

01:33:47.080 | selected MNIST digits. So that's my real bit. And then let's create some fake. So noise

01:33:57.460 | is a function that I've just created up here, which creates 100 random numbers. So let's

01:34:02.900 | create some noise called g.predict. And then I'll concatenate the two together. So now I've

01:34:09.920 | got some real data and some fake data. And so this is going to try and predict whether

01:34:16.700 | or not something is fake. So 1 means fake, 0 means real. So I'm going to return my data

01:34:26.320 | and my labels, which is a bunch of 0s to say they're all real and a bunch of 1s to say

01:34:30.900 | they're all fake. So that's my discriminator's data.

01:34:34.620 | So go ahead and create a set of data for the discriminator, and then do one batch of training.

01:34:45.500 | Now I'm going to do the same thing for the generator. But when I train the generator,

01:34:50.220 | I don't want to change the discriminator's weights. So make_trainable simply goes through

01:34:56.100 | each layer and says it's not trainable. So make my discriminator non-trainable and do

01:35:01.640 | one batch of training where I'm taking noise as my inputs. And my goal is to get the discriminator

01:35:10.740 | to think that they are actually real. So that's why I'm passing in a bunch of 0s, because

01:35:17.260 | remember 0 means real. And that's it. And then make discriminator trainable again.

01:35:23.120 | So keep looking through this. Train the discriminator on a batch of half real, half fake. And then

01:35:30.860 | train the generator to try and trick the discriminator using all fake. Repeat.

01:35:39.340 | So that's the training loop. That's a basic GAN. Because we use TQDM, we get a nice little

01:35:45.220 | progress bar. I kept track of the loss at each step, so there's our loss for the discriminator,

01:35:54.820 | and there's our loss for the generator. So our question is, what do these loss curves

01:35:59.180 | mean? Are they good or bad? How do we know? And the answer is, for this kind of GAN, they

01:36:06.780 | mean nothing at all. The generator could get fantastic, but it could be because the discriminator

01:36:12.580 | is terrible. And they don't really know whether each one is good or not, so even the order

01:36:18.300 | of magnitude of both of them is meaningless. So these curves mean nothing. The direction

01:36:23.340 | of the curves mean nothing. And this is one of the real difficulties with training GANs.

01:36:29.980 | And here's what happens when I plot 12 randomly selected random noise vectors stuck through

01:36:36.940 | there. And we have not got things that look terribly like MNIST digits and they also don't

01:36:41.380 | look terribly much like they have a lot of variety. This is called ModeClass. Very common

01:36:52.920 | problem when training GANs. And what it means is that the generator and the discriminator

01:36:59.060 | have kind of reached a stalemate where neither of them basically knows how to go from here.

01:37:06.440 | And in terms of optimization, we've basically found a local minimum. So okay, that was not

01:37:13.340 | very successful. Can we do better?

01:37:15.500 | So the next major paper that came along was this one. Let's go to the top so you can see

01:37:25.580 | it. Unsupervised representation learning with deep convolutional derivative adversarial networks.

01:37:31.780 | So this created something that they called DCGANs. And the main page that you want to

01:37:39.020 | look at here is page 3 where they say, "Call to our approach is doing these three things."

01:37:46.900 | And basically what they do is they just do exactly the same thing as GANs, but they do

01:37:50.860 | three things. One is to use the kinds of -- well in fact all of them is to learn the tricks

01:37:56.980 | that we've been learning for generative models. Use an all-convolutional net, get rid of max

01:38:03.060 | pooling and use strata convolutions instead, get rid of fully connected layers and use

01:38:08.260 | lots of convolutional features instead, and add in batch null. And then use a CNN rather

01:38:13.580 | than MLP. So here is that. This will look very familiar, it looks just like last lesson stuff.

01:38:24.260 | So the generator is going to take in a random grid of inputs. It's going to do a batch norm,

01:38:34.500 | up sample -- you'll notice that I'm doing even newer than this paper, I'm doing the

01:38:38.460 | up sampling approach because we know that's better. Up sample, 1x1 conv, batch norm, up

01:38:43.960 | sample, 1x1 conv, batch norm, and then a final conv layer. The discriminator basically does

01:38:51.860 | the opposite, which is some 2x2 sub-samplings, so down sampling in the discriminator.

01:39:00.660 | Another trick that I think it's mentioned in the paper is before you do the back and

01:39:06.820 | forth batch for the discriminator and a batch for the generator is to train the discriminator

01:39:13.460 | for a fraction of an epoch, like do a few batches through the discriminator. So at least

01:39:17.500 | it knows how to recognize the difference between a random image and a real image a little bit.

01:39:23.820 | So you can see here I actually just start by calling discriminator.fit with just a very

01:39:29.140 | small amount of data. So this is kind of like bootstrapping the discriminator. And then

01:39:36.300 | I just go ahead and call the same train as we had before with my better architectures.

01:39:44.340 | And again, these curves are totally meaningless. But we have something which if you squint,

01:39:51.260 | you could almost convince yourself that that's a vibe.

01:39:55.660 | So until a week or two before this forth started, this was kind of about as good as we had.

01:40:06.260 | People were much better at the artisanal details of this than I was, and indeed there's a whole

01:40:11.700 | page called GANhacks, which had lots of tips. But then, a couple of weeks before this class

01:40:22.060 | started, as I mentioned in the first class, along came the Wasserstein GAN. And the Wasserstein

01:40:29.220 | GAN got rid of all of these problems. And here is the Wasserstein GAN paper. And this

01:40:40.820 | paper is quite an extraordinary paper. And it's particularly extraordinary because -- and

01:40:48.820 | I think I mentioned this in the first class of this part -- most papers tend to either

01:40:55.060 | be math theory that goes nowhere, or kind of nice experiments in engineering where the

01:41:01.540 | theory bits kind of hacked on at the end and kind of meaningless.

01:41:06.220 | This paper is entirely driven by theory, and then the theory goes on to show this is what

01:41:14.380 | the theory means, this is what we do, and suddenly all the problems go away. The loss

01:41:18.340 | curves are going to actually mean something, and we're going to be able to do what I said

01:41:22.220 | we wanted to do right at the start of this GAN section, which is to train the discriminator

01:41:29.500 | a whole bunch of steps and then do a generator, and then discriminator a whole bunch of steps

01:41:33.820 | and do the generator. And all that is going to suddenly start working.

01:41:38.780 | How do we get it to work? In fact, despite the fact that this paper is both long and

01:41:47.260 | full of equations and theorems and proofs, and there's a whole bunch of appendices at

01:41:53.300 | the back with more theorems and proofs, there's actually only two things we need to do. One

01:41:58.180 | is remove the log from the loss function. So rather than using cross-entropy loss, we're

01:42:05.220 | just going to use mean squared error. That's one change.

01:42:08.180 | The second change is we're going to constrain the weights so that they lie between -0.01

01:42:16.180 | and +0.01. So we're going to constrain the weights to make them small.

01:42:20.820 | Now in the process of saying that's all we're going to do is to kind of massively not give

01:42:26.980 | credit to this paper, because what this paper is is they figured out that that's what we

01:42:30.620 | need to do. On the forums, some of you have been reading through this paper and I've already

01:42:36.820 | given you some tips as to some really great walkthrough. I'll put it on our wiki that

01:42:44.420 | explains all the math from scratch. But basically what the math says is this, the loss function

01:42:52.820 | for a GAN is not really the loss function you put into Keras. We thought we were just

01:42:58.300 | putting in a cross-entropy loss function, but in fact what we really care about is the difference

01:43:04.380 | between two distributions, the difference between the discriminator and the generator.

01:43:09.460 | And the difference between two loss functions has a very different shape for the loss function

01:43:15.300 | on its own. So it turns out that the difference between the two cross-entropy loss functions

01:43:22.460 | is something called the Jensen-Shannon distance. And this paper shows that that loss function

01:43:32.620 | is hideous. It is not differentiable, and it does not have a nice smooth shape at all.

01:43:44.340 | So it kind of explains why it is that we kept getting this mode collapse and failing to

01:43:49.620 | find nice minimums. Mathematically, this loss function does not behave the way a good loss

01:43:55.980 | function should. And previously we've not come across anything like this because we've

01:44:02.500 | been training a single function at a time. We really understand those loss functions,

01:44:08.980 | mean squared error, cross-entropy. Even though we haven't already always derived the math

01:44:14.160 | in detail, plenty of people have. We know that they're kind of nice and smooth and that

01:44:18.620 | they have pretty nice shapes and they do what we want them to do. In this case, by training

01:44:23.620 | two things kind of adversarially to each other, we're actually doing something quite different.

01:44:29.540 | This paper just absolutely fantastically shows, with both examples and with theory, why that's

01:44:36.660 | just never going to work.

01:44:49.620 | So the cosine distance is the difference between two things, whereas these distances that we're

01:45:02.020 | talking about here are the distances between two distributions, which is a much more tricky

01:45:07.180 | problem to deal with. The cosine distance, actually if you look at the notebook during

01:45:12.940 | the week, you'll see it's basically the same as the Euclidean distance, but you normalize

01:45:20.020 | the data first. So it has all the same nice properties that the Euclidean distance did.

01:45:29.460 | The authors of this paper released their code in PyTorch. Luckily, PyTorch, the first kind

01:45:39.820 | of pre-release came out in mid-January. You won't be surprised to hear that one of the

01:45:45.300 | authors of the paper is the main author of PyTorch. So he was writing this before he

01:45:51.580 | even released the code. There's lots of reasons we want to learn PyTorch anyway, so here's

01:45:57.580 | a good reason.

01:45:58.820 | So let's look at the Wasserstein GAN in PyTorch. Most of the code, in fact other than this

01:46:05.140 | pretty much all the code I'm showing you in this part of the course, is very loosely based

01:46:11.700 | on lots of bits of other code, which I had to massively rewrite because all of it was

01:46:15.500 | wrong and hideous. This code actually I only did some minor refactoring to simplify things,

01:46:21.620 | so this is actually very close to their code. So it was a very nice paper with very nice

01:46:27.220 | code, so that's a great thing.

01:46:30.460 | So before we look at the Wasserstein GAN in PyTorch, let's look briefly at PyTorch. Basically

01:46:42.460 | what you're going to see is that PyTorch looks a lot like NumPy, which is nice. We don't

01:46:49.100 | have to create a computational graph using variables and placeholders and later on run

01:46:56.120 | in a session. I'm sure you've seen by now Keras with TensorFlow, you try to print something

01:47:04.540 | out with some intermediate output, it just prints out like Tensor and tells you how many

01:47:08.540 | dimensions it has. And that's because all that thing is is a symbolic part of a computational

01:47:14.180 | graph. PyTorch doesn't work that way. PyTorch is what's called a defined-by-run framework.

01:47:21.780 | It's basically designed to be so fast to take your code and compile it that you don't have

01:47:31.100 | to create that graph in advance. Every time you run a piece of code, it puts it on the

01:47:36.580 | GPU, runs it, sends it back all in one go. So it makes things look very simple. So this

01:47:43.260 | is a slightly cut-down version of the PyTorch tutorial that PyTorch provides on their website.

01:47:49.620 | So you can grab that from there. So rather than creating np.array, you create torch.tensor.

01:47:57.780 | But other than that, it's identical. So here's a random torch.tensor. APIs are all a little

01:48:10.940 | bit different. Rather than dot shape, it's dot size. But you can see it looks very similar.

01:48:18.500 | And so unlike in TensorFlow or Theano, we can just say x + y, and there it is. We don't

01:48:25.260 | have to say z = x + y, f = function, x and y as inputs, set as output, and function dot

01:48:33.660 | a vowel. No, you just go x + y, and there it is. So you can see why it's called defined-by-run.

01:48:40.780 | We just provide the code and it just runs it. Generally speaking, most operations in

01:48:46.440 | Torch as well as having this infix version. There's also a prefix version, so this is exactly

01:48:53.180 | the same thing. You can often in fact nearly always add an out equals, and that puts the

01:48:59.900 | result in this preallocated memory. We've already talked about why it's really important

01:49:04.340 | to preallocate memory. It's particularly important on GPUs. So if you write your own algorithms

01:49:10.580 | in PyTorch, you'll need to be very careful of this. Perhaps the best trick is that you

01:49:15.740 | can stick an underscore on the end of most things, and it causes it to do in place. This

01:49:20.140 | is basically y + = x. That's what this underscore at the end means.

01:49:26.380 | So there's some good little tricks. You can do slicing just like numpy. You can turn numpy

01:49:33.300 | stuff into Torch tensors and vice versa by simply going dot numpy. One thing to be very

01:49:41.820 | aware of is that A and B are now referring to the same thing. So if I now add underscore

01:49:52.660 | A + = 1, it also changes B. Vice versa, you can turn numpy into Torch by calling Torch

01:50:02.860 | from numpy. And again, same thing. If you change the numpy, it changes the Torch. All of that

01:50:11.800 | so far has been running on the CPU. To turn anything into something that runs on the GPU,

01:50:16.900 | you chuck dot CUDA at the end of it. So this x + y just ran on the GPU.

01:50:25.280 | So where things get cool is that something like this knows not just how to do that piece

01:50:32.220 | of arithmetic, but it also knows how to take the gradient of that. To make anything into

01:50:37.700 | something which calculates gradients, you just take your Torch tensor, wrap it in variable,

01:50:44.900 | and add this parameter to it. From now on, anything I do to x, it's going to remember

01:50:49.860 | what I did so that it can take the gradient of it. For example, x + 2, I get x3 just like

01:50:58.140 | a normal tensor. So a variable and a tensor have the same API except that I can keep doing

01:51:04.740 | things to it. Square times 3, dot mean. Later on, I can go dot backward and dot grad and

01:51:14.620 | I can get the gradient. So that's the critical difference between a tensor and a variable.

01:51:21.780 | They have exactly the same API except variable also has dot backward and that gets you the

01:51:28.740 | gradient. When I say dot gradient, the reason that this is dout dx is because I typed out

01:51:35.720 | dot backward. So this is the thing the derivative is respect to.

01:51:44.900 | So this is kind of crazy. You can do things like while loops and get the gradients of

01:51:50.260 | them. It's this kind of thing pretty tricky to do with TensorFlow or Theano, these kind

01:51:56.580 | of computation graph approaches. So it gives you a whole lot of flexibility to define things

01:52:03.260 | in much more natural ways. So you can really write PyTorch just like you're writing regular

01:52:09.660 | old NumPy stuff. It has plenty of libraries, so if you want to create a neural network,

01:52:19.700 | here's how you do a CNN. I warned you early on that if you don't know about OO in Python,

01:52:26.340 | you need to learn it. So here's why. Because in PyTorch, everything's kind of done using

01:52:32.100 | OO. I really like this. In TensorFlow, they kind of invent their own weird way of programming

01:52:43.740 | rather than use Python OO. Or else PyTorch just goes, "Oh, we already have these features

01:52:48.980 | in the language. Let's just use them." So it's way easier, in my opinion.

01:52:54.980 | So to create a neural net, you create a new class, you derive from module, and then in

01:53:01.380 | the constructor, you create all of the things that have weights. So conv1 is now something

01:53:11.980 | that has some weights. It's a 2D conv. Conv2 is something with some weights. PolyConnected1

01:53:16.580 | is something with some weights. So there's all of your layers, and then you get to say

01:53:23.820 | exactly what happens in your forward pass. Because MaxPool2D doesn't have any weights,

01:53:30.900 | and Relyu doesn't have any weights, there's no need to define them in the initializer.

01:53:36.900 | You can just call them as functions. But these things have weights, so they need to be kind

01:53:42.500 | of stateful and persistent. So in my forward pass, you literally just define what are the

01:53:49.380 | things that happen. .vue is the same as reshape. The whole API has different names for everything,

01:54:00.460 | which is mildly annoying for the first week, but you kind of get used to it. .reshape is

01:54:04.180 | called .vue. During the week, if you try to use PyTorch and you're like, "How do you say

01:54:09.380 | blah in PyTorch?" and you can't find it, feel free to post on the forum. Having said that,

01:54:15.780 | PyTorch has its own discourse-based forums. And as you can see, it is just as busy and

01:54:23.980 | friendly as our forums. People are posting on these all the time. So I find it a really

01:54:31.060 | great, helpful community. So feel free to ask over there or over here.

01:54:45.360 | You can then put all of that computation onto the GPU by calling .kuda. You can then take

01:54:54.740 | some input, put that on the GPU with .kuda. You can then calculate your derivatives, calculate

01:55:04.140 | your loss, and then later on you can optimize it. This is just one step of the optimizer,

01:55:16.240 | so we have to kind of put that in the word. So there's the basic pieces. At the end here

01:55:21.180 | there's a complete process, but I think more fun will be to see the process in the Wasserstein

01:55:26.740 | GAN.

01:55:27.740 | So here it is. I've kind of got this TorchUtils thing which you'll find in GitHub which has

01:55:35.100 | the basic stuff you'll want for Torch all there, so you can just import that. So we set

01:55:45.980 | up the batch size, the size of each image, the size of our noise vector. And look how

01:55:52.700 | cool it is. I really like this. This is how you import datasets. It has a datasets module

01:55:59.000 | already in the TorchVision library. Here's the scifi10 dataset. It will automatically

01:56:06.940 | download it to this path for you if you say download equals true. And rather than having

01:56:11.660 | to figure out how to do the preprocessing, you can create a list of transforms.

01:56:20.260 | So I think this is a really lovely API. The reason that this is so new yet has such a

01:56:25.540 | nice API is because this comes from a lower library called Torch that's been around for

01:56:30.220 | many years, and so these guys are basically started off by copying what they already had

01:56:35.500 | and what already works well. So I think this is very elegant. So I've got two different

01:56:43.620 | things you can look at here. They're both in the paper. One is scifi10, which are these

01:56:47.740 | tiny little images. Another is something we haven't seen before, which is called lsun,

01:56:53.700 | which is a really nice dataset. It's a huge dataset with millions of images, 3 million

01:57:03.340 | bedroom images, for example. We can use either one. This is pretty cool. We can then create

01:57:14.200 | a data loader, say how many workers to use. We already know what workers are. This is

01:57:18.700 | all built into the framework.

01:57:21.700 | Now that you know how many workers your CPU likes to use, you can just go ahead and put

01:57:26.420 | that number in here. Use your CPU to load in this data in parallel in the background.

01:57:34.620 | We're going to start with scifi10. We've got 47,000 of those images. We'll skip over very

01:57:52.660 | quickly because it's really straightforward. Here's a conv block that consists of a conv2D,

01:57:59.340 | a batchnorm2D, and a leakyrelu. In my initializer, I can go ahead and say, "Okay, we'll start

01:58:07.020 | with a conv block. Optionally have a few extra conv blocks." This is really nice. Here's

01:58:16.260 | a while loop that says keep adding more down sampling blocks until you've got as many as

01:58:27.980 | you need. That's a really nice kind of use of a while loop to simplify creating our architecture.

01:58:36.660 | And then a final conv block at the end to actually create the thing we want.

01:58:42.340 | And then this is pretty nifty. If you pass in n GPU greater than 1, then it will call

01:58:51.780 | parallel.data parallel passing in those GPU IDs and it will do automatic multi-GPU training.

01:59:00.140 | This is by far the easiest multi-GPU training I've ever seen. That's it. That's the forward

01:59:08.540 | pass behind here. We'll learn more about this over the next couple of weeks. In fact, given

01:59:26.260 | we're a little short of time, let's discuss that next week and let me know if you don't

01:59:31.820 | think we cover it. Here's the generator. It looks very, very similar. Again, there's a

01:59:37.740 | while loop to make sure we've gone through the right number of decom blocks. This is

01:59:45.740 | actually interesting. This would probably be better off with an up-sampling block followed

01:59:50.020 | by a one-by-one convolution. Maybe at home you could try this and see if you get better

01:59:54.420 | results because this has probably got the checkerboard pattern problem.

01:59:58.300 | This is our generator and our discriminator. It's only 75 lines of code, nice and easy.

02:00:10.020 | Everything's a little bit different in PyTorch. If we want to say what initializer to use,

02:00:14.360 | again we're going to use a little bit more decoupled. Maybe at first it's a little more

02:00:21.740 | complex but there's less things you have to learn. In this case we can call something

02:00:26.300 | called apply, which takes some function and passes it to everything in our architecture.

02:00:34.060 | This function is something that says, "Is this a conv2D or a convtranspose2D? If so,

02:00:40.740 | use this initialization function." Or if it's a batch norm, use this initialization function.

02:00:46.180 | Everything's a little bit different. There isn't a separate initializer parameter. This

02:00:52.980 | is, in my opinion, much more flexible. I really like it.

02:01:03.300 | As before, we need something that creates some noise. Let's go ahead and create some

02:01:10.580 | fixed noise. We're going to have an optimizer for the discriminator. We've got an optimizer

02:01:14.580 | for the generator. Here is something that does one step of the discriminator. We're

02:01:20.060 | going to call the forward pass, then we call the backward pass, then we return the error.

02:01:26.420 | Just like before, we've got something called make_trainable. This is how we make something

02:01:32.060 | trainable or not trainable in PyTorch. Just like before, we have a train loop. The train

02:01:38.540 | loop has got a little bit more going on, partly because of the vasa_stain_gan, partly because

02:01:45.140 | of PyTorch. But the basic idea is the same. For each epoch, for each batch, make the discriminator

02:01:58.260 | trainable, and then this is the number of iterations to train the discriminator for.

02:02:07.340 | Remember I told you one of the nice things about the vasa_stain_gan is that we don't have

02:02:12.100 | to do one batch discriminator, one batch generator, one batch discriminator, one batch generator,

02:02:16.100 | but we can actually train the discriminator properly for a bunch of batches. In the paper,

02:02:22.620 | they suggest using 5 batches of discriminator training each time through the loop, unless

02:02:35.420 | you're still in the first 25 iterations. They say if you're in the first 25 iterations,

02:02:42.580 | do 100 batches. And then they also say from time to time, do 100 batches. So it's kind

02:02:49.580 | of nice by having the flexibility here to really change things, we can do exactly what

02:02:55.460 | the paper wants us to do.

02:02:56.580 | So basically at first we're going to train the discriminator carefully, and also from

02:03:03.340 | time to time, train the discriminator very carefully. Otherwise we'll just do 5 batches.

02:03:09.860 | So this is where we go ahead and train the discriminator. And you'll see here, we clamp

02:03:16.700 | -- this is the same as clip -- the weights in the discriminator to fall in this range.

02:03:24.540 | And if you're interested in reading the paper, the paper explains that basically the reason

02:03:28.780 | for this is that their assumptions are only true in this kind of small area. So that's

02:03:39.060 | why we have to make sure that the weights stay in this small area.

02:03:43.900 | So then we go ahead and do a single step with the discriminator. Then we create some noise

02:03:50.460 | and do a single step with the generator. We get our fake data for the discriminator. Then

02:04:01.580 | we can subtract the fake from the real to get our error for the discriminator. So there's

02:04:06.380 | one step with the discriminator. We do that either 5 or 100 times. Make our discriminator

02:04:17.900 | not trainable, and then do one step of the generator. You can see here, we call the generator

02:04:24.500 | with some noise, and then pass it into the discriminator to see if we tricked it or not.

02:04:30.140 | During the week, you can look at these two different versions and you're going to see

02:04:35.740 | basically the PyTorch and the Keras version of basically the same thing. The only difference

02:04:41.720 | is in the two things. One is the presence of this clamping, and the second is that the

02:04:48.940 | loss function is mean squared error rather than cross-entropy.

02:04:55.620 | So let's see what happens. Here is some examples from SciFAR 10. They're certainly a lot better

02:05:09.540 | than our crappy DC GAN MNIST examples, but they're not great. Why are they not great?

02:05:20.740 | So probably the reason they're not great is because SciFAR 10 has quite a few different

02:05:27.860 | kinds of categories of different kinds of things. So it doesn't really know what it's

02:05:32.340 | meant to be drawing a picture of. Sometimes I guess it kind of figures it out. This must

02:05:36.660 | be a plane, I think. But a lot of the time it hedges and kind of draws a picture of something

02:05:43.580 | that looks like it might be a reasonable picture, but it's not a picture of anything in particular.

02:05:48.180 | On the other hand, the Lsun dataset has 3 million bedrooms. So we would hope that when

02:05:57.540 | we train the Wasserstein GAN on Lsun bedrooms, we might get better results. Here's the real

02:06:06.540 | SciFAR 10, by the way.

02:06:11.700 | Here are our fake bedrooms, and they are pretty freaking awesome. So literally they started

02:06:21.420 | out as random noise, and everyone has been turned in like that. It's definitely a bedroom.

02:06:29.220 | They're all definitely bedrooms. And then here is the real bedrooms to compare.

02:06:36.860 | You can kind of see here that imagine if you took this and stuck it on the end of any kind

02:06:46.860 | of generator. I think you could really use this to make your generator much more believable.

02:06:58.020 | Any time you kind of look at it and you say, "Oh, that doesn't look like the real X," maybe

02:07:01.740 | you could try using a WGAN to try to make it look more like a real X.

02:07:12.700 | So this paper is so important. Here's the other thing. The loss function for these actually

02:07:27.440 | makes sense. The discriminator and the generator loss functions actually decrease as they get

02:07:33.300 | better. So you can actually tell if your thing is training properly. You can't exactly compare

02:07:40.580 | two different architectures to each other still, but you can certainly see that the

02:07:46.180 | training curves are working.

02:07:48.100 | So now that we have, in my opinion, a GAN that actually really works reliably for the

02:07:56.720 | first time ever, I feel like this changes the whole equation for what generators can

02:08:05.060 | and can't do. And this has not been applied to anything yet. So you can take any old paper

02:08:14.400 | that produces 3D outputs or segmentations or depth outputs or colorization or whatever

02:08:22.380 | and add this. And it would be great to see what happens, because none of that has been

02:08:28.580 | done before. It's not been done before because we haven't had a good way to train GANs before.

02:08:34.900 | So this is kind of, I think, something where anybody who's interested in a project, yeah,

02:08:47.220 | this would be a great project and something that maybe you can do reasonably quickly.

02:08:53.940 | Another thing you could do as a project is to convert this into Keras. So you can take

02:09:01.020 | the Keras DC GAN notebook that we've already got and change the loss function at the weight

02:09:07.300 | clipping, try training on this lsunbedroom data set, and you should get the same results.

02:09:14.800 | And then you can add this on top of any of your Keras stuff.

02:09:19.460 | So there's so much you could do this week. I don't feel like I want to give you an assignment

02:09:27.220 | per se, because there's a thousand assignments you could do. I think as per usual, you should

02:09:33.100 | go back and look at the papers. The original GAN paper is a fairly easy read. There's a

02:09:41.900 | section called Theoretical Results, which is kind of like the pointless math bit. Here's

02:09:49.900 | some theoretical stuff. It's actually interesting to read this now because you go back and you

02:09:54.860 | look at this stuff where they prove various nice things about their GAN. So they're talking

02:10:01.860 | about how the generative model perfectly replicates the data generating process. It's interesting

02:10:06.460 | to go back and look and say, okay, so they've proved these things, but it turned out to

02:10:13.740 | be totally pointless. It still didn't work. It didn't really work. So it's kind of interesting

02:10:20.460 | to look back and say, which is not to say this isn't a good paper, it is a good paper,

02:10:25.900 | but it is interesting to see when is the theoretical stuff useful and when not. Then you look at

02:10:31.900 | the Wasserstein GAN theoretical sections, and it spends a lot of time talking about

02:10:38.780 | why their theory actually matters. So they have this really cool example where they say,

02:10:45.340 | let's create something really simple. What if you want to learn just parallel lines,

02:10:51.140 | and they show why it is that the old way of doing GANs can't learn parallel lines, and

02:10:58.060 | then they show how their different objective function can learn parallel lines. So I think

02:11:04.300 | anybody who's interested in getting into the theory a little bit, it's very interesting

02:11:11.380 | to look at why did the proof of convergence show something that didn't show something

02:11:20.180 | that really turned out to matter. Where else in this paper the theory turned out to be

02:11:25.120 | super important and basically created something that allowed GANs to work for the first time.

02:11:30.980 | So there's lots of stuff you can get out of these papers if you're interested. In terms

02:11:37.060 | of the notation, we might look at some of the notation a little bit more next week.

02:11:45.020 | But if we look, for example, at the algorithm sections, I think in general the bit I find

02:12:02.580 | the most useful is the bit where they actually write the pseudocode. Even that, it's useful

02:12:09.060 | to learn some kind of nomenclature. For each iteration, for each step, what does this mean?

02:12:21.000 | Noise samples from noise prior. There's a lot of probability nomenclature which you

02:12:28.460 | can very quickly translate. A prior simply means np.random.something. In this case, we're

02:12:40.380 | probably like np.random.normal. So this just means some random number generator that you

02:12:47.140 | get to pick.

02:12:49.500 | This one here, sample from a data generating distribution, that means randomly picks some

02:12:55.180 | stuff from your array. So these are the two steps. Generate some random numbers, and then

02:13:01.940 | randomly select some things from your array. The bit where it talks about the gradient

02:13:10.140 | you can kind of largely ignore, except the bit in the middle is your lost function. You

02:13:15.580 | can see here, these things here is your noise, that's your noise. So noise, generator on

02:13:23.740 | noise, discriminator on generator on noise.

02:13:26.220 | So there's the bit where we're trying to fool the discriminator, and we're trying to make

02:13:30.660 | that tricker, so that's why we do 1-minus. And then here's getting the discriminator

02:13:35.620 | to be accurate, because these x's is the real data. So that's the math version of what we

02:13:42.540 | just learned.

02:13:45.180 | The Wasserstein-Gann also has an algorithm section, so it's kind of interesting to compare

02:13:54.540 | the two. So here we go with Wasserstein-Gann, here's the algorithm, and basically this says

02:14:01.740 | exactly the same thing as the last one said, but I actually find this one a bit clearer.

02:14:08.220 | Sample from the real data, sample from your priors. So hopefully that's enough to get

02:14:15.340 | going and look forward to talking on the forums and see how everybody gets along. Thanks everybody.

02:14:21.060 | (audience applauds)

Lesson 10: Cutting Edge Deep Learning for Coders

Chapters