back to indexLesson 10: Deep Learning Foundations to Stable Diffusion, 2022
Chapters
0:0 Introduction
0:35 Showing student’s work over the past week.
6:4 Recap Lesson 9
12:55 Explaining “Progressive Distillation for Fast Sampling of Diffusion Models” & “On Distillation of Guided Diffusion Models”
26:53 Explaining “Imagic: Text-Based Real Image Editing with Diffusion Models”
33:53 Stable diffusion pipeline code walkthrough
41:19 Scaling random noise to ensure variance
50:21 Recommended homework for the week
53:42 What are the foundations of stable diffusion? Notebook deep dive
66:30 Numpy arrays and PyTorch Tensors from scratch
88:28 History of tensor programming
97:0 Random numbers from scratch
102:41 Important tip on random numbers via process forking
00:00:00.000 |
Hi, everybody, and welcome back. This is lesson 10 of practical deep learning for coders. 00:00:09.920 |
It's the second lesson in part two, which is where we're going from deep learning foundations 00:00:14.880 |
to stable diffusion. So before we dive back into our notebook, I think first of all, let's 00:00:20.700 |
take a look at some of the interesting work that students in the course have done over 00:00:24.640 |
the last week. I'm just going to show a small sample of what's on the forum. So check out 00:00:29.760 |
the share your work here thread on the forum for many, many, many more examples. So Puro 00:00:38.400 |
did some something interesting, which is to create a bunch of images of doing a linear 00:00:46.500 |
interpolation. I mean, details actually spherical linear interpolation, but it doesn't matter. 00:00:51.160 |
Doing a linear interpolation between two different latent, you know, noisy, you know, latent noise 00:00:56.280 |
starting points for an auto picture and then showed all the intermediate results that came 00:01:03.720 |
out pretty nice and they did something similar starting with an old car prompt and going 00:01:08.640 |
to a modern Ferrari prompt. I can't remember exactly what the prompts were, but you can 00:01:12.720 |
see how as it kind of goes through that latent space, it actually is changing the image that's 00:01:20.380 |
coming out. I think that's really cool. And then I love the way Namrata took that and 00:01:26.000 |
took it to another level in a way, which is starting with a dinosaur and turning into 00:01:30.240 |
a bird. And this is a very cool intermediate picture of one of the steps along the way. 00:01:37.480 |
The dino bird. I love it. Dino chick. Fantastic. So much creativity on the forums. I loved 00:01:48.480 |
this. John Richmond took his daughter's dog and turned it gradually into a unicorn. And 00:01:57.640 |
I thought this one along the way actually came out very, very nicely. I think this is 00:02:02.240 |
adorable. And I suspect that John has won the dad of the year or dad of the week, maybe 00:02:08.080 |
award this week for this fantastic project. And Maureen did something very interesting, 00:02:18.920 |
which is she took John O's parrot image from his lesson and tried bringing it across to 00:02:28.840 |
various different painter's styles. And so her question was, anyone want to guess the 00:02:33.040 |
artists in the prompts? So I'm just going to let you pause it before I move on. If you 00:02:40.000 |
want to try to guess. And there they are. Most of them pretty obvious, I guess. I think 00:02:51.320 |
it's so funny that Frida Kahlo appears in all of her paintings. So the parents actually 00:02:56.840 |
turned into Frida Kahlo. All right. Not all of her paintings, but all of her famous ones. 00:03:01.960 |
So the very idea of Frida Kahlo painting without her in it is so unheard of that the parrots 00:03:05.960 |
turned into Frida Kahlo. And I like this Jackson Pollock. It's still got the parrot going on 00:03:11.340 |
there. So that's a really lovely one, Maureen. Thank you. And this is a good reminder to 00:03:19.760 |
make sure that you check out the other two lesson videos. So she was working with John 00:03:27.280 |
Ono's stable diffusion lesson. So be sure to check that out if you haven't yet. It is 00:03:35.200 |
available on the course web page and on the forums and has lots of cool stuff that you 00:03:41.000 |
can work with, including this parrot. And then the other one to remind you about is 00:03:49.120 |
the video that Waseem and Tanish did on the math of diffusion. And I do want to read out 00:03:56.880 |
what Alex said about this because I'm sure a number of you feel the same way. My first 00:04:02.080 |
reaction on seeing something with the title math of diffusion was to assume that, oh, 00:04:05.960 |
that's just something for all the smart people who have PhDs in maths on the course. And 00:04:09.640 |
it'll probably be completely incomprehensible. But of course, it's not that at all. So be 00:04:16.400 |
sure to check this out. Even if you don't think of yourself as a math person, I think 00:04:20.840 |
it's some nice background that you may find useful. It's certainly not necessary. But you 00:04:28.480 |
might. Yeah, I think it's kind of useful to start to dig in some of the math at this point. 00:04:39.280 |
One particularly interesting project that's been happening during the week is from Jason 00:04:43.040 |
Antich, who is a bit of a legend around here. Many of you will remember him as being the 00:04:49.480 |
guy that created De-Oldify and actually worked closely with us on our research, which together 00:04:57.440 |
turned into Nogan and Decrapify and other things, created lots of papers. And Jason has kindly 00:05:07.680 |
joined our little research team working on the stuff for these lessons and for developing 00:05:13.680 |
a kind of fast AI approach to stable diffusion. And he took the idea that I prompted last 00:05:21.080 |
week, which is maybe we should be using classic optimizers rather than differential equation 00:05:28.000 |
solvers. And he actually made it work incredibly well already within a week. These faces were 00:05:33.600 |
generated on a single GPU in a few hours from scratch by using classic deep learning optimizers, 00:05:45.120 |
which is like an unheard of speed to get this quality of image. And we think that this research 00:05:52.160 |
direction is looking extremely promising. So really great news there. And thank you, 00:05:58.640 |
Jason, for this fantastic progress. Yeah, so maybe we'll do a quick reminder of what we 00:06:09.640 |
looked at last week. So last week, I used a bit of a mega one note hand-drawn thing. 00:06:17.000 |
I thought this week I might just turn it into some slides that we can use. So the basic 00:06:24.760 |
idea, if you remember, is that we started with, if we're doing handwritten digits, for example, 00:06:30.400 |
we'd start with a number seven. This would be one of the ones with a stroke through it 00:06:35.040 |
that some countries use. And then we add to it some noise. And the seven plus the noise 00:06:43.120 |
together would equal this noisy seven. And so what we then do is we present this noisy 00:06:51.440 |
seven as an input to a unit. And we have it try to predict which pixels are noise, basically, 00:07:02.320 |
or predict the noise. And so the unit tries to predict the noise from the number. It then 00:07:09.800 |
compares its prediction to the actual noise. And it's going to then get a loss, which it 00:07:18.680 |
can use to update the weights in the unit. And that's basically how stable diffusion, 00:07:26.240 |
the main bit, if you like, the unit is created. To make it easier for the unit, we can also 00:07:34.220 |
pass in an embedding of the actual digit, the actual number seven. So for example, a 00:07:39.600 |
one hot encoded vector, which goes through an embedding layer. And the nice thing about 00:07:45.720 |
that to remind you is that if we do this, then we also have the benefit that then later 00:07:50.080 |
on we can actually generate specific digits by saying I want a number seven or I want 00:07:54.960 |
a number five and it knows what they look like. I've skipped over here the VAE Latents 00:08:01.080 |
piece, which we talked about last week. And to remind you, that's just a computational 00:08:06.640 |
shortcut. It makes it it makes it faster. And so we don't need to include that in this 00:08:14.600 |
picture because it's just a just a computational shortcut that we can pre-process things into 00:08:19.920 |
that latent space with the VAE first, if we wish. So that's what the unit does. Now then 00:08:28.120 |
to remind you, you know, we want to handle things that are more interesting than just 00:08:31.680 |
the number seven. We want to actually handle things where we can say, for example, a graceful 00:08:40.720 |
swan or a scene from Hitchcock. And the way we do that is we turn these sentences into 00:08:48.120 |
embeddings as well. And we turn them into embeddings by trying to create embeddings 00:08:53.120 |
of these sentences, which are as similar as possible to embeddings of the photos or images 00:08:58.280 |
that they are connected with. And remind you, the way we did that or the way that was done 00:09:02.760 |
originally as part of this thing called clip was to basically download from the internet 00:09:10.120 |
lots of examples of lots of images, find their alt tags. And then for each one, we then have 00:09:17.800 |
their image and its alt tag. So here's the graceful swan and its alt tag. And then we 00:09:24.240 |
build two models, an image encoder that turns each image into some feature vector. And then 00:09:33.320 |
we have a text encoder that turns each piece of text into a bunch of features. And then 00:09:38.680 |
we create a loss function that says that the features for a graceful swan, the text, should 00:09:44.960 |
be as close as possible to the features for the picture of a graceful swan. And specifically, 00:09:51.000 |
we take the dot product and then we add up all the green ones because these are the ones 00:09:55.320 |
that we want to match and we subtract all the red ones because those are the ones we 00:09:58.880 |
don't want to match. Those are where the text doesn't match the image. And so that's the 00:10:04.160 |
contrastive lost, which gives us the CL in clip. So that's a review of some stuff we 00:10:11.600 |
did last week. And so with this, then we can, we now have a text encoder, which we can now 00:10:17.760 |
say a graceful swan, and it will spit out some embeddings. And those are the embeddings 00:10:23.160 |
that we can feed into our unit during training. And so then we haven't been doing any of that 00:10:34.320 |
training ourselves, except for some fine tuning, because it takes a very long time on a lot 00:10:38.120 |
of computers. But instead, we take pre-trained models and do inference. And the way we do 00:10:44.120 |
inference is we put in an example of the thing that we want, that we have an embedding for. 00:10:49.480 |
So let's say we're doing handwritten digits, and we put in some random noise into the unit. 00:10:56.280 |
And then it spits out a prediction of which bits of noise you could remove to leave behind 00:11:02.000 |
a picture of the number three. Initially, it's going to do quite a bad job of that. 00:11:07.040 |
So we subtract just a little bit of that noise from the image to make it a little bit less 00:11:11.760 |
noisy, and we do it again, and we do it a bunch of times. So here's what that looks like, creating 00:11:23.200 |
a-- I think somebody here did a smiling picture of Jeremy Howard or something, if I remember 00:11:27.320 |
correctly. And if we print out the noise at kind of step zero, and it's step six, and it's 00:11:36.240 |
step 12, you can see the first signs of a face starting to appear. Definitely a face 00:11:41.640 |
appearing here, 18, 24. By step 30, it's looking much more like a face. By 42, it's getting 00:11:50.800 |
there. It's just got a few little blemishes to fix up. And here we are. I think I've slightly 00:11:55.940 |
messed up my indexes here because it should finish at 60, not 54, but such is life. So 00:12:03.140 |
rather rosy red lips, too, I would have to say. So remember, in the early days, this 00:12:12.200 |
took 1,000 steps, and now there are some shortcuts to make it take 60 steps. And this is what 00:12:20.720 |
the process looks like. And the reason this doesn't look like normal noise is because 00:12:24.200 |
now we are actually doing the VAE Latents thing. And so noisy Latents don't look like 00:12:32.600 |
a Gaussian noise. They look like, well, they look like this. This is what happens when 00:12:36.520 |
you decode those noisy Latents. Now, you might remember last week I complained that things 00:12:44.200 |
are moving too quickly. And there was a couple of papers that had come out the day before 00:12:49.480 |
and made everything entirely out of date. So John and I and the team have actually had 00:12:58.120 |
time to read those papers. And I thought now would be a good time to start going through 00:13:08.240 |
some papers for the first time. So what we're actually going to do is show how these papers 00:13:16.680 |
have taken the required number of steps to go through this process down from 60 steps 00:13:23.160 |
to 4 steps, which is pretty amazing. So let's talk about that. And the paper is specifically 00:13:33.440 |
is this one progressive distillation for fast sampling of diffusion models. So it's only 00:13:45.680 |
been a week, so I haven't had much of a chance to try to explain this before. So apologies 00:13:49.040 |
in advance if this is awkward, but hopefully it's going to make some sense. What we're 00:13:55.160 |
going to start with is so we're going to start with this process, which is gradually denoising 00:14:07.680 |
images and actually I wonder if we can copy it. Okay, so how are we going to get this down 00:14:17.760 |
from 60 steps to 4 steps? The basic idea is that we're going to do a process. We're going 00:14:32.520 |
to do a process called distillation, which I have no idea how to spell, but hopefully 00:14:38.560 |
that's close enough that you get the idea. Distillation is a process which is pretty 00:14:43.640 |
common in deep learning. And the basic idea of distillation is that you take something 00:14:48.760 |
called a teacher network, which is some neural network that already knows how to do something, 00:14:55.720 |
but it might be slow and big. And the teacher network is then used by a student network, 00:15:03.400 |
which tries to learn how to do the same thing, but faster or with less memory. And in this 00:15:10.640 |
case, we want ours to be faster. We want to do less steps. And the way we can do this 00:15:17.000 |
conceptually, it's actually, in my opinion, reasonably straightforward. We have. Like when 00:15:29.840 |
I look at this and I think like, wow, you know, neural nets are really amazing. So given 00:15:34.920 |
your own, it's really amazing. Why is it taking like 18 steps to go from there to there? Like 00:15:47.640 |
that seems like something that you should be able to do in one step. The fact that it's 00:15:53.440 |
taking 18 steps and originally, of course, that was hundreds and hundreds of steps is 00:16:00.480 |
because it's kind of that's just a kind of a side effect of the math of how this thing 00:16:08.000 |
was originally developed, you know, this idea of this diffusion process. But the idea in 00:16:15.600 |
this paper is something that actually we've, I think I might have even mentioned in the 00:16:20.840 |
last lesson, it's something we were thinking of doing ourselves before this paper beat 00:16:24.240 |
us to it, which is to say, well, what if we train a new model where the model takes as 00:16:32.760 |
input this image, right, and puts it through some other unit, unit B. Okay. And then that 00:16:52.360 |
spits out some result. And what we do is we take that result and we compare it to this 00:17:03.200 |
image, the thing we actually want. Because the nice thing is now, which we've never really 00:17:08.980 |
had before is we have for each intermediate output, like the desired goal where we're 00:17:13.720 |
trying to get to. And so we could compare those two just using, you know, whatever means 00:17:20.080 |
squared error. Keep on forgetting to change my pen means squared error. And so then if 00:17:29.280 |
we keep doing this for lots and lots of images and lots of lots of pairs and exactly this 00:17:33.240 |
way, this unit is going to hopefully learn to take these incomplete images and turn them 00:17:41.580 |
into complete images. And that is exactly what this paper does. It just says, okay, 00:17:50.960 |
now that we've got all these examples of showing what step 36 should turn into at step 54, 00:17:58.720 |
let's just feed those examples into a model. And that works. And you'd kind of expect it 00:18:05.540 |
to work because you can see that like a human would be able to look at this. And if they 00:18:09.520 |
were a competent artist, they could turn that into a, you know, a well finished product. 00:18:15.220 |
So you would expect that a computer could as well. There are some little tweaks around 00:18:21.240 |
how it makes this work, which I will briefly describe because we need to be able to go 00:18:26.140 |
from kind of step one through to step 10 through to step 20 and so forth. And so the way that 00:18:39.540 |
it does this, it's actually quite clever. What they do is they initially, so they take their 00:18:45.020 |
teacher model. So remember the teacher model is one that has already been trained. Okay. 00:18:50.100 |
So the teacher model already is a complete stable diffusion model. That's finished. We 00:18:54.580 |
take that as a given and we put in our image. Well, actually it's noise. We put in our noise, 00:19:03.420 |
right? And we put it through two time steps. Okay. And then we train our unit B or whatever 00:19:13.780 |
you want to call it to try to go directly from the noise to time step number two. And 00:19:20.940 |
it's pretty easy for it to do. And so then what they do is they take this. Okay. And 00:19:26.660 |
so this thing here, remember is called the student model. They then say, okay, let's 00:19:33.100 |
now take that student model and treat that as the new teacher. So they now take their 00:19:40.940 |
noise and they run it through the student model twice, once and twice, and they get 00:19:49.900 |
out something at the end. And so then they try to create a new student, which is a copy 00:19:57.700 |
of the previous student, and it learns to go directly from the noise to two goes of 00:20:02.940 |
the student model. And you won't be surprised to hear they now take that new student model 00:20:08.060 |
and use that to go two goes. And then they learn, they use that. Then they copy that 00:20:15.940 |
to become the next student model. And so they're doing it again and again and again. And each 00:20:21.100 |
time they're basically doubling the amount of work. So it goes one to two, effectively 00:20:25.380 |
it's then going two to four and then four to eight. And that's basically what they're 00:20:31.260 |
what they're doing. And they're doing it for multiple different time steps. So the single 00:20:35.700 |
student model is learning to both do these initial steps, trying to jump multiple steps 00:20:47.220 |
at a time. And it's also learning to do these later steps, multiple steps at a time. And 00:20:56.620 |
that's it, believe it or not. So this is this neat paper that came out last week. And that's 00:21:04.140 |
how it works. Now I mentioned that there was actually two papers. The second one is called 00:21:13.720 |
On Distillation of Guided Diffusion Models. And the trick now is this second paper, these 00:21:25.440 |
came out at basically the same time, if I remember correctly, even though they build 00:21:28.940 |
on each other from the same teams, is that they say, okay, this is all very well, but 00:21:37.100 |
we don't just want to create random pictures. We want to be able to do guidance, right? 00:21:46.100 |
And you might remember, I hope you remember from last week that we used something called 00:21:49.540 |
Classifier Free Guided Diffusion Models, which, because I'm lazy, we will just use an acronym, 00:21:58.340 |
Classifier Free Guided Diffusion Models. And this one, you may recall, we take, let's say 00:22:07.540 |
we want a cute puppy. We put in the prompt cute puppy into our clip text encoder, and 00:22:16.620 |
that spits out an embedding. And we put that, let's ignore the VAE Latents business. We 00:22:29.020 |
put that into our unit, but we also put the empty prompt into our clip text encoder. We 00:22:43.100 |
concatenate these things too together so that then out the other side, we get back two things. 00:22:49.820 |
We get back the image of the cute puppy, and we get back the image of some arbitrary thing. 00:22:58.420 |
Could be anything. And then we effectively do something very much like taking the weighted 00:23:03.020 |
average of these two things together, combine them. And then we use that for the next stage 00:23:11.340 |
of our diffusion process. Now, what this paper does is it says, this is all pretty awkward. 00:23:18.260 |
We end up having to train two images instead of one. And for different types of levels 00:23:23.740 |
of guided diffusion, we have to do it multiple different times. It's all pretty annoying. 00:23:28.620 |
How do we skip it? And based on the description of how we did it before, you may be able to 00:23:34.260 |
guess. What we do is we do exactly the same student-teacher distillation we did before, 00:23:44.020 |
but this time we pass in, in addition, the guidance. And so again, we've got the entire 00:23:56.300 |
stable diffusion model, the teacher model available for us. And we are doing actual CFGD, 00:24:07.980 |
Classifier Free Guided Diffusion, to create our guided diffusion cute puppy pictures. 00:24:13.780 |
And we're doing it for a range of different guidance scales. So you might be doing two 00:24:17.940 |
in 7.5 and 12 and whatever, right? And those now are becoming inputs to our student model. 00:24:29.140 |
So the student model now has additional inputs. It's getting the noise as always. It's getting 00:24:36.980 |
the caption or the prompt, I guess I should say, as always, but it's now also getting 00:24:43.500 |
the guidance scale. And so it's learning to find out how all of these things are handled 00:24:53.820 |
by the teacher model. Like what does it do after a few steps each time? So it's exactly 00:25:00.420 |
the same thing as before, but now it's learning to use the Classifier Free Guided Diffusion 00:25:06.860 |
as well. Okay. So that's got quite a lot going on there. And if it's a bit confusing, that's 00:25:17.060 |
okay. It is a bit confusing. And what I would recommend is you check out the extra information 00:25:27.700 |
from Jono who has a whole video on this. And one of the cool things actually about this 00:25:31.620 |
video is it's actually a paper walkthrough. And so part of this course is hopefully we're 00:25:36.620 |
going to start reading papers together. Reading papers is extremely intimidating and overwhelming 00:25:43.820 |
for all of us, all of the time, at least for me, it never gets any better. There's a lot 00:25:48.540 |
of math. And by watching somebody like Jono, who's an expert at this stuff, read through 00:25:54.940 |
a paper, you'll kind of get a sense of how he is skipping over lots of the math, right? 00:26:01.340 |
To focus on, in this case, the really important thing, which is the actual algorithm. And 00:26:06.180 |
when you actually look at the algorithm, you start to realize it's basically all stuff, 00:26:11.420 |
nearly all stuff, maybe all stuff that you did in primary school or secondary school. 00:26:15.940 |
So we've got division, okay, sampling from a normal distribution, so high school, subtraction, 00:26:23.420 |
division, division, multiplication, right? Oh, okay, we've got a log there. But basically, 00:26:30.020 |
you know, there's not too much going on. And then when you look at the code, you'll find 00:26:36.060 |
that once you turn this into code, of course, it becomes even more understandable if you're 00:26:40.260 |
somebody who's more familiar with code, like me. So yeah, definitely check out Jono's video 00:26:49.980 |
on this. So another paper came out about three hours ago. And I just had to show you it to 00:27:02.900 |
you because I think it's amazing. And so this is definitely the first video about this paper 00:27:12.100 |
because yeah, only came out a few hours ago. But check this out. This is a paper called 00:27:16.180 |
iMagic. And with this algorithm, you can pass in an input image. This is just a, you know, 00:27:23.100 |
a photo you've taken or downloaded off the internet. And then you pass in some text saying 00:27:28.340 |
a bird spreading wings. And what it's going to try to do is it's going to try to take 00:27:32.060 |
this exact bird in this exact pose and leave everything as similar as possible, but adjust 00:27:36.880 |
it just enough so that the prompt is now matched. So here we take this, this little guy here, 00:27:45.100 |
and we say, oh, this is actually what we want this to be a person giving the thumbs up. 00:27:49.460 |
And this is what it produces. And you can see everything else is very, very similar 00:27:52.460 |
to the previous picture. So this dog is not sitting. But if we put in the prompt, a sitting 00:27:59.980 |
dog, it turns it into a sitting dog, leaving everything else as similar as possible. So 00:28:08.180 |
here's an example of a waterfall. And then you say it's a children's drawing of a waterfall 00:28:11.740 |
and now it's become a children's drawing. So lots of people in the YouTube chat going, 00:28:16.340 |
oh my God, this is amazing, which it absolutely is. And that's why we're going to show you 00:28:19.580 |
how it works. And one of the really amazing things is you're going to realize that you 00:28:23.540 |
understand how it works already. Just to show you some more examples, here's the dog image. 00:28:29.940 |
Here's the sitting dog, the jumping dog, dog playing with the toy, jumping dog holding 00:28:36.880 |
a frisbee. Okay. And here's this guy again, giving the thumbs up, crossed arms in a greeting 00:28:45.020 |
pose to namaste hands, holding a cup. So that's pretty amazing. So I had to show you how this 00:28:56.380 |
works. And I'm not going to go into too much detail, but I think we can get the idea actually 00:29:06.640 |
pretty well. So what we do is again, we start with a fully pre-trained, ready to go generative 00:29:20.860 |
model like a stable diffusion model. And this is what this is talking about here, pre-trained 00:29:27.380 |
diffusion model. In the paper, they actually use a model called Imogen, but none of the 00:29:31.500 |
details as far as I can see in any way depend on what the model is. It should work just 00:29:35.900 |
fine for stable diffusion. And we take a photo of a bird spreading wings. Okay. So that's 00:29:41.420 |
our, that's our target. And we create an embedding from that using, for example, our clip encoder 00:29:52.220 |
as usual. And we then pass it through our pre-trained diffusion model. And we then see 00:30:05.220 |
what it creates. And it doesn't create something that's actually like our bird. So then what 00:30:18.020 |
they do is they fine-tune this embedding. So this is kind of like textual inversion. 00:30:23.820 |
They fine-tune the embedding to try to make the diffusion model output something that's 00:30:30.420 |
as similar as possible to the, to the input image. And so you can see here, they're saying, 00:30:37.260 |
oh, we're moving our embedding a little bit. They don't do this for very long. They just 00:30:41.820 |
want to move it a little bit in the right direction. And then now they lock that in 00:30:46.900 |
place and they say, okay, now let's fine-tune the entire diffusion model end to end, including 00:30:53.140 |
the VAE or actually with Imogen, they have a super resolution model, but same idea. So 00:30:58.900 |
we fine-tune the entire model end to end. And now the embedding, this optimized embedding 00:31:04.020 |
we created, we, we, we store in place. We don't change that at all. That's now frozen. 00:31:11.580 |
And we try to make it so that the diffusion model now spits out our bird as close as possible. 00:31:22.140 |
So you fine-tune that for a few epochs. And so you've now got something that takes this, 00:31:26.300 |
embedding that we fine-tuned, goes through a fine-tuned model and spits out our bird. 00:31:30.460 |
And then finally, the original target embedding we actually wanted is a photo of a bird spreading 00:31:37.060 |
its wings. We ended up with this slightly different embedding and we take the weighted 00:31:42.620 |
average of the two. That's called the interpolate step, the weighted average of the two. And we 00:31:47.820 |
pass that through this fine-tune diffusion model and we're done. And so that's pretty 00:31:56.700 |
amazing. This, this would not take, I don't think a particularly long time or require any 00:32:03.020 |
particular special hardware. It's the kind of thing I expect people will be doing, yeah, 00:32:09.740 |
in the coming days and weeks. But it's very interesting because, yeah, I mean, the ability 00:32:14.860 |
to take any photo of a person or whatever and change it, like literally change what the person's 00:32:26.380 |
doing is, you know, societally very important and really means that anybody I guess now can generate 00:32:40.380 |
believable photos that never actually existed. I see John O in the chat saying it took 00:32:47.020 |
about eight minutes to do it for Imogen on TPUs. Although Imogen is quite a slow 00:32:54.620 |
big model, although the TPUs they used were very, the latest TPUs. So might be, you know, 00:33:03.260 |
maybe it's an hour or something for stable diffusion on GPUs. 00:33:16.220 |
All right. So with that, let's go back to our notebook where we left it last time. We had kind 00:33:30.700 |
of looked at some applications that we can play with in this diffusion NBS repo, in the stable 00:33:36.140 |
diffusion notebook. And what we've got now, and to remind you, when I say we, it's mainly actually 00:33:45.020 |
Pedro, Patrick and Suraj, just a little bit of help from me. So hugging face folks. What we/they 00:33:52.700 |
have done is they now dig into the pipeline to pull it all apart step by step. So you can see 00:33:59.740 |
exactly what happens. The first thing I was just going to mention is this is how you can create 00:34:05.740 |
those gradual denoising pictures. And this is thanks to something called the callback. 00:34:13.980 |
So you can say here, when you go through the pipeline, every 12 steps, call this function. 00:34:21.980 |
And as you can see, it's going to call it with I and T and the latents. 00:34:29.180 |
And so then we can just make an image and stick it on the end of an array. And that's all that's 00:34:37.260 |
happening here. So this is how you can start to interact with a pipeline without rewriting it 00:34:45.420 |
yourself from scratch. But now what we're going to do is we're actually going to write it, we're 00:34:49.580 |
going to build it from scratch. So you don't actually have to use a callback because you'll 00:34:54.620 |
be able to change it yourself. So let's take a look. So looking inside the pipeline, what exactly is 00:35:03.500 |
going on? So what's going to be going on in the pipeline is seeing all of the steps that we saw 00:35:13.900 |
in last week's OneNote notes that I drew. And it's going to be all the code. And we're not going to 00:35:20.860 |
show the code of how each step is implemented. So for example, the clip text model we talked about, 00:35:27.340 |
the thing that takes as input a prompt and creates an embedding, we just take that as a given. So we 00:35:34.860 |
download it, open AI is trained one called clip, the IT large patch 14. So we just say from tree 00:35:41.980 |
change. So hugging face will transformers will download and create that model for us. 00:35:49.100 |
Ditto for the tokenizer. And so ditto for the auto encoder, and ditto for the unit. 00:35:56.460 |
So there they all are, we can just grab them. So we just take that all as a given. 00:36:01.500 |
These are the three models that we've talked about, the text encoder, the clip encoder, 00:36:07.020 |
the VAE, and the unit. So there they are. So given that we now have those, 00:36:16.540 |
the next thing we need is that thing that converts time steps into the amount of noise. Remember that 00:36:22.540 |
graph we drew. And so we can basically, again, use something that hugging face, 00:36:31.340 |
well, actually, in this case, Catherine Carlson has already provided, which is a scheduler, 00:36:37.580 |
it's basically something that shows us that connection. So we've got that. So we use that 00:36:44.700 |
scheduler. And we say how much noise when we're using. And so we have to make sure that that 00:36:51.580 |
matches. And so we just use these numbers that were given. Okay, so now to create our photograph 00:36:59.260 |
of astronaut riding a horse again, in 70 steps with a 7.5 guide and scale, batch size of one, 00:37:08.380 |
step number one is to take our prompt and tokenize it. Okay, so we looked at that in part one of the 00:37:16.060 |
course. So check that out if you can't remember what tokenizing does, but it's just splitting it 00:37:20.620 |
in basically splitting it into words, or subword units if they're long and unusual words. So here 00:37:26.860 |
are so this will be the start the start of sentence token, and this will be a photograph 00:37:32.620 |
of an astronaut, etc. And then you can see the same token is repeated again again at the end. 00:37:38.380 |
That's just the padding to say we're all done. And the reason for that is that GPUs and TPUs 00:37:45.340 |
really like to do lots of things at once. So we kind of have everything be the same length by 00:37:51.820 |
padding them. That may sound like a lot of wasted work, which it kind of is. But a GPU would rather 00:37:58.620 |
do lots of things at the same time on exactly the same sized input. So this is why we have all this 00:38:03.660 |
padding. So you can see here if we decode that number, it's the end of text marker, just just 00:38:11.340 |
padding really in this case. As well as getting the input IDs, so these are just lookups into a 00:38:18.620 |
vocabulary. There's also a mask, which is just telling it which ones are actual words as opposed 00:38:25.500 |
to padding, which is not very interesting. So we can now take those input IDs, we can put them on 00:38:35.180 |
the GPU, and we can run them through the clip encoder. And so for a batch size of one, so you've 00:38:43.180 |
got one image that gives us back a 77 by 768, because we've got 77 here. And each one of those 00:38:54.620 |
creates a 768 long vector. So we've got a 77 by 768 tensor. So these are the embeddings 00:39:01.820 |
for a photograph of an astronaut riding a horse that come from clip. So remember, 00:39:08.540 |
everything's pre-trained. So that's all done for us. We're just doing inference. And so remember, 00:39:15.340 |
for the classifier free guidance, we also need the embeddings for the empty string. 00:39:36.540 |
So now we just concatenate those two together, because this is just a trick to get the GPU to 00:39:42.380 |
do both at the same time, because we like the GPU to do as many things at once as possible. 00:39:46.220 |
And so now we create our noise. And because we're doing it with a VAE, we can call it 00:39:56.380 |
Latents, but it's just noise, really. I wonder if you'd still call it that without the VAE. 00:40:03.340 |
Maybe you would have to think about that. So that's just random numbers normally 00:40:09.180 |
generated, normally distributed random numbers of size one, that's our batch size. 00:40:14.140 |
And the reason that we've got this divided by eight here is because that's what the VAE does. 00:40:20.220 |
It allows us to create things that are eight times smaller by height and width. 00:40:24.700 |
And then it's going to expand it up again for us later. That's why this is so much faster. 00:40:30.700 |
You'll see a lot of this after we put it on the GPU, you'll see a lot of this dot half. 00:40:34.620 |
This is converting things into what's called half precision or FB 16. 00:40:39.580 |
Details don't matter too much. It's just making it half as big in memory by using less precision. 00:40:45.740 |
Modern GPUs are much, much, much, much faster if we do that. So you'll see that a lot. 00:40:51.500 |
If you use something like fast AI, you don't have to worry about it, but all this stuff is done for 00:40:56.780 |
you. And we'll see that later as we rebuild this with much, much less code later in the course. 00:41:02.060 |
So we'll be building our own kind of framework from scratch, 00:41:07.740 |
which you'll then be able to maintain and work with yourself. 00:41:12.780 |
Okay, so we have to say we want to do 70 steps. 00:41:16.060 |
Something that's very important, we won't worry too much about the details right now, 00:41:23.740 |
but what you see here is that we take our random noise and we scale it. And that's because 00:41:29.740 |
depending on what stage you're up to, you need to make sure that you have the right amount of 00:41:36.860 |
variance, basically. Otherwise you're going to get activations and gradients that go out of control. 00:41:42.940 |
This is something we're going to be talking about a huge amount 00:41:45.580 |
during this course, and we'll show you lots of tricks to handle that kind of thing automatically. 00:41:50.860 |
Unfortunately at the moment in the stable diffusion world, this is all done in rather, 00:41:56.940 |
in my opinion, kind of ways that are too tied to the details of the model. I think we will be able 00:42:04.620 |
to improve it as the course goes on, but for now we'll stick with how everybody else is doing it. 00:42:10.220 |
This is how they do it. So we're going to be jumping through. So normally it would take a 00:42:15.740 |
thousand time steps, but because we're using a fancy scheduler, we get to skip from 999 to 00:42:21.660 |
984, 984 to 970, and so forth. So we're going down about 14 time steps. And remember, this is a very, 00:42:28.460 |
very, very unfortunate word. They're not time steps at all. In fact, they're not even integers. 00:42:33.420 |
It's just a measure of how much noise are we adding at each time, and you find out how much noise by 00:42:41.500 |
looking it up on this graph. Okay. That's all time step means. It's not a step of time, and it's a 00:42:47.740 |
real shame that that word is used because it's incredibly confusing. This is much more helpful. 00:42:53.660 |
This is the actual amount of noise at each one of those iterations. And so here you can see the 00:43:02.140 |
amount of noise for each of those time steps, and we're going to be going backwards. As you can see, 00:43:11.100 |
we start at 999, so we'll start with lots of noise, and then we'll be using less and less and less and 00:43:15.900 |
less noise. So we go through the 70 time steps in a for loop, concatenating our two noise bits 00:43:30.380 |
together, because we've got the classifier free and the prompt versions, do our scaling, 00:43:39.500 |
calculate our predictions from the unit, and notice here we're passing in the time step, 00:43:44.460 |
as well as our prompt. That's going to return two things, the unconditional prediction, 00:43:51.980 |
so that's the one for the empty string. Remember, we passed in one of the two things we passed in 00:43:57.180 |
was the empty string. So we concatenated them together, and so after they come out of the unit, 00:44:04.220 |
we can pull them apart again. So dot chunk just means pull them apart into two separate variables, 00:44:09.580 |
and then we can do the guide and scale that we talked about last week. 00:44:14.140 |
And so now we can do that update where we take a little bit of the noise 00:44:23.660 |
and remove it to give us our new latents. So that's the loop. And so at the end of all that, 00:44:33.740 |
we decode it in the VAE. The paper that created this VAE tells us that we have to divide it by 00:44:41.180 |
this number to scale it correctly. And once we've done that, that gives us a number which is between 00:44:48.620 |
negative one and one. Python imaging library expects something between zero and one, 00:44:56.860 |
so that's what we do here to make it between zero and one and enforce that to be true. 00:45:02.460 |
Put that back on the CPU, make sure it's that the order of the dimensions is the same as what Python 00:45:08.780 |
imaging library expects, and then finally convert it up to between zero and 255 as an int, 00:45:16.220 |
which is actually what PIO really wants. And there's our picture. So there's all the steps. 00:45:26.620 |
So what I then did, this is kind of like, so the way I normally build code, I use notebooks for 00:45:34.860 |
everything, is I kind of do things step by step by step, and then I tend to kind of copy them, 00:45:40.460 |
and I use shift M. I don't know if you've seen that, but what shift M does, it takes two cells 00:45:46.460 |
and combines them like that. And so I basically combined some of the cells together, and I removed 00:45:53.820 |
a bunch of the the pros, so you can see the entire thing on one screen. And what I was trying to do 00:46:04.620 |
here is I'd like to get to the point where I've got something which I can very quickly do experiments 00:46:09.420 |
with. So maybe I want to try some different approach to guidance tree classification, 00:46:13.820 |
maybe I want to add some callbacks, so on and so forth. So I kind of like to have everything, 00:46:22.540 |
you know, I like to have all of my important code be able to fit into my screen at once. 00:46:28.300 |
And so you can see now I do, I've got the whole thing on my screen, so I can keep it all in my 00:46:32.380 |
head. One thing I was playing around with was I was trying to understand the actual guidance tree 00:46:41.340 |
equation in terms of like how does it work. Computer scientists tend to write things 00:46:50.220 |
and software engineers with kind of long words as variable names. Mathematicians tend to use 00:46:55.820 |
short just letters normally. For me, when I want to play around with stuff like that, I turn stuff 00:47:01.580 |
back into letters. And that's because I actually kind of pulled out one note and I started jutting 00:47:06.540 |
down this equation and playing around with it to understand how it behaves. So this is just like, 00:47:13.900 |
it's not better or worse, it's just depending on what you're doing. So actually here I said, okay, 00:47:18.860 |
g is guidance scale. And then rather than having the unconditional and text embeddings, 00:47:24.940 |
I just call them u and t. And now I've got this all down into an equation which I can 00:47:29.420 |
write down in a notebook and play with and understand exactly how it works. 00:47:32.860 |
So that's something I find really helpful for working with this kind of code is to, yeah, 00:47:40.700 |
turn it into a form that I can manipulate algebraically more easily. I also try to make 00:47:46.140 |
it look as much like the paper that I'm implementing as possible. Anyways, that's that code. So then I 00:47:52.860 |
copied all this again and I basically, oh, I actually did it for two prompts this time. 00:47:59.420 |
I thought this was fun. While you're painting an astronaut riding a horse in the style of 00:48:03.340 |
Grant Wood, just to remind you, Grant Wood looks like this. Not obviously astronaut material, 00:48:15.500 |
which I thought would make it actually kind of particularly interesting. Although he does have 00:48:19.500 |
horses. I can't see one here. Some of his pictures have horses. So because I did two 00:48:27.820 |
prompts, I got back two pictures I could do. So here's the Grant Wood one. I don't know what's 00:48:33.340 |
going on in his back here, but I think it's quite nice. So yeah, I then copied that whole thing again 00:48:39.980 |
and merged them all together and then just put it into a function. So I took the little bit which 00:48:48.540 |
creates an image and put that into a function. I took the bit which does the tokenizing and 00:48:55.020 |
text encoding and put that into a function. And so now all of the code necessary to do 00:49:01.340 |
the whole thing from top to bottom fits in these two cells, which makes it for me much easier to 00:49:10.300 |
see exactly what's going on. So you can see I've got the text embeddings. I've got the 00:49:16.860 |
unconditional embeddings. I've got the embeddings which can catenate the two together, 00:49:20.220 |
optional random seed, my latents, and then the loop itself. And you'll also see something I do 00:49:32.700 |
which is a bit different to a lot of software engineering is I often create things which are 00:49:36.780 |
kind of like longer lines because I try to have each line be kind of like mathematically one thing 00:49:44.620 |
that I want to be able to think about as a whole. So yeah, these are some differences between kind of 00:49:50.300 |
the way I find numerical programming works well compared to the way I would write a more 00:49:56.060 |
traditional software engineering approach. And again, this is partly a personal preference, 00:50:01.100 |
but it's something I find works well for me. So we're now at a point where we've got, yeah, 00:50:06.300 |
two fun, three functions that easily fit on the screen and do everything. So I can now just say 00:50:11.500 |
make samples and display each image. And so this is something for you to experiment with. 00:50:21.660 |
And what I specifically suggest as homework is to try picking one of the 00:50:32.860 |
extra tricks we learned about like image to image or negative prompts. Negative prompts 00:50:39.340 |
would be a nice easy one like see if you can implement negative prompt in your version of this. 00:50:48.780 |
Or yeah, try doing image to image. That wouldn't be too hard either. Another one you can add is 00:51:00.220 |
try adding callbacks. And the nice thing is then, you know, you've got code which you fully understand 00:51:07.900 |
because you know what all the lines do. And you then don't need to wait for the diffusers folks 00:51:16.460 |
to update it. The library to do is, for example, the callbacks are only added like a week ago. 00:51:22.140 |
So until then you couldn't do callbacks. Well, now you don't have to wait for the diffusers team 00:51:26.140 |
to add something. The code's all here for you to play with. So that's my recommendation as a bit 00:51:32.380 |
of homework for this week. Okay, so that brings us to the end of our rapid overview of stable 00:51:47.420 |
diffusion and some very recent papers that very significantly developed stable diffusion. I hope 00:51:53.180 |
that's given you a good sense of the kind of very high level slightly hand wavy version of all this 00:52:00.940 |
and you can actually get started playing with some fun code. What we're going to be doing next 00:52:06.540 |
is going right back to the start, learning how to multiply two matrices together effectively 00:52:14.860 |
and then gradually building from there until we've got to the point that we've rebuilt all 00:52:19.180 |
this from scratch and we understand why things work the way they do, understand how to debug 00:52:24.300 |
problems, improve performance and implement new research papers as well. So that's going to be 00:52:32.380 |
very exciting. And so we're going to have a break and I will see you back here in 10 minutes. 00:52:45.660 |
Okay, welcome back everybody. I'm really excited about the next part of this. It's going to require 00:52:53.100 |
some serious tenacity and a certain amount of patience. But I think you're going to learn a lot. 00:53:02.620 |
A lot of folks I've spoken to have said that previous iterations of this part of the course 00:53:10.060 |
is like the best course they've ever done. And this one's going to be dramatically better than 00:53:14.620 |
any previous version we've done of this. So hopefully you'll find that the hard work and patience 00:53:22.620 |
pays off. We're working now through the course 22 p2 repo. So 2022 course part two. 00:53:33.500 |
And notebooks are ordered. So we'll start with notebook number one. 00:53:40.940 |
And okay, so the goal is to get to stable diffusion from the foundations, which means we 00:53:49.020 |
have to define what are the foundations. So I've decided to define them as follows. We're allowed 00:53:56.060 |
to use Python. We're allowed to use the Python standard library. So that's all the stuff that 00:54:01.500 |
comes with Python by default. We're allowed to use matplotlib because I couldn't be bothered 00:54:07.100 |
creating my own plotting library. And we're allowed to use Jupyter notebooks and NBDev, 00:54:13.420 |
which is something that creates modules from notebooks. So basically what we're going to 00:54:19.340 |
try to do is to rebuild everything starting from this foundation. Now, to be clear, 00:54:27.980 |
what we are allowed to use are the libraries once we have reimplemented them correctly. 00:54:36.060 |
And so if we reimplement something from NumPy or PyTorch or whatever, we're then allowed to use 00:54:43.020 |
the NumPy or PyTorch or whatever version. Sometimes we'll be creating things that haven't 00:54:50.700 |
been created before. And that's then going to be becoming our own library. And we're going to be 00:54:56.860 |
calling that library mini AI. So we're going to be building our own little framework as we go. 00:55:03.740 |
So, for example, here are some imports. And these imports all come 00:55:08.540 |
from the Python standard library, except for these two. 00:55:12.300 |
Now, to be clear, one challenge we have is that the models we use in stable diffusion were trained 00:55:24.300 |
on millions of dollars worth of equipment for months, which we don't have the time or money. 00:55:31.740 |
So another trick we're going to do is we're going to create smaller identical but smaller versions 00:55:38.860 |
of them. And so once we've got them working, we'll then be allowed to use the big pre-trained 00:55:44.540 |
versions. So that's the basic idea. So we're going to have to end up with our own VAE, 00:55:50.380 |
our own unit, our own clip encoder, and so forth. 00:55:59.020 |
To some degree, I am assuming that you've completed part one of the course. To some degree, 00:56:05.500 |
I will cover everything at least briefly. But if I cover something about deep learning too fast for 00:56:13.100 |
you to know what's going on and you get lost, go back and watch part one, or go and Google for 00:56:21.740 |
that term. For stuff that we haven't covered in part one, I will go over it very thoroughly and 00:56:27.340 |
carefully. All right. So I'm going to assume that you know the basic idea, which is that we're going 00:56:37.420 |
to need to be doing some matrix multiplication. So we're going to try to take a deep dive into 00:56:43.820 |
matrix multiplication today. And we're going to need some input data. And I quite like working 00:56:49.900 |
with MNIST data. MNIST is handwritten digits. It's a classic data set. They're 28 by 28 pixel 00:57:04.380 |
grayscale images. And so we can download them from this URL. 00:57:15.980 |
So we use the path libs path object a lot. It's part of Python. And it basically takes a string 00:57:23.100 |
and turns it into something that you can treat as a path. For example, you can use slash to mean 00:57:28.460 |
this file inside this subdirectory. So this is how we create a path object. 00:57:34.140 |
Path objects have, for example, a make directory method. 00:57:43.340 |
So I like to get everything set up. But I want to be able to rerun this cell lots of times 00:57:48.940 |
and not give me errors if I run it more than once. If I run it a second time, it still works. 00:57:55.100 |
And in that case, that's because I put this exist OK equals true. How did I know that I can say-- 00:58:00.460 |
because otherwise, it would try to make the directory. It would already exist in a given 00:58:03.820 |
error. How do I know what parameters I can pass to make dir? I just press shift tab. And so when I 00:58:10.540 |
hit shift tab, it tells me what options there are. If I press it a few times, it'll actually 00:58:18.460 |
pop it down to the bottom of the screen to remind me. I can press escape to get rid of it. 00:58:25.020 |
Or you can just-- or else you can just hit tab inside. And it'll list all the things you could 00:58:37.900 |
type here, as you can see. All right. So we need to grab this URL. And so Python comes 00:58:52.620 |
with something for doing that, which is the URL lib library that's part of Python 00:58:59.340 |
that has something called URL retrieve. And something which I'm always a bit surprised 00:59:05.020 |
is not widely used as people reading the Python documentation. So you should do that a lot. 00:59:11.580 |
So if I click on that, here is the documentation 00:59:19.340 |
for URL retrieve. And so I can find exactly what it can take. And I can learn about exactly what 00:59:33.980 |
it does. And so I read the documentation from the Python docs for every single method I use. 00:59:48.460 |
And I look at every single option that it takes. And then I practice with it. 00:59:53.420 |
And to practice with it, I practice inside Jupyter. So if I want this import on its own, 01:00:07.660 |
I can hit Control-Shift- and it's going to spit it into two cells. And then I'll hit Alt-Enter 01:00:16.300 |
or Option-Enter so I can create something underneath. And I can type URL retrieve, Shift-Tab. 01:00:22.700 |
And so there it all is. If I'm like way down somewhere in the notebook and I have no idea 01:00:32.540 |
where URL retrieve comes from, I can just hit Shift-Enter. And it actually tells me exactly 01:00:38.140 |
where it comes from. And if I want to know more about it, I can just hit Question Mark, 01:00:43.340 |
Shift-Enter, and it's going to give me the documentation. And most cool of all, 01:00:49.900 |
second question mark, and it gives me the full source code. And you can see it's not a lot. 01:00:58.460 |
You know, reading the source code of Python standard library stuff is often quite revealing. 01:01:03.260 |
And you can see exactly how they do it. And that's a great way to learn more about 01:01:12.940 |
more about this. So in this case, I'm just going to use a very simple functionality, 01:01:24.300 |
which is I'm going to say the URL to retrieve and the file name to save it as. 01:01:28.540 |
And again, I made it so I can run this multiple times. So it's only going to do the URL retrieve 01:01:36.460 |
if the path doesn't exist. If I've already downloaded it, I don't want to download it again. 01:01:41.580 |
So I run that cell. And notice that I can put exclamation mark, followed by a line of bash. 01:01:54.220 |
If you're using Windows, this won't work. And I would very, very strongly suggest if you're using 01:02:06.780 |
Windows, use WSL. And if you use WSL, all of these notebooks will work perfectly. So yeah, do that. 01:02:14.860 |
Or write it on paper space or Lambda Labs or something like that, Colab, et cetera. 01:02:19.980 |
Okay, so this is a gzip file. So thankfully, Python comes with a gzip module. Python comes 01:02:31.580 |
with quite a lot, actually. And so we can open a gzip file using gzip.open. And we can pass in 01:02:37.980 |
the path. And we'll say we're going to read it as binary as opposed to text. 01:02:43.580 |
Okay, so this is called a context manager. It's a with clause. And what it's going to do is it's 01:02:51.340 |
going to open up this gzip file. The gzip object will be called f. And then it runs everything 01:02:59.100 |
inside the block. And when it's done, it will close the file. So with blocks can do all kinds 01:03:06.540 |
of different things. But in general, with blocks that involve files are going to close the file 01:03:11.660 |
automatically for you. So we can now do that. And so you can see it's opened up the gzip file. 01:03:18.860 |
And the gzip file contains what's called pickle objects. Pickled objects is basically 01:03:26.060 |
Python objects that have been saved to disk. It's the main way that people in pure Python 01:03:32.220 |
save stuff. And it's part of the standard library. So this is how we load in from that file. 01:03:39.260 |
Now, the file contains a tuple of tuples. So when you put a tuple on the left hand side of an equal 01:03:46.460 |
sign, it's quite neat. It allows us to put the first tuple into two variables called x_train 01:03:51.660 |
and y_train and the second into x_valid and y_valid. This trick here where you put stuff like this on 01:03:59.180 |
the left is called destructuring. And it's a super handy way to make your code kind of 01:04:07.020 |
clear and concise. And lots of languages support that, including Python. 01:04:15.340 |
Okay. So we've now got some data. And so we can have a look at it. Now, it's a bit tricky 01:04:22.380 |
because we're not allowed to use NumPy according to our rules. But unfortunately, this actually 01:04:26.540 |
comes as NumPy. So I've turned it into a list. All right. So I've taken the first 01:04:34.140 |
image and I've turned it into a list. And so we can look at a few examples of some values in that list. 01:04:44.380 |
And here they are. So it looks like they're numbers between zero and one. 01:04:47.420 |
And this is what I do when I learn about a new dataset. So when I started writing this notebook, 01:04:56.620 |
what you see here, other than the pros here, is what I actually did when I was working 01:05:04.540 |
with this data. I wanted to know what it was. So I just grab a little bit of it and look at it. 01:05:14.140 |
So I kind of got a sense now of what it is. Now, interestingly, 01:05:20.060 |
it's 784. This image is 784 long list. Oh, dear. People freaking out in the comments. No NumPy. 01:05:39.980 |
Yeah. No NumPy. Do you see NumPy? No NumPy. Why 784? What is that? Well, that's because these are 01:05:47.500 |
28 by 28 images. So it's just a flat list here of 784 long. So how do I turn this 784 long thing 01:05:57.340 |
into 28 by 28? So I want a list of 28 lists of 28, basically, because we don't have matrices. 01:06:05.820 |
So how do we do that? And so we're going to be learning a lot of cool stuff in Python here. 01:06:11.100 |
Sorry, I can't stop laughing at all the stuff in our chat. 01:06:15.580 |
Oh, dear. People are quite reasonably freaking out. That's okay. We'll get there. 01:06:24.300 |
I promise. I hope. Otherwise, I'll embarrass myself. All right. So how do I convert a 784 01:06:33.580 |
long list into 28 lists, 28 long list of 28 long lists? I'm going to use something called 01:06:43.740 |
chunks. And first of all, I'll show you what this thing does. And then I'll show you how it works. 01:06:47.340 |
So vowels is currently a list of 10 things. Now, if I take vowels and I pass it to chunks 01:06:56.780 |
with five, it creates two lists of five. Here's list number one of five elements. And here's list 01:07:03.740 |
number two of five elements. Hopefully, you can see what it's doing. It's chunkifying this list. 01:07:12.860 |
And this is the length of each chunk. Now, how did it do that? The way I did it is using a very, 01:07:20.620 |
very useful thing in Python that far too many people don't know about, which is called yield. 01:07:26.060 |
And what yield does is you can see here, if we're on loop, it's going to go through from zero 01:07:33.020 |
up to the length of my list. And it's going to jump by five at a time. It's going to go, 01:07:40.060 |
in this case, zero comma five. And then it's going to think of this as being like return for now, 01:07:47.420 |
it's going to return the list from zero up to five. So it returns the first bit of the list. 01:07:55.180 |
But yield doesn't just return. It kind of like returns a bit and then it continues. 01:08:02.780 |
And it returns a bit more. And so specifically, what yield does is it creates an iterator. 01:08:13.420 |
An iterator is an iterator is basically something you can actually use it that you can call next on 01:08:21.660 |
a bunch of times. So let's try it. So we can say iterator equals. 01:08:29.260 |
Okay. Oh, got to run it. So what is iterator? Well, iterator is something that I can basically, 01:08:38.940 |
I can call next on. And next basically says yield the next thing. So this should yield 01:08:45.660 |
vals zero colon five. There it is. It did, right? There's vals zero colon five. 01:08:56.060 |
Now, if I run that again, it's going to give me a different answer 01:09:00.220 |
because it's now up to the second part of this loop. Now it returns the last five. Okay. So 01:09:10.540 |
this is what an iterator does. Now, if you pass an iterator to Python's list, 01:09:21.260 |
it runs through the entire iterator until it's finished and creates a list of the results. 01:09:27.020 |
And what does finished looks like? This is what finished looks like. If you call next 01:09:31.580 |
and get stop iteration, that means you've run out. And that makes sense, because my loop, 01:09:38.380 |
there's nothing left in it. So all of that is to say, we now have a way of taking a list and 01:09:46.540 |
chunkifying it. So what if I now take my full image, image number one, chunkify it into chunks 01:09:56.620 |
of 28 long and turn that into a list and plot it. We have successfully created an image. 01:10:10.380 |
Now, we are done. But there are other ways to create this iterator. And because iterators 01:10:29.660 |
and generators, which are closely related, so important, I wanted to show you more about how 01:10:37.100 |
to do them in Python. It's one of these things that if you understand this, you'll often find 01:10:44.700 |
that you can throw away huge pieces of enterprise software and basically replace it with an iterator. 01:10:52.700 |
It lets you stream things one bit at a time. It doesn't store it all in memory. 01:10:58.300 |
It's this really powerful thing that once I show it to people, they suddenly go like, oh, wow, 01:11:05.980 |
we've been using all this third party software and we could have just created a Python iterator. 01:11:12.300 |
Python comes with a whole standard library module called edit tools just to make it easier to work 01:11:19.900 |
with iterators. I'll show you one example of something from edit tools, which is iseless. 01:11:26.540 |
So let's grab our values again, these 10 values. 01:11:32.540 |
Okay. So let's take these 10 values and we can take any list and turn it into an iterator 01:11:45.100 |
by passing it to itter, which I should call it. So I don't override this Python. 01:11:55.340 |
That's not a keyword, but this thing I don't want to override. 01:11:57.500 |
So this is now basically something that I can call. Actually, let's do this. 01:12:03.180 |
I'll show you that I can call next on it. So if I now go next it, 01:12:08.620 |
you can see it's giving me each item one at a time. 01:12:18.300 |
Okay. So that's what converting it into an iterator does. 01:12:21.580 |
I slice, converts it into a different kind of iterator. Let's call this maybe I slice iterator. 01:12:42.940 |
And so you can see here, what it did was it jumped. 01:12:53.340 |
Stop here. So that's what had been better. So I should query, 01:13:03.180 |
create the iterator and then call next a few times. Sorry. This is what I went to do. 01:13:11.340 |
It's now only returning the first five before it calls stop iteration before it raises stop 01:13:17.100 |
iteration. So what I slice does is it grabs the first N things from an iterable, 01:13:24.620 |
something that you can iterate. Why is that interesting? 01:13:33.340 |
Because I can pass it to list for example. Right. And now if I pass it to list again, 01:13:42.540 |
this iterator has now grabbed the first five things. So it's now up to thing number six. 01:13:48.620 |
So if I call it again, it's the next five things. And if I call it again, then there's nothing left. 01:13:59.180 |
And maybe you can see we've actually now got this defined, but we can do it with I slice. 01:14:06.940 |
And here's how we can do it. It's actually pretty tricky. 01:14:10.460 |
It in Python, you can pass it something like a list to create an iterator, 01:14:19.180 |
or you can pass it. Now this is a really important word. A callable. What's a callable? 01:14:25.260 |
A callable is generally speaking, it's a function. It's something that you can put parentheses after. 01:14:31.660 |
Could even be a class. Anything you can put parentheses after, 01:14:38.780 |
you can just think of it for now as a function. So we're going to pass it a function. 01:14:42.300 |
And in the second form, it's going to be called until the function returns 01:14:50.620 |
this value here, which in this case is empty list. And we just saw that I slice will return empty 01:14:55.500 |
list when it's done. So this here is going to keep calling this function again and again and again. 01:15:07.260 |
And we've seen exactly what happens because we've called it ourselves before. 01:15:12.300 |
There it is. Until it gets an empty list. So if we do it with 28, then we're going to get 01:15:22.940 |
our image again. So we've now got two different ways of creating exactly the same thing. 01:15:34.460 |
And if you've never used iterators before, now's a good time to pause the video and play with them, 01:15:43.340 |
right? So for example, you could take this here, right? And if you've not seen lambdas before, 01:15:49.500 |
they're exactly the same as functions, but you can define them in line. So let's replace that 01:15:53.900 |
with a function. Okay, so now I've turned it into a function and then you can experiment with it. 01:16:07.900 |
and call f on it. Well, not on it, call f. And you can see there's the first 28. 01:16:19.900 |
And each time I do it, I'm getting another 28. Now the first two rows are all empty, but finally, 01:16:27.180 |
look, now I've got some values. Call it again. See how each time I'm getting something else. 01:16:32.380 |
Just calling it again and again. And that is the values in our iterator. So that gives you a sense 01:16:39.420 |
of like how you can use Jupyter to experiment. So what you should do is as soon as you hit 01:16:48.220 |
something in my code that doesn't look familiar to you, I recommend pausing the video and experimenting 01:16:57.100 |
with that in Jupyter. And for example, itter, most people probably have not used itter at all, 01:17:05.660 |
and certainly very few people have used this to argument form. So hit shift tab a few times, 01:17:10.540 |
and now you've got at the bottom, there's a description of what it is. Or find out more. 01:17:18.380 |
Python itter. Here we are, go to the docs. Well, that's not the right bit of the docs. 01:17:31.020 |
See API? Wow, crazy. That's terrible. Let's try searching here. There we go. That's more like it. 01:17:52.060 |
So now you've got links. So if it's like, okay, it returns an iterator object, what's that? 01:17:56.620 |
Well, click on it. Find out. Now this is really important to know. And here's that stop exception 01:18:01.020 |
that we saw. So stop iteration exception. We saw next already. We can find out what iterable is. 01:18:07.900 |
And here's an example. And as you can see, it's using exactly the same approach that we did, 01:18:15.740 |
but here it's being used to read from a file. This is really cool. Here's how to read from a file. 01:18:22.060 |
64 bytes at a time until you get nothing processing it, right? So the docs of Python are 01:18:29.420 |
quite fantastic. As long as you use them, if you don't use them, they're not very useful at all. 01:18:37.500 |
And I see Seifer in the comments, our local Haskell programmer, 01:18:47.900 |
appreciating this Haskellness in Python. That's good. It's not quite Haskell, I'm afraid, 01:18:53.660 |
but it's the closest we're going to come. All right. How are we going for time? Pretty good. 01:19:01.980 |
Okay. So now that we've got image, which is a list of lists and each list is 25 long, 01:19:15.500 |
we can index into it. So we can say image 20. Well, let's do it. Image 20. 01:19:22.380 |
Okay. Is a list of 28 numbers. And then we could index into that. 01:19:32.860 |
Okay. So we can index into it. Now, normally, we don't like to do that for matrices. We would 01:19:43.340 |
normally rather write it like this. Okay. So that means we're going to have to create our own class 01:19:51.100 |
to make that work. So to create a class in Python, you write class, and then you write the name of it. 01:19:59.900 |
And then you write some really weird things. The weird things you write have two underscores, 01:20:08.380 |
a special word, and then two underscores. These things with two underscores on each side are 01:20:13.980 |
called dunder methods, and they're all the special magically named methods which have particular 01:20:20.700 |
meanings to Python. And you're just going to learn them, but they're all documented in the Python 01:20:27.340 |
object model. Edit object model. Yay, finally. Okay. So it's called data model, not object model. 01:20:42.140 |
And so this is basically where all the documentation is about absolutely everything, 01:20:48.620 |
and I can click under edit, and it tells you basically this is the thing that constructs 01:20:53.580 |
objects. So any time you want to create a class that you want to construct, it's going to store 01:21:02.300 |
some stuff. So in this case, it's going to store our image. You have to define dunder in it. 01:21:07.900 |
Python's slightly weird in that every method, you have to put self here. For reasons we probably 01:21:17.660 |
don't really need to get into right now. And then any parameters. So we're going to be creating an 01:21:22.860 |
image passing in the thing to store, the Xs. They're going to be passing in the Xs. And so here we're 01:21:29.580 |
just going to store it inside the self. So once I've got this line of code, I've now got something 01:21:35.340 |
that knows how to store stuff, the Xs inside itself. So now I want to be able to call square bracket 01:21:43.420 |
20 comma 15. So how do we do that? Well, basically part of the data model is that there's a special 01:21:53.660 |
thing called dunder get item. And when you call square brackets on your object, that's what Python 01:21:59.820 |
uses. And it's going to pass across the 20 comma 15 here as indices. So we're now basically just 01:22:11.340 |
going to return this. So the self.Xs with the first index and the second index. So let's create that 01:22:20.620 |
matrix class and run that. And you can now see M 20 comma 15 is the same. Oh, quick note on, you know, 01:22:30.060 |
ways in which my code is different to everybody else's, which it is. It's somewhat unusual to put 01:22:37.100 |
definitions of methods on the same line as as the the signature like this. I do it quite a lot for 01:22:47.580 |
one-liners. As I kind of mentioned before, I find it really helps me to be able to see all the code 01:22:53.580 |
I'm working with on the screen at once. A lot of the world's best programmers actually have 01:23:00.060 |
had that approach as well. It seems to work quite well for some people that are extremely productive. 01:23:05.180 |
It's not common in Python. Some people are quite against it. So if you're at work, and your 01:23:13.260 |
colleagues don't write Python this way, you probably shouldn't either. But if you can get away with it, 01:23:18.940 |
I think it works quite well. Anyway, okay, so now that we've created something that lets us index 01:23:23.420 |
into things like this, we're allowed to use PyTorch because we're allowed to use this one feature in 01:23:29.660 |
PyTorch. Okay, so we can now do that. And so now to create a tensor, which is basically a lot like 01:23:42.700 |
our matrix, we can now pass a list into tensor to get back a tensor version of that list, or perhaps 01:23:53.180 |
more interestingly, we could pass in a list of lists. Maybe let's give this a name. 01:24:04.540 |
Whoopsie dozy. That needs to be a list of lists, just like we had before for our image. 01:24:18.380 |
In fact, let's do it for our image. Let's just pass in our image. There we go. And so now we 01:24:27.980 |
should be able to say tens 20 comma 15. And there we go. Okay, so we've successfully reinvented that. 01:24:43.020 |
All right. So now we can convert all of our lists into tenses. There's a convenient 01:24:55.980 |
way to do this, which is to use the map function in the Python standard library. 01:25:02.220 |
So it takes a function and then some iterables. In this case, one iterable. And it's going to 01:25:14.060 |
apply this function to each of these four things and return those four things. And so then I can 01:25:19.900 |
put four things on the left to receive those four things. So this is going to call tensor x_train 01:25:26.140 |
and put it in x_train. Tensor y_train, put it in y_train, and so forth. So this is converting 01:25:30.780 |
all of these lists to tenses and storing them back in the same name. So you can see that 01:25:37.580 |
x_train now is a tensor. So that means it has a shape property. It has 50,000 images in it, 01:25:46.860 |
which are each 784 long. And you can find out what kind of stuff it contains by calling its .type. 01:25:57.180 |
So it contains floats. So this is the tensor class. We'll be using a lot of it. 01:26:02.620 |
So of course, you should read its documentation. 01:26:05.420 |
I don't love the PyTorch documentation. Some of it's good. Some of it's not good. 01:26:12.860 |
It's a bit all over the place. So here's tensor. But it's well worth scrolling through to get a 01:26:18.060 |
sense of like, this is actually not bad, right? It tells you how you can construct it. This is how I 01:26:21.820 |
constructed one before, passing it lists of lists. You can also pass it NumPy arrays. 01:26:27.660 |
You can change types. So on and so forth. So, you know, it's well worth reading through. And like, 01:26:39.980 |
you're not going to look at every single method it takes, but you're kind of, if you browse through 01:26:44.060 |
it, you'll get a general sense, right? That tensors do just about everything you couldn't think of 01:26:51.340 |
for a numeric programming. At some point, you will want to know every single one of these, 01:26:57.820 |
or at least be aware roughly what exists. So you know what to search for in the docs. 01:27:03.980 |
Otherwise you will end up recreating stuff from scratch, which is much, much slower 01:27:09.180 |
than simply reading the documentation to find out it's there. 01:27:14.300 |
All right. So instead of, instead of calling chunks or I slice, the thing that is roughly 01:27:21.340 |
equivalent in a tensor is the reshape method. So reshape, so to reshape our 50,000 by 784 thing, 01:27:30.380 |
we can simply, we want to turn it into 50,000 28 by 28 tensors. So I could write here, 01:27:40.620 |
reshape to 50,000 by 28 by 28. But I kind of don't need to, because I could just put minus one here 01:27:51.260 |
and it can figure out that that must be 50, that must be 50,000 because it knows that I have 50,000 01:27:58.620 |
by 784 items. So it can figure out, so minus one means just fill this with all the rest. 01:28:15.900 |
So there's some very interesting history here. 01:28:21.020 |
And I'll try not to get too far into it because I'm a bit over enthusiastic about this stuff, 01:28:27.740 |
I must admit. I'm very, very interested in the history of tensor programming and array programming. 01:28:33.980 |
And it basically goes back to a language called APL. APL is basically originally a mathematical 01:28:43.340 |
notation that was developed in the mid to late 50s, 1950s. And at first it was used as a notation 01:28:53.260 |
for defining how certain new IBM systems would work. So it was all written out in this notation. 01:29:02.140 |
It's kind of like a replacement for mathematical notation that was designed to be more consistent 01:29:08.860 |
and kind of more expressive. In the early 60s, so the guy who wrote and made it was called his 01:29:18.060 |
name Ken Iverson. In the early 60s, some implementations that actually allow this notation 01:29:24.620 |
to be executed on a computer appeared. Both the notation and the executable implementations 01:29:31.020 |
slightly confusingly are both called APL. APL's been in constant development ever since that time. 01:29:37.740 |
And today is one of the world's most powerful programming languages. And you can try it by 01:29:42.700 |
going to try APL. And why am I mentioning it here? Because one of the things Ken Iverson did, well, 01:29:51.420 |
he studied an area of physics called tensor analysis. And as he developed APL, he basically 01:30:00.860 |
said like, oh, what if we took these ideas from tensor analysis and put them into a programming 01:30:05.180 |
language? So in APL, you can and have been able to for some time can basically you can define a 01:30:16.220 |
variable. And rather than saying equals, which is a terrible way to define things really 01:30:24.140 |
mathematically, because that has a very different meaning most of the time in math. Instead, 01:30:28.860 |
we use arrow to define things. We can say, okay, that's going to be a tensor 01:30:35.260 |
like so. And then we can look at their contents of A and we can do things like, oh, what if we do A 01:30:44.460 |
times three or A minus two and so forth. And as you can see, what it's doing is it's taking 01:30:58.860 |
all the contents of this tensor and it's multiplying them all by three or subtracting 01:31:04.300 |
two from all of them. Or perhaps more fun, we could put into B a different tensor. 01:31:11.580 |
And we can now do things like A divided by B. And you can see it's taking each 01:31:22.860 |
of A and dividing by each of B. Now, this is very interesting because now we don't have to write 01:31:34.780 |
loops anymore. We can just express things directly. We can multiply things by scalars, 01:31:41.260 |
even if they're, this is called a rank one tensor. That is to say it's basically in math, 01:31:47.500 |
we'd call it a vector. We can take two vectors and can divide one by the other and so forth. 01:31:53.900 |
It's a really powerful idea. Funnily enough, APL didn't call them tensors, even though 01:32:01.020 |
Ken Iverson said he got this idea from tensor analysis, APL calls them arrays. 01:32:09.340 |
NumPy, which was heavily influenced by APL, also calls them arrays. For some reason, PyTorch, 01:32:18.140 |
which is very heavily influenced by APL, sorry, by NumPy, doesn't call them arrays, 01:32:23.820 |
it calls them tensors. They're all the same thing. They are rectangular blocks of numbers. 01:32:37.020 |
They can be one-dimensional, like a vector. They can be two-dimensional, like a matrix. 01:32:41.900 |
They can be three-dimensional, which is like a bunch of stacked matrices, 01:32:46.380 |
like a batch of matrices and so forth. If you are interested in APL, which I hope you are, 01:33:01.420 |
we have a whole APL and array programming section on our forums, and also we've prepared 01:33:08.220 |
a whole set of notes on every single glyph in APL, which also covers all kinds of interesting 01:33:23.420 |
mathematical concepts, like complex direction and magnitude, and all kinds of fun stuff like that. 01:33:36.780 |
That's all totally optional, but a lot of people who do APL say that they feel like they've become 01:33:42.540 |
a much better programmer in the process, and also you'll find here at the forums a set of 17 study 01:33:51.180 |
sessions of an hour or two each, covering the entirety of the language, every single glyph. 01:33:57.020 |
That's all where this stuff comes from. This batch of 50,000 images, 50,000 28 by 28 images, 01:34:10.540 |
is what we call a rank 3 tensor in PyTorch. In NumPy, we would call it an array with three 01:34:20.220 |
dimensions. Those are the same thing. What is the rank? The rank is just the number of dimensions. 01:34:30.220 |
It's 50,000 images of 28 high by 28 wide, so there are three dimensions that is the rank 01:34:38.220 |
of the tensor. If we then pick out a particular image, then we look at its shape, 01:34:46.380 |
we could call this a matrix. It's a 28 by 28 tensor, or we could call it a rank 2 tensor. 01:34:56.860 |
A vector is a rank 1 tensor. In APL, a scalar is a rank 0 tensor, and that's the way it should be. 01:35:06.140 |
A lot of languages and libraries don't, unfortunately, think of it that way. 01:35:11.420 |
So what is a scalar? It's a bit dependent on the language. Okay, so we can index into 01:35:17.180 |
the zeroth image, 20th row, 15th column to get back this same number. 01:35:33.660 |
Okay, so we can take x_train.shape, which is 50,000 by 784, 01:35:44.380 |
and you can destructure it into n, which is the number of images, and c, which is the number of 01:35:54.620 |
the full number of columns, for example. And we can also, well this is actually part of the standard 01:36:03.980 |
library, so we're allowed to use min, so we can find out in y_train what's the smallest number, 01:36:09.180 |
and what's the maximum number, so they go from 0 to 9. So you see here, it's not just the number 0, 01:36:18.780 |
it's a scalar tensor, 0. They act almost the same, most of the time. So here's some example of a bit 01:36:27.580 |
of the y_train, so you can see these are basically, this is going to be the labels, 01:36:33.500 |
right, these are our digits, and this is its shape, so there's just 50,000 of these labels. 01:36:48.380 |
Okay, and so since we're allowed to use this in the standard library, well it also exists in PyTorch, 01:36:53.500 |
so that means we're also allowed to use the .min and .max properties. 01:36:57.740 |
All right, so before we wrap up, we're going to do one more thing, and I don't know what the, 01:37:06.140 |
we would call kind of anti-cheating, but according to our rules, we're allowed to use random numbers 01:37:14.460 |
because there is a random number generator in the Python standard library, but we're going to do 01:37:20.860 |
random numbers from scratch ourselves, and the reason we're going to do that is even though 01:37:26.700 |
according to the rules we could be allowed to use the standard library one, it's actually extremely 01:37:31.420 |
instructive to build our own random number generator from scratch, well at least I think so. 01:37:38.060 |
Let's see what you think. So there is no way normally in software to create a random number, 01:37:52.540 |
unfortunately. Computers, you know, add, subtract, times, logic gates, stuff like that. 01:38:04.940 |
So how does one create random numbers? Well you could go to the Australian National University 01:38:09.580 |
Quantum Random Number Generator, and this looks at the quantum fluctuations of the vacuum 01:38:17.500 |
and provides an API which will actually hook you in and return quantum random fluctuations of 01:38:29.500 |
the vacuum. So that's about, that's the most random thing I'm aware of, so that would be one way to get 01:38:35.180 |
random numbers. And there's actually an API for that, so there's a bit of fun. You could do what 01:38:44.700 |
Cloudflare does. Cloudflare has a huge wall full of lava lamps, and it uses the pixels of a camera 01:39:04.140 |
looking at those lava lamps to generate random numbers. Intel nowadays actually has something 01:39:13.020 |
in its chips, which you can call rdrand, which will return random numbers on certain 01:39:25.420 |
Intel chips from 2012. All of these things are kind of slow, they can kind of get you one random 01:39:34.700 |
number from time to time. We want some way of getting lots and lots of random numbers, 01:39:40.460 |
and so what we do is we use something called a pseudorandom number generator. 01:39:46.140 |
A pseudorandom number generator is a mathematical function that you can call lots of times, 01:39:54.220 |
and each time you call it, it will give you a number that looks random. 01:40:02.540 |
To show you what I mean by that, I'm going to run some code. 01:40:07.340 |
I've created a function which we'll look at in a moment called rand, and if I call rand 50 times 01:40:17.900 |
and plot it, there's no obvious relationship between one call and the next. That's one thing 01:40:25.420 |
that I would expect to see from my random numbers. I would expect that each time I call rand, 01:40:31.500 |
the numbers would look quite different to each other. The second thing is, rand is meant to 01:40:36.940 |
be returning uniformly distributed random numbers, and therefore if I call it lots and lots and lots 01:40:43.180 |
of times and plot its histogram, I would expect to see exactly this, which is each from 0 to 0.1, 01:40:51.980 |
there's a few, from 0.1 to 0.2, there's a few, from 0.2 to 0.3, there's a few. It's a fairly evenly 01:40:58.140 |
spread thing. These are the two key things I would expect to see, an even distribution of random 01:41:03.660 |
numbers and that there's no correlation or no obvious correlation from one to the other. 01:41:08.620 |
We're going to try and create a function that has these properties. We're not going to derive it 01:41:15.580 |
from scratch. I'm just going to tell you that we have a function here called the Wickman-Hill 01:41:19.660 |
algorithm. This is actually what Python used to use back in before Python 2.3, and the key reason 01:41:26.140 |
we need to know about this is to understand really well the idea of random state. Random state is a 01:41:33.580 |
global variable. It's something which is, or at least it can be, most of the time when we use it, 01:41:39.740 |
we use it as a random variable, and it's just basically one or more numbers. So we're going 01:41:44.780 |
to start with no random state at all, and we're going to create a function called seed that we're 01:41:50.220 |
going to pass something to, and I just meshed the keyboard to create this number. Okay, so this is 01:41:55.500 |
my random number. You could get this from the ANU quantum vacuum generator or from cloud fairs lava 01:42:02.700 |
lamps or from your Intel chips ID rand, or you know in Python land we'd pretty much always use 01:42:08.380 |
the number 42. Any of those are fine. So you pass in some number or you can pass in the current tick 01:42:14.300 |
count in nanoseconds. There's various ways of getting some random starting point, and if we 01:42:19.900 |
pass it into seed it's going to do a bunch of modular divisions and create a tuple of three things, 01:42:31.580 |
and it's going to store them in this global state. So rand state now contains three numbers. 01:42:39.740 |
Okay, so why did we do that? The reason we did that is because now this function, 01:42:47.660 |
which takes our random state, unpacks it into three things, and does again a bunch of 01:42:54.780 |
modifications and modulus, and then sticks them together with various kind of weights. Modulo one, 01:43:02.220 |
so this is how you can pull out the decimal part. This returns random numbers, but the key thing I 01:43:09.980 |
want you to understand is that we pull out the random state at the start. We do some math thingies 01:43:17.020 |
to it, and then we store new random state, and so that means that each time I call this I'm going to 01:43:26.460 |
get a different number. Okay, so this is a random number generator, and this is really important 01:43:34.860 |
because lots of people in the deep learning world screw this up, including me sometimes, 01:43:41.100 |
which is to remember that random number generators rely on this state. 01:43:49.180 |
So let me show you where that will get you if you're not careful. 01:43:54.060 |
If we use this special thing called fork, that creates a whole separate copy of this Python 01:44:02.060 |
process. In one copy, os.fork returns true. In the other copy, it returns false, 01:44:11.180 |
roughly speaking. So this copy here is this, and if I say this version here, the true version, 01:44:19.500 |
is the original non-copied, it's called the parent, and so in my else here, so this will 01:44:25.180 |
only be called by the parent. This will only be called by the copy, it's called the child. 01:44:29.980 |
And each one I'm calling rand. These are two different random numbers, right? 01:44:34.540 |
Wrong. They're the same number. Now, why is that? That's because this process here and this process 01:44:46.380 |
here are copies of each other, and therefore they each contain the same numbers in random state. 01:44:55.900 |
So this is something that comes up in deep learning all the time, because in deep learning, 01:45:03.340 |
we often do parallel processing, for example, to generate lots of augmented images at the same time 01:45:12.620 |
using multiple processes. Fast AI used to have a bug, in fact, where we failed to correctly 01:45:19.900 |
initialize the random number generator separately in each process. And in fact, to this day, 01:45:26.940 |
at least as of October 2022, torch.rand itself, by default, fails to initialize the random number 01:45:37.340 |
generator. That's the same number. Okay, so you've got to be careful. Now, I have a feeling 01:45:52.140 |
Is that how you do it? I don't quite remember. We'll try. 01:46:02.220 |
Nope. Okay, NumPy also doesn't. How interesting. What about Python? 01:46:32.380 |
So Python does actually remember to reinitialize the random stream in each fork. 01:46:41.820 |
So, you know, this is something that, like, even if you've experimented in Python and you think 01:46:46.060 |
everything's working well in your data loader or whatever, and then you switch to PyTorch or NumPy 01:46:50.700 |
and now suddenly everything's broken. So this is why we've spent some time re-implementing 01:46:56.860 |
the random number generator from scratch, partly because it's fun and interesting and partly 01:47:03.020 |
because it's important that you now understand that when you're calling rand or any random number 01:47:08.380 |
generator, kind of the default versions in NumPy and PyTorch, this global state is going to be 01:47:14.940 |
copied. So you've got to be a bit careful. Now, I will mention our random number generator. Okay, 01:47:26.060 |
so this is this is cool. Percent time at percent is a special Jupyter or IPython function. 01:47:34.700 |
And percent timer runs a piece of Python code this many times. So to call it 10 times, 01:47:41.740 |
well, actually, it'll do seven loops and each one will be seven times and it'll take the mean and 01:47:46.060 |
standard deviation. So here I am going to generate random numbers 7,840 times and put them into 10 01:48:01.180 |
long chunks. And if I run that, it takes me three milliseconds per loop. If I run it 01:48:10.700 |
using PyTorch, this is the exact same thing in PyTorch. It's going to take me 73 microseconds 01:48:20.860 |
per loop. So as you can see, although we could use our version, we're not going to because the 01:48:27.180 |
PyTorch version is much, much faster. This is how we can create a 784 by 10. And why would we want 01:48:33.420 |
this? That's because this is our final layer of our neural net or if we're doing a linear classifier, 01:48:39.260 |
our linear weights, we need to be 784 because that's 28 by 28 by 10 because that's the number 01:48:48.060 |
of possible outputs, the number of possible digits. All right. That is it. So quite the 01:48:58.940 |
intense lesson. I think we can all agree. Should keep you busy for a week. And thanks very much 01:49:06.060 |
for joining. And see you next time. Bye, everybody.