back to indexLesson 11 2022: Deep Learning Foundations to Stable Diffusion
Chapters
0:0 Introduction
0:20 Showing student’s work
13:3 Workflow on reading an academic paper
16:20 Read DiffEdit paper
26:27 Understanding the equations in the “Background” section
46:10 3 steps of DiffEdit
51:42 Homework
59:15 Matrix multiplication from scratch
68:47 Speed improvement with Numba library
79:25 Frobenius norm
85:54 Broadcasting with scalars and matrices
99:22 Broadcasting rules
102:10 Matrix multiplication with broadcasting
00:00:04.680 |
This is the third lesson in part two, depending on how you count things. 00:00:08.960 |
There's been a lesson A and lesson B, it's kind of the fifth lesson in part two, I don't 00:00:13.680 |
So we'll just stick to calling it lesson 11 and avoid getting too confused. 00:00:20.120 |
My goodness, I've got so much stuff to show you. 00:00:21.920 |
I'm only going to show you a tiny fraction of the cool stuff that's been happening on 00:00:31.000 |
I'm going to start by sharing this beautiful video from John Robinson, Robinson, I should 00:00:37.280 |
say, and I've never seen anything like this before. 00:00:42.540 |
As you can see, it's very stable and it's really showing this beautiful movement between 00:00:52.320 |
So what I did on the forum was I said to folks, "Hey, you should try interpolating between 00:00:59.840 |
And I also said, "You should try using the last image of the previous prompt interpolation 00:01:12.800 |
And anyway, here it is, it came out beautifully, John was the first to get that working, so 00:01:24.560 |
And the second one I wanted to show you is this really amazing work from Seb Derhi, who, 00:01:36.880 |
Sebastian, who did something that I'd been thinking about as well. 00:01:42.480 |
I'm really thrilled that he also thought about this, which was he noticed that this update 00:01:47.800 |
we do, unconditional embeddings plus guidance times text embeddings minus unconditional 00:01:57.360 |
embeddings, has a bit of a problem, which is that it gets big. 00:02:06.680 |
To show you what I mean by it gets big is like, imagine that we've got a couple of vectors 00:02:26.660 |
And so we've got, let's see, so we've got, that's just, okay, so we've got the original 00:02:31.720 |
unconditional piece here, so we've got U. So let's say this is U. And then we add to that 00:02:39.420 |
some amount of T minus U. So if we've got like T, let's say it's huge, right? 00:02:56.140 |
Then the difference between those is the vector which goes here, right? 00:03:05.080 |
Now you can see here that if there's a big difference between T and U, then the eventual 00:03:11.440 |
update which actually happens is, oopsie daisy, I thought that was going to be an arrow. 00:03:23.200 |
The eventual update which happens is far bigger than the original update. 00:03:35.780 |
So this idea is basically to say, well, let's make it so that the update is no longer than 00:03:43.120 |
the original unconditioned update would have been. 00:03:48.120 |
And we're going to be talking more about norms later, but basically we scale it by the ratio 00:03:56.000 |
And what happens is we start with this astronaut and we move to this astronaut. 00:04:07.540 |
And it's kind of, it's a subtle change, but you can see there's a lot more before, after, 00:04:12.960 |
before, after, a lot more texture in the background. 00:04:18.360 |
And like on the Earth, there's a lot more detail before, after, you see that? 00:04:24.780 |
And even little things like before, the bridal kind of rains, whatever were pretty flimsy. 00:04:32.100 |
So it's made quite a big difference just to kind of get this scaling correct. 00:04:39.660 |
So there's a couple of other things that Sebastian tried, which I'll explain in a moment, but 00:04:45.880 |
you can see how they, some of them actually resulted in changing the image. 00:04:52.420 |
And this one's actually important because the poor horse used to be missing a leg and 00:05:00.240 |
And so here's the detailed one with its extra leg. 00:05:04.840 |
Well, so what he did was he started with this unconditioned prompt plus the guidance times 00:05:11.240 |
the difference between the conditional and unconditioned. 00:05:15.000 |
And then as we discussed, the next version, well, actually the next version we then saw 00:05:23.880 |
is to basically just take that prediction and scale it according to the difference in 00:05:32.400 |
So the norms is basically the length of the vectors. 00:05:35.640 |
And so this is the second one I did in lesson nine, you'll see it's gone from here. 00:05:39.520 |
So when we go from 1a to 1b, you can see here, it's got, look at this, this boot's gone from 00:05:48.880 |
This whatever the hell this thing is, suddenly he's got texture. 00:05:52.320 |
And look, we've now got proper stars in the sky. 00:05:57.560 |
And then the second change is not just to rescale the whole prediction, but to rescale 00:06:06.960 |
When we rescale the update, it actually not surprisingly changes the image entirely because 00:06:15.080 |
And so I don't know, is this better than this? 00:06:17.680 |
I mean, maybe, maybe not, but I think so, particularly because this was the difference 00:06:23.240 |
that added the correct fourth leg to the horse before. 00:06:28.480 |
We can rescale the difference and then rescale the result. 00:06:31.800 |
And then we get the best of both worlds, as you can see, big difference. 00:06:37.560 |
This weird thing on his back's actually become an arm. 00:06:46.840 |
So these little details make a big difference, as you can see. 00:06:53.640 |
So this is a really cool, or two really cool new things. 00:07:04.320 |
Problem number one is after I shared on Twitter Sebastian's approach, Ben Poole, who's a Google 00:07:13.280 |
brain, I think, if I remember correctly, pointed out that this already exists. 00:07:17.480 |
He thinks it's the same as what's shown in this paper, which is a diffusion model for 00:07:23.440 |
I haven't read the paper yet to check whether it's got all the different options or whether 00:07:29.520 |
So maybe this is reinventing something that already existed and putting it into a new 00:07:36.400 |
Anyway, so hopefully, folks, on the forum, you can help figure out whether this paper 00:07:45.800 |
And then the other interesting thing was John Robinson got back in touch on the forum and 00:07:50.360 |
said, "Oh, actually, that tree video doesn't actually do what we think it does at all. 00:07:57.880 |
There's a bug in his code, and despite the bug, it accidentally worked really well." 00:08:02.680 |
So now we're in this interesting question of trying to figure out, "Oh, how did he create 00:08:10.080 |
And okay, so reverse engineering exactly what the bug did, and then figuring out how to 00:08:17.760 |
It's really good to have a lot of people working on something, and the bugs often, yeah, they 00:08:30.360 |
So watch this space where we find out what John actually did and how it worked so well. 00:08:38.600 |
And then something that I just saw like two hours ago in the forum, which I had never 00:08:43.000 |
thought of before, but I thought of something a little bit similar. 00:08:47.640 |
Rakhil Prashanth said like, "Well, what if we took this?" 00:08:50.720 |
So as you can see, all the students are really bouncing ideas of each other. 00:08:54.760 |
We're doing different things with a guidance scale. 00:08:57.880 |
What if we take the guidance scale, and rather than keeping it at 7.5 all the time, let's 00:09:04.200 |
And this is a little bit similar to something I suggested to John over a few weeks ago where 00:09:08.680 |
I said he was doing some stuff with like modifying gradients based on additional loss 00:09:15.500 |
And I said to him, "Maybe you should just use them like occasionally at the start." 00:09:18.980 |
Because I think the key thing is once the model kind of knows roughly what image it's 00:09:22.880 |
trying to draw, even if it's noisy, you can let it do its thing. 00:09:28.280 |
And this is exactly what's happening here is Rakhil's idea is to say, "Well, let's decrease 00:09:37.040 |
And so once it kind of is in going in the right direction, we let it do its thing. 00:09:41.520 |
So this little doggy is with the normal 7.5 guidance scale. 00:09:52.960 |
And if I go to the next one, as you can see now, actually look at the eye. 00:09:57.540 |
That's a proper eye before, totally glassy, black, now proper eye. 00:10:03.320 |
Or like look at all this fur, very textured, previously very out of focus. 00:10:22.160 |
You folks are trying things out, and some things are working, and some things not working, 00:10:30.760 |
I kind of feel like you're going to have to slow down because I'm having trouble keeping 00:10:40.280 |
I also wanted to mention on a different theme to check out Alex's notes on the lesson because 00:10:51.680 |
I thought he's done a fantastic job of showing like how to study a lesson. 00:10:58.080 |
And so what Alex did, for example, was he made a list in his notes of all the different 00:11:03.120 |
steps we did as we started the Froma Foundations. 00:11:12.880 |
And I know that Alex's background actually is history, not computer science. 00:11:20.520 |
And so for somebody moving into a different field like this, this is a great idea, particularly 00:11:25.160 |
to be able to look at like, OK, what are all the things that I'm going to have to learn 00:11:32.400 |
And then he did something which we always recommend, which is to try the lesson on 00:11:38.720 |
And he very sensibly picked out the fashion MNIST data set, which is something we'll 00:11:42.200 |
be using a lot in this course because it's a lot like MNIST. 00:11:46.920 |
And it's just different enough to be interesting. 00:11:49.920 |
And so he described in his post or his notes how he went about doing that. 00:11:55.760 |
And then something else I thought was interesting in his notes at the very end was he just jotted 00:12:02.680 |
It's very easy when I throw a tip out there to think, oh, that's interesting. 00:12:11.040 |
So here's a good way to make sure you don't forget about all the little tricks. 00:12:17.240 |
And I think I've put those notes in the forum wiki so you can check them out if you'd like 00:12:30.000 |
OK, so during the week, Jono taught us about a new paper that had just come out called 00:12:46.040 |
DeafEdit and he told us he thought this was an interesting paper. 00:12:52.680 |
And it came out during the week and I thought it might be good practice for us to try reading 00:13:06.600 |
And you'll find that probably the majority of papers that you come across in deep learning 00:13:19.720 |
So these are models, these are papers that have not been peer reviewed. 00:13:26.060 |
I would say in our field we don't generally or I certainly don't generally care about 00:13:32.440 |
that at all because we have code, we can try it, we can see things whether it works or not. 00:13:38.920 |
You know, we tend to be very, you know, most papers are very transparent about here's what 00:13:42.680 |
we did and how we did it and you can replicate it. 00:13:46.000 |
And it gets a huge amount of peer review on Twitter. 00:13:50.020 |
So if there's a problem generally within 24 hours, somebody has pointed it out. 00:13:55.060 |
Now we use archive a lot and if you wait until it's been peer reviewed, you know, you'll 00:13:59.260 |
be way out of date because this field is moving so quickly. 00:14:03.480 |
So here it is on archive and we can read it by clicking on the PDF button. 00:14:08.360 |
I don't do that, instead I click on this little button up here, which is the Save to Zotero 00:14:16.160 |
So I figured I'd show you like my preferred workflows. 00:14:19.360 |
You don't have to do the same thing, there are different workflows, but here's one that 00:14:22.440 |
I find works very well, which is Zotero is a piece of free software that you can download 00:14:28.780 |
for Mac, Windows, Linux and install a Chrome connector. 00:14:32.700 |
Oh, Tanishka is saying the button is covered. 00:14:36.020 |
All right, so in my taskbar, I have a button that I can click that says Save to Zotero. 00:14:44.300 |
And when I click it, I'll show you what happens. 00:14:46.180 |
So after I've downloaded this, the paper will automatically appear here in this software, 00:15:04.540 |
And you can see it's told us, it's got here the abstract, the authors, where it came from. 00:15:12.660 |
And so later on, I can go and like, if I want to check some detail, I can go back and see 00:15:20.260 |
And so in this case, what I'm going to do is I'm going to double click on it. 00:15:26.980 |
Now, the reason I like to read my papers in Zotero is that I can, you know, annotate them, 00:15:35.780 |
edit them, tag them, put them in folders and so forth, and also add them to my kind of 00:15:45.180 |
So as you can see, you know, I've started this Fast Diffusion folder, which is actually 00:15:50.760 |
a group library, which I share with the other folks working on this Fast Diffusion project 00:15:58.020 |
And so we can all see the same paper library. 00:16:03.080 |
So Maribou on YouTube chat is asking, is this better than Mendeley? 00:16:07.780 |
Yeah, I used to use Mendeley and it's kind of gone downhill. 00:16:11.380 |
I think Zotero is far, far better, but they're both very similar. 00:16:15.580 |
Okay, so when you double click on it, it opens up and here is a paper. 00:16:23.220 |
So reading a paper is always extremely intimidating. 00:16:32.980 |
And so you just have to do it anyway and you have to realize that your goal is not to understand 00:16:39.980 |
Your goal is to understand the basic idea well enough that, for example, when you look 00:16:46.500 |
at the code, hopefully it comes with code, most things do, that you'll be able to kind 00:16:51.100 |
of see how the code matches to it and that you could try writing your own code to implement 00:16:58.340 |
So over on the left, you can open up the sidebar here. 00:17:01.500 |
So I generally open up the table of contents and get a bit of a sense of, okay, so there's 00:17:07.300 |
some experimental results, there's some theoretical results, introduction, related work, okay, 00:17:16.100 |
tells us about this new diff edit thing, some experiments, okay. 00:17:20.200 |
So that's a pretty standard approach that you would see in papers. 00:17:26.500 |
So I would always start with the abstract, okay. 00:17:32.260 |
So generally it's going to be some background sentence or two about how interesting this 00:17:36.860 |
It's just saying, wow, image generation is cool, which is fine. 00:17:39.300 |
And then they're going to tell us what they're going to do, which is they're going to create 00:17:50.860 |
It's going to use text condition diffusion models. 00:17:55.980 |
That's where we type in some text and get back an image of that that matches the text. 00:18:06.580 |
So let's put that aside and think, okay, let's make sure we understand that later. 00:18:10.880 |
The goal is to edit an image based on a text query. 00:18:14.880 |
So we're going to edit an image based on text. 00:18:17.940 |
Ah, they're going to tell us right away what this is. 00:18:22.140 |
It's an extension of image generation with an additional constraint, which is the generated 00:18:27.580 |
image should be as similar as possible to the given input. 00:18:30.540 |
And so generally, as they've done here, there's going to be a picture that shows us what's 00:18:37.340 |
And so in this picture, you can see here an example, here's an input image. 00:18:42.380 |
And originally it was attached to a caption, a bowl of fruits. 00:18:50.380 |
So we type a bowl of pairs and it generates, oh, a bowl of pairs, or we could change it 00:18:59.580 |
from a bowl of fruit to a basket of fruits and oh, it's become a basket of fruits. 00:19:09.180 |
What it's saying is that we can edit an image by typing what we want that image to represent. 00:19:15.900 |
So this actually looks a lot like the paper that we looked at last week. 00:19:26.460 |
So the abstract says that currently, so I guess there are current ways of doing this, 00:19:32.820 |
That means you have to basically draw the area you're replacing. 00:19:36.420 |
So that sounds really annoying, but our main contribution. 00:19:39.500 |
So what this paper does is we automatically generate the mask. 00:19:43.060 |
So they simply just type in the new query and get the new image. 00:19:48.460 |
So if you read the abstract and you think, um, I don't care about doing that, then you 00:19:54.100 |
can skip the paper, you know, um, or, or look at the results. 00:19:59.900 |
And if the results don't look impressive, then just skip the paper. 00:20:03.540 |
So that's, that's kind of your first point where we can be like, okay, we're, we're done. 00:20:09.300 |
The results look amazing. So I think we should keep going. 00:20:14.220 |
The chief state of the updating performance, of course. 00:20:21.060 |
So the introduction to a paper, um, is going to try to give you a sense of, you know, what 00:20:30.380 |
And so this first paragraph here is just repeating what we've already read in the abstract and 00:20:37.560 |
So it's saying that we can take a text query, like a basket of fruits, see the examples. 00:20:46.340 |
So the key thing about academic papers is that they are full of citations. 00:20:53.140 |
Um, you should not expect to read all of them because if you do, then to read each of those 00:21:02.020 |
citations, that's full of citations and then they're full of citations. 00:21:04.620 |
And before you know it, you've read the entire academic literature, which has taken you 5,000 00:21:10.500 |
Um, so, uh, for now, let's just recognize that it says text conditional image generation 00:21:29.020 |
So generally there's this like, okay, our area that we're working on is important in 00:21:35.980 |
Um, they've asked a vast amounts of data are used. 00:21:46.300 |
Yes, we know that they denoise starting from Gaussian noise. 00:21:52.420 |
Once you kind of in the field, you can skip over pretty quickly. 00:22:07.860 |
So there's a new technique that we haven't done, but I think it makes a lot of intuitive 00:22:12.980 |
Um, that is during that diffusion process, if there are some pixels, you don't want to 00:22:17.580 |
change such as all the ones that aren't orange here, you can just paste them from the original 00:22:27.260 |
If I want to know more about that, I could always look at this paper, but I don't think 00:22:33.220 |
And again, it's just repeating something they've already told us that they require us to provide 00:22:46.460 |
It's also says that when you mask out an area, that's a problem because if you're trying 00:22:53.140 |
to, for example, change a dog into a cat, you want to keep the animal's color and pose. 00:22:58.860 |
So this is a new technique, which is not deleting the original, not deleting a section and replacing 00:23:04.380 |
it with something else, but it's actually going to take advantage of knowledge about 00:23:07.860 |
what that thing looked like so that this is two cool new things. 00:23:12.800 |
So hopefully at this point, we know what they're trying to achieve. 00:23:17.060 |
If you don't know what they're trying to achieve when you're reading a paper, the paper won't 00:23:21.860 |
Um, so again, that's a point where you should stop. 00:23:25.020 |
Maybe this is not the right time to be reading this paper. 00:23:27.420 |
Maybe you need to read some of the references. 00:23:29.900 |
Maybe you need to look more at the examples so you can always skip straight to the experiments. 00:23:36.300 |
In this case, I don't need to because they've put enough experiments on the very first page 00:23:43.740 |
So yeah, don't always read it from top to bottom. 00:23:53.180 |
So they've got some examples of conditioning a diffusion model on an input without a mask. 00:24:00.180 |
For example, you can use a noised version of the input as a starting point. 00:24:04.460 |
So as you can see, we've already covered a lot of the techniques that they're referring 00:24:11.620 |
Something we haven't done, but makes a lot of sense is that we can look at the distance 00:24:17.180 |
Okay, that makes sense to me and there's some references here. 00:24:21.260 |
All right, so we're going to create this new thing called diffedit. 00:24:32.420 |
Hopefully you found that useful to understand what we're trying to do. 00:24:38.260 |
The next section is generally called related work as it is here. 00:24:42.020 |
And that's going to tell us about other approaches. 00:24:47.340 |
So if you're doing a deep dive, this is a good thing to study carefully. 00:24:52.380 |
I don't think we're going to do a deep dive right now. 00:24:59.420 |
We could kind of do a quick glance of like, oh, image editing, conclude colorization, 00:25:09.140 |
Definitely getting more excited about this idea of image editing. 00:25:19.500 |
You can use clip guidance, okay, they can be computationally expensive. 00:25:33.980 |
So there's a lot of repetition in these papers as well, which is nice because we can skip 00:25:42.260 |
Okay, so they're saying this is going to be not so computationally expensive. 00:25:52.060 |
And often the very end of the related work is most interesting as it is here where they've 00:25:55.860 |
talked about how somebody else has done concurrent hours. 00:25:58.820 |
Somebody else is working at exactly the same time. 00:26:02.420 |
And they've looked at some different approach. 00:26:07.980 |
Okay, so not sure we learned too much from the related work, but if you were trying to 00:26:14.180 |
really do the very, very best possible thing, you could study the related work and get the 00:26:41.740 |
And this is often the scariest bit, the background. 00:26:43.780 |
This is basically saying like, mathematically, here's how the problem that we're trying to 00:26:52.380 |
And so we're going to start by looking at denoising, diffusion, probabilistic models, 00:26:57.660 |
Now, if you've watched lesson 9b with Wasim and Tanishk, then you've already seen some 00:27:10.020 |
And the important thing to recognize is that basically no one in the world pretty much 00:27:16.580 |
is going to look at these paragraphs of text and these equations and go, oh, I get it. 00:27:28.060 |
To understand DDPM, you would have to read and study the original paper, and then you 00:27:33.820 |
would have to read and study the papers it's based on and talk to lots of people and watch 00:27:47.500 |
And then you'll be able to look at this section and say, oh, okay, I see, they're just talking 00:27:54.900 |
So this is meant to be a reminder of something that you already know. 00:27:58.740 |
It's not something you should expect to learn from scratch. 00:28:02.060 |
So let me take you through these equations somewhat briefly because Wasim and Tanishk 00:28:11.100 |
have kind of done them already because every diffusion paper pretty much is going to have 00:28:17.380 |
So, oh, and I'm just going to read something that Jono has pointed out in the chat. 00:28:22.980 |
He says it's worth remembering the background is often written last and tries to look smart 00:28:33.660 |
I think the main reason to read it is to find out what the different letters mean, what 00:28:40.700 |
the different symbols mean, because they'll probably refer to them later. 00:28:44.700 |
But in this case, I want to actually take this as a way to learn how to read math. 00:28:51.380 |
So let's start with this very first equation, which how on earth do you even read this? 00:28:58.660 |
So the first thing I'll say is that this is not an E, right? 00:29:05.180 |
It's a weird looking E. And the reason it's a weird looking E is because it's a Greek 00:29:09.660 |
And so something I always recommend to students is that you learn the Greek alphabet because 00:29:16.020 |
it's much easier to be able to actually read this to yourself. 00:29:22.620 |
If you don't know that's called theta, I guess you have to read it as like circle with line 00:29:28.820 |
It's just going to get confusing trying to read an equation where you just can't actually 00:29:35.220 |
So what I suggest is that you learn that learn the Greek alphabet and let me find the right 00:29:52.140 |
So it's very easy to look it up just on Wikipedia is the Greek alphabet. 00:30:01.940 |
And if we go down here, you'll see they've all got names and we can go and try and find 00:30:06.580 |
our one curvy E. Okay, here it is, epsilon and oh, circle with a line through it, theta. 00:30:15.820 |
All right, so practice and you will get used to recognizing C. So you've got epsilon theta. 00:30:26.220 |
This is just a weird curly L. So that's this is used for the loss function. 00:30:31.780 |
Okay, so how do we find out what this symbol means and what this symbol means? 00:30:38.260 |
Well, what we can do is there's a few ways to do it. 00:30:44.660 |
One way, which is kind of cool is we can use a program called MathPix, which is MathPix. 00:31:05.120 |
And what it does is you basically select anything on your screen. 00:31:19.100 |
So that's one way you can do this is you can select on the screen. 00:31:23.020 |
And the reason it's good to turn it into LaTeX is because LaTeX is written as actual stuff 00:31:35.300 |
Technique number two is you can download the other formats of the paper and that will have 00:31:48.820 |
And if we say download source, then what we'll be able to do is we'll be able to actually 00:32:02.980 |
So we'll wait for that to download while it's happening. 00:32:21.580 |
We could try looking for two bars, maybe math notation. 00:32:35.620 |
Oh, and here there's a glossary of mathematical symbols. 00:32:46.820 |
Okay, so it definitely doesn't look like this. 00:32:49.660 |
It's not between two sets of letters, but it is around something that looks hopeful. 00:33:01.340 |
Okay, so then you can start looking for these things up. 00:33:09.260 |
And so once you can actually find the term, then we kind of know what to look for. 00:33:16.260 |
Okay, so in our case, we've got this surrounding all this stuff, and then there's twos here 00:33:33.020 |
All right, if we scroll through, oh, this is pretty close actually. 00:33:40.020 |
So, okay, so two bars can mean a matrix norm, otherwise a single for a vector norm. 00:33:52.100 |
So it looks like we don't have to worry too much about whether it's one or two bars. 00:34:01.220 |
All right, so it's equal to root sum of squares. 00:34:06.780 |
So this norm thing means a root sum of squares. 00:34:14.300 |
Ah, so this is a root sum of squares squared. 00:34:18.620 |
Well, the square of a square root is just the thing itself. 00:34:22.180 |
Ah, so actually this whole thing is just the sum of squares. 00:34:26.180 |
It's a bit of a weird way to write it, in a sense. 00:34:30.220 |
We could perfectly well have just written it as, you know, like sum of, you know, 00:34:54.540 |
So how would you find out what the weird E thing is? 00:34:57.660 |
Okay, so our, our laytech has finally finished downloading. 00:35:14.860 |
And if we open it up, we can find there's a .tech file in here. 00:35:26.460 |
And it's not the most, you know, amazingly smooth process, but, you know, what we could 00:35:32.340 |
just do is we could say, okay, it's just after it says minimizing the denoising objective. 00:35:36.780 |
Okay, so let's search for minimizing the, oh, here it is, minimizing the denoising objective. 00:35:44.620 |
So the laytech here, let's get it back from the screen at the same time. 00:35:50.060 |
Okay, so here it is, L, math curl L equals math BBE, X naught T, epsilon, okay. 00:36:02.260 |
And here's that vertical bar thing, epsilon minus epsilon theta XT, and then the bar thing 00:36:08.580 |
All right, so the thing that we've got new is math BBE, okay, so finally we've got something 00:36:13.940 |
we can search for, math BBE, ah, fantastic, what does math BBE mean? 00:36:28.380 |
That's the expected value operator, aha, fantastic, all right. 00:36:33.960 |
So it takes a bit of fussing around, but once you've got either math pics working or actually 00:36:40.420 |
another thing you could try, because math pics is ridiculously expensive, in my opinion, 00:36:44.820 |
is there is, there is a free version called pics2tech that actually is a Python thing, 00:36:59.780 |
and you could actually even have fun playing with this because the whole thing is just 00:37:03.820 |
a PyTorch Python script, and it even describes, you know, how it's used to transformers model, 00:37:12.500 |
and you can train it yourself in Colab and so forth, but basically as you can see, yeah, 00:37:17.380 |
you can snip and convert to LaTeX, which is pretty awesome. 00:37:23.820 |
So you could use this instead of paying the math pics guys, anyway, so we are on the right 00:37:33.460 |
track now, I think, so expected value, and then we can start reading about what expected 00:37:40.580 |
value is, and you might actually remember that because we did a bit of it in iSchool, 00:37:45.260 |
at least in Australia we did, it's basically like, let's maybe jump over here, so expected 00:37:58.260 |
value of something is saying what's the likely value of that thing, so for example, let's 00:38:06.140 |
say you toss a coin, which could be heads or it could be tails, and you want to know 00:38:11.460 |
how often it's heads, and so maybe we'll call heads one tail zero, so you toss it and you 00:38:17.340 |
get a one zero zero one one zero one zero one, okay, and so forth, right, and then you 00:38:23.180 |
can calculate the mean of that, right, so if that's x you can calculate x bar, the mean, 00:38:31.380 |
which would be the sum of all that, divided by the count of all that, so it'd be one two 00:38:39.860 |
three four five, five divided by one two three four five six seven eight nine, okay, so that 00:38:47.980 |
would be the mean, but the expected value is like, well, what do you expect to happen, 00:38:53.780 |
and we can calculate that by adding up for all of the possibilities for each, I don't 00:39:00.380 |
know, I'll just call them x, for each possibility x, how likely is x, and what score do you 00:39:06.420 |
get if you get x, so in this example of heads and tails, our two possibilities is that we 00:39:11.540 |
either get heads or we get tails, so if for the version where x is heads we get probability 00:39:19.420 |
is zero point five, and the score if it's an x, is that the score if it's an x is going 00:39:28.740 |
to be one, and then what about tails, for tails the probability is zero point five, and the 00:39:36.260 |
score if you get tails is zero, and so overall the expected is point five times one plus 00:39:42.220 |
zero is point five, so our expected score if we're tossing a coin is point five, if getting 00:39:49.460 |
heads is a win. Let me give you another example, another example is let's say that we're rolling 00:39:56.940 |
a die, and we want to know what the expected score is if we roll a die, so again we could 00:40:04.500 |
roll it a bunch of times, and see what happens, okay, and so we could sum all that up, let's 00:40:15.460 |
like before, and divide it by the count, and that'll tell us the mean for this particular 00:40:21.540 |
example, but what's the expected value more generally, well again it's the sum of all 00:40:27.580 |
the possibilities of the probability of each possibility times that score, so the possibilities 00:40:34.420 |
for rolling a die is that you can get a one, a two, a three, a four, a five, or a six, the 00:40:41.240 |
probability of each one is a sixth, okay, and the score that you get is, well it's this, 00:40:54.500 |
this is the score, and so then you can multiply all these together and sum them up, which 00:41:00.260 |
would be 1/6 plus 2/6 plus 3/6 plus 4/6, oops, plus 5/6 plus 6/6, and that would give you 00:41:17.580 |
the expected value of that particular thing, which is rolling die, rolling a die, so that's 00:41:28.020 |
what expected value means, all right, so that's a really important concept that's going to 00:41:34.460 |
come up a lot as we read papers, and so in particular this is telling us what are all 00:41:43.780 |
the things that we're averaging it over, that with the expectations over, and so there's 00:41:48.820 |
a whole lot of letters here, you're not expected to just know what they are, in fact in every 00:41:52.860 |
paper they could mean totally different things, so you have to look immediately underneath 00:41:56.500 |
where they'll be defined, so x0 is an image, it's an input image, epsilon is the noise, 00:42:06.400 |
and the noise has a mean of zero and a standard deviation of i, which if you watch the lesson 00:42:11.860 |
9b you'll know it's like a standard deviation of 1 when you're doing multiple normal variables, 00:42:20.820 |
okay, and then this is kind of confusing, eta just on its own is a normally distributed 00:42:28.180 |
random variable, so it's just grabbing random numbers, but eta theta, epsilon, but epsilon 00:42:35.060 |
theta is a noise estimator, that means it's a function, you can tell it's a function kind 00:42:41.880 |
of because it's got these parentheses and stuff right next to it, so that's a function, 00:42:47.620 |
so presumably most functions like this in these papers are neural networks, okay, so 00:42:53.180 |
we're finally at a point where this actually is going to make perfect sense, we've got 00:42:56.460 |
the noise, we've got the prediction of that noise, we subtract one from the other, we 00:43:04.100 |
square it, and we take the expected value, so in other words this is mean squared error, 00:43:11.820 |
so wow, that's a lot of fiddling around to find out that we've, this whole thing here 00:43:16.500 |
means mean squared error, so the loss function is the mean squared error, and unfortunately 00:43:22.820 |
I don't think the paper ever says that, it says minimising the denoising objective L 00:43:26.580 |
blah de blah de blah de, but anyway we got there eventually, fine, we also, as well as 00:43:36.820 |
learning about x0, we also learn here about xt, and so xt is the original unnoised image 00:43:46.060 |
times some number plus some noise times one minus that number, okay, and so hopefully 00:43:55.300 |
you'll recognise this from lesson 9b, this is the thing where we reduce the value of 00:43:59.580 |
each pixel and we add noise to each pixel, so that's that, alright, so I'm not going 00:44:07.740 |
to keep going through it, but you can kind of basically get the idea here is that once 00:44:11.640 |
you know what you're looking for, the equations do actually make sense, right, but all this 00:44:20.420 |
is doing is remember this is background, right, this is telling you what already exists, so 00:44:25.600 |
this is telling you this is what a DDPM is, and then it tells you what a DDIM is, DDIM 00:44:34.660 |
is, just think of it as a more recent version of DDPM, it's some very minor changes to the 00:44:43.260 |
way it's set up which allows us to go faster, okay, so the thing is though, once we keep 00:44:51.420 |
reading what you'll find is none of this background actually matters, but you know I thought we'd 00:44:58.620 |
kind of go through it just to get a sense of like what's in a paper, okay, so for the 00:45:05.700 |
purpose of our background it's enough to know that DDPM and DDIM are kind of the foundational 00:45:11.980 |
papers on which diffusion models today are based. Okay, so the encoding process which 00:45:32.180 |
encodes an image onto a latent variable, okay, and then this is basically adding noise, this 00:45:41.220 |
is called DDIM encoding, and the thing that goes from the input image to the noised image, 00:45:48.860 |
they're going to call capital ER, and R is the encoding ratios, that's going to be some 00:45:54.420 |
like how much noise are we adding, if you use small steps then decoding that, so going 00:46:02.420 |
backwards gives you back the original image, okay, so that's what the stuff that we've 00:46:05.460 |
learned about, that's what diffusion models are. All right, so this looks like a very 00:46:12.980 |
useful picture, so maybe let's take a look and see what this says, so what is DiffEdit? 00:46:19.540 |
DiffEdit has three steps. Step one, we add noise to the input image, that sounds pretty 00:46:25.500 |
normal, here's our input image x0, okay, and we add noise to it, fine, and then we denoise 00:46:34.460 |
it, okay, fine. Ah, but we denoise it twice. One time we denoise it using the reference 00:46:46.820 |
text R, horse, or this special symbol here means nothing at all, so either unconditional 00:46:54.700 |
or horse. All right, so we do it once using the word horse, so we take this and we decode 00:47:03.860 |
it, estimate the noise, and then we can remove that noise on the assumption that it's a horse. 00:47:11.220 |
Then we do it again, but the second time we do that noise, when we calculate the noise, 00:47:20.020 |
we pass in our query Q, which is zebra. Wow, those are going to be very different noises. 00:47:27.860 |
The noise for horse is just going to be literally these Gaussian pixels, these are all dots, 00:47:33.300 |
right, because it is a horse, but if the claim is no, no, this is actually a zebra, then 00:47:39.140 |
all of these pixels here are all wrong, they're all the wrong color. So the noise that's calculated 00:47:46.180 |
if we say this is our query is going to be totally different to the noise if we say this 00:47:52.020 |
is our query, and so then we just take one minus the other, and here it is here, right, 00:47:58.940 |
so we derive a mask based on the difference in the denoising results, and then you take 00:48:05.420 |
that and binarize it, so basically turn that into ones and zeros. So that's actually the 00:48:10.780 |
key idea, that's a really cool idea, which is that once you have a diffusion model that's 00:48:17.220 |
trained, you can do inference on it where you tell it the truth about what the thing 00:48:22.100 |
is, and then you can do it again but lie about what the thing is, and in your lying version 00:48:28.500 |
it's going to say okay, all the stuff that doesn't match zebra must be noise. And so 00:48:34.180 |
the difference between the noise prediction when you say hey it's a zebra versus the noise 00:48:38.420 |
prediction when you say hey it's a horse will be all the pixels that it says no, these pixels 00:48:44.580 |
are not zebra. The rest of it, it's fine, there's nothing particularly about the background 00:48:50.140 |
that wouldn't work with a zebra. Okay, so that's step one. So then step two is we take the 00:49:03.060 |
horse and we add noise to it. Okay, that's this XR thing that we learned about before. 00:49:14.740 |
And then step three, we do decoding conditioned on the text query using the mask to replace 00:49:23.700 |
the background with pixel values. So this is like the idea that we heard about before, 00:49:29.660 |
which is that during the inference time as you do diffusion from this fuzzy horse, what 00:49:37.100 |
happens is that we do a step of diffusion inference and then all these black pixels 00:49:44.920 |
we replace with the noised version of the original. And so we do that multiple times 00:49:49.980 |
and so that means that the original pixels in this black area won't get changed. And 00:49:57.820 |
that's why you can see in this picture here and this picture here, the backgrounds all 00:50:01.820 |
the same. And the only thing that's changed is that the horse has been turned into a zebra. 00:50:09.540 |
So this paragraph describes it and then you can see here it gives you a lot more detail. 00:50:17.340 |
And the detail often has all kinds of like little tips about things they tried and things 00:50:22.220 |
they found, which is pretty cool. So I won't read through all that because it says the 00:50:31.780 |
same as what I've already just said. One of the interesting little things they note here 00:50:37.100 |
actually is that this binarized mask, so this difference between the R decoding and the 00:50:44.260 |
Q decoding tends to be a bit bigger than the actual area where the horse is, which you 00:50:50.260 |
can kind of see with these legs, for example. And their point is that they actually say 00:50:53.980 |
that's a good thing because actually often you want to slightly change some of the details 00:50:58.820 |
around the object. So this is actually fine. All right. So we have a description of what 00:51:07.340 |
the thing is, lots of details there. And then here's the bit that I totally skip, the bit 00:51:12.740 |
called theoretical analysis, where this is the stuff that people really generally just 00:51:17.620 |
add to try to get their papers past review. You have to have fancy math. And so they're 00:51:22.660 |
basically proving, you can see what it says here, insight into why this component yields 00:51:28.980 |
better editing results than other approaches. I'm not sure we particularly care because 00:51:34.420 |
like it makes perfect sense what they're doing. It's intuitive and we can see it works. I 00:51:39.260 |
don't feel like I need it proven to me, so I skip over that. So then they'll show us 00:51:44.300 |
their experiments to tell us what datasets they did the experiments on. And so then, 00:51:51.300 |
you know, they have metrics with names like LP IPS and CS FID. You'll come across FID 00:51:59.460 |
a lot. This is just a version of that. But basically they're trying to score how good 00:52:04.780 |
their generated images. We don't normally care about that either. They care because 00:52:10.580 |
they need to be able to say, you should publish our paper because it has a higher number than 00:52:14.260 |
the other people that have worked on this area. In our case, we can just say, you know, 00:52:21.140 |
it looks good. I like it. So excellent question in the chat from Mikolaj, which is, so would 00:52:29.540 |
this only work on things that are relatively similar? And I think this is a great point. 00:52:34.700 |
This is where understanding this helps to know what its limitations are going to be. 00:52:40.100 |
And that's exactly right. If you can't come up with a mask for the change you want, this 00:52:47.580 |
isn't going to work very well on the whole. Yeah, because the masked areas, the pixels 00:52:53.620 |
going to be copied. So, for example, if you wanted to change it from, you know, a bowl 00:52:58.980 |
of fruits to a bowl of fruits with a bokeh background or like a bowl of fruits with, 00:53:07.820 |
you know, a purple tinged photo of a bowl of fruit, if you want the whole color to change, 00:53:14.060 |
that's not going to work, right? Because you're not masking off an area. Yeah. So by understanding 00:53:18.660 |
the detail here, Mikolaj has correctly recognized a limitation or like, what's this for? This 00:53:25.860 |
is for things where you can just say, just change this bit and leave everything else 00:53:29.900 |
the same. All right. So there's lots of experiments. So, yeah. For some things, you care about 00:53:38.700 |
the experiments a lot. If it's something like classification, for generation, the main thing 00:53:43.180 |
you probably want to look at is the actual results. And so, and often, for whatever reason, 00:53:50.380 |
I guess, because this is, most people read these electronically, the results often you 00:53:53.820 |
have to zoom into a lot to be able to see whether they're really good. So here's the 00:53:57.420 |
input image. They want to turn this into an English Foxhound. So here's the thing they're 00:54:03.300 |
comparing themselves to, SDEdit, and it changed the composition quite a lot. And their version, 00:54:09.940 |
it hasn't changed it at all. It's only changed the dock. And Ditto here, semi-trailer truck. 00:54:14.860 |
SDEdit's totally changed it. DifEdit hasn't. So you can kind of get a sense of like, you 00:54:20.860 |
know, the authors showing off what they're good at here. This is, this is what this technique 00:54:25.420 |
is effective at doing, changing animals and vehicles and so forth. It does a very good 00:54:32.900 |
job of it. All right. So then there's going to be a conclusion at the end, which I find 00:54:43.580 |
almost never adds anything on top of what we've already read. And as you can see, it's 00:54:48.460 |
very short anyway. Now, quite often the appendices are really interesting. So don't skip over 00:54:59.820 |
them. Often you'll find like more examples of pictures. They might show some examples 00:55:05.300 |
of pictures that didn't work very well, stuff like that. So it's often well worth looking 00:55:10.380 |
at the appendices. Often some of the most interesting examples are there. And that's 00:55:16.900 |
it. All right. So that is, I guess, our first full on paper walkthrough. And it's important 00:55:24.060 |
to remember, this is not like a carefully chosen paper that we've picked specifically 00:55:30.560 |
because you can handle it. Like this is the most interesting paper that came out this 00:55:34.020 |
week. And so, you know, it gives you a sense of what it's really like. And for those of 00:55:43.720 |
you who are, you know, ready to try something that's going to stretch you, see if you can 00:55:49.740 |
implement any of this paper. So there are three steps. The first step is kind of the 00:55:56.540 |
most interesting one, which is to generate, automatically generate a mask. And the information 00:56:02.660 |
that you have and the code that's in the lesson nine notebook actually contains everything 00:56:07.340 |
you need to do it. So maybe give it a go. See if you can mask out the area of a horse that 00:56:14.540 |
does not look like a zebra. And that's actually, you know, that's actually useful of itself. 00:56:19.480 |
Like that's, that's allows you to create segmentation masks automatically. So that's pretty cool. 00:56:26.340 |
And then if you get that working, then you can go and try and do step two. If you get 00:56:30.980 |
that working, you can try and do step three. And this only came out this week. So I haven't 00:56:36.220 |
really seen, yeah, examples of easy to use interfaces to this. So here's an example of 00:56:43.940 |
a paper that you could be the first person to create a call interface to it. So there's 00:56:48.040 |
some, yeah, there's a fun little project. And even if you're watching this a long time after 00:56:53.660 |
this was released and everybody's been doing this for years, still good homework, I think, 00:56:58.540 |
so practice if you can. All right. I think now's a good time to have a 10 minute break. 00:57:12.300 |
So I'll see you all back here in 10 minutes. Okay. Welcome back. One thing during the break 00:57:23.900 |
that Diego reminded us about, which I normally describe and I totally forgot about this time 00:57:30.060 |
is detectify, which is another really great way to find symbols you don't know about. 00:57:35.580 |
So let's try it for that expectation. So if you're going to detectify and you draw the 00:57:45.100 |
thing, it doesn't always work fantastically well, but sometimes it works very nicely. 00:57:53.980 |
Yeah, in this case, not quite. What about the double line thing? It's good to know all the 00:58:03.500 |
techniques, I guess. I think it could do this one. I guess part of the problem is there's 00:58:15.340 |
so many options that actually, you know, okay, in this case, it wasn't particularly helpful. 00:58:20.700 |
And normally it's more helpful than that. I mean, if we use a simple one like Epsilon, 00:58:26.380 |
I think it should be fine. There's a lot of room to improve this app, actually, if anybody's 00:58:30.860 |
interested in a project, I think you could make it, you know, more successful. Okay, that's, 00:58:37.500 |
there you go. Signo sum, that's cool. Anyway, so it's another useful thing to know about, 00:58:41.500 |
just Google for detectify. Okay. So let's move on with our from the foundations now. 00:58:52.140 |
And so we were working on trying to at least get the start of a forward pass of a linear model or 00:59:02.220 |
a simple multi-layer perceptron for MNIST going. And we had successfully created a basic tensor. 00:59:11.260 |
We've got some random numbers going. So what we now need to do is we now need to be able to 00:59:19.660 |
multiply these things together, matrix multiplication. So matrix multiplication 00:59:26.380 |
to remind you, in this case, so we're doing MNIST, right? So we've got, 00:59:38.460 |
I think we're going to use a subset. Let's see. Yeah. Okay. So we're going to create a matrix 00:59:46.700 |
called M1, which is just the first five digits. So M1 will be the first five digits. So five rows 00:59:57.340 |
and dot, dot, dot, dot, dot, dot, dot. And then 780, what was it again? 784 columns, 01:00:09.500 |
784 columns, because it's 28 by 28 pixels. And we flattened it out. So this is our 01:00:17.100 |
first matrix and our matrix multiplication. And then we're going to multiply that by some 01:00:22.780 |
weights. So the weights are going to be 784 by 10 random numbers. So for every one of these 01:00:37.820 |
784 pixels, each one is going to have a weight. So 784 down here, 01:00:45.020 |
784 by 10. So this first column, for example, is going to tell us all the weights in order to 01:00:58.700 |
figure out if something's a zero. And the second column will have all the weights in deciding the 01:01:03.180 |
probability of something's a one and so forth, assuming we're just doing a linear model. And so 01:01:07.420 |
then we're going to multiply these two matrices together. So when we multiply matrices together, 01:01:12.940 |
we take row one of matrix one and we take column one of matrix two and we take each 01:01:24.220 |
one in turn. So we take this one and we take this one, we multiply them together. 01:01:28.220 |
And then we take this one and this one and we multiply them together. 01:01:36.300 |
And we do that for every element wise pair and then we add them all up. And that would give us 01:01:46.540 |
the value for the very first cell that would go in here. That's what matrix multiplication is. 01:02:01.660 |
Okay, so let's go ahead then and create our random numbers for the weights, 01:02:10.780 |
since we're allowed to use random number generators now. And for the bias, we'll just use a bunch of 01:02:15.180 |
zeros to start with. So the bias is just what we're going to add to each one. And so for our 01:02:23.340 |
matrix multiplication, we're going to be doing a little mini batch here. We're going to be doing 01:02:26.620 |
five rows of, as we discussed, five rows of, so five images flattened out. 01:02:35.020 |
And then multiply by this weights matrix. So here are the shapes, m1 is 5 by 784, 01:02:47.340 |
as we saw, m2 is 784 by 10. Okay, so keep those in mind. So here's a handy thing, m1.shape 01:02:57.180 |
contains two numbers and I want to pull them out. I want to call the, I'm going to think of that as, 01:03:04.940 |
I'm going to actually think of this as like a and b rather than m1 and m2. So this is like a and b. 01:03:09.420 |
So the number of rows in a and the number of columns in a, if I say equals m1.shape, 01:03:17.500 |
that will put five in ar and 784 in ac. So you'll probably notice this, I do this a lot, 01:03:23.980 |
this de-structuring, we talked about it last week too. So we can do the same for m2.shape, 01:03:28.300 |
put that into b rows and b columns. And so now if I write out arac and brbc, you can again see 01:03:36.220 |
the same things from the sizes. So that's a good way to kind of give us the stuff we have to loop 01:03:40.780 |
through. So here's our result. So our resultant tensor, while we're multiplying, we're multiplying 01:03:49.180 |
together all of these 784 things and adding them up. So the resultant tensor is going to 01:03:53.900 |
be 5 by 10. And then each thing in here is the result of multiplying and adding 784 pairs. 01:04:02.620 |
So the result here is going to start with zeros and there is, this is the result. 01:04:11.500 |
And it's going to contain ar rows, five rows, and bc columns, 10 columns, 5 comma 10. Okay. 01:04:20.780 |
So now we have to fill that in. And so to do a matrix multiplication, 01:04:24.140 |
we have to first, we have to go through each row, one at a time. And here we have that, 01:04:35.260 |
go through each row, one at a time. And then go through each column, one at a time. 01:04:42.620 |
And then we have to go through each pair in that row column, one at a time. So there's going to be 01:04:49.500 |
a loop, in a loop, in a loop. So here we're going to loop over each row. And here we're going to 01:04:59.420 |
loop over each column. And then here we're going to loop, so each column is C. And then here we're 01:05:04.380 |
going to loop over each column of A, which is going to be the same as the number of rows of B, 01:05:12.140 |
which we can see here, ac, 784, br, 784, they're the same. So it wouldn't matter whether we said 01:05:19.100 |
ac or br. So then our result for that row and that column, we have to add onto it the product of 01:05:32.940 |
ik in the first matrix by kj in the second matrix. So k is going up through those 784. And so we're 01:05:42.540 |
going to go across the columns and down, sorry, across the rows and down the columns. It's going 01:05:47.100 |
to go across the row whilst it goes down this column. So here is the world's most naive, slow, 01:05:56.220 |
uninteresting matrix multiplication. And if we run it, okay, it's done something. We have successfully, 01:06:07.580 |
apparently, hopefully successfully, multiplied the matrices M1 and M2. It's a little hard to 01:06:12.860 |
read this, I find, because punch cards used to be 80 columns wide. We still assume screens are 80 01:06:21.900 |
columns wide. Everything defaults to 80 wide, which is ridiculous. But you can easily change it. So if 01:06:28.620 |
you say set print options, you can choose your own line width. Oh, as you can see, well, we know it's 01:06:36.380 |
five by 10. We did it before. So if we change the line width, okay, that's much easier to read now. 01:06:41.260 |
We can see here are the five rows and here are the 10 columns for that matrix multiplication. 01:06:48.780 |
I tend to always put this at the top of my notebooks and you can do the same thing for NumPy as well. 01:06:53.980 |
So what I like to do, this is really important, is when I'm working on code, particularly numeric 01:07:08.460 |
code, I like to do it all step by step in Jupyter. And then what I do is, once I've got it working, 01:07:15.820 |
is I copy all the cells that have implemented that and I paste them and then I select them 01:07:23.340 |
all and I hit shift M to merge. Get rid of anything that prints out stuff I don't need. 01:07:28.220 |
And then I put a header on the top, give it a function name, and then I select the whole lot 01:07:36.700 |
and I hit control or apple right square bracket and I've turned it into a function. But I still 01:07:43.020 |
keep the stuff above it so I can see all the step by step stuff for learning about it later. 01:07:48.220 |
And so that's what I've done here to create this function. 01:07:52.780 |
And so this function does exactly the same things we just did. 01:07:57.500 |
And we can see how long it takes to run by using percent time. 01:08:01.900 |
And it took about half a second, which gosh, that's a long time to generate such a small 01:08:10.780 |
matrix. This is just to do five MNIST digits. So that's not going to be great. 01:08:18.780 |
We're going to have to speed that up. I'm actually quite surprised at how slow that is 01:08:24.300 |
because there's only 39,200. So we're, you know, if you look at how we've got a loop within a loop 01:08:30.540 |
within a loop, what's going wrong? A loop within a loop within a loop, it's doing 39,200 of these. 01:08:37.420 |
So Python, yeah, Python, when you're just doing Python, it is it is slow. 01:08:42.780 |
So we can't we can't do that. That's why we can't just write Python. 01:08:46.700 |
But there is something that kind of lets us write Python. 01:08:51.180 |
We could instead use number. Number is a system that takes Python and turns it into 01:09:05.980 |
basically into machine code. And it's amazingly easy to do. You can basically take a function 01:09:13.260 |
and write end it at end it on top. And what it's going to do is it's going to 01:09:20.140 |
look the first time you call this function, it's going to compile it down to machine code 01:09:25.820 |
and it will run much more quickly. So what I've done here is I've taken the innermost loop. 01:09:34.700 |
So just looping through and adding up all these. 01:09:40.780 |
So start at zero, go through and add up all those just for two vectors and return it. 01:09:48.700 |
This is called a dot product in linear algebra. So we'll call it dot. 01:09:53.260 |
And so number only works with NumPy. It doesn't work with PyTorch. So we're just going to use 01:09:59.740 |
arrays instead of tensors for a moment. Now, have a look at this. If I try to 01:10:04.540 |
do a dot product of one, two, three and two, three, four, that's pretty easy to do. 01:10:09.820 |
It took a fifth of a second, which sounds terrible. But the reason it took a fifth of a second is 01:10:17.260 |
because that's actually how long it took to compile this and run it. Now that it's compiled, 01:10:21.900 |
the second time it just has to call it, it's now 21 microseconds. And so that's actually very fast. 01:10:31.820 |
So with number, we can basically make Python run at C speed. 01:10:38.060 |
So now the important thing to recognize is if I replace this loop in Python with a call to dot, 01:10:49.980 |
which is running in machine code, then we now have one, two loops running in Python, not three. 01:11:03.980 |
Well, first of all, let's make sure if I run it, 01:11:08.220 |
run that matmul, it should be close to my T1. T1 is what we got before, remember? 01:11:20.380 |
So when I'm refactoring or performance improving or whatever, I always like to put every step 01:11:26.780 |
in the notebook and then test. So this test close comes from fastcore.test. And it just checks that 01:11:33.420 |
two things are very similar. They might not be exactly the same because of little floating 01:11:37.340 |
point differences, which is fine. OK, so our matmul is working correctly, or at least it's doing the 01:11:42.220 |
same thing it did before. So if we now run it, it's taking 268 microseconds versus 448 milliseconds. 01:11:53.820 |
So it's taking, you know, about 2000 times faster just by changing the one in a most loop. 01:12:03.580 |
So really, all we've done is we've added @engit to make it 2000 times faster. 01:12:08.940 |
So number is well worth knowing about. It can make your Python code very, very fast. 01:12:16.540 |
OK, let's keep making it faster. So we're going to use stuff again, which kind of goes back to APL. 01:12:26.540 |
And a lot of people say that learning APL is a thing that's taught them more about 01:12:33.020 |
programming than anything else. So it's probably worth considering learning APL. 01:12:41.100 |
And let's just look at these various things. We've got a is 10, 6 minus 4. So remember at APL, 01:12:47.740 |
we don't say equals. Equals actually means equals, funnily enough. To say set to, 01:12:52.700 |
we use this arrow. And this is a list of 10, 6, 4. OK, and then b is 287. 01:13:06.620 |
OK, and we're going to add them up, a plus b. So what's going on here? 01:13:15.980 |
So it's really important that you can think of 01:13:19.500 |
a symbol like a as representing a tensor or an array. APL calls them arrays. 01:13:31.100 |
PyTorch calls them tensors. NumPy calls them arrays. They're the same thing. 01:13:35.580 |
So this is a single thing that contains a bunch of numbers. This is a single thing that contains 01:13:39.660 |
a bunch of numbers. This is an operation that applies to arrays or tensors. And what it does 01:13:45.420 |
is it works what's called element-wise. It takes each pair, 10 and 2, and adds them together. 01:13:50.460 |
Each pair, 6 and 8, add them together. This is element-wise addition. And Fred's asking in the 01:13:57.100 |
chat, how do you put in these symbols? If you just mouse over any of them, it will show you 01:14:04.060 |
how to write it. And the one you want is the one at the very 01:14:07.100 |
bottom, which is the one where it says prefix. Now, the prefix is the backtick character. 01:14:14.140 |
So here it's saying prefix hyphen gives us times. So type a backtick dash b is a times b, 01:14:25.580 |
for example. So yeah, they all have shortcut keys, which you learn pretty quickly, I find. 01:14:33.420 |
And there's a fairly consistent kind of system for those shortcut keys, too. 01:14:36.940 |
All right. So we can do the same thing in PyTorch. It's a little bit more verbose in PyTorch, 01:14:43.740 |
which is one reason I often like to do my mathematical fiddling around in APL. I can 01:14:48.540 |
often do it with less boilerplate, which means I can spend more time thinking. 01:14:54.460 |
You know, I can see everything on the screen at once. I don't have to spend as much time trying 01:14:58.300 |
to ignore the tensor, round brackets, square bracket dot comma, blah, blah, blah. 01:15:03.100 |
It's all cognitive load, which I'd rather ignore. But anyway, it does the same thing. 01:15:07.660 |
So I can say a plus b and it works exactly like APL. 01:15:10.940 |
So here's an interesting example. I can go a less than b dot float dot mean. 01:15:18.780 |
So let's try that one over here. A less than b. So this is a really important idea, 01:15:25.020 |
which I think was invented by Ken Iverson, the APL guy, which is the true and false 01:15:30.460 |
represented by zero and one. And because they're represented by zero and one, we can 01:15:37.100 |
do things to them. We can add them up and subtract them and so forth. It's a really important idea. 01:15:43.020 |
So in this case, I want to take the mean of them. And I'm going to tell you something amazing, 01:15:51.740 |
which is that in APL, there is no function called mean. Why not? That's because we can write 01:15:59.580 |
the main function, which is so that's four letters, mean, M-E-A-N. We can write the main 01:16:05.820 |
function from scratch with four characters. I'll show you. Here's the whole main function. 01:16:12.700 |
We're going to create a function called main and the main is equal to the sum of a list 01:16:20.940 |
divided by the count of a list. So this here is sum divided by count. 01:16:27.980 |
And so I've now defined a new function called mean, which calculates the mean. 01:16:33.980 |
Mean of a is less than b. There we go. And so, you know, in practice, I'm not sure people would 01:16:41.020 |
even bother defining a function called mean because it's just as easy to actually write 01:16:45.100 |
its implementation in APL, in NumPy or whatever Python, it's going to take a lot more than four 01:16:52.620 |
letters to implement mean. So anyway, you know, it's a math notation. And so being a math notation, 01:16:58.060 |
we can do a lot with little, which I find helpful because I can see everything going on at once. 01:17:03.740 |
Anywho, OK, so that's how we do the same thing in PyTorch. And again, you can see that the less than 01:17:10.460 |
in both cases are operating element wise. OK, so a is less than b is saying ten is less than two, 01:17:15.900 |
six is less than eight, four is less than seven and gives us back each of those trues and falses 01:17:20.620 |
as zeros and ones. And according to the emoji on our YouTube chat, see if his head just exploded 01:17:26.460 |
as it should. This is why APL is, yeah, life changing. OK, let's now go up to higher ranks. 01:17:36.860 |
So this here is a rank one tensor. So a rank one tensor means it's a it's a list of things. 01:17:44.460 |
It's a vector. It's where else a rank two tensor is like a list of lists. They all have to be the 01:17:51.900 |
same length lists or it's like a rectangular bunch of numbers. And we call it in math, we call it a 01:17:56.540 |
matrix. So this is how we can create a tensor containing one, two, three, four, five, six, seven, 01:18:01.740 |
eight, nine. And you can see often what I like to do is I want to print out the thing I just created 01:18:09.660 |
after I created it. So two ways to do it. You can say, put an enter and then write M and that's 01:18:15.900 |
going to do that. Or if you want to put it all on the same line, that works too. You just use a 01:18:19.180 |
semicolon. Neither one's better than the other. They're just different. So we could do the same 01:18:26.140 |
thing in APL. Of course, in APL, it's going to be much easier. So we're going to define a matrix 01:18:32.540 |
called M, which is going to be a three by three tensor containing the numbers from one to nine. 01:18:43.580 |
Okay. And there we go. That's done it in APL. A three by three tensor containing the numbers 01:18:54.540 |
from one to nine. A lot of these ideas from APL you'll find have made their way into other 01:18:59.500 |
programming languages. For example, if you use Go, you might recognize this. This is the iota 01:19:04.700 |
character and Go uses the word iota. So they spell it out in a somewhat similar way. 01:19:11.180 |
A lot of these ideas from APL have found themselves into math notation and other languages. 01:19:19.340 |
It's been around since the late 50s. Okay. So here's a bit of fun. 01:19:24.060 |
We're going to learn about a new thing that looks kind of crazy called Frobenius norm. 01:19:30.380 |
And we'll use that from time to time as we're doing generative modeling. 01:19:35.580 |
And here's the definition of a Frobenius norm. It's the sum over all of the rows and columns 01:19:43.900 |
of a matrix. And we're going to take each one and square it. We're going to add them up and 01:19:51.980 |
they're going to take the square root. And so to implement that in PyTorch is as simple as going 01:20:00.140 |
n times m dot sum dot square root. So this looks like a pretty complicated thing when you kind of 01:20:10.380 |
look at it at first. It looks like a lot of squiggly business. Or if you said this thing here, 01:20:14.540 |
you might be like, what on earth is that? Well, now, you know, it's just square sum square root. 01:20:25.420 |
So let's do, so in APL, we want the, okay, so we're going to create something called SF. 01:20:36.940 |
Now, it's interesting, APL does this a little bit differently. So dot sum by default in PyTorch sums 01:20:44.220 |
over everything. And if you want to sum over just one dimension, you have to pass in a dimension 01:20:48.540 |
keyword. For very good reasons, APL is the opposite. It just sums across rows or just down columns. 01:20:55.340 |
So actually, we have to say sum up the flattened out version of the matrix. And to say flattened 01:21:02.620 |
out, use comma. So here's sum up the flattened out version of the matrix. Okay, so that's our SF. 01:21:12.380 |
Oh, sorry. And the matrix is meant to be m times m. There we go. So there's the same thing. Sum up 01:21:24.220 |
the flattened out m by m matrix. And another interesting thing about APL is it always is 01:21:29.100 |
read right to left. There's no such thing as operator precedence, which makes life a lot easier. 01:21:34.780 |
Okay, and then we take the square root of that. There isn't a square root function. 01:21:42.780 |
So we have to do to the power of 0.5. And there we go. Same thing. All right, you get the idea. 01:21:51.020 |
Yes, a very interesting question here from Marabou. Are the bars for norm or absolute value? 01:21:58.620 |
And I like Siva's answer, which is the norm is the same as the absolute value for a scalar. 01:22:05.340 |
So in this case, you can think of it as absolute value. And it's kind of not needed because it's 01:22:10.380 |
being squared anyway. But yes, in this case, the norm, well, in every case for a scalar, 01:22:17.340 |
the norm is the absolute value, which is kind of a cute discovery when you realize it. 01:22:21.260 |
So thank you for pointing that out, Siva. All right. So this is just fiddling around a little 01:22:28.380 |
bit to kind of get a sense of how these things work. So really importantly, you can index into 01:22:36.780 |
a matrix and you'll say rows first and then columns. And if you say colon, it means all the 01:22:43.500 |
columns. So if I say row two, here it is, row two, all the columns, sorry, this is row two, 01:22:51.420 |
so that's at zero, APL starts at one, all the columns, that's going to be seven, eight, nine. 01:22:57.420 |
And you can see I often use comma to print out multiple things. And I don't have to say print 01:23:01.980 |
in Jupiter, it's kind of assumed. And so this is just a quick way of printing out the second row. 01:23:07.980 |
And then here, every row, column two. So here is every row of column two. And here you can see three, 01:23:16.300 |
six, nine. So one thing very useful to recognize is that for tensors of higher rank than one, 01:23:30.460 |
such as a matrix, any trailing colons are optional. So you see this here, M2, that's the 01:23:37.900 |
same as M2 comma colon. It's really important to remember. Okay, so M2, you can see the result is 01:23:44.540 |
the same. So that means row two, every column. Okay, so now with all that in place, we've got 01:23:55.340 |
quite an easy way. We don't need a number anymore. We can multiply, so we can get rid of that inner 01:24:03.260 |
most loop. So we're going to get rid of this loop, because this is just multiplying together all of 01:24:09.660 |
the corresponding rows of A with the, sorry, all the corresponding columns of a row of A with all 01:24:16.380 |
the corresponding rows of a column of B. And so we can just use an element-wise operation for that. 01:24:22.300 |
So here is the ith row of A, and here is the jth column of B. And so those are both, as we've seen, 01:24:35.980 |
just vectors, and therefore we can do an element-wise multiplication of them, 01:24:40.300 |
and then sum them up. And that's the same as a dot product. So that's handy. 01:24:46.860 |
And so again, we'll do test close. Okay, it's the same. Great. And again, you'll see we kind of did 01:24:55.180 |
all of our experimenting first, right, to make sure that we understood how it all worked, 01:24:59.580 |
and then put it together. And then if we time it, 661 microseconds. Okay, so it's interesting. It's 01:25:06.940 |
actually slower than, which really shows you how good number is, but it's certainly a hell of a lot 01:25:12.220 |
better than our 450 milliseconds. But we're using something that's kind of a lot more general now. 01:25:18.620 |
This is exactly the same as dot, as we've discussed. So we could just use torch dot, 01:25:27.260 |
torch dot dot, I suppose I should say. And if we run that, okay, a little faster. It's still, 01:25:34.300 |
interestingly, it's still slower than the number, which is quite amazing, actually. 01:25:39.660 |
All right, so that one was not exactly a speed up, but it's kind of a bit more general, which is nice. 01:25:47.660 |
Now we're going to get something into something really fun, 01:25:54.780 |
which is broadcasting. And broadcasting is about what if you have arrays with different shapes. 01:26:00.540 |
So what's a shape? The shape is the number of rows, or the number of rows and columns, 01:26:06.140 |
or the number of, what would you say, faces, rows and columns, and so forth. So for example, 01:26:13.420 |
the shape of M is 3 by 3. So what happens if you multiply, or add, or do operations to tensors of 01:26:22.220 |
different shapes? Well, there's one very simple one, which is if you've got a rank one tensor, 01:26:29.180 |
the vector, then you can use any operation with a scalar, and it broadcasts that scalar 01:26:39.740 |
across the tensor. So a is greater than zero is exactly the same as saying a is greater than tensor 01:26:47.420 |
zero comma zero comma zero. So it's basically copying that across three times. Now it's not 01:26:58.860 |
literally making a copy in memory, but it's acting as if we had said that. And this is the most 01:27:03.820 |
simple version of broadcasting. Okay, it's broadcasting the zero across the ten, and the 01:27:09.900 |
six, and the negative four. And APL does exactly the same thing. a is less than five, so zero, zero, 01:27:22.620 |
one. So same idea. Okay. So we can do plus with a scalar, and we can do exactly the same thing with 01:27:41.580 |
higher than rank one. So two times a matrix is just going to do two is going to be broadcast 01:27:47.420 |
across all the rows and all the columns. Okay, now it gets interesting. So broadcasting dates back to 01:27:59.180 |
APL. But a really interesting idea is that we can broadcast not just scalars, but we can broadcast 01:28:05.820 |
vectors across matrices or broadcast any kind of lower ranked tensor across higher ranked tensors, 01:28:13.740 |
or even broadcast together together two tensors of the same rank, but different shapes in a really 01:28:20.460 |
powerful way. And as I was exploring this, I was trying to love doing this kind of computer 01:28:27.340 |
archaeology. I was trying to find out where the hell this comes from. And it actually turns out 01:28:31.660 |
from this email message in 1995, that the idea actually comes from a language that I'd never 01:28:41.100 |
heard of called Yorick, which still apparently exists. Here's Yorick. And so Yorick has talks about 01:28:51.740 |
broadcasting and conformability. So what happened is this very obscure language 01:29:01.740 |
has this very powerful idea. And NumPy has happily stolen the idea from Yorick that allows us to 01:29:11.820 |
broadcast together tensors that don't appear to match. So let me give an example. Here's a tensor 01:29:20.060 |
called C that's a vector. It's a rank one tensor, 10, 20, 30. And here's a tensor called M, which is 01:29:26.540 |
a matrix. We've seen this one before. And one of them is shape three, comma, three. The other is 01:29:32.860 |
shape three. And yet we can add them together. Now what's happened when we added it together? 01:29:41.420 |
Well, what's happened is 10, 20, 30 got added to one, two, three. And then 10, 20, 30 got added to 01:29:50.700 |
four, five, six. And then 10, 20, 30 got added to seven, eight, nine. And hopefully you can see 01:29:58.780 |
this looks quite familiar. Instead of broadcasting a scalar over a higher rank tensor, 01:30:05.340 |
this is broadcasting a vector across every row of a matrix. 01:30:15.260 |
And it works both ways. So we can say C plus M gives us exactly the same thing. And so let me 01:30:21.180 |
explain what's actually happening here. The trick is to know about this somewhat obscure method called 01:30:26.780 |
expand as. And what expand as does is this creates a new thing called T, which contains exactly the 01:30:33.420 |
same thing as C, but expanded or kind of copied over. So it has the same shape as M. So here's 01:30:40.700 |
what T looks like. Now T contains exactly the same thing as C does, but it's got three copies of it 01:30:47.500 |
now. And you can see we can definitely add T to M because they match shapes. Right? So we can say 01:30:56.380 |
M plus T. We know we can play M plus T because we've already learned that you can do element-wise 01:31:01.900 |
operations on two things that have matching shapes. Now, by the way, this thing T didn't actually 01:31:09.740 |
create three copies. Check this out. If we call T dot storage, it tells us what's actually in memory. 01:31:15.580 |
It actually just contains the numbers 10, 20, 30. But it does a really clever trick. It has a stride 01:31:23.020 |
of zero across the rows and a size of three comma three. And so what that means is that it acts as 01:31:30.700 |
if it's a three by three matrix. And each time it goes to the next row, it actually stays exactly 01:31:36.540 |
where it is. And this idea of strides is the trick which NumPy and PyTorch and so forth use 01:31:45.180 |
for all kinds of things where you basically can create very efficient ways to do things like 01:31:52.860 |
expanding or to kind of jump over things and stuff like that, you know, switch between columns and 01:31:58.540 |
rows, stuff like that. Anyway, the important thing here for us to recognize is that we didn't 01:32:03.100 |
actually make a copy. This is totally efficient and it's all going to be run in C code very fast. 01:32:07.740 |
So remember, this expand as is critical. This is the thing that will teach you to understand 01:32:14.140 |
how broadcasting works, which is really important for implementing deep learning algorithms or any 01:32:19.980 |
kind of linear algebra on any Python system, because the NumPy rules are used exactly the same 01:32:28.460 |
in JAX, in TensorFlow, in PyTorch and so forth. Now I'll show you a little trick, 01:32:36.540 |
which is going to be very important in a moment. If we take C, which remember is a vector containing 01:32:44.060 |
10 20 30, and we say dot unsqueezed zero, then it changes the shape from three to one comma three. 01:32:55.980 |
So it changes it from a vector of length three to a matrix of one row by three columns. This will 01:33:02.860 |
turn out to be very important in a moment. And you can see how it's printed. It's printed out with 01:33:06.540 |
two square brackets. Now I never use unsqueezed because I much prefer doing something more 01:33:12.220 |
flexible, which is if you index into an axis with a special value none, also known as np.newaxis. 01:33:20.300 |
It does exactly the same thing. It inserts a new axis here. So here we'll get exactly the same thing, 01:33:28.540 |
one row by all the columns, three columns. So this is exactly the same as saying unsqueezed. 01:33:35.820 |
So this inserts a new unit axis. This is a unit axis, a single row 01:33:45.100 |
in this dimension. And this does the same thing. So these are the same. So we could do the same 01:33:52.460 |
thing and say unsqueeze one, which means now we're going to unsqueeze into the first dimension. 01:33:59.820 |
So that means we now have three rows and one column. See the shape here? The shape is inserting 01:34:08.620 |
a unit axis in position one, three rows and one column. And so we can do exactly the same thing 01:34:16.940 |
here. Give us every row and a new unit axis in position one. Same thing. Okay. So those two are 01:34:25.020 |
exactly the same. So this is how we create a matrix with one row. This is how we create a 01:34:35.180 |
matrix with one column. None comma colon versus colon comma none or unsqueeze. 01:34:41.900 |
We don't have to say, as we've learned before, none comma colon, because do you remember? 01:34:51.740 |
Trailing columns are optional. So therefore just see none is also going to give you a row matrix, 01:34:59.660 |
one row matrix. This is a little trick here. If you say dot, dot, dot, that means all of the 01:35:06.940 |
dimensions. And so dot, dot, dot comma none will always insert a unit axis at the end, regardless 01:35:13.500 |
of what rank a tensor is. So, yeah, so none and NP new axis mean exactly the same thing. 01:35:21.020 |
NP new axis is actually a synonym for none. If you've ever used that, I always use none 01:35:29.260 |
because why not? It's short and simple. So here's something interesting. If we go C colon, 01:35:34.860 |
common none, so let's go and check out what C colon, common none looks like. C colon, common none 01:35:42.540 |
is a column. And if we say expand as M, which is three by three, then it's going to take that 01:35:51.580 |
10, 20, 30 column and replicate it 10, 20, 30, 10, 20, 30, 10, 20, 30. So we could add. So remember, 01:36:00.860 |
like, remember, I will explain that when you say matrix plus C colon, common none, 01:36:10.620 |
it's basically going to do this dot expand as for you. So if I want to add this matrix here to M, 01:36:20.380 |
I don't need to say dot expand as I just write this. I just write M plus C colon, common none. 01:36:27.020 |
And so this is exactly the same as doing M plus C. But now rather than adding the vector to each row, 01:36:36.060 |
it's adding the vector to each column C plus 10, 20, 30, 10, 20, 30, 10, 20, 30. 01:36:45.500 |
So that's a really simple way that we now get kind of for free thanks to this really nifty notation, 01:36:51.420 |
this nisti approach that that came from Yorick. So here you can see M plus C none, 01:36:58.460 |
none, colon is adding 10, 20, 30 to each row. And M plus C colon, common none is adding 10, 01:37:06.060 |
20, 30 to each column. All right, so that's the basic like hand wavy version. So let's 01:37:15.180 |
look at like what are the rules? How does it work? Okay, so C none, colon is one by three. 01:37:23.500 |
C colon, common none is three by one. What happens if we multiply C none, colon 01:37:31.980 |
by C colon, common none? Well, it's going to do if you think about it, 01:37:37.820 |
which you definitely should because thinking is very helpful. 01:37:46.860 |
Okay, so what happens if we go C none, colon times C colon, common none? So what it's going 01:37:55.020 |
to have to do is it's going to have to take this 10, 20, 30 column vector or three by one matrix 01:38:04.460 |
and it's going to have to make it work across each of these rows. So what it does is expands it to be 01:38:12.300 |
10, 20, 30, 10, 20, 30, 10, 20, 30. So it's going to do it just like this. And then it's going to 01:38:18.940 |
do the same thing for C none, colon. So that's going to become three rows of 10, 20, 30. So 01:38:26.460 |
we're going to end up with three rows of 10, 20, 30 times three columns of 10, 20, 30, 01:38:33.660 |
which gives us our answer. And so this is going to do an outer product. So it's very nifty that 01:38:43.020 |
you can actually do an outer product without any special, you know, functions or anything, 01:38:50.620 |
just using broadcasting. And it's not just outer products, you can do outer Boolean operations. 01:38:56.060 |
And this kind of stuff comes up all the time, right? Now, remember, you don't need the comma 01:39:00.940 |
colon, so get rid of it. So this is showing us all the places where it's greater than it's kind of an 01:39:09.820 |
outer, an outer Boolean, if you want to call it that. So this is super nifty and you can do all 01:39:16.540 |
kinds of tricks with this because it runs very, very fast. So this is going to be accelerated in C. 01:39:21.340 |
So here are the rules. Okay. When you operate on two arrays or tensors, NumPy and PyTorch will 01:39:29.180 |
compare their shapes. Okay. So remember the shape, this is a shape. You can tell it's a shape because 01:39:34.620 |
we said shape and it goes from right to left. So that's the trailing dimensions. And it checks 01:39:42.300 |
whether the dimensions are compatible. Now they're compatible if they're equal, right? So for example, 01:39:48.460 |
if we say M times M, then those two shapes are compatible because in each case, it's just going 01:40:04.220 |
to be three, right? So they're going to be equal. So if the shape in that dimension is equal, 01:40:11.580 |
they're compatible, or if one of them is one and if one of them is one, then that dimension is 01:40:18.540 |
broadcast to make it the same size as the other. So that's why the outer product worked. We had 01:40:28.540 |
a one by three times a three by one. And so this one got copied three times to make it this long. 01:40:37.740 |
And this one got copied three times to make it this long. Okay. So those are the rules. So the 01:40:46.860 |
arrays don't have to have the same number of dimensions. So this is an example that comes up 01:40:51.900 |
all the time. Let's say you've got a 256 by 256 by three array or tensor of RGB values. So you've got 01:40:57.500 |
an image, in other words, a color image. And you want to normalize it. So you want to scale each 01:41:03.740 |
color in the image by a different value. So this is how we normalize colors. So one way is you could 01:41:14.220 |
multiply or divide or whatever, multiply the image by a one-dimensional array with three values. 01:41:20.060 |
So you've got a 1D array. So that's just three. Okay. And then the image is 256 by 256 by three. 01:41:30.940 |
And we go right to left and we check, are they the same? And we say, yes, they are. 01:41:35.180 |
And then we keep going left and we say, are they the same? And if it's missing, we act as if it's 01:41:41.660 |
one. And if we go, keep going, if it's missing, we act as if it's one. So this is going to be the 01:41:46.940 |
same as doing one by one by three. And so this is going to be broadcast. This three, three 01:41:52.940 |
elements will be broadcast over all 256 by 256 pixels. So this is a super fast 01:42:00.540 |
and convenient and nice way of normalizing image data with a single expression. And this is exactly 01:42:06.620 |
how we do it in the fast.ai library. In fact, so we can use this to dramatically speed up our 01:42:15.020 |
matrix multiplication. Let's just grab a single digit just for simplicity. And I really like 01:42:21.340 |
doing this in Jupyter notebooks. And if you, if you build Jupyter notebooks to explain stuff that 01:42:26.780 |
you've learned in this course or ways that you can apply it, consider doing this for your readers, 01:42:31.020 |
but add a lot more prose. I haven't added prose here because I want to use my voice. 01:42:36.460 |
If I was, for example, in our book that we published, it's all written in notebooks and 01:42:42.700 |
there's a lot more prose, obviously. But like really, I like to show every example all along 01:42:47.740 |
the way using simple as possible. So let's just grab a single digit. So here's the first digit. 01:42:54.060 |
So its shape is, it's a 784 long vector. And remember that our weight matrix is 784 by 10. 01:43:02.060 |
So if we say digit colon common none dot shape, then that is a 784 by 1 row matrix. So there's 01:43:18.460 |
our matrix. And so if we then take that 784 by 1 and expand as M2, it's going to be the same 01:43:30.060 |
shape as our weight matrix. So it's copied our image data for that digit across all of the 10 01:43:42.060 |
vectors representing the 10 linear projections we're doing for our linear model. And so that 01:43:50.620 |
means that we can take the digit colon common none, so 784 by 1, and multiply it by the weights. 01:43:57.020 |
And so that's going to get us back 784 by 10. And so what it's doing, remember, is it's basically 01:44:03.820 |
looping through each of these 10 784 long vectors. And for each one of them, it's multiplying it by 01:44:13.900 |
this digit. So that's exactly what we want to do in our matrix multiplication. So originally, 01:44:23.340 |
we had, well not originally, most recently I should say, we had this dot product where we were 01:44:31.740 |
actually looping over j, which was the columns of b. So we don't have to do that anymore, 01:44:40.540 |
because we can do it all at once by doing exactly what we just did. So we can take the i-th row 01:44:48.780 |
and all the columns and add a axis to the end. And then just like we did here, 01:45:01.100 |
multiply it by b. And then dot sum. And so that is, again, exactly the same thing. That is another 01:45:10.780 |
matrix multiplication, doing it using broadcasting. Now this is like 01:45:15.660 |
tricky to get your head around. And so if you haven't done this kind of broadcasting before, 01:45:24.220 |
it's a really good time to pause the video and look carefully at each of these four cells before 01:45:31.820 |
and understand, what did I do there? Why did I do it? What am I showing you? And then experiment 01:45:39.260 |
with trying to, and so remember that we started with M1 0, right? So just like we have here ai, 01:45:48.220 |
okay? So that's why we've got i comma comma colon comma none, because this digit is actually M1 0. 01:45:55.740 |
This is like M1 0 colon none. So this line is doing exactly the same thing as this here, 01:46:04.700 |
plus the sum. So let's check if this matmul is the same as it used to be, yet it's still working. 01:46:12.780 |
And the speed of it, okay, not bad. So 137 microseconds. So we've now gone from a time 01:46:22.460 |
from 500 milliseconds to about 0.1 milliseconds. Funnily enough on my, oh, actually, now I think 01:46:28.620 |
about it. My MacBook Air is an M2, whereas this Mac Mini is an M1. So that's a little bit slower. 01:46:33.500 |
So my Air was a bit faster than 0.1 milliseconds. So overall, we've got about a 5,000 times 01:46:40.860 |
speed improvement. So that is pretty exciting. And since it's so fast now, there's no need to 01:46:48.700 |
use a mini batch anymore. If you remember, we used a mini batch of, where is it? Of five images. 01:47:00.540 |
But now we can actually use the whole data set because it's so fast. So now we can do the whole 01:47:05.020 |
data set. There it is. We've now got 50,000 by 10, which is what we want. And so it's taking us only 01:47:15.580 |
656 milliseconds now to do the whole data set. So this is actually getting to a point now where we 01:47:20.860 |
could start to create and train some simple models in a reasonable amount of time. So that's good 01:47:26.220 |
news. All right. I think that's probably a good time to take a break. We don't have too much more 01:47:35.100 |
of this to go, but I don't want to keep you guys up too late. So hopefully you learned something 01:47:42.460 |
interesting about broadcasting today. I cannot overemphasize how widely useful this is in 01:47:51.900 |
all deep learning and machine learning code. It comes up all the time. It's basically our 01:47:57.100 |
number one, most critical kind of foundational operation. So yeah, take your time practicing 01:48:05.820 |
it and also good luck with your diffusion homework from the first half of the lesson. 01:48:11.980 |
Thanks for joining us, and I'll see you next time.