back to index

Lesson 11 2022: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Introduction
0:20 Showing student’s work
13:3 Workflow on reading an academic paper
16:20 Read DiffEdit paper
26:27 Understanding the equations in the “Background” section
46:10 3 steps of DiffEdit
51:42 Homework
59:15 Matrix multiplication from scratch
68:47 Speed improvement with Numba library
79:25 Frobenius norm
85:54 Broadcasting with scalars and matrices
99:22 Broadcasting rules
102:10 Matrix multiplication with broadcasting

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everybody, welcome to lesson 11.
00:00:04.680 | This is the third lesson in part two, depending on how you count things.
00:00:08.960 | There's been a lesson A and lesson B, it's kind of the fifth lesson in part two, I don't
00:00:12.680 | know what it is.
00:00:13.680 | So we'll just stick to calling it lesson 11 and avoid getting too confused.
00:00:17.080 | I'm already confused.
00:00:20.120 | My goodness, I've got so much stuff to show you.
00:00:21.920 | I'm only going to show you a tiny fraction of the cool stuff that's been happening on
00:00:25.040 | the forum this week, but it's been amazing.
00:00:31.000 | I'm going to start by sharing this beautiful video from John Robinson, Robinson, I should
00:00:37.280 | say, and I've never seen anything like this before.
00:00:42.540 | As you can see, it's very stable and it's really showing this beautiful movement between
00:00:49.120 | seasons.
00:00:52.320 | So what I did on the forum was I said to folks, "Hey, you should try interpolating between
00:00:57.960 | prompts," which is what John did.
00:00:59.840 | And I also said, "You should try using the last image of the previous prompt interpolation
00:01:09.600 | as the initial image for the next prompt."
00:01:12.800 | And anyway, here it is, it came out beautifully, John was the first to get that working, so
00:01:17.960 | I was very excited about that.
00:01:24.560 | And the second one I wanted to show you is this really amazing work from Seb Derhi, who,
00:01:36.880 | Sebastian, who did something that I'd been thinking about as well.
00:01:42.480 | I'm really thrilled that he also thought about this, which was he noticed that this update
00:01:47.800 | we do, unconditional embeddings plus guidance times text embeddings minus unconditional
00:01:57.360 | embeddings, has a bit of a problem, which is that it gets big.
00:02:06.680 | To show you what I mean by it gets big is like, imagine that we've got a couple of vectors
00:02:17.000 | on this chart here.
00:02:26.660 | And so we've got, let's see, so we've got, that's just, okay, so we've got the original
00:02:31.720 | unconditional piece here, so we've got U. So let's say this is U. And then we add to that
00:02:39.420 | some amount of T minus U. So if we've got like T, let's say it's huge, right?
00:02:52.440 | And we've got U again.
00:02:56.140 | Then the difference between those is the vector which goes here, right?
00:03:05.080 | Now you can see here that if there's a big difference between T and U, then the eventual
00:03:11.440 | update which actually happens is, oopsie daisy, I thought that was going to be an arrow.
00:03:21.160 | Let's try that again.
00:03:23.200 | The eventual update which happens is far bigger than the original update.
00:03:29.720 | And so it jumps too far.
00:03:35.780 | So this idea is basically to say, well, let's make it so that the update is no longer than
00:03:43.120 | the original unconditioned update would have been.
00:03:48.120 | And we're going to be talking more about norms later, but basically we scale it by the ratio
00:03:53.920 | of the norms.
00:03:56.000 | And what happens is we start with this astronaut and we move to this astronaut.
00:04:07.540 | And it's kind of, it's a subtle change, but you can see there's a lot more before, after,
00:04:12.960 | before, after, a lot more texture in the background.
00:04:18.360 | And like on the Earth, there's a lot more detail before, after, you see that?
00:04:24.780 | And even little things like before, the bridal kind of rains, whatever were pretty flimsy.
00:04:29.800 | Now they look quite proper.
00:04:32.100 | So it's made quite a big difference just to kind of get this scaling correct.
00:04:39.660 | So there's a couple of other things that Sebastian tried, which I'll explain in a moment, but
00:04:45.880 | you can see how they, some of them actually resulted in changing the image.
00:04:52.420 | And this one's actually important because the poor horse used to be missing a leg and
00:04:56.720 | now it's not missing a leg, so that's good.
00:05:00.240 | And so here's the detailed one with its extra leg.
00:05:03.120 | So how did he do this?
00:05:04.840 | Well, so what he did was he started with this unconditioned prompt plus the guidance times
00:05:11.240 | the difference between the conditional and unconditioned.
00:05:15.000 | And then as we discussed, the next version, well, actually the next version we then saw
00:05:23.880 | is to basically just take that prediction and scale it according to the difference in
00:05:31.400 | the lengths.
00:05:32.400 | So the norms is basically the length of the vectors.
00:05:35.640 | And so this is the second one I did in lesson nine, you'll see it's gone from here.
00:05:39.520 | So when we go from 1a to 1b, you can see here, it's got, look at this, this boot's gone from
00:05:46.520 | nothing to having texture.
00:05:48.880 | This whatever the hell this thing is, suddenly he's got texture.
00:05:52.320 | And look, we've now got proper stars in the sky.
00:05:55.700 | It's made a really big difference.
00:05:57.560 | And then the second change is not just to rescale the whole prediction, but to rescale
00:06:03.920 | the update.
00:06:06.960 | When we rescale the update, it actually not surprisingly changes the image entirely because
00:06:11.800 | we're now changing the direction it goes.
00:06:15.080 | And so I don't know, is this better than this?
00:06:17.680 | I mean, maybe, maybe not, but I think so, particularly because this was the difference
00:06:23.240 | that added the correct fourth leg to the horse before.
00:06:27.300 | And then we can do both.
00:06:28.480 | We can rescale the difference and then rescale the result.
00:06:31.800 | And then we get the best of both worlds, as you can see, big difference.
00:06:35.720 | We get a nice background.
00:06:37.560 | This weird thing on his back's actually become an arm.
00:06:42.600 | That's not what a foot looks like.
00:06:44.620 | That is what a foot looks like.
00:06:46.840 | So these little details make a big difference, as you can see.
00:06:53.640 | So this is a really cool, or two really cool new things.
00:07:01.360 | New things tend to have wrinkles, though.
00:07:04.320 | Problem number one is after I shared on Twitter Sebastian's approach, Ben Poole, who's a Google
00:07:13.280 | brain, I think, if I remember correctly, pointed out that this already exists.
00:07:17.480 | He thinks it's the same as what's shown in this paper, which is a diffusion model for
00:07:22.180 | text-to-speech.
00:07:23.440 | I haven't read the paper yet to check whether it's got all the different options or whether
00:07:27.080 | it's checked them all out like this.
00:07:29.520 | So maybe this is reinventing something that already existed and putting it into a new
00:07:34.640 | field, which would still be interesting.
00:07:36.400 | Anyway, so hopefully, folks, on the forum, you can help figure out whether this paper
00:07:41.320 | is actually showing the same thing or not.
00:07:45.800 | And then the other interesting thing was John Robinson got back in touch on the forum and
00:07:50.360 | said, "Oh, actually, that tree video doesn't actually do what we think it does at all.
00:07:57.880 | There's a bug in his code, and despite the bug, it accidentally worked really well."
00:08:02.680 | So now we're in this interesting question of trying to figure out, "Oh, how did he create
00:08:07.200 | such a beautiful video by mistake?"
00:08:10.080 | And okay, so reverse engineering exactly what the bug did, and then figuring out how to
00:08:14.640 | do that more intentionally.
00:08:16.560 | And this is great, right?
00:08:17.760 | It's really good to have a lot of people working on something, and the bugs often, yeah, they
00:08:25.800 | tell us about new ideas.
00:08:28.240 | So that's very interesting.
00:08:30.360 | So watch this space where we find out what John actually did and how it worked so well.
00:08:38.600 | And then something that I just saw like two hours ago in the forum, which I had never
00:08:43.000 | thought of before, but I thought of something a little bit similar.
00:08:47.640 | Rakhil Prashanth said like, "Well, what if we took this?"
00:08:50.720 | So as you can see, all the students are really bouncing ideas of each other.
00:08:53.400 | It's like, "Oh, it's interesting.
00:08:54.760 | We're doing different things with a guidance scale.
00:08:57.880 | What if we take the guidance scale, and rather than keeping it at 7.5 all the time, let's
00:09:03.120 | reduce it."
00:09:04.200 | And this is a little bit similar to something I suggested to John over a few weeks ago where
00:09:08.680 | I said he was doing some stuff with like modifying gradients based on additional loss
00:09:14.500 | functions.
00:09:15.500 | And I said to him, "Maybe you should just use them like occasionally at the start."
00:09:18.980 | Because I think the key thing is once the model kind of knows roughly what image it's
00:09:22.880 | trying to draw, even if it's noisy, you can let it do its thing.
00:09:28.280 | And this is exactly what's happening here is Rakhil's idea is to say, "Well, let's decrease
00:09:34.120 | the guidance scale."
00:09:35.120 | So at the end, it's basically zero.
00:09:37.040 | And so once it kind of is in going in the right direction, we let it do its thing.
00:09:41.520 | So this little doggy is with the normal 7.5 guidance scale.
00:09:47.200 | Now have a look, for example, its eye here.
00:09:49.720 | It's pretty uninteresting, pretty flat.
00:09:52.960 | And if I go to the next one, as you can see now, actually look at the eye.
00:09:57.540 | That's a proper eye before, totally glassy, black, now proper eye.
00:10:03.320 | Or like look at all this fur, very textured, previously very out of focus.
00:10:11.520 | So this is again a new technique.
00:10:17.960 | So I love this.
00:10:22.160 | You folks are trying things out, and some things are working, and some things not working,
00:10:28.480 | and that's all good.
00:10:30.760 | I kind of feel like you're going to have to slow down because I'm having trouble keeping
00:10:33.480 | up with you all.
00:10:34.480 | But apart from that, this is great.
00:10:38.440 | Good work.
00:10:40.280 | I also wanted to mention on a different theme to check out Alex's notes on the lesson because
00:10:51.680 | I thought he's done a fantastic job of showing like how to study a lesson.
00:10:58.080 | And so what Alex did, for example, was he made a list in his notes of all the different
00:11:03.120 | steps we did as we started the Froma Foundations.
00:11:07.760 | What is the library that it comes from?
00:11:10.480 | Links to the documentation.
00:11:12.880 | And I know that Alex's background actually is history, not computer science.
00:11:20.520 | And so for somebody moving into a different field like this, this is a great idea, particularly
00:11:25.160 | to be able to look at like, OK, what are all the things that I'm going to have to learn
00:11:29.280 | and read about?
00:11:32.400 | And then he did something which we always recommend, which is to try the lesson on
00:11:37.160 | a new data set.
00:11:38.720 | And he very sensibly picked out the fashion MNIST data set, which is something we'll
00:11:42.200 | be using a lot in this course because it's a lot like MNIST.
00:11:46.920 | And it's just different enough to be interesting.
00:11:49.920 | And so he described in his post or his notes how he went about doing that.
00:11:55.760 | And then something else I thought was interesting in his notes at the very end was he just jotted
00:11:59.840 | down my tips.
00:12:02.680 | It's very easy when I throw a tip out there to think, oh, that's interesting.
00:12:07.320 | That's good to know.
00:12:08.440 | And then it can disappear.
00:12:11.040 | So here's a good way to make sure you don't forget about all the little tricks.
00:12:17.240 | And I think I've put those notes in the forum wiki so you can check them out if you'd like
00:12:24.240 | to learn from them as well.
00:12:25.240 | So I think this is a great role model.
00:12:27.040 | Good job, Alex.
00:12:30.000 | OK, so during the week, Jono taught us about a new paper that had just come out called
00:12:46.040 | DeafEdit and he told us he thought this was an interesting paper.
00:12:52.680 | And it came out during the week and I thought it might be good practice for us to try reading
00:12:58.980 | this paper together.
00:13:01.740 | So let's do that.
00:13:03.580 | So here's the paper, DeafEdit.
00:13:06.600 | And you'll find that probably the majority of papers that you come across in deep learning
00:13:14.320 | will take you to archive.
00:13:17.400 | Archive is a preprint server.
00:13:19.720 | So these are models, these are papers that have not been peer reviewed.
00:13:26.060 | I would say in our field we don't generally or I certainly don't generally care about
00:13:32.440 | that at all because we have code, we can try it, we can see things whether it works or not.
00:13:38.920 | You know, we tend to be very, you know, most papers are very transparent about here's what
00:13:42.680 | we did and how we did it and you can replicate it.
00:13:46.000 | And it gets a huge amount of peer review on Twitter.
00:13:50.020 | So if there's a problem generally within 24 hours, somebody has pointed it out.
00:13:55.060 | Now we use archive a lot and if you wait until it's been peer reviewed, you know, you'll
00:13:59.260 | be way out of date because this field is moving so quickly.
00:14:03.480 | So here it is on archive and we can read it by clicking on the PDF button.
00:14:08.360 | I don't do that, instead I click on this little button up here, which is the Save to Zotero
00:14:15.160 | button.
00:14:16.160 | So I figured I'd show you like my preferred workflows.
00:14:19.360 | You don't have to do the same thing, there are different workflows, but here's one that
00:14:22.440 | I find works very well, which is Zotero is a piece of free software that you can download
00:14:28.780 | for Mac, Windows, Linux and install a Chrome connector.
00:14:32.700 | Oh, Tanishka is saying the button is covered.
00:14:36.020 | All right, so in my taskbar, I have a button that I can click that says Save to Zotero.
00:14:40.900 | Sorry, not taskbar, Chrome menu bar.
00:14:44.300 | And when I click it, I'll show you what happens.
00:14:46.180 | So after I've downloaded this, the paper will automatically appear here in this software,
00:14:56.220 | which is Zotero.
00:14:58.460 | And so here it is, diffedit.
00:15:04.540 | And you can see it's told us, it's got here the abstract, the authors, where it came from.
00:15:12.660 | And so later on, I can go and like, if I want to check some detail, I can go back and see
00:15:16.460 | the URL, I can click on it, pops up.
00:15:20.260 | And so in this case, what I'm going to do is I'm going to double click on it.
00:15:24.620 | And that brings up the paper.
00:15:26.980 | Now, the reason I like to read my papers in Zotero is that I can, you know, annotate them,
00:15:35.780 | edit them, tag them, put them in folders and so forth, and also add them to my kind of
00:15:42.140 | reading list directly from my web browser.
00:15:45.180 | So as you can see, you know, I've started this Fast Diffusion folder, which is actually
00:15:50.760 | a group library, which I share with the other folks working on this Fast Diffusion project
00:15:56.460 | that we're all doing together.
00:15:58.020 | And so we can all see the same paper library.
00:16:03.080 | So Maribou on YouTube chat is asking, is this better than Mendeley?
00:16:07.780 | Yeah, I used to use Mendeley and it's kind of gone downhill.
00:16:11.380 | I think Zotero is far, far better, but they're both very similar.
00:16:15.580 | Okay, so when you double click on it, it opens up and here is a paper.
00:16:23.220 | So reading a paper is always extremely intimidating.
00:16:32.980 | And so you just have to do it anyway and you have to realize that your goal is not to understand
00:16:38.740 | every word.
00:16:39.980 | Your goal is to understand the basic idea well enough that, for example, when you look
00:16:46.500 | at the code, hopefully it comes with code, most things do, that you'll be able to kind
00:16:51.100 | of see how the code matches to it and that you could try writing your own code to implement
00:16:56.060 | parts of it yourself.
00:16:58.340 | So over on the left, you can open up the sidebar here.
00:17:01.500 | So I generally open up the table of contents and get a bit of a sense of, okay, so there's
00:17:07.300 | some experimental results, there's some theoretical results, introduction, related work, okay,
00:17:16.100 | tells us about this new diff edit thing, some experiments, okay.
00:17:20.200 | So that's a pretty standard approach that you would see in papers.
00:17:26.500 | So I would always start with the abstract, okay.
00:17:28.940 | So what's it saying this does?
00:17:32.260 | So generally it's going to be some background sentence or two about how interesting this
00:17:35.860 | field is.
00:17:36.860 | It's just saying, wow, image generation is cool, which is fine.
00:17:39.300 | And then they're going to tell us what they're going to do, which is they're going to create
00:17:41.500 | something called diff edit.
00:17:48.940 | And so this is a, what is it for?
00:17:50.860 | It's going to use text condition diffusion models.
00:17:53.640 | So we know what those are now.
00:17:54.820 | That's what we've been using.
00:17:55.980 | That's where we type in some text and get back an image of that that matches the text.
00:18:01.300 | But this is going to be different.
00:18:02.380 | It's the task of semantic image editing.
00:18:04.580 | Okay.
00:18:05.580 | We don't know what that is yet.
00:18:06.580 | So let's put that aside and think, okay, let's make sure we understand that later.
00:18:10.880 | The goal is to edit an image based on a text query.
00:18:13.580 | Oh, okay.
00:18:14.880 | So we're going to edit an image based on text.
00:18:16.580 | How on earth would you do that?
00:18:17.940 | Ah, they're going to tell us right away what this is.
00:18:21.100 | Semantic image editing.
00:18:22.140 | It's an extension of image generation with an additional constraint, which is the generated
00:18:27.580 | image should be as similar as possible to the given input.
00:18:30.540 | And so generally, as they've done here, there's going to be a picture that shows us what's
00:18:36.100 | going on.
00:18:37.340 | And so in this picture, you can see here an example, here's an input image.
00:18:42.380 | And originally it was attached to a caption, a bowl of fruits.
00:18:46.820 | Okay.
00:18:47.820 | We want to change this into a bowl of pairs.
00:18:50.380 | So we type a bowl of pairs and it generates, oh, a bowl of pairs, or we could change it
00:18:59.580 | from a bowl of fruit to a basket of fruits and oh, it's become a basket of fruits.
00:19:04.700 | Okay.
00:19:05.700 | So I think I get the idea, right?
00:19:09.180 | What it's saying is that we can edit an image by typing what we want that image to represent.
00:19:15.900 | So this actually looks a lot like the paper that we looked at last week.
00:19:21.780 | So that's cool.
00:19:26.460 | So the abstract says that currently, so I guess there are current ways of doing this,
00:19:30.820 | but they require you to provide a mask.
00:19:32.820 | That means you have to basically draw the area you're replacing.
00:19:35.420 | Okay.
00:19:36.420 | So that sounds really annoying, but our main contribution.
00:19:39.500 | So what this paper does is we automatically generate the mask.
00:19:43.060 | So they simply just type in the new query and get the new image.
00:19:46.260 | So that sounds actually really impressive.
00:19:48.460 | So if you read the abstract and you think, um, I don't care about doing that, then you
00:19:54.100 | can skip the paper, you know, um, or, or look at the results.
00:19:59.900 | And if the results don't look impressive, then just skip the paper.
00:20:03.540 | So that's, that's kind of your first point where we can be like, okay, we're, we're done.
00:20:07.220 | But in this case, this sounds great.
00:20:09.300 | The results look amazing. So I think we should keep going.
00:20:13.220 | Um, okay.
00:20:14.220 | The chief state of the updating performance, of course.
00:20:16.820 | Fine.
00:20:17.820 | And we try some, right, whatever.
00:20:20.060 | Okay.
00:20:21.060 | So the introduction to a paper, um, is going to try to give you a sense of, you know, what
00:20:28.500 | they're trying to do.
00:20:30.380 | And so this first paragraph here is just repeating what we've already read in the abstract and
00:20:35.740 | repeating what we see in figure one.
00:20:37.560 | So it's saying that we can take a text query, like a basket of fruits, see the examples.
00:20:43.540 | All right, fine.
00:20:44.540 | We'll skip through there.
00:20:46.340 | So the key thing about academic papers is that they are full of citations.
00:20:53.140 | Um, you should not expect to read all of them because if you do, then to read each of those
00:21:02.020 | citations, that's full of citations and then they're full of citations.
00:21:04.620 | And before you know it, you've read the entire academic literature, which has taken you 5,000
00:21:09.500 | years.
00:21:10.500 | Um, so, uh, for now, let's just recognize that it says text conditional image generation
00:21:15.740 | is undergoing a revolution.
00:21:16.740 | Here's some examples.
00:21:17.740 | Well, fine.
00:21:18.740 | We actually already know that.
00:21:19.740 | Okay.
00:21:20.740 | Dali's call latent diffusion.
00:21:21.900 | That's what we've been using.
00:21:23.020 | That's cool.
00:21:24.020 | Emmergen.
00:21:25.020 | Apparently that's cool.
00:21:26.020 | Um, so cool.
00:21:27.020 | All right.
00:21:28.020 | So we kind of know that.
00:21:29.020 | So generally there's this like, okay, our area that we're working on is important in
00:21:32.660 | this case.
00:21:33.660 | It's important.
00:21:34.660 | So we can skip through it pretty quickly.
00:21:35.980 | Um, they've asked a vast amounts of data are used.
00:21:40.100 | Yes, we know.
00:21:42.180 | Um, okay.
00:21:43.740 | So diffusion models are interesting.
00:21:46.300 | Yes, we know that they denoise starting from Gaussian noise.
00:21:50.020 | We know that.
00:21:51.020 | So you can see like, there's a lot of stuff.
00:21:52.420 | Once you kind of in the field, you can skip over pretty quickly.
00:21:55.900 | You can guide it using clip guidance.
00:21:57.420 | Yeah, that's what we've been doing.
00:21:58.700 | We know about that.
00:21:59.700 | Oh, wait, this is new or by in painting.
00:22:03.380 | I copy pasting pixel values outside a mask.
00:22:06.580 | All right.
00:22:07.860 | So there's a new technique that we haven't done, but I think it makes a lot of intuitive
00:22:11.980 | sense.
00:22:12.980 | Um, that is during that diffusion process, if there are some pixels, you don't want to
00:22:17.580 | change such as all the ones that aren't orange here, you can just paste them from the original
00:22:23.660 | after each stage of the diffusion.
00:22:25.260 | All right.
00:22:26.260 | That makes perfect sense.
00:22:27.260 | If I want to know more about that, I could always look at this paper, but I don't think
00:22:30.620 | I do for now.
00:22:32.220 | Okay.
00:22:33.220 | And again, it's just repeating something they've already told us that they require us to provide
00:22:39.060 | a mask.
00:22:40.060 | So that's a bit of a problem.
00:22:44.820 | And then, you know, this is interesting.
00:22:46.460 | It's also says that when you mask out an area, that's a problem because if you're trying
00:22:53.140 | to, for example, change a dog into a cat, you want to keep the animal's color and pose.
00:22:58.860 | So this is a new technique, which is not deleting the original, not deleting a section and replacing
00:23:04.380 | it with something else, but it's actually going to take advantage of knowledge about
00:23:07.860 | what that thing looked like so that this is two cool new things.
00:23:12.800 | So hopefully at this point, we know what they're trying to achieve.
00:23:17.060 | If you don't know what they're trying to achieve when you're reading a paper, the paper won't
00:23:20.860 | make any sense.
00:23:21.860 | Um, so again, that's a point where you should stop.
00:23:25.020 | Maybe this is not the right time to be reading this paper.
00:23:27.420 | Maybe you need to read some of the references.
00:23:29.900 | Maybe you need to look more at the examples so you can always skip straight to the experiments.
00:23:34.620 | So I often skip straight to the experiments.
00:23:36.300 | In this case, I don't need to because they've put enough experiments on the very first page
00:23:41.780 | for me to see what it's doing.
00:23:43.740 | So yeah, don't always read it from top to bottom.
00:23:47.020 | Um, okay.
00:23:49.060 | So all right.
00:23:53.180 | So they've got some examples of conditioning a diffusion model on an input without a mask.
00:23:58.780 | Okay.
00:24:00.180 | For example, you can use a noised version of the input as a starting point.
00:24:03.220 | Hey, we've done that too.
00:24:04.460 | So as you can see, we've already covered a lot of the techniques that they're referring
00:24:09.380 | to here.
00:24:11.620 | Something we haven't done, but makes a lot of sense is that we can look at the distance
00:24:14.820 | to the input image as a loss function.
00:24:17.180 | Okay, that makes sense to me and there's some references here.
00:24:21.260 | All right, so we're going to create this new thing called diffedit.
00:24:25.940 | It's going to be amazing.
00:24:26.940 | Wait till you check it out.
00:24:29.420 | Okay, fine.
00:24:30.420 | Okay.
00:24:31.420 | So that's the introduction.
00:24:32.420 | Hopefully you found that useful to understand what we're trying to do.
00:24:38.260 | The next section is generally called related work as it is here.
00:24:42.020 | And that's going to tell us about other approaches.
00:24:47.340 | So if you're doing a deep dive, this is a good thing to study carefully.
00:24:52.380 | I don't think we're going to do a deep dive right now.
00:24:57.060 | So I think we can happily skip over it.
00:24:59.420 | We could kind of do a quick glance of like, oh, image editing, conclude colorization,
00:25:05.140 | retouching style transfer.
00:25:06.580 | Okay, cool.
00:25:08.140 | Lots of interesting topics.
00:25:09.140 | Definitely getting more excited about this idea of image editing.
00:25:14.940 | And there's some different techniques.
00:25:19.500 | You can use clip guidance, okay, they can be computationally expensive.
00:25:26.580 | We can use diffusion for image editing.
00:25:29.660 | Okay, fine.
00:25:32.780 | We can use clip to help us.
00:25:33.980 | So there's a lot of repetition in these papers as well, which is nice because we can skip
00:25:37.740 | over it pretty quickly.
00:25:40.380 | More about the high computational costs.
00:25:42.260 | Okay, so they're saying this is going to be not so computationally expensive.
00:25:46.180 | That sounds hopeful.
00:25:52.060 | And often the very end of the related work is most interesting as it is here where they've
00:25:55.860 | talked about how somebody else has done concurrent hours.
00:25:58.820 | Somebody else is working at exactly the same time.
00:26:02.420 | And they've looked at some different approach.
00:26:07.980 | Okay, so not sure we learned too much from the related work, but if you were trying to
00:26:14.180 | really do the very, very best possible thing, you could study the related work and get the
00:26:19.540 | best ideas from each.
00:26:22.540 | Okay, now, background.
00:26:28.980 | So this is where it starts to look scary.
00:26:34.020 | I think we could all agree.
00:26:41.740 | And this is often the scariest bit, the background.
00:26:43.780 | This is basically saying like, mathematically, here's how the problem that we're trying to
00:26:50.060 | solve is set up.
00:26:52.380 | And so we're going to start by looking at denoising, diffusion, probabilistic models,
00:26:56.660 | DDPM.
00:26:57.660 | Now, if you've watched lesson 9b with Wasim and Tanishk, then you've already seen some
00:27:07.020 | of the math of DDPM.
00:27:10.020 | And the important thing to recognize is that basically no one in the world pretty much
00:27:16.580 | is going to look at these paragraphs of text and these equations and go, oh, I get it.
00:27:22.920 | That's what DDPM is.
00:27:24.620 | That's not how it works, right?
00:27:28.060 | To understand DDPM, you would have to read and study the original paper, and then you
00:27:33.820 | would have to read and study the papers it's based on and talk to lots of people and watch
00:27:40.740 | videos and go to classes just like this one.
00:27:44.580 | And after a while, you'll understand DDPM.
00:27:47.500 | And then you'll be able to look at this section and say, oh, okay, I see, they're just talking
00:27:53.380 | about this thing I'm already familiar with.
00:27:54.900 | So this is meant to be a reminder of something that you already know.
00:27:58.740 | It's not something you should expect to learn from scratch.
00:28:02.060 | So let me take you through these equations somewhat briefly because Wasim and Tanishk
00:28:11.100 | have kind of done them already because every diffusion paper pretty much is going to have
00:28:15.380 | these equations.
00:28:16.380 | Okay.
00:28:17.380 | So, oh, and I'm just going to read something that Jono has pointed out in the chat.
00:28:22.980 | He says it's worth remembering the background is often written last and tries to look smart
00:28:27.420 | for the reviewers, which is correct.
00:28:29.940 | So feel free to read it last too.
00:28:32.060 | Yeah, absolutely.
00:28:33.660 | I think the main reason to read it is to find out what the different letters mean, what
00:28:40.700 | the different symbols mean, because they'll probably refer to them later.
00:28:44.700 | But in this case, I want to actually take this as a way to learn how to read math.
00:28:51.380 | So let's start with this very first equation, which how on earth do you even read this?
00:28:58.660 | So the first thing I'll say is that this is not an E, right?
00:29:05.180 | It's a weird looking E. And the reason it's a weird looking E is because it's a Greek
00:29:08.580 | letter.
00:29:09.660 | And so something I always recommend to students is that you learn the Greek alphabet because
00:29:16.020 | it's much easier to be able to actually read this to yourself.
00:29:20.900 | So here's another one, right?
00:29:22.620 | If you don't know that's called theta, I guess you have to read it as like circle with line
00:29:27.820 | through it.
00:29:28.820 | It's just going to get confusing trying to read an equation where you just can't actually
00:29:33.180 | say it out loud.
00:29:35.220 | So what I suggest is that you learn that learn the Greek alphabet and let me find the right
00:29:51.140 | place.
00:29:52.140 | So it's very easy to look it up just on Wikipedia is the Greek alphabet.
00:30:01.940 | And if we go down here, you'll see they've all got names and we can go and try and find
00:30:06.580 | our one curvy E. Okay, here it is, epsilon and oh, circle with a line through it, theta.
00:30:15.820 | All right, so practice and you will get used to recognizing C. So you've got epsilon theta.
00:30:26.220 | This is just a weird curly L. So that's this is used for the loss function.
00:30:31.780 | Okay, so how do we find out what this symbol means and what this symbol means?
00:30:38.260 | Well, what we can do is there's a few ways to do it.
00:30:44.660 | One way, which is kind of cool is we can use a program called MathPix, which is MathPix.
00:31:05.120 | And what it does is you basically select anything on your screen.
00:31:14.000 | And it will turn it into LaTeX.
00:31:19.100 | So that's one way you can do this is you can select on the screen.
00:31:22.020 | It turns it into LaTeX.
00:31:23.020 | And the reason it's good to turn it into LaTeX is because LaTeX is written as actual stuff
00:31:28.460 | that you can search for on Google.
00:31:32.900 | So that's technique number one.
00:31:35.300 | Technique number two is you can download the other formats of the paper and that will have
00:31:46.100 | a download source.
00:31:48.820 | And if we say download source, then what we'll be able to do is we'll be able to actually
00:31:58.420 | open up that LaTeX and have a look at it.
00:32:02.980 | So we'll wait for that to download while it's happening.
00:32:05.580 | Let's keep moving along here.
00:32:09.460 | So in this case, we've got these two bars.
00:32:15.560 | So can we find out what that means?
00:32:17.780 | So we could try a few things.
00:32:21.580 | We could try looking for two bars, maybe math notation.
00:32:32.620 | Oh, here we are.
00:32:33.620 | Looks hopeful.
00:32:34.620 | What does this mean in mathematics?
00:32:35.620 | Oh, and here there's a glossary of mathematical symbols.
00:32:39.480 | Here there's a meaning of this in math.
00:32:45.160 | So that looks hopeful.
00:32:46.820 | Okay, so it definitely doesn't look like this.
00:32:49.660 | It's not between two sets of letters, but it is around something that looks hopeful.
00:32:58.340 | So it looks like we found it.
00:33:00.040 | It's a vector norm.
00:33:01.340 | Okay, so then you can start looking for these things up.
00:33:04.620 | So we can say norm or maybe vector norm.
00:33:09.260 | And so once you can actually find the term, then we kind of know what to look for.
00:33:16.260 | Okay, so in our case, we've got this surrounding all this stuff, and then there's twos here
00:33:31.020 | and here.
00:33:32.020 | What's going on here?
00:33:33.020 | All right, if we scroll through, oh, this is pretty close actually.
00:33:40.020 | So, okay, so two bars can mean a matrix norm, otherwise a single for a vector norm.
00:33:51.100 | That's just here in particular.
00:33:52.100 | So it looks like we don't have to worry too much about whether it's one or two bars.
00:33:55.540 | Oh, and here's the definition.
00:33:57.780 | Oh, that's handy.
00:33:59.700 | So we've got the two one.
00:34:01.220 | All right, so it's equal to root sum of squares.
00:34:05.540 | So that's good to know.
00:34:06.780 | So this norm thing means a root sum of squares.
00:34:11.660 | But then we've got a two up here.
00:34:12.660 | Well, that just means squared.
00:34:14.300 | Ah, so this is a root sum of squares squared.
00:34:18.620 | Well, the square of a square root is just the thing itself.
00:34:22.180 | Ah, so actually this whole thing is just the sum of squares.
00:34:26.180 | It's a bit of a weird way to write it, in a sense.
00:34:30.220 | We could perfectly well have just written it as, you know, like sum of, you know,
00:34:40.820 | whatever it is, squared.
00:34:43.180 | Fine.
00:34:44.180 | But there we go.
00:34:48.380 | Okay, and then what about this thing here?
00:34:53.540 | Weird E thing.
00:34:54.540 | So how would you find out what the weird E thing is?
00:34:57.660 | Okay, so our, our laytech has finally finished downloading.
00:35:14.860 | And if we open it up, we can find there's a .tech file in here.
00:35:18.420 | Here we are, main.tech.
00:35:21.020 | So we'll open it.
00:35:26.460 | And it's not the most, you know, amazingly smooth process, but, you know, what we could
00:35:32.340 | just do is we could say, okay, it's just after it says minimizing the denoising objective.
00:35:36.780 | Okay, so let's search for minimizing the, oh, here it is, minimizing the denoising objective.
00:35:44.620 | So the laytech here, let's get it back from the screen at the same time.
00:35:50.060 | Okay, so here it is, L, math curl L equals math BBE, X naught T, epsilon, okay.
00:36:02.260 | And here's that vertical bar thing, epsilon minus epsilon theta XT, and then the bar thing
00:36:08.580 | All right, so the thing that we've got new is math BBE, okay, so finally we've got something
00:36:13.940 | we can search for, math BBE, ah, fantastic, what does math BBE mean?
00:36:28.380 | That's the expected value operator, aha, fantastic, all right.
00:36:33.960 | So it takes a bit of fussing around, but once you've got either math pics working or actually
00:36:40.420 | another thing you could try, because math pics is ridiculously expensive, in my opinion,
00:36:44.820 | is there is, there is a free version called pics2tech that actually is a Python thing,
00:36:59.780 | and you could actually even have fun playing with this because the whole thing is just
00:37:03.820 | a PyTorch Python script, and it even describes, you know, how it's used to transformers model,
00:37:12.500 | and you can train it yourself in Colab and so forth, but basically as you can see, yeah,
00:37:17.380 | you can snip and convert to LaTeX, which is pretty awesome.
00:37:23.820 | So you could use this instead of paying the math pics guys, anyway, so we are on the right
00:37:33.460 | track now, I think, so expected value, and then we can start reading about what expected
00:37:40.580 | value is, and you might actually remember that because we did a bit of it in iSchool,
00:37:45.260 | at least in Australia we did, it's basically like, let's maybe jump over here, so expected
00:37:58.260 | value of something is saying what's the likely value of that thing, so for example, let's
00:38:06.140 | say you toss a coin, which could be heads or it could be tails, and you want to know
00:38:11.460 | how often it's heads, and so maybe we'll call heads one tail zero, so you toss it and you
00:38:17.340 | get a one zero zero one one zero one zero one, okay, and so forth, right, and then you
00:38:23.180 | can calculate the mean of that, right, so if that's x you can calculate x bar, the mean,
00:38:31.380 | which would be the sum of all that, divided by the count of all that, so it'd be one two
00:38:39.860 | three four five, five divided by one two three four five six seven eight nine, okay, so that
00:38:47.980 | would be the mean, but the expected value is like, well, what do you expect to happen,
00:38:53.780 | and we can calculate that by adding up for all of the possibilities for each, I don't
00:39:00.380 | know, I'll just call them x, for each possibility x, how likely is x, and what score do you
00:39:06.420 | get if you get x, so in this example of heads and tails, our two possibilities is that we
00:39:11.540 | either get heads or we get tails, so if for the version where x is heads we get probability
00:39:19.420 | is zero point five, and the score if it's an x, is that the score if it's an x is going
00:39:28.740 | to be one, and then what about tails, for tails the probability is zero point five, and the
00:39:36.260 | score if you get tails is zero, and so overall the expected is point five times one plus
00:39:42.220 | zero is point five, so our expected score if we're tossing a coin is point five, if getting
00:39:49.460 | heads is a win. Let me give you another example, another example is let's say that we're rolling
00:39:56.940 | a die, and we want to know what the expected score is if we roll a die, so again we could
00:40:04.500 | roll it a bunch of times, and see what happens, okay, and so we could sum all that up, let's
00:40:15.460 | like before, and divide it by the count, and that'll tell us the mean for this particular
00:40:21.540 | example, but what's the expected value more generally, well again it's the sum of all
00:40:27.580 | the possibilities of the probability of each possibility times that score, so the possibilities
00:40:34.420 | for rolling a die is that you can get a one, a two, a three, a four, a five, or a six, the
00:40:41.240 | probability of each one is a sixth, okay, and the score that you get is, well it's this,
00:40:54.500 | this is the score, and so then you can multiply all these together and sum them up, which
00:41:00.260 | would be 1/6 plus 2/6 plus 3/6 plus 4/6, oops, plus 5/6 plus 6/6, and that would give you
00:41:17.580 | the expected value of that particular thing, which is rolling die, rolling a die, so that's
00:41:28.020 | what expected value means, all right, so that's a really important concept that's going to
00:41:34.460 | come up a lot as we read papers, and so in particular this is telling us what are all
00:41:43.780 | the things that we're averaging it over, that with the expectations over, and so there's
00:41:48.820 | a whole lot of letters here, you're not expected to just know what they are, in fact in every
00:41:52.860 | paper they could mean totally different things, so you have to look immediately underneath
00:41:56.500 | where they'll be defined, so x0 is an image, it's an input image, epsilon is the noise,
00:42:06.400 | and the noise has a mean of zero and a standard deviation of i, which if you watch the lesson
00:42:11.860 | 9b you'll know it's like a standard deviation of 1 when you're doing multiple normal variables,
00:42:20.820 | okay, and then this is kind of confusing, eta just on its own is a normally distributed
00:42:28.180 | random variable, so it's just grabbing random numbers, but eta theta, epsilon, but epsilon
00:42:35.060 | theta is a noise estimator, that means it's a function, you can tell it's a function kind
00:42:41.880 | of because it's got these parentheses and stuff right next to it, so that's a function,
00:42:47.620 | so presumably most functions like this in these papers are neural networks, okay, so
00:42:53.180 | we're finally at a point where this actually is going to make perfect sense, we've got
00:42:56.460 | the noise, we've got the prediction of that noise, we subtract one from the other, we
00:43:04.100 | square it, and we take the expected value, so in other words this is mean squared error,
00:43:11.820 | so wow, that's a lot of fiddling around to find out that we've, this whole thing here
00:43:16.500 | means mean squared error, so the loss function is the mean squared error, and unfortunately
00:43:22.820 | I don't think the paper ever says that, it says minimising the denoising objective L
00:43:26.580 | blah de blah de blah de, but anyway we got there eventually, fine, we also, as well as
00:43:36.820 | learning about x0, we also learn here about xt, and so xt is the original unnoised image
00:43:46.060 | times some number plus some noise times one minus that number, okay, and so hopefully
00:43:55.300 | you'll recognise this from lesson 9b, this is the thing where we reduce the value of
00:43:59.580 | each pixel and we add noise to each pixel, so that's that, alright, so I'm not going
00:44:07.740 | to keep going through it, but you can kind of basically get the idea here is that once
00:44:11.640 | you know what you're looking for, the equations do actually make sense, right, but all this
00:44:20.420 | is doing is remember this is background, right, this is telling you what already exists, so
00:44:25.600 | this is telling you this is what a DDPM is, and then it tells you what a DDIM is, DDIM
00:44:34.660 | is, just think of it as a more recent version of DDPM, it's some very minor changes to the
00:44:43.260 | way it's set up which allows us to go faster, okay, so the thing is though, once we keep
00:44:51.420 | reading what you'll find is none of this background actually matters, but you know I thought we'd
00:44:58.620 | kind of go through it just to get a sense of like what's in a paper, okay, so for the
00:45:05.700 | purpose of our background it's enough to know that DDPM and DDIM are kind of the foundational
00:45:11.980 | papers on which diffusion models today are based. Okay, so the encoding process which
00:45:32.180 | encodes an image onto a latent variable, okay, and then this is basically adding noise, this
00:45:41.220 | is called DDIM encoding, and the thing that goes from the input image to the noised image,
00:45:48.860 | they're going to call capital ER, and R is the encoding ratios, that's going to be some
00:45:54.420 | like how much noise are we adding, if you use small steps then decoding that, so going
00:46:02.420 | backwards gives you back the original image, okay, so that's what the stuff that we've
00:46:05.460 | learned about, that's what diffusion models are. All right, so this looks like a very
00:46:12.980 | useful picture, so maybe let's take a look and see what this says, so what is DiffEdit?
00:46:19.540 | DiffEdit has three steps. Step one, we add noise to the input image, that sounds pretty
00:46:25.500 | normal, here's our input image x0, okay, and we add noise to it, fine, and then we denoise
00:46:34.460 | it, okay, fine. Ah, but we denoise it twice. One time we denoise it using the reference
00:46:46.820 | text R, horse, or this special symbol here means nothing at all, so either unconditional
00:46:54.700 | or horse. All right, so we do it once using the word horse, so we take this and we decode
00:47:03.860 | it, estimate the noise, and then we can remove that noise on the assumption that it's a horse.
00:47:11.220 | Then we do it again, but the second time we do that noise, when we calculate the noise,
00:47:20.020 | we pass in our query Q, which is zebra. Wow, those are going to be very different noises.
00:47:27.860 | The noise for horse is just going to be literally these Gaussian pixels, these are all dots,
00:47:33.300 | right, because it is a horse, but if the claim is no, no, this is actually a zebra, then
00:47:39.140 | all of these pixels here are all wrong, they're all the wrong color. So the noise that's calculated
00:47:46.180 | if we say this is our query is going to be totally different to the noise if we say this
00:47:52.020 | is our query, and so then we just take one minus the other, and here it is here, right,
00:47:58.940 | so we derive a mask based on the difference in the denoising results, and then you take
00:48:05.420 | that and binarize it, so basically turn that into ones and zeros. So that's actually the
00:48:10.780 | key idea, that's a really cool idea, which is that once you have a diffusion model that's
00:48:17.220 | trained, you can do inference on it where you tell it the truth about what the thing
00:48:22.100 | is, and then you can do it again but lie about what the thing is, and in your lying version
00:48:28.500 | it's going to say okay, all the stuff that doesn't match zebra must be noise. And so
00:48:34.180 | the difference between the noise prediction when you say hey it's a zebra versus the noise
00:48:38.420 | prediction when you say hey it's a horse will be all the pixels that it says no, these pixels
00:48:44.580 | are not zebra. The rest of it, it's fine, there's nothing particularly about the background
00:48:50.140 | that wouldn't work with a zebra. Okay, so that's step one. So then step two is we take the
00:49:03.060 | horse and we add noise to it. Okay, that's this XR thing that we learned about before.
00:49:14.740 | And then step three, we do decoding conditioned on the text query using the mask to replace
00:49:23.700 | the background with pixel values. So this is like the idea that we heard about before,
00:49:29.660 | which is that during the inference time as you do diffusion from this fuzzy horse, what
00:49:37.100 | happens is that we do a step of diffusion inference and then all these black pixels
00:49:44.920 | we replace with the noised version of the original. And so we do that multiple times
00:49:49.980 | and so that means that the original pixels in this black area won't get changed. And
00:49:57.820 | that's why you can see in this picture here and this picture here, the backgrounds all
00:50:01.820 | the same. And the only thing that's changed is that the horse has been turned into a zebra.
00:50:09.540 | So this paragraph describes it and then you can see here it gives you a lot more detail.
00:50:17.340 | And the detail often has all kinds of like little tips about things they tried and things
00:50:22.220 | they found, which is pretty cool. So I won't read through all that because it says the
00:50:31.780 | same as what I've already just said. One of the interesting little things they note here
00:50:37.100 | actually is that this binarized mask, so this difference between the R decoding and the
00:50:44.260 | Q decoding tends to be a bit bigger than the actual area where the horse is, which you
00:50:50.260 | can kind of see with these legs, for example. And their point is that they actually say
00:50:53.980 | that's a good thing because actually often you want to slightly change some of the details
00:50:58.820 | around the object. So this is actually fine. All right. So we have a description of what
00:51:07.340 | the thing is, lots of details there. And then here's the bit that I totally skip, the bit
00:51:12.740 | called theoretical analysis, where this is the stuff that people really generally just
00:51:17.620 | add to try to get their papers past review. You have to have fancy math. And so they're
00:51:22.660 | basically proving, you can see what it says here, insight into why this component yields
00:51:28.980 | better editing results than other approaches. I'm not sure we particularly care because
00:51:34.420 | like it makes perfect sense what they're doing. It's intuitive and we can see it works. I
00:51:39.260 | don't feel like I need it proven to me, so I skip over that. So then they'll show us
00:51:44.300 | their experiments to tell us what datasets they did the experiments on. And so then,
00:51:51.300 | you know, they have metrics with names like LP IPS and CS FID. You'll come across FID
00:51:59.460 | a lot. This is just a version of that. But basically they're trying to score how good
00:52:04.780 | their generated images. We don't normally care about that either. They care because
00:52:10.580 | they need to be able to say, you should publish our paper because it has a higher number than
00:52:14.260 | the other people that have worked on this area. In our case, we can just say, you know,
00:52:21.140 | it looks good. I like it. So excellent question in the chat from Mikolaj, which is, so would
00:52:29.540 | this only work on things that are relatively similar? And I think this is a great point.
00:52:34.700 | This is where understanding this helps to know what its limitations are going to be.
00:52:40.100 | And that's exactly right. If you can't come up with a mask for the change you want, this
00:52:47.580 | isn't going to work very well on the whole. Yeah, because the masked areas, the pixels
00:52:53.620 | going to be copied. So, for example, if you wanted to change it from, you know, a bowl
00:52:58.980 | of fruits to a bowl of fruits with a bokeh background or like a bowl of fruits with,
00:53:07.820 | you know, a purple tinged photo of a bowl of fruit, if you want the whole color to change,
00:53:14.060 | that's not going to work, right? Because you're not masking off an area. Yeah. So by understanding
00:53:18.660 | the detail here, Mikolaj has correctly recognized a limitation or like, what's this for? This
00:53:25.860 | is for things where you can just say, just change this bit and leave everything else
00:53:29.900 | the same. All right. So there's lots of experiments. So, yeah. For some things, you care about
00:53:38.700 | the experiments a lot. If it's something like classification, for generation, the main thing
00:53:43.180 | you probably want to look at is the actual results. And so, and often, for whatever reason,
00:53:50.380 | I guess, because this is, most people read these electronically, the results often you
00:53:53.820 | have to zoom into a lot to be able to see whether they're really good. So here's the
00:53:57.420 | input image. They want to turn this into an English Foxhound. So here's the thing they're
00:54:03.300 | comparing themselves to, SDEdit, and it changed the composition quite a lot. And their version,
00:54:09.940 | it hasn't changed it at all. It's only changed the dock. And Ditto here, semi-trailer truck.
00:54:14.860 | SDEdit's totally changed it. DifEdit hasn't. So you can kind of get a sense of like, you
00:54:20.860 | know, the authors showing off what they're good at here. This is, this is what this technique
00:54:25.420 | is effective at doing, changing animals and vehicles and so forth. It does a very good
00:54:32.900 | job of it. All right. So then there's going to be a conclusion at the end, which I find
00:54:43.580 | almost never adds anything on top of what we've already read. And as you can see, it's
00:54:48.460 | very short anyway. Now, quite often the appendices are really interesting. So don't skip over
00:54:59.820 | them. Often you'll find like more examples of pictures. They might show some examples
00:55:05.300 | of pictures that didn't work very well, stuff like that. So it's often well worth looking
00:55:10.380 | at the appendices. Often some of the most interesting examples are there. And that's
00:55:16.900 | it. All right. So that is, I guess, our first full on paper walkthrough. And it's important
00:55:24.060 | to remember, this is not like a carefully chosen paper that we've picked specifically
00:55:30.560 | because you can handle it. Like this is the most interesting paper that came out this
00:55:34.020 | week. And so, you know, it gives you a sense of what it's really like. And for those of
00:55:43.720 | you who are, you know, ready to try something that's going to stretch you, see if you can
00:55:49.740 | implement any of this paper. So there are three steps. The first step is kind of the
00:55:56.540 | most interesting one, which is to generate, automatically generate a mask. And the information
00:56:02.660 | that you have and the code that's in the lesson nine notebook actually contains everything
00:56:07.340 | you need to do it. So maybe give it a go. See if you can mask out the area of a horse that
00:56:14.540 | does not look like a zebra. And that's actually, you know, that's actually useful of itself.
00:56:19.480 | Like that's, that's allows you to create segmentation masks automatically. So that's pretty cool.
00:56:26.340 | And then if you get that working, then you can go and try and do step two. If you get
00:56:30.980 | that working, you can try and do step three. And this only came out this week. So I haven't
00:56:36.220 | really seen, yeah, examples of easy to use interfaces to this. So here's an example of
00:56:43.940 | a paper that you could be the first person to create a call interface to it. So there's
00:56:48.040 | some, yeah, there's a fun little project. And even if you're watching this a long time after
00:56:53.660 | this was released and everybody's been doing this for years, still good homework, I think,
00:56:58.540 | so practice if you can. All right. I think now's a good time to have a 10 minute break.
00:57:12.300 | So I'll see you all back here in 10 minutes. Okay. Welcome back. One thing during the break
00:57:23.900 | that Diego reminded us about, which I normally describe and I totally forgot about this time
00:57:30.060 | is detectify, which is another really great way to find symbols you don't know about.
00:57:35.580 | So let's try it for that expectation. So if you're going to detectify and you draw the
00:57:45.100 | thing, it doesn't always work fantastically well, but sometimes it works very nicely.
00:57:53.980 | Yeah, in this case, not quite. What about the double line thing? It's good to know all the
00:58:03.500 | techniques, I guess. I think it could do this one. I guess part of the problem is there's
00:58:15.340 | so many options that actually, you know, okay, in this case, it wasn't particularly helpful.
00:58:20.700 | And normally it's more helpful than that. I mean, if we use a simple one like Epsilon,
00:58:26.380 | I think it should be fine. There's a lot of room to improve this app, actually, if anybody's
00:58:30.860 | interested in a project, I think you could make it, you know, more successful. Okay, that's,
00:58:37.500 | there you go. Signo sum, that's cool. Anyway, so it's another useful thing to know about,
00:58:41.500 | just Google for detectify. Okay. So let's move on with our from the foundations now.
00:58:52.140 | And so we were working on trying to at least get the start of a forward pass of a linear model or
00:59:02.220 | a simple multi-layer perceptron for MNIST going. And we had successfully created a basic tensor.
00:59:11.260 | We've got some random numbers going. So what we now need to do is we now need to be able to
00:59:19.660 | multiply these things together, matrix multiplication. So matrix multiplication
00:59:26.380 | to remind you, in this case, so we're doing MNIST, right? So we've got,
00:59:38.460 | I think we're going to use a subset. Let's see. Yeah. Okay. So we're going to create a matrix
00:59:46.700 | called M1, which is just the first five digits. So M1 will be the first five digits. So five rows
00:59:57.340 | and dot, dot, dot, dot, dot, dot, dot. And then 780, what was it again? 784 columns,
01:00:09.500 | 784 columns, because it's 28 by 28 pixels. And we flattened it out. So this is our
01:00:17.100 | first matrix and our matrix multiplication. And then we're going to multiply that by some
01:00:22.780 | weights. So the weights are going to be 784 by 10 random numbers. So for every one of these
01:00:37.820 | 784 pixels, each one is going to have a weight. So 784 down here,
01:00:45.020 | 784 by 10. So this first column, for example, is going to tell us all the weights in order to
01:00:58.700 | figure out if something's a zero. And the second column will have all the weights in deciding the
01:01:03.180 | probability of something's a one and so forth, assuming we're just doing a linear model. And so
01:01:07.420 | then we're going to multiply these two matrices together. So when we multiply matrices together,
01:01:12.940 | we take row one of matrix one and we take column one of matrix two and we take each
01:01:24.220 | one in turn. So we take this one and we take this one, we multiply them together.
01:01:28.220 | And then we take this one and this one and we multiply them together.
01:01:36.300 | And we do that for every element wise pair and then we add them all up. And that would give us
01:01:46.540 | the value for the very first cell that would go in here. That's what matrix multiplication is.
01:02:01.660 | Okay, so let's go ahead then and create our random numbers for the weights,
01:02:10.780 | since we're allowed to use random number generators now. And for the bias, we'll just use a bunch of
01:02:15.180 | zeros to start with. So the bias is just what we're going to add to each one. And so for our
01:02:23.340 | matrix multiplication, we're going to be doing a little mini batch here. We're going to be doing
01:02:26.620 | five rows of, as we discussed, five rows of, so five images flattened out.
01:02:35.020 | And then multiply by this weights matrix. So here are the shapes, m1 is 5 by 784,
01:02:47.340 | as we saw, m2 is 784 by 10. Okay, so keep those in mind. So here's a handy thing, m1.shape
01:02:57.180 | contains two numbers and I want to pull them out. I want to call the, I'm going to think of that as,
01:03:04.940 | I'm going to actually think of this as like a and b rather than m1 and m2. So this is like a and b.
01:03:09.420 | So the number of rows in a and the number of columns in a, if I say equals m1.shape,
01:03:17.500 | that will put five in ar and 784 in ac. So you'll probably notice this, I do this a lot,
01:03:23.980 | this de-structuring, we talked about it last week too. So we can do the same for m2.shape,
01:03:28.300 | put that into b rows and b columns. And so now if I write out arac and brbc, you can again see
01:03:36.220 | the same things from the sizes. So that's a good way to kind of give us the stuff we have to loop
01:03:40.780 | through. So here's our result. So our resultant tensor, while we're multiplying, we're multiplying
01:03:49.180 | together all of these 784 things and adding them up. So the resultant tensor is going to
01:03:53.900 | be 5 by 10. And then each thing in here is the result of multiplying and adding 784 pairs.
01:04:02.620 | So the result here is going to start with zeros and there is, this is the result.
01:04:11.500 | And it's going to contain ar rows, five rows, and bc columns, 10 columns, 5 comma 10. Okay.
01:04:20.780 | So now we have to fill that in. And so to do a matrix multiplication,
01:04:24.140 | we have to first, we have to go through each row, one at a time. And here we have that,
01:04:35.260 | go through each row, one at a time. And then go through each column, one at a time.
01:04:42.620 | And then we have to go through each pair in that row column, one at a time. So there's going to be
01:04:49.500 | a loop, in a loop, in a loop. So here we're going to loop over each row. And here we're going to
01:04:59.420 | loop over each column. And then here we're going to loop, so each column is C. And then here we're
01:05:04.380 | going to loop over each column of A, which is going to be the same as the number of rows of B,
01:05:12.140 | which we can see here, ac, 784, br, 784, they're the same. So it wouldn't matter whether we said
01:05:19.100 | ac or br. So then our result for that row and that column, we have to add onto it the product of
01:05:32.940 | ik in the first matrix by kj in the second matrix. So k is going up through those 784. And so we're
01:05:42.540 | going to go across the columns and down, sorry, across the rows and down the columns. It's going
01:05:47.100 | to go across the row whilst it goes down this column. So here is the world's most naive, slow,
01:05:56.220 | uninteresting matrix multiplication. And if we run it, okay, it's done something. We have successfully,
01:06:07.580 | apparently, hopefully successfully, multiplied the matrices M1 and M2. It's a little hard to
01:06:12.860 | read this, I find, because punch cards used to be 80 columns wide. We still assume screens are 80
01:06:21.900 | columns wide. Everything defaults to 80 wide, which is ridiculous. But you can easily change it. So if
01:06:28.620 | you say set print options, you can choose your own line width. Oh, as you can see, well, we know it's
01:06:36.380 | five by 10. We did it before. So if we change the line width, okay, that's much easier to read now.
01:06:41.260 | We can see here are the five rows and here are the 10 columns for that matrix multiplication.
01:06:48.780 | I tend to always put this at the top of my notebooks and you can do the same thing for NumPy as well.
01:06:53.980 | So what I like to do, this is really important, is when I'm working on code, particularly numeric
01:07:08.460 | code, I like to do it all step by step in Jupyter. And then what I do is, once I've got it working,
01:07:15.820 | is I copy all the cells that have implemented that and I paste them and then I select them
01:07:23.340 | all and I hit shift M to merge. Get rid of anything that prints out stuff I don't need.
01:07:28.220 | And then I put a header on the top, give it a function name, and then I select the whole lot
01:07:36.700 | and I hit control or apple right square bracket and I've turned it into a function. But I still
01:07:43.020 | keep the stuff above it so I can see all the step by step stuff for learning about it later.
01:07:48.220 | And so that's what I've done here to create this function.
01:07:52.780 | And so this function does exactly the same things we just did.
01:07:57.500 | And we can see how long it takes to run by using percent time.
01:08:01.900 | And it took about half a second, which gosh, that's a long time to generate such a small
01:08:10.780 | matrix. This is just to do five MNIST digits. So that's not going to be great.
01:08:18.780 | We're going to have to speed that up. I'm actually quite surprised at how slow that is
01:08:24.300 | because there's only 39,200. So we're, you know, if you look at how we've got a loop within a loop
01:08:30.540 | within a loop, what's going wrong? A loop within a loop within a loop, it's doing 39,200 of these.
01:08:37.420 | So Python, yeah, Python, when you're just doing Python, it is it is slow.
01:08:42.780 | So we can't we can't do that. That's why we can't just write Python.
01:08:46.700 | But there is something that kind of lets us write Python.
01:08:51.180 | We could instead use number. Number is a system that takes Python and turns it into
01:09:05.980 | basically into machine code. And it's amazingly easy to do. You can basically take a function
01:09:13.260 | and write end it at end it on top. And what it's going to do is it's going to
01:09:20.140 | look the first time you call this function, it's going to compile it down to machine code
01:09:25.820 | and it will run much more quickly. So what I've done here is I've taken the innermost loop.
01:09:34.700 | So just looping through and adding up all these.
01:09:40.780 | So start at zero, go through and add up all those just for two vectors and return it.
01:09:48.700 | This is called a dot product in linear algebra. So we'll call it dot.
01:09:53.260 | And so number only works with NumPy. It doesn't work with PyTorch. So we're just going to use
01:09:59.740 | arrays instead of tensors for a moment. Now, have a look at this. If I try to
01:10:04.540 | do a dot product of one, two, three and two, three, four, that's pretty easy to do.
01:10:09.820 | It took a fifth of a second, which sounds terrible. But the reason it took a fifth of a second is
01:10:17.260 | because that's actually how long it took to compile this and run it. Now that it's compiled,
01:10:21.900 | the second time it just has to call it, it's now 21 microseconds. And so that's actually very fast.
01:10:31.820 | So with number, we can basically make Python run at C speed.
01:10:38.060 | So now the important thing to recognize is if I replace this loop in Python with a call to dot,
01:10:49.980 | which is running in machine code, then we now have one, two loops running in Python, not three.
01:11:00.060 | So our 448 milliseconds.
01:11:03.980 | Well, first of all, let's make sure if I run it,
01:11:08.220 | run that matmul, it should be close to my T1. T1 is what we got before, remember?
01:11:20.380 | So when I'm refactoring or performance improving or whatever, I always like to put every step
01:11:26.780 | in the notebook and then test. So this test close comes from fastcore.test. And it just checks that
01:11:33.420 | two things are very similar. They might not be exactly the same because of little floating
01:11:37.340 | point differences, which is fine. OK, so our matmul is working correctly, or at least it's doing the
01:11:42.220 | same thing it did before. So if we now run it, it's taking 268 microseconds versus 448 milliseconds.
01:11:53.820 | So it's taking, you know, about 2000 times faster just by changing the one in a most loop.
01:12:03.580 | So really, all we've done is we've added @engit to make it 2000 times faster.
01:12:08.940 | So number is well worth knowing about. It can make your Python code very, very fast.
01:12:16.540 | OK, let's keep making it faster. So we're going to use stuff again, which kind of goes back to APL.
01:12:26.540 | And a lot of people say that learning APL is a thing that's taught them more about
01:12:33.020 | programming than anything else. So it's probably worth considering learning APL.
01:12:41.100 | And let's just look at these various things. We've got a is 10, 6 minus 4. So remember at APL,
01:12:47.740 | we don't say equals. Equals actually means equals, funnily enough. To say set to,
01:12:52.700 | we use this arrow. And this is a list of 10, 6, 4. OK, and then b is 287.
01:13:06.620 | OK, and we're going to add them up, a plus b. So what's going on here?
01:13:15.980 | So it's really important that you can think of
01:13:19.500 | a symbol like a as representing a tensor or an array. APL calls them arrays.
01:13:31.100 | PyTorch calls them tensors. NumPy calls them arrays. They're the same thing.
01:13:35.580 | So this is a single thing that contains a bunch of numbers. This is a single thing that contains
01:13:39.660 | a bunch of numbers. This is an operation that applies to arrays or tensors. And what it does
01:13:45.420 | is it works what's called element-wise. It takes each pair, 10 and 2, and adds them together.
01:13:50.460 | Each pair, 6 and 8, add them together. This is element-wise addition. And Fred's asking in the
01:13:57.100 | chat, how do you put in these symbols? If you just mouse over any of them, it will show you
01:14:04.060 | how to write it. And the one you want is the one at the very
01:14:07.100 | bottom, which is the one where it says prefix. Now, the prefix is the backtick character.
01:14:14.140 | So here it's saying prefix hyphen gives us times. So type a backtick dash b is a times b,
01:14:25.580 | for example. So yeah, they all have shortcut keys, which you learn pretty quickly, I find.
01:14:33.420 | And there's a fairly consistent kind of system for those shortcut keys, too.
01:14:36.940 | All right. So we can do the same thing in PyTorch. It's a little bit more verbose in PyTorch,
01:14:43.740 | which is one reason I often like to do my mathematical fiddling around in APL. I can
01:14:48.540 | often do it with less boilerplate, which means I can spend more time thinking.
01:14:54.460 | You know, I can see everything on the screen at once. I don't have to spend as much time trying
01:14:58.300 | to ignore the tensor, round brackets, square bracket dot comma, blah, blah, blah.
01:15:03.100 | It's all cognitive load, which I'd rather ignore. But anyway, it does the same thing.
01:15:07.660 | So I can say a plus b and it works exactly like APL.
01:15:10.940 | So here's an interesting example. I can go a less than b dot float dot mean.
01:15:18.780 | So let's try that one over here. A less than b. So this is a really important idea,
01:15:25.020 | which I think was invented by Ken Iverson, the APL guy, which is the true and false
01:15:30.460 | represented by zero and one. And because they're represented by zero and one, we can
01:15:37.100 | do things to them. We can add them up and subtract them and so forth. It's a really important idea.
01:15:43.020 | So in this case, I want to take the mean of them. And I'm going to tell you something amazing,
01:15:51.740 | which is that in APL, there is no function called mean. Why not? That's because we can write
01:15:59.580 | the main function, which is so that's four letters, mean, M-E-A-N. We can write the main
01:16:05.820 | function from scratch with four characters. I'll show you. Here's the whole main function.
01:16:12.700 | We're going to create a function called main and the main is equal to the sum of a list
01:16:20.940 | divided by the count of a list. So this here is sum divided by count.
01:16:27.980 | And so I've now defined a new function called mean, which calculates the mean.
01:16:33.980 | Mean of a is less than b. There we go. And so, you know, in practice, I'm not sure people would
01:16:41.020 | even bother defining a function called mean because it's just as easy to actually write
01:16:45.100 | its implementation in APL, in NumPy or whatever Python, it's going to take a lot more than four
01:16:52.620 | letters to implement mean. So anyway, you know, it's a math notation. And so being a math notation,
01:16:58.060 | we can do a lot with little, which I find helpful because I can see everything going on at once.
01:17:03.740 | Anywho, OK, so that's how we do the same thing in PyTorch. And again, you can see that the less than
01:17:10.460 | in both cases are operating element wise. OK, so a is less than b is saying ten is less than two,
01:17:15.900 | six is less than eight, four is less than seven and gives us back each of those trues and falses
01:17:20.620 | as zeros and ones. And according to the emoji on our YouTube chat, see if his head just exploded
01:17:26.460 | as it should. This is why APL is, yeah, life changing. OK, let's now go up to higher ranks.
01:17:36.860 | So this here is a rank one tensor. So a rank one tensor means it's a it's a list of things.
01:17:44.460 | It's a vector. It's where else a rank two tensor is like a list of lists. They all have to be the
01:17:51.900 | same length lists or it's like a rectangular bunch of numbers. And we call it in math, we call it a
01:17:56.540 | matrix. So this is how we can create a tensor containing one, two, three, four, five, six, seven,
01:18:01.740 | eight, nine. And you can see often what I like to do is I want to print out the thing I just created
01:18:09.660 | after I created it. So two ways to do it. You can say, put an enter and then write M and that's
01:18:15.900 | going to do that. Or if you want to put it all on the same line, that works too. You just use a
01:18:19.180 | semicolon. Neither one's better than the other. They're just different. So we could do the same
01:18:26.140 | thing in APL. Of course, in APL, it's going to be much easier. So we're going to define a matrix
01:18:32.540 | called M, which is going to be a three by three tensor containing the numbers from one to nine.
01:18:43.580 | Okay. And there we go. That's done it in APL. A three by three tensor containing the numbers
01:18:54.540 | from one to nine. A lot of these ideas from APL you'll find have made their way into other
01:18:59.500 | programming languages. For example, if you use Go, you might recognize this. This is the iota
01:19:04.700 | character and Go uses the word iota. So they spell it out in a somewhat similar way.
01:19:11.180 | A lot of these ideas from APL have found themselves into math notation and other languages.
01:19:19.340 | It's been around since the late 50s. Okay. So here's a bit of fun.
01:19:24.060 | We're going to learn about a new thing that looks kind of crazy called Frobenius norm.
01:19:30.380 | And we'll use that from time to time as we're doing generative modeling.
01:19:35.580 | And here's the definition of a Frobenius norm. It's the sum over all of the rows and columns
01:19:43.900 | of a matrix. And we're going to take each one and square it. We're going to add them up and
01:19:51.980 | they're going to take the square root. And so to implement that in PyTorch is as simple as going
01:20:00.140 | n times m dot sum dot square root. So this looks like a pretty complicated thing when you kind of
01:20:10.380 | look at it at first. It looks like a lot of squiggly business. Or if you said this thing here,
01:20:14.540 | you might be like, what on earth is that? Well, now, you know, it's just square sum square root.
01:20:21.420 | So again, we could do the same thing in APL.
01:20:25.420 | So let's do, so in APL, we want the, okay, so we're going to create something called SF.
01:20:36.940 | Now, it's interesting, APL does this a little bit differently. So dot sum by default in PyTorch sums
01:20:44.220 | over everything. And if you want to sum over just one dimension, you have to pass in a dimension
01:20:48.540 | keyword. For very good reasons, APL is the opposite. It just sums across rows or just down columns.
01:20:55.340 | So actually, we have to say sum up the flattened out version of the matrix. And to say flattened
01:21:02.620 | out, use comma. So here's sum up the flattened out version of the matrix. Okay, so that's our SF.
01:21:12.380 | Oh, sorry. And the matrix is meant to be m times m. There we go. So there's the same thing. Sum up
01:21:24.220 | the flattened out m by m matrix. And another interesting thing about APL is it always is
01:21:29.100 | read right to left. There's no such thing as operator precedence, which makes life a lot easier.
01:21:34.780 | Okay, and then we take the square root of that. There isn't a square root function.
01:21:42.780 | So we have to do to the power of 0.5. And there we go. Same thing. All right, you get the idea.
01:21:51.020 | Yes, a very interesting question here from Marabou. Are the bars for norm or absolute value?
01:21:58.620 | And I like Siva's answer, which is the norm is the same as the absolute value for a scalar.
01:22:05.340 | So in this case, you can think of it as absolute value. And it's kind of not needed because it's
01:22:10.380 | being squared anyway. But yes, in this case, the norm, well, in every case for a scalar,
01:22:17.340 | the norm is the absolute value, which is kind of a cute discovery when you realize it.
01:22:21.260 | So thank you for pointing that out, Siva. All right. So this is just fiddling around a little
01:22:28.380 | bit to kind of get a sense of how these things work. So really importantly, you can index into
01:22:36.780 | a matrix and you'll say rows first and then columns. And if you say colon, it means all the
01:22:43.500 | columns. So if I say row two, here it is, row two, all the columns, sorry, this is row two,
01:22:51.420 | so that's at zero, APL starts at one, all the columns, that's going to be seven, eight, nine.
01:22:57.420 | And you can see I often use comma to print out multiple things. And I don't have to say print
01:23:01.980 | in Jupiter, it's kind of assumed. And so this is just a quick way of printing out the second row.
01:23:07.980 | And then here, every row, column two. So here is every row of column two. And here you can see three,
01:23:16.300 | six, nine. So one thing very useful to recognize is that for tensors of higher rank than one,
01:23:30.460 | such as a matrix, any trailing colons are optional. So you see this here, M2, that's the
01:23:37.900 | same as M2 comma colon. It's really important to remember. Okay, so M2, you can see the result is
01:23:44.540 | the same. So that means row two, every column. Okay, so now with all that in place, we've got
01:23:55.340 | quite an easy way. We don't need a number anymore. We can multiply, so we can get rid of that inner
01:24:03.260 | most loop. So we're going to get rid of this loop, because this is just multiplying together all of
01:24:09.660 | the corresponding rows of A with the, sorry, all the corresponding columns of a row of A with all
01:24:16.380 | the corresponding rows of a column of B. And so we can just use an element-wise operation for that.
01:24:22.300 | So here is the ith row of A, and here is the jth column of B. And so those are both, as we've seen,
01:24:35.980 | just vectors, and therefore we can do an element-wise multiplication of them,
01:24:40.300 | and then sum them up. And that's the same as a dot product. So that's handy.
01:24:46.860 | And so again, we'll do test close. Okay, it's the same. Great. And again, you'll see we kind of did
01:24:55.180 | all of our experimenting first, right, to make sure that we understood how it all worked,
01:24:59.580 | and then put it together. And then if we time it, 661 microseconds. Okay, so it's interesting. It's
01:25:06.940 | actually slower than, which really shows you how good number is, but it's certainly a hell of a lot
01:25:12.220 | better than our 450 milliseconds. But we're using something that's kind of a lot more general now.
01:25:18.620 | This is exactly the same as dot, as we've discussed. So we could just use torch dot,
01:25:27.260 | torch dot dot, I suppose I should say. And if we run that, okay, a little faster. It's still,
01:25:34.300 | interestingly, it's still slower than the number, which is quite amazing, actually.
01:25:39.660 | All right, so that one was not exactly a speed up, but it's kind of a bit more general, which is nice.
01:25:47.660 | Now we're going to get something into something really fun,
01:25:54.780 | which is broadcasting. And broadcasting is about what if you have arrays with different shapes.
01:26:00.540 | So what's a shape? The shape is the number of rows, or the number of rows and columns,
01:26:06.140 | or the number of, what would you say, faces, rows and columns, and so forth. So for example,
01:26:13.420 | the shape of M is 3 by 3. So what happens if you multiply, or add, or do operations to tensors of
01:26:22.220 | different shapes? Well, there's one very simple one, which is if you've got a rank one tensor,
01:26:29.180 | the vector, then you can use any operation with a scalar, and it broadcasts that scalar
01:26:39.740 | across the tensor. So a is greater than zero is exactly the same as saying a is greater than tensor
01:26:47.420 | zero comma zero comma zero. So it's basically copying that across three times. Now it's not
01:26:58.860 | literally making a copy in memory, but it's acting as if we had said that. And this is the most
01:27:03.820 | simple version of broadcasting. Okay, it's broadcasting the zero across the ten, and the
01:27:09.900 | six, and the negative four. And APL does exactly the same thing. a is less than five, so zero, zero,
01:27:22.620 | one. So same idea. Okay. So we can do plus with a scalar, and we can do exactly the same thing with
01:27:41.580 | higher than rank one. So two times a matrix is just going to do two is going to be broadcast
01:27:47.420 | across all the rows and all the columns. Okay, now it gets interesting. So broadcasting dates back to
01:27:59.180 | APL. But a really interesting idea is that we can broadcast not just scalars, but we can broadcast
01:28:05.820 | vectors across matrices or broadcast any kind of lower ranked tensor across higher ranked tensors,
01:28:13.740 | or even broadcast together together two tensors of the same rank, but different shapes in a really
01:28:20.460 | powerful way. And as I was exploring this, I was trying to love doing this kind of computer
01:28:27.340 | archaeology. I was trying to find out where the hell this comes from. And it actually turns out
01:28:31.660 | from this email message in 1995, that the idea actually comes from a language that I'd never
01:28:41.100 | heard of called Yorick, which still apparently exists. Here's Yorick. And so Yorick has talks about
01:28:51.740 | broadcasting and conformability. So what happened is this very obscure language
01:29:01.740 | has this very powerful idea. And NumPy has happily stolen the idea from Yorick that allows us to
01:29:11.820 | broadcast together tensors that don't appear to match. So let me give an example. Here's a tensor
01:29:20.060 | called C that's a vector. It's a rank one tensor, 10, 20, 30. And here's a tensor called M, which is
01:29:26.540 | a matrix. We've seen this one before. And one of them is shape three, comma, three. The other is
01:29:32.860 | shape three. And yet we can add them together. Now what's happened when we added it together?
01:29:41.420 | Well, what's happened is 10, 20, 30 got added to one, two, three. And then 10, 20, 30 got added to
01:29:50.700 | four, five, six. And then 10, 20, 30 got added to seven, eight, nine. And hopefully you can see
01:29:58.780 | this looks quite familiar. Instead of broadcasting a scalar over a higher rank tensor,
01:30:05.340 | this is broadcasting a vector across every row of a matrix.
01:30:15.260 | And it works both ways. So we can say C plus M gives us exactly the same thing. And so let me
01:30:21.180 | explain what's actually happening here. The trick is to know about this somewhat obscure method called
01:30:26.780 | expand as. And what expand as does is this creates a new thing called T, which contains exactly the
01:30:33.420 | same thing as C, but expanded or kind of copied over. So it has the same shape as M. So here's
01:30:40.700 | what T looks like. Now T contains exactly the same thing as C does, but it's got three copies of it
01:30:47.500 | now. And you can see we can definitely add T to M because they match shapes. Right? So we can say
01:30:56.380 | M plus T. We know we can play M plus T because we've already learned that you can do element-wise
01:31:01.900 | operations on two things that have matching shapes. Now, by the way, this thing T didn't actually
01:31:09.740 | create three copies. Check this out. If we call T dot storage, it tells us what's actually in memory.
01:31:15.580 | It actually just contains the numbers 10, 20, 30. But it does a really clever trick. It has a stride
01:31:23.020 | of zero across the rows and a size of three comma three. And so what that means is that it acts as
01:31:30.700 | if it's a three by three matrix. And each time it goes to the next row, it actually stays exactly
01:31:36.540 | where it is. And this idea of strides is the trick which NumPy and PyTorch and so forth use
01:31:45.180 | for all kinds of things where you basically can create very efficient ways to do things like
01:31:52.860 | expanding or to kind of jump over things and stuff like that, you know, switch between columns and
01:31:58.540 | rows, stuff like that. Anyway, the important thing here for us to recognize is that we didn't
01:32:03.100 | actually make a copy. This is totally efficient and it's all going to be run in C code very fast.
01:32:07.740 | So remember, this expand as is critical. This is the thing that will teach you to understand
01:32:14.140 | how broadcasting works, which is really important for implementing deep learning algorithms or any
01:32:19.980 | kind of linear algebra on any Python system, because the NumPy rules are used exactly the same
01:32:28.460 | in JAX, in TensorFlow, in PyTorch and so forth. Now I'll show you a little trick,
01:32:36.540 | which is going to be very important in a moment. If we take C, which remember is a vector containing
01:32:44.060 | 10 20 30, and we say dot unsqueezed zero, then it changes the shape from three to one comma three.
01:32:55.980 | So it changes it from a vector of length three to a matrix of one row by three columns. This will
01:33:02.860 | turn out to be very important in a moment. And you can see how it's printed. It's printed out with
01:33:06.540 | two square brackets. Now I never use unsqueezed because I much prefer doing something more
01:33:12.220 | flexible, which is if you index into an axis with a special value none, also known as np.newaxis.
01:33:20.300 | It does exactly the same thing. It inserts a new axis here. So here we'll get exactly the same thing,
01:33:28.540 | one row by all the columns, three columns. So this is exactly the same as saying unsqueezed.
01:33:35.820 | So this inserts a new unit axis. This is a unit axis, a single row
01:33:45.100 | in this dimension. And this does the same thing. So these are the same. So we could do the same
01:33:52.460 | thing and say unsqueeze one, which means now we're going to unsqueeze into the first dimension.
01:33:59.820 | So that means we now have three rows and one column. See the shape here? The shape is inserting
01:34:08.620 | a unit axis in position one, three rows and one column. And so we can do exactly the same thing
01:34:16.940 | here. Give us every row and a new unit axis in position one. Same thing. Okay. So those two are
01:34:25.020 | exactly the same. So this is how we create a matrix with one row. This is how we create a
01:34:35.180 | matrix with one column. None comma colon versus colon comma none or unsqueeze.
01:34:41.900 | We don't have to say, as we've learned before, none comma colon, because do you remember?
01:34:51.740 | Trailing columns are optional. So therefore just see none is also going to give you a row matrix,
01:34:59.660 | one row matrix. This is a little trick here. If you say dot, dot, dot, that means all of the
01:35:06.940 | dimensions. And so dot, dot, dot comma none will always insert a unit axis at the end, regardless
01:35:13.500 | of what rank a tensor is. So, yeah, so none and NP new axis mean exactly the same thing.
01:35:21.020 | NP new axis is actually a synonym for none. If you've ever used that, I always use none
01:35:29.260 | because why not? It's short and simple. So here's something interesting. If we go C colon,
01:35:34.860 | common none, so let's go and check out what C colon, common none looks like. C colon, common none
01:35:42.540 | is a column. And if we say expand as M, which is three by three, then it's going to take that
01:35:51.580 | 10, 20, 30 column and replicate it 10, 20, 30, 10, 20, 30, 10, 20, 30. So we could add. So remember,
01:36:00.860 | like, remember, I will explain that when you say matrix plus C colon, common none,
01:36:10.620 | it's basically going to do this dot expand as for you. So if I want to add this matrix here to M,
01:36:20.380 | I don't need to say dot expand as I just write this. I just write M plus C colon, common none.
01:36:27.020 | And so this is exactly the same as doing M plus C. But now rather than adding the vector to each row,
01:36:36.060 | it's adding the vector to each column C plus 10, 20, 30, 10, 20, 30, 10, 20, 30.
01:36:45.500 | So that's a really simple way that we now get kind of for free thanks to this really nifty notation,
01:36:51.420 | this nisti approach that that came from Yorick. So here you can see M plus C none,
01:36:58.460 | none, colon is adding 10, 20, 30 to each row. And M plus C colon, common none is adding 10,
01:37:06.060 | 20, 30 to each column. All right, so that's the basic like hand wavy version. So let's
01:37:15.180 | look at like what are the rules? How does it work? Okay, so C none, colon is one by three.
01:37:23.500 | C colon, common none is three by one. What happens if we multiply C none, colon
01:37:31.980 | by C colon, common none? Well, it's going to do if you think about it,
01:37:37.820 | which you definitely should because thinking is very helpful.
01:37:44.620 | What is going on here? Oh, it took forever.
01:37:46.860 | Okay, so what happens if we go C none, colon times C colon, common none? So what it's going
01:37:55.020 | to have to do is it's going to have to take this 10, 20, 30 column vector or three by one matrix
01:38:04.460 | and it's going to have to make it work across each of these rows. So what it does is expands it to be
01:38:12.300 | 10, 20, 30, 10, 20, 30, 10, 20, 30. So it's going to do it just like this. And then it's going to
01:38:18.940 | do the same thing for C none, colon. So that's going to become three rows of 10, 20, 30. So
01:38:26.460 | we're going to end up with three rows of 10, 20, 30 times three columns of 10, 20, 30,
01:38:33.660 | which gives us our answer. And so this is going to do an outer product. So it's very nifty that
01:38:43.020 | you can actually do an outer product without any special, you know, functions or anything,
01:38:50.620 | just using broadcasting. And it's not just outer products, you can do outer Boolean operations.
01:38:56.060 | And this kind of stuff comes up all the time, right? Now, remember, you don't need the comma
01:39:00.940 | colon, so get rid of it. So this is showing us all the places where it's greater than it's kind of an
01:39:09.820 | outer, an outer Boolean, if you want to call it that. So this is super nifty and you can do all
01:39:16.540 | kinds of tricks with this because it runs very, very fast. So this is going to be accelerated in C.
01:39:21.340 | So here are the rules. Okay. When you operate on two arrays or tensors, NumPy and PyTorch will
01:39:29.180 | compare their shapes. Okay. So remember the shape, this is a shape. You can tell it's a shape because
01:39:34.620 | we said shape and it goes from right to left. So that's the trailing dimensions. And it checks
01:39:42.300 | whether the dimensions are compatible. Now they're compatible if they're equal, right? So for example,
01:39:48.460 | if we say M times M, then those two shapes are compatible because in each case, it's just going
01:40:04.220 | to be three, right? So they're going to be equal. So if the shape in that dimension is equal,
01:40:11.580 | they're compatible, or if one of them is one and if one of them is one, then that dimension is
01:40:18.540 | broadcast to make it the same size as the other. So that's why the outer product worked. We had
01:40:28.540 | a one by three times a three by one. And so this one got copied three times to make it this long.
01:40:37.740 | And this one got copied three times to make it this long. Okay. So those are the rules. So the
01:40:46.860 | arrays don't have to have the same number of dimensions. So this is an example that comes up
01:40:51.900 | all the time. Let's say you've got a 256 by 256 by three array or tensor of RGB values. So you've got
01:40:57.500 | an image, in other words, a color image. And you want to normalize it. So you want to scale each
01:41:03.740 | color in the image by a different value. So this is how we normalize colors. So one way is you could
01:41:14.220 | multiply or divide or whatever, multiply the image by a one-dimensional array with three values.
01:41:20.060 | So you've got a 1D array. So that's just three. Okay. And then the image is 256 by 256 by three.
01:41:30.940 | And we go right to left and we check, are they the same? And we say, yes, they are.
01:41:35.180 | And then we keep going left and we say, are they the same? And if it's missing, we act as if it's
01:41:41.660 | one. And if we go, keep going, if it's missing, we act as if it's one. So this is going to be the
01:41:46.940 | same as doing one by one by three. And so this is going to be broadcast. This three, three
01:41:52.940 | elements will be broadcast over all 256 by 256 pixels. So this is a super fast
01:42:00.540 | and convenient and nice way of normalizing image data with a single expression. And this is exactly
01:42:06.620 | how we do it in the fast.ai library. In fact, so we can use this to dramatically speed up our
01:42:15.020 | matrix multiplication. Let's just grab a single digit just for simplicity. And I really like
01:42:21.340 | doing this in Jupyter notebooks. And if you, if you build Jupyter notebooks to explain stuff that
01:42:26.780 | you've learned in this course or ways that you can apply it, consider doing this for your readers,
01:42:31.020 | but add a lot more prose. I haven't added prose here because I want to use my voice.
01:42:36.460 | If I was, for example, in our book that we published, it's all written in notebooks and
01:42:42.700 | there's a lot more prose, obviously. But like really, I like to show every example all along
01:42:47.740 | the way using simple as possible. So let's just grab a single digit. So here's the first digit.
01:42:54.060 | So its shape is, it's a 784 long vector. And remember that our weight matrix is 784 by 10.
01:43:02.060 | So if we say digit colon common none dot shape, then that is a 784 by 1 row matrix. So there's
01:43:18.460 | our matrix. And so if we then take that 784 by 1 and expand as M2, it's going to be the same
01:43:30.060 | shape as our weight matrix. So it's copied our image data for that digit across all of the 10
01:43:42.060 | vectors representing the 10 linear projections we're doing for our linear model. And so that
01:43:50.620 | means that we can take the digit colon common none, so 784 by 1, and multiply it by the weights.
01:43:57.020 | And so that's going to get us back 784 by 10. And so what it's doing, remember, is it's basically
01:44:03.820 | looping through each of these 10 784 long vectors. And for each one of them, it's multiplying it by
01:44:13.900 | this digit. So that's exactly what we want to do in our matrix multiplication. So originally,
01:44:23.340 | we had, well not originally, most recently I should say, we had this dot product where we were
01:44:31.740 | actually looping over j, which was the columns of b. So we don't have to do that anymore,
01:44:40.540 | because we can do it all at once by doing exactly what we just did. So we can take the i-th row
01:44:48.780 | and all the columns and add a axis to the end. And then just like we did here,
01:45:01.100 | multiply it by b. And then dot sum. And so that is, again, exactly the same thing. That is another
01:45:10.780 | matrix multiplication, doing it using broadcasting. Now this is like
01:45:15.660 | tricky to get your head around. And so if you haven't done this kind of broadcasting before,
01:45:24.220 | it's a really good time to pause the video and look carefully at each of these four cells before
01:45:31.820 | and understand, what did I do there? Why did I do it? What am I showing you? And then experiment
01:45:39.260 | with trying to, and so remember that we started with M1 0, right? So just like we have here ai,
01:45:48.220 | okay? So that's why we've got i comma comma colon comma none, because this digit is actually M1 0.
01:45:55.740 | This is like M1 0 colon none. So this line is doing exactly the same thing as this here,
01:46:04.700 | plus the sum. So let's check if this matmul is the same as it used to be, yet it's still working.
01:46:12.780 | And the speed of it, okay, not bad. So 137 microseconds. So we've now gone from a time
01:46:22.460 | from 500 milliseconds to about 0.1 milliseconds. Funnily enough on my, oh, actually, now I think
01:46:28.620 | about it. My MacBook Air is an M2, whereas this Mac Mini is an M1. So that's a little bit slower.
01:46:33.500 | So my Air was a bit faster than 0.1 milliseconds. So overall, we've got about a 5,000 times
01:46:40.860 | speed improvement. So that is pretty exciting. And since it's so fast now, there's no need to
01:46:48.700 | use a mini batch anymore. If you remember, we used a mini batch of, where is it? Of five images.
01:47:00.540 | But now we can actually use the whole data set because it's so fast. So now we can do the whole
01:47:05.020 | data set. There it is. We've now got 50,000 by 10, which is what we want. And so it's taking us only
01:47:15.580 | 656 milliseconds now to do the whole data set. So this is actually getting to a point now where we
01:47:20.860 | could start to create and train some simple models in a reasonable amount of time. So that's good
01:47:26.220 | news. All right. I think that's probably a good time to take a break. We don't have too much more
01:47:35.100 | of this to go, but I don't want to keep you guys up too late. So hopefully you learned something
01:47:42.460 | interesting about broadcasting today. I cannot overemphasize how widely useful this is in
01:47:51.900 | all deep learning and machine learning code. It comes up all the time. It's basically our
01:47:57.100 | number one, most critical kind of foundational operation. So yeah, take your time practicing
01:48:05.820 | it and also good luck with your diffusion homework from the first half of the lesson.
01:48:11.980 | Thanks for joining us, and I'll see you next time.
01:48:14.780 | [BLANK_AUDIO]