Lesson 11 2022: Deep Learning Foundations to Stable Diffusion

Hi everybody, welcome to lesson 11. This is the third lesson in part two, depending on how you count things. There's been a lesson A and lesson B, it's kind of the fifth lesson in part two, I don't know what it is. So we'll just stick to calling it lesson 11 and avoid getting too confused.

I'm already confused. My goodness, I've got so much stuff to show you. I'm only going to show you a tiny fraction of the cool stuff that's been happening on the forum this week, but it's been amazing. I'm going to start by sharing this beautiful video from John Robinson, Robinson, I should say, and I've never seen anything like this before.

As you can see, it's very stable and it's really showing this beautiful movement between seasons. So what I did on the forum was I said to folks, "Hey, you should try interpolating between prompts," which is what John did. And I also said, "You should try using the last image of the previous prompt interpolation as the initial image for the next prompt." And anyway, here it is, it came out beautifully, John was the first to get that working, so I was very excited about that.

And the second one I wanted to show you is this really amazing work from Seb Derhi, who, Sebastian, who did something that I'd been thinking about as well. I'm really thrilled that he also thought about this, which was he noticed that this update we do, unconditional embeddings plus guidance times text embeddings minus unconditional embeddings, has a bit of a problem, which is that it gets big.

To show you what I mean by it gets big is like, imagine that we've got a couple of vectors on this chart here. And so we've got, let's see, so we've got, that's just, okay, so we've got the original unconditional piece here, so we've got U. So let's say this is U.

And then we add to that some amount of T minus U. So if we've got like T, let's say it's huge, right? And we've got U again. Then the difference between those is the vector which goes here, right? Now you can see here that if there's a big difference between T and U, then the eventual update which actually happens is, oopsie daisy, I thought that was going to be an arrow.

Let's try that again. The eventual update which happens is far bigger than the original update. And so it jumps too far. So this idea is basically to say, well, let's make it so that the update is no longer than the original unconditioned update would have been. And we're going to be talking more about norms later, but basically we scale it by the ratio of the norms.

And what happens is we start with this astronaut and we move to this astronaut. And it's kind of, it's a subtle change, but you can see there's a lot more before, after, before, after, a lot more texture in the background. And like on the Earth, there's a lot more detail before, after, you see that?

And even little things like before, the bridal kind of rains, whatever were pretty flimsy. Now they look quite proper. So it's made quite a big difference just to kind of get this scaling correct. So there's a couple of other things that Sebastian tried, which I'll explain in a moment, but you can see how they, some of them actually resulted in changing the image.

And this one's actually important because the poor horse used to be missing a leg and now it's not missing a leg, so that's good. And so here's the detailed one with its extra leg. So how did he do this? Well, so what he did was he started with this unconditioned prompt plus the guidance times the difference between the conditional and unconditioned.

And then as we discussed, the next version, well, actually the next version we then saw is to basically just take that prediction and scale it according to the difference in the lengths. So the norms is basically the length of the vectors. And so this is the second one I did in lesson nine, you'll see it's gone from here.

So when we go from 1a to 1b, you can see here, it's got, look at this, this boot's gone from nothing to having texture. This whatever the hell this thing is, suddenly he's got texture. And look, we've now got proper stars in the sky. It's made a really big difference.

And then the second change is not just to rescale the whole prediction, but to rescale the update. When we rescale the update, it actually not surprisingly changes the image entirely because we're now changing the direction it goes. And so I don't know, is this better than this? I mean, maybe, maybe not, but I think so, particularly because this was the difference that added the correct fourth leg to the horse before.

And then we can do both. We can rescale the difference and then rescale the result. And then we get the best of both worlds, as you can see, big difference. We get a nice background. This weird thing on his back's actually become an arm. That's not what a foot looks like.

That is what a foot looks like. So these little details make a big difference, as you can see. So this is a really cool, or two really cool new things. New things tend to have wrinkles, though. Problem number one is after I shared on Twitter Sebastian's approach, Ben Poole, who's a Google brain, I think, if I remember correctly, pointed out that this already exists.

He thinks it's the same as what's shown in this paper, which is a diffusion model for text-to-speech. I haven't read the paper yet to check whether it's got all the different options or whether it's checked them all out like this. So maybe this is reinventing something that already existed and putting it into a new field, which would still be interesting.

Anyway, so hopefully, folks, on the forum, you can help figure out whether this paper is actually showing the same thing or not. And then the other interesting thing was John Robinson got back in touch on the forum and said, "Oh, actually, that tree video doesn't actually do what we think it does at all.

There's a bug in his code, and despite the bug, it accidentally worked really well." So now we're in this interesting question of trying to figure out, "Oh, how did he create such a beautiful video by mistake?" And okay, so reverse engineering exactly what the bug did, and then figuring out how to do that more intentionally.

And this is great, right? It's really good to have a lot of people working on something, and the bugs often, yeah, they tell us about new ideas. So that's very interesting. So watch this space where we find out what John actually did and how it worked so well. And then something that I just saw like two hours ago in the forum, which I had never thought of before, but I thought of something a little bit similar.

Rakhil Prashanth said like, "Well, what if we took this?" So as you can see, all the students are really bouncing ideas of each other. It's like, "Oh, it's interesting. We're doing different things with a guidance scale. What if we take the guidance scale, and rather than keeping it at 7.5 all the time, let's reduce it." And this is a little bit similar to something I suggested to John over a few weeks ago where I said he was doing some stuff with like modifying gradients based on additional loss functions.

And I said to him, "Maybe you should just use them like occasionally at the start." Because I think the key thing is once the model kind of knows roughly what image it's trying to draw, even if it's noisy, you can let it do its thing. And this is exactly what's happening here is Rakhil's idea is to say, "Well, let's decrease the guidance scale." So at the end, it's basically zero.

And so once it kind of is in going in the right direction, we let it do its thing. So this little doggy is with the normal 7.5 guidance scale. Now have a look, for example, its eye here. It's pretty uninteresting, pretty flat. And if I go to the next one, as you can see now, actually look at the eye.

That's a proper eye before, totally glassy, black, now proper eye. Or like look at all this fur, very textured, previously very out of focus. So this is again a new technique. So I love this. You folks are trying things out, and some things are working, and some things not working, and that's all good.

I kind of feel like you're going to have to slow down because I'm having trouble keeping up with you all. But apart from that, this is great. Good work. I also wanted to mention on a different theme to check out Alex's notes on the lesson because I thought he's done a fantastic job of showing like how to study a lesson.

And so what Alex did, for example, was he made a list in his notes of all the different steps we did as we started the Froma Foundations. What is the library that it comes from? Links to the documentation. And I know that Alex's background actually is history, not computer science.

And so for somebody moving into a different field like this, this is a great idea, particularly to be able to look at like, OK, what are all the things that I'm going to have to learn and read about? And then he did something which we always recommend, which is to try the lesson on a new data set.

And he very sensibly picked out the fashion MNIST data set, which is something we'll be using a lot in this course because it's a lot like MNIST. And it's just different enough to be interesting. And so he described in his post or his notes how he went about doing that.

And then something else I thought was interesting in his notes at the very end was he just jotted down my tips. It's very easy when I throw a tip out there to think, oh, that's interesting. That's good to know. And then it can disappear. So here's a good way to make sure you don't forget about all the little tricks.

And I think I've put those notes in the forum wiki so you can check them out if you'd like to learn from them as well. So I think this is a great role model. Good job, Alex. OK, so during the week, Jono taught us about a new paper that had just come out called DeafEdit and he told us he thought this was an interesting paper.

And it came out during the week and I thought it might be good practice for us to try reading this paper together. So let's do that. So here's the paper, DeafEdit. And you'll find that probably the majority of papers that you come across in deep learning will take you to archive.

Archive is a preprint server. So these are models, these are papers that have not been peer reviewed. I would say in our field we don't generally or I certainly don't generally care about that at all because we have code, we can try it, we can see things whether it works or not.

You know, we tend to be very, you know, most papers are very transparent about here's what we did and how we did it and you can replicate it. And it gets a huge amount of peer review on Twitter. So if there's a problem generally within 24 hours, somebody has pointed it out.

Now we use archive a lot and if you wait until it's been peer reviewed, you know, you'll be way out of date because this field is moving so quickly. So here it is on archive and we can read it by clicking on the PDF button. I don't do that, instead I click on this little button up here, which is the Save to Zotero button.

So I figured I'd show you like my preferred workflows. You don't have to do the same thing, there are different workflows, but here's one that I find works very well, which is Zotero is a piece of free software that you can download for Mac, Windows, Linux and install a Chrome connector.

Oh, Tanishka is saying the button is covered. All right, so in my taskbar, I have a button that I can click that says Save to Zotero. Sorry, not taskbar, Chrome menu bar. And when I click it, I'll show you what happens. So after I've downloaded this, the paper will automatically appear here in this software, which is Zotero.

And so here it is, diffedit. And you can see it's told us, it's got here the abstract, the authors, where it came from. And so later on, I can go and like, if I want to check some detail, I can go back and see the URL, I can click on it, pops up.

And so in this case, what I'm going to do is I'm going to double click on it. And that brings up the paper. Now, the reason I like to read my papers in Zotero is that I can, you know, annotate them, edit them, tag them, put them in folders and so forth, and also add them to my kind of reading list directly from my web browser.

So as you can see, you know, I've started this Fast Diffusion folder, which is actually a group library, which I share with the other folks working on this Fast Diffusion project that we're all doing together. And so we can all see the same paper library. So Maribou on YouTube chat is asking, is this better than Mendeley?

Yeah, I used to use Mendeley and it's kind of gone downhill. I think Zotero is far, far better, but they're both very similar. Okay, so when you double click on it, it opens up and here is a paper. So reading a paper is always extremely intimidating. And so you just have to do it anyway and you have to realize that your goal is not to understand every word.

Your goal is to understand the basic idea well enough that, for example, when you look at the code, hopefully it comes with code, most things do, that you'll be able to kind of see how the code matches to it and that you could try writing your own code to implement parts of it yourself.

So over on the left, you can open up the sidebar here. So I generally open up the table of contents and get a bit of a sense of, okay, so there's some experimental results, there's some theoretical results, introduction, related work, okay, tells us about this new diff edit thing, some experiments, okay.

So that's a pretty standard approach that you would see in papers. So I would always start with the abstract, okay. So what's it saying this does? So generally it's going to be some background sentence or two about how interesting this field is. It's just saying, wow, image generation is cool, which is fine.

And then they're going to tell us what they're going to do, which is they're going to create something called diff edit. And so this is a, what is it for? It's going to use text condition diffusion models. So we know what those are now. That's what we've been using.

That's where we type in some text and get back an image of that that matches the text. But this is going to be different. It's the task of semantic image editing. Okay. We don't know what that is yet. So let's put that aside and think, okay, let's make sure we understand that later.

The goal is to edit an image based on a text query. Oh, okay. So we're going to edit an image based on text. How on earth would you do that? Ah, they're going to tell us right away what this is. Semantic image editing. It's an extension of image generation with an additional constraint, which is the generated image should be as similar as possible to the given input.

And so generally, as they've done here, there's going to be a picture that shows us what's going on. And so in this picture, you can see here an example, here's an input image. And originally it was attached to a caption, a bowl of fruits. Okay. We want to change this into a bowl of pairs.

So we type a bowl of pairs and it generates, oh, a bowl of pairs, or we could change it from a bowl of fruit to a basket of fruits and oh, it's become a basket of fruits. Okay. So I think I get the idea, right? What it's saying is that we can edit an image by typing what we want that image to represent.

So this actually looks a lot like the paper that we looked at last week. So that's cool. So the abstract says that currently, so I guess there are current ways of doing this, but they require you to provide a mask. That means you have to basically draw the area you're replacing.

Okay. So that sounds really annoying, but our main contribution. So what this paper does is we automatically generate the mask. So they simply just type in the new query and get the new image. So that sounds actually really impressive. So if you read the abstract and you think, um, I don't care about doing that, then you can skip the paper, you know, um, or, or look at the results.

And if the results don't look impressive, then just skip the paper. So that's, that's kind of your first point where we can be like, okay, we're, we're done. But in this case, this sounds great. The results look amazing. So I think we should keep going. Um, okay. The chief state of the updating performance, of course.

Fine. And we try some, right, whatever. Okay. So the introduction to a paper, um, is going to try to give you a sense of, you know, what they're trying to do. And so this first paragraph here is just repeating what we've already read in the abstract and repeating what we see in figure one.

So it's saying that we can take a text query, like a basket of fruits, see the examples. All right, fine. We'll skip through there. So the key thing about academic papers is that they are full of citations. Um, you should not expect to read all of them because if you do, then to read each of those citations, that's full of citations and then they're full of citations.

And before you know it, you've read the entire academic literature, which has taken you 5,000 years. Um, so, uh, for now, let's just recognize that it says text conditional image generation is undergoing a revolution. Here's some examples. Well, fine. We actually already know that. Okay. Dali's call latent diffusion.

That's what we've been using. That's cool. Emmergen. Apparently that's cool. Um, so cool. All right. So we kind of know that. So generally there's this like, okay, our area that we're working on is important in this case. It's important. So we can skip through it pretty quickly. Um, they've asked a vast amounts of data are used.

Yes, we know. Um, okay. So diffusion models are interesting. Yes, we know that they denoise starting from Gaussian noise. We know that. So you can see like, there's a lot of stuff. Once you kind of in the field, you can skip over pretty quickly. You can guide it using clip guidance.

Yeah, that's what we've been doing. We know about that. Oh, wait, this is new or by in painting. I copy pasting pixel values outside a mask. All right. So there's a new technique that we haven't done, but I think it makes a lot of intuitive sense. Um, that is during that diffusion process, if there are some pixels, you don't want to change such as all the ones that aren't orange here, you can just paste them from the original after each stage of the diffusion.

All right. That makes perfect sense. If I want to know more about that, I could always look at this paper, but I don't think I do for now. Okay. And again, it's just repeating something they've already told us that they require us to provide a mask. So that's a bit of a problem.

And then, you know, this is interesting. It's also says that when you mask out an area, that's a problem because if you're trying to, for example, change a dog into a cat, you want to keep the animal's color and pose. So this is a new technique, which is not deleting the original, not deleting a section and replacing it with something else, but it's actually going to take advantage of knowledge about what that thing looked like so that this is two cool new things.

So hopefully at this point, we know what they're trying to achieve. If you don't know what they're trying to achieve when you're reading a paper, the paper won't make any sense. Um, so again, that's a point where you should stop. Maybe this is not the right time to be reading this paper.

Maybe you need to read some of the references. Maybe you need to look more at the examples so you can always skip straight to the experiments. So I often skip straight to the experiments. In this case, I don't need to because they've put enough experiments on the very first page for me to see what it's doing.

So yeah, don't always read it from top to bottom. Um, okay. So all right. So they've got some examples of conditioning a diffusion model on an input without a mask. Okay. For example, you can use a noised version of the input as a starting point. Hey, we've done that too.

So as you can see, we've already covered a lot of the techniques that they're referring to here. Something we haven't done, but makes a lot of sense is that we can look at the distance to the input image as a loss function. Okay, that makes sense to me and there's some references here.

All right, so we're going to create this new thing called diffedit. It's going to be amazing. Wait till you check it out. Okay, fine. Okay. So that's the introduction. Hopefully you found that useful to understand what we're trying to do. The next section is generally called related work as it is here.

And that's going to tell us about other approaches. So if you're doing a deep dive, this is a good thing to study carefully. I don't think we're going to do a deep dive right now. So I think we can happily skip over it. We could kind of do a quick glance of like, oh, image editing, conclude colorization, retouching style transfer.

Okay, cool. Lots of interesting topics. Definitely getting more excited about this idea of image editing. And there's some different techniques. You can use clip guidance, okay, they can be computationally expensive. We can use diffusion for image editing. Okay, fine. We can use clip to help us. So there's a lot of repetition in these papers as well, which is nice because we can skip over it pretty quickly.

More about the high computational costs. Okay, so they're saying this is going to be not so computationally expensive. That sounds hopeful. And often the very end of the related work is most interesting as it is here where they've talked about how somebody else has done concurrent hours. Somebody else is working at exactly the same time.

And they've looked at some different approach. Okay, so not sure we learned too much from the related work, but if you were trying to really do the very, very best possible thing, you could study the related work and get the best ideas from each. Okay, now, background. So this is where it starts to look scary.

I think we could all agree. And this is often the scariest bit, the background. This is basically saying like, mathematically, here's how the problem that we're trying to solve is set up. And so we're going to start by looking at denoising, diffusion, probabilistic models, DDPM. Now, if you've watched lesson 9b with Wasim and Tanishk, then you've already seen some of the math of DDPM.

And the important thing to recognize is that basically no one in the world pretty much is going to look at these paragraphs of text and these equations and go, oh, I get it. That's what DDPM is. That's not how it works, right? To understand DDPM, you would have to read and study the original paper, and then you would have to read and study the papers it's based on and talk to lots of people and watch videos and go to classes just like this one.

And after a while, you'll understand DDPM. And then you'll be able to look at this section and say, oh, okay, I see, they're just talking about this thing I'm already familiar with. So this is meant to be a reminder of something that you already know. It's not something you should expect to learn from scratch.

So let me take you through these equations somewhat briefly because Wasim and Tanishk have kind of done them already because every diffusion paper pretty much is going to have these equations. Okay. So, oh, and I'm just going to read something that Jono has pointed out in the chat. He says it's worth remembering the background is often written last and tries to look smart for the reviewers, which is correct.

So feel free to read it last too. Yeah, absolutely. I think the main reason to read it is to find out what the different letters mean, what the different symbols mean, because they'll probably refer to them later. But in this case, I want to actually take this as a way to learn how to read math.

So let's start with this very first equation, which how on earth do you even read this? So the first thing I'll say is that this is not an E, right? It's a weird looking E. And the reason it's a weird looking E is because it's a Greek letter. And so something I always recommend to students is that you learn the Greek alphabet because it's much easier to be able to actually read this to yourself.

So here's another one, right? If you don't know that's called theta, I guess you have to read it as like circle with line through it. It's just going to get confusing trying to read an equation where you just can't actually say it out loud. So what I suggest is that you learn that learn the Greek alphabet and let me find the right place.

So it's very easy to look it up just on Wikipedia is the Greek alphabet. And if we go down here, you'll see they've all got names and we can go and try and find our one curvy E. Okay, here it is, epsilon and oh, circle with a line through it, theta.

All right, so practice and you will get used to recognizing C. So you've got epsilon theta. This is just a weird curly L. So that's this is used for the loss function. Okay, so how do we find out what this symbol means and what this symbol means? Well, what we can do is there's a few ways to do it.

One way, which is kind of cool is we can use a program called MathPix, which is MathPix. And what it does is you basically select anything on your screen. And it will turn it into LaTeX. So that's one way you can do this is you can select on the screen.

It turns it into LaTeX. And the reason it's good to turn it into LaTeX is because LaTeX is written as actual stuff that you can search for on Google. So that's technique number one. Technique number two is you can download the other formats of the paper and that will have a download source.

And if we say download source, then what we'll be able to do is we'll be able to actually open up that LaTeX and have a look at it. So we'll wait for that to download while it's happening. Let's keep moving along here. So in this case, we've got these two bars.

So can we find out what that means? So we could try a few things. We could try looking for two bars, maybe math notation. Oh, here we are. Looks hopeful. What does this mean in mathematics? Oh, and here there's a glossary of mathematical symbols. Here there's a meaning of this in math.

So that looks hopeful. Okay, so it definitely doesn't look like this. It's not between two sets of letters, but it is around something that looks hopeful. So it looks like we found it. It's a vector norm. Okay, so then you can start looking for these things up. So we can say norm or maybe vector norm.

And so once you can actually find the term, then we kind of know what to look for. Okay, so in our case, we've got this surrounding all this stuff, and then there's twos here and here. What's going on here? All right, if we scroll through, oh, this is pretty close actually.

So, okay, so two bars can mean a matrix norm, otherwise a single for a vector norm. That's just here in particular. So it looks like we don't have to worry too much about whether it's one or two bars. Oh, and here's the definition. Oh, that's handy. So we've got the two one.

All right, so it's equal to root sum of squares. So that's good to know. So this norm thing means a root sum of squares. But then we've got a two up here. Well, that just means squared. Ah, so this is a root sum of squares squared. Well, the square of a square root is just the thing itself.

Ah, so actually this whole thing is just the sum of squares. It's a bit of a weird way to write it, in a sense. We could perfectly well have just written it as, you know, like sum of, you know, whatever it is, squared. Fine. But there we go. Okay, and then what about this thing here?

Weird E thing. So how would you find out what the weird E thing is? Okay, so our, our laytech has finally finished downloading. And if we open it up, we can find there's a .tech file in here. Here we are, main.tech. So we'll open it. And it's not the most, you know, amazingly smooth process, but, you know, what we could just do is we could say, okay, it's just after it says minimizing the denoising objective.

Okay, so let's search for minimizing the, oh, here it is, minimizing the denoising objective. So the laytech here, let's get it back from the screen at the same time. Okay, so here it is, L, math curl L equals math BBE, X naught T, epsilon, okay. And here's that vertical bar thing, epsilon minus epsilon theta XT, and then the bar thing 22.

All right, so the thing that we've got new is math BBE, okay, so finally we've got something we can search for, math BBE, ah, fantastic, what does math BBE mean? That's the expected value operator, aha, fantastic, all right. So it takes a bit of fussing around, but once you've got either math pics working or actually another thing you could try, because math pics is ridiculously expensive, in my opinion, is there is, there is a free version called pics2tech that actually is a Python thing, and you could actually even have fun playing with this because the whole thing is just a PyTorch Python script, and it even describes, you know, how it's used to transformers model, and you can train it yourself in Colab and so forth, but basically as you can see, yeah, you can snip and convert to LaTeX, which is pretty awesome.

So you could use this instead of paying the math pics guys, anyway, so we are on the right track now, I think, so expected value, and then we can start reading about what expected value is, and you might actually remember that because we did a bit of it in iSchool, at least in Australia we did, it's basically like, let's maybe jump over here, so expected value of something is saying what's the likely value of that thing, so for example, let's say you toss a coin, which could be heads or it could be tails, and you want to know how often it's heads, and so maybe we'll call heads one tail zero, so you toss it and you get a one zero zero one one zero one zero one, okay, and so forth, right, and then you can calculate the mean of that, right, so if that's x you can calculate x bar, the mean, which would be the sum of all that, divided by the count of all that, so it'd be one two three four five, five divided by one two three four five six seven eight nine, okay, so that would be the mean, but the expected value is like, well, what do you expect to happen, and we can calculate that by adding up for all of the possibilities for each, I don't know, I'll just call them x, for each possibility x, how likely is x, and what score do you get if you get x, so in this example of heads and tails, our two possibilities is that we either get heads or we get tails, so if for the version where x is heads we get probability is zero point five, and the score if it's an x, is that the score if it's an x is going to be one, and then what about tails, for tails the probability is zero point five, and the score if you get tails is zero, and so overall the expected is point five times one plus zero is point five, so our expected score if we're tossing a coin is point five, if getting heads is a win.

Let me give you another example, another example is let's say that we're rolling a die, and we want to know what the expected score is if we roll a die, so again we could roll it a bunch of times, and see what happens, okay, and so we could sum all that up, let's like before, and divide it by the count, and that'll tell us the mean for this particular example, but what's the expected value more generally, well again it's the sum of all the possibilities of the probability of each possibility times that score, so the possibilities for rolling a die is that you can get a one, a two, a three, a four, a five, or a six, the probability of each one is a sixth, okay, and the score that you get is, well it's this, this is the score, and so then you can multiply all these together and sum them up, which would be 1/6 plus 2/6 plus 3/6 plus 4/6, oops, plus 5/6 plus 6/6, and that would give you the expected value of that particular thing, which is rolling die, rolling a die, so that's what expected value means, all right, so that's a really important concept that's going to come up a lot as we read papers, and so in particular this is telling us what are all the things that we're averaging it over, that with the expectations over, and so there's a whole lot of letters here, you're not expected to just know what they are, in fact in every paper they could mean totally different things, so you have to look immediately underneath where they'll be defined, so x0 is an image, it's an input image, epsilon is the noise, and the noise has a mean of zero and a standard deviation of i, which if you watch the lesson 9b you'll know it's like a standard deviation of 1 when you're doing multiple normal variables, okay, and then this is kind of confusing, eta just on its own is a normally distributed random variable, so it's just grabbing random numbers, but eta theta, epsilon, but epsilon theta is a noise estimator, that means it's a function, you can tell it's a function kind of because it's got these parentheses and stuff right next to it, so that's a function, so presumably most functions like this in these papers are neural networks, okay, so we're finally at a point where this actually is going to make perfect sense, we've got the noise, we've got the prediction of that noise, we subtract one from the other, we square it, and we take the expected value, so in other words this is mean squared error, so wow, that's a lot of fiddling around to find out that we've, this whole thing here means mean squared error, so the loss function is the mean squared error, and unfortunately I don't think the paper ever says that, it says minimising the denoising objective L blah de blah de blah de, but anyway we got there eventually, fine, we also, as well as learning about x0, we also learn here about xt, and so xt is the original unnoised image times some number plus some noise times one minus that number, okay, and so hopefully you'll recognise this from lesson 9b, this is the thing where we reduce the value of each pixel and we add noise to each pixel, so that's that, alright, so I'm not going to keep going through it, but you can kind of basically get the idea here is that once you know what you're looking for, the equations do actually make sense, right, but all this is doing is remember this is background, right, this is telling you what already exists, so this is telling you this is what a DDPM is, and then it tells you what a DDIM is, DDIM is, just think of it as a more recent version of DDPM, it's some very minor changes to the way it's set up which allows us to go faster, okay, so the thing is though, once we keep reading what you'll find is none of this background actually matters, but you know I thought we'd kind of go through it just to get a sense of like what's in a paper, okay, so for the purpose of our background it's enough to know that DDPM and DDIM are kind of the foundational papers on which diffusion models today are based.

Okay, so the encoding process which encodes an image onto a latent variable, okay, and then this is basically adding noise, this is called DDIM encoding, and the thing that goes from the input image to the noised image, they're going to call capital ER, and R is the encoding ratios, that's going to be some like how much noise are we adding, if you use small steps then decoding that, so going backwards gives you back the original image, okay, so that's what the stuff that we've learned about, that's what diffusion models are.

All right, so this looks like a very useful picture, so maybe let's take a look and see what this says, so what is DiffEdit? DiffEdit has three steps. Step one, we add noise to the input image, that sounds pretty normal, here's our input image x0, okay, and we add noise to it, fine, and then we denoise it, okay, fine.

Ah, but we denoise it twice. One time we denoise it using the reference text R, horse, or this special symbol here means nothing at all, so either unconditional or horse. All right, so we do it once using the word horse, so we take this and we decode it, estimate the noise, and then we can remove that noise on the assumption that it's a horse.

Then we do it again, but the second time we do that noise, when we calculate the noise, we pass in our query Q, which is zebra. Wow, those are going to be very different noises. The noise for horse is just going to be literally these Gaussian pixels, these are all dots, right, because it is a horse, but if the claim is no, no, this is actually a zebra, then all of these pixels here are all wrong, they're all the wrong color.

So the noise that's calculated if we say this is our query is going to be totally different to the noise if we say this is our query, and so then we just take one minus the other, and here it is here, right, so we derive a mask based on the difference in the denoising results, and then you take that and binarize it, so basically turn that into ones and zeros.

So that's actually the key idea, that's a really cool idea, which is that once you have a diffusion model that's trained, you can do inference on it where you tell it the truth about what the thing is, and then you can do it again but lie about what the thing is, and in your lying version it's going to say okay, all the stuff that doesn't match zebra must be noise.

And so the difference between the noise prediction when you say hey it's a zebra versus the noise prediction when you say hey it's a horse will be all the pixels that it says no, these pixels are not zebra. The rest of it, it's fine, there's nothing particularly about the background that wouldn't work with a zebra.

Okay, so that's step one. So then step two is we take the horse and we add noise to it. Okay, that's this XR thing that we learned about before. And then step three, we do decoding conditioned on the text query using the mask to replace the background with pixel values.

So this is like the idea that we heard about before, which is that during the inference time as you do diffusion from this fuzzy horse, what happens is that we do a step of diffusion inference and then all these black pixels we replace with the noised version of the original.

And so we do that multiple times and so that means that the original pixels in this black area won't get changed. And that's why you can see in this picture here and this picture here, the backgrounds all the same. And the only thing that's changed is that the horse has been turned into a zebra.

So this paragraph describes it and then you can see here it gives you a lot more detail. And the detail often has all kinds of like little tips about things they tried and things they found, which is pretty cool. So I won't read through all that because it says the same as what I've already just said.

One of the interesting little things they note here actually is that this binarized mask, so this difference between the R decoding and the Q decoding tends to be a bit bigger than the actual area where the horse is, which you can kind of see with these legs, for example.

And their point is that they actually say that's a good thing because actually often you want to slightly change some of the details around the object. So this is actually fine. All right. So we have a description of what the thing is, lots of details there. And then here's the bit that I totally skip, the bit called theoretical analysis, where this is the stuff that people really generally just add to try to get their papers past review.

You have to have fancy math. And so they're basically proving, you can see what it says here, insight into why this component yields better editing results than other approaches. I'm not sure we particularly care because like it makes perfect sense what they're doing. It's intuitive and we can see it works.

I don't feel like I need it proven to me, so I skip over that. So then they'll show us their experiments to tell us what datasets they did the experiments on. And so then, you know, they have metrics with names like LP IPS and CS FID. You'll come across FID a lot.

This is just a version of that. But basically they're trying to score how good their generated images. We don't normally care about that either. They care because they need to be able to say, you should publish our paper because it has a higher number than the other people that have worked on this area.

In our case, we can just say, you know, it looks good. I like it. So excellent question in the chat from Mikolaj, which is, so would this only work on things that are relatively similar? And I think this is a great point. This is where understanding this helps to know what its limitations are going to be.

And that's exactly right. If you can't come up with a mask for the change you want, this isn't going to work very well on the whole. Yeah, because the masked areas, the pixels going to be copied. So, for example, if you wanted to change it from, you know, a bowl of fruits to a bowl of fruits with a bokeh background or like a bowl of fruits with, you know, a purple tinged photo of a bowl of fruit, if you want the whole color to change, that's not going to work, right?

Because you're not masking off an area. Yeah. So by understanding the detail here, Mikolaj has correctly recognized a limitation or like, what's this for? This is for things where you can just say, just change this bit and leave everything else the same. All right. So there's lots of experiments.

So, yeah. For some things, you care about the experiments a lot. If it's something like classification, for generation, the main thing you probably want to look at is the actual results. And so, and often, for whatever reason, I guess, because this is, most people read these electronically, the results often you have to zoom into a lot to be able to see whether they're really good.

So here's the input image. They want to turn this into an English Foxhound. So here's the thing they're comparing themselves to, SDEdit, and it changed the composition quite a lot. And their version, it hasn't changed it at all. It's only changed the dock. And Ditto here, semi-trailer truck. SDEdit's totally changed it.

DifEdit hasn't. So you can kind of get a sense of like, you know, the authors showing off what they're good at here. This is, this is what this technique is effective at doing, changing animals and vehicles and so forth. It does a very good job of it. All right.

So then there's going to be a conclusion at the end, which I find almost never adds anything on top of what we've already read. And as you can see, it's very short anyway. Now, quite often the appendices are really interesting. So don't skip over them. Often you'll find like more examples of pictures.

They might show some examples of pictures that didn't work very well, stuff like that. So it's often well worth looking at the appendices. Often some of the most interesting examples are there. And that's it. All right. So that is, I guess, our first full on paper walkthrough. And it's important to remember, this is not like a carefully chosen paper that we've picked specifically because you can handle it.

Like this is the most interesting paper that came out this week. And so, you know, it gives you a sense of what it's really like. And for those of you who are, you know, ready to try something that's going to stretch you, see if you can implement any of this paper.

So there are three steps. The first step is kind of the most interesting one, which is to generate, automatically generate a mask. And the information that you have and the code that's in the lesson nine notebook actually contains everything you need to do it. So maybe give it a go.

See if you can mask out the area of a horse that does not look like a zebra. And that's actually, you know, that's actually useful of itself. Like that's, that's allows you to create segmentation masks automatically. So that's pretty cool. And then if you get that working, then you can go and try and do step two.

If you get that working, you can try and do step three. And this only came out this week. So I haven't really seen, yeah, examples of easy to use interfaces to this. So here's an example of a paper that you could be the first person to create a call interface to it.

So there's some, yeah, there's a fun little project. And even if you're watching this a long time after this was released and everybody's been doing this for years, still good homework, I think, so practice if you can. All right. I think now's a good time to have a 10 minute break.

So I'll see you all back here in 10 minutes. Okay. Welcome back. One thing during the break that Diego reminded us about, which I normally describe and I totally forgot about this time is detectify, which is another really great way to find symbols you don't know about. So let's try it for that expectation.

So if you're going to detectify and you draw the thing, it doesn't always work fantastically well, but sometimes it works very nicely. Yeah, in this case, not quite. What about the double line thing? It's good to know all the techniques, I guess. I think it could do this one.

I guess part of the problem is there's so many options that actually, you know, okay, in this case, it wasn't particularly helpful. And normally it's more helpful than that. I mean, if we use a simple one like Epsilon, I think it should be fine. There's a lot of room to improve this app, actually, if anybody's interested in a project, I think you could make it, you know, more successful.

Okay, that's, there you go. Signo sum, that's cool. Anyway, so it's another useful thing to know about, just Google for detectify. Okay. So let's move on with our from the foundations now. And so we were working on trying to at least get the start of a forward pass of a linear model or a simple multi-layer perceptron for MNIST going.

And we had successfully created a basic tensor. We've got some random numbers going. So what we now need to do is we now need to be able to multiply these things together, matrix multiplication. So matrix multiplication to remind you, in this case, so we're doing MNIST, right? So we've got, I think we're going to use a subset.

Let's see. Yeah. Okay. So we're going to create a matrix called M1, which is just the first five digits. So M1 will be the first five digits. So five rows and dot, dot, dot, dot, dot, dot, dot. And then 780, what was it again? 784 columns, 784 columns, because it's 28 by 28 pixels.

And we flattened it out. So this is our first matrix and our matrix multiplication. And then we're going to multiply that by some weights. So the weights are going to be 784 by 10 random numbers. So for every one of these 784 pixels, each one is going to have a weight.

So 784 down here, 784 by 10. So this first column, for example, is going to tell us all the weights in order to figure out if something's a zero. And the second column will have all the weights in deciding the probability of something's a one and so forth, assuming we're just doing a linear model.

And so then we're going to multiply these two matrices together. So when we multiply matrices together, we take row one of matrix one and we take column one of matrix two and we take each one in turn. So we take this one and we take this one, we multiply them together.

And then we take this one and this one and we multiply them together. And we do that for every element wise pair and then we add them all up. And that would give us the value for the very first cell that would go in here. That's what matrix multiplication is.

Okay, so let's go ahead then and create our random numbers for the weights, since we're allowed to use random number generators now. And for the bias, we'll just use a bunch of zeros to start with. So the bias is just what we're going to add to each one. And so for our matrix multiplication, we're going to be doing a little mini batch here.

We're going to be doing five rows of, as we discussed, five rows of, so five images flattened out. And then multiply by this weights matrix. So here are the shapes, m1 is 5 by 784, as we saw, m2 is 784 by 10. Okay, so keep those in mind. So here's a handy thing, m1.shape contains two numbers and I want to pull them out.

I want to call the, I'm going to think of that as, I'm going to actually think of this as like a and b rather than m1 and m2. So this is like a and b. So the number of rows in a and the number of columns in a, if I say equals m1.shape, that will put five in ar and 784 in ac.

So you'll probably notice this, I do this a lot, this de-structuring, we talked about it last week too. So we can do the same for m2.shape, put that into b rows and b columns. And so now if I write out arac and brbc, you can again see the same things from the sizes.

So that's a good way to kind of give us the stuff we have to loop through. So here's our result. So our resultant tensor, while we're multiplying, we're multiplying together all of these 784 things and adding them up. So the resultant tensor is going to be 5 by 10.

And then each thing in here is the result of multiplying and adding 784 pairs. So the result here is going to start with zeros and there is, this is the result. And it's going to contain ar rows, five rows, and bc columns, 10 columns, 5 comma 10. Okay. So now we have to fill that in.

And so to do a matrix multiplication, we have to first, we have to go through each row, one at a time. And here we have that, go through each row, one at a time. And then go through each column, one at a time. And then we have to go through each pair in that row column, one at a time.

So there's going to be a loop, in a loop, in a loop. So here we're going to loop over each row. And here we're going to loop over each column. And then here we're going to loop, so each column is C. And then here we're going to loop over each column of A, which is going to be the same as the number of rows of B, which we can see here, ac, 784, br, 784, they're the same.

So it wouldn't matter whether we said ac or br. So then our result for that row and that column, we have to add onto it the product of ik in the first matrix by kj in the second matrix. So k is going up through those 784. And so we're going to go across the columns and down, sorry, across the rows and down the columns.

It's going to go across the row whilst it goes down this column. So here is the world's most naive, slow, uninteresting matrix multiplication. And if we run it, okay, it's done something. We have successfully, apparently, hopefully successfully, multiplied the matrices M1 and M2. It's a little hard to read this, I find, because punch cards used to be 80 columns wide.

We still assume screens are 80 columns wide. Everything defaults to 80 wide, which is ridiculous. But you can easily change it. So if you say set print options, you can choose your own line width. Oh, as you can see, well, we know it's five by 10. We did it before.

So if we change the line width, okay, that's much easier to read now. We can see here are the five rows and here are the 10 columns for that matrix multiplication. I tend to always put this at the top of my notebooks and you can do the same thing for NumPy as well.

So what I like to do, this is really important, is when I'm working on code, particularly numeric code, I like to do it all step by step in Jupyter. And then what I do is, once I've got it working, is I copy all the cells that have implemented that and I paste them and then I select them all and I hit shift M to merge.

Get rid of anything that prints out stuff I don't need. And then I put a header on the top, give it a function name, and then I select the whole lot and I hit control or apple right square bracket and I've turned it into a function. But I still keep the stuff above it so I can see all the step by step stuff for learning about it later.

And so that's what I've done here to create this function. And so this function does exactly the same things we just did. And we can see how long it takes to run by using percent time. And it took about half a second, which gosh, that's a long time to generate such a small matrix.

This is just to do five MNIST digits. So that's not going to be great. We're going to have to speed that up. I'm actually quite surprised at how slow that is because there's only 39,200. So we're, you know, if you look at how we've got a loop within a loop within a loop, what's going wrong?

A loop within a loop within a loop, it's doing 39,200 of these. So Python, yeah, Python, when you're just doing Python, it is it is slow. So we can't we can't do that. That's why we can't just write Python. But there is something that kind of lets us write Python.

We could instead use number. Number is a system that takes Python and turns it into basically into machine code. And it's amazingly easy to do. You can basically take a function and write end it at end it on top. And what it's going to do is it's going to look the first time you call this function, it's going to compile it down to machine code and it will run much more quickly.

So what I've done here is I've taken the innermost loop. So just looping through and adding up all these. So start at zero, go through and add up all those just for two vectors and return it. This is called a dot product in linear algebra. So we'll call it dot.

And so number only works with NumPy. It doesn't work with PyTorch. So we're just going to use arrays instead of tensors for a moment. Now, have a look at this. If I try to do a dot product of one, two, three and two, three, four, that's pretty easy to do.

It took a fifth of a second, which sounds terrible. But the reason it took a fifth of a second is because that's actually how long it took to compile this and run it. Now that it's compiled, the second time it just has to call it, it's now 21 microseconds.

And so that's actually very fast. So with number, we can basically make Python run at C speed. So now the important thing to recognize is if I replace this loop in Python with a call to dot, which is running in machine code, then we now have one, two loops running in Python, not three.

So our 448 milliseconds. Well, first of all, let's make sure if I run it, run that matmul, it should be close to my T1. T1 is what we got before, remember? So when I'm refactoring or performance improving or whatever, I always like to put every step in the notebook and then test.

So this test close comes from fastcore.test. And it just checks that two things are very similar. They might not be exactly the same because of little floating point differences, which is fine. OK, so our matmul is working correctly, or at least it's doing the same thing it did before.

So if we now run it, it's taking 268 microseconds versus 448 milliseconds. So it's taking, you know, about 2000 times faster just by changing the one in a most loop. So really, all we've done is we've added @engit to make it 2000 times faster. So number is well worth knowing about.

It can make your Python code very, very fast. OK, let's keep making it faster. So we're going to use stuff again, which kind of goes back to APL. And a lot of people say that learning APL is a thing that's taught them more about programming than anything else. So it's probably worth considering learning APL.

And let's just look at these various things. We've got a is 10, 6 minus 4. So remember at APL, we don't say equals. Equals actually means equals, funnily enough. To say set to, we use this arrow. And this is a list of 10, 6, 4. OK, and then b is 287.

OK, and we're going to add them up, a plus b. So what's going on here? So it's really important that you can think of a symbol like a as representing a tensor or an array. APL calls them arrays. PyTorch calls them tensors. NumPy calls them arrays. They're the same thing.

So this is a single thing that contains a bunch of numbers. This is a single thing that contains a bunch of numbers. This is an operation that applies to arrays or tensors. And what it does is it works what's called element-wise. It takes each pair, 10 and 2, and adds them together.

Each pair, 6 and 8, add them together. This is element-wise addition. And Fred's asking in the chat, how do you put in these symbols? If you just mouse over any of them, it will show you how to write it. And the one you want is the one at the very bottom, which is the one where it says prefix.

Now, the prefix is the backtick character. So here it's saying prefix hyphen gives us times. So type a backtick dash b is a times b, for example. So yeah, they all have shortcut keys, which you learn pretty quickly, I find. And there's a fairly consistent kind of system for those shortcut keys, too.

All right. So we can do the same thing in PyTorch. It's a little bit more verbose in PyTorch, which is one reason I often like to do my mathematical fiddling around in APL. I can often do it with less boilerplate, which means I can spend more time thinking. You know, I can see everything on the screen at once.

I don't have to spend as much time trying to ignore the tensor, round brackets, square bracket dot comma, blah, blah, blah. It's all cognitive load, which I'd rather ignore. But anyway, it does the same thing. So I can say a plus b and it works exactly like APL. So here's an interesting example.

I can go a less than b dot float dot mean. So let's try that one over here. A less than b. So this is a really important idea, which I think was invented by Ken Iverson, the APL guy, which is the true and false represented by zero and one.

And because they're represented by zero and one, we can do things to them. We can add them up and subtract them and so forth. It's a really important idea. So in this case, I want to take the mean of them. And I'm going to tell you something amazing, which is that in APL, there is no function called mean.

Why not? That's because we can write the main function, which is so that's four letters, mean, M-E-A-N. We can write the main function from scratch with four characters. I'll show you. Here's the whole main function. We're going to create a function called main and the main is equal to the sum of a list divided by the count of a list.

So this here is sum divided by count. And so I've now defined a new function called mean, which calculates the mean. Mean of a is less than b. There we go. And so, you know, in practice, I'm not sure people would even bother defining a function called mean because it's just as easy to actually write its implementation in APL, in NumPy or whatever Python, it's going to take a lot more than four letters to implement mean.

So anyway, you know, it's a math notation. And so being a math notation, we can do a lot with little, which I find helpful because I can see everything going on at once. Anywho, OK, so that's how we do the same thing in PyTorch. And again, you can see that the less than in both cases are operating element wise.

OK, so a is less than b is saying ten is less than two, six is less than eight, four is less than seven and gives us back each of those trues and falses as zeros and ones. And according to the emoji on our YouTube chat, see if his head just exploded as it should.

This is why APL is, yeah, life changing. OK, let's now go up to higher ranks. So this here is a rank one tensor. So a rank one tensor means it's a it's a list of things. It's a vector. It's where else a rank two tensor is like a list of lists.

They all have to be the same length lists or it's like a rectangular bunch of numbers. And we call it in math, we call it a matrix. So this is how we can create a tensor containing one, two, three, four, five, six, seven, eight, nine. And you can see often what I like to do is I want to print out the thing I just created after I created it.

So two ways to do it. You can say, put an enter and then write M and that's going to do that. Or if you want to put it all on the same line, that works too. You just use a semicolon. Neither one's better than the other. They're just different.

So we could do the same thing in APL. Of course, in APL, it's going to be much easier. So we're going to define a matrix called M, which is going to be a three by three tensor containing the numbers from one to nine. Okay. And there we go. That's done it in APL.

A three by three tensor containing the numbers from one to nine. A lot of these ideas from APL you'll find have made their way into other programming languages. For example, if you use Go, you might recognize this. This is the iota character and Go uses the word iota. So they spell it out in a somewhat similar way.

A lot of these ideas from APL have found themselves into math notation and other languages. It's been around since the late 50s. Okay. So here's a bit of fun. We're going to learn about a new thing that looks kind of crazy called Frobenius norm. And we'll use that from time to time as we're doing generative modeling.

And here's the definition of a Frobenius norm. It's the sum over all of the rows and columns of a matrix. And we're going to take each one and square it. We're going to add them up and they're going to take the square root. And so to implement that in PyTorch is as simple as going n times m dot sum dot square root.

So this looks like a pretty complicated thing when you kind of look at it at first. It looks like a lot of squiggly business. Or if you said this thing here, you might be like, what on earth is that? Well, now, you know, it's just square sum square root.

So again, we could do the same thing in APL. So let's do, so in APL, we want the, okay, so we're going to create something called SF. Now, it's interesting, APL does this a little bit differently. So dot sum by default in PyTorch sums over everything. And if you want to sum over just one dimension, you have to pass in a dimension keyword.

For very good reasons, APL is the opposite. It just sums across rows or just down columns. So actually, we have to say sum up the flattened out version of the matrix. And to say flattened out, use comma. So here's sum up the flattened out version of the matrix. Okay, so that's our SF.

Oh, sorry. And the matrix is meant to be m times m. There we go. So there's the same thing. Sum up the flattened out m by m matrix. And another interesting thing about APL is it always is read right to left. There's no such thing as operator precedence, which makes life a lot easier.

Okay, and then we take the square root of that. There isn't a square root function. So we have to do to the power of 0.5. And there we go. Same thing. All right, you get the idea. Yes, a very interesting question here from Marabou. Are the bars for norm or absolute value?

And I like Siva's answer, which is the norm is the same as the absolute value for a scalar. So in this case, you can think of it as absolute value. And it's kind of not needed because it's being squared anyway. But yes, in this case, the norm, well, in every case for a scalar, the norm is the absolute value, which is kind of a cute discovery when you realize it.

So thank you for pointing that out, Siva. All right. So this is just fiddling around a little bit to kind of get a sense of how these things work. So really importantly, you can index into a matrix and you'll say rows first and then columns. And if you say colon, it means all the columns.

So if I say row two, here it is, row two, all the columns, sorry, this is row two, so that's at zero, APL starts at one, all the columns, that's going to be seven, eight, nine. And you can see I often use comma to print out multiple things. And I don't have to say print in Jupiter, it's kind of assumed.

And so this is just a quick way of printing out the second row. And then here, every row, column two. So here is every row of column two. And here you can see three, six, nine. So one thing very useful to recognize is that for tensors of higher rank than one, such as a matrix, any trailing colons are optional.

So you see this here, M2, that's the same as M2 comma colon. It's really important to remember. Okay, so M2, you can see the result is the same. So that means row two, every column. Okay, so now with all that in place, we've got quite an easy way. We don't need a number anymore.

We can multiply, so we can get rid of that inner most loop. So we're going to get rid of this loop, because this is just multiplying together all of the corresponding rows of A with the, sorry, all the corresponding columns of a row of A with all the corresponding rows of a column of B.

And so we can just use an element-wise operation for that. So here is the ith row of A, and here is the jth column of B. And so those are both, as we've seen, just vectors, and therefore we can do an element-wise multiplication of them, and then sum them up.

And that's the same as a dot product. So that's handy. And so again, we'll do test close. Okay, it's the same. Great. And again, you'll see we kind of did all of our experimenting first, right, to make sure that we understood how it all worked, and then put it together.

And then if we time it, 661 microseconds. Okay, so it's interesting. It's actually slower than, which really shows you how good number is, but it's certainly a hell of a lot better than our 450 milliseconds. But we're using something that's kind of a lot more general now. This is exactly the same as dot, as we've discussed.

So we could just use torch dot, torch dot dot, I suppose I should say. And if we run that, okay, a little faster. It's still, interestingly, it's still slower than the number, which is quite amazing, actually. All right, so that one was not exactly a speed up, but it's kind of a bit more general, which is nice.

Now we're going to get something into something really fun, which is broadcasting. And broadcasting is about what if you have arrays with different shapes. So what's a shape? The shape is the number of rows, or the number of rows and columns, or the number of, what would you say, faces, rows and columns, and so forth.

So for example, the shape of M is 3 by 3. So what happens if you multiply, or add, or do operations to tensors of different shapes? Well, there's one very simple one, which is if you've got a rank one tensor, the vector, then you can use any operation with a scalar, and it broadcasts that scalar across the tensor.

So a is greater than zero is exactly the same as saying a is greater than tensor zero comma zero comma zero. So it's basically copying that across three times. Now it's not literally making a copy in memory, but it's acting as if we had said that. And this is the most simple version of broadcasting.

Okay, it's broadcasting the zero across the ten, and the six, and the negative four. And APL does exactly the same thing. a is less than five, so zero, zero, one. So same idea. Okay. So we can do plus with a scalar, and we can do exactly the same thing with higher than rank one.

So two times a matrix is just going to do two is going to be broadcast across all the rows and all the columns. Okay, now it gets interesting. So broadcasting dates back to APL. But a really interesting idea is that we can broadcast not just scalars, but we can broadcast vectors across matrices or broadcast any kind of lower ranked tensor across higher ranked tensors, or even broadcast together together two tensors of the same rank, but different shapes in a really powerful way.

And as I was exploring this, I was trying to love doing this kind of computer archaeology. I was trying to find out where the hell this comes from. And it actually turns out from this email message in 1995, that the idea actually comes from a language that I'd never heard of called Yorick, which still apparently exists.

Here's Yorick. And so Yorick has talks about broadcasting and conformability. So what happened is this very obscure language has this very powerful idea. And NumPy has happily stolen the idea from Yorick that allows us to broadcast together tensors that don't appear to match. So let me give an example.

Here's a tensor called C that's a vector. It's a rank one tensor, 10, 20, 30. And here's a tensor called M, which is a matrix. We've seen this one before. And one of them is shape three, comma, three. The other is shape three. And yet we can add them together.

Now what's happened when we added it together? Well, what's happened is 10, 20, 30 got added to one, two, three. And then 10, 20, 30 got added to four, five, six. And then 10, 20, 30 got added to seven, eight, nine. And hopefully you can see this looks quite familiar.

Instead of broadcasting a scalar over a higher rank tensor, this is broadcasting a vector across every row of a matrix. And it works both ways. So we can say C plus M gives us exactly the same thing. And so let me explain what's actually happening here. The trick is to know about this somewhat obscure method called expand as.

And what expand as does is this creates a new thing called T, which contains exactly the same thing as C, but expanded or kind of copied over. So it has the same shape as M. So here's what T looks like. Now T contains exactly the same thing as C does, but it's got three copies of it now.

And you can see we can definitely add T to M because they match shapes. Right? So we can say M plus T. We know we can play M plus T because we've already learned that you can do element-wise operations on two things that have matching shapes. Now, by the way, this thing T didn't actually create three copies.

Check this out. If we call T dot storage, it tells us what's actually in memory. It actually just contains the numbers 10, 20, 30. But it does a really clever trick. It has a stride of zero across the rows and a size of three comma three. And so what that means is that it acts as if it's a three by three matrix.

And each time it goes to the next row, it actually stays exactly where it is. And this idea of strides is the trick which NumPy and PyTorch and so forth use for all kinds of things where you basically can create very efficient ways to do things like expanding or to kind of jump over things and stuff like that, you know, switch between columns and rows, stuff like that.

Anyway, the important thing here for us to recognize is that we didn't actually make a copy. This is totally efficient and it's all going to be run in C code very fast. So remember, this expand as is critical. This is the thing that will teach you to understand how broadcasting works, which is really important for implementing deep learning algorithms or any kind of linear algebra on any Python system, because the NumPy rules are used exactly the same in JAX, in TensorFlow, in PyTorch and so forth.

Now I'll show you a little trick, which is going to be very important in a moment. If we take C, which remember is a vector containing 10 20 30, and we say dot unsqueezed zero, then it changes the shape from three to one comma three. So it changes it from a vector of length three to a matrix of one row by three columns.

This will turn out to be very important in a moment. And you can see how it's printed. It's printed out with two square brackets. Now I never use unsqueezed because I much prefer doing something more flexible, which is if you index into an axis with a special value none, also known as np.newaxis.

It does exactly the same thing. It inserts a new axis here. So here we'll get exactly the same thing, one row by all the columns, three columns. So this is exactly the same as saying unsqueezed. So this inserts a new unit axis. This is a unit axis, a single row in this dimension.

And this does the same thing. So these are the same. So we could do the same thing and say unsqueeze one, which means now we're going to unsqueeze into the first dimension. So that means we now have three rows and one column. See the shape here? The shape is inserting a unit axis in position one, three rows and one column.

And so we can do exactly the same thing here. Give us every row and a new unit axis in position one. Same thing. Okay. So those two are exactly the same. So this is how we create a matrix with one row. This is how we create a matrix with one column.

None comma colon versus colon comma none or unsqueeze. We don't have to say, as we've learned before, none comma colon, because do you remember? Trailing columns are optional. So therefore just see none is also going to give you a row matrix, one row matrix. This is a little trick here.

If you say dot, dot, dot, that means all of the dimensions. And so dot, dot, dot comma none will always insert a unit axis at the end, regardless of what rank a tensor is. So, yeah, so none and NP new axis mean exactly the same thing. NP new axis is actually a synonym for none.

If you've ever used that, I always use none because why not? It's short and simple. So here's something interesting. If we go C colon, common none, so let's go and check out what C colon, common none looks like. C colon, common none is a column. And if we say expand as M, which is three by three, then it's going to take that 10, 20, 30 column and replicate it 10, 20, 30, 10, 20, 30, 10, 20, 30.

So we could add. So remember, like, remember, I will explain that when you say matrix plus C colon, common none, it's basically going to do this dot expand as for you. So if I want to add this matrix here to M, I don't need to say dot expand as I just write this.

I just write M plus C colon, common none. And so this is exactly the same as doing M plus C. But now rather than adding the vector to each row, it's adding the vector to each column C plus 10, 20, 30, 10, 20, 30, 10, 20, 30. So that's a really simple way that we now get kind of for free thanks to this really nifty notation, this nisti approach that that came from Yorick.

So here you can see M plus C none, none, colon is adding 10, 20, 30 to each row. And M plus C colon, common none is adding 10, 20, 30 to each column. All right, so that's the basic like hand wavy version. So let's look at like what are the rules?

How does it work? Okay, so C none, colon is one by three. C colon, common none is three by one. What happens if we multiply C none, colon by C colon, common none? Well, it's going to do if you think about it, which you definitely should because thinking is very helpful.

What is going on here? Oh, it took forever. Okay, so what happens if we go C none, colon times C colon, common none? So what it's going to have to do is it's going to have to take this 10, 20, 30 column vector or three by one matrix and it's going to have to make it work across each of these rows.

So what it does is expands it to be 10, 20, 30, 10, 20, 30, 10, 20, 30. So it's going to do it just like this. And then it's going to do the same thing for C none, colon. So that's going to become three rows of 10, 20, 30.

So we're going to end up with three rows of 10, 20, 30 times three columns of 10, 20, 30, which gives us our answer. And so this is going to do an outer product. So it's very nifty that you can actually do an outer product without any special, you know, functions or anything, just using broadcasting.

And it's not just outer products, you can do outer Boolean operations. And this kind of stuff comes up all the time, right? Now, remember, you don't need the comma colon, so get rid of it. So this is showing us all the places where it's greater than it's kind of an outer, an outer Boolean, if you want to call it that.

So this is super nifty and you can do all kinds of tricks with this because it runs very, very fast. So this is going to be accelerated in C. So here are the rules. Okay. When you operate on two arrays or tensors, NumPy and PyTorch will compare their shapes.

Okay. So remember the shape, this is a shape. You can tell it's a shape because we said shape and it goes from right to left. So that's the trailing dimensions. And it checks whether the dimensions are compatible. Now they're compatible if they're equal, right? So for example, if we say M times M, then those two shapes are compatible because in each case, it's just going to be three, right?

So they're going to be equal. So if the shape in that dimension is equal, they're compatible, or if one of them is one and if one of them is one, then that dimension is broadcast to make it the same size as the other. So that's why the outer product worked.

We had a one by three times a three by one. And so this one got copied three times to make it this long. And this one got copied three times to make it this long. Okay. So those are the rules. So the arrays don't have to have the same number of dimensions.

So this is an example that comes up all the time. Let's say you've got a 256 by 256 by three array or tensor of RGB values. So you've got an image, in other words, a color image. And you want to normalize it. So you want to scale each color in the image by a different value.

So this is how we normalize colors. So one way is you could multiply or divide or whatever, multiply the image by a one-dimensional array with three values. So you've got a 1D array. So that's just three. Okay. And then the image is 256 by 256 by three. And we go right to left and we check, are they the same?

And we say, yes, they are. And then we keep going left and we say, are they the same? And if it's missing, we act as if it's one. And if we go, keep going, if it's missing, we act as if it's one. So this is going to be the same as doing one by one by three.

And so this is going to be broadcast. This three, three elements will be broadcast over all 256 by 256 pixels. So this is a super fast and convenient and nice way of normalizing image data with a single expression. And this is exactly how we do it in the fast.ai library.

In fact, so we can use this to dramatically speed up our matrix multiplication. Let's just grab a single digit just for simplicity. And I really like doing this in Jupyter notebooks. And if you, if you build Jupyter notebooks to explain stuff that you've learned in this course or ways that you can apply it, consider doing this for your readers, but add a lot more prose.

I haven't added prose here because I want to use my voice. If I was, for example, in our book that we published, it's all written in notebooks and there's a lot more prose, obviously. But like really, I like to show every example all along the way using simple as possible.

So let's just grab a single digit. So here's the first digit. So its shape is, it's a 784 long vector. And remember that our weight matrix is 784 by 10. So if we say digit colon common none dot shape, then that is a 784 by 1 row matrix. So there's our matrix.

And so if we then take that 784 by 1 and expand as M2, it's going to be the same shape as our weight matrix. So it's copied our image data for that digit across all of the 10 vectors representing the 10 linear projections we're doing for our linear model.

And so that means that we can take the digit colon common none, so 784 by 1, and multiply it by the weights. And so that's going to get us back 784 by 10. And so what it's doing, remember, is it's basically looping through each of these 10 784 long vectors.

And for each one of them, it's multiplying it by this digit. So that's exactly what we want to do in our matrix multiplication. So originally, we had, well not originally, most recently I should say, we had this dot product where we were actually looping over j, which was the columns of b.

So we don't have to do that anymore, because we can do it all at once by doing exactly what we just did. So we can take the i-th row and all the columns and add a axis to the end. And then just like we did here, multiply it by b.

And then dot sum. And so that is, again, exactly the same thing. That is another matrix multiplication, doing it using broadcasting. Now this is like tricky to get your head around. And so if you haven't done this kind of broadcasting before, it's a really good time to pause the video and look carefully at each of these four cells before and understand, what did I do there?

Why did I do it? What am I showing you? And then experiment with trying to, and so remember that we started with M1 0, right? So just like we have here ai, okay? So that's why we've got i comma comma colon comma none, because this digit is actually M1 0.

This is like M1 0 colon none. So this line is doing exactly the same thing as this here, plus the sum. So let's check if this matmul is the same as it used to be, yet it's still working. And the speed of it, okay, not bad. So 137 microseconds.

So we've now gone from a time from 500 milliseconds to about 0.1 milliseconds. Funnily enough on my, oh, actually, now I think about it. My MacBook Air is an M2, whereas this Mac Mini is an M1. So that's a little bit slower. So my Air was a bit faster than 0.1 milliseconds.

So overall, we've got about a 5,000 times speed improvement. So that is pretty exciting. And since it's so fast now, there's no need to use a mini batch anymore. If you remember, we used a mini batch of, where is it? Of five images. But now we can actually use the whole data set because it's so fast.

So now we can do the whole data set. There it is. We've now got 50,000 by 10, which is what we want. And so it's taking us only 656 milliseconds now to do the whole data set. So this is actually getting to a point now where we could start to create and train some simple models in a reasonable amount of time.

So that's good news. All right. I think that's probably a good time to take a break. We don't have too much more of this to go, but I don't want to keep you guys up too late. So hopefully you learned something interesting about broadcasting today. I cannot overemphasize how widely useful this is in all deep learning and machine learning code.

It comes up all the time. It's basically our number one, most critical kind of foundational operation. So yeah, take your time practicing it and also good luck with your diffusion homework from the first half of the lesson. Thanks for joining us, and I'll see you next time.

Lesson 11 2022: Deep Learning Foundations to Stable Diffusion

Chapters

Transcript