fast.ai Live - Lesson 11

Hi everybody, nice to see you all here. Can you guys all hear me okay? Great. I don't have too much logistic stuff to mention, other than that ... well, we'll see what happens. I have a feeling this course is going to go a lot longer than I expected, so just putting that out there to warn you right now, it could be more of a marathon than we originally thought, which may require having some breaks in the middle or something.

Anyway, we've got a lot of stuff to cover, and I don't want to hurry, I want to do it all carefully and properly, so I decided rather than hurrying, we'll just do what it takes. All right, so I think we are ready to get into it, a never-ending course, exactly, Sam, never-ending story.

Hi everybody, welcome to lesson 11. This is the third lesson in part two. Depending on how you count things, there's been a lesson A and a lesson B, it's kind of the fifth lesson in part two, I don't know what it is, so we'll just stick to calling it lesson 11 and avoid getting too confused, I'm already confused.

My goodness, I've got so much stuff to show you, I'm only going to show you a tiny fraction of the cool stuff that's been happening on the forum this week, but it's been amazing. I'm going to start by sharing this beautiful video from John Robinson, and I've never seen anything like this before, as you can see it's very stable, and it's really showing this beautiful movement between seasons.

So what I did on the forum was I said to folks, "Hey, you should try interpolating between prompts," which is what John did, "and I also said you should try using the last image of the previous prompt interpolation as the initial image for the next prompt." And anyway, here it is, it came out beautifully, John was the first to get that working, so I was very excited about that.

And the second one I wanted to show you is this really amazing work from Seb Dehi, who, Sebastian, who did something that I've been thinking about as well, I'm really thrilled that he also thought about this, which was he noticed that this update we do, unconditional embeddings plus guidance times text embeddings minus unconditional embeddings, has a bit of a problem, which is that it gets big.

To show you what I mean by it gets big, is like, imagine that we've got, you know, we've got a couple of vectors, right, oops, that's not a vector, let's try that again, we've got a couple of vectors on this chart here, okay. And so we've got, let's see, so we've got, that's just, okay, so we've got the original unconditional piece here, so we've got u, so let's say this is u, okay, and then we add to that some amount of t minus u, so if we've got like t, let's say it's huge, right, and we've got u again, then the difference between those is the vector which goes here, right.

Now you can see here that if there's a big difference between t and u, then the eventual update which actually happens is, oopsie daisy, that's gonna be an arrow, let's try that again, the eventual update which happens is far bigger than the original update, and so it jumps too far.

So this idea is basically to say well let's make it so that the update is no longer than the original unconditioned update would have been, and we're going to be talking more about norms later, but basically we scale it by the ratio of the norms. And what happens is we start with this astronaut, and we move to this astronaut, and it's kind of a, it's a subtle change, but you can see there's a lot more before, after, before, after, a lot more texture in the background, and like on the earth there's a lot more detail before, after, you see that?

And even little things like before the bridal kind of rains, whatever, were pretty flimsy, now they look quite proper, so it's made quite a big difference just to kind of get this scaling correct. So another example, there's a couple of other things that Sebastian tried, which I'll explain in a moment, but you can see how they, they, some of them actually resulted in changing the, the image, and this one's actually important because the poor horse used to be missing a leg, and now it's not missing a leg, so that's good.

And so here's the detailed one with its extra leg, and so how did he do this? Well, so what he did was he started with this unconditioned prompt plus the guidance times the difference between the conditional and unconditioned, and then as we discussed, the next version that, well actually the next version we then saw is to basically just take that prediction and scale it according to the difference in the lengths, so the norms is basically the lengths of the vectors.

And so this, this is the second one I did in lesson 9, you'll see it's gone from here, so when we go from 1a to 1b, you can see here it's got, look at this, this boot's gone from nothing to having texture, this, I don't know, whatever the hell this thing is, suddenly he's got texture, and look, we've now got proper stars in the sky.

It's made a really big difference, and then the second change is not just to rescale the whole prediction, but to rescale the, the update, and when we rescale the update it actually not surprisingly changes the image entirely because we're now changing the direction it goes, and so I don't know, is this better than this, I mean maybe, maybe not, but you know, I think so, you know, particularly because this was the difference that added the correct fourth leg to the horse before, and then we can do both, we can rescale the difference and then rescale the result, and then we get the best of both worlds, as you can see, big difference, we get a nice background, this weird thing on his back's actually become an arm, that's not what a foot looks like, that is what a foot looks like, so these little details make a big difference, as you can see.

So this is a really cool, or two really cool new things, new things tend to have wrinkles though, wrinkle number one is after I shared on Twitter Sebastian's approach, Ben Paul, who's Google brain, I think if I remember correctly, pointed out that this already exists, he thinks it's the same as what's shown in this paper, which is a diffusion model for text-to-speech, I haven't read the paper yet to check whether it's got all the different options or whether it's checked them all out like this, so maybe this is reinventing something that already existed and putting it into a new field, which would still be interesting, anyway, so hopefully folks on the forum you can help figure out whether this paper is actually showing the same thing or not, and then the other interesting thing was John Robinson got back in touch on the forum and said oh actually that tree video doesn't actually do what we think it does at all, there's a bug in his code and despite the bug it accidentally worked really well, so now we're in this interesting question of trying to figure out like oh why did we, how did he create such a beautiful video by mistake and okay so reverse engineering exactly what the bug did and then figuring out how to do that more intentionally, and this is great right, it's really good to you know having a lot of people working on something and the bugs often yeah they tell us about new ideas, so that's very interesting, so watch this space where we find out what John actually did and how come it worked so well, and then something that I just saw like two hours ago on the forum which I'd never thought of before but I thought of something a little bit similar, Rekyo Prashanth said like well what if we took this, so as you can see all the students are really bouncing ideas of each other it's like oh it's interesting we're doing different things with a guidance scale, what if we take the guidance scale and rather than keeping it at 7.5 all the time let's reduce it, and this is a little bit similar to something I suggested to John over a few weeks ago where I said he was doing some stuff with like modifying gradients based on additional loss functions and I said to him maybe you should just use them like occasionally at the start because I think the key thing is once the model kind of knows roughly what image it's trying to draw even if it's noisy you know you can let it do its thing and this is exactly what's happening here is Rekyo's idea is to say well let's decrease the guidance scale so at the end it's basically zero and so once it kind of is in going in the right direction we let it do its thing, so this little doggy is with the normal 7.5 guidance scale now have a look for example it's eye here it's pretty uninteresting pretty flat and if I go to the next one as you can see now actually look at the eye that's a proper eye before totally glassy black now proper eye or like look at all this fur very textured previously very out of focus so this is again a new technique so I love this you know you you folks are trying things out and some things are working and some things not working and that's all that's all good I kind of feel like you're going to have to slow down because I'm having trouble keeping up with you all but apart from that this is great good work I also wanted to mention on a different theme to check out Alex's notes on the on the lesson because I thought he's done a fantastic job of showing like how to how to study how to study a lesson and so what Alex did for example was he made a list in his notes of all the different steps we did as we started the front of the foundations what is the library that it comes from links to the documentation and I know that Alex's background actually is history you know not not computer science and so you know for somebody moving into a different field like this this is a great idea you know particularly to be able to like look at like okay what are all the things that I'm going to have to learn and read about and then he did something which we always recommend which is to try the lesson on a new data set and he very sensibly picked out the fashion MNIST data set which is something we'll be using a lot in this course because it's a lot like MNIST and it's just different enough to be interesting and so he's described in his post or his notes how he went about doing that and then something else I thought was interesting in his notes at the very end was he just jotted down my tips it's very easy when I throw a tip out there to think oh that's interesting that's good to know and then it can disappear so here's a good way to make sure you don't forget about all the little the little tricks and I think I've put those notes in the forum wiki so you can you can check them out if you've if you'd like to learn from them as well so I think this is a great role model good job Alex okay so during the week Jono taught us about a new paper that had just come out called deaf edit and he told us he thought this was an interesting paper and it came out during the week and I thought it might be good practice for us to try reading this paper together so let's do that so here's the paper deaf edit and you'll find that probably the majority of papers that you come across in deep learning will take you to archive archive is a preprint server so these are models these are papers that have not been peer-reviewed I would say in our field we don't generally or I certainly don't generally care about that at all because we have code we can try it we can see things whether it works or not you know we tend to be very you know most papers are very transparent about here's what we did and how we did it and you can replicate it and it gets a huge amount of peer-review on Twitter so if there's a problem generally within 24 hours somebody has pointed it out so we use archive a lot and if you wait until it's been peer-reviewed you know you'll be way out of date because this field is moving so quickly so here so here it is on archive and we can read it by clicking on the PDF button I don't do that instead I click on this little button up here which is the save to Zotero button so I figured I'd show you like my preferred workflows you don't have to do the same thing there are different workflows but here's one that I find works very well which is a Zotero is a piece of free software that you can download for Mac Windows Linux and install a Chrome connector oh Tanishka saying the buttons covered all right so in my taskbar I have a button that I can click that says save to Zotero sorry not taskbar Chrome menu bar and when I click it I'll show you what happens so after I've downloaded this the paper will automatically appear here in this software which is Zotero and so here it is diffedit and you can see it's told us it's got here the abstract the authors where it came from and so later on I can go in like if I want to check some detail I can go back and see the URL I can click on it pops up and so in this case what I'm going to do is I'm going to double click on it and that brings up the paper now the reason I like to read my papers in Zotero is that I can you know annotate them edit them tag them put them in folders and so forth and also add them to my kind of reading list directly from my web browser so as you can see you know I've started this fast diffusion folder which is actually a group library which I share with the other folks working on this fast diffusion project that we're all doing together and so we can all see the same paper library so Maribou on YouTube chat is asking is this better than Mendeley yeah I used to use Mendeley and it's kind of gone downhill I think Zotero is far far better but they're both very similar okay so we double click on it it opens up and here is a paper so reading a paper is always extremely intimidating and so you just have to do it anyway and you have to realize that your goal is not to understand every word your goal is to understand the basic idea well enough that for example when you look at the code hopefully it's comes with code most things do that you'll be able to kind of see how the code matches to it and that you could try writing your own code to implement parts of it yourself so over on the left you can open up the sidebar here so I generally open up the table of contents and get a bit of a sense of okay so there's some experimental results there's some theoretical results introduction related work okay tells us about this new diff edit thing some experiments okay so that's a pretty standard approach that you would see in papers so I would always start with the abstract okay so what's it saying this does so generally it's going to be some background sentence or two about how interesting this field is it's just saying wow image generation scroll which is fine and then they're going to tell us what they're going to do which is they're going to create something called diff edit and so this is a what is it for it's going to use text condition diffusion models so we know what those are now that's what we've been using that's where we type in some text and get back an image of that that matches the text but this is going to be different it's the task of semantic image editing okay we don't know what that is yet so let's put that aside and think okay let's make sure we understand that later the goal is to edit an image based on a text query oh okay so we're going to edit an image based on text how on earth would you do that they're going to tell us right away what this is semantic image editing it's an extension of image generation with an additional constraint which is the generated image should be as similar as possible to the given input and so generally as they've done here there's going to be a picture that shows us what's going on and so in this picture you can see here an example here's an input image and originally it was attached to a caption a bowl of fruits okay we want to change this into a bowl of pairs so we type a bowl of pairs and it generates oh a bowl of pairs or we could change it from a bowl of fruit to a basket of fruits and oh it's become a basket of fruits okay so I think I get the idea right what it's saying is that we can edit an image by typing what we want that image to represent so this actually looks a lot like the paper that we looked at last week so that's cool so the abstract says that currently so I guess there are current ways of doing this but they require you to provide a mask that means you have to basically draw the area you're replacing okay so that sounds really annoying but our main contribution so what this paper does is we automatically generate the mask so they simply just type in the new query and get the new image so that sounds actually really impressive so if you read the abstract and you think I don't care about doing that then you can skip the paper you know or or look at the results and if the results don't look impressive then just skip the paper so that's that's kind of your first point where we can be like okay we're done but in this case this sounds great the results look amazing so I think we should keep going okay it achieves data the updating performance of course fine and we try some right whatever okay so the introduction to a paper is going to try to give you a sense of you know what they're trying to do and so this first paragraph here is just repeating what we've already read in the abstract and repeating what we see in figure one so saying that we can take a text query like a basket of fruits see the examples alright fine we'll skip through there so the key thing about academic papers is that they are full of citations you should not expect to read all of them because if you do then to read each of those citations that's full of citations and then they're full of citations and before you know it you've read the entire academic literature which has taken you 5,000 years so for now let's just recognize that it says text conditional image generations undergoing a revolution here's some examples well fine we actually already know that okay Dali's call latent diffusion that's what we've been using that's called the Emma Jan apparently that's cool so cool alright so we kind of know that so generally there's this like okay our area that we're working on is important in this case we already agree it's important so we can skip through it pretty quickly they've asked vast amounts of data are used yes we know okay so diffusion models are interesting yes we know that they denoise starting from Gaussian noise we know that so you can see like there's a lot of stuff once you're kind of in the field you can skip over pretty quickly you can guide it using clip guidance yeah that's what we've been doing we know about that oh wait this is new or by in painting by copy pasting pixel values outside a mask alright so there's a new technique that we haven't done but I think it makes a lot of intuitive sense that is during that diffusion process if there are some pixels you don't want to change such as all the ones that aren't orange here you can just paste them from the original after each stage of the diffusion alright that makes perfect sense if I want to know more about that I could always look at this paper but I don't think I do for now okay and again it's just repeating something they've already told us that there they require us to provide a mask so it's a bit of a problem and then you know this is interesting it's also says that when you mask out an area that's a problem because if you're trying to for example change a dog into a cat you want to keep the animals color and pose so this is a new technique which is not deleting the original not deleting a section and replacing it with something else but it's actually going to take advantage of knowledge about what that thing looked like so that this is called to call new things so hopefully at this point we know what they're trying to achieve if you don't know what they're trying to achieve when you're reading a paper the paper won't make any sense so again that's a point where you should stop maybe this is not the right time to be reading this paper maybe you need to read some of the references maybe you need to look more at the examples so you can always skip straight to the experiments so I often skip straight to the experiments in this case I don't need to because they've put enough experiments on the very first page me to see what it's doing so yeah don't always read it from top to bottom okay so all right so they've got some examples of conditioning a diffusion model on an input without a mask okay for example you can use a noised version of the input as a starting point hey we've done that too so as you can see we've already covered a lot of the techniques that they're referring to here something we haven't done but makes a lot of sense is that we can look at the distance to the input image as a loss function okay that makes sense to me and there's some references here all right so we're going to create this new thing called diffedit it's going to be amazing wait till you check it out okay fine okay so that's the introduction hopefully you found that useful to understand what we're trying to do the next section is generally called related work as it is here and that's going to tell us about other approaches so if you're doing a deep dive this is a good thing to study carefully I don't think we're going to do a deep dive right now so I think we can happily skip over it we could kind of do a quick glance of like oh image editing can include colorization retouching style transfer okay cool lots of interesting topics definitely getting more excited about this idea of image editing and there's some different techniques you can use clip guidance okay they can be computationally expensive we can use diffusion for image editing okay fine we can use flip to help us so there's a lot of repetition in these papers as well which is nice because we can skip over it pretty quickly more about the high computational costs okay so they're saying this is going to be not so computationally expensive that sounds hopeful and often the very end of the related work is most interesting as it is here where they've talked about how somebody else has done concurrent to ours somebody else is working at the exactly the same time they've looked at some different approach okay so not sure we learned too much from the related work but if you were trying to really do the very very best possible thing you could study the related work and get the best ideas from each okay now background so this is where it starts to look scary I think we could all agree the and this is often the scariest bit the background this is basically saying like mathematically here's how the problem that we're trying to solve is set up and so we're going to start by looking at denoising diffusion probabilistic models DDPM now if you've watched lesson 9b with Wasim and Tanishk then you've already seen some of the math of DDPM and the important thing to recognize is that basically no one in the world pretty much it's going to look at these paragraphs of text and these equations and go oh I get it that's what DDPM is that's not how it works right to understand DDPM you would have to read and study the original paper and then you would have to read and study the papers it's based on and talk to lots of people and watch videos and go to classes just like this one and after a while you'll understand DDPM and then you'll be able to look at this section and say oh okay I see they're just talking about this thing I'm already familiar with so this is meant to be a reminder of something that you already know it's not something you should expect to learn from scratch so let me take you through these equations somewhat briefly because Wasim and Tanishk have kind of done them already because every diffusion paper pretty much is going to have these equations okay so oh and I'm just going to read something that John knows pointed out in the chat he says it's worth remembering the background is often written last and tries to look smart for the reviewers which is correct so feel free to read it last too yeah absolutely I think the main reason to read it is to find out what the different letters mean what the different symbols mean because they'll probably refer to them later but in this case I want to actually take this as a way to learn how to read math so let's start with this very first equation which how on earth do you even read this so the first thing I'll say is that this is not an E right it's a weird looking E and the reason it's a weird looking E is because it's a Greek letter and so something I always recommend to students is that you learn the Greek alphabet because it's much easier to be able to actually read this to yourself so here's another one right if you don't know that's called theta I guess you have to read it as like circle with line through it it's just going to get confusing trying to read an equation where you just can't actually say it out loud so what I suggest is that you learn that learn the Greek alphabet and let me find the right place so it's very easy to look it up just on Wikipedia is the Greek alphabet and if we go down here you'll see they've all got names and we can go and try and find our one curvy e okay here it is epsilon and oh circle with a line through it theta alright so practice and you will get used to recognizing these so you've got epsilon theta this is just a weird curly L so that's this is used for the loss function okay so how do we find out what this symbol means and what this symbol means well what we can do is there's a few ways to do it one way which is kind of cool is we can use a program called MathPix which is and what it does is you basically select anything on your screen and it will turn it into LaTeX so that's one way you can do this is you can select on the screen it turns it into LaTeX and the reason it's good to turn it into LaTeX is because LaTeX is written as actual stuff that you can search for on Google so that's technique number one technique number two is you can download the other formats of the paper and that will have a download source and if we say download source then what we'll be able to do is we'll be able to actually open up that LaTeX and have a look at it so we'll wait for that to download while it's happening let's keep moving along here so in this case we've got these these two bars so can we find out what that means so we could try a few things we could try looking for two bars maybe math notation oh here we are looks hopeful what does this mean in mathematics oh and here there's a glossary of mathematical symbols here there's a meaning of this in math so that looks hopeful okay so it definitely doesn't look like this it's not between two sets of letters ah but it is around something that looks hopeful so it looks like we found it it's a vector norm okay so then you can start looking for these things up so we can say norm or maybe vector norm and so once you can actually find the term then we kind of know what to look for okay so in our case we've got this surrounding all this stuff and then there's twos here and here what's going on here alright if we scroll through oh this is pretty close actually so okay so two bars can mean a matrix norm otherwise a single for a vector norm that's just here in particular so it looks like we don't have to worry too much about whether it's one or two bars oh and here's the definition oh that's handy so we've got the two one alright so it's equal to root sum of squares so that's good to know so this norm thing means a root sum of squares but then we've got a two up here well that just means squared ah so this is a root sum of squares squared well the square of a square root is just the thing itself ah so actually this whole thing is just the sum of squares it's a bit of a weird way to write it in a sense we could perfectly well have just written it as you know like sum of you know whatever it is squared fine but there we go okay and then what about this thing here weird a thing so how would you find out what the weird a thing is my goodness this is still downloading that's crazy wow 20k per second that's strange wonder why that's taking so long all right hmm maybe if we search for it copy and now it's just searching for an a rotten thing well it's kind of speedy fancy e okay try fancy e maybe fancy a math symbol weird a letter no oh finished great okay so our our um laytech has finally finished downloading and if we open it up we can find there's a tech file in here here we are main tech so we'll open it and it's not the most you know amazingly smooth process but you know what we could just do is we could say okay it's just after it says minimizing the denoising objective okay so let's search for many my seeing the D oh here it is minimizing the denoising objective so the laytech here let's get it back from the screen at the same time okay so here it is l math l equals math bbe x naught t epsilon okay and here's that vertical bar thing epsilon minus epsilon theta xt and then the bar thing to two all right so the thing that we've got new is math bbe okay so finally we've got something we can search for math bbe ah fantastic what does math bbe mean that's the expected value operator aha fantastic all right so it takes a bit of fussing around but once you've got either math picks working or actually another thing you could try because math picks is ridiculously expensive in my opinion is there is there is a free version called picks to tech that actually is a python thing and you could actually even have fun playing with this because the whole thing is just a pytorch python script and it even describes you know how if used to transformers model and you can train it yourself in colab and so forth but basically as you can see yeah you can snip and convert to latex which is pretty awesome so you could use this instead of paying the math picks guys anyway so we are on the right track now I think so expected value and then we can start reading about what expected value is and you might actually remember that because we did a bit of it in high school at least in Australia we did it's basically like let's maybe jump over here so expected value of something is saying what's what's the likely value of that thing so for example let's say you toss a coin which could be heads or it could be tails and you want to know how often it's heads and so maybe the core heads one tails zero so you toss it and you get a one zero zero one one zero one zero one okay and so forth right and then you can calculate the mean of that right so if that's X you can calculate X bar the mean which would be the sum of all that divided by the count of all that so it'd be one two three four five five divided by one two three four five six seven eight nine okay so that would be the mean but the expected value is like well what do you expect to happen and we can calculate that by adding up for all of the possibilities for each I don't know what to call them X for each possibility X how likely is X and what score do you get if you get X so in this example of heads and tails our two possibilities is that we either get heads or we get tails so if for the version where X is heads we get probability is 0.5 and the score if it's an X is going to be 1 and then what about tails for tails the probability is 0.5 and the score if you get tails is 0 and so overall the expected is 0.5 times 1 plus 0 is 0.5 so our expected score if we're tossing a coin is 0.5 if getting heads is a win let me give you another example another example is let's say that we're rolling a die and we want to know what the expected score is if we roll a die so again we could roll it a bunch of times and see what happens okay and so we could sum all that up last night before and divide it by the count and that'll tell us the mean for this particular example but what's the expected value more generally well again it's the sum of all the possibilities of the probability of each possibility times that score so the possibilities for rolling a die is that you can get a one a two a three a four a five or a six the probability of each one is a sixth okay and the score that you get is well it's this this is the score and so then you can multiply all these together and sum them up which would be 1/6 plus 2/6 plus 3/6 plus 4/6 plus 5/6 plus 6/6 and that would give you the expected value of that particular thing which is rolling die rolling rolling a die so that's what expected value means all right so that's a really important concept that's going to come up a lot as we read papers and so in particular this is telling us what are all the things that we're averaging it over that with the expectations over and so there's a whole lot of letters here you're not expected to just know what they are in fact in every paper they could mean totally different things so you have to look immediately underneath where they'll be defined so x0 is an image it's an input image epsilon is the noise and the noise has a mean of zero and a standard deviation of I which if you watch the lesson 9b you'll know it's like a standard deviation of one when you're doing multiple normal variables okay and then this is kind of confusing either just on its own is a normally distributed random variable so it's just grabbing random numbers but either the if sorry epsilon but epsilon theta is a noise estimator that means it's a function you can tell it's a function kind of because it's got these parentheses and stuff direct next to it so that's a function so presumably most functions like this in these papers and neural networks okay so we're finally at a point where this actually is going to make perfect sense we've got the noise we've got the prediction of that noise we subtract one from the other we square it and we take the expected value so in other words this is mean squared error so well that's a lot of fiddling around to find out that we've this whole thing here means mean squared error so the loss function is the mean squared error and unfortunately I don't think the paper ever says that it says minimizing the denoising objective L bloody bloody bloody but anyway we got there eventually fine we also as well as learning about X naught we also learn here about X T and so X T is the original unnoised image times some number plus some noise times one minus that number okay and so hopefully you'll recognize this from lesson 9b this is the thing where we reduce the value of each pixel and we add noise to each pixel so that's that all right so I'm not going to keep going through it but you can kind of basically get the idea here is that once you know what you're looking for the equations do actually make sense right but but all this is doing is remember this is background right this is telling you what already exists this is telling you this is what a DDPM is and then it tells you what a DDIM is DDIM is let's think of it as a more recent version of DDPM it's some very minor changes to the to the way it's set up which allows us to go faster okay so the thing is though once we keep reading what you'll find is none of this background actually matters but you know I thought we'd kind of go through it just to get a sense of like what what's in a paper okay so for the purpose of our background it's enough to know that DDPM and DDIM are kind of the foundational papers on which diffusion models today based okay so the encoding process which encodes an image onto a latent variable okay and then this is basically adding noise this is called DDIM encoding and the thing that goes from the input image to the noised image they're going to call capital ER and R is the encoding ratios that's going to be some like how much noise are we adding if you use small steps then decoding that so going backwards gives you back the original image okay so that's what the stuff that we've learned about that's what diffusion models are all right so this looks like a very useful picture so maybe let's take a look and see what this says so what is diffedit diffedit has three steps step one we add noise to the input image that sounds pretty normal here's our input image x naught okay and we add noise to it fine and then we denoise it okay fine ah but we denoise it twice one time we denoise it using the reference text R horse or this special symbol here means nothing at all so either unconditional or horse all right so we do it once using the word horse so we take this and we decode it estimate the noise and then we can remove that noise on the assumption that it's a horse then we do it again but the second time we do that noise when we calculate the noise we pass in our query Q which is zebra wow those are going to be very different noises the noise for horse is just going to be literally these Gaussian pixels these are all dots right because it is a horse but if the claim is no no this is actually a zebra then all of these pixels here are all wrong they're all the wrong color so the noise that's calculated if we say this is our query it's going to be totally different to the noise if we say this is our query and so then we just take one minus the other and here it is here right so we derive a mask based on the difference in the denoising results and then you take that and binarize it so basically turn that into ones and zeros so that's actually the key idea that's a really cool idea which is that once you have a diffusion model that's trained you can do inference on it where you tell it the truth about what the thing is and then you can do it again but lie about what the thing is and in your lying version it's going to say okay all the stuff that doesn't match zebra must be noise and so the difference between the noise prediction when you say hey it's a zebra versus the noise prediction when you say hey it's a horse will be all the pixels that it says no these pixels are not zebra the rest of it it's fine there's nothing particularly about the background that wouldn't work with a zebra okay so that's step one so then step two is we take the horse and we add noise to it okay that's this XR thing that we learned about before and then step three we deducating conditioned on the text query using the mask to replace the background with pixel values so this is like the idea that we heard about before which is that during the inference time as you do diffusion from this fuzzy horse what happens is that we do a step of diffusion inference and then all these black pixels we replace with the noised version of the original and so we do that multiple times and so that means that the original pixels in this black area won't get changed and that's why you can see in this picture here and this picture here the backgrounds all the same and the only thing that's changed is that the horse has been turned into a zebra so this paragraph describes it and then you can see here it gives you in a lot more detail and the detail often has all kinds of like little tips about things they tried and things they found which is pretty cool so I won't read through all that because it says the same as what I've already just said one of the interesting little things they noted note here actually is that this binarized mask so this difference between the R decoding and the Q decoding tends to be a bit bigger than the actual area where the horse is which you can kind of see with these legs for example and their point is that they actually say that's a good thing because actually often you want to slightly change some of the details around the object so this is actually fine all right so we have a description of what the thing is lots of details there and then here's the bit that I totally skip the bit called theoretical analysis where this is the stuff that people really generally just add to try to get their papers past review you have to have fancy math and so they're basically proving you can see what it says here insight into why this component yields better editing results than other approaches I'm not sure we particularly care because like it makes perfect sense what they're doing it's intuitive and we can see it works I don't feel like I need it proven to me so I skip over that so then they'll show us their experiments to tell us what datasets they you did the experiments on and so then you know they have metrics with names like LP IPS and CSF ID you'll come across F ID a lot this is just a version of that where basically they're trying to score how good their generated images and we don't normally care about that either they care because they need to be able to say you should publish our paper because it has a higher number than the other people that have worked on this area in our case we can just say yeah you know it looks good I like it so excellent question in the chat from Michelage which is so with this only work on things that are relatively similar and I think this is a great point right this is where understanding this helps to know what its limitations are going to be and that's exactly right if if you can't come up with a mask for the change you want this isn't going to work very well on on the whole yeah because those the masked areas the pixels going to be copied so for example if you wanted to change it from you know a bowl of fruits to a bowl of fruits with a bokeh background I don't know like a bowl of fruits with you know with a you know a purple tinged photo of a bowl of fruit if you want the whole color to change that's not going to work right because you're not masking off an area yeah so by understanding the detail here Michelage is correctly recognized a limitation or or like what's this for this is the things where you can just say just change this bit and leave everything else the same all right so there's lots of experiments so yeah for some things do you care about the experiments a lot if it's something like classification first up for generation the main thing you probably want to look at is the actual results and so and often for whatever reason I guess because this is most people read these electronically the results often you have to zoom into a lot to be able to see whether they're any pretty good so here's the input image they want to turn this into an English foxhound so here's the thing they're comparing themselves to SD edit and it changed the composition quite a lot and their version it hasn't changed it at all it's only changed the dog and ditto here semi-trailer truck SD edits totally changed it if edit hasn't so you can kind of get a sense of like you know the author's showing off what they're good at here this is this is what this technique is effective at doing changing animals and vehicles and so forth it does a very good job of it all right so then there's going to be a conclusion at the end which I find almost never adds anything on top of what we've already read and as you can see it's very short anyway now quite often the appendices are really interesting so don't skip over them often you'll find like more examples of pictures they might show some examples of pictures that didn't work very well stuff like that so it's often well worth looking at the appendices often some of the most interesting examples are there and that's it all right so that is I guess our first full-on paper walkthrough and you know it's important to remember this is not like a carefully chosen paper that that we've picked specifically because you can handle it like this is the most interesting paper that came out this week and so you know it gives you a sense of what it's really like and for those of you who are you know ready to try something that's going to stretch you see if you can implement any of this paper so there are three steps the first step is kind of the most interesting one which is to generate automatically generate a mask and the information that you have and the code that's in the lesson 9 notebook actually contains everything you need to do it so maybe give it a go see if you can mask out the area of a horse that does not look like a zebra and that's actually you know that's actually useful of itself like that's that's allows you to create segmentation masks automatically so that's pretty cool and then if you get that working then you can go and try and do step two if you get that working you can try and do step three and this only came out this week so I haven't really seen examples of easy to use interfaces to this so here's an example of a paper that you could be the first person to create a call interface to it so there's some yeah there's a fun little project and even if you're watching this a long time after this was released and everybody's been doing this for years still good homework I think to to practice if you can alright I think now's a good time to have a 10 minute break so I'll see you all back here in 10 minutes.

And pop questions into the forum topic if you've got questions that haven't been answered yet. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay.

Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) (no audio) Okay, welcome back.

One thing during the break that Diego reminded us about, which I normally describe and I totally forgot about this time is detectify, which is another really great way to find symbols you don't know about. So let's try it for that expectation. So if you go to detectify, and you draw the thing.

It doesn't always work fantastically well, but sometimes it works very nicely. Yeah, in this case not quite. What about the double line thing? It's good to know all the techniques, I guess. (no audio) You'd think you could do this one. I guess part of the problem is there's so many options that actually you know, okay, in this case it wasn't particularly helpful.

Normally it's more helpful than that. I mean, if we use a simple one like epsilon, I think it should be fine. There's a lot of room to improve this app actually. If anybody's interested in a project, I think you could make it, you know, more successful. Okay, there you go.

SignoSum, that's cool. Anyway, so it's another useful thing to know about. Just Google for detectify. Okay, so let's move on with our from the foundations now. And so we were working on trying to at least get the start of a forward pass of a linear model or a simple multi-layer perceptron for MNIST going.

And we had successfully created a basic tensor. We've got some random numbers going. So what we now need to do is we now need to be able to multiply these things together, matrix multiplication. So matrix multiplication to remind you in this case, so we're doing MNIST, right? So we've got, well I think we're going to use a subset.

Let's see. Yeah, okay. So we're going to create a matrix called M1, which is just the first five digits. So M1 will be the first five digits. So five rows. And dot dot dot dot dot dot. And then 780, what was it again? 784 columns. 784 columns. Because it's 28 by 28 pixels.

And we flattened it out. So this is our first matrix and our matrix multiplication. And then we're going to multiply that by some weights. So the weights are going to be 784 by 10 random numbers. So for every one of these 784 pixels, each one is going to have a weight.

So 784 down here. 784 by 10. So this first column, for example, is going to tell us all the weights in order to figure out if something's a zero. And the second column will have all the weights in deciding if the probability of something's a one and so forth, assuming we're just doing a linear model.

And so then we're going to multiply these two matrices together. So when we multiply matrices together, we take row one of matrix one and we take column one of matrix two and we take each one in turns. We take this one and we take this one and we multiply them together.

And then we take this one and this one and we multiply them together. And we do that for every element-wise pair and then we add them all up. And that would give us the value for the very first cell. That would go in here. That's what matrix multiplication is.

So let's go ahead then and create our random numbers for the weights since we're allowed to use random number generators now. And for the bias we'll just use a bunch of zeros to start with. So the bias is just what we're going to add to each one. And so for our matrix multiplication we're going to be doing a little mini-batch here.

We're going to be doing five rows of, as we discussed, five rows of, so five images flattened out and then multiplied by this weights matrix. So here are the shapes. M1 is 5 by 784, as we saw. M2 is 784 by 10. So keep those in mind. So here's a handy thing.

M1 dot shape contains two numbers and I want to pull them out. I want to call the, I'm going to think of that as, I'm going to actually think of this as like A and B rather than M1 and M2. So this is like A and B. So the number of rows in A and the number of columns in A if I say equals M1 dot shape, that will put 5 in AR and 784 in AC.

So you'll probably notice this. I do this a lot, this de-structure, and we talked about it last week too. So we can do the same for M2 dot shape, put that into B rows and B columns. And so now if I write out ARAC and BRBC, you can again see the same things from the sizes.

So that's a good way to kind of give us the stuff we have to look through. So here's our result. So our resultant tensor, well we're multiplying together all of these 784 things and adding them up. So the resultant tensor is going to be 5 by 10. And then each thing in here is the result of multiplying and adding 784 pairs.

So the result here is going to start with zeros and this is the result and it's going to contain AR rows, 5 rows, and BC columns, 10 columns, 5 comma 10. So now we have to fill that in. And so to do a matrix multiplication, first we have to go through each row one at a time.

And here we have that. Go through each row one at a time. And then go through each column one at a time. And then we have to go through each pair in that row column one at a time. So there's going to be a loop, in a loop, in a loop.

So here we're going to loop over each row. And here we're going to loop over each column. And then here we're going to loop over each column of C. And then here we're going to loop over each column of A, which is going to be the same as the number of rows of B, which we can see here AC, 784, BR, 784, they're the same.

So it wouldn't matter whether we said AC or BR. So then our result for that row and that column we have to add onto it the product of IK in the first matrix by KJ in the second matrix. So K is going up through those 784 and so we're going to go across the columns and down, sorry, across the rows and down the columns.

So it's going to go across the row whilst it goes down this column. So here is the world's most naive, slow, uninteresting matrix multiplication. And if we run it, okay, it's done something. We have successfully, apparently hopefully successfully, multiplied the matrices M1 and M2. It's a little hard to read this I find because lunch cards used to be 80 columns wide.

We still assume screens are 80 columns wide. Everything defaults to 80 wide, which is ridiculous. But you can easily change it. So if you say set print options you can choose your own line width. You can see, well we know that's 5 by 10, we did it before. So if we change the line width, okay, that's much easier to read now.

We can see here the five rows and here are the ten columns for that matrix multiplication. I tend to always put this at the top of my notebooks and you can do the same thing for NumPy as well. So what I like to do, this is really important, is when I'm working on code, particularly numeric code, I like to do it all step by step in Jupyter.

And then what I do is once I've got it working is I copy all the cells that have implemented that and I paste them and then I select them all and I hit shift M to merge. Get rid of anything that prints out stuff I don't need. And then I put a header on the top, give it a function name.

And then I select the whole lot and I hit control or apple right square bracket and I've turned it into a function. But I still keep the stuff above it so I can see all the step by step stuff for learning about it later. And so that's what I've done here to create this function.

And so this function does exactly the same things we just did and we can see how long it takes to run by using percent time. And it took about half a second. Which, gosh, that's a long time to generate such a small matrix. This is just to do five MNIST digits.

So that's not going to be great. We're going to have to speed that up. I'm actually quite surprised at how slow that is because there's only 39,200. So if you look at how we've got a loop within a loop within a loop. It's doing 39,200 of these. Python, yeah.

Python, when you're just doing Python, it is slow. So we can't do that. That's why we can't just write Python. But there is something that kind of lets us write Python. We could instead use number. Number is a system that takes Python and turns it into basically into machine code.

And it's amazingly easy to do. You can basically take a function and write njit at njit on top. And what it's going to do is it's going to look, the first time you call this function, it's going to compile it down to machine code. And it will run much more quickly.

So what I've done here is I've taken the innermost loop. So just looping through and adding up all these. So start at zero, go through and add up all those. Just for two vectors and return it. This is called a dot product in linear algebra. So we'll call it dot.

And so number only works with numpy. It doesn't work with pytorch. So we're just going to use arrays instead of tensors for a moment. Now have a look at this. If I try to do a dot product of one, two, three, and two, three, four, it's pretty easy to do.

It took a fifth of a second, which sounds terrible. But the reason it took a fifth of a second is because that's actually how long it took to compile this and run it. Now that it's compiled the second time, it just has to call it. It's now 21 microseconds.

And so that's actually very fast. So with number, we can basically make Python run at C speed. So now the important thing to recognize is if I replace this loop in Python with a called a dot which is running in machine code, then we now have one, two loops running in Python.

And then we have dot three. So our 448 milliseconds well, first of all, let's make sure if I run it run that matmul it should be close to my T1. T1 is what we got before, remember? So when I'm refactoring or performance improving or whatever, I always like to put every step in the notebook and then test.

So this test close comes from fastcore.test and it just checks that two things are very similar. They might not be exactly the same because of little floating point differences which is fine. Okay, so our matmul is working correctly, or at least it's doing the same thing it did before.

So if we now run it it's taking 268 microseconds versus 448 milliseconds. So it's taking about 2,000 times faster just by changing the one innermost loop. So really all we've done is we've added @endgit to make it 2,000 times faster. So number is well worth knowing about. It can make your Python code very, very fast.

Okay, let's keep making it faster. So we're going to use stuff again which kind of goes back to APL. And a lot of people say that learning APL is a thing that's taught them more about programming than anything else. So it's probably worth considering learning APL. And let's just look at these various things.

We've got a is 10, 6, minus 4. So remember in APL we don't say equals, equals actually means equals funnily enough. To say set to we use this arrow. And this is a list of 10, 6, 4. Okay, and then B is 287. Okay, and we're going to add them up.

A plus B. So what's going on here? So it's really important that you can think of a symbol like A as representing a tensor or an array. APL calls them arrays. PyTorch calls them tensors. NumPy calls them arrays. They're the same thing. So this is a single thing that contains a bunch of numbers.

This is a single thing that contains a bunch of numbers. This is an operation that applies to arrays or tensors. And what it does is it works what's called element-wise. It takes each pair, 10 and 2, and adds them together. Each pair, 6 and 8, add them together. This is element-wise addition.

And Fred's asking in the chat how do you put in these symbols. If you just mouse over any of them it will show you how to write it. And the one you want is the one at the very bottom, which is the one where it says prefix. Now the prefix is the backtick character.

So here it's saying prefix hyphen gives us times. A backtick- B is A times B, for example. So yeah, they all have shortcut keys which you learn pretty quickly, I find. And there's a fairly consistent kind of system for those shortcut keys too. Alright, so we can do the same thing in PyTorch.

It's a little bit more verbose in PyTorch, which is one reason I often like to do my mathematical fiddling around in APL. I can often do it with less boilerplate, which means I can spend more time thinking. You know, I can see everything on the screen at once. I don't have to spend as much time trying to like ignore the tensor, round bracket, square bracket, dot comma, blah blah blah.

It's all cognitive load, which I'd rather ignore. But anyway, it does the same thing. So I can say A plus B, and it works exactly like APL. So here's an interesting example. I can go A less than B dot float dot mean. So let's try that one over here.

A less than B. So this is a really important idea which I think was invented by Ken Iverson, the APL guy, which is the true and false represented by 0 and 1. And because they're represented by 0 and 1, we can do things to them. We can add them up and subtract them and so forth.

It's a really important idea. So in this case I want to take the mean of them. And I'm going to tell you something amazing, which is that in APL there is no function called mean. Why not? That's because we can write the mean function which, so that's four letters, mean, M-E-A-N, we can write the mean function from scratch with four characters.

I'll show you. Here is the whole mean function. We're going to create a function called mean. And the mean is equal to the sum of a list divided by the count of a list. So this here is sum divided by count. And so I have now defined a new function called mean, which calculates the mean.

The mean of a is less than b. There we go. And so in practice I'm not sure people would even bother defining a function called mean because it's just as easy to actually write its implementation in APL, in NumPy or whatever Python. It's going to take a lot more than four letters to implement mean.

So anyway, it's a math notation. And so being a math notation we can do a lot with a little, which I find helpful because I can see everything going on at once. So that's how we do the same thing in PyTorch. And again, you can see that the less than in both cases are operating element-wise.

So a is less than b is saying ten is less than two, six is less than eight, four is less than seven and gives us back each of those trues and falses as zeros and ones. And according to the emoji on our YouTube chat, Siva's head just exploded as it should.

This is why APL is life-changing. Okay, let's now go up to higher ranks. So this here is a rank one tensor. So a rank one tensor means it's a list of things. It's a vector. Where else a rank two tensor is like a list of lists. They all have to be the same length lists.

Or it's like a rectangular bunch of numbers. And we call it, in math we call it a matrix. So this is how we can create a tensor containing one, two, three, four, five, six, seven, eight, nine. And you can see often what I like to do is I want to print out the thing I just created after I created it.

So two ways to do it. You can say put an enter and then write m and that's going to do that. Or if you want to put it all on the same line that works too. You just use a semicolon. Neither one's better than the other. They're just different.

So we could do the same thing in APL. Of course in APL it's going to be much easier. So we're going to define a matrix called m which is going to be a three by three tensor containing the numbers from one to nine. Okay and there we go. That's done it in APL.

A three by three tensor containing the numbers from one to nine. A lot of these ideas from APL you'll find have made their way into other programming languages. For example if you use go you might recognize this. This is the iota character and go uses the word iota so they spell it out in a somewhat similar way.

A lot of these ideas from APL have found themselves into math notation and other languages. It's been around since the late fifties. Okay so here's a bit of fun. We're going to learn about a new thing that looks kind of crazy called Frobenius norm. And we'll use that from time to time as we're doing generative modeling.

And here's the definition of a Frobenius norm. It's the sum over all of the rows and columns of a matrix. And we're going to take each one and square it. You're going to add them up and they're going to take the square root. And so to implement that in PyTorch is as simple as going n times m dot sum dot square root.

So this looks like a pretty complicated thing. When you kind of look at it at first it looks like a lot of squiggly business. Or if you said this thing here you might be like what on earth is that? Well now you know it's just square, sum, square root.

So again we could do the same thing in APL. So let's do, so in APL we want the, okay so we're going to create something called sf. Now it's interesting, APL does this a little bit differently. So dot sum by default in PyTorch sums over everything. And if you want to sum over just one dimension you have to pass in a dimension keyword.

For very good reasons, APL is the opposite. It just sums across rows or just down columns. So actually we have to say sum up the flattened out version of the matrix. And to say flattened out you use comma. So here's sum up the flattened out version of the matrix.

Okay so that's our sf. Oh sorry. And the matrix is meant to be m times m. There we go. So there's the same thing. Sum up the flattened out m by m matrix. And another interesting thing about APL is it always is read right to left. There's no such thing as operator precedence which makes life a lot easier.

Okay and then we take the square root of that. There isn't a square root function so we have to do to the power of 0.5. And there we go, same thing. Alright, you get the idea. Yes, a very interesting question here from Marabou. Are the bars for norm or absolute value?

And I like Siva's answer which is the norm is the same as the absolute value for a scalar. So in this case you can think of it as absolute value and it's kind of not needed because it's being squared anyway. But yes, in this case the norm, well in every case for a scalar the norm is the absolute value which is kind of a cute discovery when you realize it.

So thank you for pointing that out Siva. Alright, so this is just fiddling around a little bit to kind of get a sense of how these things work. So really importantly you can index into a matrix and you'll say rows first and then columns. And if you say colon it means all the columns.

So if I say row 2, here it is, row 2, all the columns, sorry this is row 2 so that's at 0, APL starts at 1, all the columns, that's going to be 7, 8, 9. And you can see I often use comma to print out multiple things and I don't have to say print in Jupiter, it's kind of assumed.

And so this is just a quick way of printing out the second row and then here every row column 2. So here is every row of column 2 and here you can see 3, 6, 9. So one thing very useful to recognize is that for tensors of higher rank than 1, such as a matrix, any trailing colons are optional.

So you see this here M2, that's the same as M2, colon. It's really important to remember. So M2, you can see the result is the same. So that means row 2 every column. So now with all that in place, we've got quite an easy way we don't need a number anymore.

We can multiply, so we can get rid of that innermost loop. So we're going to get rid of this loop, because this is just multiplying together all of the corresponding rows of a, sorry, all the corresponding columns of a row of a with all the corresponding rows of a column of b.

And so we can just use an element-wise operation for that. So here is the i-th row of a, and here is the j-th column of b. And so those are both, as we've seen, just vectors, and therefore we can do an element-wise multiplication of them, and then sum them up.

And that's the same as a dot product. So that's handy. And so again, we'll do test close. Okay, it's the same. Great. And again, you'll see we kind of did all of our experimenting first, right, to make sure we understood how it all worked, and then put it together, and then if we time it, 661 microseconds.

Okay, so it's interesting. It's actually slower than, which really shows you how good number is, but it's certainly a hell of a lot better than our 450 milliseconds. But we're using something that's kind of a lot more general now. This is exactly the same as dot, as we've discussed.

So we could just use torch dot, torch dot dot, I suppose I should say. And if we run that, okay, a little faster. It's still, interestingly, it's still slower than the number, which is quite amazing, actually. Alright, so that one was not exactly a speedup, but it's kind of more general, which is nice.

Now we're going to get something into something really fun, which is broadcasting. And broadcasting is about what if you have arrays with different shapes. So what's a shape? The shape is the number of rows, or the number of rows and columns, or the number of what would you say, faces, rows and columns, and so forth.

So for example, the shape of M is three by three. So what happens if you multiply, or add, or do operations to tensors of different shapes? Well there's one very simple one, which is if you've got a rank one tensor, the vector, then you can use any operation with a scalar and it broadcasts that scalar across the tensor.

So a is greater than zero is exactly the same as saying a is greater than tensor, zero comma zero comma zero. So it's basically copying that across three times. Now it's not literally making a copy in memory, but it's acting as if we'd said that. And this is the most simple version of broadcasting.

It's broadcasting the zero across the ten, and the six, and the negative four. And APL does exactly the same thing. A is less than five. So zero, zero, one. Same idea. Okay. So we can do plus with a scalar, and we can do exactly the same thing with higher than rank one, so two times a matrix is just going to be broadcast across all the rows and all the columns.

Okay, now it gets interesting. So broadcasting dates back to APL, but a really interesting idea is that we can broadcast not just scalars, but we can broadcast vectors across matrices, or broadcast any kind of lower ranked tensor across higher ranked tensors, or even broadcast together two tensors of the same rank, but different shapes in a really powerful way.

And as I was exploring this, I loved doing this kind of computer archaeology, I was trying to find out where the hell this comes from, and it actually turns out from this email message in 1995 that the idea actually comes from a language that I'd never heard of called Yorick, which still apparently exists.

Here's Yorick. And so Yorick has talks about broadcasting and conformability. So what happened is this very obscure language has this very powerful idea, and NumPy has happily stolen the idea from Yorick that allows us to broadcast together tensors that don't appear to match. So let me give an example.

Here's a tensor called C that's a vector, it's a rank 1 tensor, 10, 20, 30. And here's a tensor called M, which is a matrix, we've seen this one before. And one of them is shape 3, the other is shape 3. And yet we can add them together. Now what's happened when we added it together?

Well what's happened is 10, 20, 30 got added to 1, 2, 3. And then 10, 20, 30 got added to 4, 5, 6. And then 10, 20, 30 got added to 7, 8, 9. And hopefully you can see this looks quite familiar. Instead of broadcasting a scalar over a higher rank tensor, this is broadcasting a vector across every row of a matrix.

And it works both ways, so we can say C plus M gives us exactly the same thing. And so let me explain what's actually happening here. The trick is to know about this somewhat obscure method called expandAs. And what expandAs does is this creates a new thing called T, which contains exactly the same thing as C, but expanded, or kind of copied over, so it has the same shape as M.

So here's what T looks like. Now T contains exactly the same thing as C does, but it's got three copies of it now. And you can see we can definitely add T to M because they match shapes. We can say M plus T, we know we can say M plus T because we've already learned that you can do element-wise operations on two things that have matching shapes.

Now by the way, this thing T didn't actually create three copies. Check this out. If we call T.storage it tells us what's actually in memory. It actually just contains the numbers 10, 20, 30. But it does a really clever trick. It has a stride of zero across the rows, and a size of 3,3.

And so what that means is that it acts as if it's a 3x3 matrix, and each time it goes to the next row, it actually stays exactly where it is. And this idea of strides is the trick which NumPy and PyTorch and so forth use for all kinds of things where you basically can create very efficient ways to do things like expanding, or to kind of jump over things, and stuff like that.

You know, switch between columns and rows, stuff like that. Anyway, the important thing here for us to recognize is that we didn't actually make a copy. This is totally efficient, and it's all going to be run in C code very fast. So remember, this expand ads is critical. This is the thing that will teach you to understand how broadcasting works, which is really important for implementing deep learning algorithms, or any kind of linear algebra on any Python system.

Because the NumPy rules are used exactly the same in JAX, in TensorFlow, in PyTorch, and so forth. Now I'll show you a little trick, which is going to be very important in a moment. If we take C, which remember is a vector containing 10, 20, 30, and we say dot unsqueezed zero, then it changes the shape from three to one comma three.

So it changes it from a vector of length three to a matrix of one row by three columns. This will turn out to be very important in a moment. And you can see how it's printed. It's printed out with two square brackets. Now I never use unsqueezed, because I much prefer doing something more flexible, which is if you index into an axis with a special value none, also known as np.newaxis, it does exactly the same thing.

It inserts a new axis here. So here we'll get exactly the same thing. One row by all the columns, three columns. So this is exactly the same as saying unsqueezed. So this inserts a new unit axis. This is a unit axis, a single row, in this dimension. And this does the same thing.

So these are the same. So we could do the same thing and say unsqueezed one, which means now we're going to unsqueeze into the first dimension. So that means we now have three rows and one column. See the shape here? The shape is inserting a unit axis in position one.

Three rows and one column. And so we can do exactly the same thing here. Give us every row and a new unit axis in position one. Same thing. So those two are exactly the same. So this is how we create a matrix with one row. This is how we create a matrix with one column.

None, colon versus colon, colon, none. Or unsqueeze. We don't have to say, as we've learned before, none, colon, because you remember trailing columns are optional. So therefore just C none is also going to give you a row matrix, one row matrix. This is a little trick here. If you say dot, dot, dot, that means all of the dimensions.

And so dot, dot, dot, none will always insert a unit axis at the end regardless of what rank a tensor is. So, yeah, so none and NP new axis mean exactly the same thing. NP new axis is actually a synonym for none. If you've ever used that, I always use none.

Because why not? It's short and simple. So here's something interesting. If we go C colon, colon, none, so let's go and check out what C colon, colon, none looked like. C colon, colon, none is a column. And if we say expand as M, which is 3 by 3, then it's going to take that 10/20/30 column and replicate it.

10/20/30, 10/20/30, 10/20/30. So we could add, so remember, like, well, remember, I'll explain that. When you say matrix plus C colon, colon, none, it's basically going to do this dot expand as for you. So if I want to add this matrix here to M, I don't need to say dot expand as, I just write this, I just write M plus C colon, colon, none.

And so this is exactly the same as doing M plus C. But now, rather than adding the vector to each row, it's adding the vector to each column. See, plus 10/20/30, 10/20/30, 10/20/30. So that's a really simple way that we now get kind of for free, thanks to this really nifty notation, this approach that came from Yorick.

So here you can see M plus C, none, colon, is adding 10/20/30 to each row, and M plus C colon, none is adding 10/20/30 to each column. All right, so that's the basic, like, hand-wavy version, so let's look at, like, what are the rules, and how does it work?

Okay, so C none, colon is 1 by 3. C colon, none is 3 by 1. What happens if we multiply C none, colon by C colon, none? Well, it's going to do, if you think about it, which you definitely should, because thinking is very helpful, if we say, so what it's going to do is it's going to have to expand as, let's see if this works, actually.

I'm not quite sure if expand as will do this. C none, colon, expand as C colon, colon, none. No. No. No. No. What is going on here? Okay, so what happens if we go C none, colon times C colon, colon, none? So what it's going to have to do is it's going to have to take this 10/20/30 column vector, or 3 by 1 matrix, and it's going to have to make it work across each of these rows.

So what it does is expands it to be 10/20/30, 10/20/30, 10/20/30, 10/20/30. So it's going to do it just like this. And then it's going to do the same thing for C none, colon. So that's going to become 3 rows of 10/20/30. So we're going to end up with 3 rows of 10/20/30 times 3 columns of 10/20/30, which gives us our answer.

And so this is going to do an outer product. So it's very nifty that you can actually do an outer product without any special functions or anything, just using broadcasting. And it's not just outer products, you can do outer boolean operations. And this kind of stuff comes up all the time.

Now remember, you don't need the comma, colon, so get rid of it. So this is showing us all the places where it's kind of an outer boolean, if you want to call it that. So this is super nifty, and you can do all kinds of tricks with this because it runs very, very fast.

So this is going to be accelerated in C. So here are the rules. When you operate on 2 arrays or tensors, NumPy and PyTorch will compare their shapes. So remember this is a shape. You can tell it's a shape because we said shape. And it goes from right to left, starts with the trailing dimensions.

And it checks whether the dimensions are compatible. Now they're compatible if they're equal, right? So for example, if we say m times m, then those two shapes are compatible because the because in each case, it's just going to be 3, right? So they're going to be equal. So if the shape in that dimension is equal, they're compatible.

Or if one of them's 1, and if one of them's 1, then that dimension is broadcast to make it the same size as the other. So that's why the outer product worked. We had a 1 by 3 times a 3 by 1. And so this 1 got copied 3 times to make it this long.

And this 1 got copied 3 times to make it this long. Okay, so those are the rules. So the arrays don't have to have the same number of dimensions. So this is an example that comes up all the time. Let's say you've got a 256 by 256 by 3 array or tensor of RGB values.

So you've got an image, in other words, a color image. And you want to normalize it. So you want to scale each color in the image by a different value. So this is how we normalize colors. So one way is you could multiply, or divide, or whatever, multiply the image by a one-dimensional array with 3 values.

So you've got a 1D array, so that's just 3. And then the image is 256 by 256 by 3. And we go right to left, and we check, "Are they the same?" And we say, "Yes, they are." And then we keep going left, and we say, "Are they the same?" And if it's missing, we act as if it's 1.

And if we keep going, if it's missing, we act as if it's 1. So this is going to be the same as doing 1 by 1 by 3. And so this is going to be broadcast, this three elements will be broadcast over all 256 by 256 pixels. So this is a super fast and convenient and nice way of normalizing image data with a single expression.

And this is exactly how we do it in the Fast.ai library, in fact. So we can use this to dramatically speed up our matrix multiplication. Let's just grab a single digit, just for simplicity. And I really like doing this in Jupyter Notebooks. And if you build Jupyter Notebooks to explain stuff that you've learned in this course or ways that you can apply it, consider doing this for your readers, but add a lot more prose.

I haven't added prose here because I want to use my voice. For example, in our book that we published, it's all written in notebooks, and there's a lot more prose, obviously. But really, I like to show every example all along the way, using simple as possible. So let's just grab a single digit.

So here's the first digit. So its shape is a 784 long vector. And remember that our weight matrix is 784 by 10. So if we say digit colon common none dot shape, then that is a 784 by 1 row matrix. So there's our matrix. And so if we then take that, 784 by 1, and expand as M2, it's going to be the same shape as our weight matrix.

So it's copied our image data for that digit across all of the 10 vectors representing the 10 linear projections we're doing for our linear model. And so that means that we can take the digit colon common none, so 784 by 1, and multiply it by the weights. And so that's going to get us back 784 by 10.

And so what it's doing, remember, is it's basically looping through each of these 10 784 long vectors, and for each one of them, it's multiplying it by this digit. So that's exactly what we want to do in our matrix multiplication. So originally we had, well not originally, most recently I should say, we had this dot product where we were actually looping over J, which was the columns of B.

So we don't have to do that anymore, because we can do it all at once by doing exactly what we just did. So we can take the ith row and all the columns and add an axis to the end and then just like we did here, multiply it by B, and then dot sum.

And so that is, again, exactly the same thing. That is another matrix multiplication doing it using broadcasting. Now this is like tricky to get your head around, and so if you haven't done this kind of broadcasting before, it's a really good time to pause the video and look carefully at each of these four cells before.

Understand what did I do there? Why did I do it? What am I showing you? And then experiment with trying to, and to remember that we started with M1 0, so just like we have here a i. So that's why we've got i comma comma colon comma none, because this digit is actually M1 0, so this is like M1 0 colon none.

So this line is doing exactly the same thing as this here, plus the sum. So let's check if this matmul is the same as it used to be, yet it's still working, and the speed of it, okay, not bad, so 137 microseconds. So we've now gone from a time from 500 milliseconds to about 0.1 milliseconds.

Funnily enough on my, oh actually now I think about it, my MacBook Air is an M2, whereas this Mac Mini is an M1, so that's a little bit far. So my Air was a bit faster than 0.1 milliseconds. So overall we've got about a 5000 times speed improvement. So that is pretty exciting.

And since it's so fast now, there's no need to use a mini-batch anymore. If you remember, we used a mini-batch of of, where is it, of five images. But now we can actually use the whole data set because it's so fast. So now we can do the whole data set.

There it is. We've now got 50000 by 10, which is what we want. And so it's taking us only 656 milliseconds now to do the whole data set. So this is actually getting to a point now where we could start to create and train some simple models in a reasonable amount of time.

So that's good news. All right. I think that's a probably good time to take a break. We don't have too much more of this to go, but I don't want to keep you guys up too late. So hopefully you learned something interesting about broadcasting today. I cannot overemphasize how widely useful this is in all deep learning and machine learning code.

It comes up all the time. It's basically our number one most critical kind of foundational operation. So yeah, take your time practicing it and also good luck with your diffusion homework from the first half of the lesson. Thanks for joining us and I'll see you next time.

fast.ai Live - Lesson 11

Transcript