Lesson 9: Deep Learning Foundations to Stable Diffusion

Hi, everybody and welcome to deep learning foundations to stable diffusion. Hopefully it's not too confusing that this is described here as lesson nine. That's because strictly speaking, we treat this as part two of the practical deep learning for coders series. Um, so that part one had eight lessons. So this is lesson nine, but don't worry, you didn't miss anything.

It's the first lesson of, of part two. Um, which is called deep learning foundations to stable diffusion. And maybe rather than calling it practical deep learning for coders, we should call this, um, impractical deep learning for coders in the sense that we are certainly not going to be spending all of our time seeing exactly how to do important things with, with deep learning.

Um, but we'll be doing a whole lot of fun things, generative, modally fun things, and also a whole lot of understanding lots of details, which you won't necessarily need to know, um, um, to use this stuff. Um, but if you want to become a researcher, um, or if you want to like put something in production, which has got like some kind of complex customization requirements, stuff like that, um, then it is going to be very helpful to learn the details we'll be talking about.

Um, so here in lesson nine, there's kind of going to be two parts to it. One is just a quick, uh, run through quick ish run through of, of, of using stable diffusion, um, because we're all dying to play with it. Right. And then, um, the other thing that I'll be doing is describing in some detail.

What's going on? How is it working? Um, there'll be a whole lot of hand waving either way, because it's going to take us a few lessons to describe everything from scratch. Um, but hopefully, you know, you'll get a, you'll feel like you'll come away from this, with this lesson, with a reasonable, you know, intuitive understanding at least of how, how this is all working.

Um, assumptions. Um, well, I'm going to try to explain everything. Like everything, I'm going to try to explain everything. So like, if you haven't done deep learning before, um, this is going to be very hard. Um, but I will at least be trying to say like, this is roughly what's going on and where you can find out more.

Having said that, um, I would strongly suggest doing part one, um, before doing this course, unless you really want to throw yourself in the deep, deep end and give yourself quite, quite the test. Um, if you haven't done part one of practical deep learning for coders, but you're, you know, reasonably comfortable with deep learning basics, you could like write a basic SG SGD loop in Python.

Um, and you know, you know how to use ideally PyTorch, but TensorFlow is probably okay as well. Um, and you kind of know the basic ideas of how to create, um, you know, what an embedding is, and you could create one of those and scratch stuff like that. Um, you know, you'll probably be fine.

Uh, generally speaking for these courses, I find most people tend to watch the videos a few times, um, and often the second time through folks will like pause and look things up, they don't know and check things out. Um, generally speaking, you know, we expect people to be spending about 10 hours of work on each video.

Um, having said that some people spend a hell of a lot more and go very deep. Some people will spend a whole year, you know, sabbatical studying practical deep learning for coders in order to really fully understand everything. So really it's up to you as to how deep you go.

Okay. Um, so with that said, um, let's jump into it. And as I said, the first part we're going to be playing around with, um, stable diffusion and I, you know, try to, you know, prepare this as late as possible. So it wouldn't be out of date. Um, unfortunately, um, as of 12 hours ago, it is now out of date.

Um, and this is one of the big issues with the bit I'm about to describe, which is like how to play with stable diffusion and exactly how do the details work? Which is, um, it's moving so quickly that all of the details I'm going to describe to you today and all of the software I'm going to show you today, by the time you watch this, if you're watching it, so what was it up to now?

It's a 11th of October. So if you're watching this in like December of 2022 or watching this in 2023, the details will have changed. Um, so, so what's happened today in the last 24 hours is two, um, papers have come out. Uh, so what I was going to be telling you today is for example, to do a stable diffusion generative model, um, the number of steps required has gone down from a thousand to about 40 or 50.

Um, but then as of, yeah, uh, last night, uh, papers just come out and saying it's now down to four and it's 256 times faster. And then another paper has come out with a separate, I think, orthogonal approach. Um, which makes it another, um, uh, let's see, 10 to 10.

To 20 times faster. So things are very exciting. Um, things are moving very quickly. Now, having said that, don't worry, because after this lesson, we're going to be going from the foundations, which means we're going to be learning all the things of how these are built up and those don't change much at all.

And in fact, a lot of what we'll be seeing is extremely similar to another course we did in 2019, um, cause the foundations don't change. And once you know, the foundations, um, these kinds of details about the, you know, that you'll find in these papers, you'll be like, Oh, I see.

They did all these things the same way as usual. And they made this little change. Um, so that's why we do things from the foundations so that you can keep up with the research, do your own research, um, by taking advantage of this foundational knowledge, which, you know, all these papers are building on top of.

So anyway, I guess I should apologize that even as I record this, um, the, you know, the notebook is now one day out of date. So in part one, you know, you might remember, we saw this stuff of Dali to the illustrations of Twitter by Twitter bios, um, which really, uh, pretty cool.

Um, so, you know, the cool thing is that we're now at a point we can build this stuff, ourself and run this stuff ourself. We won't actually be doing it using this particular model. Dali too, we'll be using a different model, stable diffusion. Um, but, uh, has very similar kind of outputs, but we can go even further now.

Um, so, uh, one of our wonderful alumni, Alon actually recently started a new company called, I don't know who said it, Struma, um, where you can use something that we'll be learning about, um, today called dream booth, um, to put any object, person, whatever, uh, into an image. And so he was kind enough to do a quick dream booth run for me and added these, uh, various pictures of me, um, using his service.

So here's a fun service you can try. Um, uh, one crazy one he tried was me as a dwarf, which I got to say actually worked pretty well. This, this half looks like me, I reckon. And then the bottom bit is the dwarf version. Uh, so thank you, Alon and congratulations on your great progress since completing the first AI course.

Um, uh, yeah, I love it. Um, so something that's a bit different about, uh, this compared to previous courses, a lot of previous courses is, um, uh, it's, this is no longer just a me thing, um, because this, um, is moving so quickly. I've needed to get a lot of help, um, to, to even vaguely get, get up to date and stay up to date, um, so everything I'll be showing you today, uh, is, is very heavily influenced by extremely high levels of input from these, um, amazing folks, all of whom are fast AI alumni.

Um, so Jonathan Whittaker, who, um, I saw in our chat, um, is, was basically the first guy to, to create detailed educational material about stable diffusion and has been in the generative model space, uh, well, for a long time by staple diffusion standards, I guess. Um, so, uh, was seen, um, has been an extraordinary contributor to all things.

Fast AI, um, uh, Pedro came to San Francisco for the last time. We did a part two course in 2019 and, um, uh, took what he learned there and made his amazing camera plus software dramatically better and had it highlighted by apple, um, for the extraordinary machine learning stuff added.

And he's now at hugging face, working on the software that we'll be using a lot diffusers. And then to teach everybody in the forum, the fast AI community probably already knows, um, now at stability, uh, AI, um, working on stable diffusion, um, models, um, his expertise, particularly as in medical applications.

Uh, so really, um, folks from all the key groups pretty much, um, around stable diffusion and stuff are, uh, are working on this together. And, um, you'll also find some of these folks have recorded additional videos, um, going into more detail about some of the areas, which you will find, um, on the, uh, on the course website.

So make sure you go to cost@fast.ai, um, to get all the information, um, about, you know, all the materials, um, that, that, that you need to take full advantage of this. So every lesson, um, has links to notebooks and to details and so forth. Um, if you want to go even deeper, um, head over to forums.fast.ai into the part two 2022 category, um, hit on the about the course button.

And you'll find that, uh, every lesson, uh, there's a chat, um, with even more stuff. So look at this carefully to see all the things that me and the community have provided to you to help you understand, um, this video, um, and also check out the questions, um, underneath and answers underneath to see what people have talked about.

They can get a bit overwhelming. Um, so, um, once they get big enough, you'll see that there's a summarize button that you can, um, that you can click, um, uh, to kind of see just the most liked parts, um, so that can be very helpful. Um, okay. So they're all important resources, I think, to get the most, to get the most out of, um, this course.

Now, um, compute. So, um, completing part two requires quite a bit more compute than part one. Um, compute options are changing rapidly. And to be honest, the main reason for that is because of the huge popularity of stable diffusion. Um, everybody's taken to using Colab for stable diffusion. Um, and Colab's response has been to start charging by the hour, um, for most usage.

So you may well find if you're a Colab user, we still love Colab. Um, um, you may find that, you know, you run out of, uh, they start not giving you decent GPUs anymore. Um, and if you want to then upgrade, they limit quite a lot. How many hours you can use.

So at the moment, yeah, still try Colab. They're, they're, they're pretty good. I mean, for free, you get some decent stuff. Um, but, um, I would strongly suggest trying out also paper space gradient. You can pay like $9 a month to actually get some pretty good GPUs there at the moment, um, or pay them a bit more to get even better ones.

Um, again, but the thing is this is all going to change a lot. I don't know. Like maybe people will, will make paper space gradient have to change their pricing too. I don't know. So, um, check course.fast.ai to find out what our current recommendations are. Um, now, um, Lambda labs and Java slabs are also both good options.

Java was created by a alum of the course and has some, um, just really fantastic options at a very reasonable price. Um, and a lot of fast AI students use them and love them. Um, and also check out Lambda labs who are the most recent, um, provider on this page.

Um, and they are, um, rapidly adding new features, but the reason I particularly wanted to mention them is at least as I say this, which they say is early October, 2022, they're the cheapest provider of kind of big GPUs that you might want to use to, to run like serious models.

Um, so they're absolutely, yeah, well worth checking out. But as I say, this could all have changed by the time you watch this. So go and check out course.fast.ai. Also, um, at the moment, like, uh, like 2020, 2020, 2022 GPU prices have come down a lot and you may well want to consider buying your own machine at this point.

Um, okay. So what we're now going to do is, um, jump into the notebooks. And so there's a, uh, repo that we've linked to called diffusion NBS. Um, which isn't kind of the main course notebooks. It's not the, from the foundation's notes books. It's just a couple of notebooks that you might want to play with a bit of fun stuff to try out.

One of the interesting things here is, um, uh, Jonathan Brittaker, um, who I tend to call Jono. So if I say Jono, that's who I'm referring to has kind of this really interesting thing called suggested tools.md, which hopefully he'll keep up to date. So even if you come here later, um, this will still be up to date because he knows so much about this area, he's been able to pull out some of the best stuff out there for just starting to play.

And, and I think it's actually important to play, um, because that way you can really understand what the capabilities are and what the constraints are. So then you can think about like, well, what could you do with that? Um, and also like, what kind of research opportunities might there be?

So I'd strongly suggest trying out these things. The community on the whole has moved towards, um, um, making things available as Colab notebooks. So if I click, for example, on this one, the forum. Um, and they often have this kind of hacker aesthetic, um, around them, which is kind of fun.

Um, so what happens is like, just they add lots and lots of features and you can basically just fill in this stuff to try, to try things. Um, and they often have a few examples. And so you can hit up the runtime and say, change runtime type to make sure it says GPU, and you can say what kind of GPU, um, and, um, and start running things.

Now, um, a lot of the folks who use this stuff honestly have no idea what any of these things mean now, by the end of the course, you'll know what all of these things mean, um, pretty much, um, and that will help you to make great outputs from stuff like this, but you can create great outputs just using more of an artisanal approach.

There's lots of information online about, you know, what, what kinds of things could you try? Um, so anyway, check out this stuff, um, from, from Jono. Um, and then he also links to this fantastic resource from Pharma psychotic, which is a rather overwhelming list of things to play with.

Uh, now again, you know, maybe by the time you watch this, this is all changed. Um, but I just wanted you to know these kinds of things are out there and they're basically like ready to go applications that you can start playing with. So play a lot. Um, what you'll find is that, um, most of them, at least at the moment, expect you to input some text to say what you want to create a picture of.

Um, it turns out that as we'll learn, we'll learn in detail why, um, the text you pick, it's not very easy to know what to write and, and that gives kind of interesting results. Um, at the moment, it's quite an artisanal thing to understand what to write. And the best way to learn what to write, it's called the prompt.

The best way to learn about prompts is to look at other people's prompts in their outputs. So at the moment, perhaps the best way to do that is a lexica, which has lots and lots of really interesting, um, artworks and so AI artworks. And so you can click on one and see what prompt was used.

And so you'll see here that generally you start with, um, what do you want to take a, make a picture of what's the style. Um, and then the trick is to add a bunch of like artists names or places that they put art so that the algorithm will tend to create a piece which matches art, you know, that tends to have the, these kinds of words in their captions.

Um, so there's a really useful trick to kind of get good at this. And so you can even search for things. So I don't know if they have Teddy bears. Let's try. Um, there we go. So if there's a kind of like a property, not that one, um, that's a pretty good Teddy bear image.

So you can kind of get some sense of how to create nice Teddy bear images. That's so cute. I know what I'm going to be showing my daughter tomorrow. Um, and you can see they often tend to have similar kinds of stuff to try to encourage the algorithm to give good outputs.

Okay. So by the end of this course, you'll understand why this is happening. Why these kinds of app, you know, prompts create these kind of outputs and also how you can go beyond just creating prompts. Um, to actually, um, to actually, um, building really innovative new things with new data types.

Um, okay. So let's take a look at the diffusion and bees repo. The first thing we'll look at is a stable diffusion. So a couple of options here, you can, um, you can, uh, clone this repo, um, which is, uh, linked from both the cost of fast at AI and from the forum and run it on like pepper space gradient or your own machine or whatever, uh, or you can head over to CoLab and you can just say, get hub, right?

And then you can paste in, um, the, the link to it directly from get hub. Okay. So, um, I'm running it on my own machine and, um, this notebook is, um, largely been built thanks to the wonderful folks at hugging face and hugging face have a library called, um, diffusers.

So any of you that have done part one of the course, we'll be very familiar with hugging face. We used a lot of their libraries in part one. Um, diffusers is their library for doing stable diffusion and stuff like stable diffusion, um, at the moment, you know, these things are changing a lot, but at the moment, this is our recommended, um, library for doing this stuff.

And it's what we'll be using in this course. Um, maybe by the time you watch this, there'll be lots of other options. So again, keep an eye on course.fast at AI, um, in general hugging face have done a really good job of, um, being at and staying at the kind of head of the pack around models in general for deep learning.

So, um, that would be, you know, not surprising if they continue to be the best option for quite a while. Um, but the basic idea of any library is going to look pretty similar. Um, so, uh, to get started playing with this, um, you will need to, um, log in to hugging face.

Um, so if you've got a hugging face, you can create a, um, uh, username there and a password and then login. Once you've done it once, it'll save it on your computer. So you wouldn't have to log in again. Um, and the thing we're going to be working with is, um, pipelines and in particular, the stable diffusion pipeline, again, you know, they might be using different pipelines by the time that, um, uh, you, um, you watch this, um, but the basic idea of pipeline is, um, quite similar to what we call a learner in fast AI, which is it's got a whole bunch of things in it, you know, a bunch of kind of processing and models and inference, all, all happening automatically.

Uh, and just like you can save a pipeline in faster, sorry, save a learner in fast AI, you can save a pipeline, um, in diffusers. Um, now something that you can do in, um, all pretty much all hugging face libraries that you can't do in fast AI is you can then save a pipeline or whatever backup, um, into the cloud onto hugging face, they call it the hub.

And so then if we say from pre-trained, um, it's, um, a lot like how we create pre-trained learners in fast AI, but the thing you put here is actually, um, if it's not a local path, it's a hugging face, uh, repo. So if we search hugging face for this, and you can see this is what it's going to download and you can actually save your own pipelines up to the hub for other people to use.

So I think this is a very nice, uh, feature that helps, you know, the community build stuff. So this is actually going to, the first time you run this, it's going to download many gigabytes of data from the internet. Um, this is one of the slight challenges with using this on, on Colab is every time you use Colab, everything gets thrown away and start from scratch.

So it'll all have to be downloaded every time you use Colab. Um, if you use something like, um, paper space or particularly actually Lambda Labs, um, it's all going to be saved for you. Um, so once you've, um, downloaded all this, um, it's going to be, it's going to save a whole bunch of stuff, um, into your dot cache in your home directory.

So that's where hugging face puts things. So now that we have a pipeline called pipe, um, we can now trade it as if it's a function, which is pretty common for like PyTorch stuff and fast.ai stuff. You should be very familiar with this, hopefully, and you can pass it a prompt.

And so this is just some text and that's going to return some images. Since we're only passing one prompt, it's going to return one image. So we'll just index into dot images. And when we run it, it takes, uh, you know, maybe 30 seconds or so and returns a photograph of an astronaut riding a horse.

Um, every time you call, um, a pipeline using the same random seed, you'll get the same image. Um, you can set them at the random seed manually. And so you could send to somebody else and say, Oh, this is a really cool astronaut riding a horse. I found, uh, try manual seed one Oh two four, and you'll get back this particular astronaut riding a horse, um, so that's how you can, like, that's the most basic way to get started, um, running on CoLab or on your own machine.

You can start creating images. Um, it takes, as I said, it takes 30 seconds or so. And in this case, it took 51 steps. Um, what it's doing, this is a bit very different to like what we're used to with inference in fast AI, uh, where it's, it's one step to, to classify something, for example, what it's doing in these 51 steps is it's starting with like, um, so this is actually an example that we're going to create ourselves ourselves, um, in the course of creating, um, handwritten digits, and this is actually an image from a later notebook we'll be building.

Well, we basically starts with random noise and each step, it tries to make it slightly less noisy and slightly more like the thing we want. And so going down here is showing all the steps to create the first four, for example, or here to create the first one. And if you look closely, you can kind of see in this noise, there is something that looks a bit like a one.

And so it kind of decides to focus on that. And so that's how these diffusion models basically work. So remember, if you're having any trouble finding the materials, we're looking at to go to cost.fast.ai or go to the forum topic, um, uh, to see all the links, and this one is called diffusion dash nbs.

And the notebook is called, you can see it at the top stable diffusion. Now, um, a question might be, well, why don't we just do it in, in one go? Um, and we can do it in one go, but if we try to do it in one go, um, it doesn't do a very good job.

Uh, these models aren't, as I speak now in October, 2022, smart enough to do it in one go now, as I mentioned, um, at the start, the fact that I'm doing it in 51 steps here is, um, you know, hopelessly out of date, because as of yesterday, apparently we can now do it in three to four steps.

I'm not sure if that code's available yet. Um, so by the time you see this, yeah, this might all be dramatically faster. Um, but as I'll be describing, understanding this basic concept, I'm pretty confident it's going to be very important. Like forever. Um, so we'll talk about that. Um, so if we do 16 steps instead of 51 steps, you know, it looks a bit more like it, um, but it's still not amazing.

Um, okay. So that's how you can kind of get started and I'll show you a few things that you can tune, um, and, you know, I should remind you that, you know, when I, most of the stuff I'm showing you in this was built by Pedro Cuenca and the other folks at, um, Hugging Face, um, so huge thanks to them.

There's no way I could have been as up to speed, up to speed with all this, um, detail without their help. Um, they built this library diffusers and, um, um, they've done a fantastic job of helping display what you can do with it. So let's look at an example of what you can do with it.

We're just going to quickly define a little function here to create a grid of images. Um, the details don't matter. Um, but what we, uh, do want to show here is you can, um, take your prompt, which was an astronaut riding a horse and, uh, just create four copies of it.

Okay. So times when applied to a list, simply copies the list that many times. So here's a list of the exact same prompt four times. Um, and then what we're going to do is we're going to pass to the pipeline the prompts, and we're going to use a different parameter now called guidance scale.

Um, we're going to be learning about guidance scale, um, in detail, uh, later in the course, but basically what this does is it says, um, to what degree should we be, um, uh, kind of focusing on the specifics, uh, caption versus just creating an image. Um, so we're going to try a few different guidance scales about one, three, seven, 14, generally seven and a half.

I believe at this stage is the default that might have changed by the time you watch this, um, and so each row here is a different guidance scale. So you can see in the first row, it hasn't really listened to us very much at all. Um, uh, these are very weird looking things and they, none of them really look like astronauts riding a horse, um, at guidance scale of three, they look more like things riding horses that they might be astronaut ish and at 7.5, they certainly on the whole look like astronauts riding a horse, um, and at 14 or 15, they certainly look like that, but getting a little bit too abstract.

Sometimes, um, I have a pretty strong feeling there are some slight problems with actually how this is coded or actually how the algorithm works, which I will be looking at during this course. So maybe by the time you see this, some of these will be looking a bit better.

Um, uh, I think basically something that's happening here is it's actually kind of over, uh, over jumping a bit too far during these high ones anyway. So the basic idea of what it's doing here is, um, this guidance is it's basically actually for every single, um, prompt it's creating, um, creating two versions.

Um, one version of the image with the prompt and astronaut riding a horse and one version of the image with no prompt. So it's just some random thing. And then it takes the average basically of those two things. Um, and that's how, that's what guidance scale does. And, uh, you can kind of think of the guidance scale as being a bit like a number that's used to weight the average.

Um, there's something very similar. You can do, um, where again, you create, um, get the model to create two images, but rather than taking the average, you can ask it to effectively subtract one from the other. So, um, here's something that Pedro did, um, of, uh, using the prompt a Labrador and the style of the mirror.

And then he said, well, what if we then subtract something which is. Just the model for the caption blue. And, uh, you can pass in this thing, negative prompt to diffusers. And what that will do is it will take the prompt, which in this case is Labrador in the style of familiar and create a second image effectively, which is just responding to the prompt blue and effectively subtract one from the other.

The details are slightly different to that, but that's the basic idea. Um, and that way we get a non blue, um, Labrador in the style of them yet. So, uh, yeah, this is the basic, um, kind of idea of how to use negative prompt and you can play with that.

Good fun. Um, here's something else you can play with is you don't have to, um, uh, just to pass in text, uh, you can actually pass in images. So for this, you'll need a different pipeline. You'll need an image to image pipeline. And with the image to image pipeline, you can grab a rather sketchy looking sketch.

Um, and you can then, uh, pass to this. I to I image to image pipeline, the, um, initial image to start with. Um, and basically what this is going to do is rather than starting diffusion process with random noise, it's going to basically start it with, um, a noisy version of this drawing.

And so then it's going to try to create something that matches this caption and also like follows this kind of guiding starting point. And so as a result, you get things that look quite a lot better than the original drawing, but you can see that the composition is the same.

And so using this approach, you can, um, you know, construct things that match the particular kind of composition you're looking for. Um, so I think that's quite a nifty approach. And so here, this parameter strength is saying, um, to what degree do you want to really, um, create something that looks like this, um, or to what degree do you want the model to be able to, you know, try out different things a bit.

Now here's where things get interesting. And this is the kind of stuff you're not going to be able to do at the moment with just the basic, uh, gooeys and stuff, but you can, if you really know what you're doing, what we could do now is we could take these output images and we could say, Oh, this one's nice.

Uh, sorry, this one, this one's nice. Um, let's make this the initial image and now we'll say, let's do an oil painting of by van Gogh and pass in the same thing here and the strength of one. And actually that pretty much worked. And I think that's, um, absolutely fascinating, right?

Because this is something I haven't seen before, um, which, um, which, which Pedro put together this week, um, and it's combining simple Python code together. And so you can play with that. Um, something else you can do, which this one's actually example came from the folks at Lambda Labs is, and we won't be going into this in detail right now, because this is like basically exactly like what we've done a thousand times in fast AI is you can take the models in that pipeline and you can pass it, um, your own images, um, and your own captions.

And so what happened here is, um, Oh, I hate these things go away. Nevermind. Um, oh, here we are at this one. Um, so, um, um, what these folks did, I think this was, uh, Justin, if I remember correctly, yeah. Uh, so what Justin at Lambda did was he, um, created a really cool dataset by going to, um, uh, grab a Pokemon dataset of images, um, which had almost a thousand images of Pokemon, and then this is really neat.

He then used an imaging captioning image captioning model model to automatically generate captions for each of those image images, and then he fine tuned the stable diffusion model using those imaging caption pairs. Um, so here's an example of one of the captions and one of the images, um, and then took that fine tuned model and past it prompts like girl with a girl, girl with a pale earring and cute Obama creature and got back these Totoro these oopsie daisy, um, and got back these, um, super nifty edges that now are reflecting the fine tuning dataset that he used, um, and also responding to these prompts, um, here's another example of something you can do is fine tuning can take quite a bit of data and quite a bit of time, um, but you can actually do some special kinds of fine tuning.

Um, one that you can do is called textual inversion, uh, which is where we actually fine tune just a single embedding. Um, so for example, we can create, um, a new, um, embedding, um, where we're trying to make things that look like this. So what we can do is we can give this concept a name.

Um, so here we're going to call it, um, Oh, I just lost it now. Um, what a color. Yeah. Uh, we're going to call it what a color portrait. Um, and so that's what the embedding name we're going to use is. Um, and we can then, um, basically add that token to the text model and then we can train the embeddings for this, um, so that they match the example pictures that we've seen, and this is going to be much faster because we're just training a single token for just in this case, four pictures.

Um, and so when we do that, we can then say, for example, woman reading in the style of, and then pass in that token we just trained, and as you see, we'll get back a kind of novel, um, image, um, which I think is, yeah, pretty, um, pretty interesting.

Um, another example, um, very similar to textual inversion, um, is, uh, something called dream booth, um, which, um, as mentioned here, what it does is it takes an existing token, but one that isn't used much, like say, SKS, nothing, almost nothing has SKS and fine tunes a model to bring that token, as it says here, close to the images we provide.

And so what Pedro did here was he grabbed some pictures of me and said painting of SKS, so in this case, he's fine tuned this token to be Jeremy Howard photos in the style of Paul Signac, and there they are. And so the, um, example I showed earlier of the dwarf Jeremy Howard, um, that, that service streamer is actually using this dream booth.

Um, so here's how you can try that, uh, yourself. Okay, so, um, that is part one of, um, of this lesson, which is the how to get started playing around with table diffusion. Um, in part two, we're going to talk about, um, what's actually going on here from a machine learning point of view.

Uh, so we'll come back in, um, about seven minutes, um, to talk about that. All right. See you guys in about seven minutes. Okay. Welcome back folks. Um, uh, I just thought I'd share with you one more example actually of, um, textual inversion training. This is my daughter's Teddy tiny, who, as you can see is grossly misnamed.

Um, and, um, Pedro and I tried to, yeah, create a, um, textual inversion, uh, version of tiny, and, uh, I was trying to get, um, tiny riding a horse. And it's interesting that when I tried to do that, this top row here, this is actually Pedro's, um, example when he ran it, um, this is showing the kind of steps as he was training of, uh, trying to use the caption tiny riding a horse.

And as you can see, it never actually ended up generating tiny riding a horse. Instead, it ended up generating a horse that looks a little bit like tiny. Um, and then we're trying to get tiny sitting on a pink rug. And actually after a while, it did make some progress there.

It doesn't quite look like tiny. Um, one thing Pedro did that was different to me was he started with the embedding of a person, um, in my one, I actually started with the embedding for Teddy and it worked a bit better. Um, but, uh, as you see, there are like problems and, um, we'll understand where those problems come from as we talk more about how this is trained, um, in the rest of this lesson.

Okay. So I'm going to be, um, relying on some understanding of the basic idea of how, um, machine learning models are trained here. Um, so if, if you start getting a bit lost at any point, um, you might want to go back to part one, um, and then come back to this once you're unlost.

Um, the way we're going to start, okay. So I need to explain, um, the way stable diffusion is normally explained, uh, is focused very much on a particular mathematical derivation. Um, we've been developing a totally new way of thinking about stable diffusion. And, um, I'm going to be teaching you that, um, it's, um, mathematically equivalent to the approach, which you'll see in other places, but what you'll realize and discover is that it's actually, um, um, conceptually much simpler.

And also later in this course, we'll be showing you some, um, really innovative directions that this can take you when you think of it in this brand new way. So, uh, all of which is to say, um, when you listen to this and then you go and look at some blog posts and it looks like I'm saying something different.

Um, just keep that in mind. I'm not saying something different. I'm expressing it in, um, a different way. Um, but it's, um, it's equally mathematically valid. Um, what I'm going to do is I'm going to start by saying, let's imagine that, um, let's imagine that we were trying to get something to generate something much simpler, which is to generate handwritten digits.

Okay. Um, so it's like the stable diffusion for handwritten digits. Um, and we're going to start by assuming there's some like, um, API, some web service or whatever out there, who knows how it was made. Um, but what it does is something pretty nifty, which is that you can get an image of a handwritten digit and you can pass it over into this, this web API into this, you know, this rest endpoint or whatever, it's just a black box as far as we're concerned.

And it's going to spit out the probability that this thing you passed in as a handwritten digit. So for this one, so let's say this image is called X one. The probability that X one is a handwritten digit, it might say is 0.9 eight. And so then you pass something else into this, um, magic API endpoint, which looks like this, you pass that in and that looks a little bit like an eight, I guess, but it might not be, you'd pass it into this API and you see what happens, this is X two.

And it says the probability that X two is a digit is 0.4. Okay. Now we pass in our image X three into our magic API and it returns the probability that X three is a handwritten digit, pretty small. Okay, so, um, why is this interesting? Well, it turns out that if you have a function, you know, let's not call this an API, let's call this, let's call this, um, it's quite an error.

It's quite an F some function, but it's like behind some web API, rest endpoint, whatever, if you have this function, we can actually use it to generate handwritten digits. Um, so that's something pretty magical. And we're going to see how on earth would you do that? If you have this function, which can take an image and tell you the probability that that is a handwritten digit, how could you use it to generate new images?

Well, imagine you, um, wanted to turn this mess into something that did look like an image. Here's something you could do. You let's say that it's a 28 by 28, um, image, which is what? 7 86, oopsie dozy 28 times 28 784. So 794 pixels, and we could pick one of these pixels and say, what if I increased this pixel to be a little bit darker and then we could pass that image through F and we could see what happens to the probability that it's a handwritten digit.

So for a specific example, handwritten digits don't normally have any pixels that are black in the very bottom corners. So if we took this here and we said, what would happen if we made this a little bit lighter, right, and then we pass that exact image through here, the probability would probably go up a tiny bit, for example.

So now we've got an image, which is slightly more like a handwritten digit than before, and also in digits, generally there are straight lines. So this pixel here probably makes sense for it to be darker. So if we made a slightly darker version of this pixel and sent it through here, that would also increase the probability a little bit.

And so we could do that for every single pixel of the 28 by 28, one at a time, finding out which ones, if we make them a little bit lighter, make it more like a handwritten digit, which ones if we make it a little bit darker, make it more like a handwritten digit, what we've just done is we've calculated the gradient of the probability that X3 is a handwritten digit with respect to the pixels of X3.

Now notice that I didn't say DPX3, DX3, which you might be familiar with from, from high school, and the reason for that is that we've done, we've calculated this for every single pixel. And so when you do it for lots of different inputs, you have to turn the D into a, this is called a Dell or a nabla.

And it just means there's lots of values here. So this here contains lots of values, which is the, um, how much does the probability that X3 is a digit increase as we increase this pixel value? And as we increase this pixel value, as we increase this pixel value. So there's going to be for 28 by 28 inputs, there's going to be 784 pixels, which means that this thing here has 784 values.

Okay. I totally messed up the notation there. I did think about going back and rerecording it, but then I thought, well, maybe instead as penance for my failure to get the notation right, I should instead record a little section describing the notation in more detail, both for myself. So I don't make it a mistake again and for the rest of you.

So you understand exactly what's going on. I think it's actually pretty worthwhile because this notation does come up a lot. And I've been regularly butchering it in talks and notes for years now. So that's about time I got it right. So I should mention, I have absolutely no excuse for butchering the notation like I have on the basis that actually my friend Terrence and I wrote a, what is it?

Like 30, 40 page tutorial on matrix calculus. And in that paper, he actually described everything I'm going to show you here. Having said that, you certainly don't need to read this rather lengthy tutorial. I'm going to explain the key stuff that actually I think it's worth knowing. And then you'll understand the mistake that I made during the lesson.

Maybe let's start with some reminders from stuff that hopefully you did at high school. So let's create maybe a 2D version here and maybe a 3D version as well. Okay. So let's say, for example, we've got a quadratic that looks something like this. And we can say this is a equation such as y equals x squared.

And we might endeavor to identify the slope kind of at some exact moment, like here. This slope at this exact moment here. So this thing is called the tangent and the slope at a tangent is the derivative of a function. And so the derivative, let me just try to make this look a bit more like a y than an x.

Maybe I write it like this. Okay. And so the derivative of a function we can write in a few ways, but one way we can write it is like this, dy dx. And there are some rules we can use to calculate it analytically. And for squared, the rule is that it's 2x.

You basically move the index out to the front to calculate its derivative. Another way of writing y equals x squared is we could write f of x equals x squared. And another way of writing the derivative is we could use this. For example, okay. So this is all stuff that hopefully you at least vaguely remember from high school.

Now functions are not necessarily just of one variable. So here we've got x and y, but they could be of two variables, x and y and z say. And so we could have functions which, for example, could be like a 3D parabola with this kind of curvature, for instance.

And so you can still find the derivative. But if you think about it, there's one way to kind of get a derivative would be exactly what we did before, which would we say as we change x, how does it change z? So that would be this slope again. But you could also have something that says as we change y, how would we change z?

That would be like rotating this whole thing around by 90 degrees and then kind of doing the same thing. So it's a little bit trickier now because we've got a function of x and y. And so we could calculate the derivative of that with respect to just one of these things or both of these things or whatever.

And so what we do here is we write this little thing. And so we can then say this is how our output z changes as we change one thing at a time, in this case, just x. There's another value which would be how does it change as we change just y, one thing at a time.

And so those are two separate numbers we could calculate at some particular point on this surface. And so these things are called partial derivatives. What is partials? So in our case, we've got a 28 by 28 pixel image. Which might be, for example, the number 7 made of 28 by 28 pixels.

So the pixels would be, you know, something like this. And then down here. And so in our case, we've got a situation where we're saying we've got some loss. And our loss is calculated at some function of both some weights in a neural network. As well as some pixel values.

Such as the pixels in this number 7. And I guess, you know, actually the way it would work would be these would be shaded. This is more like what MNIST looks like, right? All right. So that's my pixels, aren't they terrible? Okay, so the loss would be, you know, calculated more specifically as the MSE mean squared error of the actual.

The actual, I guess we should use, that's correct. The actual answer is like, which digit should it be? So let's call that Y minus the predicted Y, which is some neural network with some weights and our pixels. So that would be like delving in one layer deeply. But none of these details really matter too much of like, what's the loss function or the neural network or whatever.

It's just some function that's calculating loss. So let's get rid of all that. Okay. And we can now say what happens as we change X. As we change X, what happens to the loss? Because we want to change X in a way that the loss goes down. But there isn't just one X there's, I write this as seven by seven.

It's actually meant to be 28 by 28, but let's just do a simple seven by seven version here. We've got seven pixels by seven pixels. This is a super low resolution 49 pixel image. So there's 49 different things we could change. We could make each of these pixels darker or lighter.

All right. So let's take, for example, pixel, maybe we can write it like this. We'll say pixel one comma one. We could write it like that, for example. And so we could say what happens to the loss as we change pixel one comma one. So we can calculate a derivative, right?

And it's going to be a partial, partial derivative. What, how does the loss change as we change pixel one comma one? So that would be a very useful thing to know, because that would then tell us, do we need to make pixel one comma one a bit brighter or a bit darker in order to improve the loss?

And we could also calculate that for pixel one comma two, one comma three, and so forth. All 49 pixels in this super low resolution digit. So, okay. So there's the first thing we can calculate to them. The second thing I mentioned could be loss over pixel one comma two.

Okay. So that's the slope as we change pixel one comma two. I'm not going to write all of them, but there could be, well, there will be 49 of these. For each seven by seven. So rather than writing out all 49 of those, it's nice to instead write them all at once and say, and you can do that like so.

You can say upside down triangle X. Loss. And what that means is it's a vector of all of these derivatives. And this upside down triangle here is called either the Dell or the nabla. And it's just a little convenient notational shortcut to avoid writing all these out. And the X here is telling you about the thing that we're basically putting on the bottom.

Right. What's it with respect to what's the direction that we're trying to go? Um, so that's actually what I should have written in, in the notes that I was doing in the lesson. Uh, I wrote something else, um, which was basically the equivalent of writing this and that's not a thing at all.

Um, so in my notes, when you see me write this, I actually mean this. Um, so why does my brain get confused? And I write this weird thing that doesn't even exist. It's as far as I know. Uh, well, the reason is that this thing does exist if you check, turn the triangles upside down.

And hence my brain always gets confused. And if I turn the triangles upside down, these triangles are now totally different. These triangles now mean a small change. So this is a small change in loss divided by a small change in, let's for example, one particular pixel. And that's a totally valid thing to say.

Um, and in fact, if you make that small change small enough, then you're going to end up with a derivative. That's what the derivative is. So the derivative is just our classic rise over run. Okay, slope that we did in what is that grade eight grade nine rise over run.

Once we make our step in X small enough. Um, and so then we can then do that for changing just one variable in a multivariable function. Um, so for example, when I say variable, I guess in this case, I should give an example. For example, changing one pixel value in an image to see how that impacts our loss.

Uh, or we could do the same thing for the weights. We could change one weight. And if you just change one thing at a time and calculate the derivative of the loss against that one thing, then we get these things called partials. And if you then do it for all the different things that you could change, such as every pixel, you get this whole gradient vector, which we use the upside down triangle to represent.

Um, and then finally, um, the non upside down triangle simply refers to a small change. So this would be a small change in loss, which is caused by changing pixel one, one by a small bit. And if you use a small enough bit, an infinitely small bit, we call that the derivative.

Okay, so with all that said, um, net result, every time you see me, well, I think I did it once, but, um, in the notes where I say. This, uh, please throw that away in your head and replace it in your head with this. And that is the moral of the story.

Okay, so thank you very much for, uh, bearing with me as I do my penance to actually, um, uh, get this notation correct this time. And I will endeavor not to make the same mistake again during this course, but no promises. All right, back to the lesson. Okay, so with those 784 values, they tell us how can we change X3 to make it look more like a digit?

And so what we can then do is we can now change the pixels according to this gradient. And so we can do something a lot like what we do when we train neural networks, except instead of changing the whites in a model, we're changing the inputs to the model.

And so we're going to take every pixel and we're going to modify it, subtract its gradient a little bit times its gradient. So we'll multiply this by some constant, let's call it C, and then we're going to subtract it to get some new image. So with the new image, it's probably going to get rid of some of these bits at the bottom, right?

And it's probably going to add a few more bits between some of these here. Right. And we've now got something that looks slightly more like a handwritten digit than before. And this is the basic idea. We can now do that again. We can now take this. We can run it through F.

And so we've now got something, let's say we call it X3 prime, for example. So this new version, X3 prime or whatever, is now the probability that's a handwritten digits, quite a bit higher. I'd say it's probably like 0.2, maybe. And we can now do the same thing. We can say for every pixel, if I increase its value a little bit or decrease its value a little bit, how does it change the probability that this new X3, whatever prime prime is a digit?

And so we'll now get a new gradient here, 784 values, and we can use that to change every pixel to make it look a little bit more like a handwritten digit. So as you can see, if we have this magic function, we can use it to turn any arbitrary noisy input into something that looks like a valid input, something that has a high p-value from that function by using this derivative.

So a key thing to remember here is saying, as I change the input pixels, how does it change the probability that this is a digit? And that tells me which pixels to make darker and which pixels to make lighter. Now, those of you who remember your high school calculus may recall that when you do this by changing each pixel one at a time to calculate a derivative, this is called the finite differencing method of calculating derivatives.

And it's very slow because we have to call finite diff-- sorry, I can't spell-- differencing, it's very slow because we have to call it 7 dysfunction, 784 times, every single one. We don't have to use finite differencing. Assuming the folks running this magic API endpoint use Python, we can just call f.grad-- f.backward, and then we can get x3.grad, and that will tell us the same thing in one go by using the analytic derivatives.

So we'll learn exactly about what these dot backward does. We'll write our own everything from scratch, including our own calculus things from scratch later. But for now, just like we did in part one of the course, we're just going to assume these things exist. So maybe then the nice folks that provide this endpoint can actually provide a new endpoint that calls dot backward for us and gives us dot grad.

Right. And then we don't really have to use f at all. Right. We can instead just directly call this endpoint. We'll just directly call this endpoint that gives us the gradient directly. We'll multiply it by this smaller constant c. We'll subtract it from the pixels, and we'll do it a few times, making the input get larger and larger p values, larger and larger probabilities that this is actually a digit.

So we don't particularly need this thing at all. We don't particularly need the thing that calculates these probabilities. We only need the thing that tells us which pixels we should change to calculate the probabilities. Okay, so that's great. The problem is nobody's provided this for us. So we're going to have to write it.

So how are we going to do that? Well, no problem. Generally speaking, in this course, when there's some magic black box that we want to exist and it doesn't exist, we create a neural net and we train it. So we want to train a neural net that tells us which pixels to change to make a digit look more to make an image look more like a handwritten digit.

Okay, so here's how we can do that. We could create some training data and use that training data to get the information we want. We could pass in something that looks a lot like a handwritten digit. We could pass something that looks a bit like a handwritten digit. We could pass something in that doesn't look very much like a handwritten digit.

And we could pass in something which doesn't really look like a handwritten digit at all. Now you'll notice it was very easy for me to create these. I created real handwritten digits and then I just chucked random noise on top of it. It's a little bit awkward for us to come up with an exact score saying, how much is that like a handwritten digit?

How much is that like a handwritten digit? How much is that and how much is that? It seems a bit arbitrary. So let's not do that. Let's use something which is kind of like the opposite. Right? And instead, let's say, oh, why don't we predict how much noise I added?

Right? Because this number 7 is actually equal to this number 7 plus this noise. And this number 3 is actually equal to this number 3 plus this noise. And this number 6 is actually equal to this number 6 plus this noise. And that one's got a lot. And of course, the very first one is equal to this number 9 plus this noise.

So why don't we generate this data? And then rather than trying to come up with some arbitrary number of like how much like a digit is it, let's say the amount of noise tells us how much like a digit it is. So something with no noise is very much like a digit and something with lots of noise isn't much like a digit at all.

So let's feed in. Let's create a neural net. Who cares what the architecture is? Right? It's just a neural net of some kind. And this is critical to your understanding of of this course at this point. We're going to go beyond the idea of like worrying all the time about architectures and details.

And we're going to be spending quite often. We're going to get I mean we're going to get to all those details. But the important thing to using this stuff. Well, is to think about neural nets as being something that has some inputs. Some outputs. Oopsie-daisy. Some outputs. And some loss function which takes those two.

And then the derivative is used to update the weights. Right? That's really what we care about. Those four things. Now the inputs to our model is. This. Okay, that's the inputs to our model. The outputs to our model is a measure of how much noise there is. So maybe we could just say, oh, well, what's the, um, these are all basically normally distributed random variables with a mean of zero and a variance in this case of zero, in this case, they're normally distributed random variables with a mean of zero and a variance of like 0.1.

This one's normally distributed random variables, pixels, I guess, with a mean of zero and like 0.3. This one's super noisy. There's the main invariance. So that's the main each one and the variance for each one. So, why don't we, as the output, use the variance? So, predict how much noise.

Or better still, why don't we predict the actual noise itself? So why don't we actually use. That. Now we're not just predicting how much noise, but we're predict the actual noise. That's our outputs. Now, if we do that, our loss is going to be very simple. It's going to be.

We took the input, we passed it through our neural net. We tried to predict what the noise was. And so the prediction of the noise is N hat and the actual noise is N. And so we can do something we've done a thousand times. Which is we can divide it by the count squared and then we can sum all that up.

And this here is the main squared error, which we use all the time. So the main squared error. Means that we've now got inputs, which is noisy digits. We've got outputs, which is noise. And so this neural network is trying to predict. This noise. So we're basically jumping straight to the step.

That we had here. Remember, this is what we really wanted. We wanted some ability to know how much do we have to change a pixel by to make it more digit like. Well, to turn this number seven into this number seven. That's our goal. We have to remove all of that.

So if we can predict the noise, then we've got exactly what we want, which is this. We can then do this process. We can take and multiply it by a constant and subtract it from our input. And so if you subtract this noise from this input, you get this hand written digit.

So we're doing exactly what we wanted. Well, that seems easy enough. We already know from part one how to do this. So we just have any old neural network. So some kind of ConvNet or something that takes as input. Numbers where we've just randomly added different amounts of noise, lots of noise to some, not much noise to others.

It predicts what the noise was that we added. We take the loss between the input. Sorry, between we take the loss between the predicted output and the actual noise mean squared error. And we use that to update the weights. And so if we train this for a while, then if we pass this into our model, it will return that.

And we're done. We now have something that can generate images. How? Because now we can take this train neural network. So I'm going to copy it down here. And we can pass it. Something very, very, very noisy, which is pure noise. We pass it to the neural net and it's going to spit out information saying which part of that does it think is noise?

And it's going to leave behind the bits that look the most like a digit, just like we did back here. So it might say, oh, you know what? If you left behind just that bit, that bit, that bit, that bit, that bit, that bit, that bit, that bit and that bit, it's going to look a little bit more like a digit.

And then maybe you could increase the values of that bit, that bit, that bit, that bit, that bit, that bit and that bit. And so after you do that and so that everything else is noise. So we subtract those bits. Subtract it times some constant. We're now going to have something that looks more like a digit, which is what we hoped for.

And so then we can just do it again. And you can see now why we you can see now why we are doing this multiple times. Somebody on the chat saying they don't see me drawing. Oh, you can see. Thanks, Jimmy. Don't know, Michelangelo, what's happening for you. Okay.

And to answer your earlier question about how am I drawing, I'm using a graphics tablet, which I'm not very expert at, because on Windows you can just draw directly on the screen, which is why this is particularly messy. All right. In practice, at the moment, this might change by the time you've watched this, we use a particular type of neural net for this.

The particular kind of neural net we use is something that was developed for medical imaging called the unit. If you've done previous versions of the course, you will have seen this and don't worry, this course will see exactly how a unit works and we'll build them ourselves from scratch.

And this is the first component of stable diffusion. It's the unit. Okay, so there's going to be a few pieces. And the details of why they're called these things don't matter too much just yet. Just take my word for it. This is their names and the thing that you do need to know for each thing is like what's the input and what's the output.

So the input to a unit, well, what does it do? The input to the unit is a somewhat noisy image. And when I say somewhat, it could be not noisy at all, or it could be all noise. That's the input. And the output is the noise. Such that if we subtract the output from the input, we end up with the unnoisy image or at least an approximation of it.

So that's the unit. Now here's the problem. Well, here's our problem. We have, oh, why do I keep forgetting this? We have 28 times 28, 784, I should write that down. We have in these things 784 pixels. And that's quite a lot. And it gets worse because in practice, we don't want to draw handwritten digits.

The thing that would be passing in here is beautiful high definition photos or images of like great paintings. And at the moment, the thing we tend to use for that is a 512 by 512 by three channel RGB. Nice big image, 512 by 512 by 3. Red, green, blue.

These are the pixels. So that is 512 by 512 by 3, 786, 432. So we've got 786, 432 pixels in here. And so this is, I don't know, some beautiful picture. This is my amazing portrait Van Gogh style in a dainty little hat. There we go. So this is the beautiful painting or an image of it.

That's a lot of pixels. And so training this model where we put noisy versions of millions of these beautiful images is going to take us an awful long time. And, you know, if you're Google with a huge cloud of TPUs or something, maybe that's okay. But for the rest of us, we would like to do this as efficiently as possible.

How could we do this more efficiently? Well, when you think about it, in this beautiful picture I drew, storing the exact pixel value of every single pixel is probably not the most efficient way. To store it. You know, what if instead we said like, oh, you know, let's say this is like green rushes or something.

It might say like, oh, over here is green and everything kind of underneath. It's pretty much the same. Or, you know, maybe I'm wearing a blue top in this beautiful portrait and it could kind of say like, oh, all the pixels in here are blue. You know, you don't really have to do every one individually.

There are faster, more concise ways of storing what an image is. We know this is true because, for example, a JPEG picture is far fewer bytes than the number of bytes you would get if you multiplied its height by its width by its channels. So we know that it's possible to compress pictures.

So let me show you a really interesting way to compress pictures. Let's take this image and let's put it through a convolutional layer of Stride 2. Now, if we put it through a convolutional layer of Stride 2 with six features, with six channels, we would get back a 256 by 256.

Gosh, that was a terrible attempt at drawing a square, wasn't it? 256 by 256. Actually, do it here by, okay, let's double the number of channels to six by six. And then let's put it through another Stride 2 convolution. And remember, we're going to be seeing exactly how to do all these things and building them all from scratch.

So don't worry if you're not sure what a Stride 2 convolution exactly is. And just do it again to get 128 by 128. And again, let's double the number of channels. And then let's do it again. Another Stride 2 convolution. So we're just building a neural network here. So now we're down to 64 by 64 by 24.

Okay, and then now let's put that through a few like resnet blocks to kind of squish down the number of channels as much as we can. So it'll be now down to let's say 64 by 64 by 4. Okay, so here's a neural network. And so the number of pixels in this version is now 64 times 64 times 4, 16, 3, 8, 4.

So there's 16, 3, 8, 4 pixels here. Okay, so we've compressed it from 7, 8, 6, 4, 3, 2, to 16, 3, 8, 4, which is a 48 times decrease. Now that's no use if we've lost our image. So can we get the image back again? Sure, why not?

What if we now create a kind of an inverse convolution which does the exact opposite. So actually let's put it over here. So we're going to take our 64 by 64 by 4 image, put it through an inverse convolution. So let's put it, let's keep moving this over further.

Back to 128 by 128 by 12. And put it through another inverse convolution. So these are all just basically they're just neural network layers. 256 by 256 by 6. And then finally wrap you out all the way back to 512 by 512 by 3. Okay, we could put this whole thing inside a neural net.

Here's our single neural network. And what we could do is we can start fading at images. It goes all the way through this neural network and out of the other end comes back. Well initially it's random. So initially comes out of this is random noise. 512 by 512, I guess I draw it inside here.

So inside here initially it's going to give us random noise. And so now we need a loss function, right? So the loss function we can create could be to say let's take this output and this input and compare them and create and do an MSE, Mean Squared Error, directly on those two pieces.

So what would that do if we train this model? This model is going to try to put an image through and going to try to make it so that what comes out the other end is the exact same thing that went into it. That's what it's going to try to do.

Because if it does that successfully then the Mean Squared Error would be zero. So I see some people in the chat saying that this is a unit. This is not a unit. Okay, we'll get to that later. There's no cross connections. It's just a bunch of convolutions that decrease inside followed by a bunch of convolutions that increase in size.

And so we're going to try to train this model to spit out exactly what it received in. And that seems really boring. What's the point of a model that only learns to give you back exactly what came in? Well, this is actually extremely interesting. This kind of model is called an autoencoder.

It's something that gives you back what you gave it. And the reason an autoencoder is interesting is because we can spit it in half. Let's grab just this bit. Okay, let's cut it up. Let's grab that just that bit and then we'll get a second half. Okay, they're not quite halves, but you know what I mean, which is just this bit.

And so let's say I take this image and I put it through just this first half, this green half, which is called the encoder. Okay, I can take this this thing that comes out of it and I could save it. And the thing that I'm going to save is going to be 16,384 bytes.

I started with something that was 48 times bigger than that, 786,432 bytes. And I've turned it into something that's 16,384 bytes. I could now attach that to an email, say or whatever. And I've now got something that's 48 times smaller than my original picture. So what's going to happen?

The person who receives these 16,384 bytes? Well, as long as they have a copy of the decoder on their computer, they can feed those bytes into the decoder and get back the original image. So what we've just done is we've created a compression algorithm. That's pretty amazing, isn't it?

And in fact, these compression algorithms work extremely, extremely well. Notice that we didn't train this on just this one image. We've trained it on say millions and millions of images, right? And then so you and I both need to have a copy of these two neural nets. But now we can share thousands of pictures that we send each other by sending just the 16,384 byte version.

So we've created a very powerful compression algorithm. And so maybe you can see where this is going. If this here is something which contains all of the interesting and useful information of the image in 16,384 bytes, why on earth would we train our UNET with 786,432 pixels of information?

And the answer is we wouldn't. That would be stupid. Instead, we're going to do this entire thing using our encoded version of each picture. So if we want to train this UNET on 10 million pictures, we put all 10 million pictures through the auto encoder's encoder. So we've now got 10 million of these smaller things and then we feed it into the UNET training hundreds or thousands of times to train our UNET.

And so what will that UNET now do? Something slightly different to what we described. It does not anymore take a somewhat noisy image instead it takes a somewhat noisy one of these. So it probably helped to give this thing a name and so the name we give this thing is Latents.

These are called the Latents. Okay, so instead the input is somewhat noisy Latents. The output is still the noise. And so we can now subtract the noise from the slightly noise somewhat noisy Latents. And that would give us the actual Latents. And so we can then take the output of the UNET and pass it into our auto encoder's decoder.

Sorry, yes decoder. Because that's something which Okay, that's something which takes Latents and turns it into a picture. So the input to this is small Latents tensor. And the output is a large image. Okay. Now this thing here is not going to be called an encoder. It's going to have the name the VAE.

And we'll learn about why later. Those details aren't too important, but let's put its correct name here, the VAE's decoder. So you're only going to need the encoder for the VAE if you're training a UNET. If you want to just do inference like we did today, you're only going to need the decoder of the UNET.

So this whole thing of Latents is entirely optional, right? This thing we described before works fine. But you know, generally speaking, we would rather not use more compute than necessary. So unless you're trying to sell the world a roomful of TPUs, you would probably rather everybody was doing stuff on the thing that's 48 times smaller.

So the VAE is optional, but it saves us a whole lot of time and a whole lot of money. So that's good. Okay, what's next? Well, there's something else which is we have not just been in this morning's, you know, sorry, in the first half of today's lesson, we weren't just saying, "produce me an image." We were saying, "produce me an image of Tiny the teddy bear riding a horse." So how does that bit work?

So the way that bit works is actually on the whole pretty straightforward. Let's think about how we could do exactly that for our MNIST example. How could we get this so that rather than just feeding in noise and getting back some digit, how do we get it to give us a particular digit?

What if we wanted to pass in the literal number 3 plus some noise and having it attempt to generate a handwritten 3 for us? How would we do that? Well, what we could do is way back here for the input to this model is in addition to passing in the noisy input, let's also pass in a one hot encoded version of what digit it is.

So we're now passing two things into this model. So previously this neural net took as inputs just the pixels, but now it's going to take in the pixels and what digit is it as like a one hot encoded vector? So now it's going to learn how to predict what is the noise, right, and it's going to predict what is the noise and it's going to have some extra information, which is it's going to know what the original image was.

So we would expect this model to be better at predicting noise than the previous one because we're giving it more information. This was a 3, this was a 6, this was a 7. So this neural net is going to learn to estimate noise better by taking advantage of the fact that it knows what actual the input was.

And why is that useful? Well the reason that's useful is because now when we feed in the number 3, like the actual digit 3 is a one hot encoded vector plus noise after this has been trained, then our model is going to say the noise is everything that doesn't represent the number 3, because that's what it's learned to do.

Right? So that's a pretty straightforward way to give it, the word we use is guidance about what it is that we're actually trying to remove the noise from. And so then we can use that guidance to guide it as to what image we're trying to create. So that's the basic idea.

Now the problem is if we want to create a picture of a cute teddy bear, we've got a problem. It was easy enough to pass the digit 8, the literal number 8 into our neural net, because we can just create a one hot encoded vector in which position number 8 is a 1 and everything else is a 0.

But how do we do that for a cute teddy? We can't. We can't create every possible sentence that can be uttered in the whole world and then create a one hot encoded version, one hot encoded version of every sentence in the world, because that's going to take a vector that is too long to say the least.

So we have to do something else to turn this into an embedding, something other than grabbing a one hot encoded version of this. So what do we do? So what we do So what we're going to do is we're going to try to create a model that can take a sentence like a cute teddy and can return a vector of numbers that in some way represents what cute teddies look like.

And the way we're going to do that is we're first going to surf the internet and download images. So here are four examples of images that I found on the internet. And so for each of these images they had an image tag next to them, right? And if people are being good, then they also added an alt tag to help with accessibility and maybe for SEO purposes and they probably said things like a graceful swan.

And the alt tag for this might have been a scene from Hitchcock's The Birds. And the alt tag for this might have been Jeremy Howard. And the alt tag for this might have been fast.ai's logo. And we could do that for millions and millions and millions of images that we find on the internet.

So what we can now do with these is we can create two models. One model which is a text encoder and one model which is an image encoder. Okay, so again, these are neural nets. We don't care about what their architectures are or whatever. We know that they're just black boxes which contain weights, which means they need inputs and outputs and a loss function and then they'll do something.

Once we've defined inputs and outputs and a loss function, the neural nets will then do something. So here's a really interesting idea. What if we take this image and what if we then also take the text a graceful swan. Okay, and we're going to feed these into their respective models, which initially they of course have random weights.

And that means that they're going to spit out random features, the vector of stuff, random crap. Because we haven't trained them yet, okay? And we can do the same thing with a scene from Hitchcock. We pass the scene from Hitchcock in and we'll pass in the words seen from Hitchcock and they'll give us two other vectors.

Right, and so we can do something really interesting now. We can line these up. Guess we'll just move them. We can line these up. Okay, here's all of our images. Okay, and then we can have, oopsie dozy, okay, and then we can have our text. So we've got graceful swan.

We've got Hitchcock. We've got Jeremy Howard and we've got fastai logo. Now ideally when we pass the graceful swan through our model, what we'd like is that it creates a set of embeddings that are a good match for the text. Graceful swan. When we pass the scene from Hitchcock through our image model, we would like it to return embeddings which are similar to the embeddings for the text seen from Hitchcock.

And ditto for the picture of Jeremy Howard versus the name Jeremy Howard and ditto for the image fastai and sorry, the fastai logo and the word fastai logo. So in other words, for this particular combination here, we would like this one's features and this one's features to be similar.

So how do we tell if two sets of things are similar? Two vectors. Well, what we can do is we can simply multiply them together, element-wise, and add them up. And this thing here is called the dot product. And so we could take the dot product of the features from the image model for this one and the dot product of the features from the text model for the word graceful swan and take their dot product.

And we want that number to be nice and big. And the scene from Hitchcock's features should be very similar to the text seen from Hitchcock's features. So we want their product to be nice and big and ditto for everything on this diagonal. Now on the other hand, a graceful swan picture should not have embeddings that are similar to the text as seen from Hitchcock.

So that should be nice and small. And ditto for everything else off diagonal. And so perhaps you can see where this is going. If we add up all of these right, add those all together and then subtract all of these we have a loss function. And so if we want this loss function to be good, then we're going to need the weights of our model for the text encoder to spit out embeddings that are very similar to the images that they're paired with.

And we need them to spit out embedd- uh, features for things that they are not paired with which are not similar. And so if we can do that then we're going to end up with a text encoder that we can feed in things like a graceful swan some beautiful swan such a lovely swan and these should all give very similar embeddings because these would all represent very similar pictures.

And so what we've now done is we successfully created two models that together put text and images into the same space. So we've got this multimodal set of models which is exactly what we wanted. So now we can take our cute teddy bear feed it in here get out some features and that is what we will use instead of these one hot encoded vectors when we train our photo or painting whatever unit.

And then we can do exactly the same thing with guidance. We can now pass in the text encoder's feature vector for a cute teddy and it is going to turn the noise into something that is similar to things that it's previously seen that are cute teddies. So the model that's used or the pair of models that's used here is called clip.

This thing where we want these to be bigger and these to be smaller is called a contrastive loss. And now you know where the CL comes from. So here we have a clip text encoder. Its input is some text. Its output is we call it an embedding. It's just some features.

Oops embedding where similar sets of text meaning with similar meanings will give us similar embeddings. Okay, because we need a bit more space. We're nearly done. Okay, so we've got the unit that can de-noise latency into unnoisy latency including pure noise. We've got a decoder that can take latency and create an image.

We've got a text encoder which can allow us to train a unit which is guided by captions. So the last thing we need is the question about how exactly do we do this inference process here? So how exactly once we've got something that gives us the gradients we want and by the way these gradients are often called the score function just in case you come across that.

That's all that's referring to. Yeah, how exactly do we go about this process and unfortunately the language used around this is weird and confusing and so ideally you will learn to ignore the fact that the language is weird and confusing and in particular the language you'll see a lot talks about time steps.

And you'll notice that during our training process we never used any concept of time steps. This is basically an overhang from the particular way in which the math was formulated in the first papers. There are lots of other ways we can formulate it and during the course on the whole we will avoid using the term time steps.

But we can, to see what time steps are, even though it's got nothing to do with time in real life, consider the fact that we used varying levels of noise. Some things were very noisy, some things were not noisy at all, some things had no noise and some I haven't drawn here would have been pure noise.

You could basically create a kind of a noising schedule where along here you could put say the numbers from 1 to 1,000 and you could then say oh, you know and we'll call this t and maybe we randomly pick a number from 1 to 1,000 and then we look up on this noise schedule which will be some monotonically decreasing function and we'd look up, let's say we happen to pick randomly a number 4, we would look up here to find where that is and we'd look over here and this would return to us some sigma which is the amount of noise to use if you happen to get a 4.

So if you happen to get a 1 you're going to get a whole lot of noise and if you happen to get a 1,000 you're going to have hardly any noise. So this is one way of like picking, so remember we were going to pick and when we were training we were going to pick for every image a random amount of noise.

So this would be one way to do that is to pick a random number from 1 to 1,000, look it up on this function and that tells us how much noise to use. So this t is what people refer to as the time step. Nowadays you don't really have to do that that much and a lot of people are starting to get rid of this idea altogether and some people instead will simply say how much noise was there.

Normally we would think of using sigma for standard deviations of Gaussians or normal distributions, but actually much more common is to use the Greek letter beta. And so if you see something talking about beta, they're just saying oh for that particular image when it was being trained what standard deviation of noise was being used basically?

Slightly hand-wavy, but close enough. And so what you do each time you're going to create a mini batch to pass into your model, you randomly pick an image from your training set, you randomly pick either an amount of noise or some models, you randomly pick a t and then look up an amount of noise and then you put use that amount of noise for each one and then you pass that mini batch into your model to train it and that trains the weights in your model so it can learn to predict noise.

And so then when you come to inference time, so inference is when you're generating a picture from pure noise, you want your model, basically your model is now starting here, right, which is as much noise as possible. And so you want it to learn to remove noise, but what it does in practice, as we saw in our notebook, is it actually creates some hideous and rather random kind of thing.

So in fact, let's remind ourselves what that looked like. This is what it created when we tried to do it in one step. So remember what we then do is we say, okay, what's the prediction of the noise and then we multiply the prediction of the noise I said by some constant.

It's kind of like a learning rate, but we're not updating weights now, we're updating pixels and we subtract it from the pixels. So it didn't actually predict the image, what it actually did was it predicted what the noise is. So that could then subtract that from the image, from the noisy image, to give us the denoised image.

And so what we do is we don't actually subtract all of it, we multiply that by a constant. And we get a somewhat noisy image. The reason we don't jump all the way to the best image we can find is because things that look like this never appeared in our training set.

And so since it never appeared in our training set, our model has no idea what to do with it. Our model only knows how to deal with things that look like somewhat noisy latents. So that's why we subtract just a bit of the noise so that we still have a somewhat noisy latent.

So this process repeats a bunch of times, and the questions like what do we use for C? Right, and how do we go from the prediction of noise to the thing that we subtract? These are all of the things that are the kind of the things that you decide in the actual sampler.

And that's used both to think about like how do I add the noise, and how do I subtract the noise? And there's a few things that might be jumping into your head at this point if you're anything like me. And one is that gosh, this looks an awful lot like deep learning optimizers.

So in a deep learning optimizer this constant is called the learning rate. And we have some neat tricks where we say for example, oh if you change the same parameters by a similar amount multiple times in multiple steps, maybe you should increase the amount you change them. This concept is something we call momentum.

And we'll be doing all this from scratch during the course, don't worry. And in fact, we even got better ways of doing that, where we kind of say well what about what happens if the variance changes? Maybe we can look at that as well, and that gives us something called Adam.

And these are types of optimizer. And so maybe you might be wondering could we use these kinds of tricks? And the answer based on our very early research is yes. Yes, we can. The whole world of like where stable diffusion and all these diffusion based models came from, came from a very different world of maths, which is the world of differential equations.

And there's a whole lot of very parallel concepts in the world of differential equations, which is really all about taking these like little steps, little steps, little steps, and trying to figure out how to take bigger steps. And so different differential equation solvers use a lot of the same kind of ideas if you squint as optimizers.

One thing that differential equations solvers do, which is kind of interesting though, is that they tend to take T as an input. And in fact, pretty much all diffusion models, I've actually lied, pretty much all diffusion models don't just take the input pixels and the digit or the caption or the prompt, they also take T.

And the idea is that the model will be better at removing the noise if you tell it how much noise there is. And remember, this is related to how much noise there is. I very strongly suspect that this premise is incorrect. Because if you think about it, for a complicated fancy neural net, figuring out how noisy something is, is very, very straightforward.

So I very much doubt we actually need to pass in T. And as soon as you stop doing that, things stop looking like differential equations and they start looking more like optimizers. And so actually, Jono started playing with this and experimenting a bit. And early results suggest that, yeah, actually, when we re-think about the whole thing as being about learning rates and optimizers, maybe it actually works a bit better.

In fact, there's all kinds of things we could do. Once we stop thinking about them as differential equations and worry about the math, don't worry about the math so much about Gaussians and whatever, we can really switch things around. So, for example, we decided for no particular obvious reason to use MSC.

Well, the truth is in statistics and machine learning, almost every time you see somebody use MSC, it's because the math worked out better that way. Not as in it's a better thing to do, but as in, you know, it was kind of easier. Now, MSC does fall out quite nicely as being a good thing to do with some particular premises.

You know, like it's not like totally arbitrary, but what if we instead used more sophisticated loss functions where we actually said, well, you know, after we subtract the outputs, how good is this really? Does it look like a digit or does it have the similar qualities to a digit?

So we'll learn about this stuff, but there's things called, for example, perceptual loss. Or another question is, do we really need to do this thing where we actually put noise back at all? Could we instead use this directly? These are all things that suddenly become possible when we start thinking of this as an optimization problem rather than a differential equation solving problem.

So for those of you who are interested in kind of doing novel research, this is some of the kind of stuff that we are starting to research at the moment. And the early results are extremely positive, both in terms of how quickly we can do things and what kind of outputs we seem to be getting.

OK, so I think that's probably a good place to stop it. So what we're going to do in the next lesson is we're going to finish our journey into this notebook to see some of the code behind the scenes of what's in a pipeline when I get there. So we'll do looking inside the pipeline and see exactly what's going on behind the scenes a bit more in terms of the code.

And then we're going to do a huge rewind to the from the foundations, and we're going to build up from some very tricky ground rules. Our ground rules would be we're only allowed to use pure Python, the Python standard library and nothing else and build up from there until we have recreated all of this and possibly some new research directions at the same time.

So that's our goal. And so strap in and see you all next time. See ya.

Lesson 9: Deep Learning Foundations to Stable Diffusion

Chapters

Transcript