back to index

🤗 Hugging Face just released *Diffusers* - for models like DALL-E 2 and Imagen!


Chapters

0:0 What are Diffusers?
1:55 Getting started
4:20 Prompt engineering
9:34 Testing other diffusers

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at a new library from Hugging Face called Diffusers
00:00:05.280 | and it covers what are called Diffuser Models. Now Diffuser Models if you don't know what they
00:00:11.600 | are you've probably heard of a few of them already. They are like OpenAI's DALI2 is a Diffuser Model,
00:00:18.880 | Google's Imogen, Midjourney's Image Generation Model as well. There are a lot of these Diffuser
00:00:28.240 | Models and they are producing pretty insane things. If you know what GANs are you can kind
00:00:35.600 | of think of them as doing that same thing of generating something. In the case of the three
00:00:46.240 | models I just mentioned it's generating images but they can also generate other things as well.
00:00:51.440 | They can generate audio, video and I imagine probably a lot more. So they are pretty cool
00:00:58.320 | and also pretty new. But already pretty much everyone knows about DALI2 and the other Diffuser
00:01:11.280 | Models as well. They're already making a big impact. So Hugging Face's decision to create
00:01:18.400 | a library to support this in a more open source way in contrary to the OpenAI, Google and Midjourney
00:01:27.360 | approach where they have everything behind closed doors, which fair enough they're probably massive
00:01:31.520 | models and most of us can't run them anyway. This decision by Hugging Face I think is a pretty good
00:01:39.040 | one. It means that normal people like me and you can actually start playing around with these
00:01:46.320 | models which is really cool. So let's have a look at what we can do with the first version
00:01:52.960 | of this library. So to install the library we just pip install Diffusers. You may also need to
00:02:03.120 | add Transformers onto the end there. I'm not sure, I already had Transformers installed so it might
00:02:08.880 | not be necessary. Let's just have a look at this first example. So I've taken this example
00:02:16.000 | from the GitHub repo, modified it a little bit and played around with the prompts.
00:02:21.760 | But all we're doing is taking this ConvVis text to image model, so this is a Diffusion model,
00:02:30.560 | Diffuser, and we're just creating this Diffusion pipeline using this model ID.
00:02:36.640 | And the example they gave was something like a painting of a squirrel eating maybe a mango or
00:02:44.720 | something along those lines, I don't remember. So I just, I want banana just to see what happens.
00:02:50.960 | And we get this pretty cartoony image here of a, I mean I suppose it's not necessarily eating the
00:02:58.240 | banana but it's definitely a squirrel with a banana. So with just a few lines of code here,
00:03:04.240 | we got that which is pretty cool. So yeah, already something to be pretty impressed with.
00:03:14.240 | So that's cool, but I would say I wanted to play around with these prompts and just see what it can
00:03:23.760 | do. Now just to be very clear, this is a very small and basic model, so we're not going to get
00:03:32.640 | DALI 2 standard images from this or Imogen standard images from this, which is to be expected.
00:03:42.080 | But nonetheless, we can play around with it and we can just do it. This is on, I ran all this
00:03:47.520 | on my MacBook Pro. So yeah, it didn't even take long. It may be a minute or a couple of minutes
00:03:54.720 | for each image at most. So with this one, again, it's the same prompt again. All I did was change
00:04:02.000 | the number of inference steps. So I played around with this a little bit. I'm not familiar with
00:04:07.520 | the Fuser models, so I don't know the best approach dealing with different parameters here,
00:04:15.040 | but number of inference steps, you can increase or decrease that. And then what I was probably
00:04:21.600 | more interested in here was the prompt engineering side of things. So how can I modify the prompt to
00:04:28.160 | make it more like what I want it to be? And I see a lot of people saying, well, I'll show you in a
00:04:35.040 | moment. But I started off with this photorealistic image of a squirrel eating a banana. And yeah,
00:04:40.800 | you can see straight away, like there's a bit more detail in this image. The banana and the
00:04:45.520 | squirrel kind of like melded together here. It's a bit weird, but it's a kind of more realistic
00:04:53.360 | image. And I noticed there's also this reflection down here from the banana, which is kind of cool.
00:05:02.320 | And yeah, with this prompt engineering stuff, I see people adding things in 4K to try and make
00:05:07.440 | something higher quality. Now this is with Dali 2, so it's not necessarily the same with this model,
00:05:17.680 | but I thought I'd give it a go anyway. We get this. And this is like, it seems to be trying
00:05:22.800 | to pull in more detail around here. I wouldn't say it necessarily works, but it's kind of cool.
00:05:31.520 | It's kind of weird as these two eyes staring at us here, but nonetheless, interesting.
00:05:38.400 | So this, I thought I'd go away from the squirrel examples and go for an Italian person eating pizza
00:05:46.000 | on top of the Colosseum in Rome. And we get this, which is pretty good actually. It's not on top of
00:05:52.720 | the Colosseum, but I thought this was pretty cool anyway. It looks a lot like the Colosseum
00:06:01.920 | and he looks relatively Italian. Some interesting sunglasses here as well. And the pizza is a bit
00:06:07.600 | strange, but yeah, pretty cool. Here, photorealistic image of a squirrel eating a banana
00:06:15.840 | rendered in Unity. This is taking inspiration from what I've seen people do with OpenAI's
00:06:21.200 | Dali here. And yeah, we get this. I assume they must've trained on a lot of stock images because
00:06:27.520 | I got this a few times. You can kind of see the overlay, like the watermark from the stock image
00:06:34.320 | here. And then like down here, I think that's pretty, like you usually have like some information
00:06:40.320 | about the stock image at the bottom or the company that owns the stock image. So I thought that was
00:06:45.440 | kind of funny that that comes through. It came through on a lot of it, on a fair few images.
00:06:50.800 | Not a lot, but a few. I thought I'd try Unreal Engine as well. Yeah, similar sort of thing.
00:06:58.400 | So one thing that I have noticed with this is that it seems to be able to,
00:07:05.440 | it's like here I put a giant squirrel destroying a city, which is a pretty abstract thing. There's
00:07:12.160 | probably not any pictures of that, that it has seen in the past. And I think when there's no
00:07:17.920 | pictures of something that it's seen in the past, it's true it's put two concepts together,
00:07:22.880 | at least this model. So yeah, I kind of got this. I think this is the best image and I tried a lot
00:07:30.640 | of prompts with this. So we've got the squirrel here and he's on some kind of blocks. It looks
00:07:35.280 | a bit like a city. That one, okay, it's not so bad. And then there was this one as well. It kind
00:07:41.600 | of looks like a city in the background. I thought that was pretty cool. But then the remaining,
00:07:47.680 | the remaining ones here are just like a squirrel or a squirrel thing in a natural environment.
00:07:54.880 | Or this one, so here I'm playing around the inference steps. I just turned it right down
00:08:01.760 | and you just kind of get this interesting. Again, I don't know anything about diffusion models,
00:08:08.560 | so I don't know why it's doing that. I assume it's just, well, I know that it's generating the
00:08:16.880 | image from noise. So I assume this needs a few inference steps in order to take that noise
00:08:26.080 | and create an image out of it. So at this point, it's still kind of stuck halfway in between the
00:08:32.720 | noise and an actual image, I assume. Yeah, no, we kept going through these, just a squirrel,
00:08:40.720 | squirrel again. And then I eventually got this image of a squirrel thing inside what seems to
00:08:49.360 | be a city. Yeah, and then I got a picture of an actual city. So it seems, yeah, it knows what a
00:08:57.280 | squirrel is, it knows what a city is, but putting those two things together, particularly a giant
00:09:02.720 | squirrel, doesn't seem to do so well with that. Here I modified it a little bit, so a landscape
00:09:10.720 | image. And then the same prompt, slightly different number of inference steps. No, no,
00:09:18.720 | actually the same. So I just re-ran this same prompt twice, same parameters, and it just got
00:09:23.680 | like a weird squirrel again. Kind of cool with this grass detail in front here, though. Anyway,
00:09:32.080 | yeah, that's that one example. I wanted to have another go with some other models as well.
00:09:39.120 | These are less impressive, I'll be honest. But this is literally, I feel like they released
00:09:49.040 | this library like a week ago, maybe, at the time of me recording this. So it's really new. And
00:09:56.000 | there's all these different pipelines, by the way. As far as I understand, this pipeline uses
00:10:01.600 | a different something called a scheduler, which is like the algorithm that denoises the noise,
00:10:09.840 | the input noise, that the image is generated from. Don't quote me on that. I think that is
00:10:17.680 | what these different pipelines are doing, but I'm really not sure. It's something along those lines.
00:10:26.560 | Okay. So with this one, this is like a Pokemon trained by this guy here. I think he's called
00:10:36.640 | Manuel or something. He has a load of models that he's built through Hugging Face. And this is
00:10:45.840 | supposed to be a Pokemon image generation model. So, yeah, I mean, you can kind of see it definitely
00:10:55.680 | has that Pokemon style to it. But it's not really a good, I wouldn't say it's a good Pokemon. It's
00:11:06.560 | kind of messed up. And then I thought I'd try the other pipeline. So I read on the Hugging Face
00:11:16.240 | diffuser's GitHub repo that, and I don't know if this is completely accurate or not,
00:11:22.880 | but this DDIMP or DDIM algorithm is supposed to be slower. Yes, slower, but more accurate or
00:11:40.000 | can produce better quality images than the DDPM pipeline. But that's not always true.
00:11:47.920 | And in a lot of cases, I imagine these different models have been trained with one of these
00:11:55.760 | algorithms in mind and you can't always, for some models you can, but you can't always switch
00:12:03.120 | between the two algorithms and get them to work with a particular model. That's how I've understood
00:12:08.960 | it. Again, could be wrong. I don't know. But it's definitely faster. This took me like two minutes
00:12:14.480 | and 30. This is running for a thousand inference steps by default. I'll show you, we can change
00:12:21.120 | that in a moment. The same here was seven seconds, but then we just got noise. So I thought, okay,
00:12:27.040 | maybe it just needs to be ran for more steps. Actually, sorry, this was run for 50 inference
00:12:35.200 | steps by default, not a thousand. And the same for this one here. And then here I turn that up
00:12:42.160 | to a thousand just to see if it would produce anything, because I expected that maybe if I
00:12:48.400 | run it for more inference steps, it might do something. But no, unfortunately not. So
00:12:54.880 | clearly this model can't, we can't use different algorithms or denoising algorithms. Maybe that's
00:13:03.040 | what they're called. I don't know, with that model. So then I thought, okay, let's try a
00:13:09.520 | different model. This is from Google. Google have put out a load of these diffusion models in the
00:13:15.040 | last couple of days on Hugging Face, which is pretty cool. And obviously this, you need to
00:13:19.920 | use a DDPM pipeline. You see in the model name or model ID. And yeah, this one takes ages to run,
00:13:29.120 | at least on my M1 Macs without MPS. It was about 25 minutes, really long. And then I got this. So
00:13:38.560 | it was a bit disappointing, but you can kind of see there's a sort of a cat face in there.
00:13:43.520 | There's definitely the feeling of a cat in this image. I wouldn't say that is actually a cat in
00:13:50.000 | this image though. And the same, this one, you can kind of see a cat over here. Yeah. Interesting.
00:13:59.120 | And there's, again, the feeling of a cat in this image, but there's definitely not a cat.
00:14:04.320 | Yeah. Maybe there's a cat, but it doesn't look very healthy.
00:14:09.920 | And then the last thing I just had a look at here is this config. So the model or pipelines here,
00:14:19.200 | they are set up using this configuration dictionary and you can modify different parts
00:14:27.760 | of your pipeline by changing this config and loading your pipeline with a different config.
00:14:33.600 | Similar to the configs in the Transformers library where you have like a BERT config and you load
00:14:40.080 | that into your BERT model. I assume they're going for a similar sort of thing here. So I thought I'd
00:14:46.240 | just point that out. But yeah, it's literally very early days with this library. I haven't really
00:14:53.520 | been through anything. I've just taken a very high-level look at the library and played around
00:15:01.360 | with a few image generation pieces. So I hope this is interesting to see. I'm pretty excited
00:15:08.240 | for this library. I definitely want to play with it a lot more in the future. Very soon,
00:15:13.920 | I'm also going to be having a look at some other diffuser models. So hopefully I will
00:15:18.080 | understand them better in the very near future. So thank you very much for watching. I hope this
00:15:28.000 | video has been interesting and I will see you again in the next one. Bye.