back to index🤗 Hugging Face just released *Diffusers* - for models like DALL-E 2 and Imagen!
Chapters
0:0 What are Diffusers?
1:55 Getting started
4:20 Prompt engineering
9:34 Testing other diffusers
00:00:00.000 |
Today we're going to have a look at a new library from Hugging Face called Diffusers 00:00:05.280 |
and it covers what are called Diffuser Models. Now Diffuser Models if you don't know what they 00:00:11.600 |
are you've probably heard of a few of them already. They are like OpenAI's DALI2 is a Diffuser Model, 00:00:18.880 |
Google's Imogen, Midjourney's Image Generation Model as well. There are a lot of these Diffuser 00:00:28.240 |
Models and they are producing pretty insane things. If you know what GANs are you can kind 00:00:35.600 |
of think of them as doing that same thing of generating something. In the case of the three 00:00:46.240 |
models I just mentioned it's generating images but they can also generate other things as well. 00:00:51.440 |
They can generate audio, video and I imagine probably a lot more. So they are pretty cool 00:00:58.320 |
and also pretty new. But already pretty much everyone knows about DALI2 and the other Diffuser 00:01:11.280 |
Models as well. They're already making a big impact. So Hugging Face's decision to create 00:01:18.400 |
a library to support this in a more open source way in contrary to the OpenAI, Google and Midjourney 00:01:27.360 |
approach where they have everything behind closed doors, which fair enough they're probably massive 00:01:31.520 |
models and most of us can't run them anyway. This decision by Hugging Face I think is a pretty good 00:01:39.040 |
one. It means that normal people like me and you can actually start playing around with these 00:01:46.320 |
models which is really cool. So let's have a look at what we can do with the first version 00:01:52.960 |
of this library. So to install the library we just pip install Diffusers. You may also need to 00:02:03.120 |
add Transformers onto the end there. I'm not sure, I already had Transformers installed so it might 00:02:08.880 |
not be necessary. Let's just have a look at this first example. So I've taken this example 00:02:16.000 |
from the GitHub repo, modified it a little bit and played around with the prompts. 00:02:21.760 |
But all we're doing is taking this ConvVis text to image model, so this is a Diffusion model, 00:02:30.560 |
Diffuser, and we're just creating this Diffusion pipeline using this model ID. 00:02:36.640 |
And the example they gave was something like a painting of a squirrel eating maybe a mango or 00:02:44.720 |
something along those lines, I don't remember. So I just, I want banana just to see what happens. 00:02:50.960 |
And we get this pretty cartoony image here of a, I mean I suppose it's not necessarily eating the 00:02:58.240 |
banana but it's definitely a squirrel with a banana. So with just a few lines of code here, 00:03:04.240 |
we got that which is pretty cool. So yeah, already something to be pretty impressed with. 00:03:14.240 |
So that's cool, but I would say I wanted to play around with these prompts and just see what it can 00:03:23.760 |
do. Now just to be very clear, this is a very small and basic model, so we're not going to get 00:03:32.640 |
DALI 2 standard images from this or Imogen standard images from this, which is to be expected. 00:03:42.080 |
But nonetheless, we can play around with it and we can just do it. This is on, I ran all this 00:03:47.520 |
on my MacBook Pro. So yeah, it didn't even take long. It may be a minute or a couple of minutes 00:03:54.720 |
for each image at most. So with this one, again, it's the same prompt again. All I did was change 00:04:02.000 |
the number of inference steps. So I played around with this a little bit. I'm not familiar with 00:04:07.520 |
the Fuser models, so I don't know the best approach dealing with different parameters here, 00:04:15.040 |
but number of inference steps, you can increase or decrease that. And then what I was probably 00:04:21.600 |
more interested in here was the prompt engineering side of things. So how can I modify the prompt to 00:04:28.160 |
make it more like what I want it to be? And I see a lot of people saying, well, I'll show you in a 00:04:35.040 |
moment. But I started off with this photorealistic image of a squirrel eating a banana. And yeah, 00:04:40.800 |
you can see straight away, like there's a bit more detail in this image. The banana and the 00:04:45.520 |
squirrel kind of like melded together here. It's a bit weird, but it's a kind of more realistic 00:04:53.360 |
image. And I noticed there's also this reflection down here from the banana, which is kind of cool. 00:05:02.320 |
And yeah, with this prompt engineering stuff, I see people adding things in 4K to try and make 00:05:07.440 |
something higher quality. Now this is with Dali 2, so it's not necessarily the same with this model, 00:05:17.680 |
but I thought I'd give it a go anyway. We get this. And this is like, it seems to be trying 00:05:22.800 |
to pull in more detail around here. I wouldn't say it necessarily works, but it's kind of cool. 00:05:31.520 |
It's kind of weird as these two eyes staring at us here, but nonetheless, interesting. 00:05:38.400 |
So this, I thought I'd go away from the squirrel examples and go for an Italian person eating pizza 00:05:46.000 |
on top of the Colosseum in Rome. And we get this, which is pretty good actually. It's not on top of 00:05:52.720 |
the Colosseum, but I thought this was pretty cool anyway. It looks a lot like the Colosseum 00:06:01.920 |
and he looks relatively Italian. Some interesting sunglasses here as well. And the pizza is a bit 00:06:07.600 |
strange, but yeah, pretty cool. Here, photorealistic image of a squirrel eating a banana 00:06:15.840 |
rendered in Unity. This is taking inspiration from what I've seen people do with OpenAI's 00:06:21.200 |
Dali here. And yeah, we get this. I assume they must've trained on a lot of stock images because 00:06:27.520 |
I got this a few times. You can kind of see the overlay, like the watermark from the stock image 00:06:34.320 |
here. And then like down here, I think that's pretty, like you usually have like some information 00:06:40.320 |
about the stock image at the bottom or the company that owns the stock image. So I thought that was 00:06:45.440 |
kind of funny that that comes through. It came through on a lot of it, on a fair few images. 00:06:50.800 |
Not a lot, but a few. I thought I'd try Unreal Engine as well. Yeah, similar sort of thing. 00:06:58.400 |
So one thing that I have noticed with this is that it seems to be able to, 00:07:05.440 |
it's like here I put a giant squirrel destroying a city, which is a pretty abstract thing. There's 00:07:12.160 |
probably not any pictures of that, that it has seen in the past. And I think when there's no 00:07:17.920 |
pictures of something that it's seen in the past, it's true it's put two concepts together, 00:07:22.880 |
at least this model. So yeah, I kind of got this. I think this is the best image and I tried a lot 00:07:30.640 |
of prompts with this. So we've got the squirrel here and he's on some kind of blocks. It looks 00:07:35.280 |
a bit like a city. That one, okay, it's not so bad. And then there was this one as well. It kind 00:07:41.600 |
of looks like a city in the background. I thought that was pretty cool. But then the remaining, 00:07:47.680 |
the remaining ones here are just like a squirrel or a squirrel thing in a natural environment. 00:07:54.880 |
Or this one, so here I'm playing around the inference steps. I just turned it right down 00:08:01.760 |
and you just kind of get this interesting. Again, I don't know anything about diffusion models, 00:08:08.560 |
so I don't know why it's doing that. I assume it's just, well, I know that it's generating the 00:08:16.880 |
image from noise. So I assume this needs a few inference steps in order to take that noise 00:08:26.080 |
and create an image out of it. So at this point, it's still kind of stuck halfway in between the 00:08:32.720 |
noise and an actual image, I assume. Yeah, no, we kept going through these, just a squirrel, 00:08:40.720 |
squirrel again. And then I eventually got this image of a squirrel thing inside what seems to 00:08:49.360 |
be a city. Yeah, and then I got a picture of an actual city. So it seems, yeah, it knows what a 00:08:57.280 |
squirrel is, it knows what a city is, but putting those two things together, particularly a giant 00:09:02.720 |
squirrel, doesn't seem to do so well with that. Here I modified it a little bit, so a landscape 00:09:10.720 |
image. And then the same prompt, slightly different number of inference steps. No, no, 00:09:18.720 |
actually the same. So I just re-ran this same prompt twice, same parameters, and it just got 00:09:23.680 |
like a weird squirrel again. Kind of cool with this grass detail in front here, though. Anyway, 00:09:32.080 |
yeah, that's that one example. I wanted to have another go with some other models as well. 00:09:39.120 |
These are less impressive, I'll be honest. But this is literally, I feel like they released 00:09:49.040 |
this library like a week ago, maybe, at the time of me recording this. So it's really new. And 00:09:56.000 |
there's all these different pipelines, by the way. As far as I understand, this pipeline uses 00:10:01.600 |
a different something called a scheduler, which is like the algorithm that denoises the noise, 00:10:09.840 |
the input noise, that the image is generated from. Don't quote me on that. I think that is 00:10:17.680 |
what these different pipelines are doing, but I'm really not sure. It's something along those lines. 00:10:26.560 |
Okay. So with this one, this is like a Pokemon trained by this guy here. I think he's called 00:10:36.640 |
Manuel or something. He has a load of models that he's built through Hugging Face. And this is 00:10:45.840 |
supposed to be a Pokemon image generation model. So, yeah, I mean, you can kind of see it definitely 00:10:55.680 |
has that Pokemon style to it. But it's not really a good, I wouldn't say it's a good Pokemon. It's 00:11:06.560 |
kind of messed up. And then I thought I'd try the other pipeline. So I read on the Hugging Face 00:11:16.240 |
diffuser's GitHub repo that, and I don't know if this is completely accurate or not, 00:11:22.880 |
but this DDIMP or DDIM algorithm is supposed to be slower. Yes, slower, but more accurate or 00:11:40.000 |
can produce better quality images than the DDPM pipeline. But that's not always true. 00:11:47.920 |
And in a lot of cases, I imagine these different models have been trained with one of these 00:11:55.760 |
algorithms in mind and you can't always, for some models you can, but you can't always switch 00:12:03.120 |
between the two algorithms and get them to work with a particular model. That's how I've understood 00:12:08.960 |
it. Again, could be wrong. I don't know. But it's definitely faster. This took me like two minutes 00:12:14.480 |
and 30. This is running for a thousand inference steps by default. I'll show you, we can change 00:12:21.120 |
that in a moment. The same here was seven seconds, but then we just got noise. So I thought, okay, 00:12:27.040 |
maybe it just needs to be ran for more steps. Actually, sorry, this was run for 50 inference 00:12:35.200 |
steps by default, not a thousand. And the same for this one here. And then here I turn that up 00:12:42.160 |
to a thousand just to see if it would produce anything, because I expected that maybe if I 00:12:48.400 |
run it for more inference steps, it might do something. But no, unfortunately not. So 00:12:54.880 |
clearly this model can't, we can't use different algorithms or denoising algorithms. Maybe that's 00:13:03.040 |
what they're called. I don't know, with that model. So then I thought, okay, let's try a 00:13:09.520 |
different model. This is from Google. Google have put out a load of these diffusion models in the 00:13:15.040 |
last couple of days on Hugging Face, which is pretty cool. And obviously this, you need to 00:13:19.920 |
use a DDPM pipeline. You see in the model name or model ID. And yeah, this one takes ages to run, 00:13:29.120 |
at least on my M1 Macs without MPS. It was about 25 minutes, really long. And then I got this. So 00:13:38.560 |
it was a bit disappointing, but you can kind of see there's a sort of a cat face in there. 00:13:43.520 |
There's definitely the feeling of a cat in this image. I wouldn't say that is actually a cat in 00:13:50.000 |
this image though. And the same, this one, you can kind of see a cat over here. Yeah. Interesting. 00:13:59.120 |
And there's, again, the feeling of a cat in this image, but there's definitely not a cat. 00:14:04.320 |
Yeah. Maybe there's a cat, but it doesn't look very healthy. 00:14:09.920 |
And then the last thing I just had a look at here is this config. So the model or pipelines here, 00:14:19.200 |
they are set up using this configuration dictionary and you can modify different parts 00:14:27.760 |
of your pipeline by changing this config and loading your pipeline with a different config. 00:14:33.600 |
Similar to the configs in the Transformers library where you have like a BERT config and you load 00:14:40.080 |
that into your BERT model. I assume they're going for a similar sort of thing here. So I thought I'd 00:14:46.240 |
just point that out. But yeah, it's literally very early days with this library. I haven't really 00:14:53.520 |
been through anything. I've just taken a very high-level look at the library and played around 00:15:01.360 |
with a few image generation pieces. So I hope this is interesting to see. I'm pretty excited 00:15:08.240 |
for this library. I definitely want to play with it a lot more in the future. Very soon, 00:15:13.920 |
I'm also going to be having a look at some other diffuser models. So hopefully I will 00:15:18.080 |
understand them better in the very near future. So thank you very much for watching. I hope this 00:15:28.000 |
video has been interesting and I will see you again in the next one. Bye.