Back to Index

🤗 Hugging Face just released *Diffusers* - for models like DALL-E 2 and Imagen!


Chapters

0:0 What are Diffusers?
1:55 Getting started
4:20 Prompt engineering
9:34 Testing other diffusers

Transcript

Today we're going to have a look at a new library from Hugging Face called Diffusers and it covers what are called Diffuser Models. Now Diffuser Models if you don't know what they are you've probably heard of a few of them already. They are like OpenAI's DALI2 is a Diffuser Model, Google's Imogen, Midjourney's Image Generation Model as well.

There are a lot of these Diffuser Models and they are producing pretty insane things. If you know what GANs are you can kind of think of them as doing that same thing of generating something. In the case of the three models I just mentioned it's generating images but they can also generate other things as well.

They can generate audio, video and I imagine probably a lot more. So they are pretty cool and also pretty new. But already pretty much everyone knows about DALI2 and the other Diffuser Models as well. They're already making a big impact. So Hugging Face's decision to create a library to support this in a more open source way in contrary to the OpenAI, Google and Midjourney approach where they have everything behind closed doors, which fair enough they're probably massive models and most of us can't run them anyway.

This decision by Hugging Face I think is a pretty good one. It means that normal people like me and you can actually start playing around with these models which is really cool. So let's have a look at what we can do with the first version of this library. So to install the library we just pip install Diffusers.

You may also need to add Transformers onto the end there. I'm not sure, I already had Transformers installed so it might not be necessary. Let's just have a look at this first example. So I've taken this example from the GitHub repo, modified it a little bit and played around with the prompts.

But all we're doing is taking this ConvVis text to image model, so this is a Diffusion model, Diffuser, and we're just creating this Diffusion pipeline using this model ID. And the example they gave was something like a painting of a squirrel eating maybe a mango or something along those lines, I don't remember.

So I just, I want banana just to see what happens. And we get this pretty cartoony image here of a, I mean I suppose it's not necessarily eating the banana but it's definitely a squirrel with a banana. So with just a few lines of code here, we got that which is pretty cool.

So yeah, already something to be pretty impressed with. So that's cool, but I would say I wanted to play around with these prompts and just see what it can do. Now just to be very clear, this is a very small and basic model, so we're not going to get DALI 2 standard images from this or Imogen standard images from this, which is to be expected.

But nonetheless, we can play around with it and we can just do it. This is on, I ran all this on my MacBook Pro. So yeah, it didn't even take long. It may be a minute or a couple of minutes for each image at most. So with this one, again, it's the same prompt again.

All I did was change the number of inference steps. So I played around with this a little bit. I'm not familiar with the Fuser models, so I don't know the best approach dealing with different parameters here, but number of inference steps, you can increase or decrease that. And then what I was probably more interested in here was the prompt engineering side of things.

So how can I modify the prompt to make it more like what I want it to be? And I see a lot of people saying, well, I'll show you in a moment. But I started off with this photorealistic image of a squirrel eating a banana. And yeah, you can see straight away, like there's a bit more detail in this image.

The banana and the squirrel kind of like melded together here. It's a bit weird, but it's a kind of more realistic image. And I noticed there's also this reflection down here from the banana, which is kind of cool. And yeah, with this prompt engineering stuff, I see people adding things in 4K to try and make something higher quality.

Now this is with Dali 2, so it's not necessarily the same with this model, but I thought I'd give it a go anyway. We get this. And this is like, it seems to be trying to pull in more detail around here. I wouldn't say it necessarily works, but it's kind of cool.

It's kind of weird as these two eyes staring at us here, but nonetheless, interesting. So this, I thought I'd go away from the squirrel examples and go for an Italian person eating pizza on top of the Colosseum in Rome. And we get this, which is pretty good actually. It's not on top of the Colosseum, but I thought this was pretty cool anyway.

It looks a lot like the Colosseum and he looks relatively Italian. Some interesting sunglasses here as well. And the pizza is a bit strange, but yeah, pretty cool. Here, photorealistic image of a squirrel eating a banana rendered in Unity. This is taking inspiration from what I've seen people do with OpenAI's Dali here.

And yeah, we get this. I assume they must've trained on a lot of stock images because I got this a few times. You can kind of see the overlay, like the watermark from the stock image here. And then like down here, I think that's pretty, like you usually have like some information about the stock image at the bottom or the company that owns the stock image.

So I thought that was kind of funny that that comes through. It came through on a lot of it, on a fair few images. Not a lot, but a few. I thought I'd try Unreal Engine as well. Yeah, similar sort of thing. So one thing that I have noticed with this is that it seems to be able to, it's like here I put a giant squirrel destroying a city, which is a pretty abstract thing.

There's probably not any pictures of that, that it has seen in the past. And I think when there's no pictures of something that it's seen in the past, it's true it's put two concepts together, at least this model. So yeah, I kind of got this. I think this is the best image and I tried a lot of prompts with this.

So we've got the squirrel here and he's on some kind of blocks. It looks a bit like a city. That one, okay, it's not so bad. And then there was this one as well. It kind of looks like a city in the background. I thought that was pretty cool.

But then the remaining, the remaining ones here are just like a squirrel or a squirrel thing in a natural environment. Or this one, so here I'm playing around the inference steps. I just turned it right down and you just kind of get this interesting. Again, I don't know anything about diffusion models, so I don't know why it's doing that.

I assume it's just, well, I know that it's generating the image from noise. So I assume this needs a few inference steps in order to take that noise and create an image out of it. So at this point, it's still kind of stuck halfway in between the noise and an actual image, I assume.

Yeah, no, we kept going through these, just a squirrel, squirrel again. And then I eventually got this image of a squirrel thing inside what seems to be a city. Yeah, and then I got a picture of an actual city. So it seems, yeah, it knows what a squirrel is, it knows what a city is, but putting those two things together, particularly a giant squirrel, doesn't seem to do so well with that.

Here I modified it a little bit, so a landscape image. And then the same prompt, slightly different number of inference steps. No, no, actually the same. So I just re-ran this same prompt twice, same parameters, and it just got like a weird squirrel again. Kind of cool with this grass detail in front here, though.

Anyway, yeah, that's that one example. I wanted to have another go with some other models as well. These are less impressive, I'll be honest. But this is literally, I feel like they released this library like a week ago, maybe, at the time of me recording this. So it's really new.

And there's all these different pipelines, by the way. As far as I understand, this pipeline uses a different something called a scheduler, which is like the algorithm that denoises the noise, the input noise, that the image is generated from. Don't quote me on that. I think that is what these different pipelines are doing, but I'm really not sure.

It's something along those lines. Okay. So with this one, this is like a Pokemon trained by this guy here. I think he's called Manuel or something. He has a load of models that he's built through Hugging Face. And this is supposed to be a Pokemon image generation model. So, yeah, I mean, you can kind of see it definitely has that Pokemon style to it.

But it's not really a good, I wouldn't say it's a good Pokemon. It's kind of messed up. And then I thought I'd try the other pipeline. So I read on the Hugging Face diffuser's GitHub repo that, and I don't know if this is completely accurate or not, but this DDIMP or DDIM algorithm is supposed to be slower.

Yes, slower, but more accurate or can produce better quality images than the DDPM pipeline. But that's not always true. And in a lot of cases, I imagine these different models have been trained with one of these algorithms in mind and you can't always, for some models you can, but you can't always switch between the two algorithms and get them to work with a particular model.

That's how I've understood it. Again, could be wrong. I don't know. But it's definitely faster. This took me like two minutes and 30. This is running for a thousand inference steps by default. I'll show you, we can change that in a moment. The same here was seven seconds, but then we just got noise.

So I thought, okay, maybe it just needs to be ran for more steps. Actually, sorry, this was run for 50 inference steps by default, not a thousand. And the same for this one here. And then here I turn that up to a thousand just to see if it would produce anything, because I expected that maybe if I run it for more inference steps, it might do something.

But no, unfortunately not. So clearly this model can't, we can't use different algorithms or denoising algorithms. Maybe that's what they're called. I don't know, with that model. So then I thought, okay, let's try a different model. This is from Google. Google have put out a load of these diffusion models in the last couple of days on Hugging Face, which is pretty cool.

And obviously this, you need to use a DDPM pipeline. You see in the model name or model ID. And yeah, this one takes ages to run, at least on my M1 Macs without MPS. It was about 25 minutes, really long. And then I got this. So it was a bit disappointing, but you can kind of see there's a sort of a cat face in there.

There's definitely the feeling of a cat in this image. I wouldn't say that is actually a cat in this image though. And the same, this one, you can kind of see a cat over here. Yeah. Interesting. And there's, again, the feeling of a cat in this image, but there's definitely not a cat.

Yeah. Maybe there's a cat, but it doesn't look very healthy. And then the last thing I just had a look at here is this config. So the model or pipelines here, they are set up using this configuration dictionary and you can modify different parts of your pipeline by changing this config and loading your pipeline with a different config.

Similar to the configs in the Transformers library where you have like a BERT config and you load that into your BERT model. I assume they're going for a similar sort of thing here. So I thought I'd just point that out. But yeah, it's literally very early days with this library.

I haven't really been through anything. I've just taken a very high-level look at the library and played around with a few image generation pieces. So I hope this is interesting to see. I'm pretty excited for this library. I definitely want to play with it a lot more in the future.

Very soon, I'm also going to be having a look at some other diffuser models. So hopefully I will understand them better in the very near future. So thank you very much for watching. I hope this video has been interesting and I will see you again in the next one.

Bye.