back to indexAI Engineering for Art - with comfyanonymous

Chapters
0:0 Introduction of hosts and anonymous guest
0:35 Origins of Comfy UI and early Stable Diffusion landscape
2:58 Comfy's background and development of high-res fix
5:37 Area conditioning and compositing in image generation
7:20 Discussion on different AI image models (SD, Flux, etc.)
11:10 Closed source model APIs and community discussions on SD versions
14:41 LoRAs and textual inversion in image generation
18:43 Evaluation methods in the Comfy community
20:5 CLIP models and text encoders in image generation
23:5 Prompt weighting and negative prompting
26:22 Comfy UI's unique features and design choices
31:0 Memory management in Comfy UI
33:50 GPU market share and compatibility issues
35:40 Node design and parameter settings in Comfy UI
38:44 Custom nodes and community contributions
41:40 Video generation models and capabilities
44:47 Comfy UI's development timeline and rise to popularity
48:13 Current state of Comfy UI team and future plans
50:11 Discussion on other Comfy startups and potential text generation support
00:00:02.580 |
- Hey everyone, welcome to the Latent Space Podcast. 00:00:06.880 |
This is Alessio, partner and CTO at Decibel Partners, 00:00:09.500 |
and I'm joined by my co-host Swix, founder of Small.ai. 00:00:12.440 |
Hey everyone, we are in the Chroma Studio again, 00:00:24.320 |
- Yeah, well a lot of people just call me Comfy, 00:00:26.880 |
even though, even when they know my real name. 00:00:37.760 |
That people know you by, and then you have a legal name. 00:00:44.200 |
know that Comfy is like the tool for image generation 00:00:52.080 |
the star of the show was Automatic 111, right? 00:00:55.520 |
And I actually looked back at my notes from 2022-ish, 00:00:59.520 |
like Comfy was already getting started back then, 00:01:02.760 |
and like your main feature was the flowchart. 00:01:04.520 |
Can you just kind of rewind to that moment that year 00:01:07.720 |
and like, you know, how you looked at the landscape there 00:01:11.160 |
- Yeah, I discovered stable diffusion in 2022, 00:01:17.200 |
And well, I kind of started playing around with it. 00:01:32.240 |
I had no idea like how diffusion models work, 00:01:36.960 |
- Oh yeah, what was your prior background as an engineer? 00:01:44.320 |
- But like any image stuff, any orchestration, 00:01:50.040 |
- No, I was doing basically nothing interesting. 00:02:08.640 |
- Yeah, but like already some interest in automations, 00:02:26.280 |
like what's the best way to represent the diffusion process 00:02:48.440 |
I hadn't written a line of PyTorch before that. 00:03:09.880 |
the high-risk fix is just since the diffusion models 00:03:14.120 |
back then could only generate at low resolution. 00:03:30.640 |
I really liked generating like higher-resolution images. 00:03:40.920 |
Okay, what happens if I use different samplers 00:03:49.920 |
So what happens if I use a different sampler? 00:03:58.920 |
'Cause back then the high-risk fix was very basic. 00:04:08.880 |
- Yeah, I think they added a bunch of options 00:04:22.800 |
if I use a different model for the second pass? 00:04:32.320 |
like it would have been harder to implement that 00:04:36.160 |
in the auto interface than to create my own interface. 00:04:51.200 |
'Cause it was just me experimenting with stuff. 00:04:58.120 |
So I started writing the code January 1, 2023. 00:05:28.040 |
- Is there a particular segment of the community 00:05:34.760 |
compared to the automatic crowd or, you know. 00:05:37.680 |
- This was my way of like experimenting with new things. 00:05:53.240 |
I think the first times it got a bit of popularity 00:05:56.720 |
was when I started experimenting with different, 00:06:00.360 |
like applying prompts to different areas of the image. 00:06:07.880 |
posted it on Reddit and it got a bunch of upvotes. 00:06:11.640 |
So I think that's when people first learned of Comfy UI. 00:06:31.080 |
I want a mountain here and I want like a fox here. 00:06:41.840 |
It was just like, oh, when you run the diffusion process, 00:06:48.720 |
you do pass, one pass through the diffusion model, 00:06:57.080 |
this place of the image with the other prompt 00:07:00.040 |
and then the entire image with another prompt 00:07:03.400 |
and then just average everything together, every step. 00:07:07.320 |
And that was area composition, which I call it. 00:07:13.760 |
there was a paper that came out called multi-diffusion, 00:07:19.120 |
- Could you do area composition with different models 00:07:29.200 |
I hadn't implemented it for different models, 00:07:32.200 |
but you can do it with different models if you want, 00:07:36.720 |
as long as the models share the same latent space. 00:07:52.480 |
but like, yeah, like SD 1.5 models, different ones, 00:08:16.480 |
Because there used to be latent diffusion models 00:08:25.840 |
Have you ever tried to talk to like Stability, 00:08:42.520 |
- Ah, that's the part of the story I didn't know about. 00:08:55.720 |
it was a base model and then a refiner model. 00:09:05.680 |
- Come, oh, this, we can use this to do that. 00:09:23.040 |
and then they wanted to publish both of them. 00:09:48.680 |
But like right now, I don't think many people 00:09:52.840 |
even though it is actually a full diffusion model. 00:10:07.680 |
So stable diffusion, obviously is the most known. 00:10:12.200 |
Are there any underrated models that people should use more 00:10:17.440 |
- Well, the latest state of the art at least, 00:10:21.040 |
yeah, for images, there's, yeah, there's Flux. 00:10:29.400 |
There's a small one, 2.5B and there's the bigger one, 8B. 00:10:45.640 |
People should give SD 3.5 a try 'cause it's different. 00:10:52.400 |
Well, it's better for some like specific use cases. 00:10:55.800 |
If you want some to make something more like creative, 00:11:00.760 |
If you want to make something more consistent 00:11:11.000 |
- Well, we do support them with custom nodes. 00:11:23.600 |
- Yeah, it's just not, I'm not the person that handles that. 00:11:39.120 |
very loyal to the previous generations of SDs? 00:11:58.840 |
- Okay, so SD 1.5, SD 3, Flux and whatever else. 00:12:18.920 |
What was it like inside of Stability, actually? 00:12:29.960 |
actually that model was ready like three months before, 00:12:40.920 |
or was supposed to be released by the authors, 00:12:44.400 |
then it would probably have gotten very popular 00:12:54.680 |
so people kind of didn't develop anything on top of it, 00:13:04.600 |
Completely, mostly ignores for some reason, like. 00:13:07.880 |
- It seemed, I think the naming as well matters. 00:13:11.040 |
It seemed like a branch off of the main tree of development. 00:13:15.760 |
- Yeah, well, it was different researchers that did it. 00:13:25.080 |
I don't know if I'm pronouncing it correctly. 00:13:32.880 |
and they left right after the Cascade release. 00:13:45.960 |
This, I think I'm pronouncing his name correctly. 00:14:20.120 |
- Like if there's only like incremental improvement, 00:14:24.280 |
which is what most of these models are going to have, 00:14:28.000 |
especially if you stay the same parameter count. 00:14:33.280 |
- Like you're not going to get a massive improvement 00:14:36.480 |
into like, unless there's something big that changes, so. 00:14:41.920 |
And how are they evaluating these improvements? 00:14:43.360 |
Like, because it's a whole chain of, you know, 00:14:52.560 |
- Are you talking on the model side specific? 00:15:16.440 |
are kind of going to be compatible between different models. 00:15:19.840 |
It's just like, you might have to completely change 00:15:25.000 |
then maybe the question is really about evals. 00:15:26.520 |
Like what does the Comfy community do for evals? 00:15:34.680 |
it's more like, oh, I think this image is nice. 00:15:41.040 |
and just see like, you know, what Fulfur is doing. 00:15:43.520 |
- Yeah, they just, they just generate like it, 00:15:58.720 |
- Yeah, it's not like the more like scientific, 00:16:03.640 |
like checking that's more on specifically on like model side. 00:16:23.520 |
'Cause most of the images on the internet are ugly. 00:16:34.960 |
I created on all the, like I've trained on just 00:16:43.920 |
- They're gonna be very consistent, but yeah. 00:16:49.280 |
that people are gonna be expecting from a model. 00:17:00.480 |
Before, I actually, I'm kind of curious how Loras 00:17:13.440 |
- Yeah, I can't even explain the difference between that. 00:17:16.200 |
Textual inversions, that's basically what you're doing 00:17:22.560 |
Stable diffusion, you have the diffusion model, 00:17:26.720 |
So basically what you're doing is training a vector 00:17:46.000 |
basically you have, you take your words of your product, 00:17:51.000 |
you convert those into tokens with the tokenizer 00:17:57.720 |
Basically, yeah, each token represents a different vector. 00:18:04.960 |
depending on your words, that's the list of vectors 00:18:09.440 |
which is just, yeah, just a stack of attention. 00:18:14.440 |
Like basically it's very close to LLM architecture. 00:18:27.520 |
and I want to know which word does that represent 00:18:32.520 |
and it's gonna get, like you train this vector 00:18:37.840 |
it hopefully generates like something similar 00:18:42.920 |
- Yeah, I would say it's like surprisingly sample efficient 00:18:48.560 |
- Yeah, well, people have kind of stopped doing that, 00:18:55.080 |
we actually did train internally some textual inversions 00:19:03.640 |
but for some reason, yeah, people don't use them. 00:19:11.880 |
that's just something you'd probably have to test, 00:19:23.120 |
'Cause same thing with like the textual inversions 00:19:36.920 |
and one of them is the same as the SD 1.5 clip out. 00:19:41.680 |
So those, they actually, they don't work as strongly 00:19:45.000 |
'cause they're only applied to one of the text encoders, 00:20:05.960 |
- Do people experiment a lot on, just on the clip side, 00:20:14.280 |
- Yeah, 'cause they're trained together, right? 00:20:18.800 |
what I've seen people experimenting with is a long clip. 00:20:28.360 |
- Oh, it's kind of like long context fine-tuning. 00:21:00.040 |
well, you split, you split it up in chunks of 77, 00:21:19.240 |
and then just cut everything together at the end. 00:22:02.480 |
all sort of similar part of this layer of the stack. 00:22:05.720 |
- Yeah, the hack for that, which works on clip, 00:22:13.440 |
well, for SD 1.5, the prompt weighting works well 00:22:16.800 |
because clip L is a, it's not a very deep model. 00:22:33.160 |
They're very, the concepts are very close, closely linked. 00:22:37.480 |
So that means if you interpolate the vector from what, 00:23:11.000 |
but this stops working the deeper your text encoder is. 00:23:22.240 |
I mean, 'cause I'm used to just moving up numbers. 00:23:40.000 |
'Cause I guess people can sort of get around it 00:23:49.160 |
now the popular way to customize models is Loras. 00:23:56.360 |
- There's a bunch of, 'cause what the Lora is essentially, 00:23:59.800 |
is instead of like, okay, you have your model, 00:24:15.000 |
So to speed things up and make things less heavy, 00:24:18.800 |
what you can do is just fine tune some smaller weights. 00:24:26.800 |
when you multiply like two low rank matrices, 00:24:35.680 |
between trained weights and your base weights. 00:24:54.800 |
you can inference with them pretty efficiently, 00:25:04.200 |
so that there's only a small delay at the base, 00:25:07.440 |
like before the sampling to when it applies the weights, 00:25:19.360 |
and then you have, so basically all the Lora types, 00:25:24.760 |
that's just different ways of representing that. 00:25:28.720 |
Like, basically you can call it kind of like compression, 00:25:35.520 |
It's just different ways of representing, like, just okay. 00:25:42.880 |
What's the best way to represent that difference? 00:25:48.240 |
oh, let's multiply these two matrices together. 00:25:58.200 |
- So let's talk about what Confi UI actually is. 00:26:05.280 |
I think fewer people have built very complex workflows. 00:26:08.040 |
So when you started, automatic was like the super simple way. 00:26:17.760 |
that stands out as like, this was like a unique take 00:26:24.160 |
everyone was trying to make like easy to use interface. 00:26:42.520 |
Everyone else doing it, so let me try something. 00:26:45.840 |
Like, let me try to make a powerful interface 00:26:52.200 |
- So like, yeah, there's a sort of node execution engine. 00:26:55.880 |
Your readme actually lists, has this really good list 00:27:03.720 |
from many parts of this workflow that was changed, 00:27:06.760 |
asynchronous queue system, smart memory management, 00:27:10.040 |
like all this seems like a lot of engineering that. 00:27:17.080 |
'Cause I was always focused on making things work locally 00:27:21.080 |
very well, 'cause that's, 'cause I was using it locally. 00:27:24.640 |
So everything, so there's a lot of thought and work 00:27:29.640 |
and like getting everything to run as well as possible. 00:27:34.600 |
So yeah, Confio is actually more of a backend, 00:27:46.840 |
I was pretty much only focused on the backend. 00:27:54.240 |
- Yeah, before there was no versioning, so yeah. 00:27:58.200 |
- And so what was the big rewrite for the 0.1 00:28:05.480 |
'Cause before that, it was just like the UI, what, 00:28:09.960 |
'cause when I first wrote it, I just, I said, okay, 00:28:13.800 |
how can I make, like, I can do web development, 00:28:28.600 |
- Usually people will go for like React flow, 00:28:40.720 |
well, oh, LightGraph, this has the whole node interface. 00:28:44.840 |
Well, okay, let me just plug that into my back end then. 00:28:49.440 |
- I feel like if Streamlit or Gradio offered something, 00:28:51.640 |
you would have used Streamlit or Gradio 'cause it's Python. 00:29:16.080 |
and your back end logic and just sticks them together. 00:29:20.760 |
- Well, it's supposed to be easy for you guys, 00:29:22.720 |
if you're a Python main, you know, I'm a JS main, right? 00:29:25.760 |
- If you're a Python main, it's supposed to be easy. 00:29:26.760 |
- Yeah, it's easy, but it makes your whole software 00:29:31.760 |
So you're mixing concerns instead of separating concerns? 00:29:36.880 |
- Front end and back end should be well separated 00:29:52.960 |
And also it's, there's a lot of issues with Gradio. 00:30:00.200 |
is just get, like, slap a quick interface on your, 00:30:04.000 |
like, to show off your, like, your ML project. 00:30:21.080 |
But if you want to make something that's like a real, 00:30:24.280 |
like real software that will last a long time 00:30:28.320 |
and will be easy to maintain, then I wouldn't avoid it. 00:30:33.480 |
So your criticism is Streamlit and Gradio are the same. 00:30:52.040 |
Going back to like the core tech, like asynchronous queues, 00:30:55.080 |
slow re-execution, smart memory management, you know, 00:31:00.720 |
- Yeah, the thing that's the biggest pain in the ass 00:31:05.840 |
- Yeah, were you just paging models in and out or? 00:31:08.360 |
- Yeah, before it was just, okay, load the model, 00:31:16.200 |
Then, okay, that works well when your model are small, 00:31:19.800 |
but if your models are big and it takes like, 00:31:29.760 |
that can take a few seconds to like load and load, 00:31:33.200 |
so you want to try to keep things like in memory, 00:31:48.040 |
It's going to take probably this amount of memory. 00:31:51.200 |
Let's remove the models, like this amount of memory 00:31:56.200 |
that's been loaded on the GPU and then just execute it. 00:32:06.400 |
'cause try to remove the least amount of modelings 00:32:18.080 |
And another problem is the NVIDIA driver on Windows 00:32:26.680 |
But by default, it, like, if you start loading, 00:32:34.280 |
and then it's, the driver's going to automatically 00:32:51.240 |
So it's basically, you have to just try to get, 00:32:55.480 |
use as much memory as possible, but not too much, 00:33:03.760 |
And then just find, try to find that line where, 00:33:07.680 |
like the driver on Windows starts paging and stuff. 00:33:12.840 |
- And yeah, the problem with PyTorch is it's, 00:33:15.520 |
it's high levels, don't have that much fine grained control 00:33:22.240 |
So kind of have to leave like the memory freeing 00:33:26.280 |
to Python and PyTorch, which is, can be annoying sometimes. 00:33:31.280 |
- So, you know, I think one thing as a maintainer 00:33:50.160 |
- First of all, is there a market share estimate? 00:33:55.040 |
and then like miscellaneous on Apple Silicon or whatever? 00:34:08.600 |
- 'Cause AMD, the problem, like AMD works horribly 00:34:15.280 |
It's slower than the price equivalent NVIDIA GPU, 00:34:19.600 |
but it works, like you can use it, generate images, 00:34:25.120 |
On Linux, on Windows, you might have a hard time. 00:34:29.720 |
And most people, I think most people who bought AMD 00:34:42.760 |
So until AMD actually like ports their like raw cam 00:34:47.760 |
to Windows properly, and then there's actually PyTorch. 00:34:56.080 |
but until they get a good like PyTorch raw cam build 00:35:08.200 |
- Yeah, well, he's trying to get Lisa Sui to do it. 00:35:10.920 |
But let's talk a bit about like the node design. 00:35:19.280 |
So you have like a separate node for like clip and code. 00:35:22.120 |
You have a separate node for like the case sampler. 00:35:25.880 |
Going back to like making it easy versus making it hard. 00:35:31.960 |
You know, kind of like, how do you guide people to like, 00:35:33.680 |
hey, this is actually gonna be very impactful 00:35:49.240 |
At least for the, but for things like, for example, 00:36:11.800 |
custom advanced node, that one you can actually, 00:36:21.760 |
- What are like the most impactful parameters that you use? 00:36:27.480 |
but like, which ones like really make a difference? 00:36:31.800 |
They all have their own like, they all like, for example, 00:36:41.800 |
but you want, if you're optimizing your workflow, 00:36:47.600 |
until like the image has started deteriorating too much. 00:36:52.440 |
'Cause that, yeah, that's the number of steps 00:36:57.320 |
So if you want things to be faster, lower is better. 00:37:05.040 |
you can kind of see that as the contrast of the image. 00:37:48.920 |
step, CFG, sampler, name, scheduler, denoise. 00:37:56.000 |
it's something you should like try out yourself. 00:37:59.480 |
I don't know, you don't necessarily need to know 00:38:11.160 |
- So the only thing you know at CFG is if it's 1.0, 00:38:14.240 |
then that means the negative prompt isn't applying. 00:38:17.360 |
And also maybe sampling is two times faster, but. 00:38:27.960 |
and you'll probably get a more intuitive understanding 00:38:33.960 |
- Any other notes or things you want to shout out? 00:38:40.360 |
those are like some of the most popular ones. 00:38:53.160 |
as their backend, like there's a plugin for Krita 00:39:15.960 |
- What's the craziest node that people have built, 00:39:23.720 |
I know some people have made like video games in Confi, 00:39:56.440 |
there's a lot of crazy things people do, so yeah. 00:40:04.760 |
- Yeah, well, there's always been the Confi UI manager, 00:40:08.360 |
but we're trying to make this more like official, 00:40:20.920 |
like okay, how did your custom node get into Confi UI manager? 00:40:24.560 |
That's the guy running it who like every day, 00:40:29.880 |
and added them manually to his custom node manager. 00:40:34.520 |
So we're trying to make it less effort for him, basically. 00:40:41.840 |
I mean, there's like a YouTube download node. 00:40:44.640 |
There's like, this is almost like a data pipeline, 00:40:48.080 |
more than like an image generation thing at this point. 00:40:54.800 |
- Yeah, you can do a lot of different things. 00:40:59.200 |
what I did is I made it easy to make custom nodes. 00:41:04.960 |
So I think that that helped a lot for the ecosystem, 00:41:20.960 |
of custom node packs, which share similar nodes. 00:41:25.520 |
But well, that's, yeah, something we're trying to solve 00:41:30.320 |
by maybe bringing some of the functionality into core. 00:41:40.440 |
- Yeah, video, that's, well, the first video model 00:41:47.080 |
which was last, yeah, exactly last year, I think. 00:41:50.760 |
Like one year ago, but that wasn't a true video model. 00:42:04.320 |
It's basically what they did is they took SD2 00:42:08.040 |
and then they added some temporal attention to it 00:42:27.440 |
Like a true video model, like Mochi, for example, 00:42:33.360 |
- Which means you can like move through the space, 00:42:45.880 |
that compresses on like the temporal direction also. 00:42:51.000 |
So that's something you don't have with like, 00:42:52.880 |
yeah, animate diff and stable video diffusion. 00:42:56.080 |
They only like compress spatially, not temporally. 00:43:02.040 |
that's why I call them like true video models. 00:43:07.200 |
but the one I've implemented in Comfy is Mochi 00:43:15.960 |
- We had AJ come and speak at the stable diffusion meetup. 00:43:19.200 |
Other open one I think I've seen is Kog Video. 00:43:22.880 |
Yeah, that one's, yeah, it also seems decent, but yeah. 00:43:31.840 |
yeah, it's just that there's a, it's not the only one. 00:43:39.760 |
- Yeah, the closed source, there's a bunch of them, 00:43:57.480 |
as SD3.5, same day, which is why I don't remember the name. 00:44:04.280 |
so we don't conflict on each of these things. 00:44:06.080 |
- Yeah, I think SD3.5 and Mochi released on the same day. 00:44:14.480 |
So for some reason, lots of people picked that day 00:44:21.520 |
Yeah, which is, well, shame for those, I think, guys. 00:44:32.080 |
So you are Comfy, and then there's like Comfy.org. 00:44:35.560 |
I know we do a lot of things for like news research, 00:44:37.520 |
and those guys also have kind of like a more open source 00:44:44.520 |
you must have worked on like the core piece of it, 00:44:55.880 |
- Yeah, maybe I should explain the rest, so yeah. 00:45:05.720 |
that's when Comfy was first released to the public. 00:45:12.880 |
about the area composition thing somewhere in, 00:45:21.440 |
And then somewhere, a YouTuber made a video about it, 00:45:26.440 |
like Olivia, he made a video about Comfy in March, 2023. 00:45:31.840 |
I think that's when it was real burst of attention. 00:45:36.280 |
And by that time I was continued to developing it 00:45:45.400 |
which unfortunately meant that I had first written it 00:46:00.480 |
people were actually starting to use it then, 00:46:09.360 |
Yeah, and then I got hired by Stability, June, 2023. 00:46:26.880 |
Actually the SDX, how the SDXL released worked 00:46:34.800 |
but they didn't release the model checkpoint. 00:46:39.800 |
And then, well, since the research was with the code, 00:46:45.400 |
And then the checkpoints were basically early access. 00:46:50.840 |
and they only allowed a lot of people from edu emails. 00:46:58.160 |
like they gave you access basically to the zero SDXL 0.9. 00:47:09.520 |
because of course it's gonna leak if you do that. 00:47:13.240 |
Well, the only way people could easily use it 00:47:19.320 |
And then I fixed a few of the issues that people had. 00:47:26.800 |
And well, Comfy UI was the only way a lot of people 00:47:33.880 |
'Cause it just like automatic was so like inefficient 00:47:58.500 |
Like everywhere, like people who didn't have the 4090, 00:48:09.480 |
And so today, what's, is there like a core Comfy team or? 00:48:14.240 |
- Yeah, well, right now, yeah, we are hiring actually. 00:48:19.240 |
So right now core, like the core core itself, it's me. 00:48:31.480 |
'cause that's the thing that's been neglected 00:48:36.200 |
So most of the focus right now is all on the front end, 00:48:41.200 |
but we are, yeah, we will soon get more people 00:48:46.040 |
to like help me with the actual backend stuff. 00:48:57.240 |
with the nice interface and easy to install on Windows, 00:49:04.160 |
Yeah, once we have that, we're going to have to, 00:49:21.800 |
Yeah, like I don't want to promise a release date. 00:49:26.240 |
Yeah, we do have a release date we're targeting, 00:49:47.840 |
and like it's going to be best way to run models locally, 00:49:57.480 |
like cloud inference or like that type of thing. 00:50:02.480 |
So, and maybe some, like some things for some enterprises. 00:50:09.840 |
How do you feel about the other Comfy startups? 00:50:22.280 |
I don't like, like, yeah, we're going to try not to, 00:50:32.800 |
it's better that people use Comfy than something else. 00:50:39.600 |
it's, I think it helps, it helps the ecosystem. 00:51:07.520 |
It's something I've wanted to eventually add to core, 00:51:17.880 |
for like prompt enhancement and like other things like that. 00:51:28.000 |
Yeah, unless some text diffusion model comes out. 00:51:31.000 |
- Yeah, David Holtz is investing a lot in text diffusion. 00:51:42.880 |
- Yeah, well, yeah, if an open one comes out.