AI Engineering for Art - with comfyanonymous

(upbeat music) - Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swix, founder of Small.ai. Hey everyone, we are in the Chroma Studio again, but with our first ever anonymous guest, Comfy Anonymous, welcome. I feel like that's your full name.

You just go by Comfy, right? - Yeah, well a lot of people just call me Comfy, even though, even when they know my real name. Hey, Comfy. - Swix is the same. Not a lot of people call you Sean. - Yeah, you have a professional name, right? That people know you by, and then you have a legal name.

Yeah, that's fine. How do I phrase this? I think people who are in the know, know that Comfy is like the tool for image generation and now other multimodality stuff. I would say that when I first got started with stable diffusion, the star of the show was Automatic 111, right?

And I actually looked back at my notes from 2022-ish, like Comfy was already getting started back then, but it was kind of like the up and comer and like your main feature was the flowchart. Can you just kind of rewind to that moment that year and like, you know, how you looked at the landscape there and decided to start Comfy?

- Yeah, I discovered stable diffusion in 2022, in October, 2022. And well, I kind of started playing around with it. Yes, I, and back then I was using Automatic, which was what everyone was using back then. And I, so I started with that. 'Cause I had the, 'cause when I started, I had no idea like how diffusion models work, how any of this works.

- Oh yeah, what was your prior background as an engineer? - Just a software engineer. Yeah, boring software engineer. - But like any image stuff, any orchestration, distributed systems, GPUs? - No, I was doing basically nothing interesting. (laughs) - Crud, web development? - Yeah, a lot of web development.

Just, yeah, some basic, maybe some basic like automation stuff and whatever. Just, yeah, no big companies or anything. - Yeah, but like already some interest in automations, probably a lot of Python. - Yeah, yeah, of course, Python. But like, I wasn't actually used to like the node graph interface.

Before I started Confi UI, it was just, I just thought it was like, oh, like what's the best way to represent the diffusion process in a user interface? And then like, oh, well, like naturally, oh, this is the best way I found. And this was like with the node interface.

So how I got started was, yeah. So basically October, 2022, just like, I hadn't written a line of PyTorch before that. So it's completely new. What happened was I kind of got addicted to generating images. - As we all did. - Yeah, and then I started experimenting with like the high-risk fixed in auto, which was, for those that don't know, the high-risk fix is just since the diffusion models back then could only generate at low resolution.

So what you would do, you would generate low-resolution image, then upscale, then refine it again. And that was kind of the hack to generate high-resolution images. I really liked generating like higher-resolution images. So I was experimenting with that. And so I modified the code a bit. Okay, what happens if I use different samplers on the second pass?

I must edit the code of auto. So what happens if I use a different sampler? What happens if I use a different, like a different settings, different number of steps? 'Cause back then the high-risk fix was very basic. - Yeah, now there's a whole library of just the upsamplers.

- Yeah, I think they added a bunch of options to the high-risk fix since then. But before that, it was just so basic. So I wanted to go further. I wanted to try, okay, what happens if I use a different model for the second pass? And then, well, then the auto code base wasn't good enough for, like it would have been harder to implement that in the auto interface than to create my own interface.

So that's when I decided to create my own. - And you were doing that mostly on your own when you started, or did you already have kind of like a subgroup of people? - No, it was on my own. 'Cause it was just me experimenting with stuff. So yeah, that was it.

So I started writing the code January 1, 2023. And then I released the first version on GitHub January 16, 2023. That's how things got started. - And was the name Comfy UI right away? - Yeah, yeah, Comfy UI. The reason my name is Comfy is people thought my pictures were comfy.

So I just named it, it's my Comfy UI. So yeah, that's. - Is there a particular segment of the community that you targeted as users? Like more intensive workflow artists, compared to the automatic crowd or, you know. - This was my way of like experimenting with new things. I like the high risk fix thing I mentioned, which was like in Comfy, the first thing you could easily do was just chain different models together.

And then one of the first things, I think the first times it got a bit of popularity was when I started experimenting with different, like applying prompts to different areas of the image. Yeah, I called it area conditioning, posted it on Reddit and it got a bunch of upvotes.

So I think that's when people first learned of Comfy UI. - Is that mostly like fixing hands? - No, no, no, that was just like, let's say, well, it was very, well, it still is kind of difficult to like, let's say you want a mountain, you have an image and then you're like, I want a mountain here and I want like a fox here.

- Yeah, so compositing the image. - Yeah, my way was very easy. It was just like, oh, when you run the diffusion process, you kind of generate, okay, you do pass, one pass through the diffusion model, every step you do one pass, okay. This place of the image with this prompt, this place of the image with the other prompt and then the entire image with another prompt and then just average everything together, every step.

And that was area composition, which I call it. And then a month later, there was a paper that came out called multi-diffusion, which was the same thing, but yeah. - Could you do area composition with different models or because you're averaging out, you kind of need the same model?

- Could do it with, but yeah, I hadn't implemented it for different models, but you can do it with different models if you want, as long as the models share the same latent space. - We're supposed to ring a bell every time someone says latent space. - Yeah, like for example, you couldn't use like Excel and SD 1.5 'cause those have a different latent space, but like, yeah, like SD 1.5 models, different ones, you could do that.

- There's some models that try to work in pixel space, right? - Yeah, they're very slow. - Of course. - That's the problem. That's the reason why stable diffusion actually became like popular, like was because of the latent space. - Small in, yeah. Because there used to be latent diffusion models and then they trained it up.

- Yeah, 'cause the pixel diffusion models are just too slow, so. - Yeah. Have you ever tried to talk to like Stability, the latent diffusion guys, like, you know, Robin Rhombach, that crew? - Yeah, well, I used to work at Stability. - Oh, I actually didn't know. - Yeah, I used to work at Stability.

I got hired in June, 2023. - Ah, that's the part of the story I didn't know about. Okay. - So the reason I was hired is because they were doing SDXL at the time. And they were basically, SDXL, I don't know if you remember, it was a base model and then a refiner model.

Basically, they wanted to experiment like chaining them together. And then they saw, oh. - Right. - Come, oh, this, we can use this to do that. Well, let's hire that guy. - But they didn't pursue it for like SD3. - What do you mean? - Like the SDXL approach.

- Yeah, the reason for that approach was because basically they had two models and then they wanted to publish both of them. So they trained one on lower time steps, which was the refiner model. And then the first one was trained normally. And then during their test, they realized, oh, like if we string these models together are like quality increases.

So let's publish that. - It worked. - Yeah. But like right now, I don't think many people actually use the refiner anymore, even though it is actually a full diffusion model. Like you can use it on its own and it's gonna generate images. I don't think anyone, people have mostly forgotten about it, but.

- Can we talk about models a little bit? So stable diffusion, obviously is the most known. I know Flux has gotten a lot of traction. Are there any underrated models that people should use more or what's the state of the union? - Well, the latest state of the art at least, yeah, for images, there's, yeah, there's Flux.

There's also SD 3.5. SD 3.5 is two models. There's a small one, 2.5B and there's the bigger one, 8B. So it's smaller than Flux. So, and it's more creative in a way. But Flux, yeah, Flux is the best. People should give SD 3.5 a try 'cause it's different. I won't say it's better.

Well, it's better for some like specific use cases. If you want some to make something more like creative, maybe SD 3.5. If you want to make something more consistent and Flux is probably better. - Do you ever consider supporting the closed source model APIs? - Well, we do support them with custom nodes.

We actually have some official custom nodes from different-- - Ideogram. - Yeah. - I guess Dolly would have one. - Yeah, it's just not, I'm not the person that handles that. - Sure, sure. Quick question on SD. There's a lot of community discussion about the transition from SD 1.5 to SD 2 and then SD 2 to SD 3.

People still like, you know, very loyal to the previous generations of SDs? - Yeah, SD 1.5 and still has a lot of users. - The last based model. - Yeah, then SD 2 was mostly ignored 'cause it wasn't a big enough improvement over the previous one. - Okay, so SD 1.5, SD 3, Flux and whatever else.

- Yeah, SD XL. - SD XL. - SD XL, that's the main one. - Stable Cascade? - Stable Cascade, that was a good model, but the problem with that one is it got, like SD 3 was announced one week after. - Yeah, it was like a weird release. What was it like inside of Stability, actually?

I mean, Statue of Limitations expired, you know, management has moved, so it's easier to talk about now. - Yeah, and inside Stability, actually that model was ready like three months before, but it got stuck in red teaming. So basically, if that model had released or was supposed to be released by the authors, then it would probably have gotten very popular since it's a step up from SD XL, but it got all of its momentum stolen by the SD 3 announcement, so people kind of didn't develop anything on top of it, even though it's, yeah.

It was a good model, at least. Completely, mostly ignores for some reason, like. - It seemed, I think the naming as well matters. It seemed like a branch off of the main tree of development. - Yeah, well, it was different researchers that did it. - Different, yeah. - Very, like a good model, like it's the Worcestershire authors, I don't know if I'm pronouncing it correctly.

- Worshen, yeah. - Worshen, yeah, yeah. - I actually met them in Vienna. Yeah, they worked at Stability for a bit and they left right after the Cascade release. - This is Dustin, right? - No. - Dustin's SD 3. - No, Dustin is SD 3, SD XL. That's Pablo and Domi.

This, I think I'm pronouncing his name correctly. Yeah, that's very good. - It seems like the community is very, they move very quickly. - Yeah. - Like when there's a new model out, they just drop whatever the current one is and they just all move wholesale over, like they don't really stay to explore the full capabilities.

Like if the Stable Cascade was that good, they would have maybe tested a bit more. Instead, they're like, "Okay, SD 3 is out, let's go." You know? - Well, I find the opposite actually. The community doesn't, like they only jump on a new model when there's a significant improvement.

- I see. - Like if there's only like incremental improvement, which is what most of these models are going to have, especially if you stay the same parameter count. - Yeah. - Like you're not going to get a massive improvement into like, unless there's something big that changes, so.

- Yeah. And how are they evaluating these improvements? Like, because it's a whole chain of, you know, comfy workflows. - Yeah. - How does one part of the chain actually affect the whole process? - Are you talking on the model side specific? - Model specific, right? But like, once you have your whole workflow based on a model, it's very hard to move.

- Ah, not, well, not really. - Yeah, maybe not. - It depends on your, depends on the specifics and the workflow. - Yeah. Like, so I do a lot of like text and image. - Yeah. When you do change, like most workflows are kind of going to be compatible between different models.

It's just like, you might have to completely change your prompt, completely change. - Okay, well, I mean, then maybe the question is really about evals. Like what does the Comfy community do for evals? Just, you know. - Well, they don't really do the, it's more like, oh, I think this image is nice.

- Yeah. - So that's. - They just subscribe to Fulfur AI and just see like, you know, what Fulfur is doing. - Yeah, they just, they just generate like it, like, I don't see anyone really doing it. At least on the Comfy side, Comfy users, it's more like, oh, generate images and see, oh, this one's nice.

- Yeah. - It's like. - Yeah, vibes. - Yeah, it's not like the more like scientific, like checking that's more on specifically on like model side. Yeah. But there is a lot of vibes also, 'cause it is like artistic. You can create a very good model that doesn't generate nice images.

'Cause most of the images on the internet are ugly. So if you, if that's like, if you just, oh, I have the best model that can generate. It's super smart. I created on all the, like I've trained on just all the images on the internet. The images are not gonna look good.

So. - Yeah, yeah. - They're gonna be very consistent, but yeah. Like, it's not gonna be like the look that people are gonna be expecting from a model. So, yeah. - Can we talk about Loras? 'Cause we talk about models, then like the next step is probably Loras. Before, I actually, I'm kind of curious how Loras entered the tool set of the image community because the Lora paper was 2021.

And then like, there was like other methods like textual inversion that was popular at the early SD stage. - Yeah, I can't even explain the difference between that. Textual inversions, that's basically what you're doing is you're training a, 'cause well, yeah. Stable diffusion, you have the diffusion model, you have the text encoder.

So basically what you're doing is training a vector that you're gonna pass to the text decoder. It's basically you're training a new word. - Yeah, it's a little bit like representation engineering now. - Yeah, yeah. Basically, yeah. You're just, so yeah. If you know how like the text encoder works, basically you have, you take your words of your product, you convert those into tokens with the tokenizer and those are converted into vectors.

Basically, yeah, each token represents a different vector. So each word presents a vector and those, depending on your words, that's the list of vectors that get passed to the text encoder, which is just, yeah, just a stack of attention. Like basically it's very close to LLM architecture. Yeah, yeah, so basically what you're doing is just training a new vector.

We're saying, well, I have all these images and I want to know which word does that represent and it's gonna get, like you train this vector and then when you use this vector, it hopefully generates like something similar to your images. - Yeah, I would say it's like surprisingly sample efficient in picking up the concept that you're trying to train it on.

- Yeah, well, people have kind of stopped doing that, even though back when I was at Stability, we actually did train internally some textual inversions on like T5XXL, actually worked pretty well, but for some reason, yeah, people don't use them. And also they might also work like, yeah, that's just something you'd probably have to test, but maybe if you train a textual inversion like on T5XXL, it might also work with all the other models that use T5XXL.

'Cause same thing with like the textual inversions that were trained for SD 1.5, they also kind of work on SDXL because SDXL has two text encoders and one of them is the same as the SD 1.5 clip out. So those, they actually, they don't work as strongly 'cause they're only applied to one of the text encoders, but, and the same thing for SD 3.3.

SD 3.3 has three text encoders, so it works. It's still, you can still use your textual inversion SD 1.5 on SD 3, but it's just a lot weaker because now there's three text encoders, so it gets even more diluted, yeah. - Do people experiment a lot on, just on the clip side, there's like Siglip, there's Blip, like do people experiment a lot on those?

- You can't really replace. - Yeah, 'cause they're trained together, right? - Yeah, they're trained together. So you can't, like, well, what I've seen people experimenting with is a long clip. So basically someone fine-tuned a clip model to accept longer prompts. - Oh, it's kind of like long context fine-tuning.

- Yeah, so, so like it's, it's actually supported in core-comfy. - How long is long? - Regular clip is 77 tokens. Long clip is 256. - Okay. - But the hack that, like, if you use stable diffusion 1.5, you've probably noticed, oh, it still works if I use long prompts, prompts longer than 77 words.

Well, that's because the hack is to just, well, you split, you split it up in chunks of 77, your whole big prompt, let's say you give it like the massive text, like the Bible or something, and it would split it up in chunks of 77 and then just pass each one through the clip and then just cut everything together at the end.

It's not ideal, but it actually works. - Like the positioning of the words really, really matters then, right? Like this is why order matters in prompts. - Yeah. Yeah, like it, it works, but it's, it's not ideal, but it's what people expect. Like if someone gives a huge prompt, they expect at least some of the concepts at the end to be like present in the image.

But usually when they give long prompts, they don't, they like, they don't expect like detail, I think. So that's why it works very well. - And while we're on this topic, prompt weighting, negative prompting, all sort of similar part of this layer of the stack. - Yeah, the hack for that, which works on clip, like basically it's just for SD 1.5, well, for SD 1.5, the prompt weighting works well because clip L is a, it's not a very deep model.

So you have a very high correlation between you have the input token, the index of the input token vector and the output token. They're very, the concepts are very close, closely linked. So that means if you interpolate the vector from what, well, the way Comfy UI does it, is it has, okay, you have the vector, you have an empty prompt.

So you have a channel, like a clip output for the empty prompt, and then you have the one for your prompt. And then it interpolates from that, depending on your prompt weight, the weight of your tokens. So if you, yeah. So that's how it does prompt weighting, but this stops working the deeper your text encoder is.

So on T5X, it doesn't work at all, so. - Wow, is that a problem for people? I mean, 'cause I'm used to just moving up numbers. - Probably not, is, well. - So you just use words to describe, right? 'Cause it's a bigger language model. - Yeah, yeah. So it might be good, but I haven't seen many complaints on Flux, that it's not working, so.

'Cause I guess people can sort of get around it with language, so. - Yeah. - Yeah. - And then coming back to Loras, now the popular way to customize models is Loras. And I saw you also support Lokon and Loha, which I've never heard of before. - There's a bunch of, 'cause what the Lora is essentially, is instead of like, okay, you have your model, and then you wanna fine tune it.

So instead of, like, what you could do is you could fine tune the entire thing. - Yeah, full fine tune, yeah. - But that's a bit heavy. So to speed things up and make things less heavy, what you can do is just fine tune some smaller weights. Like basically two matrices, when you multiply like two low rank matrices, then when you multiply them together, gives a, represents a difference between trained weights and your base weights.

So by training those two smaller matrices, that's a lot less heavy. - And they're portable. So you're gonna share them. - Yeah. - It's like easier. - And also smaller, yeah. That's how Loras work. So basically, so when inferencing, you can inference with them pretty efficiently, like how Compute-Wide does it.

It just, when you use a Lora, it just applies it straight on the weights so that there's only a small delay at the base, like before the sampling to when it applies the weights, and then it just, same speed as before. So for inference, it's not that bad, but, and then you have, so basically all the Lora types, like LOHA, LOCA, and everything, that's just different ways of representing that.

Like, basically you can call it kind of like compression, even though it's not really compression. It's just different ways of representing, like, just okay. I want to train a difference on the weights. What's the best way to represent that difference? There's the basic Lora, which is just, oh, let's multiply these two matrices together.

And then there's all the other ones, which are all different algorithms. So, yeah. - So let's talk about what Confi UI actually is. I think most people have heard of it. Some people might've seen screenshots. I think fewer people have built very complex workflows. So when you started, automatic was like the super simple way.

What were some of the choices that you made? So the node workflow, is there anything else that stands out as like, this was like a unique take on how to do image generation workflows? - Well, I feel like, yeah, back then, everyone was trying to make like easy to use interface.

So I'm like, well, everyone's trying to make an easy to use interface. - Let's make a hard to use interface. (all laughing) - Like, so like, I don't need to do that. (all laughing) Everyone else doing it, so let me try something. Like, let me try to make a powerful interface that's not easy to use, so.

- So like, yeah, there's a sort of node execution engine. Your readme actually lists, has this really good list of features of things you prioritize, right? Like, let me see, like sort of re-executing from many parts of this workflow that was changed, asynchronous queue system, smart memory management, like all this seems like a lot of engineering that.

- Yeah, there's a lot of engineering in the backend to make things. 'Cause I was always focused on making things work locally very well, 'cause that's, 'cause I was using it locally. So everything, so there's a lot of thought and work and like getting everything to run as well as possible.

So yeah, Confio is actually more of a backend, at least, well, now the front end's getting a lot more development, but before it was, I was pretty much only focused on the backend. - Yeah, so v0.1 was only August this year. - Yeah, before there was no versioning, so yeah.

- And so what was the big rewrite for the 0.1 and then the 1.0? - Well, that's more on the front end side. 'Cause before that, it was just like the UI, what, 'cause when I first wrote it, I just, I said, okay, how can I make, like, I can do web development, but I don't like doing it.

Like, what's the easiest way I can slap a node interface on this? And then I found this library, LightGraph, like JavaScript library. - LightGraph? - LightGraph. - Usually people will go for like React flow, for like a flow builder. - Yeah, but that seems like too complicated. - 'Cause of React.

(all laughing) - So I didn't really want to spend time like developing the front end, so I'm like, well, oh, LightGraph, this has the whole node interface. Well, okay, let me just plug that into my back end then. - I feel like if Streamlit or Gradio offered something, you would have used Streamlit or Gradio 'cause it's Python.

- Yeah, Streamlit and Gradio. Gradio, I don't like Gradio. - Why? - It's bad. Like, that's one of the reasons why, like, automatic was very bad. It's great, 'cause the problem with Gradio, it forces you to, well, not forces you, but it kind of makes your interface logic and your back end logic and just sticks them together.

- Well, it's supposed to be easy for you guys, if you're a Python main, you know, I'm a JS main, right? - Okay. - If you're a Python main, it's supposed to be easy. - Yeah, it's easy, but it makes your whole software a huge mess. - I see, I see.

So you're mixing concerns instead of separating concerns? - Well, it's 'cause- - Like front end and back end. - Front end and back end should be well separated with a fine API. Like, that's how you're supposed to do it. - People, smart people disagree, but yeah. - It just sticks everything together.

It makes it easy to, like, make a huge mess. And also it's, there's a lot of issues with Gradio. Like, it's very good if all you want to do is just get, like, slap a quick interface on your, like, to show off your, like, your ML project. Like, that's what it's made for.

- Yeah, yeah. - Like, there's no problem using it, like, oh, I have my, I have my code. I just want a quick interface on it. That's perfect. Like, use Gradio. But if you want to make something that's like a real, like real software that will last a long time and will be easy to maintain, then I wouldn't avoid it.

- Yeah, yeah. So your criticism is Streamlit and Gradio are the same. I mean, those are the same criticisms. - Yeah, Streamlit, I haven't. - Haven't used as much. - Yeah, it's just looked a bit. - Similar philosophy. - Yeah, it's similar. It's just, it just seems to me like, okay, for quick, like, AI demos, it's perfect.

- Yeah. Going back to like the core tech, like asynchronous queues, slow re-execution, smart memory management, you know, anything that you were very proud of or was very hard to figure out? - Yeah, the thing that's the biggest pain in the ass is probably the memory management. - Yeah, were you just paging models in and out or?

- Yeah, before it was just, okay, load the model, completely unload it, load the new model, completely unload it. Then, okay, that works well when your model are small, but if your models are big and it takes like, let's say someone has a, like a 4090 and the model size is 10 gigabytes, that can take a few seconds to like load and load, so you want to try to keep things like in memory, in the GPU memory, as much as possible.

What Comfy UI does right now is that, it tries to like estimate, okay, like, okay, you're going to sample this model. It's going to take probably this amount of memory. Let's remove the models, like this amount of memory that's been loaded on the GPU and then just execute it.

But, so there's a fine line between just, 'cause try to remove the least amount of modelings that are already loaded, 'cause it adds like Windows driver. And another problem is the NVIDIA driver on Windows by default, because there's a way to, there's an option to disable that feature. But by default, it, like, if you start loading, you can overflow your GPU memory, and then it's, the driver's going to automatically start paging to RAM.

But the problem with that is it, it makes everything extremely slow. So when you see people complaining, oh, this model, it works, but oh shit, it starts slowing down a lot. That's probably what's happening. So it's basically, you have to just try to get, use as much memory as possible, but not too much, or else things start slowing down, or people get out of memory.

And then just find, try to find that line where, like the driver on Windows starts paging and stuff. - Yeah. - And yeah, the problem with PyTorch is it's, it's high levels, don't have that much fine grained control over like specific memory stuff. So kind of have to leave like the memory freeing to Python and PyTorch, which is, can be annoying sometimes.

- So, you know, I think one thing as a maintainer of this project, like you're designing for a very wide surface area of compute. Like you even support CPUs. - Yeah, well that's, that's just for fun. PyTorch CPUs, so yeah, it's just, that's not, that's not hard to support.

- First of all, is there a market share estimate? Like, is it like 70% NVIDIA, like 30% AMD, and then like miscellaneous on Apple Silicon or whatever? - For Comfy? - Yeah. - Yeah, I don't know the market share. - Can you guess? - I think it's mostly NVIDIA.

- Yeah. - 'Cause AMD, the problem, like AMD works horribly on Windows. Like on Linux, it works fine. It's slower than the price equivalent NVIDIA GPU, but it works, like you can use it, generate images, everything works. On Linux, on Windows, you might have a hard time. So that's the problem.

And most people, I think most people who bought AMD probably use Windows. (laughing) They probably aren't gonna switch to Linux. (laughing) So until AMD actually like ports their like raw cam to Windows properly, and then there's actually PyTorch. I think they're doing that. They're in the process of doing that, but until they get a good like PyTorch raw cam build that works on Windows, it's like, they're gonna have a hard time.

- Yeah. - We gotta get George on it. - Yeah, well, he's trying to get Lisa Sui to do it. But let's talk a bit about like the node design. So unlike all the other text to image, you have a very like deep. So you have like a separate node for like clip and code.

You have a separate node for like the case sampler. You have like all these nodes. Going back to like making it easy versus making it hard. But like, how much do people actually play with all the settings? You know, kind of like, how do you guide people to like, hey, this is actually gonna be very impactful versus this is maybe like less impactful, but we still wanna expose it to you.

- Well, I try to expose like, I try to expose everything or that's, yeah. At least for the, but for things like, for example, for the samplers, like there's like, yeah, four different sampler nodes, which go in easiest to most advanced. So yeah, if you go like the easy node, the regular sampler node, that's you have just the basic settings.

But if you use like the sampler advanced, custom advanced node, that one you can actually, you'll see you have like different nodes. - I'm looking it up now. - Yeah. - What are like the most impactful parameters that you use? So it's like, you know, you can have more, but like, which ones like really make a difference?

- CFG. - Yeah, they all do. They all have their own like, they all like, for example, yeah, steps, usually you want steps, you want them to be as low as possible, but you want, if you're optimizing your workflow, you want to, you lower the steps until like the image has started deteriorating too much.

'Cause that, yeah, that's the number of steps you're running the diffusion process. So if you want things to be faster, lower is better. But yeah, CFG, that's more, you can kind of see that as the contrast of the image. Like if your image looks too burnt out, then you lower the CFG.

So yeah, CFG, that's how, yeah, that's how strongly the, like the negative versus positive problem. 'Cause when you sample a diffusion model, it's basically a negative prompt. It's just, yeah, positive prediction minus negative prediction. - Contrastive loss. - Yeah, so it's positive minus negative, and the CFG, that's the multiplier.

- Yeah. - Yeah, so. - What are like good resources to understand what the parameters do? I think most people start with automatic, and then they move over and it's like, step, CFG, sampler, name, scheduler, denoise. - Read it. - Honestly, well, it's more, it's something you should like try out yourself.

I don't know, you don't necessarily need to know how it works to like what it does. 'Cause even if you know like CFGO, it's like positive minus negative problem. - Yeah. - So the only thing you know at CFG is if it's 1.0, then that means the negative prompt isn't applying.

And also maybe sampling is two times faster, but. - Yeah, yeah. - Yeah, but other than that, it's more like you should really just see what it does to the images yourself, and you'll probably get a more intuitive understanding of what these things do. - Any other notes or things you want to shout out?

Like I know the AnimateDiff IP adapter, those are like some of the most popular ones. - Yeah, what else comes to mind? - I don't have notes, but there's, like what I like is when some people, sometimes they make things that use Confi UI as their backend, like there's a plugin for Krita that uses Confi UI as its backend.

So you can use like all the models that work in Confi in Krita. And I think I've tried it once, but I know a lot of people use it, and find it really nice, so. - What's the craziest node that people have built, like the most complicated? - Craziest node, like yeah, I know some people have made like video games in Confi, with like stuff like that.

So like someone, like I remember, like yeah, I think it was last year, someone made like Wolfenstein 2 in Confi, and then one of the inputs was, oh, you can generate a texture, and then it changes the texture in the game. So I could plug it to like the workflow.

And there's a lot of, if you look there, there's a lot of crazy things people do, so yeah. - And now there's like a node register that people can use to like download nodes. - Yeah, well, there's always been the Confi UI manager, but we're trying to make this more like official, like with the node registry, 'cause before the node registry, like okay, how did your custom node get into Confi UI manager?

That's the guy running it who like every day, he searched GitHub for new custom nodes, and added them manually to his custom node manager. So we're trying to make it less effort for him, basically. - Yeah, but I was looking, I mean, there's like a YouTube download node. There's like, this is almost like a data pipeline, more than like an image generation thing at this point.

It's like you can get data in, you can like apply filters to it, you can generate data out. - Yeah, you can do a lot of different things. - Yeah, something I think, what I did is I made it easy to make custom nodes. So I think that that helped a lot for the ecosystem, 'cause it is very easy to just make a node.

So yeah, a bit too easy sometimes. (laughing) Then we have the issue where there's a lot of custom node packs, which share similar nodes. But well, that's, yeah, something we're trying to solve by maybe bringing some of the functionality into core. - Yeah, yeah, yeah. - Yeah. - And then there's like video, people can do video generation.

- Yeah, video, that's, well, the first video model was like stable video diffusion, which was last, yeah, exactly last year, I think. Like one year ago, but that wasn't a true video model. So it was-- - It was like moving images. - Yeah, it generated video. What I mean by that is it's like, it's still 2D latency.

It's basically what they did is they took SD2 and then they added some temporal attention to it and then trained it on videos. So it's kind of like animate diff, like same idea, basically. Why I say it's not a true video model is that you still have like the 2D latency.

Like a true video model, like Mochi, for example, would have 3D latency. So-- - Which means you can like move through the space, basically, it's the difference. You're not just kind of like reorienting. - Yeah, and it's also, well, it's also because you have a temporal VAE. Also, like Mochi has a temporal VAE that compresses on like the temporal direction also.

So that's something you don't have with like, yeah, animate diff and stable video diffusion. They only like compress spatially, not temporally. So yeah, so these models, that's why I call them like true video models. There's actually a few of them, but the one I've implemented in Comfy is Mochi 'cause that seems to be the best one so far.

- We had AJ come and speak at the stable diffusion meetup. Other open one I think I've seen is Kog Video. - Yeah, Kog Video. Yeah, that one's, yeah, it also seems decent, but yeah. - We're Chinese, so we don't use it. (all laughing) - No, it's fine, it's just, yeah, I could, yeah, it's just that there's a, it's not the only one.

There's also a few others, which I-- - The rest are like closed source, right? Like Cling-- - Yeah, the closed source, there's a bunch of them, but I mean, open, I've seen a few of them, like, yeah, I can't remember their names, but there's Kog Videos, the big one, then there's also a few of them that released at the same time.

There's one that released at the same time as SD3.5, same day, which is why I don't remember the name. (all laughing) - We should have a release schedule so we don't conflict on each of these things. - Yeah, I think SD3.5 and Mochi released on the same day. So everything else was kind of drowned, completely drowned out.

So for some reason, lots of people picked that day to release their stuff. (all laughing) Yeah, which is, well, shame for those, I think, guys. I think Omnijet also released the same day, which also seems interesting, but yeah. - Yeah, what's Comfy? So you are Comfy, and then there's like Comfy.org.

I know we do a lot of things for like news research, and those guys also have kind of like a more open source and on thing going on. How do you work? Like you mentioned, you must have worked on like the core piece of it, and then what? - Maybe I should fit in because, yeah, I feel like maybe, yeah, I only explained part of the story.

- Right. - Yeah, maybe I should explain the rest, so yeah. So yeah, basically January, that's when the first, January 2023, January 16th, 2023, that's when Comfy was first released to the public. Then, yeah, did a Reddit post about the area composition thing somewhere in, I don't remember exactly, maybe end of January, beginning of February.

And then somewhere, a YouTuber made a video about it, like Olivia, he made a video about Comfy in March, 2023. I think that's when it was real burst of attention. And by that time I was continued to developing it and it was getting, people were starting to use it more, which unfortunately meant that I had first written it to do like experiments, but then my time to do experiments when it started going down, yeah, 'cause yeah, people were actually starting to use it then, like I had to.

And I said, well, yeah, time to add all these features and stuff. Yeah, and then I got hired by Stability, June, 2023. Then I made the, basically, yeah, they hired me 'cause they wanted the SDXL. So I got SDXL working very well in Comfy UI because they were experimenting with it.

Actually the SDX, how the SDXL released worked is they released for some reason, they released the code first, but they didn't release the model checkpoint. Oh yeah, so they released the code. And then, well, since the research was with the code, I released the code in Comfy too. And then the checkpoints were basically early access.

People had to sign up and they only allowed a lot of people from edu emails. Like if you had the edu email, like they gave you access basically to the zero SDXL 0.9. And well, that leaked, of course, because of course it's gonna leak if you do that. Well, the only way people could easily use it was with Comfy.

So yeah, people started using it. And then I fixed a few of the issues that people had. So then the big 1.0 release happened. And well, Comfy UI was the only way a lot of people could actually run it on their computers. 'Cause it just like automatic was so like inefficient and bad that most people couldn't act like it just wouldn't work.

Like, 'cause he did a quick implementation. So people were forced to use Comfy UI. And that's how it became popular because people had no choice. (all laughing) - The Grove hack. - Yeah. - Yeah. Like everywhere, like people who didn't have the 4090, they had like, who had just regular GPUs.

- Yeah, yeah. - They didn't have a choice, so. - Yeah, I got a 4070, so think of me. And so today, what's, is there like a core Comfy team or? - Yeah, well, right now, yeah, we are hiring actually. So right now core, like the core core itself, it's me.

Yeah, but because the reason we're focused, like all the focus has been mostly on the front end right now, 'cause that's the thing that's been neglected for a long time. So most of the focus right now is all on the front end, but we are, yeah, we will soon get more people to like help me with the actual backend stuff.

Because that's, once we have our V1 release, which is, 'cause it'd be the package Comfy Y with the nice interface and easy to install on Windows, and hopefully Mac. Yeah, once we have that, we're going to have to, lots of stuff to do on the backend side and also the front end side, but yeah.

- What's the release date? I'm on the wait list. What's the timing? - Soon, soon. Yeah, like I don't want to promise a release date. Yeah, we do have a release date we're targeting, but I'm not sure if it's public. - Yeah. - Yeah, and how we're going to, like we're still going to continue like doing the open source, like making Comfy Y the best way to run like stable infusion models, like at least the open source side, and like it's going to be best way to run models locally, but we will have a few, like a few things to make money from it, like cloud inference or like that type of thing.

So, and maybe some, like some things for some enterprises. - I mean, a few questions on that. How do you feel about the other Comfy startups? - I mean, I think it's great. - They're using your name, you know. - Yeah, well, it's better to use Comfy than to use something else.

- Yeah, that's true. - Yeah, like it's fine. I don't like, like, yeah, we're going to try not to, we don't want to, like we want them to, people to use Comfy, because like I said, it's better that people use Comfy than something else. So as long as they use Comfy, it's, I think it helps, it helps the ecosystem.

And so, because more people, even if they don't contribute directly, the fact that they are using Comfy means that like people are more likely to like join the ecosystem. So, yeah. - And then would you ever do text? - Yeah, well, you can already do text with some custom nodes.

So yeah, it's something where we like, yeah. It's something I've wanted to eventually add to core, it's more like, not a very high priority, but because a lot of people use text for like prompt enhancement and like other things like that. So it's, yeah, it's just that my focus has always been like diffusion models.

Yeah, unless some text diffusion model comes out. - Yeah, David Holtz is investing a lot in text diffusion. - Well, if a good one comes out, then well, I'll probably implement it since it fits with the whole. - Yeah, I mean, I imagine it's gonna be close source to my journey, so.

- Yeah, well, yeah, if an open one comes out. (laughing) Yeah, then yeah, I'll probably, yeah. Yeah, I'll probably implement it. - Cool, Comfy. Thanks so much for coming on. This was fun. (upbeat music) (upbeat music) (gentle music) you

AI Engineering for Art - with comfyanonymous

Chapters

Transcript