back to index

AI Engineering for Art - with comfyanonymous


Chapters

0:0 Introduction of hosts and anonymous guest
0:35 Origins of Comfy UI and early Stable Diffusion landscape
2:58 Comfy's background and development of high-res fix
5:37 Area conditioning and compositing in image generation
7:20 Discussion on different AI image models (SD, Flux, etc.)
11:10 Closed source model APIs and community discussions on SD versions
14:41 LoRAs and textual inversion in image generation
18:43 Evaluation methods in the Comfy community
20:5 CLIP models and text encoders in image generation
23:5 Prompt weighting and negative prompting
26:22 Comfy UI's unique features and design choices
31:0 Memory management in Comfy UI
33:50 GPU market share and compatibility issues
35:40 Node design and parameter settings in Comfy UI
38:44 Custom nodes and community contributions
41:40 Video generation models and capabilities
44:47 Comfy UI's development timeline and rise to popularity
48:13 Current state of Comfy UI team and future plans
50:11 Discussion on other Comfy startups and potential text generation support

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone, welcome to the Latent Space Podcast.
00:00:06.880 | This is Alessio, partner and CTO at Decibel Partners,
00:00:09.500 | and I'm joined by my co-host Swix, founder of Small.ai.
00:00:12.440 | Hey everyone, we are in the Chroma Studio again,
00:00:15.680 | but with our first ever anonymous guest,
00:00:18.400 | Comfy Anonymous, welcome.
00:00:19.680 | I feel like that's your full name.
00:00:22.620 | You just go by Comfy, right?
00:00:24.320 | - Yeah, well a lot of people just call me Comfy,
00:00:26.880 | even though, even when they know my real name.
00:00:30.640 | Hey, Comfy.
00:00:32.600 | - Swix is the same.
00:00:34.520 | Not a lot of people call you Sean.
00:00:36.160 | - Yeah, you have a professional name, right?
00:00:37.760 | That people know you by, and then you have a legal name.
00:00:40.520 | Yeah, that's fine.
00:00:41.440 | How do I phrase this?
00:00:42.520 | I think people who are in the know,
00:00:44.200 | know that Comfy is like the tool for image generation
00:00:47.280 | and now other multimodality stuff.
00:00:49.480 | I would say that when I first got started
00:00:51.080 | with stable diffusion,
00:00:52.080 | the star of the show was Automatic 111, right?
00:00:55.520 | And I actually looked back at my notes from 2022-ish,
00:00:59.520 | like Comfy was already getting started back then,
00:01:01.400 | but it was kind of like the up and comer
00:01:02.760 | and like your main feature was the flowchart.
00:01:04.520 | Can you just kind of rewind to that moment that year
00:01:07.720 | and like, you know, how you looked at the landscape there
00:01:09.640 | and decided to start Comfy?
00:01:11.160 | - Yeah, I discovered stable diffusion in 2022,
00:01:14.800 | in October, 2022.
00:01:17.200 | And well, I kind of started playing around with it.
00:01:20.720 | Yes, I, and back then I was using Automatic,
00:01:24.120 | which was what everyone was using back then.
00:01:26.880 | And I, so I started with that.
00:01:30.120 | 'Cause I had the, 'cause when I started,
00:01:32.240 | I had no idea like how diffusion models work,
00:01:34.880 | how any of this works.
00:01:36.960 | - Oh yeah, what was your prior background as an engineer?
00:01:39.880 | - Just a software engineer.
00:01:42.120 | Yeah, boring software engineer.
00:01:44.320 | - But like any image stuff, any orchestration,
00:01:47.920 | distributed systems, GPUs?
00:01:50.040 | - No, I was doing basically nothing interesting.
00:01:54.320 | (laughs)
00:01:55.160 | - Crud, web development?
00:01:56.840 | - Yeah, a lot of web development.
00:01:58.720 | Just, yeah, some basic, maybe some basic
00:02:00.880 | like automation stuff and whatever.
00:02:03.640 | Just, yeah, no big companies or anything.
00:02:08.640 | - Yeah, but like already some interest in automations,
00:02:11.320 | probably a lot of Python.
00:02:12.720 | - Yeah, yeah, of course, Python.
00:02:14.760 | But like, I wasn't actually used to like
00:02:18.440 | the node graph interface.
00:02:20.840 | Before I started Confi UI, it was just,
00:02:24.120 | I just thought it was like, oh,
00:02:26.280 | like what's the best way to represent the diffusion process
00:02:30.840 | in a user interface?
00:02:32.080 | And then like, oh, well, like naturally,
00:02:34.480 | oh, this is the best way I found.
00:02:37.080 | And this was like with the node interface.
00:02:40.840 | So how I got started was, yeah.
00:02:44.360 | So basically October, 2022, just like,
00:02:48.440 | I hadn't written a line of PyTorch before that.
00:02:51.520 | So it's completely new.
00:02:54.320 | What happened was I kind of got addicted
00:02:56.560 | to generating images.
00:02:58.400 | - As we all did.
00:03:00.200 | - Yeah, and then I started experimenting
00:03:03.880 | with like the high-risk fixed in auto,
00:03:07.600 | which was, for those that don't know,
00:03:09.880 | the high-risk fix is just since the diffusion models
00:03:14.120 | back then could only generate at low resolution.
00:03:17.280 | So what you would do, you would generate
00:03:19.120 | low-resolution image, then upscale,
00:03:21.520 | then refine it again.
00:03:25.560 | And that was kind of the hack
00:03:27.280 | to generate high-resolution images.
00:03:30.640 | I really liked generating like higher-resolution images.
00:03:34.440 | So I was experimenting with that.
00:03:38.040 | And so I modified the code a bit.
00:03:40.920 | Okay, what happens if I use different samplers
00:03:45.120 | on the second pass?
00:03:46.520 | I must edit the code of auto.
00:03:49.920 | So what happens if I use a different sampler?
00:03:52.240 | What happens if I use a different,
00:03:55.040 | like a different settings,
00:03:56.880 | different number of steps?
00:03:58.920 | 'Cause back then the high-risk fix was very basic.
00:04:03.840 | - Yeah, now there's a whole library
00:04:06.920 | of just the upsamplers.
00:04:08.880 | - Yeah, I think they added a bunch of options
00:04:13.200 | to the high-risk fix since then.
00:04:16.840 | But before that, it was just so basic.
00:04:18.920 | So I wanted to go further.
00:04:21.160 | I wanted to try, okay, what happens
00:04:22.800 | if I use a different model for the second pass?
00:04:26.960 | And then, well, then the auto code base
00:04:30.040 | wasn't good enough for,
00:04:32.320 | like it would have been harder to implement that
00:04:36.160 | in the auto interface than to create my own interface.
00:04:40.720 | So that's when I decided to create my own.
00:04:44.640 | - And you were doing that mostly on your own
00:04:46.480 | when you started, or did you already have
00:04:47.920 | kind of like a subgroup of people?
00:04:49.160 | - No, it was on my own.
00:04:51.200 | 'Cause it was just me experimenting with stuff.
00:04:55.240 | So yeah, that was it.
00:04:58.120 | So I started writing the code January 1, 2023.
00:05:03.280 | And then I released the first version
00:05:06.160 | on GitHub January 16, 2023.
00:05:09.320 | That's how things got started.
00:05:11.840 | - And was the name Comfy UI right away?
00:05:14.040 | - Yeah, yeah, Comfy UI.
00:05:15.680 | The reason my name is Comfy
00:05:17.800 | is people thought my pictures were comfy.
00:05:20.680 | So I just named it, it's my Comfy UI.
00:05:25.680 | So yeah, that's.
00:05:28.040 | - Is there a particular segment of the community
00:05:29.840 | that you targeted as users?
00:05:31.640 | Like more intensive workflow artists,
00:05:34.760 | compared to the automatic crowd or, you know.
00:05:37.680 | - This was my way of like experimenting with new things.
00:05:42.680 | I like the high risk fix thing I mentioned,
00:05:45.440 | which was like in Comfy,
00:05:46.840 | the first thing you could easily do
00:05:48.680 | was just chain different models together.
00:05:51.840 | And then one of the first things,
00:05:53.240 | I think the first times it got a bit of popularity
00:05:56.720 | was when I started experimenting with different,
00:06:00.360 | like applying prompts to different areas of the image.
00:06:04.840 | Yeah, I called it area conditioning,
00:06:07.880 | posted it on Reddit and it got a bunch of upvotes.
00:06:11.640 | So I think that's when people first learned of Comfy UI.
00:06:16.640 | - Is that mostly like fixing hands?
00:06:19.800 | - No, no, no, that was just like, let's say,
00:06:22.880 | well, it was very,
00:06:24.400 | well, it still is kind of difficult to like,
00:06:26.800 | let's say you want a mountain,
00:06:29.640 | you have an image and then you're like,
00:06:31.080 | I want a mountain here and I want like a fox here.
00:06:36.080 | - Yeah, so compositing the image.
00:06:39.920 | - Yeah, my way was very easy.
00:06:41.840 | It was just like, oh, when you run the diffusion process,
00:06:46.160 | you kind of generate, okay,
00:06:48.720 | you do pass, one pass through the diffusion model,
00:06:52.040 | every step you do one pass, okay.
00:06:54.480 | This place of the image with this prompt,
00:06:57.080 | this place of the image with the other prompt
00:07:00.040 | and then the entire image with another prompt
00:07:03.400 | and then just average everything together, every step.
00:07:07.320 | And that was area composition, which I call it.
00:07:11.440 | And then a month later,
00:07:13.760 | there was a paper that came out called multi-diffusion,
00:07:16.600 | which was the same thing, but yeah.
00:07:19.120 | - Could you do area composition with different models
00:07:24.320 | or because you're averaging out,
00:07:25.840 | you kind of need the same model?
00:07:27.120 | - Could do it with, but yeah,
00:07:29.200 | I hadn't implemented it for different models,
00:07:32.200 | but you can do it with different models if you want,
00:07:36.720 | as long as the models share the same latent space.
00:07:40.040 | - We're supposed to ring a bell
00:07:42.720 | every time someone says latent space.
00:07:45.240 | - Yeah, like for example,
00:07:46.640 | you couldn't use like Excel and SD 1.5
00:07:50.240 | 'cause those have a different latent space,
00:07:52.480 | but like, yeah, like SD 1.5 models, different ones,
00:07:57.440 | you could do that.
00:08:00.080 | - There's some models that try to work
00:08:01.920 | in pixel space, right?
00:08:03.240 | - Yeah, they're very slow.
00:08:05.680 | - Of course.
00:08:06.520 | - That's the problem.
00:08:07.360 | That's the reason why stable diffusion
00:08:09.360 | actually became like popular,
00:08:11.920 | like was because of the latent space.
00:08:15.200 | - Small in, yeah.
00:08:16.480 | Because there used to be latent diffusion models
00:08:18.200 | and then they trained it up.
00:08:19.600 | - Yeah, 'cause the pixel diffusion models
00:08:22.640 | are just too slow, so.
00:08:25.000 | - Yeah.
00:08:25.840 | Have you ever tried to talk to like Stability,
00:08:28.720 | the latent diffusion guys,
00:08:30.040 | like, you know, Robin Rhombach, that crew?
00:08:32.320 | - Yeah, well, I used to work at Stability.
00:08:34.960 | - Oh, I actually didn't know.
00:08:35.840 | - Yeah, I used to work at Stability.
00:08:37.520 | I got hired in June, 2023.
00:08:42.520 | - Ah, that's the part of the story I didn't know about.
00:08:45.240 | Okay.
00:08:46.680 | - So the reason I was hired
00:08:48.200 | is because they were doing SDXL at the time.
00:08:51.680 | And they were basically, SDXL,
00:08:54.720 | I don't know if you remember,
00:08:55.720 | it was a base model and then a refiner model.
00:08:58.800 | Basically, they wanted to experiment
00:09:00.680 | like chaining them together.
00:09:02.640 | And then they saw, oh.
00:09:04.840 | - Right.
00:09:05.680 | - Come, oh, this, we can use this to do that.
00:09:08.040 | Well, let's hire that guy.
00:09:10.680 | - But they didn't pursue it for like SD3.
00:09:13.280 | - What do you mean?
00:09:14.120 | - Like the SDXL approach.
00:09:15.880 | - Yeah, the reason for that approach
00:09:18.400 | was because basically they had two models
00:09:23.040 | and then they wanted to publish both of them.
00:09:26.160 | So they trained one on lower time steps,
00:09:29.400 | which was the refiner model.
00:09:31.840 | And then the first one was trained normally.
00:09:35.760 | And then during their test, they realized,
00:09:38.040 | oh, like if we string these models together
00:09:41.560 | are like quality increases.
00:09:43.640 | So let's publish that.
00:09:45.120 | - It worked.
00:09:47.840 | - Yeah.
00:09:48.680 | But like right now, I don't think many people
00:09:51.000 | actually use the refiner anymore,
00:09:52.840 | even though it is actually a full diffusion model.
00:09:55.920 | Like you can use it on its own
00:09:57.880 | and it's gonna generate images.
00:10:00.040 | I don't think anyone,
00:10:01.280 | people have mostly forgotten about it, but.
00:10:05.680 | - Can we talk about models a little bit?
00:10:07.680 | So stable diffusion, obviously is the most known.
00:10:09.520 | I know Flux has gotten a lot of traction.
00:10:12.200 | Are there any underrated models that people should use more
00:10:15.320 | or what's the state of the union?
00:10:17.440 | - Well, the latest state of the art at least,
00:10:21.040 | yeah, for images, there's, yeah, there's Flux.
00:10:24.920 | There's also SD 3.5.
00:10:27.520 | SD 3.5 is two models.
00:10:29.400 | There's a small one, 2.5B and there's the bigger one, 8B.
00:10:34.400 | So it's smaller than Flux.
00:10:37.000 | So, and it's more creative in a way.
00:10:42.640 | But Flux, yeah, Flux is the best.
00:10:45.640 | People should give SD 3.5 a try 'cause it's different.
00:10:50.560 | I won't say it's better.
00:10:52.400 | Well, it's better for some like specific use cases.
00:10:55.800 | If you want some to make something more like creative,
00:10:59.320 | maybe SD 3.5.
00:11:00.760 | If you want to make something more consistent
00:11:03.400 | and Flux is probably better.
00:11:06.640 | - Do you ever consider supporting
00:11:08.200 | the closed source model APIs?
00:11:11.000 | - Well, we do support them with custom nodes.
00:11:14.720 | We actually have some official custom nodes
00:11:18.440 | from different--
00:11:19.960 | - Ideogram.
00:11:20.800 | - Yeah.
00:11:21.640 | - I guess Dolly would have one.
00:11:23.600 | - Yeah, it's just not, I'm not the person that handles that.
00:11:28.440 | - Sure, sure.
00:11:29.360 | Quick question on SD.
00:11:31.000 | There's a lot of community discussion
00:11:32.560 | about the transition from SD 1.5 to SD 2
00:11:36.200 | and then SD 2 to SD 3.
00:11:37.760 | People still like, you know,
00:11:39.120 | very loyal to the previous generations of SDs?
00:11:42.600 | - Yeah, SD 1.5 and still has a lot of users.
00:11:47.160 | - The last based model.
00:11:48.360 | - Yeah, then SD 2 was mostly ignored
00:11:53.120 | 'cause it wasn't a big enough improvement
00:11:57.200 | over the previous one.
00:11:58.840 | - Okay, so SD 1.5, SD 3, Flux and whatever else.
00:12:02.440 | - Yeah, SD XL.
00:12:03.280 | - SD XL.
00:12:04.120 | - SD XL, that's the main one.
00:12:04.960 | - Stable Cascade?
00:12:06.040 | - Stable Cascade, that was a good model,
00:12:08.480 | but the problem with that one is it got,
00:12:13.160 | like SD 3 was announced one week after.
00:12:16.320 | - Yeah, it was like a weird release.
00:12:18.920 | What was it like inside of Stability, actually?
00:12:21.440 | I mean, Statue of Limitations expired,
00:12:23.160 | you know, management has moved,
00:12:24.560 | so it's easier to talk about now.
00:12:27.200 | - Yeah, and inside Stability,
00:12:29.960 | actually that model was ready like three months before,
00:12:34.000 | but it got stuck in red teaming.
00:12:37.960 | So basically, if that model had released
00:12:40.920 | or was supposed to be released by the authors,
00:12:44.400 | then it would probably have gotten very popular
00:12:47.240 | since it's a step up from SD XL,
00:12:50.440 | but it got all of its momentum stolen
00:12:52.840 | by the SD 3 announcement,
00:12:54.680 | so people kind of didn't develop anything on top of it,
00:12:58.840 | even though it's, yeah.
00:13:01.320 | It was a good model, at least.
00:13:04.600 | Completely, mostly ignores for some reason, like.
00:13:07.880 | - It seemed, I think the naming as well matters.
00:13:11.040 | It seemed like a branch off of the main tree of development.
00:13:15.760 | - Yeah, well, it was different researchers that did it.
00:13:18.520 | - Different, yeah.
00:13:20.000 | - Very, like a good model,
00:13:22.920 | like it's the Worcestershire authors,
00:13:25.080 | I don't know if I'm pronouncing it correctly.
00:13:26.680 | - Worshen, yeah.
00:13:27.520 | - Worshen, yeah, yeah.
00:13:29.080 | - I actually met them in Vienna.
00:13:30.720 | Yeah, they worked at Stability for a bit
00:13:32.880 | and they left right after the Cascade release.
00:13:35.920 | - This is Dustin, right?
00:13:36.960 | - No.
00:13:37.800 | - Dustin's SD 3.
00:13:38.640 | - No, Dustin is SD 3, SD XL.
00:13:42.080 | That's Pablo and Domi.
00:13:45.960 | This, I think I'm pronouncing his name correctly.
00:13:49.240 | Yeah, that's very good.
00:13:51.960 | - It seems like the community is very,
00:13:54.200 | they move very quickly.
00:13:55.320 | - Yeah.
00:13:56.160 | - Like when there's a new model out,
00:13:57.000 | they just drop whatever the current one is
00:13:59.440 | and they just all move wholesale over,
00:14:01.200 | like they don't really stay
00:14:02.040 | to explore the full capabilities.
00:14:04.120 | Like if the Stable Cascade was that good,
00:14:06.520 | they would have maybe tested a bit more.
00:14:08.320 | Instead, they're like,
00:14:09.160 | "Okay, SD 3 is out, let's go."
00:14:10.800 | You know?
00:14:11.840 | - Well, I find the opposite actually.
00:14:14.200 | The community doesn't,
00:14:15.480 | like they only jump on a new model
00:14:17.360 | when there's a significant improvement.
00:14:19.280 | - I see.
00:14:20.120 | - Like if there's only like incremental improvement,
00:14:24.280 | which is what most of these models are going to have,
00:14:28.000 | especially if you stay the same parameter count.
00:14:32.440 | - Yeah.
00:14:33.280 | - Like you're not going to get a massive improvement
00:14:36.480 | into like, unless there's something big that changes, so.
00:14:41.080 | - Yeah.
00:14:41.920 | And how are they evaluating these improvements?
00:14:43.360 | Like, because it's a whole chain of, you know,
00:14:46.840 | comfy workflows.
00:14:47.840 | - Yeah.
00:14:48.680 | - How does one part of the chain
00:14:50.120 | actually affect the whole process?
00:14:52.560 | - Are you talking on the model side specific?
00:14:55.120 | - Model specific, right?
00:14:56.280 | But like, once you have your whole workflow
00:14:58.680 | based on a model, it's very hard to move.
00:15:02.040 | - Ah, not, well, not really.
00:15:04.480 | - Yeah, maybe not.
00:15:05.320 | - It depends on your,
00:15:06.560 | depends on the specifics and the workflow.
00:15:09.000 | - Yeah.
00:15:09.840 | Like, so I do a lot of like text and image.
00:15:12.000 | - Yeah.
00:15:13.000 | When you do change, like most workflows
00:15:16.440 | are kind of going to be compatible between different models.
00:15:19.840 | It's just like, you might have to completely change
00:15:21.920 | your prompt, completely change.
00:15:24.160 | - Okay, well, I mean,
00:15:25.000 | then maybe the question is really about evals.
00:15:26.520 | Like what does the Comfy community do for evals?
00:15:30.840 | Just, you know.
00:15:32.040 | - Well, they don't really do the,
00:15:34.680 | it's more like, oh, I think this image is nice.
00:15:37.400 | - Yeah.
00:15:38.240 | - So that's.
00:15:39.080 | - They just subscribe to Fulfur AI
00:15:41.040 | and just see like, you know, what Fulfur is doing.
00:15:43.520 | - Yeah, they just, they just generate like it,
00:15:46.640 | like, I don't see anyone really doing it.
00:15:49.120 | At least on the Comfy side, Comfy users,
00:15:52.360 | it's more like, oh, generate images
00:15:54.600 | and see, oh, this one's nice.
00:15:56.200 | - Yeah.
00:15:57.040 | - It's like.
00:15:57.880 | - Yeah, vibes.
00:15:58.720 | - Yeah, it's not like the more like scientific,
00:16:03.640 | like checking that's more on specifically on like model side.
00:16:08.640 | Yeah.
00:16:12.000 | But there is a lot of vibes also,
00:16:14.560 | 'cause it is like artistic.
00:16:17.800 | You can create a very good model
00:16:19.680 | that doesn't generate nice images.
00:16:23.520 | 'Cause most of the images on the internet are ugly.
00:16:26.480 | So if you, if that's like, if you just,
00:16:30.040 | oh, I have the best model that can generate.
00:16:33.160 | It's super smart.
00:16:34.960 | I created on all the, like I've trained on just
00:16:38.760 | all the images on the internet.
00:16:40.320 | The images are not gonna look good.
00:16:43.080 | - Yeah, yeah.
00:16:43.920 | - They're gonna be very consistent, but yeah.
00:16:46.800 | Like, it's not gonna be like the look
00:16:49.280 | that people are gonna be expecting from a model.
00:16:53.320 | So, yeah.
00:16:55.160 | - Can we talk about Loras?
00:16:56.560 | 'Cause we talk about models,
00:16:57.880 | then like the next step is probably Loras.
00:17:00.480 | Before, I actually, I'm kind of curious how Loras
00:17:02.600 | entered the tool set of the image community
00:17:05.480 | because the Lora paper was 2021.
00:17:08.240 | And then like, there was like other methods
00:17:09.760 | like textual inversion that was popular
00:17:11.920 | at the early SD stage.
00:17:13.440 | - Yeah, I can't even explain the difference between that.
00:17:16.200 | Textual inversions, that's basically what you're doing
00:17:19.720 | is you're training a, 'cause well, yeah.
00:17:22.560 | Stable diffusion, you have the diffusion model,
00:17:24.560 | you have the text encoder.
00:17:26.720 | So basically what you're doing is training a vector
00:17:31.720 | that you're gonna pass to the text decoder.
00:17:34.680 | It's basically you're training a new word.
00:17:37.360 | - Yeah, it's a little bit
00:17:38.200 | like representation engineering now.
00:17:40.040 | - Yeah, yeah.
00:17:40.880 | Basically, yeah.
00:17:41.720 | You're just, so yeah.
00:17:43.080 | If you know how like the text encoder works,
00:17:46.000 | basically you have, you take your words of your product,
00:17:51.000 | you convert those into tokens with the tokenizer
00:17:54.680 | and those are converted into vectors.
00:17:57.720 | Basically, yeah, each token represents a different vector.
00:18:01.200 | So each word presents a vector and those,
00:18:04.960 | depending on your words, that's the list of vectors
00:18:07.320 | that get passed to the text encoder,
00:18:09.440 | which is just, yeah, just a stack of attention.
00:18:14.440 | Like basically it's very close to LLM architecture.
00:18:19.720 | Yeah, yeah, so basically what you're doing
00:18:22.440 | is just training a new vector.
00:18:24.760 | We're saying, well, I have all these images
00:18:27.520 | and I want to know which word does that represent
00:18:32.520 | and it's gonna get, like you train this vector
00:18:34.960 | and then when you use this vector,
00:18:37.840 | it hopefully generates like something similar
00:18:42.000 | to your images.
00:18:42.920 | - Yeah, I would say it's like surprisingly sample efficient
00:18:46.000 | in picking up the concept
00:18:47.560 | that you're trying to train it on.
00:18:48.560 | - Yeah, well, people have kind of stopped doing that,
00:18:52.120 | even though back when I was at Stability,
00:18:55.080 | we actually did train internally some textual inversions
00:18:59.440 | on like T5XXL, actually worked pretty well,
00:19:03.640 | but for some reason, yeah, people don't use them.
00:19:07.840 | And also they might also work like, yeah,
00:19:11.880 | that's just something you'd probably have to test,
00:19:14.080 | but maybe if you train a textual inversion
00:19:17.120 | like on T5XXL, it might also work
00:19:19.600 | with all the other models that use T5XXL.
00:19:23.120 | 'Cause same thing with like the textual inversions
00:19:27.400 | that were trained for SD 1.5,
00:19:30.840 | they also kind of work on SDXL
00:19:33.280 | because SDXL has two text encoders
00:19:36.920 | and one of them is the same as the SD 1.5 clip out.
00:19:41.680 | So those, they actually, they don't work as strongly
00:19:45.000 | 'cause they're only applied to one of the text encoders,
00:19:47.600 | but, and the same thing for SD 3.3.
00:19:50.280 | SD 3.3 has three text encoders, so it works.
00:19:53.880 | It's still, you can still use
00:19:55.480 | your textual inversion SD 1.5 on SD 3,
00:19:58.720 | but it's just a lot weaker
00:20:00.760 | because now there's three text encoders,
00:20:02.640 | so it gets even more diluted, yeah.
00:20:05.960 | - Do people experiment a lot on, just on the clip side,
00:20:08.680 | there's like Siglip, there's Blip,
00:20:10.160 | like do people experiment a lot on those?
00:20:12.840 | - You can't really replace.
00:20:14.280 | - Yeah, 'cause they're trained together, right?
00:20:15.840 | - Yeah, they're trained together.
00:20:16.840 | So you can't, like, well,
00:20:18.800 | what I've seen people experimenting with is a long clip.
00:20:22.960 | So basically someone fine-tuned a clip model
00:20:26.280 | to accept longer prompts.
00:20:28.360 | - Oh, it's kind of like long context fine-tuning.
00:20:31.400 | - Yeah, so, so like it's,
00:20:33.520 | it's actually supported in core-comfy.
00:20:35.800 | - How long is long?
00:20:36.920 | - Regular clip is 77 tokens.
00:20:40.120 | Long clip is 256.
00:20:43.240 | - Okay.
00:20:44.080 | - But the hack that, like,
00:20:47.080 | if you use stable diffusion 1.5,
00:20:49.560 | you've probably noticed,
00:20:50.560 | oh, it still works if I use long prompts,
00:20:54.560 | prompts longer than 77 words.
00:20:56.600 | Well, that's because the hack is to just,
00:21:00.040 | well, you split, you split it up in chunks of 77,
00:21:04.200 | your whole big prompt,
00:21:06.040 | let's say you give it like the massive text,
00:21:09.280 | like the Bible or something,
00:21:12.160 | and it would split it up in chunks of 77
00:21:15.600 | and then just pass each one through the clip
00:21:19.240 | and then just cut everything together at the end.
00:21:23.600 | It's not ideal, but it actually works.
00:21:26.600 | - Like the positioning of the words
00:21:28.520 | really, really matters then, right?
00:21:30.120 | Like this is why order matters in prompts.
00:21:32.360 | - Yeah.
00:21:33.680 | Yeah, like it, it works, but it's,
00:21:36.400 | it's not ideal, but it's what people expect.
00:21:39.840 | Like if someone gives a huge prompt,
00:21:42.400 | they expect at least some of the concepts
00:21:45.080 | at the end to be like present in the image.
00:21:48.680 | But usually when they give long prompts,
00:21:50.640 | they don't, they like,
00:21:52.400 | they don't expect like detail, I think.
00:21:56.720 | So that's why it works very well.
00:21:59.160 | - And while we're on this topic,
00:22:00.640 | prompt weighting, negative prompting,
00:22:02.480 | all sort of similar part of this layer of the stack.
00:22:05.720 | - Yeah, the hack for that, which works on clip,
00:22:09.040 | like basically it's just for SD 1.5,
00:22:13.440 | well, for SD 1.5, the prompt weighting works well
00:22:16.800 | because clip L is a, it's not a very deep model.
00:22:21.800 | So you have a very high correlation
00:22:25.440 | between you have the input token,
00:22:28.720 | the index of the input token vector
00:22:31.640 | and the output token.
00:22:33.160 | They're very, the concepts are very close, closely linked.
00:22:37.480 | So that means if you interpolate the vector from what,
00:22:42.480 | well, the way Comfy UI does it,
00:22:45.280 | is it has, okay, you have the vector,
00:22:48.760 | you have an empty prompt.
00:22:51.440 | So you have a channel, like a clip output
00:22:54.480 | for the empty prompt,
00:22:55.560 | and then you have the one for your prompt.
00:22:57.920 | And then it interpolates from that,
00:23:00.040 | depending on your prompt weight,
00:23:02.600 | the weight of your tokens.
00:23:04.760 | So if you, yeah.
00:23:07.640 | So that's how it does prompt weighting,
00:23:11.000 | but this stops working the deeper your text encoder is.
00:23:16.000 | So on T5X, it doesn't work at all, so.
00:23:19.880 | - Wow, is that a problem for people?
00:23:22.240 | I mean, 'cause I'm used to just moving up numbers.
00:23:24.440 | - Probably not, is, well.
00:23:26.480 | - So you just use words to describe, right?
00:23:28.760 | 'Cause it's a bigger language model.
00:23:29.880 | - Yeah, yeah.
00:23:31.240 | So it might be good,
00:23:33.480 | but I haven't seen many complaints on Flux,
00:23:36.160 | that it's not working, so.
00:23:40.000 | 'Cause I guess people can sort of get around it
00:23:43.920 | with language, so.
00:23:46.480 | - Yeah. - Yeah.
00:23:47.520 | - And then coming back to Loras,
00:23:49.160 | now the popular way to customize models is Loras.
00:23:52.680 | And I saw you also support Lokon and Loha,
00:23:55.320 | which I've never heard of before.
00:23:56.360 | - There's a bunch of, 'cause what the Lora is essentially,
00:23:59.800 | is instead of like, okay, you have your model,
00:24:04.800 | and then you wanna fine tune it.
00:24:06.520 | So instead of, like, what you could do
00:24:08.880 | is you could fine tune the entire thing.
00:24:10.720 | - Yeah, full fine tune, yeah.
00:24:12.000 | - But that's a bit heavy.
00:24:15.000 | So to speed things up and make things less heavy,
00:24:18.800 | what you can do is just fine tune some smaller weights.
00:24:23.080 | Like basically two matrices,
00:24:26.800 | when you multiply like two low rank matrices,
00:24:30.320 | then when you multiply them together,
00:24:32.640 | gives a, represents a difference
00:24:35.680 | between trained weights and your base weights.
00:24:39.280 | So by training those two smaller matrices,
00:24:43.800 | that's a lot less heavy.
00:24:45.720 | - And they're portable.
00:24:47.000 | So you're gonna share them.
00:24:48.120 | - Yeah. - It's like easier.
00:24:48.960 | - And also smaller, yeah.
00:24:50.160 | That's how Loras work.
00:24:51.800 | So basically, so when inferencing,
00:24:54.800 | you can inference with them pretty efficiently,
00:24:57.600 | like how Compute-Wide does it.
00:24:59.360 | It just, when you use a Lora,
00:25:01.440 | it just applies it straight on the weights
00:25:04.200 | so that there's only a small delay at the base,
00:25:07.440 | like before the sampling to when it applies the weights,
00:25:10.600 | and then it just, same speed as before.
00:25:14.680 | So for inference, it's not that bad, but,
00:25:19.360 | and then you have, so basically all the Lora types,
00:25:22.880 | like LOHA, LOCA, and everything,
00:25:24.760 | that's just different ways of representing that.
00:25:28.720 | Like, basically you can call it kind of like compression,
00:25:32.920 | even though it's not really compression.
00:25:35.520 | It's just different ways of representing, like, just okay.
00:25:39.480 | I want to train a difference on the weights.
00:25:42.880 | What's the best way to represent that difference?
00:25:46.600 | There's the basic Lora, which is just,
00:25:48.240 | oh, let's multiply these two matrices together.
00:25:50.840 | And then there's all the other ones,
00:25:52.800 | which are all different algorithms.
00:25:55.720 | So, yeah.
00:25:58.200 | - So let's talk about what Confi UI actually is.
00:26:00.960 | I think most people have heard of it.
00:26:02.760 | Some people might've seen screenshots.
00:26:05.280 | I think fewer people have built very complex workflows.
00:26:08.040 | So when you started, automatic was like the super simple way.
00:26:12.600 | What were some of the choices that you made?
00:26:15.000 | So the node workflow, is there anything else
00:26:17.760 | that stands out as like, this was like a unique take
00:26:20.360 | on how to do image generation workflows?
00:26:22.480 | - Well, I feel like, yeah, back then,
00:26:24.160 | everyone was trying to make like easy to use interface.
00:26:29.160 | So I'm like, well, everyone's trying to make
00:26:31.400 | an easy to use interface.
00:26:32.800 | - Let's make a hard to use interface.
00:26:34.320 | (all laughing)
00:26:37.280 | - Like, so like, I don't need to do that.
00:26:40.480 | (all laughing)
00:26:42.520 | Everyone else doing it, so let me try something.
00:26:45.840 | Like, let me try to make a powerful interface
00:26:49.440 | that's not easy to use, so.
00:26:52.200 | - So like, yeah, there's a sort of node execution engine.
00:26:55.880 | Your readme actually lists, has this really good list
00:26:58.400 | of features of things you prioritize, right?
00:27:00.600 | Like, let me see, like sort of re-executing
00:27:03.720 | from many parts of this workflow that was changed,
00:27:06.760 | asynchronous queue system, smart memory management,
00:27:10.040 | like all this seems like a lot of engineering that.
00:27:12.240 | - Yeah, there's a lot of engineering
00:27:14.160 | in the backend to make things.
00:27:17.080 | 'Cause I was always focused on making things work locally
00:27:21.080 | very well, 'cause that's, 'cause I was using it locally.
00:27:24.640 | So everything, so there's a lot of thought and work
00:27:29.640 | and like getting everything to run as well as possible.
00:27:34.600 | So yeah, Confio is actually more of a backend,
00:27:39.200 | at least, well, now the front end's getting
00:27:42.280 | a lot more development, but before it was,
00:27:46.840 | I was pretty much only focused on the backend.
00:27:49.920 | - Yeah, so v0.1 was only August this year.
00:27:54.240 | - Yeah, before there was no versioning, so yeah.
00:27:58.200 | - And so what was the big rewrite for the 0.1
00:28:00.520 | and then the 1.0?
00:28:02.280 | - Well, that's more on the front end side.
00:28:05.480 | 'Cause before that, it was just like the UI, what,
00:28:09.960 | 'cause when I first wrote it, I just, I said, okay,
00:28:13.800 | how can I make, like, I can do web development,
00:28:17.360 | but I don't like doing it.
00:28:19.080 | Like, what's the easiest way I can slap
00:28:21.160 | a node interface on this?
00:28:22.800 | And then I found this library, LightGraph,
00:28:25.440 | like JavaScript library.
00:28:27.040 | - LightGraph? - LightGraph.
00:28:28.600 | - Usually people will go for like React flow,
00:28:30.400 | for like a flow builder.
00:28:31.240 | - Yeah, but that seems like too complicated.
00:28:33.880 | - 'Cause of React. (all laughing)
00:28:35.560 | - So I didn't really want to spend time
00:28:38.000 | like developing the front end, so I'm like,
00:28:40.720 | well, oh, LightGraph, this has the whole node interface.
00:28:44.840 | Well, okay, let me just plug that into my back end then.
00:28:49.440 | - I feel like if Streamlit or Gradio offered something,
00:28:51.640 | you would have used Streamlit or Gradio 'cause it's Python.
00:28:54.040 | - Yeah, Streamlit and Gradio.
00:28:55.680 | Gradio, I don't like Gradio.
00:28:57.360 | - Why? - It's bad.
00:28:58.400 | Like, that's one of the reasons why,
00:29:02.040 | like, automatic was very bad.
00:29:05.040 | It's great, 'cause the problem with Gradio,
00:29:08.320 | it forces you to, well, not forces you,
00:29:11.080 | but it kind of makes your interface logic
00:29:16.080 | and your back end logic and just sticks them together.
00:29:20.760 | - Well, it's supposed to be easy for you guys,
00:29:22.720 | if you're a Python main, you know, I'm a JS main, right?
00:29:24.920 | - Okay.
00:29:25.760 | - If you're a Python main, it's supposed to be easy.
00:29:26.760 | - Yeah, it's easy, but it makes your whole software
00:29:29.960 | a huge mess.
00:29:30.920 | - I see, I see.
00:29:31.760 | So you're mixing concerns instead of separating concerns?
00:29:34.280 | - Well, it's 'cause-
00:29:35.880 | - Like front end and back end.
00:29:36.880 | - Front end and back end should be well separated
00:29:39.560 | with a fine API.
00:29:41.160 | Like, that's how you're supposed to do it.
00:29:43.760 | - People, smart people disagree, but yeah.
00:29:46.680 | - It just sticks everything together.
00:29:49.120 | It makes it easy to, like, make a huge mess.
00:29:52.960 | And also it's, there's a lot of issues with Gradio.
00:29:57.960 | Like, it's very good if all you want to do
00:30:00.200 | is just get, like, slap a quick interface on your,
00:30:04.000 | like, to show off your, like, your ML project.
00:30:08.360 | Like, that's what it's made for.
00:30:10.280 | - Yeah, yeah.
00:30:11.120 | - Like, there's no problem using it, like,
00:30:13.520 | oh, I have my, I have my code.
00:30:16.160 | I just want a quick interface on it.
00:30:18.600 | That's perfect.
00:30:20.080 | Like, use Gradio.
00:30:21.080 | But if you want to make something that's like a real,
00:30:24.280 | like real software that will last a long time
00:30:28.320 | and will be easy to maintain, then I wouldn't avoid it.
00:30:32.520 | - Yeah, yeah.
00:30:33.480 | So your criticism is Streamlit and Gradio are the same.
00:30:36.080 | I mean, those are the same criticisms.
00:30:38.120 | - Yeah, Streamlit, I haven't.
00:30:40.800 | - Haven't used as much.
00:30:41.640 | - Yeah, it's just looked a bit.
00:30:43.800 | - Similar philosophy.
00:30:44.640 | - Yeah, it's similar.
00:30:45.640 | It's just, it just seems to me like, okay,
00:30:48.000 | for quick, like, AI demos, it's perfect.
00:30:51.200 | - Yeah.
00:30:52.040 | Going back to like the core tech, like asynchronous queues,
00:30:55.080 | slow re-execution, smart memory management, you know,
00:30:57.320 | anything that you were very proud of
00:30:59.160 | or was very hard to figure out?
00:31:00.720 | - Yeah, the thing that's the biggest pain in the ass
00:31:03.640 | is probably the memory management.
00:31:05.840 | - Yeah, were you just paging models in and out or?
00:31:08.360 | - Yeah, before it was just, okay, load the model,
00:31:11.840 | completely unload it, load the new model,
00:31:14.560 | completely unload it.
00:31:16.200 | Then, okay, that works well when your model are small,
00:31:19.800 | but if your models are big and it takes like,
00:31:22.960 | let's say someone has a, like a 4090
00:31:26.720 | and the model size is 10 gigabytes,
00:31:29.760 | that can take a few seconds to like load and load,
00:31:33.200 | so you want to try to keep things like in memory,
00:31:36.840 | in the GPU memory, as much as possible.
00:31:39.520 | What Comfy UI does right now is that,
00:31:43.120 | it tries to like estimate, okay, like,
00:31:45.800 | okay, you're going to sample this model.
00:31:48.040 | It's going to take probably this amount of memory.
00:31:51.200 | Let's remove the models, like this amount of memory
00:31:56.200 | that's been loaded on the GPU and then just execute it.
00:32:01.840 | But, so there's a fine line between just,
00:32:06.400 | 'cause try to remove the least amount of modelings
00:32:10.840 | that are already loaded,
00:32:13.080 | 'cause it adds like Windows driver.
00:32:18.080 | And another problem is the NVIDIA driver on Windows
00:32:22.160 | by default, because there's a way to,
00:32:24.640 | there's an option to disable that feature.
00:32:26.680 | But by default, it, like, if you start loading,
00:32:30.840 | you can overflow your GPU memory,
00:32:34.280 | and then it's, the driver's going to automatically
00:32:36.360 | start paging to RAM.
00:32:38.120 | But the problem with that is it,
00:32:39.720 | it makes everything extremely slow.
00:32:42.320 | So when you see people complaining,
00:32:44.000 | oh, this model, it works, but oh shit,
00:32:47.200 | it starts slowing down a lot.
00:32:49.160 | That's probably what's happening.
00:32:51.240 | So it's basically, you have to just try to get,
00:32:55.480 | use as much memory as possible, but not too much,
00:32:59.360 | or else things start slowing down,
00:33:01.520 | or people get out of memory.
00:33:03.760 | And then just find, try to find that line where,
00:33:07.680 | like the driver on Windows starts paging and stuff.
00:33:12.000 | - Yeah.
00:33:12.840 | - And yeah, the problem with PyTorch is it's,
00:33:15.520 | it's high levels, don't have that much fine grained control
00:33:19.040 | over like specific memory stuff.
00:33:22.240 | So kind of have to leave like the memory freeing
00:33:26.280 | to Python and PyTorch, which is, can be annoying sometimes.
00:33:31.280 | - So, you know, I think one thing as a maintainer
00:33:35.240 | of this project, like you're designing
00:33:37.040 | for a very wide surface area of compute.
00:33:40.600 | Like you even support CPUs.
00:33:42.320 | - Yeah, well that's, that's just for fun.
00:33:44.720 | PyTorch CPUs, so yeah, it's just,
00:33:48.240 | that's not, that's not hard to support.
00:33:50.160 | - First of all, is there a market share estimate?
00:33:51.720 | Like, is it like 70% NVIDIA, like 30% AMD,
00:33:55.040 | and then like miscellaneous on Apple Silicon or whatever?
00:33:58.440 | - For Comfy?
00:34:00.640 | - Yeah.
00:34:01.480 | - Yeah, I don't know the market share.
00:34:04.080 | - Can you guess?
00:34:05.480 | - I think it's mostly NVIDIA.
00:34:07.320 | - Yeah.
00:34:08.600 | - 'Cause AMD, the problem, like AMD works horribly
00:34:11.720 | on Windows.
00:34:12.920 | Like on Linux, it works fine.
00:34:15.280 | It's slower than the price equivalent NVIDIA GPU,
00:34:19.600 | but it works, like you can use it, generate images,
00:34:23.680 | everything works.
00:34:25.120 | On Linux, on Windows, you might have a hard time.
00:34:28.280 | So that's the problem.
00:34:29.720 | And most people, I think most people who bought AMD
00:34:34.720 | probably use Windows.
00:34:36.960 | (laughing)
00:34:38.360 | They probably aren't gonna switch to Linux.
00:34:41.200 | (laughing)
00:34:42.760 | So until AMD actually like ports their like raw cam
00:34:47.760 | to Windows properly, and then there's actually PyTorch.
00:34:52.280 | I think they're doing that.
00:34:54.280 | They're in the process of doing that,
00:34:56.080 | but until they get a good like PyTorch raw cam build
00:35:01.080 | that works on Windows, it's like,
00:35:04.400 | they're gonna have a hard time.
00:35:06.160 | - Yeah.
00:35:07.000 | - We gotta get George on it.
00:35:08.200 | - Yeah, well, he's trying to get Lisa Sui to do it.
00:35:10.920 | But let's talk a bit about like the node design.
00:35:14.920 | So unlike all the other text to image,
00:35:17.240 | you have a very like deep.
00:35:19.280 | So you have like a separate node for like clip and code.
00:35:22.120 | You have a separate node for like the case sampler.
00:35:24.440 | You have like all these nodes.
00:35:25.880 | Going back to like making it easy versus making it hard.
00:35:28.320 | But like, how much do people actually play
00:35:30.640 | with all the settings?
00:35:31.960 | You know, kind of like, how do you guide people to like,
00:35:33.680 | hey, this is actually gonna be very impactful
00:35:35.840 | versus this is maybe like less impactful,
00:35:38.280 | but we still wanna expose it to you.
00:35:40.080 | - Well, I try to expose like,
00:35:44.520 | I try to expose everything or that's, yeah.
00:35:49.240 | At least for the, but for things like, for example,
00:35:51.760 | for the samplers, like there's like, yeah,
00:35:55.240 | four different sampler nodes,
00:35:57.400 | which go in easiest to most advanced.
00:36:01.720 | So yeah, if you go like the easy node,
00:36:04.600 | the regular sampler node,
00:36:06.120 | that's you have just the basic settings.
00:36:09.160 | But if you use like the sampler advanced,
00:36:11.800 | custom advanced node, that one you can actually,
00:36:15.280 | you'll see you have like different nodes.
00:36:19.520 | - I'm looking it up now.
00:36:20.920 | - Yeah.
00:36:21.760 | - What are like the most impactful parameters that you use?
00:36:25.880 | So it's like, you know, you can have more,
00:36:27.480 | but like, which ones like really make a difference?
00:36:30.040 | - CFG.
00:36:30.880 | - Yeah, they all do.
00:36:31.800 | They all have their own like, they all like, for example,
00:36:35.680 | yeah, steps, usually you want steps,
00:36:38.760 | you want them to be as low as possible,
00:36:41.800 | but you want, if you're optimizing your workflow,
00:36:45.760 | you want to, you lower the steps
00:36:47.600 | until like the image has started deteriorating too much.
00:36:52.440 | 'Cause that, yeah, that's the number of steps
00:36:55.200 | you're running the diffusion process.
00:36:57.320 | So if you want things to be faster, lower is better.
00:37:02.280 | But yeah, CFG, that's more,
00:37:05.040 | you can kind of see that as the contrast of the image.
00:37:09.400 | Like if your image looks too burnt out,
00:37:12.000 | then you lower the CFG.
00:37:13.920 | So yeah, CFG, that's how, yeah,
00:37:16.440 | that's how strongly the, like the negative
00:37:20.240 | versus positive problem.
00:37:22.760 | 'Cause when you sample a diffusion model,
00:37:24.760 | it's basically a negative prompt.
00:37:27.320 | It's just, yeah, positive prediction
00:37:30.320 | minus negative prediction.
00:37:33.120 | - Contrastive loss.
00:37:34.000 | - Yeah, so it's positive minus negative,
00:37:36.880 | and the CFG, that's the multiplier.
00:37:39.040 | - Yeah.
00:37:40.240 | - Yeah, so.
00:37:41.080 | - What are like good resources
00:37:42.640 | to understand what the parameters do?
00:37:45.080 | I think most people start with automatic,
00:37:47.240 | and then they move over and it's like,
00:37:48.920 | step, CFG, sampler, name, scheduler, denoise.
00:37:51.920 | - Read it.
00:37:52.760 | - Honestly, well, it's more,
00:37:56.000 | it's something you should like try out yourself.
00:37:59.480 | I don't know, you don't necessarily need to know
00:38:02.360 | how it works to like what it does.
00:38:05.240 | 'Cause even if you know like CFGO,
00:38:07.600 | it's like positive minus negative problem.
00:38:10.320 | - Yeah.
00:38:11.160 | - So the only thing you know at CFG is if it's 1.0,
00:38:14.240 | then that means the negative prompt isn't applying.
00:38:17.360 | And also maybe sampling is two times faster, but.
00:38:20.600 | - Yeah, yeah.
00:38:21.480 | - Yeah, but other than that,
00:38:22.760 | it's more like you should really just see
00:38:26.080 | what it does to the images yourself,
00:38:27.960 | and you'll probably get a more intuitive understanding
00:38:32.280 | of what these things do.
00:38:33.960 | - Any other notes or things you want to shout out?
00:38:37.920 | Like I know the AnimateDiff IP adapter,
00:38:40.360 | those are like some of the most popular ones.
00:38:42.920 | - Yeah, what else comes to mind?
00:38:44.840 | - I don't have notes, but there's,
00:38:47.240 | like what I like is when some people,
00:38:49.560 | sometimes they make things that use Confi UI
00:38:53.160 | as their backend, like there's a plugin for Krita
00:38:58.160 | that uses Confi UI as its backend.
00:39:02.760 | So you can use like all the models
00:39:05.600 | that work in Confi in Krita.
00:39:07.640 | And I think I've tried it once,
00:39:10.880 | but I know a lot of people use it,
00:39:13.440 | and find it really nice, so.
00:39:15.960 | - What's the craziest node that people have built,
00:39:19.080 | like the most complicated?
00:39:21.320 | - Craziest node, like yeah,
00:39:23.720 | I know some people have made like video games in Confi,
00:39:28.400 | with like stuff like that.
00:39:31.760 | So like someone, like I remember,
00:39:35.160 | like yeah, I think it was last year,
00:39:38.240 | someone made like Wolfenstein 2 in Confi,
00:39:43.240 | and then one of the inputs was,
00:39:45.920 | oh, you can generate a texture,
00:39:47.720 | and then it changes the texture in the game.
00:39:51.400 | So I could plug it to like the workflow.
00:39:54.320 | And there's a lot of, if you look there,
00:39:56.440 | there's a lot of crazy things people do, so yeah.
00:40:00.320 | - And now there's like a node register
00:40:02.120 | that people can use to like download nodes.
00:40:04.760 | - Yeah, well, there's always been the Confi UI manager,
00:40:08.360 | but we're trying to make this more like official,
00:40:12.840 | like with the node registry,
00:40:17.360 | 'cause before the node registry,
00:40:20.920 | like okay, how did your custom node get into Confi UI manager?
00:40:24.560 | That's the guy running it who like every day,
00:40:27.720 | he searched GitHub for new custom nodes,
00:40:29.880 | and added them manually to his custom node manager.
00:40:34.520 | So we're trying to make it less effort for him, basically.
00:40:39.520 | - Yeah, but I was looking,
00:40:41.840 | I mean, there's like a YouTube download node.
00:40:44.640 | There's like, this is almost like a data pipeline,
00:40:48.080 | more than like an image generation thing at this point.
00:40:50.120 | It's like you can get data in,
00:40:51.400 | you can like apply filters to it,
00:40:52.880 | you can generate data out.
00:40:54.800 | - Yeah, you can do a lot of different things.
00:40:57.760 | - Yeah, something I think,
00:40:59.200 | what I did is I made it easy to make custom nodes.
00:41:04.960 | So I think that that helped a lot for the ecosystem,
00:41:09.760 | 'cause it is very easy to just make a node.
00:41:12.960 | So yeah, a bit too easy sometimes.
00:41:16.880 | (laughing)
00:41:18.040 | Then we have the issue where there's a lot
00:41:20.960 | of custom node packs, which share similar nodes.
00:41:25.520 | But well, that's, yeah, something we're trying to solve
00:41:30.320 | by maybe bringing some of the functionality into core.
00:41:34.720 | - Yeah, yeah, yeah.
00:41:35.840 | - Yeah.
00:41:36.680 | - And then there's like video,
00:41:38.640 | people can do video generation.
00:41:40.440 | - Yeah, video, that's, well, the first video model
00:41:44.480 | was like stable video diffusion,
00:41:47.080 | which was last, yeah, exactly last year, I think.
00:41:50.760 | Like one year ago, but that wasn't a true video model.
00:41:54.080 | So it was--
00:41:55.920 | - It was like moving images.
00:41:57.560 | - Yeah, it generated video.
00:41:58.920 | What I mean by that is it's like,
00:42:01.080 | it's still 2D latency.
00:42:04.320 | It's basically what they did is they took SD2
00:42:08.040 | and then they added some temporal attention to it
00:42:11.840 | and then trained it on videos.
00:42:14.360 | So it's kind of like animate diff,
00:42:18.280 | like same idea, basically.
00:42:21.960 | Why I say it's not a true video model
00:42:24.440 | is that you still have like the 2D latency.
00:42:27.440 | Like a true video model, like Mochi, for example,
00:42:30.680 | would have 3D latency.
00:42:33.360 | - Which means you can like move through the space,
00:42:35.040 | basically, it's the difference.
00:42:36.680 | You're not just kind of like reorienting.
00:42:38.880 | - Yeah, and it's also, well,
00:42:40.240 | it's also because you have a temporal VAE.
00:42:42.960 | Also, like Mochi has a temporal VAE
00:42:45.880 | that compresses on like the temporal direction also.
00:42:51.000 | So that's something you don't have with like,
00:42:52.880 | yeah, animate diff and stable video diffusion.
00:42:56.080 | They only like compress spatially, not temporally.
00:43:00.240 | So yeah, so these models,
00:43:02.040 | that's why I call them like true video models.
00:43:04.920 | There's actually a few of them,
00:43:07.200 | but the one I've implemented in Comfy is Mochi
00:43:11.680 | 'cause that seems to be the best one so far.
00:43:15.960 | - We had AJ come and speak at the stable diffusion meetup.
00:43:19.200 | Other open one I think I've seen is Kog Video.
00:43:21.640 | - Yeah, Kog Video.
00:43:22.880 | Yeah, that one's, yeah, it also seems decent, but yeah.
00:43:27.360 | - We're Chinese, so we don't use it.
00:43:29.040 | (all laughing)
00:43:29.880 | - No, it's fine, it's just, yeah, I could,
00:43:31.840 | yeah, it's just that there's a, it's not the only one.
00:43:34.880 | There's also a few others, which I--
00:43:37.160 | - The rest are like closed source, right?
00:43:38.920 | Like Cling--
00:43:39.760 | - Yeah, the closed source, there's a bunch of them,
00:43:41.600 | but I mean, open, I've seen a few of them,
00:43:45.560 | like, yeah, I can't remember their names,
00:43:47.680 | but there's Kog Videos, the big one,
00:43:50.640 | then there's also a few of them
00:43:52.520 | that released at the same time.
00:43:55.080 | There's one that released at the same time
00:43:57.480 | as SD3.5, same day, which is why I don't remember the name.
00:44:01.800 | (all laughing)
00:44:03.120 | - We should have a release schedule
00:44:04.280 | so we don't conflict on each of these things.
00:44:06.080 | - Yeah, I think SD3.5 and Mochi released on the same day.
00:44:10.160 | So everything else was kind of drowned,
00:44:13.240 | completely drowned out.
00:44:14.480 | So for some reason, lots of people picked that day
00:44:17.920 | to release their stuff.
00:44:19.400 | (all laughing)
00:44:21.520 | Yeah, which is, well, shame for those, I think, guys.
00:44:25.320 | I think Omnijet also released the same day,
00:44:27.640 | which also seems interesting, but yeah.
00:44:30.680 | - Yeah, what's Comfy?
00:44:32.080 | So you are Comfy, and then there's like Comfy.org.
00:44:35.560 | I know we do a lot of things for like news research,
00:44:37.520 | and those guys also have kind of like a more open source
00:44:40.400 | and on thing going on.
00:44:42.120 | How do you work?
00:44:43.680 | Like you mentioned,
00:44:44.520 | you must have worked on like the core piece of it,
00:44:47.240 | and then what?
00:44:48.080 | - Maybe I should fit in because, yeah,
00:44:50.640 | I feel like maybe, yeah,
00:44:52.520 | I only explained part of the story.
00:44:54.880 | - Right.
00:44:55.880 | - Yeah, maybe I should explain the rest, so yeah.
00:44:58.240 | So yeah, basically January,
00:45:01.000 | that's when the first, January 2023,
00:45:04.320 | January 16th, 2023,
00:45:05.720 | that's when Comfy was first released to the public.
00:45:10.400 | Then, yeah, did a Reddit post
00:45:12.880 | about the area composition thing somewhere in,
00:45:15.680 | I don't remember exactly,
00:45:18.360 | maybe end of January, beginning of February.
00:45:21.440 | And then somewhere, a YouTuber made a video about it,
00:45:26.440 | like Olivia, he made a video about Comfy in March, 2023.
00:45:31.840 | I think that's when it was real burst of attention.
00:45:36.280 | And by that time I was continued to developing it
00:45:40.480 | and it was getting,
00:45:42.720 | people were starting to use it more,
00:45:45.400 | which unfortunately meant that I had first written it
00:45:50.400 | to do like experiments,
00:45:52.400 | but then my time to do experiments
00:45:55.720 | when it started going down,
00:45:57.120 | yeah, 'cause yeah,
00:46:00.480 | people were actually starting to use it then,
00:46:02.680 | like I had to.
00:46:04.320 | And I said, well, yeah,
00:46:05.800 | time to add all these features and stuff.
00:46:09.360 | Yeah, and then I got hired by Stability, June, 2023.
00:46:13.760 | Then I made the, basically, yeah,
00:46:16.080 | they hired me 'cause they wanted the SDXL.
00:46:19.440 | So I got SDXL working very well in Comfy UI
00:46:23.680 | because they were experimenting with it.
00:46:26.880 | Actually the SDX, how the SDXL released worked
00:46:29.840 | is they released for some reason,
00:46:32.080 | they released the code first,
00:46:34.800 | but they didn't release the model checkpoint.
00:46:37.360 | Oh yeah, so they released the code.
00:46:39.800 | And then, well, since the research was with the code,
00:46:42.880 | I released the code in Comfy too.
00:46:45.400 | And then the checkpoints were basically early access.
00:46:49.360 | People had to sign up
00:46:50.840 | and they only allowed a lot of people from edu emails.
00:46:55.840 | Like if you had the edu email,
00:46:58.160 | like they gave you access basically to the zero SDXL 0.9.
00:47:03.160 | And well, that leaked, of course,
00:47:09.520 | because of course it's gonna leak if you do that.
00:47:13.240 | Well, the only way people could easily use it
00:47:15.640 | was with Comfy.
00:47:16.920 | So yeah, people started using it.
00:47:19.320 | And then I fixed a few of the issues that people had.
00:47:22.240 | So then the big 1.0 release happened.
00:47:26.800 | And well, Comfy UI was the only way a lot of people
00:47:31.280 | could actually run it on their computers.
00:47:33.880 | 'Cause it just like automatic was so like inefficient
00:47:38.680 | and bad that most people couldn't act
00:47:41.880 | like it just wouldn't work.
00:47:44.920 | Like, 'cause he did a quick implementation.
00:47:47.640 | So people were forced to use Comfy UI.
00:47:50.480 | And that's how it became popular
00:47:52.320 | because people had no choice.
00:47:54.440 | (all laughing)
00:47:56.000 | - The Grove hack.
00:47:56.840 | - Yeah.
00:47:57.680 | - Yeah.
00:47:58.500 | Like everywhere, like people who didn't have the 4090,
00:48:01.000 | they had like, who had just regular GPUs.
00:48:03.520 | - Yeah, yeah.
00:48:04.360 | - They didn't have a choice, so.
00:48:06.480 | - Yeah, I got a 4070, so think of me.
00:48:09.480 | And so today, what's, is there like a core Comfy team or?
00:48:14.240 | - Yeah, well, right now, yeah, we are hiring actually.
00:48:19.240 | So right now core, like the core core itself, it's me.
00:48:25.120 | Yeah, but because the reason we're focused,
00:48:27.640 | like all the focus has been mostly
00:48:30.120 | on the front end right now,
00:48:31.480 | 'cause that's the thing that's been neglected
00:48:34.400 | for a long time.
00:48:36.200 | So most of the focus right now is all on the front end,
00:48:41.200 | but we are, yeah, we will soon get more people
00:48:46.040 | to like help me with the actual backend stuff.
00:48:49.720 | Because that's, once we have our V1 release,
00:48:54.120 | which is, 'cause it'd be the package Comfy Y
00:48:57.240 | with the nice interface and easy to install on Windows,
00:49:02.240 | and hopefully Mac.
00:49:04.160 | Yeah, once we have that, we're going to have to,
00:49:08.800 | lots of stuff to do on the backend side
00:49:11.400 | and also the front end side, but yeah.
00:49:14.960 | - What's the release date?
00:49:15.960 | I'm on the wait list.
00:49:17.000 | What's the timing?
00:49:18.560 | - Soon, soon.
00:49:21.800 | Yeah, like I don't want to promise a release date.
00:49:26.240 | Yeah, we do have a release date we're targeting,
00:49:29.640 | but I'm not sure if it's public.
00:49:32.880 | - Yeah.
00:49:33.720 | - Yeah, and how we're going to,
00:49:35.440 | like we're still going to continue
00:49:37.640 | like doing the open source,
00:49:40.280 | like making Comfy Y the best way to run
00:49:43.640 | like stable infusion models,
00:49:45.840 | like at least the open source side,
00:49:47.840 | and like it's going to be best way to run models locally,
00:49:52.840 | but we will have a few,
00:49:54.840 | like a few things to make money from it,
00:49:57.480 | like cloud inference or like that type of thing.
00:50:02.480 | So, and maybe some, like some things for some enterprises.
00:50:08.120 | - I mean, a few questions on that.
00:50:09.840 | How do you feel about the other Comfy startups?
00:50:12.240 | - I mean, I think it's great.
00:50:14.640 | - They're using your name, you know.
00:50:15.640 | - Yeah, well, it's better to use Comfy
00:50:17.240 | than to use something else.
00:50:18.720 | - Yeah, that's true.
00:50:20.400 | - Yeah, like it's fine.
00:50:22.280 | I don't like, like, yeah, we're going to try not to,
00:50:27.280 | we don't want to,
00:50:29.000 | like we want them to, people to use Comfy,
00:50:31.760 | because like I said,
00:50:32.800 | it's better that people use Comfy than something else.
00:50:37.160 | So as long as they use Comfy,
00:50:39.600 | it's, I think it helps, it helps the ecosystem.
00:50:43.280 | And so, because more people,
00:50:45.000 | even if they don't contribute directly,
00:50:49.120 | the fact that they are using Comfy
00:50:51.280 | means that like people are more likely
00:50:54.040 | to like join the ecosystem.
00:50:56.200 | So, yeah.
00:50:58.240 | - And then would you ever do text?
00:51:00.040 | - Yeah, well, you can already do text
00:51:02.240 | with some custom nodes.
00:51:03.600 | So yeah, it's something where we like, yeah.
00:51:07.520 | It's something I've wanted to eventually add to core,
00:51:10.880 | it's more like, not a very high priority,
00:51:15.440 | but because a lot of people use text
00:51:17.880 | for like prompt enhancement and like other things like that.
00:51:21.280 | So it's, yeah, it's just that my focus
00:51:24.760 | has always been like diffusion models.
00:51:28.000 | Yeah, unless some text diffusion model comes out.
00:51:31.000 | - Yeah, David Holtz is investing a lot in text diffusion.
00:51:34.280 | - Well, if a good one comes out,
00:51:35.920 | then well, I'll probably implement it
00:51:37.720 | since it fits with the whole.
00:51:39.120 | - Yeah, I mean, I imagine it's gonna be
00:51:41.400 | close source to my journey, so.
00:51:42.880 | - Yeah, well, yeah, if an open one comes out.
00:51:46.080 | (laughing)
00:51:47.680 | Yeah, then yeah, I'll probably, yeah.
00:51:51.920 | Yeah, I'll probably implement it.
00:51:53.560 | - Cool, Comfy.
00:51:55.800 | Thanks so much for coming on.
00:51:57.480 | This was fun.
00:51:58.320 | (upbeat music)
00:52:01.920 | (upbeat music)
00:52:04.520 | (gentle music)