AI Engineering for Art - with comfyanonymous

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey everyone, welcome to the Latent Space Podcast.

00:00:06.880 | This is Alessio, partner and CTO at Decibel Partners,

00:00:09.500 | and I'm joined by my co-host Swix, founder of Small.ai.

00:00:12.440 | Hey everyone, we are in the Chroma Studio again,

00:00:15.680 | but with our first ever anonymous guest,

00:00:18.400 | Comfy Anonymous, welcome.

00:00:19.680 | I feel like that's your full name.

00:00:22.620 | You just go by Comfy, right?

00:00:24.320 | - Yeah, well a lot of people just call me Comfy,

00:00:26.880 | even though, even when they know my real name.

00:00:30.640 | Hey, Comfy.

00:00:32.600 | - Swix is the same.

00:00:34.520 | Not a lot of people call you Sean.

00:00:36.160 | - Yeah, you have a professional name, right?

00:00:37.760 | That people know you by, and then you have a legal name.

00:00:40.520 | Yeah, that's fine.

00:00:41.440 | How do I phrase this?

00:00:42.520 | I think people who are in the know,

00:00:44.200 | know that Comfy is like the tool for image generation

00:00:47.280 | and now other multimodality stuff.

00:00:49.480 | I would say that when I first got started

00:00:51.080 | with stable diffusion,

00:00:52.080 | the star of the show was Automatic 111, right?

00:00:55.520 | And I actually looked back at my notes from 2022-ish,

00:00:59.520 | like Comfy was already getting started back then,

00:01:01.400 | but it was kind of like the up and comer

00:01:02.760 | and like your main feature was the flowchart.

00:01:04.520 | Can you just kind of rewind to that moment that year

00:01:07.720 | and like, you know, how you looked at the landscape there

00:01:09.640 | and decided to start Comfy?

00:01:11.160 | - Yeah, I discovered stable diffusion in 2022,

00:01:14.800 | in October, 2022.

00:01:17.200 | And well, I kind of started playing around with it.

00:01:20.720 | Yes, I, and back then I was using Automatic,

00:01:24.120 | which was what everyone was using back then.

00:01:26.880 | And I, so I started with that.

00:01:30.120 | 'Cause I had the, 'cause when I started,

00:01:32.240 | I had no idea like how diffusion models work,

00:01:34.880 | how any of this works.

00:01:36.960 | - Oh yeah, what was your prior background as an engineer?

00:01:39.880 | - Just a software engineer.

00:01:42.120 | Yeah, boring software engineer.

00:01:44.320 | - But like any image stuff, any orchestration,

00:01:47.920 | distributed systems, GPUs?

00:01:50.040 | - No, I was doing basically nothing interesting.

00:01:54.320 | (laughs)

00:01:55.160 | - Crud, web development?

00:01:56.840 | - Yeah, a lot of web development.

00:01:58.720 | Just, yeah, some basic, maybe some basic

00:02:00.880 | like automation stuff and whatever.

00:02:03.640 | Just, yeah, no big companies or anything.

00:02:08.640 | - Yeah, but like already some interest in automations,

00:02:11.320 | probably a lot of Python.

00:02:12.720 | - Yeah, yeah, of course, Python.

00:02:14.760 | But like, I wasn't actually used to like

00:02:18.440 | the node graph interface.

00:02:20.840 | Before I started Confi UI, it was just,

00:02:24.120 | I just thought it was like, oh,

00:02:26.280 | like what's the best way to represent the diffusion process

00:02:30.840 | in a user interface?

00:02:32.080 | And then like, oh, well, like naturally,

00:02:34.480 | oh, this is the best way I found.

00:02:37.080 | And this was like with the node interface.

00:02:40.840 | So how I got started was, yeah.

00:02:44.360 | So basically October, 2022, just like,

00:02:48.440 | I hadn't written a line of PyTorch before that.

00:02:51.520 | So it's completely new.

00:02:54.320 | What happened was I kind of got addicted

00:02:56.560 | to generating images.

00:02:58.400 | - As we all did.

00:03:00.200 | - Yeah, and then I started experimenting

00:03:03.880 | with like the high-risk fixed in auto,

00:03:07.600 | which was, for those that don't know,

00:03:09.880 | the high-risk fix is just since the diffusion models

00:03:14.120 | back then could only generate at low resolution.

00:03:17.280 | So what you would do, you would generate

00:03:19.120 | low-resolution image, then upscale,

00:03:21.520 | then refine it again.

00:03:25.560 | And that was kind of the hack

00:03:27.280 | to generate high-resolution images.

00:03:30.640 | I really liked generating like higher-resolution images.

00:03:34.440 | So I was experimenting with that.

00:03:38.040 | And so I modified the code a bit.

00:03:40.920 | Okay, what happens if I use different samplers

00:03:45.120 | on the second pass?

00:03:46.520 | I must edit the code of auto.

00:03:49.920 | So what happens if I use a different sampler?

00:03:52.240 | What happens if I use a different,

00:03:55.040 | like a different settings,

00:03:56.880 | different number of steps?

00:03:58.920 | 'Cause back then the high-risk fix was very basic.

00:04:03.840 | - Yeah, now there's a whole library

00:04:06.920 | of just the upsamplers.

00:04:08.880 | - Yeah, I think they added a bunch of options

00:04:13.200 | to the high-risk fix since then.

00:04:16.840 | But before that, it was just so basic.

00:04:18.920 | So I wanted to go further.

00:04:21.160 | I wanted to try, okay, what happens

00:04:22.800 | if I use a different model for the second pass?

00:04:26.960 | And then, well, then the auto code base

00:04:30.040 | wasn't good enough for,

00:04:32.320 | like it would have been harder to implement that

00:04:36.160 | in the auto interface than to create my own interface.

00:04:40.720 | So that's when I decided to create my own.

00:04:44.640 | - And you were doing that mostly on your own

00:04:46.480 | when you started, or did you already have

00:04:47.920 | kind of like a subgroup of people?

00:04:49.160 | - No, it was on my own.

00:04:51.200 | 'Cause it was just me experimenting with stuff.

00:04:55.240 | So yeah, that was it.

00:04:58.120 | So I started writing the code January 1, 2023.

00:05:03.280 | And then I released the first version

00:05:06.160 | on GitHub January 16, 2023.

00:05:09.320 | That's how things got started.

00:05:11.840 | - And was the name Comfy UI right away?

00:05:14.040 | - Yeah, yeah, Comfy UI.

00:05:15.680 | The reason my name is Comfy

00:05:17.800 | is people thought my pictures were comfy.

00:05:20.680 | So I just named it, it's my Comfy UI.

00:05:25.680 | So yeah, that's.

00:05:28.040 | - Is there a particular segment of the community

00:05:29.840 | that you targeted as users?

00:05:31.640 | Like more intensive workflow artists,

00:05:34.760 | compared to the automatic crowd or, you know.

00:05:37.680 | - This was my way of like experimenting with new things.

00:05:42.680 | I like the high risk fix thing I mentioned,

00:05:45.440 | which was like in Comfy,

00:05:46.840 | the first thing you could easily do

00:05:48.680 | was just chain different models together.

00:05:51.840 | And then one of the first things,

00:05:53.240 | I think the first times it got a bit of popularity

00:05:56.720 | was when I started experimenting with different,

00:06:00.360 | like applying prompts to different areas of the image.

00:06:04.840 | Yeah, I called it area conditioning,

00:06:07.880 | posted it on Reddit and it got a bunch of upvotes.

00:06:11.640 | So I think that's when people first learned of Comfy UI.

00:06:16.640 | - Is that mostly like fixing hands?

00:06:19.800 | - No, no, no, that was just like, let's say,

00:06:22.880 | well, it was very,

00:06:24.400 | well, it still is kind of difficult to like,

00:06:26.800 | let's say you want a mountain,

00:06:29.640 | you have an image and then you're like,

00:06:31.080 | I want a mountain here and I want like a fox here.

00:06:36.080 | - Yeah, so compositing the image.

00:06:39.920 | - Yeah, my way was very easy.

00:06:41.840 | It was just like, oh, when you run the diffusion process,

00:06:46.160 | you kind of generate, okay,

00:06:48.720 | you do pass, one pass through the diffusion model,

00:06:52.040 | every step you do one pass, okay.

00:06:54.480 | This place of the image with this prompt,

00:06:57.080 | this place of the image with the other prompt

00:07:00.040 | and then the entire image with another prompt

00:07:03.400 | and then just average everything together, every step.

00:07:07.320 | And that was area composition, which I call it.

00:07:11.440 | And then a month later,

00:07:13.760 | there was a paper that came out called multi-diffusion,

00:07:16.600 | which was the same thing, but yeah.

00:07:19.120 | - Could you do area composition with different models

00:07:24.320 | or because you're averaging out,

00:07:25.840 | you kind of need the same model?

00:07:27.120 | - Could do it with, but yeah,

00:07:29.200 | I hadn't implemented it for different models,

00:07:32.200 | but you can do it with different models if you want,

00:07:36.720 | as long as the models share the same latent space.

00:07:40.040 | - We're supposed to ring a bell

00:07:42.720 | every time someone says latent space.

00:07:45.240 | - Yeah, like for example,

00:07:46.640 | you couldn't use like Excel and SD 1.5

00:07:50.240 | 'cause those have a different latent space,

00:07:52.480 | but like, yeah, like SD 1.5 models, different ones,

00:07:57.440 | you could do that.

00:08:00.080 | - There's some models that try to work

00:08:01.920 | in pixel space, right?

00:08:03.240 | - Yeah, they're very slow.

00:08:05.680 | - Of course.

00:08:06.520 | - That's the problem.

00:08:07.360 | That's the reason why stable diffusion

00:08:09.360 | actually became like popular,

00:08:11.920 | like was because of the latent space.

00:08:15.200 | - Small in, yeah.

00:08:16.480 | Because there used to be latent diffusion models

00:08:18.200 | and then they trained it up.

00:08:19.600 | - Yeah, 'cause the pixel diffusion models

00:08:22.640 | are just too slow, so.

00:08:25.000 | - Yeah.

00:08:25.840 | Have you ever tried to talk to like Stability,

00:08:28.720 | the latent diffusion guys,

00:08:30.040 | like, you know, Robin Rhombach, that crew?

00:08:32.320 | - Yeah, well, I used to work at Stability.

00:08:34.960 | - Oh, I actually didn't know.

00:08:35.840 | - Yeah, I used to work at Stability.

00:08:37.520 | I got hired in June, 2023.

00:08:42.520 | - Ah, that's the part of the story I didn't know about.

00:08:45.240 | Okay.

00:08:46.680 | - So the reason I was hired

00:08:48.200 | is because they were doing SDXL at the time.

00:08:51.680 | And they were basically, SDXL,

00:08:54.720 | I don't know if you remember,

00:08:55.720 | it was a base model and then a refiner model.

00:08:58.800 | Basically, they wanted to experiment

00:09:00.680 | like chaining them together.

00:09:02.640 | And then they saw, oh.

00:09:04.840 | - Right.

00:09:05.680 | - Come, oh, this, we can use this to do that.

00:09:08.040 | Well, let's hire that guy.

00:09:10.680 | - But they didn't pursue it for like SD3.

00:09:13.280 | - What do you mean?

00:09:14.120 | - Like the SDXL approach.

00:09:15.880 | - Yeah, the reason for that approach

00:09:18.400 | was because basically they had two models

00:09:23.040 | and then they wanted to publish both of them.

00:09:26.160 | So they trained one on lower time steps,

00:09:29.400 | which was the refiner model.

00:09:31.840 | And then the first one was trained normally.

00:09:35.760 | And then during their test, they realized,

00:09:38.040 | oh, like if we string these models together

00:09:41.560 | are like quality increases.

00:09:43.640 | So let's publish that.

00:09:45.120 | - It worked.

00:09:47.840 | - Yeah.

00:09:48.680 | But like right now, I don't think many people

00:09:51.000 | actually use the refiner anymore,

00:09:52.840 | even though it is actually a full diffusion model.

00:09:55.920 | Like you can use it on its own

00:09:57.880 | and it's gonna generate images.

00:10:00.040 | I don't think anyone,

00:10:01.280 | people have mostly forgotten about it, but.

00:10:05.680 | - Can we talk about models a little bit?

00:10:07.680 | So stable diffusion, obviously is the most known.

00:10:09.520 | I know Flux has gotten a lot of traction.

00:10:12.200 | Are there any underrated models that people should use more

00:10:15.320 | or what's the state of the union?

00:10:17.440 | - Well, the latest state of the art at least,

00:10:21.040 | yeah, for images, there's, yeah, there's Flux.

00:10:24.920 | There's also SD 3.5.

00:10:27.520 | SD 3.5 is two models.

00:10:29.400 | There's a small one, 2.5B and there's the bigger one, 8B.

00:10:34.400 | So it's smaller than Flux.

00:10:37.000 | So, and it's more creative in a way.

00:10:42.640 | But Flux, yeah, Flux is the best.

00:10:45.640 | People should give SD 3.5 a try 'cause it's different.

00:10:50.560 | I won't say it's better.

00:10:52.400 | Well, it's better for some like specific use cases.

00:10:55.800 | If you want some to make something more like creative,

00:10:59.320 | maybe SD 3.5.

00:11:00.760 | If you want to make something more consistent

00:11:03.400 | and Flux is probably better.

00:11:06.640 | - Do you ever consider supporting

00:11:08.200 | the closed source model APIs?

00:11:11.000 | - Well, we do support them with custom nodes.

00:11:14.720 | We actually have some official custom nodes

00:11:18.440 | from different--

00:11:19.960 | - Ideogram.

00:11:20.800 | - Yeah.

00:11:21.640 | - I guess Dolly would have one.

00:11:23.600 | - Yeah, it's just not, I'm not the person that handles that.

00:11:28.440 | - Sure, sure.

00:11:29.360 | Quick question on SD.

00:11:31.000 | There's a lot of community discussion

00:11:32.560 | about the transition from SD 1.5 to SD 2

00:11:36.200 | and then SD 2 to SD 3.

00:11:37.760 | People still like, you know,

00:11:39.120 | very loyal to the previous generations of SDs?

00:11:42.600 | - Yeah, SD 1.5 and still has a lot of users.

00:11:47.160 | - The last based model.

00:11:48.360 | - Yeah, then SD 2 was mostly ignored

00:11:53.120 | 'cause it wasn't a big enough improvement

00:11:57.200 | over the previous one.

00:11:58.840 | - Okay, so SD 1.5, SD 3, Flux and whatever else.

00:12:02.440 | - Yeah, SD XL.

00:12:03.280 | - SD XL.

00:12:04.120 | - SD XL, that's the main one.

00:12:04.960 | - Stable Cascade?

00:12:06.040 | - Stable Cascade, that was a good model,

00:12:08.480 | but the problem with that one is it got,

00:12:13.160 | like SD 3 was announced one week after.

00:12:16.320 | - Yeah, it was like a weird release.

00:12:18.920 | What was it like inside of Stability, actually?

00:12:21.440 | I mean, Statue of Limitations expired,

00:12:23.160 | you know, management has moved,

00:12:24.560 | so it's easier to talk about now.

00:12:27.200 | - Yeah, and inside Stability,

00:12:29.960 | actually that model was ready like three months before,

00:12:34.000 | but it got stuck in red teaming.

00:12:37.960 | So basically, if that model had released

00:12:40.920 | or was supposed to be released by the authors,

00:12:44.400 | then it would probably have gotten very popular

00:12:47.240 | since it's a step up from SD XL,

00:12:50.440 | but it got all of its momentum stolen

00:12:52.840 | by the SD 3 announcement,

00:12:54.680 | so people kind of didn't develop anything on top of it,

00:12:58.840 | even though it's, yeah.

00:13:01.320 | It was a good model, at least.

00:13:04.600 | Completely, mostly ignores for some reason, like.

00:13:07.880 | - It seemed, I think the naming as well matters.

00:13:11.040 | It seemed like a branch off of the main tree of development.

00:13:15.760 | - Yeah, well, it was different researchers that did it.

00:13:18.520 | - Different, yeah.

00:13:20.000 | - Very, like a good model,

00:13:22.920 | like it's the Worcestershire authors,

00:13:25.080 | I don't know if I'm pronouncing it correctly.

00:13:26.680 | - Worshen, yeah.

00:13:27.520 | - Worshen, yeah, yeah.

00:13:29.080 | - I actually met them in Vienna.

00:13:30.720 | Yeah, they worked at Stability for a bit

00:13:32.880 | and they left right after the Cascade release.

00:13:35.920 | - This is Dustin, right?

00:13:36.960 | - No.

00:13:37.800 | - Dustin's SD 3.

00:13:38.640 | - No, Dustin is SD 3, SD XL.

00:13:42.080 | That's Pablo and Domi.

00:13:45.960 | This, I think I'm pronouncing his name correctly.

00:13:49.240 | Yeah, that's very good.

00:13:51.960 | - It seems like the community is very,

00:13:54.200 | they move very quickly.

00:13:55.320 | - Yeah.

00:13:56.160 | - Like when there's a new model out,

00:13:57.000 | they just drop whatever the current one is

00:13:59.440 | and they just all move wholesale over,

00:14:01.200 | like they don't really stay

00:14:02.040 | to explore the full capabilities.

00:14:04.120 | Like if the Stable Cascade was that good,

00:14:06.520 | they would have maybe tested a bit more.

00:14:08.320 | Instead, they're like,

00:14:09.160 | "Okay, SD 3 is out, let's go."

00:14:10.800 | You know?

00:14:11.840 | - Well, I find the opposite actually.

00:14:14.200 | The community doesn't,

00:14:15.480 | like they only jump on a new model

00:14:17.360 | when there's a significant improvement.

00:14:19.280 | - I see.

00:14:20.120 | - Like if there's only like incremental improvement,

00:14:24.280 | which is what most of these models are going to have,

00:14:28.000 | especially if you stay the same parameter count.

00:14:32.440 | - Yeah.

00:14:33.280 | - Like you're not going to get a massive improvement

00:14:36.480 | into like, unless there's something big that changes, so.

00:14:41.080 | - Yeah.

00:14:41.920 | And how are they evaluating these improvements?

00:14:43.360 | Like, because it's a whole chain of, you know,

00:14:46.840 | comfy workflows.

00:14:47.840 | - Yeah.

00:14:48.680 | - How does one part of the chain

00:14:50.120 | actually affect the whole process?

00:14:52.560 | - Are you talking on the model side specific?

00:14:55.120 | - Model specific, right?

00:14:56.280 | But like, once you have your whole workflow

00:14:58.680 | based on a model, it's very hard to move.

00:15:02.040 | - Ah, not, well, not really.

00:15:04.480 | - Yeah, maybe not.

00:15:05.320 | - It depends on your,

00:15:06.560 | depends on the specifics and the workflow.

00:15:09.000 | - Yeah.

00:15:09.840 | Like, so I do a lot of like text and image.

00:15:12.000 | - Yeah.

00:15:13.000 | When you do change, like most workflows

00:15:16.440 | are kind of going to be compatible between different models.

00:15:19.840 | It's just like, you might have to completely change

00:15:21.920 | your prompt, completely change.

00:15:24.160 | - Okay, well, I mean,

00:15:25.000 | then maybe the question is really about evals.

00:15:26.520 | Like what does the Comfy community do for evals?

00:15:30.840 | Just, you know.

00:15:32.040 | - Well, they don't really do the,

00:15:34.680 | it's more like, oh, I think this image is nice.

00:15:37.400 | - Yeah.

00:15:38.240 | - So that's.

00:15:39.080 | - They just subscribe to Fulfur AI

00:15:41.040 | and just see like, you know, what Fulfur is doing.

00:15:43.520 | - Yeah, they just, they just generate like it,

00:15:46.640 | like, I don't see anyone really doing it.

00:15:49.120 | At least on the Comfy side, Comfy users,

00:15:52.360 | it's more like, oh, generate images

00:15:54.600 | and see, oh, this one's nice.

00:15:56.200 | - Yeah.

00:15:57.040 | - It's like.

00:15:57.880 | - Yeah, vibes.

00:15:58.720 | - Yeah, it's not like the more like scientific,

00:16:03.640 | like checking that's more on specifically on like model side.

00:16:08.640 | Yeah.

00:16:12.000 | But there is a lot of vibes also,

00:16:14.560 | 'cause it is like artistic.

00:16:17.800 | You can create a very good model

00:16:19.680 | that doesn't generate nice images.

00:16:23.520 | 'Cause most of the images on the internet are ugly.

00:16:26.480 | So if you, if that's like, if you just,

00:16:30.040 | oh, I have the best model that can generate.

00:16:33.160 | It's super smart.

00:16:34.960 | I created on all the, like I've trained on just

00:16:38.760 | all the images on the internet.

00:16:40.320 | The images are not gonna look good.

00:16:42.240 | So.

00:16:43.080 | - Yeah, yeah.

00:16:43.920 | - They're gonna be very consistent, but yeah.

00:16:46.800 | Like, it's not gonna be like the look

00:16:49.280 | that people are gonna be expecting from a model.

00:16:53.320 | So, yeah.

00:16:55.160 | - Can we talk about Loras?

00:16:56.560 | 'Cause we talk about models,

00:16:57.880 | then like the next step is probably Loras.

00:17:00.480 | Before, I actually, I'm kind of curious how Loras

00:17:02.600 | entered the tool set of the image community

00:17:05.480 | because the Lora paper was 2021.

00:17:08.240 | And then like, there was like other methods

00:17:09.760 | like textual inversion that was popular

00:17:11.920 | at the early SD stage.

00:17:13.440 | - Yeah, I can't even explain the difference between that.

00:17:16.200 | Textual inversions, that's basically what you're doing

00:17:19.720 | is you're training a, 'cause well, yeah.

00:17:22.560 | Stable diffusion, you have the diffusion model,

00:17:24.560 | you have the text encoder.

00:17:26.720 | So basically what you're doing is training a vector

00:17:31.720 | that you're gonna pass to the text decoder.

00:17:34.680 | It's basically you're training a new word.

00:17:37.360 | - Yeah, it's a little bit

00:17:38.200 | like representation engineering now.

00:17:40.040 | - Yeah, yeah.

00:17:40.880 | Basically, yeah.

00:17:41.720 | You're just, so yeah.

00:17:43.080 | If you know how like the text encoder works,

00:17:46.000 | basically you have, you take your words of your product,

00:17:51.000 | you convert those into tokens with the tokenizer

00:17:54.680 | and those are converted into vectors.

00:17:57.720 | Basically, yeah, each token represents a different vector.

00:18:01.200 | So each word presents a vector and those,

00:18:04.960 | depending on your words, that's the list of vectors

00:18:07.320 | that get passed to the text encoder,

00:18:09.440 | which is just, yeah, just a stack of attention.

00:18:14.440 | Like basically it's very close to LLM architecture.

00:18:19.720 | Yeah, yeah, so basically what you're doing

00:18:22.440 | is just training a new vector.

00:18:24.760 | We're saying, well, I have all these images

00:18:27.520 | and I want to know which word does that represent

00:18:32.520 | and it's gonna get, like you train this vector

00:18:34.960 | and then when you use this vector,

00:18:37.840 | it hopefully generates like something similar

00:18:42.000 | to your images.

00:18:42.920 | - Yeah, I would say it's like surprisingly sample efficient

00:18:46.000 | in picking up the concept

00:18:47.560 | that you're trying to train it on.

00:18:48.560 | - Yeah, well, people have kind of stopped doing that,

00:18:52.120 | even though back when I was at Stability,

00:18:55.080 | we actually did train internally some textual inversions

00:18:59.440 | on like T5XXL, actually worked pretty well,

00:19:03.640 | but for some reason, yeah, people don't use them.

00:19:07.840 | And also they might also work like, yeah,

00:19:11.880 | that's just something you'd probably have to test,

00:19:14.080 | but maybe if you train a textual inversion

00:19:17.120 | like on T5XXL, it might also work

00:19:19.600 | with all the other models that use T5XXL.

00:19:23.120 | 'Cause same thing with like the textual inversions

00:19:27.400 | that were trained for SD 1.5,

00:19:30.840 | they also kind of work on SDXL

00:19:33.280 | because SDXL has two text encoders

00:19:36.920 | and one of them is the same as the SD 1.5 clip out.

00:19:41.680 | So those, they actually, they don't work as strongly

00:19:45.000 | 'cause they're only applied to one of the text encoders,

00:19:47.600 | but, and the same thing for SD 3.3.

00:19:50.280 | SD 3.3 has three text encoders, so it works.

00:19:53.880 | It's still, you can still use

00:19:55.480 | your textual inversion SD 1.5 on SD 3,

00:19:58.720 | but it's just a lot weaker

00:20:00.760 | because now there's three text encoders,

00:20:02.640 | so it gets even more diluted, yeah.

00:20:05.960 | - Do people experiment a lot on, just on the clip side,

00:20:08.680 | there's like Siglip, there's Blip,

00:20:10.160 | like do people experiment a lot on those?

00:20:12.840 | - You can't really replace.

00:20:14.280 | - Yeah, 'cause they're trained together, right?

00:20:15.840 | - Yeah, they're trained together.

00:20:16.840 | So you can't, like, well,

00:20:18.800 | what I've seen people experimenting with is a long clip.

00:20:22.960 | So basically someone fine-tuned a clip model

00:20:26.280 | to accept longer prompts.

00:20:28.360 | - Oh, it's kind of like long context fine-tuning.

00:20:31.400 | - Yeah, so, so like it's,

00:20:33.520 | it's actually supported in core-comfy.

00:20:35.800 | - How long is long?

00:20:36.920 | - Regular clip is 77 tokens.

00:20:40.120 | Long clip is 256.

00:20:43.240 | - Okay.

00:20:44.080 | - But the hack that, like,

00:20:47.080 | if you use stable diffusion 1.5,

00:20:49.560 | you've probably noticed,

00:20:50.560 | oh, it still works if I use long prompts,

00:20:54.560 | prompts longer than 77 words.

00:20:56.600 | Well, that's because the hack is to just,

00:21:00.040 | well, you split, you split it up in chunks of 77,

00:21:04.200 | your whole big prompt,

00:21:06.040 | let's say you give it like the massive text,

00:21:09.280 | like the Bible or something,

00:21:12.160 | and it would split it up in chunks of 77

00:21:15.600 | and then just pass each one through the clip

00:21:19.240 | and then just cut everything together at the end.

00:21:23.600 | It's not ideal, but it actually works.

00:21:26.600 | - Like the positioning of the words

00:21:28.520 | really, really matters then, right?

00:21:30.120 | Like this is why order matters in prompts.

00:21:32.360 | - Yeah.

00:21:33.680 | Yeah, like it, it works, but it's,

00:21:36.400 | it's not ideal, but it's what people expect.

00:21:39.840 | Like if someone gives a huge prompt,

00:21:42.400 | they expect at least some of the concepts

00:21:45.080 | at the end to be like present in the image.

00:21:48.680 | But usually when they give long prompts,

00:21:50.640 | they don't, they like,

00:21:52.400 | they don't expect like detail, I think.

00:21:56.720 | So that's why it works very well.

00:21:59.160 | - And while we're on this topic,

00:22:00.640 | prompt weighting, negative prompting,

00:22:02.480 | all sort of similar part of this layer of the stack.

00:22:05.720 | - Yeah, the hack for that, which works on clip,

00:22:09.040 | like basically it's just for SD 1.5,

00:22:13.440 | well, for SD 1.5, the prompt weighting works well

00:22:16.800 | because clip L is a, it's not a very deep model.

00:22:21.800 | So you have a very high correlation

00:22:25.440 | between you have the input token,

00:22:28.720 | the index of the input token vector

00:22:31.640 | and the output token.

00:22:33.160 | They're very, the concepts are very close, closely linked.

00:22:37.480 | So that means if you interpolate the vector from what,

00:22:42.480 | well, the way Comfy UI does it,

00:22:45.280 | is it has, okay, you have the vector,

00:22:48.760 | you have an empty prompt.

00:22:51.440 | So you have a channel, like a clip output

00:22:54.480 | for the empty prompt,

00:22:55.560 | and then you have the one for your prompt.

00:22:57.920 | And then it interpolates from that,

00:23:00.040 | depending on your prompt weight,

00:23:02.600 | the weight of your tokens.

00:23:04.760 | So if you, yeah.

00:23:07.640 | So that's how it does prompt weighting,

00:23:11.000 | but this stops working the deeper your text encoder is.

00:23:16.000 | So on T5X, it doesn't work at all, so.

00:23:19.880 | - Wow, is that a problem for people?

00:23:22.240 | I mean, 'cause I'm used to just moving up numbers.

00:23:24.440 | - Probably not, is, well.

00:23:26.480 | - So you just use words to describe, right?

00:23:28.760 | 'Cause it's a bigger language model.

00:23:29.880 | - Yeah, yeah.

00:23:31.240 | So it might be good,

00:23:33.480 | but I haven't seen many complaints on Flux,

00:23:36.160 | that it's not working, so.

00:23:40.000 | 'Cause I guess people can sort of get around it

00:23:43.920 | with language, so.

00:23:46.480 | - Yeah. - Yeah.

00:23:47.520 | - And then coming back to Loras,

00:23:49.160 | now the popular way to customize models is Loras.

00:23:52.680 | And I saw you also support Lokon and Loha,

00:23:55.320 | which I've never heard of before.

00:23:56.360 | - There's a bunch of, 'cause what the Lora is essentially,

00:23:59.800 | is instead of like, okay, you have your model,

00:24:04.800 | and then you wanna fine tune it.

00:24:06.520 | So instead of, like, what you could do

00:24:08.880 | is you could fine tune the entire thing.

00:24:10.720 | - Yeah, full fine tune, yeah.

00:24:12.000 | - But that's a bit heavy.

00:24:15.000 | So to speed things up and make things less heavy,

00:24:18.800 | what you can do is just fine tune some smaller weights.

00:24:23.080 | Like basically two matrices,

00:24:26.800 | when you multiply like two low rank matrices,

00:24:30.320 | then when you multiply them together,

00:24:32.640 | gives a, represents a difference

00:24:35.680 | between trained weights and your base weights.

00:24:39.280 | So by training those two smaller matrices,

00:24:43.800 | that's a lot less heavy.

00:24:45.720 | - And they're portable.

00:24:47.000 | So you're gonna share them.

00:24:48.120 | - Yeah. - It's like easier.

00:24:48.960 | - And also smaller, yeah.

00:24:50.160 | That's how Loras work.

00:24:51.800 | So basically, so when inferencing,

00:24:54.800 | you can inference with them pretty efficiently,

00:24:57.600 | like how Compute-Wide does it.

00:24:59.360 | It just, when you use a Lora,

00:25:01.440 | it just applies it straight on the weights

00:25:04.200 | so that there's only a small delay at the base,

00:25:07.440 | like before the sampling to when it applies the weights,

00:25:10.600 | and then it just, same speed as before.

00:25:14.680 | So for inference, it's not that bad, but,

00:25:19.360 | and then you have, so basically all the Lora types,

00:25:22.880 | like LOHA, LOCA, and everything,

00:25:24.760 | that's just different ways of representing that.

00:25:28.720 | Like, basically you can call it kind of like compression,

00:25:32.920 | even though it's not really compression.

00:25:35.520 | It's just different ways of representing, like, just okay.

00:25:39.480 | I want to train a difference on the weights.

00:25:42.880 | What's the best way to represent that difference?

00:25:46.600 | There's the basic Lora, which is just,

00:25:48.240 | oh, let's multiply these two matrices together.

00:25:50.840 | And then there's all the other ones,

00:25:52.800 | which are all different algorithms.

00:25:55.720 | So, yeah.

00:25:58.200 | - So let's talk about what Confi UI actually is.

00:26:00.960 | I think most people have heard of it.

00:26:02.760 | Some people might've seen screenshots.

00:26:05.280 | I think fewer people have built very complex workflows.

00:26:08.040 | So when you started, automatic was like the super simple way.

00:26:12.600 | What were some of the choices that you made?

00:26:15.000 | So the node workflow, is there anything else

00:26:17.760 | that stands out as like, this was like a unique take

00:26:20.360 | on how to do image generation workflows?

00:26:22.480 | - Well, I feel like, yeah, back then,

00:26:24.160 | everyone was trying to make like easy to use interface.

00:26:29.160 | So I'm like, well, everyone's trying to make

00:26:31.400 | an easy to use interface.

00:26:32.800 | - Let's make a hard to use interface.

00:26:34.320 | (all laughing)

00:26:37.280 | - Like, so like, I don't need to do that.

00:26:40.480 | (all laughing)

00:26:42.520 | Everyone else doing it, so let me try something.

00:26:45.840 | Like, let me try to make a powerful interface

00:26:49.440 | that's not easy to use, so.

00:26:52.200 | - So like, yeah, there's a sort of node execution engine.

00:26:55.880 | Your readme actually lists, has this really good list

00:26:58.400 | of features of things you prioritize, right?

00:27:00.600 | Like, let me see, like sort of re-executing

00:27:03.720 | from many parts of this workflow that was changed,

00:27:06.760 | asynchronous queue system, smart memory management,

00:27:10.040 | like all this seems like a lot of engineering that.

00:27:12.240 | - Yeah, there's a lot of engineering

00:27:14.160 | in the backend to make things.

00:27:17.080 | 'Cause I was always focused on making things work locally

00:27:21.080 | very well, 'cause that's, 'cause I was using it locally.

00:27:24.640 | So everything, so there's a lot of thought and work

00:27:29.640 | and like getting everything to run as well as possible.

00:27:34.600 | So yeah, Confio is actually more of a backend,

00:27:39.200 | at least, well, now the front end's getting

00:27:42.280 | a lot more development, but before it was,

00:27:46.840 | I was pretty much only focused on the backend.

00:27:49.920 | - Yeah, so v0.1 was only August this year.

00:27:54.240 | - Yeah, before there was no versioning, so yeah.

00:27:58.200 | - And so what was the big rewrite for the 0.1

00:28:00.520 | and then the 1.0?

00:28:02.280 | - Well, that's more on the front end side.

00:28:05.480 | 'Cause before that, it was just like the UI, what,

00:28:09.960 | 'cause when I first wrote it, I just, I said, okay,

00:28:13.800 | how can I make, like, I can do web development,

00:28:17.360 | but I don't like doing it.

00:28:19.080 | Like, what's the easiest way I can slap

00:28:21.160 | a node interface on this?

00:28:22.800 | And then I found this library, LightGraph,

00:28:25.440 | like JavaScript library.

00:28:27.040 | - LightGraph? - LightGraph.

00:28:28.600 | - Usually people will go for like React flow,

00:28:30.400 | for like a flow builder.

00:28:31.240 | - Yeah, but that seems like too complicated.

00:28:33.880 | - 'Cause of React. (all laughing)

00:28:35.560 | - So I didn't really want to spend time

00:28:38.000 | like developing the front end, so I'm like,

00:28:40.720 | well, oh, LightGraph, this has the whole node interface.

00:28:44.840 | Well, okay, let me just plug that into my back end then.

00:28:49.440 | - I feel like if Streamlit or Gradio offered something,

00:28:51.640 | you would have used Streamlit or Gradio 'cause it's Python.

00:28:54.040 | - Yeah, Streamlit and Gradio.

00:28:55.680 | Gradio, I don't like Gradio.

00:28:57.360 | - Why? - It's bad.

00:28:58.400 | Like, that's one of the reasons why,

00:29:02.040 | like, automatic was very bad.

00:29:05.040 | It's great, 'cause the problem with Gradio,

00:29:08.320 | it forces you to, well, not forces you,

00:29:11.080 | but it kind of makes your interface logic

00:29:16.080 | and your back end logic and just sticks them together.

00:29:20.760 | - Well, it's supposed to be easy for you guys,

00:29:22.720 | if you're a Python main, you know, I'm a JS main, right?

00:29:24.920 | - Okay.

00:29:25.760 | - If you're a Python main, it's supposed to be easy.

00:29:26.760 | - Yeah, it's easy, but it makes your whole software

00:29:29.960 | a huge mess.

00:29:30.920 | - I see, I see.

00:29:31.760 | So you're mixing concerns instead of separating concerns?

00:29:34.280 | - Well, it's 'cause-

00:29:35.880 | - Like front end and back end.

00:29:36.880 | - Front end and back end should be well separated

00:29:39.560 | with a fine API.

00:29:41.160 | Like, that's how you're supposed to do it.

00:29:43.760 | - People, smart people disagree, but yeah.

00:29:46.680 | - It just sticks everything together.

00:29:49.120 | It makes it easy to, like, make a huge mess.

00:29:52.960 | And also it's, there's a lot of issues with Gradio.

00:29:57.960 | Like, it's very good if all you want to do

00:30:00.200 | is just get, like, slap a quick interface on your,

00:30:04.000 | like, to show off your, like, your ML project.

00:30:08.360 | Like, that's what it's made for.

00:30:10.280 | - Yeah, yeah.

00:30:11.120 | - Like, there's no problem using it, like,

00:30:13.520 | oh, I have my, I have my code.

00:30:16.160 | I just want a quick interface on it.

00:30:18.600 | That's perfect.

00:30:20.080 | Like, use Gradio.

00:30:21.080 | But if you want to make something that's like a real,

00:30:24.280 | like real software that will last a long time

00:30:28.320 | and will be easy to maintain, then I wouldn't avoid it.

00:30:32.520 | - Yeah, yeah.

00:30:33.480 | So your criticism is Streamlit and Gradio are the same.

00:30:36.080 | I mean, those are the same criticisms.

00:30:38.120 | - Yeah, Streamlit, I haven't.

00:30:40.800 | - Haven't used as much.

00:30:41.640 | - Yeah, it's just looked a bit.

00:30:43.800 | - Similar philosophy.

00:30:44.640 | - Yeah, it's similar.

00:30:45.640 | It's just, it just seems to me like, okay,

00:30:48.000 | for quick, like, AI demos, it's perfect.

00:30:51.200 | - Yeah.

00:30:52.040 | Going back to like the core tech, like asynchronous queues,

00:30:55.080 | slow re-execution, smart memory management, you know,

00:30:57.320 | anything that you were very proud of

00:30:59.160 | or was very hard to figure out?

00:31:00.720 | - Yeah, the thing that's the biggest pain in the ass

00:31:03.640 | is probably the memory management.

00:31:05.840 | - Yeah, were you just paging models in and out or?

00:31:08.360 | - Yeah, before it was just, okay, load the model,

00:31:11.840 | completely unload it, load the new model,

00:31:14.560 | completely unload it.

00:31:16.200 | Then, okay, that works well when your model are small,

00:31:19.800 | but if your models are big and it takes like,

00:31:22.960 | let's say someone has a, like a 4090

00:31:26.720 | and the model size is 10 gigabytes,

00:31:29.760 | that can take a few seconds to like load and load,

00:31:33.200 | so you want to try to keep things like in memory,

00:31:36.840 | in the GPU memory, as much as possible.

00:31:39.520 | What Comfy UI does right now is that,

00:31:43.120 | it tries to like estimate, okay, like,

00:31:45.800 | okay, you're going to sample this model.

00:31:48.040 | It's going to take probably this amount of memory.

00:31:51.200 | Let's remove the models, like this amount of memory

00:31:56.200 | that's been loaded on the GPU and then just execute it.

00:32:01.840 | But, so there's a fine line between just,

00:32:06.400 | 'cause try to remove the least amount of modelings

00:32:10.840 | that are already loaded,

00:32:13.080 | 'cause it adds like Windows driver.

00:32:18.080 | And another problem is the NVIDIA driver on Windows

00:32:22.160 | by default, because there's a way to,

00:32:24.640 | there's an option to disable that feature.

00:32:26.680 | But by default, it, like, if you start loading,

00:32:30.840 | you can overflow your GPU memory,

00:32:34.280 | and then it's, the driver's going to automatically

00:32:36.360 | start paging to RAM.

00:32:38.120 | But the problem with that is it,

00:32:39.720 | it makes everything extremely slow.

00:32:42.320 | So when you see people complaining,

00:32:44.000 | oh, this model, it works, but oh shit,

00:32:47.200 | it starts slowing down a lot.

00:32:49.160 | That's probably what's happening.

00:32:51.240 | So it's basically, you have to just try to get,

00:32:55.480 | use as much memory as possible, but not too much,

00:32:59.360 | or else things start slowing down,

00:33:01.520 | or people get out of memory.

00:33:03.760 | And then just find, try to find that line where,

00:33:07.680 | like the driver on Windows starts paging and stuff.

00:33:12.000 | - Yeah.

00:33:12.840 | - And yeah, the problem with PyTorch is it's,

00:33:15.520 | it's high levels, don't have that much fine grained control

00:33:19.040 | over like specific memory stuff.

00:33:22.240 | So kind of have to leave like the memory freeing

00:33:26.280 | to Python and PyTorch, which is, can be annoying sometimes.

00:33:31.280 | - So, you know, I think one thing as a maintainer

00:33:35.240 | of this project, like you're designing

00:33:37.040 | for a very wide surface area of compute.

00:33:40.600 | Like you even support CPUs.

00:33:42.320 | - Yeah, well that's, that's just for fun.

00:33:44.720 | PyTorch CPUs, so yeah, it's just,

00:33:48.240 | that's not, that's not hard to support.

00:33:50.160 | - First of all, is there a market share estimate?

00:33:51.720 | Like, is it like 70% NVIDIA, like 30% AMD,

00:33:55.040 | and then like miscellaneous on Apple Silicon or whatever?

00:33:58.440 | - For Comfy?

00:34:00.640 | - Yeah.

00:34:01.480 | - Yeah, I don't know the market share.

00:34:04.080 | - Can you guess?

00:34:05.480 | - I think it's mostly NVIDIA.

00:34:07.320 | - Yeah.

00:34:08.600 | - 'Cause AMD, the problem, like AMD works horribly

00:34:11.720 | on Windows.

00:34:12.920 | Like on Linux, it works fine.

00:34:15.280 | It's slower than the price equivalent NVIDIA GPU,

00:34:19.600 | but it works, like you can use it, generate images,

00:34:23.680 | everything works.

00:34:25.120 | On Linux, on Windows, you might have a hard time.

00:34:28.280 | So that's the problem.

00:34:29.720 | And most people, I think most people who bought AMD

00:34:34.720 | probably use Windows.

00:34:36.960 | (laughing)

00:34:38.360 | They probably aren't gonna switch to Linux.

00:34:41.200 | (laughing)

00:34:42.760 | So until AMD actually like ports their like raw cam

00:34:47.760 | to Windows properly, and then there's actually PyTorch.

00:34:52.280 | I think they're doing that.

00:34:54.280 | They're in the process of doing that,

00:34:56.080 | but until they get a good like PyTorch raw cam build

00:35:01.080 | that works on Windows, it's like,

00:35:04.400 | they're gonna have a hard time.

00:35:06.160 | - Yeah.

00:35:07.000 | - We gotta get George on it.

00:35:08.200 | - Yeah, well, he's trying to get Lisa Sui to do it.

00:35:10.920 | But let's talk a bit about like the node design.

00:35:14.920 | So unlike all the other text to image,

00:35:17.240 | you have a very like deep.

00:35:19.280 | So you have like a separate node for like clip and code.

00:35:22.120 | You have a separate node for like the case sampler.

00:35:24.440 | You have like all these nodes.

00:35:25.880 | Going back to like making it easy versus making it hard.

00:35:28.320 | But like, how much do people actually play

00:35:30.640 | with all the settings?

00:35:31.960 | You know, kind of like, how do you guide people to like,

00:35:33.680 | hey, this is actually gonna be very impactful

00:35:35.840 | versus this is maybe like less impactful,

00:35:38.280 | but we still wanna expose it to you.

00:35:40.080 | - Well, I try to expose like,

00:35:44.520 | I try to expose everything or that's, yeah.

00:35:49.240 | At least for the, but for things like, for example,

00:35:51.760 | for the samplers, like there's like, yeah,

00:35:55.240 | four different sampler nodes,

00:35:57.400 | which go in easiest to most advanced.

00:36:01.720 | So yeah, if you go like the easy node,

00:36:04.600 | the regular sampler node,

00:36:06.120 | that's you have just the basic settings.

00:36:09.160 | But if you use like the sampler advanced,

00:36:11.800 | custom advanced node, that one you can actually,

00:36:15.280 | you'll see you have like different nodes.

00:36:19.520 | - I'm looking it up now.

00:36:20.920 | - Yeah.

00:36:21.760 | - What are like the most impactful parameters that you use?

00:36:25.880 | So it's like, you know, you can have more,

00:36:27.480 | but like, which ones like really make a difference?

00:36:30.040 | - CFG.

00:36:30.880 | - Yeah, they all do.

00:36:31.800 | They all have their own like, they all like, for example,

00:36:35.680 | yeah, steps, usually you want steps,

00:36:38.760 | you want them to be as low as possible,

00:36:41.800 | but you want, if you're optimizing your workflow,

00:36:45.760 | you want to, you lower the steps

00:36:47.600 | until like the image has started deteriorating too much.

00:36:52.440 | 'Cause that, yeah, that's the number of steps

00:36:55.200 | you're running the diffusion process.

00:36:57.320 | So if you want things to be faster, lower is better.

00:37:02.280 | But yeah, CFG, that's more,

00:37:05.040 | you can kind of see that as the contrast of the image.

00:37:09.400 | Like if your image looks too burnt out,

00:37:12.000 | then you lower the CFG.

00:37:13.920 | So yeah, CFG, that's how, yeah,

00:37:16.440 | that's how strongly the, like the negative

00:37:20.240 | versus positive problem.

00:37:22.760 | 'Cause when you sample a diffusion model,

00:37:24.760 | it's basically a negative prompt.

00:37:27.320 | It's just, yeah, positive prediction

00:37:30.320 | minus negative prediction.

00:37:33.120 | - Contrastive loss.

00:37:34.000 | - Yeah, so it's positive minus negative,

00:37:36.880 | and the CFG, that's the multiplier.

00:37:39.040 | - Yeah.

00:37:40.240 | - Yeah, so.

00:37:41.080 | - What are like good resources

00:37:42.640 | to understand what the parameters do?

00:37:45.080 | I think most people start with automatic,

00:37:47.240 | and then they move over and it's like,

00:37:48.920 | step, CFG, sampler, name, scheduler, denoise.

00:37:51.920 | - Read it.

00:37:52.760 | - Honestly, well, it's more,

00:37:56.000 | it's something you should like try out yourself.

00:37:59.480 | I don't know, you don't necessarily need to know

00:38:02.360 | how it works to like what it does.

00:38:05.240 | 'Cause even if you know like CFGO,

00:38:07.600 | it's like positive minus negative problem.

00:38:10.320 | - Yeah.

00:38:11.160 | - So the only thing you know at CFG is if it's 1.0,

00:38:14.240 | then that means the negative prompt isn't applying.

00:38:17.360 | And also maybe sampling is two times faster, but.

00:38:20.600 | - Yeah, yeah.

00:38:21.480 | - Yeah, but other than that,

00:38:22.760 | it's more like you should really just see

00:38:26.080 | what it does to the images yourself,

00:38:27.960 | and you'll probably get a more intuitive understanding

00:38:32.280 | of what these things do.

00:38:33.960 | - Any other notes or things you want to shout out?

00:38:37.920 | Like I know the AnimateDiff IP adapter,

00:38:40.360 | those are like some of the most popular ones.

00:38:42.920 | - Yeah, what else comes to mind?

00:38:44.840 | - I don't have notes, but there's,

00:38:47.240 | like what I like is when some people,

00:38:49.560 | sometimes they make things that use Confi UI

00:38:53.160 | as their backend, like there's a plugin for Krita

00:38:58.160 | that uses Confi UI as its backend.

00:39:02.760 | So you can use like all the models

00:39:05.600 | that work in Confi in Krita.

00:39:07.640 | And I think I've tried it once,

00:39:10.880 | but I know a lot of people use it,

00:39:13.440 | and find it really nice, so.

00:39:15.960 | - What's the craziest node that people have built,

00:39:19.080 | like the most complicated?

00:39:21.320 | - Craziest node, like yeah,

00:39:23.720 | I know some people have made like video games in Confi,

00:39:28.400 | with like stuff like that.

00:39:31.760 | So like someone, like I remember,

00:39:35.160 | like yeah, I think it was last year,

00:39:38.240 | someone made like Wolfenstein 2 in Confi,

00:39:43.240 | and then one of the inputs was,

00:39:45.920 | oh, you can generate a texture,

00:39:47.720 | and then it changes the texture in the game.

00:39:51.400 | So I could plug it to like the workflow.

00:39:54.320 | And there's a lot of, if you look there,

00:39:56.440 | there's a lot of crazy things people do, so yeah.

00:40:00.320 | - And now there's like a node register

00:40:02.120 | that people can use to like download nodes.

00:40:04.760 | - Yeah, well, there's always been the Confi UI manager,

00:40:08.360 | but we're trying to make this more like official,

00:40:12.840 | like with the node registry,

00:40:17.360 | 'cause before the node registry,

00:40:20.920 | like okay, how did your custom node get into Confi UI manager?

00:40:24.560 | That's the guy running it who like every day,

00:40:27.720 | he searched GitHub for new custom nodes,

00:40:29.880 | and added them manually to his custom node manager.

00:40:34.520 | So we're trying to make it less effort for him, basically.

00:40:39.520 | - Yeah, but I was looking,

00:40:41.840 | I mean, there's like a YouTube download node.

00:40:44.640 | There's like, this is almost like a data pipeline,

00:40:48.080 | more than like an image generation thing at this point.

00:40:50.120 | It's like you can get data in,

00:40:51.400 | you can like apply filters to it,

00:40:52.880 | you can generate data out.

00:40:54.800 | - Yeah, you can do a lot of different things.

00:40:57.760 | - Yeah, something I think,

00:40:59.200 | what I did is I made it easy to make custom nodes.

00:41:04.960 | So I think that that helped a lot for the ecosystem,

00:41:09.760 | 'cause it is very easy to just make a node.

00:41:12.960 | So yeah, a bit too easy sometimes.

00:41:16.880 | (laughing)

00:41:18.040 | Then we have the issue where there's a lot

00:41:20.960 | of custom node packs, which share similar nodes.

00:41:25.520 | But well, that's, yeah, something we're trying to solve

00:41:30.320 | by maybe bringing some of the functionality into core.

00:41:34.720 | - Yeah, yeah, yeah.

00:41:35.840 | - Yeah.

00:41:36.680 | - And then there's like video,

00:41:38.640 | people can do video generation.

00:41:40.440 | - Yeah, video, that's, well, the first video model

00:41:44.480 | was like stable video diffusion,

00:41:47.080 | which was last, yeah, exactly last year, I think.

00:41:50.760 | Like one year ago, but that wasn't a true video model.

00:41:54.080 | So it was--

00:41:55.920 | - It was like moving images.

00:41:57.560 | - Yeah, it generated video.

00:41:58.920 | What I mean by that is it's like,

00:42:01.080 | it's still 2D latency.

00:42:04.320 | It's basically what they did is they took SD2

00:42:08.040 | and then they added some temporal attention to it

00:42:11.840 | and then trained it on videos.

00:42:14.360 | So it's kind of like animate diff,

00:42:18.280 | like same idea, basically.

00:42:21.960 | Why I say it's not a true video model

00:42:24.440 | is that you still have like the 2D latency.

00:42:27.440 | Like a true video model, like Mochi, for example,

00:42:30.680 | would have 3D latency.

00:42:32.520 | So--

00:42:33.360 | - Which means you can like move through the space,

00:42:35.040 | basically, it's the difference.

00:42:36.680 | You're not just kind of like reorienting.

00:42:38.880 | - Yeah, and it's also, well,

00:42:40.240 | it's also because you have a temporal VAE.

00:42:42.960 | Also, like Mochi has a temporal VAE

00:42:45.880 | that compresses on like the temporal direction also.

00:42:51.000 | So that's something you don't have with like,

00:42:52.880 | yeah, animate diff and stable video diffusion.

00:42:56.080 | They only like compress spatially, not temporally.

00:43:00.240 | So yeah, so these models,

00:43:02.040 | that's why I call them like true video models.

00:43:04.920 | There's actually a few of them,

00:43:07.200 | but the one I've implemented in Comfy is Mochi

00:43:11.680 | 'cause that seems to be the best one so far.

00:43:15.960 | - We had AJ come and speak at the stable diffusion meetup.

00:43:19.200 | Other open one I think I've seen is Kog Video.

00:43:21.640 | - Yeah, Kog Video.

00:43:22.880 | Yeah, that one's, yeah, it also seems decent, but yeah.

00:43:27.360 | - We're Chinese, so we don't use it.

00:43:29.040 | (all laughing)

00:43:29.880 | - No, it's fine, it's just, yeah, I could,

00:43:31.840 | yeah, it's just that there's a, it's not the only one.

00:43:34.880 | There's also a few others, which I--

00:43:37.160 | - The rest are like closed source, right?

00:43:38.920 | Like Cling--

00:43:39.760 | - Yeah, the closed source, there's a bunch of them,

00:43:41.600 | but I mean, open, I've seen a few of them,

00:43:45.560 | like, yeah, I can't remember their names,

00:43:47.680 | but there's Kog Videos, the big one,

00:43:50.640 | then there's also a few of them

00:43:52.520 | that released at the same time.

00:43:55.080 | There's one that released at the same time

00:43:57.480 | as SD3.5, same day, which is why I don't remember the name.

00:44:01.800 | (all laughing)

00:44:03.120 | - We should have a release schedule

00:44:04.280 | so we don't conflict on each of these things.

00:44:06.080 | - Yeah, I think SD3.5 and Mochi released on the same day.

00:44:10.160 | So everything else was kind of drowned,

00:44:13.240 | completely drowned out.

00:44:14.480 | So for some reason, lots of people picked that day

00:44:17.920 | to release their stuff.

00:44:19.400 | (all laughing)

00:44:21.520 | Yeah, which is, well, shame for those, I think, guys.

00:44:25.320 | I think Omnijet also released the same day,

00:44:27.640 | which also seems interesting, but yeah.

00:44:30.680 | - Yeah, what's Comfy?

00:44:32.080 | So you are Comfy, and then there's like Comfy.org.

00:44:35.560 | I know we do a lot of things for like news research,

00:44:37.520 | and those guys also have kind of like a more open source

00:44:40.400 | and on thing going on.

00:44:42.120 | How do you work?

00:44:43.680 | Like you mentioned,

00:44:44.520 | you must have worked on like the core piece of it,

00:44:47.240 | and then what?

00:44:48.080 | - Maybe I should fit in because, yeah,

00:44:50.640 | I feel like maybe, yeah,

00:44:52.520 | I only explained part of the story.

00:44:54.880 | - Right.

00:44:55.880 | - Yeah, maybe I should explain the rest, so yeah.

00:44:58.240 | So yeah, basically January,

00:45:01.000 | that's when the first, January 2023,

00:45:04.320 | January 16th, 2023,

00:45:05.720 | that's when Comfy was first released to the public.

00:45:10.400 | Then, yeah, did a Reddit post

00:45:12.880 | about the area composition thing somewhere in,

00:45:15.680 | I don't remember exactly,

00:45:18.360 | maybe end of January, beginning of February.

00:45:21.440 | And then somewhere, a YouTuber made a video about it,

00:45:26.440 | like Olivia, he made a video about Comfy in March, 2023.

00:45:31.840 | I think that's when it was real burst of attention.

00:45:36.280 | And by that time I was continued to developing it

00:45:40.480 | and it was getting,

00:45:42.720 | people were starting to use it more,

00:45:45.400 | which unfortunately meant that I had first written it

00:45:50.400 | to do like experiments,

00:45:52.400 | but then my time to do experiments

00:45:55.720 | when it started going down,

00:45:57.120 | yeah, 'cause yeah,

00:46:00.480 | people were actually starting to use it then,

00:46:02.680 | like I had to.

00:46:04.320 | And I said, well, yeah,

00:46:05.800 | time to add all these features and stuff.

00:46:09.360 | Yeah, and then I got hired by Stability, June, 2023.

00:46:13.760 | Then I made the, basically, yeah,

00:46:16.080 | they hired me 'cause they wanted the SDXL.

00:46:19.440 | So I got SDXL working very well in Comfy UI

00:46:23.680 | because they were experimenting with it.

00:46:26.880 | Actually the SDX, how the SDXL released worked

00:46:29.840 | is they released for some reason,

00:46:32.080 | they released the code first,

00:46:34.800 | but they didn't release the model checkpoint.

00:46:37.360 | Oh yeah, so they released the code.

00:46:39.800 | And then, well, since the research was with the code,

00:46:42.880 | I released the code in Comfy too.

00:46:45.400 | And then the checkpoints were basically early access.

00:46:49.360 | People had to sign up

00:46:50.840 | and they only allowed a lot of people from edu emails.

00:46:55.840 | Like if you had the edu email,

00:46:58.160 | like they gave you access basically to the zero SDXL 0.9.

00:47:03.160 | And well, that leaked, of course,

00:47:09.520 | because of course it's gonna leak if you do that.

00:47:13.240 | Well, the only way people could easily use it

00:47:15.640 | was with Comfy.

00:47:16.920 | So yeah, people started using it.

00:47:19.320 | And then I fixed a few of the issues that people had.

00:47:22.240 | So then the big 1.0 release happened.

00:47:26.800 | And well, Comfy UI was the only way a lot of people

00:47:31.280 | could actually run it on their computers.

00:47:33.880 | 'Cause it just like automatic was so like inefficient

00:47:38.680 | and bad that most people couldn't act

00:47:41.880 | like it just wouldn't work.

00:47:44.920 | Like, 'cause he did a quick implementation.

00:47:47.640 | So people were forced to use Comfy UI.

00:47:50.480 | And that's how it became popular

00:47:52.320 | because people had no choice.

00:47:54.440 | (all laughing)

00:47:56.000 | - The Grove hack.

00:47:56.840 | - Yeah.

00:47:57.680 | - Yeah.

00:47:58.500 | Like everywhere, like people who didn't have the 4090,

00:48:01.000 | they had like, who had just regular GPUs.

00:48:03.520 | - Yeah, yeah.

00:48:04.360 | - They didn't have a choice, so.

00:48:06.480 | - Yeah, I got a 4070, so think of me.

00:48:09.480 | And so today, what's, is there like a core Comfy team or?

00:48:14.240 | - Yeah, well, right now, yeah, we are hiring actually.

00:48:19.240 | So right now core, like the core core itself, it's me.

00:48:25.120 | Yeah, but because the reason we're focused,

00:48:27.640 | like all the focus has been mostly

00:48:30.120 | on the front end right now,

00:48:31.480 | 'cause that's the thing that's been neglected

00:48:34.400 | for a long time.

00:48:36.200 | So most of the focus right now is all on the front end,

00:48:41.200 | but we are, yeah, we will soon get more people

00:48:46.040 | to like help me with the actual backend stuff.

00:48:49.720 | Because that's, once we have our V1 release,

00:48:54.120 | which is, 'cause it'd be the package Comfy Y

00:48:57.240 | with the nice interface and easy to install on Windows,

00:49:02.240 | and hopefully Mac.

00:49:04.160 | Yeah, once we have that, we're going to have to,

00:49:08.800 | lots of stuff to do on the backend side

00:49:11.400 | and also the front end side, but yeah.

00:49:14.960 | - What's the release date?

00:49:15.960 | I'm on the wait list.

00:49:17.000 | What's the timing?

00:49:18.560 | - Soon, soon.

00:49:21.800 | Yeah, like I don't want to promise a release date.

00:49:26.240 | Yeah, we do have a release date we're targeting,

00:49:29.640 | but I'm not sure if it's public.

00:49:32.880 | - Yeah.

00:49:33.720 | - Yeah, and how we're going to,

00:49:35.440 | like we're still going to continue

00:49:37.640 | like doing the open source,

00:49:40.280 | like making Comfy Y the best way to run

00:49:43.640 | like stable infusion models,

00:49:45.840 | like at least the open source side,

00:49:47.840 | and like it's going to be best way to run models locally,

00:49:52.840 | but we will have a few,

00:49:54.840 | like a few things to make money from it,

00:49:57.480 | like cloud inference or like that type of thing.

00:50:02.480 | So, and maybe some, like some things for some enterprises.

00:50:08.120 | - I mean, a few questions on that.

00:50:09.840 | How do you feel about the other Comfy startups?

00:50:12.240 | - I mean, I think it's great.

00:50:14.640 | - They're using your name, you know.

00:50:15.640 | - Yeah, well, it's better to use Comfy

00:50:17.240 | than to use something else.

00:50:18.720 | - Yeah, that's true.

00:50:20.400 | - Yeah, like it's fine.

00:50:22.280 | I don't like, like, yeah, we're going to try not to,

00:50:27.280 | we don't want to,

00:50:29.000 | like we want them to, people to use Comfy,

00:50:31.760 | because like I said,

00:50:32.800 | it's better that people use Comfy than something else.

00:50:37.160 | So as long as they use Comfy,

00:50:39.600 | it's, I think it helps, it helps the ecosystem.

00:50:43.280 | And so, because more people,

00:50:45.000 | even if they don't contribute directly,

00:50:49.120 | the fact that they are using Comfy

00:50:51.280 | means that like people are more likely

00:50:54.040 | to like join the ecosystem.

00:50:56.200 | So, yeah.

00:50:58.240 | - And then would you ever do text?

00:51:00.040 | - Yeah, well, you can already do text

00:51:02.240 | with some custom nodes.

00:51:03.600 | So yeah, it's something where we like, yeah.

00:51:07.520 | It's something I've wanted to eventually add to core,

00:51:10.880 | it's more like, not a very high priority,

00:51:15.440 | but because a lot of people use text

00:51:17.880 | for like prompt enhancement and like other things like that.

00:51:21.280 | So it's, yeah, it's just that my focus

00:51:24.760 | has always been like diffusion models.

00:51:28.000 | Yeah, unless some text diffusion model comes out.

00:51:31.000 | - Yeah, David Holtz is investing a lot in text diffusion.

00:51:34.280 | - Well, if a good one comes out,

00:51:35.920 | then well, I'll probably implement it

00:51:37.720 | since it fits with the whole.

00:51:39.120 | - Yeah, I mean, I imagine it's gonna be

00:51:41.400 | close source to my journey, so.

00:51:42.880 | - Yeah, well, yeah, if an open one comes out.

00:51:46.080 | (laughing)

00:51:47.680 | Yeah, then yeah, I'll probably, yeah.

00:51:51.920 | Yeah, I'll probably implement it.

00:51:53.560 | - Cool, Comfy.

00:51:55.800 | Thanks so much for coming on.

00:51:57.480 | This was fun.

00:51:58.320 | (upbeat music)

00:52:01.920 | (upbeat music)

00:52:04.520 | (gentle music)

00:52:07.100 | you

AI Engineering for Art - with comfyanonymous

Chapters