back to indexThe AI-First Graphics Editor - with Suhail Doshi of Playground AI
Chapters
0:0 Introductions
0:56 Suhail's background (Mixpanel, Mighty)
9:0 Transition from Mighty to exploring AI and generative models
10:24 The viral moment for generative AI with DALL-E 2 and Stable Diffusion
17:58 Training Playground v2 from scratch
27:52 The MJHQ 30K benchmark for evaluating model's aesthetic quality
30:59 Discussion on styles in AI-generated images and the categorization of styles
43:18 Tackling edge cases from UIs to NSFW
49:47 The user experience and interface design for AI image generation tools
54:50 Running their own infrastructure for GPU DevOps
57:13 The ecosystem of LoRAs
62:7 The goals and challenges of building a graphics editor with AI integration
64:44 Lightning Round
00:00:07.040 |
And I'm joined by my co-host Swiggs, founder of Small.ai. 00:00:09.960 |
- Hey, and today in the studio we have Soheil Doshi. 00:00:14.320 |
- Among many things, you're a CEO and co-founder 00:00:22.640 |
- And more recently, I think about a year ago, 00:00:31.160 |
I'd just like to start, touch on Mixpanel a little bit, 00:00:40.640 |
And I'm curious if you had any sort of reflections 00:00:52.120 |
Like, I don't know if there's still a part of you 00:00:58.560 |
the short version is that maybe back in like 2015 or '16, 00:01:05.800 |
'cause it was a while ago, we had an ML team at Mixpanel. 00:01:08.900 |
And I think this is like when maybe deep learning 00:01:22.460 |
So we built, you know, two or three different features. 00:01:24.480 |
I think we built a feature where we could predict 00:01:41.160 |
maybe a spike in traffic in a particular region. 00:01:46.360 |
'Cause it's really hard to like know everything 00:01:49.080 |
Could we tell you something surprising about your data? 00:01:53.700 |
Most of it boiled down to just like, you know, 00:01:58.480 |
And it never quite seemed very groundbreaking in the end. 00:02:03.280 |
And so I think, you know, we had a four or five person ML team 00:02:16.640 |
- Yeah, that was the first time I did fast AI. 00:02:18.320 |
Yeah, I think I've done it now three times maybe. 00:02:23.160 |
- No, no, just me reviewing it is maybe three times, 00:02:29.160 |
but honestly like it's also just about the feedback, right? 00:02:34.900 |
I think it's useful for anyone building AI applications. 00:02:40.560 |
- Yeah, I think I haven't spent a lot of time 00:02:42.760 |
thinking about Mixpanel 'cause it's been a long time, 00:02:44.440 |
but yeah, I wonder now, given everything that's happened, 00:02:51.920 |
And then I kind of like move on to whatever I'm working on, 00:02:59.560 |
And then maybe we'll touch on Mighty a little bit. 00:03:03.400 |
It was basically, well, my framing of it was, 00:03:10.820 |
I have too many tabs open and slowing down your machines 00:03:15.480 |
- Yeah, we were first trying to make a browser 00:03:17.960 |
that we would stream from a data center to your computer 00:03:22.680 |
But the real objective wasn't trying to make a browser 00:03:30.760 |
And the thought was just that, like, you know, 00:03:37.840 |
or they don't have enough RAM or not enough disk, 00:03:39.640 |
or, you know, there's some limitation with our computers, 00:03:46.360 |
Could we, you know, why do I need to think about 00:03:50.760 |
And so, you know, we just had to kind of observe that, 00:03:52.800 |
like, well, actually, it seems like a lot of applications 00:03:57.600 |
You know, it's like how many real desktop applications 00:04:00.800 |
do we use relative to the number of applications 00:04:03.800 |
So it was just this realization that actually, like, 00:04:05.920 |
you know, the browser was effectively becoming 00:04:10.840 |
And so then that's why we kind of decided to go, 00:04:18.040 |
but the objective is to try to make a true new computer. 00:04:26.960 |
- I think the last, or one of the last in-person ones, 00:04:36.080 |
when everybody wants to put some of these models 00:04:40.960 |
Do you think there's maybe another wave of the same problem 00:04:46.200 |
and now it's like models too slow to run on device? 00:04:52.520 |
but a lot of what I somewhat believed at Mighty 00:04:58.760 |
Maybe why I'm so excited about AI and what's happening. 00:05:03.440 |
was like moving compute somewhere else, right? 00:05:07.920 |
they get limited quantities of memory, disk, networking, 00:05:14.920 |
You know, what if these applications could somehow, 00:05:18.440 |
and then these applications have vastly more compute 00:05:22.120 |
Right now it's just like client backend services, 00:05:24.920 |
but you know, what if we could change the shape 00:05:27.280 |
of how applications could interact with things? 00:05:33.400 |
In some ways, AI has like a bit of a continuation 00:05:38.120 |
perhaps we can really shift compute somewhere else. 00:05:43.120 |
was that JavaScript is single-threaded in the browser. 00:05:53.760 |
We could have made some kind of enterprise business, 00:05:56.080 |
probably could have made maybe a lot of money, 00:05:59.000 |
but it wasn't going to be what I hoped it was going to be. 00:06:01.520 |
And so once I realized that most of a web app 00:06:05.440 |
is just going to be single-threaded JavaScript, 00:06:20.480 |
two of which sell, you know, big ones, you know, 00:06:23.120 |
AMD, Intel, and then of course, like Apple made the M1. 00:06:26.440 |
And it's not like single-threaded CPU core performance. 00:06:30.240 |
Single core performance was like increasing very fast. 00:06:38.560 |
sort of with the continuation of Moore's law. 00:06:40.640 |
But what happened in AI was that you got like, 00:06:43.480 |
like if you think of the AI model as like a computer program 00:07:01.120 |
I can get make computation happen really rapidly 00:07:09.200 |
really amazing models that can like do anything. 00:07:19.000 |
into these like really amazing AI models in reality. 00:07:23.440 |
Like I think Andrej Karpathy has always been, 00:07:25.760 |
has been making a lot of analogies with the LLMOS. 00:07:38.680 |
- Yeah, I think, I think there still will be, 00:07:45.760 |
Yeah, I think it just depends on kind of like 00:07:48.440 |
like any, like any engineer would probably care about. 00:07:54.400 |
like if the models continue to kind of keep getting bigger, 00:07:59.760 |
whether you should use the big thing or the small, 00:08:08.320 |
Maybe that would be hard to do, you know, over a network. 00:08:12.240 |
- Yeah, you tackle the much harder problem latency-wise, 00:08:16.520 |
you know, than the AI models actually require. 00:08:22.360 |
You know, you definitely did 30 FPS video streaming, 00:08:30.720 |
on the kinds of things you can do with networking. 00:08:33.960 |
Maybe someday you'll come back to that at some point. 00:08:41.200 |
Very good to follow you just to learn your insights. 00:08:43.840 |
And you actually published a postmortem on Mighty 00:08:45.800 |
that people can read up on if they're willing to. 00:08:50.760 |
You started exploring the AI stuff in June, 2022, 00:09:05.480 |
for the team at Mighty to finish up something. 00:09:11.240 |
"I guess I will make some kind of like address bar predictor 00:09:18.560 |
And I was like, "You know, one thing that's kind of lame 00:09:22.420 |
"is that like this browser should be like a lot better 00:09:24.600 |
"at predicting what I might do, where I might wanna go." 00:09:34.680 |
And for a company like Google, you'd think there's a lot, 00:09:37.320 |
but it's actually just like the code is actually just very, 00:09:41.200 |
you know, it's just a bunch of if then statements 00:09:47.600 |
And that's also where a lot of people interact 00:10:12.200 |
And I think that was the first like truly big viral moment 00:10:19.680 |
- Because of the avocado chair and yeah, exactly. 00:10:26.040 |
- It wasn't as big for me as "Stable Diffusion." 00:10:34.460 |
but I never really, it didn't really register me as-- 00:10:51.620 |
for developers to walk in back when it wasn't as, 00:10:56.100 |
I guess, much of a security issue as it is today. 00:11:08.580 |
but there could be any number of AI companies 00:11:13.140 |
and businesses that you could start in the widest one, right? 00:11:17.340 |
- So there must be an idea maze from June to September. 00:11:32.300 |
But back then, I think they were more than happy. 00:11:36.180 |
They had a lot more bandwidth to help anybody. 00:11:38.900 |
And so we had been talking with the team there 00:11:43.820 |
really fast, low-latency address bar prediction 00:11:59.140 |
and kind of being involved gave me a bird's-eye view 00:12:01.660 |
into a bunch of things that started to happen. 00:12:12.060 |
And I remember just kind of sitting up one night thinking, 00:12:20.740 |
One thing that I observed is that I find a lot of great, 00:12:29.620 |
Like for Mixpanel, I was an intern at a company, 00:12:34.660 |
And so I thought, "Hmm, I wonder if I could make a product, 00:12:38.500 |
And in this case, the same thing kind of occurred. 00:12:46.640 |
they put a model up, and then you can use their API, 00:12:49.500 |
like Replicate is a really good example of that. 00:12:52.620 |
There are a bunch of companies that are helping you 00:12:54.620 |
with training, model optimization, Mosaic at the time, 00:12:59.620 |
and probably still was doing stuff like that. 00:13:03.180 |
So I just started listing out every category of everything, 00:13:06.340 |
of every company that was doing something interesting. 00:13:15.440 |
"I might be really good at competing with that company." 00:13:17.940 |
Because of Mixpanel, 'cause it's so much of analysis. 00:13:21.380 |
I was like, "No, I don't want to do anything related to that. 00:13:23.780 |
"I think that would be too boring now at this point." 00:13:26.480 |
But, so I started to list out all these ideas, 00:13:35.620 |
And all it was was just a text box, more or less. 00:13:38.060 |
And then there were some settings on the right, 00:13:44.940 |
I mean, that was like their product before chat GPT. 00:13:54.460 |
where the interface kind of was getting more and more, 00:13:59.420 |
generate something in the middle of a sentence, 00:14:07.460 |
"and you generate something, and that's about it." 00:14:15.820 |
And so I had this kind of thing where I wrote prompt dash, 00:14:20.460 |
And I didn't know what was like the product for that, 00:14:40.260 |
And so then of course, then you thought about, 00:14:44.420 |
How would you build a UI for each kind of modality? 00:14:54.300 |
because it seemed like the most obvious place 00:14:57.760 |
where you could build a really powerful, complex UI 00:15:11.300 |
So yeah, I think that just that progression kind of happened 00:15:17.360 |
and it just seemed like there was a lot of effort 00:15:49.100 |
and I will stop and I'll go do something else. 00:15:51.500 |
But if you're not gonna do anything, I'll just do it. 00:15:59.620 |
that they were gonna focus on language primarily. 00:16:28.540 |
- Well, not so much, because I think that right now 00:16:32.620 |
I would say graphics is in this very nascent phase. 00:16:34.740 |
Like most of the customers are just like hobbyists, right? 00:16:40.140 |
as opposed to being this like very high utility thing. 00:16:47.260 |
then probably the next customers will end up being B2B. 00:16:55.500 |
If your quest is to kind of make like a super, 00:17:00.220 |
something that surpasses human ability for graphics, 00:17:03.660 |
like ultimately it will end up being used for business. 00:17:16.420 |
I think it will be a very similar progression. 00:17:19.540 |
- But yeah, I mean, the reason why I was excited about it 00:17:26.100 |
It's like something that I know I could stay up 00:17:30.400 |
Those are kind of like very simple bars for me. 00:17:38.780 |
You just had Playground V2 come out two days ago? 00:17:43.760 |
So this is a model you train completely from scratch. 00:17:49.480 |
You open source everything, including the weights. 00:17:59.380 |
- Yeah, what made you want to come up with V2 00:18:06.180 |
- Yeah, so I think that we continue to feel like graphics 00:18:12.100 |
and these foundation models for anything really related 00:18:21.060 |
It feels a little like graphics is in this GPT-2 moment, 00:18:29.320 |
But it was like, what are you gonna use this for? 00:18:30.980 |
You know, yeah, we'll do some text classification 00:18:34.740 |
and maybe it'll sometimes make a summary of something 00:18:42.960 |
And in images, we're kind of stuck in the same place. 00:18:46.500 |
We're kind of like, okay, I write this thing in a box 00:18:54.500 |
Maybe I'll use it for a blog post, that kind of thing. 00:19:02.320 |
at stable diffusion and we definitely use that model 00:19:04.740 |
in our product and our users like it and use it 00:19:12.420 |
So we were kind of faced with the choice of, you know, 00:19:21.180 |
to just decide to go train these things from scratch. 00:19:24.380 |
And I think the community has given us so much. 00:19:28.740 |
is one of the most vibrant communities on the internet. 00:19:33.360 |
It feels like, I hope this is what Homebrew Club felt like 00:19:36.680 |
when computers showed up because it's like amazing 00:19:42.060 |
I've never seen anything in my life where so far, 00:19:44.540 |
and heard other people's stories around this, 00:19:46.540 |
where a research, an academic research paper comes out 00:19:50.200 |
and then like two days later, someone has sample code for it 00:19:55.180 |
and then two days later, it's like in nine products. 00:20:10.020 |
So I think we wanted to give back to the community 00:20:21.540 |
But we definitely felt like there needs to be 00:20:24.220 |
some kind of progress in these open source models. 00:20:31.900 |
but there hasn't been anything really since, right? 00:20:36.380 |
- Well, Excel Turbo is like this distilled model, right? 00:20:40.780 |
You have to decide what your trade-off is there. 00:20:46.100 |
- It's not, I don't think it's a consistency model. 00:20:58.340 |
- Yeah, I think it's, I've read something about that. 00:21:04.020 |
But yeah, there hasn't been quite enough progress 00:21:06.820 |
in terms of, you know, there's no multitask image model. 00:21:09.780 |
You know, the closest thing would be something called 00:21:16.140 |
So we did that and we also gave out pre-trained weights, 00:21:28.260 |
there's like a 256 pixel pre-trained stage and a 512. 00:21:35.020 |
we come across people all the time in academia 00:21:37.620 |
they have access to like one A100 or eight at best. 00:21:42.060 |
And so if we can give them kind of like a 512 00:21:47.740 |
our hope is that there'll be interesting novel research 00:21:57.900 |
tend to be things like character consistency. 00:22:08.620 |
one image and then you want it to be like in another. 00:22:26.820 |
You know, there are two things like instruct pics to pics 00:22:28.860 |
and then the emu edit paper that are maybe very interesting, 00:22:33.140 |
but we certainly are not pushing the fold on that 00:22:37.340 |
It just, all kinds of things like around that rotation, 00:22:43.220 |
you know, being able to keep coherence across images, 00:22:52.100 |
what's going on in an image, that kind of thing. 00:22:54.820 |
Things are still very, very underpowered, very nascent. 00:22:57.820 |
So therefore the utility is very, very limited. 00:23:02.780 |
you are 2.5X prefer to stable diffusion Excel. 00:23:15.660 |
- I think they're still very early on in the recipe, 00:23:18.140 |
but I think it's a lot of like little things. 00:23:28.020 |
So we spend a lot of time thinking about that. 00:23:37.020 |
Everything from captions to the data that you align with 00:23:40.980 |
after pre-train to how you're picking your data sets, 00:23:46.920 |
There's a lot, I feel like there's a lot of work in AI 00:23:52.060 |
It just really feels like just data set filtering 00:23:56.260 |
And just like, you know, and the recipe is all there, 00:23:58.460 |
but it's like a lot of extra work to do that. 00:24:01.580 |
So I think these models, I think whatever version, 00:24:08.220 |
maybe either by the end of the year or early next year. 00:24:10.940 |
And we're just like watching what the community does 00:24:14.300 |
And then we're just gonna take a lot of the things 00:24:16.060 |
that they're unhappy about and just like fix them. 00:24:19.520 |
You know, so for example, like maybe the eyes of people 00:24:29.600 |
That's something that we already know we wanna fix. 00:24:31.320 |
So I think in that case, it's gonna be about data quality. 00:24:34.600 |
Or maybe you wanna improve the kind of the dynamic range 00:24:38.440 |
You know, we wanna make sure that that's like got a good 00:24:43.000 |
There's different things like offset noise, pyramid noise, 00:24:47.120 |
Like there are all these various interesting things 00:24:49.920 |
So I think it's like a lot of just like tricks. 00:24:57.220 |
- Specifically for faces, it's very common to use a pipeline 00:25:05.400 |
Do you have a strong belief either way on like, 00:25:08.440 |
oh, they should be separated out to different stages 00:25:10.720 |
for like improving the eyes, improving the face 00:25:14.440 |
Or do you think like it can all be done in one model? 00:25:29.960 |
Maybe there is something out there that we haven't read. 00:25:32.220 |
There are some bottlenecks, like for example, in the VAE, 00:25:35.800 |
like the VAEs are ultimately like compressing these things. 00:25:38.120 |
And so you don't know, and then you might have 00:25:42.800 |
So maybe you would use a pixel based model, perhaps. 00:25:48.280 |
I think we've talked to people, everyone from like Rombach 00:25:51.300 |
to various people, Rombach trained stable diffusion. 00:25:54.520 |
You know, I think there's like a big question 00:26:07.800 |
that's also seemingly working with diffusion. 00:26:10.240 |
And so, you know, are we going to use vision transformers? 00:26:16.360 |
We don't really, I don't think there have been 00:26:22.680 |
- Yeah, I think it's very computationally expensive 00:26:25.120 |
to do a pipeline model where you're like fixing the eyes 00:26:28.080 |
and you're fixing the mouth and you're fixing the hands. 00:26:28.920 |
- That's what everyone does as far as I understand. 00:26:31.340 |
- Well, I'm not sure, I'm not exactly sure what you mean, 00:26:47.760 |
Now you have to pick all these different things. 00:26:49.280 |
- Yeah, you're just kind of glomming things on together. 00:26:51.120 |
Like when I look at AI artists, like that's what they do. 00:26:59.320 |
control net tiling to do kind of generative upscaling 00:27:04.140 |
Yeah, I mean, to me, these are all just like, 00:27:09.480 |
let's go back to where we were just three years, 00:27:12.240 |
four years ago with where deep learning was at 00:27:18.360 |
well, I'll just train these very narrow models 00:27:21.200 |
to try to do these things and kind of ensemble them 00:27:23.200 |
or pipeline them to try to get to a best-in-class result. 00:27:25.600 |
And here we are with like where the models are gigantic 00:27:29.400 |
and like very capable of solving huge amounts of tasks 00:27:38.520 |
You also released a new benchmark called MJHQ-30K 00:27:42.480 |
for automatic evaluation of a model's aesthetic quality. 00:27:54.120 |
How do you think about the Playground model, MidJourney? 00:27:59.840 |
a lot of people in research like to come up with, 00:28:09.800 |
it can be helpful to not be a researcher also sometimes. 00:28:15.320 |
I don't have a PhD in anything AI related, for example. 00:28:23.400 |
then the most important thing that you wanna figure out 00:28:32.760 |
We would, I would happily, I'm happy to admit that. 00:28:43.720 |
to try to compare ourselves to the thing that's best, 00:28:45.720 |
even if we lose, even if we're not the best, right? 00:28:55.440 |
then we only have ourselves to compare ourselves to. 00:29:06.320 |
So I think more people should try to do that. 00:29:09.960 |
kind of comparing yourself on like some Google model 00:29:13.480 |
or some old SD, you know, stable diffusion model 00:29:16.680 |
and be like, look, we beat, you know, stable diffusion 1.5. 00:29:19.640 |
I think users ultimately want care, you know, 00:29:24.520 |
that like I also mostly, people mostly agree with. 00:29:32.840 |
this seems like a worthy thing for us to at least try, 00:29:41.000 |
And you kill stable diffusion Excel and everything. 00:29:47.960 |
it says Playground V2 1024 pixel dash aesthetic. 00:29:57.960 |
- We debated this, maybe we named it wrong or something, 00:30:00.080 |
but we were like, how do we help people realize 00:30:03.400 |
the model that's aligned versus the models that weren't. 00:30:19.980 |
Who wouldn't want the thing that's aesthetic? 00:30:33.000 |
And it seems like the styles are tied to the model. 00:30:41.320 |
Can you maybe give listeners an overview of how that works? 00:30:45.120 |
Because in language, there's not this idea of like style, 00:30:52.640 |
and you cannot get certain styles in different models. 00:30:56.920 |
and how do you categorize them and find them? 00:30:59.160 |
- Yeah, I mean, it's so fun having a community 00:31:03.360 |
Like it's only been two days for Playground V2 00:31:05.880 |
and we actually don't know what the model's capable of 00:31:12.600 |
but we have yet to see what emergent behavior is. 00:31:28.640 |
there's some styles that are very like well-known to us, 00:31:30.560 |
like maybe like pixel art is a well-known style. 00:31:38.200 |
But there are some styles that cannot be easily named. 00:31:47.840 |
And in the end, you end up making up the name 00:31:56.320 |
And so if anyone that's into stable diffusion 00:31:58.920 |
and into building anything with graphics and stuff 00:32:03.240 |
you might've heard of like ProtoVision or DreamShaper, 00:32:09.200 |
But they're just, you know, invented by these authors, 00:32:14.960 |
- Because it like roughly embeds to what you want. 00:32:22.080 |
there's this one of my favorite ones that's fine-tuned. 00:32:30.160 |
It's got really great color contrast and visual elements. 00:32:38.960 |
I think that's like a very big open question with graphics 00:32:47.040 |
I don't know, it's like an evolving situation too, 00:32:52.160 |
It's like listening to the same style of pop song. 00:33:10.680 |
like the EDM genre alone has like sub genres. 00:33:16.160 |
and painting and art and anything that we're doing. 00:33:22.760 |
But I think they are emergent from the community, 00:33:29.680 |
coming back to this, like B2B versus B2C thing. 00:33:32.480 |
B2C, you're gonna have a huge amount of diversity 00:33:35.040 |
and then it's gonna reduce as you get towards 00:33:38.560 |
I'm making this up here, tell me if you disagree. 00:33:55.040 |
overly ambitious and like really scrutinizing 00:33:59.440 |
like what something is in its most nascent phase 00:34:01.440 |
that you miss the most ambitious thing you could have done. 00:34:09.600 |
can like kind of lead you to something amazing. 00:34:24.080 |
and then just kind of was still searching, you know, 00:34:29.760 |
I think that happens a lot with like Nobel prize people. 00:34:31.760 |
I think there's like a term for it that I forget. 00:34:34.200 |
I actually wanted to go after a toy almost intentionally. 00:34:42.040 |
I could imagine that it would lead to something 00:34:47.080 |
And so, yeah, it's a very, like I said, it's very hobbyist, 00:34:58.220 |
even if these hobbyists aren't likely to be the people 00:35:01.220 |
that, you know, have a way to monetize it or whatever, 00:35:04.080 |
even if they're, but they're doing it for fun. 00:35:08.700 |
But I agree with you that, you know, in time, 00:35:12.000 |
we will absolutely focus on more utilitarian things, 00:35:16.400 |
like things that are more related to editing feats 00:35:20.060 |
But, and so I think like a very simple use case is just, 00:35:26.000 |
I don't know if, I don't know if you guys are, 00:35:28.680 |
but it's sure, you know, it seems like very simple 00:35:31.080 |
that like you, if we could give you the ability 00:35:39.000 |
You know, like my wife the other day was set, you know, 00:35:46.840 |
where like we could make my son, his name's Devin, 00:35:53.040 |
You know, just being able to highlight his mouth 00:36:00.600 |
Little things like that, all the way to, you know, 00:36:03.920 |
putting you in completely different scenarios. 00:36:18.480 |
and it'll still like kind of look like crooked 00:36:21.720 |
Part of it's like, you know, the lips on the face are so, 00:36:24.360 |
there's such, there's such little information there. 00:36:26.720 |
It's so small that the models really struggle with it. 00:36:30.520 |
- Make the picture smaller and you won't see it. 00:36:32.360 |
- Wait, I think, I think that's my trick, I don't know. 00:36:37.200 |
and make it really big and then like say it's a mouth 00:36:43.120 |
more than it's doing something that kind of surprises you. 00:36:48.480 |
- It feels like you are very much the internal tastemaker. 00:36:56.200 |
Is it, do you find it hard to like communicate it 00:36:59.520 |
to like your team and, you know, other people? 00:37:10.140 |
Like images have such, like such high bit rate 00:37:15.700 |
And we don't have enough words to describe these things. 00:37:21.740 |
if they don't have good kind of like judgment taste 00:37:30.820 |
So in that realm, I don't worry too much, actually. 00:37:39.980 |
But I also have, you know, my own narrow taste. 00:37:43.220 |
I don't represent the whole population either. 00:38:07.420 |
- And then, so are there any other metrics that you like 00:38:11.980 |
I'm always looking for alternatives to vibes. 00:38:15.500 |
- You know, it might be fun to kind of talk about this 00:38:33.980 |
did the way that we benchmark actually succeed? 00:38:42.340 |
but all these benchmarks are just an approximation 00:38:47.420 |
And I think that's like very fascinating to me. 00:38:56.540 |
And so, you know, one of the benchmarks we did 00:38:58.340 |
was we did a, we kind of curated like a thousand prompts. 00:39:01.300 |
That's what we published in our blog post, you know, 00:39:05.140 |
a lot of them, some of them are curated by our team 00:39:09.340 |
Like my favorite prompt that no model's really capable of 00:39:32.740 |
Just to see if the models will figure it out. 00:39:39.620 |
And just like all these very interesting, weird, 00:39:44.120 |
and then we kind of like evaluate whether the models 00:39:47.140 |
And the reality is that they're all bad at it. 00:39:48.940 |
And so then you're just picking the most aesthetic image. 00:39:53.500 |
we're still at the beginning of building like our, 00:39:56.980 |
that aligns most with just user happiness, I think. 00:40:01.980 |
'Cause we're not, we're not like putting these in papers 00:40:03.780 |
and trying to like win, you know, I don't know, 00:40:05.900 |
awards at ICCV or something if they have awards. 00:40:18.100 |
I think we're still evolving whatever our benchmarks are. 00:40:20.460 |
So the first benchmark was just like very difficult tasks 00:40:28.980 |
And then can we ask the users, like, how do we do? 00:40:31.900 |
And then we wanted to use a benchmark like party prompts 00:40:37.740 |
could measure their models against ours versus others. 00:40:50.880 |
and then you try to see like what users make. 00:40:52.980 |
And I think my sense is that we're gonna take all the things 00:40:55.220 |
that we noticed that the users kind of were failing at 00:40:58.060 |
and try to find like new ways to measure that, 00:41:07.900 |
we have users making millions of images every single day. 00:41:15.900 |
- And they go for like a post-generation feedback. 00:41:21.500 |
We can just say like, how good was the lighting here? 00:41:33.700 |
and then we say, and then maybe randomly you just say, 00:41:52.200 |
as opposed to just like benchmark performance. 00:41:54.140 |
Hopefully next year, I think we will try to publish 00:41:56.940 |
kind of like a benchmark that anyone could use, 00:42:01.640 |
that we evaluate ourselves on and that other people can, 00:42:08.420 |
because we've tried it and done it and noticed that it did. 00:42:38.720 |
something I've been looking out for the entire year 00:42:45.320 |
- Or an ideogram, I think, came out recently, 00:42:48.440 |
which had decent but not perfect text and images. 00:43:06.200 |
'Cause I don't see any of that in like your sample. 00:43:09.200 |
Yeah, the V2 model was mostly focused on image quality 00:43:21.940 |
- Yeah, I'm very excited about text synthesis 00:43:23.520 |
and yeah, I think ideogram has done a good job 00:43:28.080 |
Dottie kind of has like a, it has like a hit rate. 00:43:33.520 |
I think where this has to go is it has to be like, 00:43:36.620 |
you could like write little tiny pieces of text 00:43:41.880 |
- That's maybe not even the focal point of a scene. 00:43:56.360 |
So I think text synthesis would be very exciting. 00:43:59.160 |
And then also flag that Max Wolf, Minimax here, 00:44:04.960 |
He's done a lot of stuff about using like logo masks 00:44:21.720 |
the open source community is that you get things 00:44:23.600 |
like control net and then you see all these people 00:44:25.880 |
do these just amazing things with control net 00:44:28.360 |
and then you wonder, I think from our point of view, 00:44:33.400 |
but how do we end up with like a unified model 00:44:42.560 |
- And so they need these kinds of like work around 00:44:45.720 |
work around research ideas to get there, but yeah. 00:44:55.440 |
We kept the Playground v2 exactly the same as SDXL, 00:45:01.720 |
we knew that the community already had tools. 00:45:08.600 |
and then, you know, retrain a control net for it. 00:45:18.080 |
I don't want to DDoS you with topics, but okay. 00:45:21.200 |
I was basically going to go over three more categories. 00:45:34.000 |
I don't know if you care to comment on any of those. 00:45:36.440 |
- The NSFW kind of like safety stuff is really important. 00:45:39.840 |
Part of, I kind of think that one of the biggest risks 00:45:44.360 |
kind of going into maybe the U.S. election year 00:45:53.760 |
I think it's going to be very hard to explain, 00:46:00.880 |
And our world is like sometimes very, you know, 00:46:05.520 |
Some people are like, there's still lots of humanity 00:46:09.320 |
And I think it's going to be very hard to explain, 00:46:16.200 |
I saw President Biden say this thing on a video. 00:46:19.800 |
You know, I can't believe, you know, he said that. 00:46:22.960 |
I think that's going to be a very troubling thing 00:46:25.440 |
going into the world next year, the year after. 00:46:29.720 |
- Oh, I didn't, that's more like a risk thing. 00:46:35.800 |
But there's just, there's a lot of studies on how, 00:46:42.080 |
you don't want to train on not safe for work images, 00:46:44.480 |
except that it makes you really good at bodies. 00:46:51.040 |
we filter out NSFW type of images in our data set 00:46:55.760 |
so that it's, you know, so our safety filter stuff 00:46:59.640 |
- But you've heard this argument that it gets, 00:47:04.160 |
not safe for work images are very good at human anatomy, 00:47:11.280 |
it's not like necessarily a bad thing to train on that data. 00:47:16.160 |
That's why I was kind of talking about safety. 00:47:25.280 |
suddenly like you can kind of imagine, you know, 00:47:27.840 |
now if you can like generate nudes and then there's like, 00:47:30.040 |
you can do very character consistent things with faces, 00:47:37.600 |
Even if you train on, let's say, you know, new data, 00:47:42.280 |
there's nothing wrong with the human anatomy. 00:47:47.200 |
but then it's kind of like, how does that get used? 00:47:49.440 |
And, you know, I won't bring up all of the very, 00:48:00.320 |
And so we, you know, we just recently did like a big sprint 00:48:05.760 |
and it's very difficult with graphics and art, right? 00:48:08.560 |
Because there is tasteful art that has nudity, right? 00:48:19.920 |
there's the things that are the gray line of that. 00:48:23.960 |
someone might be like, that is completely tasteful, right? 00:48:26.840 |
And then there's things that are way over the line. 00:48:29.880 |
And then there are things that are, you know, 00:48:31.400 |
maybe you or, you know, maybe I would be okay with, 00:48:45.440 |
if a child goes to your site, scrolls down some images, 00:48:48.920 |
you know, classrooms of kids, you know, using our product, 00:49:02.160 |
Another favorite topic of our listeners is UX and AI. 00:49:21.880 |
so they can kind of have semi-repeatable generation. 00:49:25.080 |
You also have, yeah, you can pick how many images, 00:49:28.840 |
and then you leave all of them in the canvas, 00:49:33.720 |
the generation box, and you can even cross between them 00:49:45.360 |
You know, you're like, these are all the tools for you. 00:49:52.600 |
I think we think that we're also trying to like re-imagine 00:50:02.240 |
So, you know, I don't think we're trying to build Photoshop, 00:50:07.960 |
that people are, you know, largely familiar with. 00:50:11.520 |
I think, you know, I don't think you would think 00:50:16.840 |
you wouldn't think, what would Photoshop compare itself 00:50:44.440 |
You have to wait right now for the time being, 00:50:46.920 |
but the wait is worth it often for a lot of people 00:50:50.600 |
because they can't make that with their own skills. 00:50:56.560 |
which was kind of looking at GPT-3's Playground, 00:51:11.440 |
these prompt boxes are like a terminal window, right? 00:51:15.080 |
We're kind of at this weird point where it's just like CLI. 00:51:20.400 |
and I memorized the keywords, like D-I-R-L-S, 00:51:27.560 |
- The shirt I'm wearing, you know, it's a bug, 00:51:33.120 |
which waits the word token more in the model or whatever. 00:51:42.880 |
I think a large portion of humanity would agree 00:51:53.840 |
if I wanted to get rid of like the headphones on my head, 00:51:57.640 |
and then say, you know, can you remove the headphones? 00:52:00.640 |
You know, if I want to grow the, expand the image, 00:52:03.240 |
it should, you know, how can we make that feel easier 00:52:06.240 |
without typing lots of words and being really confused? 00:52:11.400 |
I don't even think we've nailed the UI/UX yet. 00:52:24.000 |
And whatever felt like the right UX six months ago 00:52:29.600 |
And so that's a little bit of how we got there, 00:52:34.920 |
is kind of saying, does everything have to be 00:52:47.720 |
which Dali 3 just does, it doesn't let you decide 00:52:56.400 |
- Yeah, for that feature, I think we'll probably, 00:53:14.840 |
it's in a Discord bot somewhere with Majorny, 00:53:19.240 |
One of the differentiators I think we provide 00:53:21.480 |
is at the expense of just lots of users necessarily, 00:53:26.480 |
mainstream consumers, is that we provide as much like power 00:53:29.800 |
and tweakability and configurability as possible. 00:53:35.000 |
because we know that users might want to use it 00:53:39.480 |
There are some really powerful power user hobbyists 00:53:45.940 |
you know, just want something that looks cool, 00:53:50.080 |
And so I think a lot of Playground is more about 00:53:53.040 |
going after that core user base that like knows, 00:54:01.520 |
So they might not use like these users probably, 00:54:08.360 |
And so I think that like, as the models get more powerful, 00:54:20.360 |
just as like powerful and configurable as Photoshop. 00:54:24.400 |
And you might have to master a new kind of tool. 00:54:28.720 |
There's so many things I could bounce off of that. 00:54:40.760 |
Consistency models have been blowing up the past month. 00:54:45.640 |
Is that, like, how do you think about integrating that? 00:54:50.040 |
also trying to beat you to that space as well. 00:54:52.960 |
- I think we were the first company to integrate it. 00:54:58.600 |
that have kind of tried to do like interactive editing 00:55:09.160 |
We have a different feature that's like unique 00:55:11.320 |
in our product that's called preview rendering. 00:55:18.760 |
we're like, what is the most common use case? 00:55:20.120 |
The most common use case is you write a prompt 00:55:23.180 |
But what's the most annoying thing about that? 00:55:31.480 |
So we did something that seemed a lot simpler 00:55:34.320 |
but a lot more relevant to how users already use this 00:55:38.240 |
You toggle it on and it will show you a render of the image. 00:55:40.560 |
And then it's just like, graphics tools already have this. 00:55:44.840 |
Like if you use Cinema 4D or After Effects or something, 00:55:52.280 |
in the real world that has familiarity and say, 00:56:05.300 |
just like pulling down the slot machine lever. 00:56:23.800 |
well the demos I've been seeing it's also, I guess, 00:56:30.000 |
They're almost using it to animate their generations, 00:56:39.400 |
but they're sort of showing like if I move a moon, 00:56:55.400 |
like how about the just general ecosystem of Loras, right? 00:57:01.760 |
That Civit is obviously the most popular repository of Loras. 00:57:15.280 |
but the person that brought Loras to Stable Diffusion 00:57:30.160 |
Obviously fine tuning all these dream booth models 00:57:35.480 |
And giving, and it's obvious in our conversation 00:58:01.340 |
that goes beyond just typing Greg Rakowski in a prompt. 00:58:08.180 |
It's not like users want to type these real artist names. 00:58:11.200 |
It's that they don't know how else to get an image 00:58:18.040 |
And they provide it in a very nice scalable way. 00:58:21.320 |
I hope that we find something even better than Loras 00:58:46.600 |
I think when we have a significantly better model, 00:59:11.920 |
you know, five or six reference images, right? 00:59:24.220 |
but they're actually like, it's like a mood board, right? 00:59:27.560 |
And it takes, you have to be kind of an engineer almost 00:59:33.980 |
It seems like it'd be much better if I could say, 00:59:43.640 |
And you tell the model, like, this is what I want. 00:59:45.680 |
And the model gives you something that's very aligned 00:59:48.320 |
with what your style is, what you're talking about. 00:59:50.320 |
And it's a style you couldn't even communicate, right? 00:59:54.400 |
you know, if you have a Tron image, it's not just Tron, 00:59:59.480 |
- Yeah, even cyberpunk can have its like sub-genre, right? 01:00:03.360 |
But I just think training Loras and doing that 01:00:05.680 |
is very heavy, so I hope we can do better than that. 01:00:09.640 |
- We have Sharif from Lexica on the podcast before. 01:00:22.820 |
- Yeah, yeah, is that something you see more and more of 01:00:25.880 |
in terms of like coming up with these styles? 01:00:27.660 |
Is that why you have that as the starting point 01:00:30.540 |
versus a lot of other products, you just go in, 01:00:36.160 |
- Our feed is a little different than their feed. 01:00:41.000 |
So we have kind of like a Reddit thing going on 01:00:43.800 |
where it's a kind of a competition like every day, 01:00:51.460 |
And there's just this wonderful community of people 01:00:55.120 |
and just showing their genuine interest in each other. 01:00:58.440 |
And I think we definitely learn about styles that way. 01:01:06.640 |
they'll sometimes put these polls out and they'll say, 01:01:08.400 |
you know, what do you wish you could like learn more from? 01:01:10.040 |
And like one of the things that people vote the most for 01:01:17.520 |
if you put away your research hat for a minute 01:01:19.560 |
and you just put on like your product hat for a second, 01:01:23.000 |
why do people want to learn how to prompt, right? 01:01:25.400 |
It's because they want to get higher quality images. 01:01:38.300 |
and it gives all of the users a way to learn how to prompt 01:01:43.300 |
because they're just seeing this huge rising tide 01:01:47.300 |
of all these images that are super cool and interesting 01:01:49.980 |
and they can kind of like take each other's prompts 01:01:57.180 |
because I think the complexity of these things 01:02:16.480 |
What have you learned running DevOps for GPUs? 01:02:19.360 |
You had a tweet about like how many A100s you have, 01:02:34.960 |
I find the DevOps for inference to be relatively easy. 01:02:42.360 |
I think we had thousands and thousands of servers 01:02:47.720 |
had such huge quantities of volume that I didn't find it. 01:02:58.660 |
So I think that I find that very difficult at the moment. 01:03:05.820 |
Scaling a training cluster is much, much harder 01:03:12.980 |
- Well, it's just like a very large distributed system 01:03:21.820 |
and then you have to somehow be resilient to that. 01:03:23.560 |
And I would say training in for a software is very early. 01:03:29.820 |
I can tell in 10 years, it would be a lot better. 01:03:35.180 |
I think we use very basic tools like Slurm for scheduling 01:03:43.780 |
I think I talked to a friend that's over at XAI. 01:03:45.740 |
They just, they like built their own scheduler 01:03:51.900 |
because the existing open source stuff doesn't work 01:03:54.000 |
and everyone's doing their own bespoke thing, 01:03:55.600 |
you know there's a valuable company to be formed. 01:04:01.360 |
- Well, with Mosaic, yeah, it's tough with Mosaic 01:04:03.680 |
'cause anyway, I won't go into the details why, 01:04:13.160 |
Perhaps it's still, I just think it's nascent 01:04:30.120 |
what's the most interesting unsolved question in AI 01:04:42.580 |
I mean, you're a founder, you're a repeat founder. 01:04:51.620 |
The only thing that I, I don't have an idea per se 01:05:00.820 |
Right now, we sort of think that a lot of the modalities 01:05:04.600 |
just kind of feel like they're vision, language, audio, 01:05:11.880 |
And somehow all this will like turn into something, 01:05:14.420 |
it'll be multimodal and then we'll end up with AGI perhaps. 01:05:18.740 |
And I just think that there are probably far more modalities 01:05:25.540 |
And it just seems hard for us to see it right now 01:05:28.760 |
because it's sort of like we have tunnel vision 01:05:36.660 |
- I think we are lacking imagination as a species 01:05:40.680 |
And I think like, you know, just like, you know, 01:05:43.580 |
it's not, I don't know what company would form 01:05:49.420 |
like just like a true actual, like not a meta world model, 01:05:52.940 |
but an actual world model that truly maps everything 01:05:56.860 |
that's going in terms of like physics and fluids 01:06:04.660 |
like a true physics foundation model of sorts 01:06:09.040 |
And that in of itself seems very difficult, you know, 01:06:13.060 |
but we just think of, but we're kind of stuck on like 01:06:17.020 |
with like, you know, a word or a token, if you will. 01:06:20.820 |
And I went, you know, I had a dinner last night 01:06:22.300 |
where we were kind of debating this philosophically. 01:06:24.580 |
And I think someone, you know, said something 01:06:27.780 |
at the end of the day, it doesn't really matter 01:06:31.180 |
At the end of the day, it's just like some, you know, 01:06:36.100 |
But, you know, I do wonder if there are more, 01:06:42.520 |
And if you could create that, then what would that, 01:06:48.940 |
So I don't know yet, so I don't have a great company for it. 01:06:53.540 |
Maybe you would just inspire somebody to try. 01:06:57.720 |
- My personal response to that is I'm less interested 01:07:04.220 |
Because that is teleportation, that is immortality, 01:07:22.180 |
If I were to take a Bill Gates book trip and had a week, 01:07:32.700 |
You shouldn't take a book, you should just go to YouTube 01:07:35.540 |
and visit Karpathy's class and just do it, do it, 01:07:41.820 |
That's actually the most useful thing for you? 01:07:43.300 |
- I wish it came out when I started back last year. 01:07:49.460 |
at the beginning, but I did do a few of his classes 01:07:53.980 |
I don't think books, every time I buy a programming book, 01:08:02.300 |
- Yeah, so more generally, advice for founders 01:08:04.820 |
who are not PhDs and are effectively self-taught 01:08:07.420 |
like you are, what should they do, what should they avoid? 01:08:11.000 |
- Same thing that I would advise if you're programming. 01:08:14.100 |
Pick a project that seems very exciting to you, 01:08:19.060 |
and build it and learn every detail of it while you do it. 01:08:24.740 |
Or can you go far enough not training, just fine-tuning? 01:08:29.180 |
- It depends, I would just follow your curiosity. 01:08:33.660 |
that requires fundamental understanding of training models, 01:08:37.820 |
You don't have to be a PhD, you don't have to get 01:08:44.700 |
If it's not necessary, then go as far as you need to go, 01:08:46.860 |
but I would learn, pick something that motivates.