back to index

The AI-First Graphics Editor - with Suhail Doshi of Playground AI


Chapters

0:0 Introductions
0:56 Suhail's background (Mixpanel, Mighty)
9:0 Transition from Mighty to exploring AI and generative models
10:24 The viral moment for generative AI with DALL-E 2 and Stable Diffusion
17:58 Training Playground v2 from scratch
27:52 The MJHQ 30K benchmark for evaluating model's aesthetic quality
30:59 Discussion on styles in AI-generated images and the categorization of styles
43:18 Tackling edge cases from UIs to NSFW
49:47 The user experience and interface design for AI image generation tools
54:50 Running their own infrastructure for GPU DevOps
57:13 The ecosystem of LoRAs
62:7 The goals and challenges of building a graphics editor with AI integration
64:44 Lightning Round

Whisper Transcript | Transcript Only Page

00:00:00.880 | - Hey, everyone.
00:00:01.720 | Welcome to the Latent Space Podcast.
00:00:03.520 | This is Alessio, partner and CTO
00:00:05.460 | on Resident and Decibel Partners.
00:00:07.040 | And I'm joined by my co-host Swiggs, founder of Small.ai.
00:00:09.960 | - Hey, and today in the studio we have Soheil Doshi.
00:00:12.320 | Welcome.
00:00:13.140 | - Yeah, thanks for having me.
00:00:14.320 | - Among many things, you're a CEO and co-founder
00:00:15.960 | of Mixpanel.
00:00:16.800 | - Yep.
00:00:17.620 | - And I think about three years ago,
00:00:19.800 | you left to start Mighty.
00:00:21.400 | - Mm-hmm.
00:00:22.640 | - And more recently, I think about a year ago,
00:00:25.920 | transitioned into Playground.
00:00:27.680 | And you've just announced your new round.
00:00:31.160 | I'd just like to start, touch on Mixpanel a little bit,
00:00:33.120 | 'cause it's obviously one of the more
00:00:35.200 | sort of successful analytics companies
00:00:37.920 | we previously had amplitude on.
00:00:40.640 | And I'm curious if you had any sort of reflections
00:00:43.800 | on just that overall,
00:00:46.480 | the interaction of that amount of data
00:00:49.880 | that people would want to use for AI.
00:00:52.120 | Like, I don't know if there's still a part of you
00:00:54.760 | that stays in touch with that world.
00:00:56.320 | - Yeah, I mean, it's, I mean, you know,
00:00:58.560 | the short version is that maybe back in like 2015 or '16,
00:01:03.560 | I don't really remember exactly,
00:01:05.800 | 'cause it was a while ago, we had an ML team at Mixpanel.
00:01:08.900 | And I think this is like when maybe deep learning
00:01:13.060 | or something like really just started
00:01:14.680 | getting kind of exciting.
00:01:16.160 | And we were thinking that maybe we,
00:01:17.800 | given that we had such vast amounts of data,
00:01:20.200 | perhaps we could predict things.
00:01:22.460 | So we built, you know, two or three different features.
00:01:24.480 | I think we built a feature where we could predict
00:01:26.960 | whether users would churn from your product.
00:01:30.200 | We made a feature that could predict
00:01:32.040 | whether users would convert.
00:01:33.840 | We tried to be built a feature
00:01:35.320 | that could do anomaly detection.
00:01:36.840 | Like if something occurred in your product,
00:01:40.080 | that was just very surprising,
00:01:41.160 | maybe a spike in traffic in a particular region.
00:01:43.840 | Could we tell you in advance?
00:01:45.040 | Could we tell you that that happened?
00:01:46.360 | 'Cause it's really hard to like know everything
00:01:47.880 | that's going on with your data.
00:01:49.080 | Could we tell you something surprising about your data?
00:01:51.580 | And we tried all of these various features.
00:01:53.700 | Most of it boiled down to just like, you know,
00:01:55.440 | using logistic regression.
00:01:58.480 | And it never quite seemed very groundbreaking in the end.
00:02:03.280 | And so I think, you know, we had a four or five person ML team
00:02:07.240 | and I think we never expanded it from there.
00:02:10.220 | And I did all these fast AI courses
00:02:12.060 | trying to learn about ML.
00:02:13.280 | And that was the, that's the--
00:02:15.200 | - That's the first time you did fast AI.
00:02:16.640 | - Yeah, that was the first time I did fast AI.
00:02:18.320 | Yeah, I think I've done it now three times maybe.
00:02:20.400 | - Oh, okay.
00:02:21.240 | I didn't know it was a third.
00:02:22.060 | Okay.
00:02:23.160 | - No, no, just me reviewing it is maybe three times,
00:02:25.440 | but yeah.
00:02:26.280 | - Yeah, yeah, yeah.
00:02:27.120 | I mean, I think you mentioned prediction,
00:02:29.160 | but honestly like it's also just about the feedback, right?
00:02:31.760 | The quality of feedback from users,
00:02:34.900 | I think it's useful for anyone building AI applications.
00:02:38.400 | - Yeah.
00:02:39.320 | - Yeah, self-evident.
00:02:40.560 | - Yeah, I think I haven't spent a lot of time
00:02:42.760 | thinking about Mixpanel 'cause it's been a long time,
00:02:44.440 | but yeah, I wonder now, given everything that's happened,
00:02:47.680 | like, you know, sometimes I'm like,
00:02:50.400 | oh, I wonder what we could do now.
00:02:51.920 | And then I kind of like move on to whatever I'm working on,
00:02:54.360 | but things have changed significantly since.
00:02:56.720 | So yeah.
00:02:57.880 | - Yeah.
00:02:58.720 | Awesome.
00:02:59.560 | And then maybe we'll touch on Mighty a little bit.
00:03:01.660 | Mighty was very, very bold.
00:03:03.400 | It was basically, well, my framing of it was,
00:03:06.440 | you will run our browsers for us
00:03:07.920 | because everyone has too many tabs open.
00:03:10.820 | I have too many tabs open and slowing down your machines
00:03:12.680 | that you can do it better for us
00:03:14.160 | in a centralized data center.
00:03:15.480 | - Yeah, we were first trying to make a browser
00:03:17.960 | that we would stream from a data center to your computer
00:03:21.120 | at extremely low latency.
00:03:22.680 | But the real objective wasn't trying to make a browser
00:03:27.080 | or anything like that.
00:03:27.920 | The real objective was to try to make
00:03:29.080 | a new kind of computer.
00:03:30.760 | And the thought was just that, like, you know,
00:03:32.720 | we have these computers in front of us today
00:03:35.080 | and we upgrade them or they run out of RAM
00:03:37.840 | or they don't have enough RAM or not enough disk,
00:03:39.640 | or, you know, there's some limitation with our computers,
00:03:43.240 | perhaps like data locality is a problem.
00:03:46.360 | Could we, you know, why do I need to think about
00:03:48.880 | upgrading my computer ever?
00:03:50.760 | And so, you know, we just had to kind of observe that,
00:03:52.800 | like, well, actually, it seems like a lot of applications
00:03:55.920 | are just now in the browser.
00:03:57.600 | You know, it's like how many real desktop applications
00:04:00.800 | do we use relative to the number of applications
00:04:02.960 | we use in the browser?
00:04:03.800 | So it was just this realization that actually, like,
00:04:05.920 | you know, the browser was effectively becoming
00:04:08.280 | more or less our operating system over time.
00:04:10.840 | And so then that's why we kind of decided to go,
00:04:13.120 | hmm, maybe we can stream the browser.
00:04:14.680 | Fortunately, the idea did not work
00:04:15.760 | for a couple of different reasons,
00:04:18.040 | but the objective is to try to make a true new computer.
00:04:21.200 | - Yeah, very bold, very bold.
00:04:22.480 | - Yeah, and I was there at YC Demo Day
00:04:25.200 | when you first announced it.
00:04:26.040 | - Oh, okay.
00:04:26.960 | - I think the last, or one of the last in-person ones,
00:04:29.880 | or like the PR 34 in Mission Bay.
00:04:32.120 | - Yeah, before COVID.
00:04:34.280 | - How do you think about that now
00:04:36.080 | when everybody wants to put some of these models
00:04:38.240 | in people's machines and some of them
00:04:39.760 | want to stream them in?
00:04:40.960 | Do you think there's maybe another wave of the same problem
00:04:44.320 | before it was like browser apps too slow,
00:04:46.200 | and now it's like models too slow to run on device?
00:04:49.000 | - Yeah, I think, you know,
00:04:51.000 | we obviously pivoted away from Mighty,
00:04:52.520 | but a lot of what I somewhat believed at Mighty
00:04:57.040 | is like somewhat very true.
00:04:58.760 | Maybe why I'm so excited about AI and what's happening.
00:05:02.000 | A lot of what Mighty was about
00:05:03.440 | was like moving compute somewhere else, right?
00:05:06.800 | Right now applications,
00:05:07.920 | they get limited quantities of memory, disk, networking,
00:05:12.280 | whatever your home network has, et cetera.
00:05:14.920 | You know, what if these applications could somehow,
00:05:17.280 | if we could shift compute,
00:05:18.440 | and then these applications have vastly more compute
00:05:20.560 | than they do today.
00:05:22.120 | Right now it's just like client backend services,
00:05:24.920 | but you know, what if we could change the shape
00:05:27.280 | of how applications could interact with things?
00:05:31.040 | And it's changed my thinking.
00:05:33.400 | In some ways, AI has like a bit of a continuation
00:05:36.680 | of my belief that like,
00:05:38.120 | perhaps we can really shift compute somewhere else.
00:05:41.560 | One of the problems with Mighty
00:05:43.120 | was that JavaScript is single-threaded in the browser.
00:05:47.720 | And what we learned, you know,
00:05:49.280 | the reason why we kind of abandoned Mighty
00:05:51.240 | was because I didn't believe
00:05:52.520 | we could make a new kind of computer.
00:05:53.760 | We could have made some kind of enterprise business,
00:05:56.080 | probably could have made maybe a lot of money,
00:05:59.000 | but it wasn't going to be what I hoped it was going to be.
00:06:01.520 | And so once I realized that most of a web app
00:06:05.440 | is just going to be single-threaded JavaScript,
00:06:07.240 | then the only thing you could do largely
00:06:10.160 | withstanding changing JavaScript,
00:06:11.480 | which is a fool's errand most likely,
00:06:14.560 | is make a better CPU, right?
00:06:18.080 | And there's like three CPU manufacturers,
00:06:20.480 | two of which sell, you know, big ones, you know,
00:06:23.120 | AMD, Intel, and then of course, like Apple made the M1.
00:06:26.440 | And it's not like single-threaded CPU core performance.
00:06:30.240 | Single core performance was like increasing very fast.
00:06:33.240 | It's plateauing rapidly.
00:06:35.080 | And even these different like companies
00:06:36.800 | were not doing as good of a job, you know,
00:06:38.560 | sort of with the continuation of Moore's law.
00:06:40.640 | But what happened in AI was that you got like,
00:06:43.480 | like if you think of the AI model as like a computer program
00:06:46.920 | which is like a compiled computer program,
00:06:48.560 | it is literally built and designed
00:06:50.120 | to do massive parallel computations.
00:06:53.360 | And so if you could take like
00:06:55.120 | the universal approximation theorem
00:06:56.520 | to its like kind of logical complete point,
00:07:00.280 | you know, you're like, wow,
00:07:01.120 | I can get make computation happen really rapidly
00:07:04.360 | and parallel somewhere else.
00:07:06.040 | You know, so you end up with these like
00:07:09.200 | really amazing models that can like do anything.
00:07:11.920 | It just turned out like perhaps,
00:07:14.080 | perhaps the new kind of computer
00:07:16.320 | would just simply be shifted, you know,
00:07:19.000 | into these like really amazing AI models in reality.
00:07:22.600 | - Yeah.
00:07:23.440 | Like I think Andrej Karpathy has always been,
00:07:25.760 | has been making a lot of analogies with the LLMOS.
00:07:28.640 | - Yeah, I saw his, yeah, I saw his video
00:07:30.480 | and I watched that, you know,
00:07:31.880 | maybe two weeks ago or something like that.
00:07:33.320 | And I was like, oh man, this,
00:07:35.000 | I very much resonate with this like idea.
00:07:37.000 | - Why didn't I see this three years ago?
00:07:38.680 | - Yeah, I think, I think there still will be,
00:07:40.760 | you know, local models
00:07:42.240 | and then there'll be these very large models
00:07:43.760 | that have to be run in data centers.
00:07:45.760 | Yeah, I think it just depends on kind of like
00:07:47.320 | the right tool for the job,
00:07:48.440 | like any, like any engineer would probably care about.
00:07:52.000 | But I think that, you know, by and large,
00:07:54.400 | like if the models continue to kind of keep getting bigger,
00:07:57.520 | you know, it's gonna, it's,
00:07:58.720 | you're always going to be wondering
00:07:59.760 | whether you should use the big thing or the small,
00:08:01.880 | you know, the tiny little model.
00:08:03.880 | And it might just depend on like, you know,
00:08:05.800 | do you need 30 FPS or 60 FPS?
00:08:08.320 | Maybe that would be hard to do, you know, over a network.
00:08:12.240 | - Yeah, you tackle the much harder problem latency-wise,
00:08:16.520 | you know, than the AI models actually require.
00:08:19.080 | - Yeah, yeah, you can do quite well.
00:08:20.880 | You can do quite well.
00:08:22.360 | You know, you definitely did 30 FPS video streaming,
00:08:26.320 | did very crazy things to make that work.
00:08:28.440 | So I'm actually quite bullish
00:08:30.720 | on the kinds of things you can do with networking.
00:08:33.120 | - Yeah, right.
00:08:33.960 | Maybe someday you'll come back to that at some point.
00:08:37.520 | - But so for those that don't know,
00:08:39.360 | you're very transparent on Twitter.
00:08:41.200 | Very good to follow you just to learn your insights.
00:08:43.840 | And you actually published a postmortem on Mighty
00:08:45.800 | that people can read up on if they're willing to.
00:08:48.400 | And so there was a bit of an overlap.
00:08:50.760 | You started exploring the AI stuff in June, 2022,
00:08:56.560 | which is when you started saying like,
00:08:57.600 | "I'm taking Fast.ai again."
00:08:59.480 | Maybe, was there more context around that?
00:09:02.440 | - Yeah, I think I was kind of like waiting
00:09:05.480 | for the team at Mighty to finish up something.
00:09:08.640 | And I was like, "Okay, well, what can I do?
00:09:11.240 | "I guess I will make some kind of like address bar predictor
00:09:15.240 | "in the browser."
00:09:16.080 | So we had forked Chrome and Chromium.
00:09:18.560 | And I was like, "You know, one thing that's kind of lame
00:09:22.420 | "is that like this browser should be like a lot better
00:09:24.600 | "at predicting what I might do, where I might wanna go."
00:09:28.480 | You know, it struck me as really odd
00:09:30.160 | that Chrome had very little AI actually,
00:09:32.720 | or ML inside this browser.
00:09:34.680 | And for a company like Google, you'd think there's a lot,
00:09:37.320 | but it's actually just like the code is actually just very,
00:09:41.200 | you know, it's just a bunch of if then statements
00:09:43.240 | is more or less the address bar.
00:09:45.160 | So it seemed like a pretty big opportunity.
00:09:47.600 | And that's also where a lot of people interact
00:09:50.040 | with the browser.
00:09:50.880 | So in a long story short, I was like,
00:09:52.360 | "Hmm, I wonder what I could build here."
00:09:55.160 | So I started to, yeah, take some AI courses
00:09:57.840 | and try to review the material again
00:10:00.620 | and get back to figuring it out.
00:10:02.520 | But I think that was somewhat serendipitous
00:10:05.240 | because right around April was, I think,
00:10:08.640 | a very big watershed moment in AI
00:10:10.320 | 'cause that's when "Dolly 2" came out.
00:10:12.200 | And I think that was the first like truly big viral moment
00:10:15.560 | for generative AI.
00:10:17.760 | - Because of the avocado chair.
00:10:19.680 | - Because of the avocado chair and yeah, exactly.
00:10:24.680 | Yeah, it was just so novel.
00:10:26.040 | - It wasn't as big for me as "Stable Diffusion."
00:10:28.000 | - Really?
00:10:28.840 | - Yeah, I don't know.
00:10:29.660 | People was like, "All right, that's cool."
00:10:31.220 | I don't know. (laughs)
00:10:32.460 | - Yeah.
00:10:33.300 | - I mean, they had some flashy videos,
00:10:34.460 | but I never really, it didn't really register me as--
00:10:37.900 | - But just that moment of images
00:10:39.460 | was just such a viral, novel moment.
00:10:41.860 | I think it just blew people's mind.
00:10:44.040 | - Yeah, I mean, it was the first time
00:10:46.540 | I encountered Sam Altman
00:10:47.980 | 'cause they had this "Dolly 2" hackathon.
00:10:50.140 | They opened up the OpenAI office
00:10:51.620 | for developers to walk in back when it wasn't as,
00:10:56.100 | I guess, much of a security issue as it is today.
00:11:00.200 | Maybe take us through the journey
00:11:01.600 | to decide to pivot into this.
00:11:03.940 | And also, choosing images.
00:11:06.060 | Obviously, you were inspired by "Dolly,"
00:11:08.580 | but there could be any number of AI companies
00:11:13.140 | and businesses that you could start in the widest one, right?
00:11:16.500 | - Yeah.
00:11:17.340 | - So there must be an idea maze from June to September.
00:11:20.500 | - Yeah, yeah, there definitely was.
00:11:22.500 | So I think at that time,
00:11:24.300 | Mighty, OpenAI was not quite as popular
00:11:29.300 | as it is all of a sudden now these days.
00:11:32.300 | But back then, I think they were more than happy.
00:11:36.180 | They had a lot more bandwidth to help anybody.
00:11:38.900 | And so we had been talking with the team there
00:11:42.140 | around trying to see if we could do
00:11:43.820 | really fast, low-latency address bar prediction
00:11:47.140 | with GPT-3 and 3.5 and that kind of thing.
00:11:51.180 | And so we were sort of figuring out
00:11:54.220 | how could we make that low-latency.
00:11:56.020 | I think that just being able to talk to them
00:11:59.140 | and kind of being involved gave me a bird's-eye view
00:12:01.660 | into a bunch of things that started to happen.
00:12:03.960 | Obviously, first was the "Dolly 2" moment,
00:12:07.620 | but then "Stable Diffusion" came out,
00:12:09.280 | and that was a big moment for me as well.
00:12:12.060 | And I remember just kind of sitting up one night thinking,
00:12:16.080 | I was like, "What are the kinds of companies
00:12:18.540 | "one could build?
00:12:19.380 | "What matters right now?"
00:12:20.740 | One thing that I observed is that I find a lot of great,
00:12:23.420 | I find a lot of inspiration when I'm working
00:12:26.260 | in a field in something,
00:12:27.500 | and then I can identify a bunch of problems.
00:12:29.620 | Like for Mixpanel, I was an intern at a company,
00:12:32.340 | and I just noticed that they were doing
00:12:33.620 | all this data analysis.
00:12:34.660 | And so I thought, "Hmm, I wonder if I could make a product,
00:12:37.000 | "and then maybe they would use it."
00:12:38.500 | And in this case, the same thing kind of occurred.
00:12:41.680 | It was like, okay, there are a bunch
00:12:42.660 | of infrastructure companies that are doing,
00:12:46.640 | they put a model up, and then you can use their API,
00:12:49.500 | like Replicate is a really good example of that.
00:12:52.620 | There are a bunch of companies that are helping you
00:12:54.620 | with training, model optimization, Mosaic at the time,
00:12:59.620 | and probably still was doing stuff like that.
00:13:03.180 | So I just started listing out every category of everything,
00:13:06.340 | of every company that was doing something interesting.
00:13:08.100 | Obviously, Weights & Biases.
00:13:09.560 | I was like, "Oh man, Weights & Biases
00:13:12.100 | "is this great company.
00:13:14.000 | "Do I want to compete with that company?
00:13:15.440 | "I might be really good at competing with that company."
00:13:17.940 | Because of Mixpanel, 'cause it's so much of analysis.
00:13:21.380 | I was like, "No, I don't want to do anything related to that.
00:13:23.780 | "I think that would be too boring now at this point."
00:13:26.480 | But, so I started to list out all these ideas,
00:13:30.300 | and one thing I observed was that at OpenAI,
00:13:32.820 | they have a playground for GPT-3, right?
00:13:35.620 | And all it was was just a text box, more or less.
00:13:38.060 | And then there were some settings on the right,
00:13:39.540 | like temperature and whatever.
00:13:41.140 | - Top K, Top N. - Yeah, Top K.
00:13:43.340 | What's your end stop sequence?
00:13:44.940 | I mean, that was like their product before chat GPT.
00:13:48.140 | You know, really difficult to use,
00:13:49.500 | but fun if you're like an engineer.
00:13:51.400 | And I just noticed that their product
00:13:53.060 | kind of was evolving a little bit,
00:13:54.460 | where the interface kind of was getting more and more,
00:13:56.700 | a little bit more complex.
00:13:58.020 | They had like a way where you could like,
00:13:59.420 | generate something in the middle of a sentence,
00:14:01.340 | and all those kinds of things.
00:14:02.820 | And I just thought to myself, I was like,
00:14:04.140 | "You know, there's not,
00:14:05.200 | "everything is just like this text box,
00:14:07.460 | "and you generate something, and that's about it."
00:14:09.620 | And Stable Diffusion had kind of come out,
00:14:11.220 | and it was all like hugging face and code.
00:14:13.540 | Nobody was really building any UI.
00:14:15.820 | And so I had this kind of thing where I wrote prompt dash,
00:14:18.540 | like question mark in my notes.
00:14:20.460 | And I didn't know what was like the product for that,
00:14:23.780 | at the time.
00:14:24.820 | I mean, it seems kind of trite now.
00:14:27.380 | But yeah, I just like wrote prompt.
00:14:29.180 | What's the thing for that?
00:14:30.020 | - Manager, prompt. - Prompt manager.
00:14:32.100 | Do you organize them?
00:14:33.560 | Like, do you like have a UI that can like--
00:14:35.860 | - Library. - Play with them?
00:14:37.280 | Yeah, like a library.
00:14:38.340 | What would you make?
00:14:40.260 | And so then of course, then you thought about,
00:14:41.700 | what would the modalities be, given that?
00:14:44.420 | How would you build a UI for each kind of modality?
00:14:47.260 | And so there were a couple people
00:14:48.620 | working on some pretty cool things.
00:14:51.100 | And I basically chose graphics
00:14:54.300 | because it seemed like the most obvious place
00:14:57.760 | where you could build a really powerful, complex UI
00:15:02.220 | that's not just only typing in a box.
00:15:05.260 | That it would very much evolve beyond that.
00:15:07.820 | Like, what would be the best thing
00:15:08.900 | for something that's visual?
00:15:09.900 | Probably something visual.
00:15:11.300 | So yeah, I think that just that progression kind of happened
00:15:17.360 | and it just seemed like there was a lot of effort
00:15:19.860 | going into language,
00:15:21.220 | but not a lot of effort going into graphics.
00:15:24.220 | And then maybe the very last thing was,
00:15:26.300 | I think I was talking to Aditya Ramesh,
00:15:29.020 | who is the co-creator of Dolly 2 and Sam.
00:15:32.780 | And I just kind of went to these guys
00:15:34.180 | and I was just like, hey,
00:15:35.860 | are you gonna make like a UI for this thing?
00:15:38.660 | Like a true UI, are you gonna go for this?
00:15:40.780 | Are you gonna make a product?
00:15:42.020 | - For Dolly, yeah.
00:15:43.100 | - For Dolly, yeah.
00:15:44.700 | Are you gonna do anything here?
00:15:46.420 | 'Cause if you're not gonna do it,
00:15:47.940 | if you are gonna do it, just let me know
00:15:49.100 | and I will stop and I'll go do something else.
00:15:51.500 | But if you're not gonna do anything, I'll just do it.
00:15:54.460 | And so we had a couple of conversations
00:15:55.780 | around what that would look like.
00:15:58.020 | And then I think ultimately they decided
00:15:59.620 | that they were gonna focus on language primarily.
00:16:03.220 | And yeah, I just felt like
00:16:05.780 | it was gonna be very underinvested in.
00:16:07.860 | - Yes, there's that sort of underinvestment
00:16:11.260 | from OpenAI, which I can see that.
00:16:14.420 | But also it's a different type of customer
00:16:18.100 | than you're used to.
00:16:19.380 | Presumably, and Mixpanel are very good
00:16:22.100 | at selling to B2B developers.
00:16:24.620 | With Fairground, you're not.
00:16:26.180 | - Yeah.
00:16:27.020 | - Was that not a concern?
00:16:28.540 | - Well, not so much, because I think that right now
00:16:32.620 | I would say graphics is in this very nascent phase.
00:16:34.740 | Like most of the customers are just like hobbyists, right?
00:16:37.500 | Like it's a little bit of like a novel toy
00:16:40.140 | as opposed to being this like very high utility thing.
00:16:42.980 | But I think ultimately if you believe
00:16:45.220 | that you could make it very high utility,
00:16:47.260 | then probably the next customers will end up being B2B.
00:16:50.460 | It'll probably not be like consumer.
00:16:52.180 | Like there will certainly be a variation
00:16:53.860 | of this idea that's in consumer.
00:16:55.500 | If your quest is to kind of make like a super,
00:17:00.220 | something that surpasses human ability for graphics,
00:17:03.660 | like ultimately it will end up being used for business.
00:17:06.540 | So I think it's maybe more of a progression.
00:17:08.540 | In fact, for me, it's maybe more like
00:17:09.980 | Mixpanel started out as SMB,
00:17:11.940 | and then very much like ended up
00:17:13.340 | starting to grow up towards enterprise.
00:17:14.940 | So for me, it's a little,
00:17:16.420 | I think it will be a very similar progression.
00:17:18.340 | - Yeah, yeah.
00:17:19.540 | - But yeah, I mean, the reason why I was excited about it
00:17:21.340 | is 'cause it was a creative tool.
00:17:22.860 | I make music and it's AI.
00:17:26.100 | It's like something that I know I could stay up
00:17:28.100 | till three o'clock in the morning doing.
00:17:30.400 | Those are kind of like very simple bars for me.
00:17:33.140 | - Yeah. - Yeah.
00:17:33.980 | It's good decision criteria.
00:17:35.900 | - So you mentioned DALI, Stable Diffusion.
00:17:38.780 | You just had Playground V2 come out two days ago?
00:17:42.020 | - Yeah, two days ago, yeah.
00:17:42.920 | - Two days ago.
00:17:43.760 | So this is a model you train completely from scratch.
00:17:46.580 | So it's not a cheap fine tune on something.
00:17:49.480 | You open source everything, including the weights.
00:17:52.740 | Why did you decide to do it?
00:17:54.200 | I know you supported Stable Diffusion XL
00:17:56.560 | in Playground before, right?
00:17:58.220 | - Yep.
00:17:59.380 | - Yeah, what made you want to come up with V2
00:18:02.020 | and maybe some of the interesting,
00:18:04.320 | technical research work you've done?
00:18:06.180 | - Yeah, so I think that we continue to feel like graphics
00:18:12.100 | and these foundation models for anything really related
00:18:16.900 | to pixels, but also definitely images,
00:18:18.980 | continues to be very under-invested.
00:18:21.060 | It feels a little like graphics is in this GPT-2 moment,
00:18:24.460 | right, like even GPT-3.
00:18:27.140 | Even when GPT-3 came out, it was exciting.
00:18:29.320 | But it was like, what are you gonna use this for?
00:18:30.980 | You know, yeah, we'll do some text classification
00:18:33.060 | and some semantic analysis,
00:18:34.740 | and maybe it'll sometimes make a summary of something
00:18:37.460 | and it'll hallucinate.
00:18:38.500 | But no one really had a very significant
00:18:41.120 | business application for GPT-3.
00:18:42.960 | And in images, we're kind of stuck in the same place.
00:18:46.500 | We're kind of like, okay, I write this thing in a box
00:18:49.080 | and I get some cool piece of artwork
00:18:50.860 | and the hands are kind of messed up
00:18:52.180 | and sometimes the eyes are a little weird.
00:18:54.500 | Maybe I'll use it for a blog post, that kind of thing.
00:18:58.280 | The utility feels so limited.
00:18:59.840 | And so, you know, and then you sort of look
00:19:02.320 | at stable diffusion and we definitely use that model
00:19:04.740 | in our product and our users like it and use it
00:19:07.260 | and love it and enjoy it.
00:19:08.540 | But it hasn't gone nearly far enough.
00:19:12.420 | So we were kind of faced with the choice of, you know,
00:19:14.500 | do we wait for progress to occur
00:19:16.100 | or do we make that progress happen?
00:19:18.340 | So, yeah, we kind of embarked on a plan
00:19:21.180 | to just decide to go train these things from scratch.
00:19:24.380 | And I think the community has given us so much.
00:19:27.020 | The community for stable diffusion, I think,
00:19:28.740 | is one of the most vibrant communities on the internet.
00:19:31.900 | It's like amazing.
00:19:33.360 | It feels like, I hope this is what Homebrew Club felt like
00:19:36.680 | when computers showed up because it's like amazing
00:19:39.220 | what that community will do.
00:19:40.460 | And it moves so fast.
00:19:42.060 | I've never seen anything in my life where so far,
00:19:44.540 | and heard other people's stories around this,
00:19:46.540 | where a research, an academic research paper comes out
00:19:50.200 | and then like two days later, someone has sample code for it
00:19:53.660 | and then two days later, there's a model
00:19:55.180 | and then two days later, it's like in nine products.
00:19:57.780 | - Yeah.
00:19:58.620 | - Competing with each other.
00:19:59.540 | - Yeah.
00:20:00.380 | - It's incredible to see like math symbols
00:20:01.780 | on an academic paper go to features,
00:20:04.960 | well-designed features in a product.
00:20:06.980 | So I think the community has done so much.
00:20:10.020 | So I think we wanted to give back to the community
00:20:12.180 | kind of on our way.
00:20:13.020 | We knew it wasn't going to be,
00:20:15.300 | we knew it was not ever going to be,
00:20:17.140 | certainly we would train a better model
00:20:18.500 | than what we gave out on Tuesday.
00:20:21.540 | But we definitely felt like there needs to be
00:20:24.220 | some kind of progress in these open source models.
00:20:27.740 | The last kind of milestone was in July
00:20:30.260 | when Stable Diffusion Excel came out,
00:20:31.900 | but there hasn't been anything really since, right?
00:20:34.500 | - And there's Excel Turbo now.
00:20:36.380 | - Well, Excel Turbo is like this distilled model, right?
00:20:38.900 | So it's like lower quality, but fast.
00:20:40.780 | You have to decide what your trade-off is there.
00:20:43.460 | - And it's also a consistency model?
00:20:46.100 | - It's not, I don't think it's a consistency model.
00:20:48.140 | It's like, they did like a different thing.
00:20:50.260 | - Yeah.
00:20:51.100 | - Yeah, I think it's like,
00:20:51.940 | I don't want to get quoted for this,
00:20:53.460 | but it's like something called ad,
00:20:54.900 | like adversarial something or another.
00:20:56.460 | - That's exactly right.
00:20:58.340 | - Yeah, I think it's, I've read something about that.
00:21:00.980 | Maybe it's like closer to GANs or something,
00:21:02.380 | but I didn't really read the full paper.
00:21:04.020 | But yeah, there hasn't been quite enough progress
00:21:06.820 | in terms of, you know, there's no multitask image model.
00:21:09.780 | You know, the closest thing would be something called
00:21:11.180 | like EmuEdit, but there's no model for that.
00:21:13.940 | It's just a paper that's within meta.
00:21:16.140 | So we did that and we also gave out pre-trained weights,
00:21:20.780 | which is very rare.
00:21:22.300 | Usually you just get the aligned model
00:21:24.020 | and then you have to like,
00:21:25.180 | see if you can do anything with it.
00:21:26.840 | We actually gave out,
00:21:28.260 | there's like a 256 pixel pre-trained stage and a 512.
00:21:32.460 | And we did that for academic research,
00:21:34.100 | 'cause there's a whole bunch of,
00:21:35.020 | we come across people all the time in academia
00:21:36.780 | and they have like,
00:21:37.620 | they have access to like one A100 or eight at best.
00:21:42.060 | And so if we can give them kind of like a 512
00:21:45.220 | pre-trained model, it might,
00:21:47.740 | our hope is that there'll be interesting novel research
00:21:50.340 | that occurs from that.
00:21:51.660 | - What research do you want to happen?
00:21:53.900 | - I would love to see more research around,
00:21:56.620 | you know, things that users care about
00:21:57.900 | tend to be things like character consistency.
00:22:00.660 | - Between frames?
00:22:02.180 | - More like if you have like a face.
00:22:03.900 | Yeah, yeah, basically between frames,
00:22:05.420 | but more just like, you know,
00:22:06.300 | you have your face and it's in, you know,
00:22:08.620 | one image and then you want it to be like in another.
00:22:10.980 | And users are very particular
00:22:12.940 | and sensitive to faces changing.
00:22:14.260 | 'Cause we know, we know what, you know,
00:22:16.900 | we're trained on faces as humans.
00:22:19.140 | And, you know, that's something I don't,
00:22:21.900 | I'm not seeing a lot of innovation,
00:22:23.380 | enough innovation around multitask editing.
00:22:26.820 | You know, there are two things like instruct pics to pics
00:22:28.860 | and then the emu edit paper that are maybe very interesting,
00:22:33.140 | but we certainly are not pushing the fold on that
00:22:36.460 | in that regard.
00:22:37.340 | It just, all kinds of things like around that rotation,
00:22:43.220 | you know, being able to keep coherence across images,
00:22:46.740 | style transfer is still very limited.
00:22:48.740 | Just even reasoning around images, you know,
00:22:52.100 | what's going on in an image, that kind of thing.
00:22:54.820 | Things are still very, very underpowered, very nascent.
00:22:57.820 | So therefore the utility is very, very limited.
00:23:01.140 | - On the 1K Prompt Benchmark,
00:23:02.780 | you are 2.5X prefer to stable diffusion Excel.
00:23:06.740 | How do you get there?
00:23:07.580 | Is it better images in the training corpus?
00:23:10.540 | Is it, yeah, can you maybe talk through
00:23:13.660 | the improvements in the model?
00:23:15.660 | - I think they're still very early on in the recipe,
00:23:18.140 | but I think it's a lot of like little things.
00:23:21.620 | And, you know, every now and then
00:23:22.780 | there are some big important things.
00:23:24.260 | Like certainly your data quality
00:23:26.860 | is really, really important.
00:23:28.020 | So we spend a lot of time thinking about that.
00:23:30.900 | But I would say it's a lot of things
00:23:34.140 | that you kind of clean up along the way
00:23:35.660 | as you train your model.
00:23:37.020 | Everything from captions to the data that you align with
00:23:40.980 | after pre-train to how you're picking your data sets,
00:23:44.380 | how you filter your data sets.
00:23:46.920 | There's a lot, I feel like there's a lot of work in AI
00:23:49.700 | that's like, doesn't really feel like AI.
00:23:52.060 | It just really feels like just data set filtering
00:23:55.220 | and systems engineering.
00:23:56.260 | And just like, you know, and the recipe is all there,
00:23:58.460 | but it's like a lot of extra work to do that.
00:24:01.580 | So I think these models, I think whatever version,
00:24:04.420 | I think we plan to do a Playground V 2.1,
00:24:08.220 | maybe either by the end of the year or early next year.
00:24:10.940 | And we're just like watching what the community does
00:24:13.100 | with the model.
00:24:14.300 | And then we're just gonna take a lot of the things
00:24:16.060 | that they're unhappy about and just like fix them.
00:24:19.520 | You know, so for example, like maybe the eyes of people
00:24:23.560 | in an image don't feel right.
00:24:25.840 | They feel like they're a little misshapen
00:24:27.800 | or they're kind of blurry feeling.
00:24:29.600 | That's something that we already know we wanna fix.
00:24:31.320 | So I think in that case, it's gonna be about data quality.
00:24:34.600 | Or maybe you wanna improve the kind of the dynamic range
00:24:37.600 | of color.
00:24:38.440 | You know, we wanna make sure that that's like got a good
00:24:40.280 | range in any image.
00:24:41.300 | So what technique can we use there?
00:24:43.000 | There's different things like offset noise, pyramid noise,
00:24:45.960 | terminal zero SNR.
00:24:47.120 | Like there are all these various interesting things
00:24:49.080 | that you can do.
00:24:49.920 | So I think it's like a lot of just like tricks.
00:24:52.200 | Some are tricks, some are data,
00:24:53.360 | and some is just like cleaning.
00:24:55.880 | Yeah.
00:24:57.220 | - Specifically for faces, it's very common to use a pipeline
00:25:01.320 | rather than just train the base model more.
00:25:05.400 | Do you have a strong belief either way on like,
00:25:08.440 | oh, they should be separated out to different stages
00:25:10.720 | for like improving the eyes, improving the face
00:25:12.640 | or enhance or whatever?
00:25:14.440 | Or do you think like it can all be done in one model?
00:25:17.440 | - I think we will make a unified model.
00:25:19.320 | - Okay.
00:25:20.160 | - Yeah, I think we'll certainly in the end,
00:25:21.680 | ultimately make a unified model.
00:25:23.320 | There's not enough research about this.
00:25:29.960 | Maybe there is something out there that we haven't read.
00:25:32.220 | There are some bottlenecks, like for example, in the VAE,
00:25:35.800 | like the VAEs are ultimately like compressing these things.
00:25:38.120 | And so you don't know, and then you might have
00:25:39.880 | like a big information bottleneck.
00:25:42.800 | So maybe you would use a pixel based model, perhaps.
00:25:45.520 | You know, there's a lot of belief.
00:25:48.280 | I think we've talked to people, everyone from like Rombach
00:25:51.300 | to various people, Rombach trained stable diffusion.
00:25:54.520 | You know, I think there's like a big question
00:25:56.760 | around the architecture of these things.
00:25:59.360 | It's still kind of unknown, right?
00:26:01.360 | Like we've got transformers
00:26:03.400 | and we've got like a GPT architecture model,
00:26:06.440 | but then there's this like weird thing
00:26:07.800 | that's also seemingly working with diffusion.
00:26:10.240 | And so, you know, are we going to use vision transformers?
00:26:12.520 | Are we going to move to pixel based models?
00:26:14.340 | Is there a different kind of architecture?
00:26:16.360 | We don't really, I don't think there have been
00:26:17.800 | enough experiments in this regard.
00:26:19.320 | - Still? Oh my God.
00:26:21.200 | - Yeah. - That's surprising.
00:26:22.680 | - Yeah, I think it's very computationally expensive
00:26:25.120 | to do a pipeline model where you're like fixing the eyes
00:26:28.080 | and you're fixing the mouth and you're fixing the hands.
00:26:28.920 | - That's what everyone does as far as I understand.
00:26:31.340 | - Well, I'm not sure, I'm not exactly sure what you mean,
00:26:33.320 | but if you mean like you get an image
00:26:35.260 | and then you will like make another model
00:26:37.280 | specifically to fix a face.
00:26:38.940 | Yeah, I think that's a very computationally,
00:26:40.640 | that's fairly computationally expensive.
00:26:42.200 | And I think it's like not,
00:26:43.320 | probably not the right thing, right way.
00:26:45.320 | - Yeah. - Yeah.
00:26:46.160 | And it doesn't generalize very well.
00:26:47.760 | Now you have to pick all these different things.
00:26:49.280 | - Yeah, you're just kind of glomming things on together.
00:26:51.120 | Like when I look at AI artists, like that's what they do.
00:26:54.380 | - Ah, yeah, yeah, yeah.
00:26:55.640 | They'll do things like, you know,
00:26:57.760 | I think a lot of ARs will do, you know,
00:26:59.320 | control net tiling to do kind of generative upscaling
00:27:01.920 | of all these different pieces of the image.
00:27:04.140 | Yeah, I mean, to me, these are all just like,
00:27:05.920 | they're all hacks, ultimately in the end.
00:27:08.280 | I mean, it just, to me, it's like,
00:27:09.480 | let's go back to where we were just three years,
00:27:12.240 | four years ago with where deep learning was at
00:27:14.920 | and where language was at.
00:27:16.600 | You know, it's the same thing.
00:27:17.440 | It's like, we were like, okay,
00:27:18.360 | well, I'll just train these very narrow models
00:27:21.200 | to try to do these things and kind of ensemble them
00:27:23.200 | or pipeline them to try to get to a best-in-class result.
00:27:25.600 | And here we are with like where the models are gigantic
00:27:29.400 | and like very capable of solving huge amounts of tasks
00:27:33.440 | when given like lots of great data.
00:27:35.200 | So, yeah. - Makes sense.
00:27:38.520 | You also released a new benchmark called MJHQ-30K
00:27:42.480 | for automatic evaluation of a model's aesthetic quality.
00:27:45.960 | I have one question.
00:27:48.680 | The dataset that you use for the benchmark
00:27:51.020 | is from MidJourney. - Yes.
00:27:52.440 | - You have 10 categories.
00:27:54.120 | How do you think about the Playground model, MidJourney?
00:27:58.720 | - You know, there are a lot of people,
00:27:59.840 | a lot of people in research like to come up with,
00:28:02.500 | they like to compare themselves
00:28:03.640 | to something they know they can beat, right?
00:28:06.760 | But maybe this is the best reason why
00:28:09.800 | it can be helpful to not be a researcher also sometimes.
00:28:12.840 | Like I'm not like trained as a researcher.
00:28:15.320 | I don't have a PhD in anything AI related, for example.
00:28:19.120 | But I think if you care about products
00:28:21.880 | and you care about your users,
00:28:23.400 | then the most important thing that you wanna figure out
00:28:25.720 | is like everyone has to acknowledge
00:28:28.080 | that MidJourney is very good.
00:28:30.080 | You know, they are the best at this thing.
00:28:32.760 | We would, I would happily, I'm happy to admit that.
00:28:34.840 | I have no problem admitting that.
00:28:37.520 | It's just easy.
00:28:38.760 | It's very visual to tell.
00:28:40.680 | So, you know, I think it's incumbent on us
00:28:43.720 | to try to compare ourselves to the thing that's best,
00:28:45.720 | even if we lose, even if we're not the best, right?
00:28:50.040 | And, you know, at some point,
00:28:53.060 | if we are able to surpass MidJourney,
00:28:55.440 | then we only have ourselves to compare ourselves to.
00:28:58.360 | But on first blush, you know,
00:29:00.020 | I think it's worth comparing yourself
00:29:01.360 | to maybe the best thing and try to find
00:29:03.860 | like a really fair way of doing that.
00:29:06.320 | So I think more people should try to do that.
00:29:08.680 | I definitely don't think you should be
00:29:09.960 | kind of comparing yourself on like some Google model
00:29:13.480 | or some old SD, you know, stable diffusion model
00:29:16.680 | and be like, look, we beat, you know, stable diffusion 1.5.
00:29:19.640 | I think users ultimately want care, you know,
00:29:23.380 | how close are you getting to the thing
00:29:24.520 | that like I also mostly, people mostly agree with.
00:29:28.380 | So we put out that benchmark not because,
00:29:31.280 | for no other reason to say like,
00:29:32.840 | this seems like a worthy thing for us to at least try,
00:29:35.280 | you know, for people to try to get to.
00:29:37.600 | And then if we surpass it, great,
00:29:38.760 | we'll come up with another one.
00:29:40.080 | - Yeah, no, that's awesome.
00:29:41.000 | And you kill stable diffusion Excel and everything.
00:29:45.240 | In the benchmark chart,
00:29:47.960 | it says Playground V2 1024 pixel dash aesthetic.
00:29:51.680 | - Yes. - You have kind of like,
00:29:53.720 | yeah, style fine tunes or like,
00:29:55.680 | what's the dash aesthetic for?
00:29:57.960 | - We debated this, maybe we named it wrong or something,
00:30:00.080 | but we were like, how do we help people realize
00:30:03.400 | the model that's aligned versus the models that weren't.
00:30:06.520 | So because we gave out pre-trained models,
00:30:09.120 | we didn't want people to like use those.
00:30:11.920 | So that's why they're called base.
00:30:13.520 | And then the aesthetic model, yeah,
00:30:15.120 | we wanted people to pick up the thing
00:30:16.600 | that we thought would be like the thing
00:30:18.560 | that makes things pretty.
00:30:19.980 | Who wouldn't want the thing that's aesthetic?
00:30:22.680 | But if there's a better name,
00:30:25.040 | we definitely are open to feedback.
00:30:26.840 | - No, no, that's cool.
00:30:28.040 | I was using the product.
00:30:29.000 | You also have the style filter
00:30:31.080 | and you have all these different style.
00:30:33.000 | And it seems like the styles are tied to the model.
00:30:35.920 | So there's some like SDXL styles,
00:30:38.800 | there's some Playground V2 styles.
00:30:41.320 | Can you maybe give listeners an overview of how that works?
00:30:45.120 | Because in language, there's not this idea of like style,
00:30:49.040 | right, versus like in vision model there is,
00:30:52.640 | and you cannot get certain styles in different models.
00:30:55.640 | How do styles emerge
00:30:56.920 | and how do you categorize them and find them?
00:30:59.160 | - Yeah, I mean, it's so fun having a community
00:31:01.560 | where people are just trying a model.
00:31:03.360 | Like it's only been two days for Playground V2
00:31:05.880 | and we actually don't know what the model's capable of
00:31:09.680 | and not capable of.
00:31:10.600 | You know, we certainly see problems with it,
00:31:12.600 | but we have yet to see what emergent behavior is.
00:31:16.520 | I mean, we've just sort of discovered
00:31:17.960 | that it takes about like a week
00:31:19.680 | before you start to see like new things.
00:31:21.880 | But I think like a lot of that style
00:31:24.080 | kind of emerges after that week
00:31:26.400 | where you start to see, you know,
00:31:28.640 | there's some styles that are very like well-known to us,
00:31:30.560 | like maybe like pixel art is a well-known style.
00:31:33.560 | But then there's some style,
00:31:34.720 | photo realism is like another one
00:31:36.160 | that's like well-known to us.
00:31:38.200 | But there are some styles that cannot be easily named.
00:31:41.880 | You know, it's not as simple as like,
00:31:43.880 | okay, that's an anime style.
00:31:45.800 | It's very visual.
00:31:47.840 | And in the end, you end up making up the name
00:31:50.760 | for what that style represents.
00:31:52.040 | And so the community kind of shapes itself
00:31:55.040 | around these different things.
00:31:56.320 | And so if anyone that's into stable diffusion
00:31:58.920 | and into building anything with graphics and stuff
00:32:01.960 | with these models, you know,
00:32:03.240 | you might've heard of like ProtoVision or DreamShaper,
00:32:07.080 | some of these weird names.
00:32:09.200 | But they're just, you know, invented by these authors,
00:32:11.120 | but they have a sort of je ne sais quoi
00:32:13.000 | that, you know, appeals to users.
00:32:14.960 | - Because it like roughly embeds to what you want.
00:32:18.640 | - I guess so.
00:32:21.240 | I mean, it's like, you know,
00:32:22.080 | there's this one of my favorite ones that's fine-tuned.
00:32:24.400 | It's not made by us.
00:32:25.560 | It's called like Starlight XL.
00:32:28.080 | It's just this beautiful model.
00:32:30.160 | It's got really great color contrast and visual elements.
00:32:33.960 | And the users love it.
00:32:35.280 | I love it.
00:32:36.240 | And yeah, it's so hard.
00:32:38.960 | I think that's like a very big open question with graphics
00:32:41.280 | that I'm not totally sure how we'll solve.
00:32:44.040 | Yeah, I think a lot of styles are sort of,
00:32:47.040 | I don't know, it's like an evolving situation too,
00:32:49.560 | 'cause styles get boring, right?
00:32:51.320 | They get fatigued.
00:32:52.160 | It's like listening to the same style of pop song.
00:32:55.400 | I kind of, I try to relate to graphics
00:32:57.920 | a little bit like with music,
00:32:59.240 | because I think it gives you a little bit
00:33:01.400 | of a different shape to things.
00:33:02.600 | Like in music, it's not just,
00:33:04.760 | it's not as if we just have pop music
00:33:06.440 | and, you know, rap music and country music.
00:33:09.040 | Like they're all of these,
00:33:10.680 | like the EDM genre alone has like sub genres.
00:33:14.040 | And I think that's very true in graphics
00:33:16.160 | and painting and art and anything that we're doing.
00:33:19.080 | There's just these sub genres,
00:33:20.400 | even if we can't quite always name them.
00:33:22.760 | But I think they are emergent from the community,
00:33:24.760 | which is why we're so always happy
00:33:26.120 | to work with the community.
00:33:27.160 | - Yeah, that is a struggle, you know,
00:33:29.680 | coming back to this, like B2B versus B2C thing.
00:33:32.480 | B2C, you're gonna have a huge amount of diversity
00:33:35.040 | and then it's gonna reduce as you get towards
00:33:36.920 | more sort of B2B type use cases.
00:33:38.560 | I'm making this up here, tell me if you disagree.
00:33:41.280 | So like you might be optimizing for a thing
00:33:44.040 | that you may eventually not need.
00:33:45.840 | - Yeah, possibly.
00:33:46.960 | Yeah, possibly.
00:33:48.320 | Yeah, I try not to share,
00:33:49.320 | I think like a simple thing with startups
00:33:51.120 | is that I worry sometimes by trying to be
00:33:55.040 | overly ambitious and like really scrutinizing
00:33:59.440 | like what something is in its most nascent phase
00:34:01.440 | that you miss the most ambitious thing you could have done.
00:34:03.960 | Like just having like very basic curiosity
00:34:06.840 | with something very small
00:34:09.600 | can like kind of lead you to something amazing.
00:34:13.040 | Like Einstein definitely did that.
00:34:14.280 | And then when, and then he like, you know,
00:34:16.480 | he basically won all the prizes
00:34:17.880 | and got everything he wanted
00:34:19.080 | and then basically did like kind,
00:34:20.240 | didn't really, he kind of dismissed quantum
00:34:24.080 | and then just kind of was still searching, you know,
00:34:26.960 | for the unifying theory.
00:34:28.200 | And he like had this quest.
00:34:29.760 | I think that happens a lot with like Nobel prize people.
00:34:31.760 | I think there's like a term for it that I forget.
00:34:34.200 | I actually wanted to go after a toy almost intentionally.
00:34:39.180 | So long as that I could see,
00:34:42.040 | I could imagine that it would lead to something
00:34:45.360 | very, very large later.
00:34:47.080 | And so, yeah, it's a very, like I said, it's very hobbyist,
00:34:50.760 | but you need to start somewhere.
00:34:53.200 | You need to start with something
00:34:54.400 | that has a big gravitational pull,
00:34:58.220 | even if these hobbyists aren't likely to be the people
00:35:01.220 | that, you know, have a way to monetize it or whatever,
00:35:04.080 | even if they're, but they're doing it for fun.
00:35:05.460 | So there's something there
00:35:07.160 | that I think is really important.
00:35:08.700 | But I agree with you that, you know, in time,
00:35:11.160 | we're gonna have to focus,
00:35:12.000 | we will absolutely focus on more utilitarian things,
00:35:16.400 | like things that are more related to editing feats
00:35:18.760 | that are much harder.
00:35:20.060 | But, and so I think like a very simple use case is just,
00:35:23.360 | you know, I'm not a graphics designer.
00:35:26.000 | I don't know if, I don't know if you guys are,
00:35:28.680 | but it's sure, you know, it seems like very simple
00:35:31.080 | that like you, if we could give you the ability
00:35:33.080 | to do really complex graphics without skill,
00:35:37.520 | wouldn't you want that?
00:35:39.000 | You know, like my wife the other day was set, you know,
00:35:41.020 | said, ah, I wish Playground was better
00:35:43.080 | because I wish that, you know, don't you,
00:35:45.560 | when are you guys gonna have a feature
00:35:46.840 | where like we could make my son, his name's Devin,
00:35:48.880 | smile when he was not smiling in the picture
00:35:50.800 | for the holiday card, right?
00:35:53.040 | You know, just being able to highlight his mouth
00:35:55.080 | and just say like, make him smile.
00:35:56.480 | Like, why can't we do that
00:35:58.040 | with like high fidelity and coherence?
00:36:00.600 | Little things like that, all the way to, you know,
00:36:03.920 | putting you in completely different scenarios.
00:36:06.200 | - Is that true?
00:36:07.040 | Can we not do that in painting?
00:36:08.760 | - You can do in painting,
00:36:10.200 | but it's the quality is just so bad.
00:36:12.840 | Yeah, it's just really terrible quality.
00:36:16.240 | You know, it's like, you'll do it five times
00:36:18.480 | and it'll still like kind of look like crooked
00:36:20.440 | or just the artifact.
00:36:21.720 | Part of it's like, you know, the lips on the face are so,
00:36:24.360 | there's such, there's such little information there.
00:36:26.720 | It's so small that the models really struggle with it.
00:36:29.500 | Yeah.
00:36:30.520 | - Make the picture smaller and you won't see it.
00:36:32.360 | - Wait, I think, I think that's my trick, I don't know.
00:36:34.760 | - Yeah, yeah, that's true.
00:36:35.640 | Or, you know, you could take that region
00:36:37.200 | and make it really big and then like say it's a mouth
00:36:39.520 | and then like shrink it.
00:36:40.920 | It feels like you're wrestling with it
00:36:43.120 | more than it's doing something that kind of surprises you.
00:36:47.640 | Yeah.
00:36:48.480 | - It feels like you are very much the internal tastemaker.
00:36:50.600 | Like you carry in your head this vision
00:36:53.320 | for what a good art model should look like.
00:36:56.200 | Is it, do you find it hard to like communicate it
00:36:59.520 | to like your team and, you know, other people?
00:37:02.960 | 'Cause obviously it's hard to put into words
00:37:04.840 | like we just said.
00:37:06.100 | - Yeah, it's very hard to explain.
00:37:10.140 | Like images have such, like such high bit rate
00:37:14.360 | compared to just words.
00:37:15.700 | And we don't have enough words to describe these things.
00:37:19.900 | Difficult, I think everyone on the team,
00:37:21.740 | if they don't have good kind of like judgment taste
00:37:25.180 | or like an eye for some of these things,
00:37:27.300 | they're like steadily building it
00:37:28.860 | 'cause they have no choice, right?
00:37:30.820 | So in that realm, I don't worry too much, actually.
00:37:33.820 | Like everyone is kind of like learning
00:37:35.860 | to get the eye is what I would call it.
00:37:39.980 | But I also have, you know, my own narrow taste.
00:37:41.740 | Like I'm at my, you know, I'm not,
00:37:43.220 | I don't represent the whole population either.
00:37:45.220 | - True, true.
00:37:46.060 | - So.
00:37:47.580 | - When you benchmark models, you know,
00:37:49.780 | like this benchmark we're talking about,
00:37:51.060 | we use FID for efficient input distance.
00:37:53.720 | Okay, that's one measure,
00:37:56.500 | but like doesn't capture anything
00:37:57.700 | you just said about smiles.
00:37:59.380 | - Yeah, FID is generally a bad metric.
00:38:02.660 | You know, it's good up to a point
00:38:04.460 | and then it kind of like is irrelevant.
00:38:06.580 | - Yeah. - Yeah.
00:38:07.420 | - And then, so are there any other metrics that you like
00:38:11.060 | apart from vibes?
00:38:11.980 | I'm always looking for alternatives to vibes.
00:38:13.940 | 'Cause vibes don't scale, you know?
00:38:15.500 | - You know, it might be fun to kind of talk about this
00:38:18.300 | because it's actually kind of fresh.
00:38:20.300 | So up till now, we haven't needed to do
00:38:22.860 | a ton of like benchmarking
00:38:24.540 | because we hadn't trained our own model
00:38:26.540 | and then now we have.
00:38:27.380 | So now what?
00:38:28.340 | What does that mean?
00:38:29.180 | How do we evaluate it?
00:38:30.380 | You know, we're kind of like living
00:38:31.460 | with the last 48, 72 hours of going,
00:38:33.980 | did the way that we benchmark actually succeed?
00:38:37.340 | Did it deliver?
00:38:38.180 | Right?
00:38:39.020 | You know, like I think Gemini just came out.
00:38:40.500 | They just put out a bunch of benchmarks,
00:38:42.340 | but all these benchmarks are just an approximation
00:38:45.100 | of how you think it's gonna end up
00:38:46.340 | with real world performance.
00:38:47.420 | And I think that's like very fascinating to me.
00:38:50.260 | So if you fake that benchmark,
00:38:53.360 | you'll still end up in a really bad scenario
00:38:55.500 | at the end of the day.
00:38:56.540 | And so, you know, one of the benchmarks we did
00:38:58.340 | was we did a, we kind of curated like a thousand prompts.
00:39:01.300 | That's what we published in our blog post, you know,
00:39:03.940 | of all these tasks that we,
00:39:05.140 | a lot of them, some of them are curated by our team
00:39:07.100 | where we know the models all suck at it.
00:39:09.340 | Like my favorite prompt that no model's really capable of
00:39:12.900 | is a horse riding an astronaut.
00:39:15.600 | - Yeah.
00:39:16.440 | - The inverse one.
00:39:17.260 | And it's really, really hard to do.
00:39:19.900 | - Not in data.
00:39:20.900 | - You know, another one is like a giraffe
00:39:22.720 | underneath a microwave.
00:39:24.420 | How does that work?
00:39:25.260 | (laughing)
00:39:26.780 | Right?
00:39:27.620 | There's so many of these little funny ones.
00:39:29.620 | We do, we have prompts that are just like
00:39:31.060 | misspellings of things, right?
00:39:32.740 | Just to see if the models will figure it out.
00:39:35.260 | - So that's easy.
00:39:36.100 | That should embed to the same space.
00:39:38.780 | - Yeah.
00:39:39.620 | And just like all these very interesting, weird,
00:39:42.260 | weirdo things.
00:39:43.080 | And so we have so many of these
00:39:44.120 | and then we kind of like evaluate whether the models
00:39:46.300 | are any good at it.
00:39:47.140 | And the reality is that they're all bad at it.
00:39:48.940 | And so then you're just picking the most aesthetic image.
00:39:51.440 | But I think, you know, we're just,
00:39:53.500 | we're still at the beginning of building like our,
00:39:55.420 | like the best benchmark we can
00:39:56.980 | that aligns most with just user happiness, I think.
00:40:01.980 | 'Cause we're not, we're not like putting these in papers
00:40:03.780 | and trying to like win, you know, I don't know,
00:40:05.900 | awards at ICCV or something if they have awards.
00:40:07.980 | Sorry if they don't.
00:40:09.740 | And you could.
00:40:11.340 | - Well, that's absolutely a valid strategy.
00:40:12.980 | - Yeah, you could.
00:40:14.020 | I don't think it could correlate necessarily
00:40:15.860 | with the impact we want to have on humanity.
00:40:18.100 | I think we're still evolving whatever our benchmarks are.
00:40:20.460 | So the first benchmark was just like very difficult tasks
00:40:23.020 | that we know the models are bad at.
00:40:24.300 | Can we come up with a thousand of these?
00:40:26.700 | Whether they're hand-written
00:40:27.540 | and some of them are generated.
00:40:28.980 | And then can we ask the users, like, how do we do?
00:40:31.900 | And then we wanted to use a benchmark like party prompts
00:40:34.380 | so that people in academia,
00:40:36.020 | we mostly did that so people in academia
00:40:37.740 | could measure their models against ours versus others.
00:40:40.580 | And, but yeah, I mean, fit is pretty bad.
00:40:45.080 | And I think, yeah, in terms of vibes,
00:40:49.380 | it's like when you put out the model
00:40:50.880 | and then you try to see like what users make.
00:40:52.980 | And I think my sense is that we're gonna take all the things
00:40:55.220 | that we noticed that the users kind of were failing at
00:40:58.060 | and try to find like new ways to measure that,
00:41:01.020 | whether that's like a smile or, you know,
00:41:03.740 | color contrast or lighting.
00:41:06.260 | One benefit of Playground is that
00:41:07.900 | we have users making millions of images every single day.
00:41:12.900 | And so we can just ask them.
00:41:15.900 | - And they go for like a post-generation feedback.
00:41:20.260 | - Yeah, we can just ask them.
00:41:21.500 | We can just say like, how good was the lighting here?
00:41:23.740 | How was the subject?
00:41:25.620 | How was the background?
00:41:26.740 | - Oh, like a proper form of like.
00:41:30.460 | - It's just like, you make it,
00:41:32.300 | you come to our site, you make an image
00:41:33.700 | and then we say, and then maybe randomly you just say,
00:41:35.660 | hey, you know, like how was the color
00:41:37.540 | and contrast of this image?
00:41:38.700 | And you say, it was not very good.
00:41:40.460 | And then you just tell us.
00:41:41.760 | So I think we can get like tens of thousands
00:41:45.460 | of these evaluations every single day
00:41:49.100 | to truly measure real world performance
00:41:52.200 | as opposed to just like benchmark performance.
00:41:54.140 | Hopefully next year, I think we will try to publish
00:41:56.940 | kind of like a benchmark that anyone could use,
00:42:01.640 | that we evaluate ourselves on and that other people can,
00:42:04.580 | that we think does a good job
00:42:06.660 | of approximating real world performance
00:42:08.420 | because we've tried it and done it and noticed that it did.
00:42:10.940 | Yeah, I think we will do that.
00:42:12.580 | - Yeah.
00:42:14.060 | I think we're going to ask a few more
00:42:15.500 | like sort of product-y questions.
00:42:17.540 | I personally have a few like categories
00:42:20.020 | that I consider special among, you know,
00:42:22.860 | you have like animals, art, fashion, food.
00:42:25.060 | There are some categories which I consider
00:42:28.640 | like a different tier of image.
00:42:30.680 | So the top among them is text in images.
00:42:33.420 | How do you think about that?
00:42:36.600 | So one of the big wild ones for me,
00:42:38.720 | something I've been looking out for the entire year
00:42:40.720 | is just the progress of text and images.
00:42:42.520 | Like, do you, can you write in an image?
00:42:44.480 | - Yeah.
00:42:45.320 | - Or an ideogram, I think, came out recently,
00:42:48.440 | which had decent but not perfect text and images.
00:42:52.280 | Dottie3 had improved some
00:42:55.140 | and all they said in their paper was that
00:42:58.500 | they just included more text in the dataset
00:43:00.000 | and it just worked.
00:43:01.200 | I was like, that's just, that's just lazy.
00:43:03.000 | (laughing)
00:43:04.320 | But anyway, do you care about that?
00:43:06.200 | 'Cause I don't see any of that in like your sample.
00:43:08.360 | - Yeah, yeah.
00:43:09.200 | Yeah, the V2 model was mostly focused on image quality
00:43:14.200 | versus like the feature of text synthesis.
00:43:18.120 | 'Cause I, well, as a business user,
00:43:20.280 | I care a lot about that.
00:43:21.120 | - Yeah. - Right.
00:43:21.940 | - Yeah, I'm very excited about text synthesis
00:43:23.520 | and yeah, I think ideogram has done a good job
00:43:26.720 | of maybe the best job.
00:43:28.080 | Dottie kind of has like a, it has like a hit rate.
00:43:31.920 | You know, you don't want just text effects.
00:43:33.520 | I think where this has to go is it has to be like,
00:43:36.620 | you could like write little tiny pieces of text
00:43:39.000 | like on like a milk carton.
00:43:41.040 | - Yeah.
00:43:41.880 | - That's maybe not even the focal point of a scene.
00:43:43.600 | - Yeah.
00:43:44.440 | - I think that's like a very hard task
00:43:46.360 | that if you could do something like that,
00:43:48.600 | then there's a lot of other possibilities.
00:43:50.360 | - Well, you don't have to zero shot it.
00:43:51.400 | You can just be like here and focus on this.
00:43:54.080 | - Sure, yeah, yeah, definitely.
00:43:55.520 | Yeah, yeah.
00:43:56.360 | So I think text synthesis would be very exciting.
00:43:58.320 | - Yeah.
00:43:59.160 | And then also flag that Max Wolf, Minimax here,
00:44:02.860 | which you must have come across his work.
00:44:04.960 | He's done a lot of stuff about using like logo masks
00:44:08.700 | that then map onto like food or vegetables
00:44:13.440 | and it looks like text,
00:44:15.720 | which can be pretty fun.
00:44:17.280 | - Yeah, yeah.
00:44:18.280 | I mean, it's very interesting to,
00:44:20.280 | that's the wonderful thing about like
00:44:21.720 | the open source community is that you get things
00:44:23.600 | like control net and then you see all these people
00:44:25.880 | do these just amazing things with control net
00:44:28.360 | and then you wonder, I think from our point of view,
00:44:31.480 | we sort of go, that's really wonderful,
00:44:33.400 | but how do we end up with like a unified model
00:44:35.520 | that can do that?
00:44:36.400 | What are the bottlenecks?
00:44:37.320 | What are the issues?
00:44:39.040 | Because the community ultimately
00:44:40.280 | has very limited resources.
00:44:41.720 | - Yeah.
00:44:42.560 | - And so they need these kinds of like work around
00:44:45.720 | work around research ideas to get there, but yeah.
00:44:50.520 | - Are techniques like control net
00:44:52.480 | portable to your architecture?
00:44:54.240 | - Definitely, yeah.
00:44:55.440 | We kept the Playground v2 exactly the same as SDXL,
00:44:58.520 | not because, not out of laziness,
00:45:00.080 | but just because we wanted,
00:45:01.720 | we knew that the community already had tools.
00:45:03.880 | - Yeah.
00:45:04.720 | - It's, you know, all you have to do
00:45:06.080 | is maybe change a string in your code
00:45:08.600 | and then, you know, retrain a control net for it.
00:45:10.520 | So it was very intentional to do that.
00:45:12.040 | We didn't want to fragment the community
00:45:13.320 | with different architectures.
00:45:14.520 | - Yeah.
00:45:15.360 | Yeah.
00:45:16.200 | I have more questions about that.
00:45:17.240 | I don't know.
00:45:18.080 | I don't want to DDoS you with topics, but okay.
00:45:21.200 | I was basically going to go over three more categories.
00:45:23.640 | One is UIs, like app UIs, like mock UIs.
00:45:27.720 | Third is not safe for work, obviously.
00:45:32.120 | And then copyrighted stuff.
00:45:34.000 | I don't know if you care to comment on any of those.
00:45:36.440 | - The NSFW kind of like safety stuff is really important.
00:45:39.840 | Part of, I kind of think that one of the biggest risks
00:45:44.360 | kind of going into maybe the U.S. election year
00:45:47.200 | will probably be very interrelated
00:45:49.400 | with like graphics, audio, video.
00:45:53.760 | I think it's going to be very hard to explain,
00:45:56.200 | you know, to a family relative
00:45:58.480 | who's not kind of in our world.
00:46:00.880 | And our world is like sometimes very, you know,
00:46:02.800 | we think it's very big, but it's very tiny
00:46:04.680 | compared to the rest of the world.
00:46:05.520 | Some people are like, there's still lots of humanity
00:46:07.280 | who have no idea what chat GPT is.
00:46:09.320 | And I think it's going to be very hard to explain,
00:46:12.080 | you know, to your uncle, aunt, whoever,
00:46:14.960 | you know, hey, I saw, you know,
00:46:16.200 | I saw President Biden say this thing on a video.
00:46:19.800 | You know, I can't believe, you know, he said that.
00:46:22.960 | I think that's going to be a very troubling thing
00:46:25.440 | going into the world next year, the year after.
00:46:29.720 | - Oh, I didn't, that's more like a risk thing.
00:46:32.280 | - Yeah. - Or like deep fakes.
00:46:33.840 | Well, faking, political faking.
00:46:35.800 | But there's just, there's a lot of studies on how,
00:46:40.520 | yeah, for most businesses,
00:46:42.080 | you don't want to train on not safe for work images,
00:46:44.480 | except that it makes you really good at bodies.
00:46:47.560 | - Yeah, I mean, yeah, I mean, we personally,
00:46:51.040 | we filter out NSFW type of images in our data set
00:46:55.760 | so that it's, you know, so our safety filter stuff
00:46:58.440 | doesn't have to work as hard.
00:46:59.640 | - But you've heard this argument that it gets,
00:47:01.600 | it makes you worse at, because obviously,
00:47:04.160 | not safe for work images are very good at human anatomy,
00:47:08.200 | which you do want to be good at.
00:47:09.640 | - Yeah, it's not about like,
00:47:11.280 | it's not like necessarily a bad thing to train on that data.
00:47:14.120 | It's more about like how you go and use it.
00:47:16.160 | That's why I was kind of talking about safety.
00:47:18.200 | - Yeah, I see. - You know, in part,
00:47:19.480 | because there are very terrible things
00:47:20.920 | that can happen in the world.
00:47:21.760 | If you have a sufficiently, you know,
00:47:23.480 | extremely powerful graphics model, you know,
00:47:25.280 | suddenly like you can kind of imagine, you know,
00:47:27.840 | now if you can like generate nudes and then there's like,
00:47:30.040 | you can do very character consistent things with faces,
00:47:32.520 | like what does that lead to?
00:47:33.560 | - Yeah. - I think it's like more
00:47:35.480 | what occurs after that, right?
00:47:37.600 | Even if you train on, let's say, you know, new data,
00:47:40.880 | if it does something to kind of help,
00:47:42.280 | there's nothing wrong with the human anatomy.
00:47:44.760 | It's very valid for a model to learn that,
00:47:47.200 | but then it's kind of like, how does that get used?
00:47:49.440 | And, you know, I won't bring up all of the very,
00:47:52.280 | very unsavory, terrible things that we see
00:47:55.360 | on a daily basis on the site.
00:47:57.640 | I think it's more about what occurs.
00:48:00.320 | And so we, you know, we just recently did like a big sprint
00:48:03.520 | on safety internally around,
00:48:05.760 | and it's very difficult with graphics and art, right?
00:48:08.560 | Because there is tasteful art that has nudity, right?
00:48:12.940 | They're all over in museums, like, you know,
00:48:15.440 | it's very, very valid situations for that.
00:48:18.120 | And then there's, you know,
00:48:19.920 | there's the things that are the gray line of that.
00:48:22.280 | You know, what I might not find tasteful,
00:48:23.960 | someone might be like, that is completely tasteful, right?
00:48:26.840 | And then there's things that are way over the line.
00:48:29.880 | And then there are things that are, you know,
00:48:31.400 | maybe you or, you know, maybe I would be okay with,
00:48:35.600 | but society isn't.
00:48:37.720 | I think it's really hard with art.
00:48:39.600 | I think it's really, really hard.
00:48:41.360 | Sometimes even if you have like,
00:48:43.320 | even if you have things that are not nude,
00:48:45.440 | if a child goes to your site, scrolls down some images,
00:48:48.920 | you know, classrooms of kids, you know, using our product,
00:48:52.040 | it's a really difficult problem.
00:48:53.640 | And it stretches mostly culture, society,
00:48:57.040 | politics, everything, yeah.
00:48:59.040 | - Okay.
00:49:02.160 | Another favorite topic of our listeners is UX and AI.
00:49:06.880 | And I think you're probably one of the best
00:49:09.800 | all-inclusive editors for these things.
00:49:12.040 | So you don't just have the, you know,
00:49:14.680 | prompt images come out, you pray,
00:49:17.360 | and now you do it again.
00:49:19.240 | First, you let people pick a seed
00:49:21.880 | so they can kind of have semi-repeatable generation.
00:49:25.080 | You also have, yeah, you can pick how many images,
00:49:28.840 | and then you leave all of them in the canvas,
00:49:31.280 | and then you have kind of like this box,
00:49:33.720 | the generation box, and you can even cross between them
00:49:37.080 | and outpaint, there's all these things.
00:49:39.040 | How did you get here?
00:49:41.920 | You know, most people are kind of like,
00:49:43.800 | give me text, I give you image.
00:49:45.360 | You know, you're like, these are all the tools for you.
00:49:47.680 | - Even though we were trying to make
00:49:50.200 | a graphics foundation model,
00:49:52.600 | I think we think that we're also trying to like re-imagine
00:49:57.840 | like what a graphics editor might look like
00:49:59.680 | given the change in technology.
00:50:02.240 | So, you know, I don't think we're trying to build Photoshop,
00:50:06.160 | but it's the only thing that we could say
00:50:07.960 | that people are, you know, largely familiar with.
00:50:10.000 | Oh, okay, there's Photoshop.
00:50:11.520 | I think, you know, I don't think you would think
00:50:14.400 | of Photoshop without like the, you know,
00:50:16.840 | you wouldn't think, what would Photoshop compare itself
00:50:19.640 | to pre-computer, I don't know, right?
00:50:22.040 | It's like, or kind of like a canvas,
00:50:24.440 | but, you know, there's these menu options,
00:50:26.360 | and you can use your mouse, what's a mouse?
00:50:29.520 | So I think that we're trying to make like,
00:50:31.640 | we're trying to re-imagine
00:50:32.600 | what a graphics editor might look like.
00:50:34.120 | Not just for the fun of it,
00:50:35.800 | but because we kind of have no choice.
00:50:37.160 | Like there's this idea in image generation
00:50:39.560 | where you can generate images.
00:50:41.440 | That's like a super weird thing.
00:50:42.760 | What is that in Photoshop, right?
00:50:44.440 | You have to wait right now for the time being,
00:50:46.920 | but the wait is worth it often for a lot of people
00:50:50.600 | because they can't make that with their own skills.
00:50:52.560 | So I think it goes back to, you know,
00:50:54.760 | how we started the company,
00:50:56.560 | which was kind of looking at GPT-3's Playground,
00:51:00.720 | that the reason why we're named Playground
00:51:02.240 | is a homage to that, actually.
00:51:04.120 | And, you know, it's like,
00:51:06.480 | shouldn't these products be more visual?
00:51:09.320 | Shouldn't, you know, shouldn't they,
00:51:11.440 | these prompt boxes are like a terminal window, right?
00:51:15.080 | We're kind of at this weird point where it's just like CLI.
00:51:17.400 | It's like MS-DOS.
00:51:18.400 | I remember my mom using MS-DOS,
00:51:20.400 | and I memorized the keywords, like D-I-R-L-S,
00:51:23.160 | all those things, right?
00:51:24.520 | It feels a little like we're there, right?
00:51:26.080 | Prompt engineering is just like--
00:51:27.560 | - The shirt I'm wearing, you know, it's a bug,
00:51:29.480 | not a feature.
00:51:30.320 | - Yeah, exactly.
00:51:31.160 | Parentheses to say beautiful or whatever,
00:51:33.120 | which waits the word token more in the model or whatever.
00:51:37.160 | Yeah, it's, that's like super strange.
00:51:40.000 | I think that's not, I think everybody,
00:51:42.880 | I think a large portion of humanity would agree
00:51:45.240 | that that's not user-friendly, right?
00:51:47.720 | So how do we think about the products
00:51:49.480 | to be more user-friendly?
00:51:50.520 | Well, sure, you know, sure it would be nice
00:51:52.000 | if I could like, you know,
00:51:53.840 | if I wanted to get rid of like the headphones on my head,
00:51:56.360 | you know, it'd be nice to mask it,
00:51:57.640 | and then say, you know, can you remove the headphones?
00:52:00.640 | You know, if I want to grow the, expand the image,
00:52:03.240 | it should, you know, how can we make that feel easier
00:52:06.240 | without typing lots of words and being really confused?
00:52:09.320 | And by no, by no stretch of the imagination,
00:52:11.400 | I don't even think we've nailed the UI/UX yet.
00:52:14.480 | Part of that is because we don't,
00:52:18.160 | we're still experimenting.
00:52:19.480 | And part of that is because the model
00:52:21.760 | and the technology is going to get better.
00:52:24.000 | And whatever felt like the right UX six months ago
00:52:27.760 | is going to feel very broken now.
00:52:29.600 | And so that's a little bit of how we got there,
00:52:34.920 | is kind of saying, does everything have to be
00:52:37.120 | like a prompt in a box?
00:52:38.280 | Or can we do, can we do things
00:52:39.960 | that make it very intuitive for users?
00:52:42.080 | - How do you decide what to give access to?
00:52:44.960 | So you have things like Expand Prompt,
00:52:47.720 | which Dali 3 just does, it doesn't let you decide
00:52:51.280 | whether you should or not.
00:52:52.580 | - As in like, rewrites your prompts for you.
00:52:55.560 | - Yeah.
00:52:56.400 | - Yeah, for that feature, I think we'll probably,
00:52:59.920 | I think once we get it to be cheaper,
00:53:02.720 | we'll probably just give it up,
00:53:03.760 | we'll probably just give it away.
00:53:04.840 | But we also decided something that,
00:53:07.600 | that might be a little bit different.
00:53:08.760 | We noticed that most of image generation
00:53:10.760 | is just like kind of casual.
00:53:12.920 | You know, it's in WhatsApp, it's, you know,
00:53:14.840 | it's in a Discord bot somewhere with Majorny,
00:53:17.200 | it's in ChatGPT.
00:53:19.240 | One of the differentiators I think we provide
00:53:21.480 | is at the expense of just lots of users necessarily,
00:53:26.480 | mainstream consumers, is that we provide as much like power
00:53:29.800 | and tweakability and configurability as possible.
00:53:33.080 | So the only reason why it's a toggle,
00:53:35.000 | because we know that users might want to use it
00:53:37.560 | and might not want to use it, right?
00:53:39.480 | There are some really powerful power user hobbyists
00:53:42.640 | that know what they're doing.
00:53:44.080 | And then there's a lot of people that,
00:53:45.940 | you know, just want something that looks cool,
00:53:49.120 | but they don't know how to prompt.
00:53:50.080 | And so I think a lot of Playground is more about
00:53:53.040 | going after that core user base that like knows,
00:53:57.040 | has a little bit more savviness
00:53:59.160 | and how to use these tools, yeah.
00:54:01.520 | So they might not use like these users probably,
00:54:03.280 | you know, the average Dell user
00:54:04.360 | is probably not going to use ControlNet.
00:54:05.720 | They probably don't even know what that is.
00:54:08.360 | And so I think that like, as the models get more powerful,
00:54:11.040 | as there's more tooling, yeah,
00:54:13.680 | I think you could imagine it,
00:54:15.080 | hopefully you'll imagine a new sort of
00:54:17.040 | AI first graphics editor that's
00:54:20.360 | just as like powerful and configurable as Photoshop.
00:54:24.400 | And you might have to master a new kind of tool.
00:54:27.360 | - Yeah, yeah, well.
00:54:28.720 | There's so many things I could bounce off of that.
00:54:33.640 | One, what you mentioned about waiting.
00:54:35.820 | We have to kind of somewhat address
00:54:39.560 | the elephant in the room.
00:54:40.760 | Consistency models have been blowing up the past month.
00:54:45.640 | Is that, like, how do you think about integrating that?
00:54:48.560 | Obviously there's a lot of other companies
00:54:50.040 | also trying to beat you to that space as well.
00:54:52.960 | - I think we were the first company to integrate it.
00:54:55.320 | Well, we integrated it in a different way.
00:54:57.240 | There are like 10 companies right now
00:54:58.600 | that have kind of tried to do like interactive editing
00:55:00.880 | where you can like draw on the left side
00:55:03.040 | and then you get an image on the right side.
00:55:04.560 | We decided to kind of like wait and see
00:55:06.480 | whether there's like true utility on that.
00:55:09.160 | We have a different feature that's like unique
00:55:11.320 | in our product that's called preview rendering.
00:55:15.520 | And so you go to the product and you say,
00:55:18.760 | we're like, what is the most common use case?
00:55:20.120 | The most common use case is you write a prompt
00:55:22.280 | and then you get an image.
00:55:23.180 | But what's the most annoying thing about that?
00:55:24.960 | The most annoying thing is like,
00:55:26.300 | it feels like a slot machine, right?
00:55:28.160 | You're like, okay, I'm gonna put it in
00:55:29.400 | and maybe I'll get something cool.
00:55:31.480 | So we did something that seemed a lot simpler
00:55:34.320 | but a lot more relevant to how users already use this
00:55:36.960 | product, which is preview rendering.
00:55:38.240 | You toggle it on and it will show you a render of the image.
00:55:40.560 | And then it's just like, graphics tools already have this.
00:55:44.840 | Like if you use Cinema 4D or After Effects or something,
00:55:47.480 | it's called viewport rendering.
00:55:49.600 | And so we try to take something that exists
00:55:52.280 | in the real world that has familiarity and say,
00:55:54.200 | okay, you're gonna get a rough sense
00:55:56.380 | of an early preview of this thing.
00:55:57.780 | And then when you're ready to generate,
00:55:59.640 | we're gonna try to be as coherent
00:56:01.720 | about that image that you saw.
00:56:03.440 | That way you're not spending so much time
00:56:05.300 | just like pulling down the slot machine lever.
00:56:08.900 | So we were actually the first company,
00:56:11.160 | I think we were the first company
00:56:12.080 | to actually ship a quick LCM thing, yeah.
00:56:16.160 | - Okay.
00:56:17.000 | (laughing)
00:56:18.080 | - We were very excited about it.
00:56:19.120 | So we shipped it very quick, yeah.
00:56:20.760 | - Yeah, I think like the other,
00:56:23.800 | well the demos I've been seeing it's also, I guess,
00:56:27.840 | it's not like a preview necessarily.
00:56:30.000 | They're almost using it to animate their generations,
00:56:34.640 | because you can kind of move shapes over.
00:56:36.240 | - Yeah, yeah, they're like doing it.
00:56:37.840 | They're like animating it,
00:56:39.400 | but they're sort of showing like if I move a moon,
00:56:41.520 | you know, can I, yeah.
00:56:42.640 | - Yeah, I don't know.
00:56:43.960 | To me it unlocks video in a way.
00:56:46.560 | - Yeah.
00:56:47.400 | - That--
00:56:48.240 | - But the video models are already
00:56:49.480 | so much better than that.
00:56:50.600 | Yeah, so.
00:56:51.440 | (laughing)
00:56:53.440 | - There's another one which I think is,
00:56:55.400 | like how about the just general ecosystem of Loras, right?
00:57:01.760 | That Civit is obviously the most popular repository of Loras.
00:57:06.200 | How do you think about sort of interacting
00:57:09.680 | with that ecosystem?
00:57:11.500 | - Yeah, I mean, the guy that did Lora,
00:57:14.080 | not the guy that invented Loras,
00:57:15.280 | but the person that brought Loras to Stable Diffusion
00:57:19.120 | actually works with us on some projects.
00:57:23.200 | His name is Simu.
00:57:24.560 | Shout out to Simu.
00:57:26.360 | And I think Loras are wonderful.
00:57:30.160 | Obviously fine tuning all these dream booth models
00:57:33.480 | and such, it's just so heavy.
00:57:35.480 | And giving, and it's obvious in our conversation
00:57:38.240 | around styles and vibes and it's very hard
00:57:42.800 | to evaluate the artistry of these things.
00:57:44.860 | Loras give people this wonderful opportunity
00:57:48.860 | to create sub-genres of art.
00:57:51.860 | And I think they're amazing.
00:57:52.900 | And so any graphics tool, any kind of thing
00:57:54.880 | that's expressing art has to provide
00:57:57.900 | some level of customization to its user base
00:58:01.340 | that goes beyond just typing Greg Rakowski in a prompt.
00:58:04.980 | Right, we have to give more than that.
00:58:08.180 | It's not like users want to type these real artist names.
00:58:11.200 | It's that they don't know how else to get an image
00:58:12.960 | that looks interesting.
00:58:14.280 | They truly want originality and uniqueness.
00:58:16.720 | And I think Loras provide that.
00:58:18.040 | And they provide it in a very nice scalable way.
00:58:21.320 | I hope that we find something even better than Loras
00:58:24.040 | in the long term.
00:58:26.060 | 'Cause there are still weaknesses to Loras,
00:58:31.000 | but I think they do a good job for now.
00:58:32.560 | - Yeah, and so you don't want to be the,
00:58:34.440 | like you wouldn't ever compete with Civet.
00:58:36.320 | You would just kind of--
00:58:37.920 | - Civet's a site where like all these things
00:58:39.320 | get kind of hosted by the community, right?
00:58:41.880 | And so yeah, we'll often pull down
00:58:43.960 | like some of the best things there.
00:58:46.600 | I think when we have a significantly better model,
00:58:51.440 | we will certainly build something.
00:58:53.360 | - I see. - That gets closer to that.
00:58:55.200 | I still, again, I go back to saying just,
00:58:57.080 | I still think this is like very nascent.
00:58:59.120 | Things are very underpowered, right?
00:59:00.920 | Loras are not easy for people to train.
00:59:05.640 | You know, they're easy for an engineer,
00:59:07.800 | but they're not easy, you know,
00:59:10.160 | it sure would be nicer if I could just pick,
00:59:11.920 | you know, five or six reference images, right?
00:59:14.440 | And then say, hey, you know, this is,
00:59:17.480 | and they might even be five or six different
00:59:19.080 | reference images that are not,
00:59:20.400 | they're just very different, actually.
00:59:22.200 | Like they're, they communicate a style,
00:59:24.220 | but they're actually like, it's like a mood board, right?
00:59:27.560 | And it takes, you have to be kind of an engineer almost
00:59:30.480 | to train these Loras or go to some site
00:59:32.120 | and be technically savvy at least.
00:59:33.980 | It seems like it'd be much better if I could say,
00:59:37.280 | I love this style.
00:59:38.500 | I love this style, here are five images.
00:59:43.640 | And you tell the model, like, this is what I want.
00:59:45.680 | And the model gives you something that's very aligned
00:59:48.320 | with what your style is, what you're talking about.
00:59:50.320 | And it's a style you couldn't even communicate, right?
00:59:52.400 | There's no word, you know, this is,
00:59:54.400 | you know, if you have a Tron image, it's not just Tron,
00:59:56.040 | it's like Tron plus like four or five
00:59:57.980 | different weird things. - Cyberpunk, yeah.
00:59:59.480 | - Yeah, even cyberpunk can have its like sub-genre, right?
01:00:03.360 | But I just think training Loras and doing that
01:00:05.680 | is very heavy, so I hope we can do better than that.
01:00:08.800 | - Cool. - Yeah.
01:00:09.640 | - We have Sharif from Lexica on the podcast before.
01:00:13.640 | - Oh, nice.
01:00:14.600 | - Both of you have like a landing page
01:00:17.360 | with just a bunch of images
01:00:19.320 | where you can like explore things.
01:00:20.960 | - Yeah, yeah, we have a feed.
01:00:22.820 | - Yeah, yeah, is that something you see more and more of
01:00:25.880 | in terms of like coming up with these styles?
01:00:27.660 | Is that why you have that as the starting point
01:00:30.540 | versus a lot of other products, you just go in,
01:00:32.680 | you have the generation prompt,
01:00:34.340 | you don't see a lot of examples?
01:00:36.160 | - Our feed is a little different than their feed.
01:00:38.520 | Our feed is more about community.
01:00:41.000 | So we have kind of like a Reddit thing going on
01:00:43.800 | where it's a kind of a competition like every day,
01:00:47.200 | loose competition, mostly fun competition
01:00:49.640 | of like making things.
01:00:51.460 | And there's just this wonderful community of people
01:00:53.760 | where they're liking each other's images
01:00:55.120 | and just showing their genuine interest in each other.
01:00:58.440 | And I think we definitely learn about styles that way.
01:01:01.700 | One of the funniest polls,
01:01:03.400 | if you go to the Mid-Journey polls,
01:01:06.640 | they'll sometimes put these polls out and they'll say,
01:01:08.400 | you know, what do you wish you could like learn more from?
01:01:10.040 | And like one of the things that people vote the most for
01:01:12.760 | is like learning how to prompt, right?
01:01:16.080 | And so I think like, you know,
01:01:17.520 | if you put away your research hat for a minute
01:01:19.560 | and you just put on like your product hat for a second,
01:01:22.160 | you're kind of like, well,
01:01:23.000 | why do people want to learn how to prompt, right?
01:01:25.400 | It's because they want to get higher quality images.
01:01:28.160 | Well, what's higher quality composition,
01:01:29.600 | lighting, aesthetics, so on and so forth.
01:01:32.660 | And I think that the community on our feed,
01:01:35.560 | I think we might have the biggest community
01:01:38.300 | and it gives all of the users a way to learn how to prompt
01:01:43.300 | because they're just seeing this huge rising tide
01:01:47.300 | of all these images that are super cool and interesting
01:01:49.980 | and they can kind of like take each other's prompts
01:01:51.780 | and like kind of learn how to do that.
01:01:53.680 | I think that'll be short-lived
01:01:57.180 | because I think the complexity of these things
01:01:58.540 | is going to get higher,
01:01:59.780 | but that's more about why we have that feed
01:02:03.800 | is to help each other, help teach users
01:02:05.840 | and then also just celebrate people's art.
01:02:08.600 | - You run your own infra.
01:02:09.960 | - We do.
01:02:10.800 | - Yeah, that's unusual.
01:02:12.360 | (laughs)
01:02:14.560 | - It's necessary.
01:02:15.480 | - It's necessary.
01:02:16.480 | What have you learned running DevOps for GPUs?
01:02:19.360 | You had a tweet about like how many A100s you have,
01:02:22.680 | but I feel like it's out of date probably.
01:02:28.020 | - I mean, it just comes down to cost.
01:02:29.200 | These things are very expensive.
01:02:30.400 | So we just want to make it as affordable
01:02:33.160 | for everybody as possible.
01:02:34.960 | I find the DevOps for inference to be relatively easy.
01:02:40.320 | It doesn't feel that different than,
01:02:42.360 | I think we had thousands and thousands of servers
01:02:44.840 | at Mixpanel just for dealing with the API
01:02:47.720 | had such huge quantities of volume that I didn't find it.
01:02:50.200 | I don't find it particularly very different.
01:02:53.680 | I do find model optimization performance
01:02:57.700 | is very new to me.
01:02:58.660 | So I think that I find that very difficult at the moment.
01:03:01.140 | So that's very interesting.
01:03:02.620 | But scaling inference is not terrible.
01:03:05.820 | Scaling a training cluster is much, much harder
01:03:08.860 | than I perhaps anticipated.
01:03:11.840 | - Why is that?
01:03:12.980 | - Well, it's just like a very large distributed system
01:03:16.660 | with if you have like a node that goes down
01:03:20.100 | then your training run crashes
01:03:21.820 | and then you have to somehow be resilient to that.
01:03:23.560 | And I would say training in for a software is very early.
01:03:28.260 | It feels very broken.
01:03:29.820 | I can tell in 10 years, it would be a lot better.
01:03:32.260 | - Like a mosaic or whatever.
01:03:34.340 | - Yeah, we don't even know.
01:03:35.180 | I think we use very basic tools like Slurm for scheduling
01:03:39.020 | and just normal PyTorch, PyTorch Lightning,
01:03:41.340 | that kind of thing.
01:03:42.160 | I think our tooling is an ascent.
01:03:43.780 | I think I talked to a friend that's over at XAI.
01:03:45.740 | They just, they like built their own scheduler
01:03:48.540 | and doing things with Kubernetes.
01:03:50.140 | When people are building out tools
01:03:51.900 | because the existing open source stuff doesn't work
01:03:54.000 | and everyone's doing their own bespoke thing,
01:03:55.600 | you know there's a valuable company to be formed.
01:03:58.040 | - Yeah, I think it's Mosaic.
01:03:59.840 | I don't know.
01:04:01.360 | - Well, with Mosaic, yeah, it's tough with Mosaic
01:04:03.680 | 'cause anyway, I won't go into the details why,
01:04:06.240 | but yeah, we found it difficult to do it.
01:04:09.200 | It might be worth like wondering
01:04:10.640 | like why not everyone is going to Mosaic.
01:04:13.160 | Perhaps it's still, I just think it's nascent
01:04:15.720 | and perhaps Mosaic will come through.
01:04:17.520 | - Cool, anything for you?
01:04:18.920 | - No, no, this was great.
01:04:20.880 | And just to wrap, we talked about
01:04:22.940 | some of the pivotal moments in your mind
01:04:25.040 | with like DALI and whatnot.
01:04:27.140 | If you were not doing this,
01:04:30.120 | what's the most interesting unsolved question in AI
01:04:33.360 | that you would try and build in?
01:04:34.960 | - Oh man, coming up with startup ideas
01:04:38.160 | is very hard on the spot.
01:04:39.900 | - You shoot, you have to have them.
01:04:42.580 | I mean, you're a founder, you're a repeat founder.
01:04:45.440 | - I'm very picky about my startup ideas.
01:04:49.140 | So I don't have any great ones.
01:04:51.620 | The only thing that I, I don't have an idea per se
01:04:54.900 | as much as a curiosity.
01:04:57.300 | And I suppose I'll pose it to you guys.
01:05:00.820 | Right now, we sort of think that a lot of the modalities
01:05:04.600 | just kind of feel like they're vision, language, audio,
01:05:09.600 | that's roughly it.
01:05:11.880 | And somehow all this will like turn into something,
01:05:14.420 | it'll be multimodal and then we'll end up with AGI perhaps.
01:05:18.740 | And I just think that there are probably far more modalities
01:05:22.580 | than maybe we, than meets the eye.
01:05:25.540 | And it just seems hard for us to see it right now
01:05:28.760 | because it's sort of like we have tunnel vision
01:05:31.260 | on the moment.
01:05:32.100 | - We're just like code, image, audio, video.
01:05:34.700 | - Yeah, I think--
01:05:35.540 | - Very, very broad categories.
01:05:36.660 | - I think we are lacking imagination as a species
01:05:39.840 | in this regard.
01:05:40.680 | And I think like, you know, just like, you know,
01:05:43.580 | it's not, I don't know what company would form
01:05:45.300 | as a result of this, but you know,
01:05:47.220 | like there's some very difficult problems,
01:05:49.420 | like just like a true actual, like not a meta world model,
01:05:52.940 | but an actual world model that truly maps everything
01:05:56.860 | that's going in terms of like physics and fluids
01:06:00.140 | and all these various kinds of interactions.
01:06:02.700 | And what does that kind of model,
01:06:04.660 | like a true physics foundation model of sorts
01:06:07.340 | that represents earth.
01:06:09.040 | And that in of itself seems very difficult, you know,
01:06:13.060 | but we just think of, but we're kind of stuck on like
01:06:15.460 | thinking that we can approximate everything
01:06:17.020 | with like, you know, a word or a token, if you will.
01:06:20.820 | And I went, you know, I had a dinner last night
01:06:22.300 | where we were kind of debating this philosophically.
01:06:24.580 | And I think someone, you know, said something
01:06:26.380 | that I also believe in, which is like,
01:06:27.780 | at the end of the day, it doesn't really matter
01:06:29.260 | that it's like a token or a byte.
01:06:31.180 | At the end of the day, it's just like some, you know,
01:06:33.620 | unit of information that it emits.
01:06:36.100 | But, you know, I do wonder if there are more,
01:06:38.780 | far more modalities than meets the eye.
01:06:42.520 | And if you could create that, then what would that,
01:06:45.300 | what would that company become?
01:06:47.220 | What problems could you solve?
01:06:48.940 | So I don't know yet, so I don't have a great company for it.
01:06:52.700 | - I don't know.
01:06:53.540 | Maybe you would just inspire somebody to try.
01:06:56.180 | - Yeah, hopefully.
01:06:57.720 | - My personal response to that is I'm less interested
01:06:59.860 | in physics and more interested in people.
01:07:01.780 | Like how do I mind upload?
01:07:04.220 | Because that is teleportation, that is immortality,
01:07:07.940 | that is everything.
01:07:08.980 | - Yeah, yeah, can we model our own,
01:07:11.660 | rather than trying to create consciousness,
01:07:13.300 | could we model our own?
01:07:15.040 | Even if it was lossy to some extent, yeah.
01:07:18.500 | - Yeah.
01:07:19.780 | Well, we won't solve that here.
01:07:22.180 | If I were to take a Bill Gates book trip and had a week,
01:07:27.180 | what should I take with me to learn AI?
01:07:29.820 | - Oh man, oh gosh.
01:07:32.700 | You shouldn't take a book, you should just go to YouTube
01:07:35.540 | and visit Karpathy's class and just do it, do it,
01:07:40.540 | grind through it.
01:07:41.820 | That's actually the most useful thing for you?
01:07:43.300 | - I wish it came out when I started back last year.
01:07:46.220 | I'm as bummed that I didn't get to take it
01:07:49.460 | at the beginning, but I did do a few of his classes
01:07:53.140 | regardless.
01:07:53.980 | I don't think books, every time I buy a programming book,
01:07:57.300 | I never read it.
01:07:58.220 | I always find that just writing code
01:08:00.500 | helps cement my internal understanding.
01:08:02.300 | - Yeah, so more generally, advice for founders
01:08:04.820 | who are not PhDs and are effectively self-taught
01:08:07.420 | like you are, what should they do, what should they avoid?
01:08:11.000 | - Same thing that I would advise if you're programming.
01:08:14.100 | Pick a project that seems very exciting to you,
01:08:16.700 | but doesn't have to be too serious,
01:08:19.060 | and build it and learn every detail of it while you do it.
01:08:22.420 | - And it must be, should you train?
01:08:24.740 | Or can you go far enough not training, just fine-tuning?
01:08:29.180 | - It depends, I would just follow your curiosity.
01:08:31.500 | If what you want to do is something
01:08:33.660 | that requires fundamental understanding of training models,
01:08:35.980 | then you should learn it.
01:08:37.820 | You don't have to be a PhD, you don't have to get
01:08:39.300 | to become a five-year, whatever, PhD,
01:08:41.940 | but if that's necessary, I would do it.
01:08:44.700 | If it's not necessary, then go as far as you need to go,
01:08:46.860 | but I would learn, pick something that motivates.
01:08:48.940 | I think most people tap out on motivation,
01:08:51.420 | but they're deeply curious.
01:08:52.780 | - Cool. - Cool.
01:08:55.180 | - Thank you so much for coming out, man.
01:08:56.380 | - Thank you for having me, appreciate it.
01:08:58.980 | (upbeat music)
01:09:01.560 | (upbeat music)
01:09:04.140 | (upbeat music)
01:09:06.720 | (upbeat music)
01:09:09.300 | (upbeat music)
01:09:11.880 | (upbeat music)
01:09:14.460 | (upbeat music)
01:09:17.400 | (upbeat music)
01:09:19.980 | (gentle music)