The AI-First Graphics Editor - with Suhail Doshi of Playground AI

00:00:00.880 | - Hey, everyone.

00:00:01.720 | Welcome to the Latent Space Podcast.

00:00:03.520 | This is Alessio, partner and CTO

00:00:05.460 | on Resident and Decibel Partners.

00:00:07.040 | And I'm joined by my co-host Swiggs, founder of Small.ai.

00:00:09.960 | - Hey, and today in the studio we have Soheil Doshi.

00:00:12.320 | Welcome.

00:00:13.140 | - Yeah, thanks for having me.

00:00:14.320 | - Among many things, you're a CEO and co-founder

00:00:15.960 | of Mixpanel.

00:00:16.800 | - Yep.

00:00:17.620 | - And I think about three years ago,

00:00:19.800 | you left to start Mighty.

00:00:21.400 | - Mm-hmm.

00:00:22.640 | - And more recently, I think about a year ago,

00:00:25.920 | transitioned into Playground.

00:00:27.680 | And you've just announced your new round.

00:00:31.160 | I'd just like to start, touch on Mixpanel a little bit,

00:00:33.120 | 'cause it's obviously one of the more

00:00:35.200 | sort of successful analytics companies

00:00:37.920 | we previously had amplitude on.

00:00:40.640 | And I'm curious if you had any sort of reflections

00:00:43.800 | on just that overall,

00:00:46.480 | the interaction of that amount of data

00:00:49.880 | that people would want to use for AI.

00:00:52.120 | Like, I don't know if there's still a part of you

00:00:54.760 | that stays in touch with that world.

00:00:56.320 | - Yeah, I mean, it's, I mean, you know,

00:00:58.560 | the short version is that maybe back in like 2015 or '16,

00:01:03.560 | I don't really remember exactly,

00:01:05.800 | 'cause it was a while ago, we had an ML team at Mixpanel.

00:01:08.900 | And I think this is like when maybe deep learning

00:01:13.060 | or something like really just started

00:01:14.680 | getting kind of exciting.

00:01:16.160 | And we were thinking that maybe we,

00:01:17.800 | given that we had such vast amounts of data,

00:01:20.200 | perhaps we could predict things.

00:01:22.460 | So we built, you know, two or three different features.

00:01:24.480 | I think we built a feature where we could predict

00:01:26.960 | whether users would churn from your product.

00:01:30.200 | We made a feature that could predict

00:01:32.040 | whether users would convert.

00:01:33.840 | We tried to be built a feature

00:01:35.320 | that could do anomaly detection.

00:01:36.840 | Like if something occurred in your product,

00:01:40.080 | that was just very surprising,

00:01:41.160 | maybe a spike in traffic in a particular region.

00:01:43.840 | Could we tell you in advance?

00:01:45.040 | Could we tell you that that happened?

00:01:46.360 | 'Cause it's really hard to like know everything

00:01:47.880 | that's going on with your data.

00:01:49.080 | Could we tell you something surprising about your data?

00:01:51.580 | And we tried all of these various features.

00:01:53.700 | Most of it boiled down to just like, you know,

00:01:55.440 | using logistic regression.

00:01:58.480 | And it never quite seemed very groundbreaking in the end.

00:02:03.280 | And so I think, you know, we had a four or five person ML team

00:02:07.240 | and I think we never expanded it from there.

00:02:10.220 | And I did all these fast AI courses

00:02:12.060 | trying to learn about ML.

00:02:13.280 | And that was the, that's the--

00:02:15.200 | - That's the first time you did fast AI.

00:02:16.640 | - Yeah, that was the first time I did fast AI.

00:02:18.320 | Yeah, I think I've done it now three times maybe.

00:02:20.400 | - Oh, okay.

00:02:21.240 | I didn't know it was a third.

00:02:22.060 | Okay.

00:02:23.160 | - No, no, just me reviewing it is maybe three times,

00:02:25.440 | but yeah.

00:02:26.280 | - Yeah, yeah, yeah.

00:02:27.120 | I mean, I think you mentioned prediction,

00:02:29.160 | but honestly like it's also just about the feedback, right?

00:02:31.760 | The quality of feedback from users,

00:02:34.900 | I think it's useful for anyone building AI applications.

00:02:38.400 | - Yeah.

00:02:39.320 | - Yeah, self-evident.

00:02:40.560 | - Yeah, I think I haven't spent a lot of time

00:02:42.760 | thinking about Mixpanel 'cause it's been a long time,

00:02:44.440 | but yeah, I wonder now, given everything that's happened,

00:02:47.680 | like, you know, sometimes I'm like,

00:02:50.400 | oh, I wonder what we could do now.

00:02:51.920 | And then I kind of like move on to whatever I'm working on,

00:02:54.360 | but things have changed significantly since.

00:02:56.720 | So yeah.

00:02:57.880 | - Yeah.

00:02:58.720 | Awesome.

00:02:59.560 | And then maybe we'll touch on Mighty a little bit.

00:03:01.660 | Mighty was very, very bold.

00:03:03.400 | It was basically, well, my framing of it was,

00:03:06.440 | you will run our browsers for us

00:03:07.920 | because everyone has too many tabs open.

00:03:10.820 | I have too many tabs open and slowing down your machines

00:03:12.680 | that you can do it better for us

00:03:14.160 | in a centralized data center.

00:03:15.480 | - Yeah, we were first trying to make a browser

00:03:17.960 | that we would stream from a data center to your computer

00:03:21.120 | at extremely low latency.

00:03:22.680 | But the real objective wasn't trying to make a browser

00:03:27.080 | or anything like that.

00:03:27.920 | The real objective was to try to make

00:03:29.080 | a new kind of computer.

00:03:30.760 | And the thought was just that, like, you know,

00:03:32.720 | we have these computers in front of us today

00:03:35.080 | and we upgrade them or they run out of RAM

00:03:37.840 | or they don't have enough RAM or not enough disk,

00:03:39.640 | or, you know, there's some limitation with our computers,

00:03:43.240 | perhaps like data locality is a problem.

00:03:46.360 | Could we, you know, why do I need to think about

00:03:48.880 | upgrading my computer ever?

00:03:50.760 | And so, you know, we just had to kind of observe that,

00:03:52.800 | like, well, actually, it seems like a lot of applications

00:03:55.920 | are just now in the browser.

00:03:57.600 | You know, it's like how many real desktop applications

00:04:00.800 | do we use relative to the number of applications

00:04:02.960 | we use in the browser?

00:04:03.800 | So it was just this realization that actually, like,

00:04:05.920 | you know, the browser was effectively becoming

00:04:08.280 | more or less our operating system over time.

00:04:10.840 | And so then that's why we kind of decided to go,

00:04:13.120 | hmm, maybe we can stream the browser.

00:04:14.680 | Fortunately, the idea did not work

00:04:15.760 | for a couple of different reasons,

00:04:18.040 | but the objective is to try to make a true new computer.

00:04:21.200 | - Yeah, very bold, very bold.

00:04:22.480 | - Yeah, and I was there at YC Demo Day

00:04:25.200 | when you first announced it.

00:04:26.040 | - Oh, okay.

00:04:26.960 | - I think the last, or one of the last in-person ones,

00:04:29.880 | or like the PR 34 in Mission Bay.

00:04:32.120 | - Yeah, before COVID.

00:04:34.280 | - How do you think about that now

00:04:36.080 | when everybody wants to put some of these models

00:04:38.240 | in people's machines and some of them

00:04:39.760 | want to stream them in?

00:04:40.960 | Do you think there's maybe another wave of the same problem

00:04:44.320 | before it was like browser apps too slow,

00:04:46.200 | and now it's like models too slow to run on device?

00:04:49.000 | - Yeah, I think, you know,

00:04:51.000 | we obviously pivoted away from Mighty,

00:04:52.520 | but a lot of what I somewhat believed at Mighty

00:04:57.040 | is like somewhat very true.

00:04:58.760 | Maybe why I'm so excited about AI and what's happening.

00:05:02.000 | A lot of what Mighty was about

00:05:03.440 | was like moving compute somewhere else, right?

00:05:06.800 | Right now applications,

00:05:07.920 | they get limited quantities of memory, disk, networking,

00:05:12.280 | whatever your home network has, et cetera.

00:05:14.920 | You know, what if these applications could somehow,

00:05:17.280 | if we could shift compute,

00:05:18.440 | and then these applications have vastly more compute

00:05:20.560 | than they do today.

00:05:22.120 | Right now it's just like client backend services,

00:05:24.920 | but you know, what if we could change the shape

00:05:27.280 | of how applications could interact with things?

00:05:31.040 | And it's changed my thinking.

00:05:33.400 | In some ways, AI has like a bit of a continuation

00:05:36.680 | of my belief that like,

00:05:38.120 | perhaps we can really shift compute somewhere else.

00:05:41.560 | One of the problems with Mighty

00:05:43.120 | was that JavaScript is single-threaded in the browser.

00:05:47.720 | And what we learned, you know,

00:05:49.280 | the reason why we kind of abandoned Mighty

00:05:51.240 | was because I didn't believe

00:05:52.520 | we could make a new kind of computer.

00:05:53.760 | We could have made some kind of enterprise business,

00:05:56.080 | probably could have made maybe a lot of money,

00:05:59.000 | but it wasn't going to be what I hoped it was going to be.

00:06:01.520 | And so once I realized that most of a web app

00:06:05.440 | is just going to be single-threaded JavaScript,

00:06:07.240 | then the only thing you could do largely

00:06:10.160 | withstanding changing JavaScript,

00:06:11.480 | which is a fool's errand most likely,

00:06:14.560 | is make a better CPU, right?

00:06:18.080 | And there's like three CPU manufacturers,

00:06:20.480 | two of which sell, you know, big ones, you know,

00:06:23.120 | AMD, Intel, and then of course, like Apple made the M1.

00:06:26.440 | And it's not like single-threaded CPU core performance.

00:06:30.240 | Single core performance was like increasing very fast.

00:06:33.240 | It's plateauing rapidly.

00:06:35.080 | And even these different like companies

00:06:36.800 | were not doing as good of a job, you know,

00:06:38.560 | sort of with the continuation of Moore's law.

00:06:40.640 | But what happened in AI was that you got like,

00:06:43.480 | like if you think of the AI model as like a computer program

00:06:46.920 | which is like a compiled computer program,

00:06:48.560 | it is literally built and designed

00:06:50.120 | to do massive parallel computations.

00:06:53.360 | And so if you could take like

00:06:55.120 | the universal approximation theorem

00:06:56.520 | to its like kind of logical complete point,

00:07:00.280 | you know, you're like, wow,

00:07:01.120 | I can get make computation happen really rapidly

00:07:04.360 | and parallel somewhere else.

00:07:06.040 | You know, so you end up with these like

00:07:09.200 | really amazing models that can like do anything.

00:07:11.920 | It just turned out like perhaps,

00:07:14.080 | perhaps the new kind of computer

00:07:16.320 | would just simply be shifted, you know,

00:07:19.000 | into these like really amazing AI models in reality.

00:07:22.600 | - Yeah.

00:07:23.440 | Like I think Andrej Karpathy has always been,

00:07:25.760 | has been making a lot of analogies with the LLMOS.

00:07:28.640 | - Yeah, I saw his, yeah, I saw his video

00:07:30.480 | and I watched that, you know,

00:07:31.880 | maybe two weeks ago or something like that.

00:07:33.320 | And I was like, oh man, this,

00:07:35.000 | I very much resonate with this like idea.

00:07:37.000 | - Why didn't I see this three years ago?

00:07:38.680 | - Yeah, I think, I think there still will be,

00:07:40.760 | you know, local models

00:07:42.240 | and then there'll be these very large models

00:07:43.760 | that have to be run in data centers.

00:07:45.760 | Yeah, I think it just depends on kind of like

00:07:47.320 | the right tool for the job,

00:07:48.440 | like any, like any engineer would probably care about.

00:07:52.000 | But I think that, you know, by and large,

00:07:54.400 | like if the models continue to kind of keep getting bigger,

00:07:57.520 | you know, it's gonna, it's,

00:07:58.720 | you're always going to be wondering

00:07:59.760 | whether you should use the big thing or the small,

00:08:01.880 | you know, the tiny little model.

00:08:03.880 | And it might just depend on like, you know,

00:08:05.800 | do you need 30 FPS or 60 FPS?

00:08:08.320 | Maybe that would be hard to do, you know, over a network.

00:08:12.240 | - Yeah, you tackle the much harder problem latency-wise,

00:08:16.520 | you know, than the AI models actually require.

00:08:19.080 | - Yeah, yeah, you can do quite well.

00:08:20.880 | You can do quite well.

00:08:22.360 | You know, you definitely did 30 FPS video streaming,

00:08:26.320 | did very crazy things to make that work.

00:08:28.440 | So I'm actually quite bullish

00:08:30.720 | on the kinds of things you can do with networking.

00:08:33.120 | - Yeah, right.

00:08:33.960 | Maybe someday you'll come back to that at some point.

00:08:37.520 | - But so for those that don't know,

00:08:39.360 | you're very transparent on Twitter.

00:08:41.200 | Very good to follow you just to learn your insights.

00:08:43.840 | And you actually published a postmortem on Mighty

00:08:45.800 | that people can read up on if they're willing to.

00:08:48.400 | And so there was a bit of an overlap.

00:08:50.760 | You started exploring the AI stuff in June, 2022,

00:08:56.560 | which is when you started saying like,

00:08:57.600 | "I'm taking Fast.ai again."

00:08:59.480 | Maybe, was there more context around that?

00:09:02.440 | - Yeah, I think I was kind of like waiting

00:09:05.480 | for the team at Mighty to finish up something.

00:09:08.640 | And I was like, "Okay, well, what can I do?

00:09:11.240 | "I guess I will make some kind of like address bar predictor

00:09:15.240 | "in the browser."

00:09:16.080 | So we had forked Chrome and Chromium.

00:09:18.560 | And I was like, "You know, one thing that's kind of lame

00:09:22.420 | "is that like this browser should be like a lot better

00:09:24.600 | "at predicting what I might do, where I might wanna go."

00:09:28.480 | You know, it struck me as really odd

00:09:30.160 | that Chrome had very little AI actually,

00:09:32.720 | or ML inside this browser.

00:09:34.680 | And for a company like Google, you'd think there's a lot,

00:09:37.320 | but it's actually just like the code is actually just very,

00:09:41.200 | you know, it's just a bunch of if then statements

00:09:43.240 | is more or less the address bar.

00:09:45.160 | So it seemed like a pretty big opportunity.

00:09:47.600 | And that's also where a lot of people interact

00:09:50.040 | with the browser.

00:09:50.880 | So in a long story short, I was like,

00:09:52.360 | "Hmm, I wonder what I could build here."

00:09:55.160 | So I started to, yeah, take some AI courses

00:09:57.840 | and try to review the material again

00:10:00.620 | and get back to figuring it out.

00:10:02.520 | But I think that was somewhat serendipitous

00:10:05.240 | because right around April was, I think,

00:10:08.640 | a very big watershed moment in AI

00:10:10.320 | 'cause that's when "Dolly 2" came out.

00:10:12.200 | And I think that was the first like truly big viral moment

00:10:15.560 | for generative AI.

00:10:17.760 | - Because of the avocado chair.

00:10:19.680 | - Because of the avocado chair and yeah, exactly.

00:10:24.680 | Yeah, it was just so novel.

00:10:26.040 | - It wasn't as big for me as "Stable Diffusion."

00:10:28.000 | - Really?

00:10:28.840 | - Yeah, I don't know.

00:10:29.660 | People was like, "All right, that's cool."

00:10:31.220 | I don't know. (laughs)

00:10:32.460 | - Yeah.

00:10:33.300 | - I mean, they had some flashy videos,

00:10:34.460 | but I never really, it didn't really register me as--

00:10:37.900 | - But just that moment of images

00:10:39.460 | was just such a viral, novel moment.

00:10:41.860 | I think it just blew people's mind.

00:10:44.040 | - Yeah, I mean, it was the first time

00:10:46.540 | I encountered Sam Altman

00:10:47.980 | 'cause they had this "Dolly 2" hackathon.

00:10:50.140 | They opened up the OpenAI office

00:10:51.620 | for developers to walk in back when it wasn't as,

00:10:56.100 | I guess, much of a security issue as it is today.

00:11:00.200 | Maybe take us through the journey

00:11:01.600 | to decide to pivot into this.

00:11:03.940 | And also, choosing images.

00:11:06.060 | Obviously, you were inspired by "Dolly,"

00:11:08.580 | but there could be any number of AI companies

00:11:13.140 | and businesses that you could start in the widest one, right?

00:11:16.500 | - Yeah.

00:11:17.340 | - So there must be an idea maze from June to September.

00:11:20.500 | - Yeah, yeah, there definitely was.

00:11:22.500 | So I think at that time,

00:11:24.300 | Mighty, OpenAI was not quite as popular

00:11:29.300 | as it is all of a sudden now these days.

00:11:32.300 | But back then, I think they were more than happy.

00:11:36.180 | They had a lot more bandwidth to help anybody.

00:11:38.900 | And so we had been talking with the team there

00:11:42.140 | around trying to see if we could do

00:11:43.820 | really fast, low-latency address bar prediction

00:11:47.140 | with GPT-3 and 3.5 and that kind of thing.

00:11:51.180 | And so we were sort of figuring out

00:11:54.220 | how could we make that low-latency.

00:11:56.020 | I think that just being able to talk to them

00:11:59.140 | and kind of being involved gave me a bird's-eye view

00:12:01.660 | into a bunch of things that started to happen.

00:12:03.960 | Obviously, first was the "Dolly 2" moment,

00:12:07.620 | but then "Stable Diffusion" came out,

00:12:09.280 | and that was a big moment for me as well.

00:12:12.060 | And I remember just kind of sitting up one night thinking,

00:12:16.080 | I was like, "What are the kinds of companies

00:12:18.540 | "one could build?

00:12:19.380 | "What matters right now?"

00:12:20.740 | One thing that I observed is that I find a lot of great,

00:12:23.420 | I find a lot of inspiration when I'm working

00:12:26.260 | in a field in something,

00:12:27.500 | and then I can identify a bunch of problems.

00:12:29.620 | Like for Mixpanel, I was an intern at a company,

00:12:32.340 | and I just noticed that they were doing

00:12:33.620 | all this data analysis.

00:12:34.660 | And so I thought, "Hmm, I wonder if I could make a product,

00:12:37.000 | "and then maybe they would use it."

00:12:38.500 | And in this case, the same thing kind of occurred.

00:12:41.680 | It was like, okay, there are a bunch

00:12:42.660 | of infrastructure companies that are doing,

00:12:46.640 | they put a model up, and then you can use their API,

00:12:49.500 | like Replicate is a really good example of that.

00:12:52.620 | There are a bunch of companies that are helping you

00:12:54.620 | with training, model optimization, Mosaic at the time,

00:12:59.620 | and probably still was doing stuff like that.

00:13:03.180 | So I just started listing out every category of everything,

00:13:06.340 | of every company that was doing something interesting.

00:13:08.100 | Obviously, Weights & Biases.

00:13:09.560 | I was like, "Oh man, Weights & Biases

00:13:12.100 | "is this great company.

00:13:14.000 | "Do I want to compete with that company?

00:13:15.440 | "I might be really good at competing with that company."

00:13:17.940 | Because of Mixpanel, 'cause it's so much of analysis.

00:13:21.380 | I was like, "No, I don't want to do anything related to that.

00:13:23.780 | "I think that would be too boring now at this point."

00:13:26.480 | But, so I started to list out all these ideas,

00:13:30.300 | and one thing I observed was that at OpenAI,

00:13:32.820 | they have a playground for GPT-3, right?

00:13:35.620 | And all it was was just a text box, more or less.

00:13:38.060 | And then there were some settings on the right,

00:13:39.540 | like temperature and whatever.

00:13:41.140 | - Top K, Top N. - Yeah, Top K.

00:13:43.340 | What's your end stop sequence?

00:13:44.940 | I mean, that was like their product before chat GPT.

00:13:48.140 | You know, really difficult to use,

00:13:49.500 | but fun if you're like an engineer.

00:13:51.400 | And I just noticed that their product

00:13:53.060 | kind of was evolving a little bit,

00:13:54.460 | where the interface kind of was getting more and more,

00:13:56.700 | a little bit more complex.

00:13:58.020 | They had like a way where you could like,

00:13:59.420 | generate something in the middle of a sentence,

00:14:01.340 | and all those kinds of things.

00:14:02.820 | And I just thought to myself, I was like,

00:14:04.140 | "You know, there's not,

00:14:05.200 | "everything is just like this text box,

00:14:07.460 | "and you generate something, and that's about it."

00:14:09.620 | And Stable Diffusion had kind of come out,

00:14:11.220 | and it was all like hugging face and code.

00:14:13.540 | Nobody was really building any UI.

00:14:15.820 | And so I had this kind of thing where I wrote prompt dash,

00:14:18.540 | like question mark in my notes.

00:14:20.460 | And I didn't know what was like the product for that,

00:14:23.780 | at the time.

00:14:24.820 | I mean, it seems kind of trite now.

00:14:27.380 | But yeah, I just like wrote prompt.

00:14:29.180 | What's the thing for that?

00:14:30.020 | - Manager, prompt. - Prompt manager.

00:14:32.100 | Do you organize them?

00:14:33.560 | Like, do you like have a UI that can like--

00:14:35.860 | - Library. - Play with them?

00:14:37.280 | Yeah, like a library.

00:14:38.340 | What would you make?

00:14:40.260 | And so then of course, then you thought about,

00:14:41.700 | what would the modalities be, given that?

00:14:44.420 | How would you build a UI for each kind of modality?

00:14:47.260 | And so there were a couple people

00:14:48.620 | working on some pretty cool things.

00:14:51.100 | And I basically chose graphics

00:14:54.300 | because it seemed like the most obvious place

00:14:57.760 | where you could build a really powerful, complex UI

00:15:02.220 | that's not just only typing in a box.

00:15:05.260 | That it would very much evolve beyond that.

00:15:07.820 | Like, what would be the best thing

00:15:08.900 | for something that's visual?

00:15:09.900 | Probably something visual.

00:15:11.300 | So yeah, I think that just that progression kind of happened

00:15:17.360 | and it just seemed like there was a lot of effort

00:15:19.860 | going into language,

00:15:21.220 | but not a lot of effort going into graphics.

00:15:24.220 | And then maybe the very last thing was,

00:15:26.300 | I think I was talking to Aditya Ramesh,

00:15:29.020 | who is the co-creator of Dolly 2 and Sam.

00:15:32.780 | And I just kind of went to these guys

00:15:34.180 | and I was just like, hey,

00:15:35.860 | are you gonna make like a UI for this thing?

00:15:38.660 | Like a true UI, are you gonna go for this?

00:15:40.780 | Are you gonna make a product?

00:15:42.020 | - For Dolly, yeah.

00:15:43.100 | - For Dolly, yeah.

00:15:44.700 | Are you gonna do anything here?

00:15:46.420 | 'Cause if you're not gonna do it,

00:15:47.940 | if you are gonna do it, just let me know

00:15:49.100 | and I will stop and I'll go do something else.

00:15:51.500 | But if you're not gonna do anything, I'll just do it.

00:15:54.460 | And so we had a couple of conversations

00:15:55.780 | around what that would look like.

00:15:58.020 | And then I think ultimately they decided

00:15:59.620 | that they were gonna focus on language primarily.

00:16:03.220 | And yeah, I just felt like

00:16:05.780 | it was gonna be very underinvested in.

00:16:07.860 | - Yes, there's that sort of underinvestment

00:16:11.260 | from OpenAI, which I can see that.

00:16:14.420 | But also it's a different type of customer

00:16:18.100 | than you're used to.

00:16:19.380 | Presumably, and Mixpanel are very good

00:16:22.100 | at selling to B2B developers.

00:16:24.620 | With Fairground, you're not.

00:16:26.180 | - Yeah.

00:16:27.020 | - Was that not a concern?

00:16:28.540 | - Well, not so much, because I think that right now

00:16:32.620 | I would say graphics is in this very nascent phase.

00:16:34.740 | Like most of the customers are just like hobbyists, right?

00:16:37.500 | Like it's a little bit of like a novel toy

00:16:40.140 | as opposed to being this like very high utility thing.

00:16:42.980 | But I think ultimately if you believe

00:16:45.220 | that you could make it very high utility,

00:16:47.260 | then probably the next customers will end up being B2B.

00:16:50.460 | It'll probably not be like consumer.

00:16:52.180 | Like there will certainly be a variation

00:16:53.860 | of this idea that's in consumer.

00:16:55.500 | If your quest is to kind of make like a super,

00:17:00.220 | something that surpasses human ability for graphics,

00:17:03.660 | like ultimately it will end up being used for business.

00:17:06.540 | So I think it's maybe more of a progression.

00:17:08.540 | In fact, for me, it's maybe more like

00:17:09.980 | Mixpanel started out as SMB,

00:17:11.940 | and then very much like ended up

00:17:13.340 | starting to grow up towards enterprise.

00:17:14.940 | So for me, it's a little,

00:17:16.420 | I think it will be a very similar progression.

00:17:18.340 | - Yeah, yeah.

00:17:19.540 | - But yeah, I mean, the reason why I was excited about it

00:17:21.340 | is 'cause it was a creative tool.

00:17:22.860 | I make music and it's AI.

00:17:26.100 | It's like something that I know I could stay up

00:17:28.100 | till three o'clock in the morning doing.

00:17:30.400 | Those are kind of like very simple bars for me.

00:17:33.140 | - Yeah. - Yeah.

00:17:33.980 | It's good decision criteria.

00:17:35.900 | - So you mentioned DALI, Stable Diffusion.

00:17:38.780 | You just had Playground V2 come out two days ago?

00:17:42.020 | - Yeah, two days ago, yeah.

00:17:42.920 | - Two days ago.

00:17:43.760 | So this is a model you train completely from scratch.

00:17:46.580 | So it's not a cheap fine tune on something.

00:17:49.480 | You open source everything, including the weights.

00:17:52.740 | Why did you decide to do it?

00:17:54.200 | I know you supported Stable Diffusion XL

00:17:56.560 | in Playground before, right?

00:17:58.220 | - Yep.

00:17:59.380 | - Yeah, what made you want to come up with V2

00:18:02.020 | and maybe some of the interesting,

00:18:04.320 | technical research work you've done?

00:18:06.180 | - Yeah, so I think that we continue to feel like graphics

00:18:12.100 | and these foundation models for anything really related

00:18:16.900 | to pixels, but also definitely images,

00:18:18.980 | continues to be very under-invested.

00:18:21.060 | It feels a little like graphics is in this GPT-2 moment,

00:18:24.460 | right, like even GPT-3.

00:18:27.140 | Even when GPT-3 came out, it was exciting.

00:18:29.320 | But it was like, what are you gonna use this for?

00:18:30.980 | You know, yeah, we'll do some text classification

00:18:33.060 | and some semantic analysis,

00:18:34.740 | and maybe it'll sometimes make a summary of something

00:18:37.460 | and it'll hallucinate.

00:18:38.500 | But no one really had a very significant

00:18:41.120 | business application for GPT-3.

00:18:42.960 | And in images, we're kind of stuck in the same place.

00:18:46.500 | We're kind of like, okay, I write this thing in a box

00:18:49.080 | and I get some cool piece of artwork

00:18:50.860 | and the hands are kind of messed up

00:18:52.180 | and sometimes the eyes are a little weird.

00:18:54.500 | Maybe I'll use it for a blog post, that kind of thing.

00:18:58.280 | The utility feels so limited.

00:18:59.840 | And so, you know, and then you sort of look

00:19:02.320 | at stable diffusion and we definitely use that model

00:19:04.740 | in our product and our users like it and use it

00:19:07.260 | and love it and enjoy it.

00:19:08.540 | But it hasn't gone nearly far enough.

00:19:12.420 | So we were kind of faced with the choice of, you know,

00:19:14.500 | do we wait for progress to occur

00:19:16.100 | or do we make that progress happen?

00:19:18.340 | So, yeah, we kind of embarked on a plan

00:19:21.180 | to just decide to go train these things from scratch.

00:19:24.380 | And I think the community has given us so much.

00:19:27.020 | The community for stable diffusion, I think,

00:19:28.740 | is one of the most vibrant communities on the internet.

00:19:31.900 | It's like amazing.

00:19:33.360 | It feels like, I hope this is what Homebrew Club felt like

00:19:36.680 | when computers showed up because it's like amazing

00:19:39.220 | what that community will do.

00:19:40.460 | And it moves so fast.

00:19:42.060 | I've never seen anything in my life where so far,

00:19:44.540 | and heard other people's stories around this,

00:19:46.540 | where a research, an academic research paper comes out

00:19:50.200 | and then like two days later, someone has sample code for it

00:19:53.660 | and then two days later, there's a model

00:19:55.180 | and then two days later, it's like in nine products.

00:19:57.780 | - Yeah.

00:19:58.620 | - Competing with each other.

00:19:59.540 | - Yeah.

00:20:00.380 | - It's incredible to see like math symbols

00:20:01.780 | on an academic paper go to features,

00:20:04.960 | well-designed features in a product.

00:20:06.980 | So I think the community has done so much.

00:20:10.020 | So I think we wanted to give back to the community

00:20:12.180 | kind of on our way.

00:20:13.020 | We knew it wasn't going to be,

00:20:15.300 | we knew it was not ever going to be,

00:20:17.140 | certainly we would train a better model

00:20:18.500 | than what we gave out on Tuesday.

00:20:21.540 | But we definitely felt like there needs to be

00:20:24.220 | some kind of progress in these open source models.

00:20:27.740 | The last kind of milestone was in July

00:20:30.260 | when Stable Diffusion Excel came out,

00:20:31.900 | but there hasn't been anything really since, right?

00:20:34.500 | - And there's Excel Turbo now.

00:20:36.380 | - Well, Excel Turbo is like this distilled model, right?

00:20:38.900 | So it's like lower quality, but fast.

00:20:40.780 | You have to decide what your trade-off is there.

00:20:43.460 | - And it's also a consistency model?

00:20:46.100 | - It's not, I don't think it's a consistency model.

00:20:48.140 | It's like, they did like a different thing.

00:20:50.260 | - Yeah.

00:20:51.100 | - Yeah, I think it's like,

00:20:51.940 | I don't want to get quoted for this,

00:20:53.460 | but it's like something called ad,

00:20:54.900 | like adversarial something or another.

00:20:56.460 | - That's exactly right.

00:20:58.340 | - Yeah, I think it's, I've read something about that.

00:21:00.980 | Maybe it's like closer to GANs or something,

00:21:02.380 | but I didn't really read the full paper.

00:21:04.020 | But yeah, there hasn't been quite enough progress

00:21:06.820 | in terms of, you know, there's no multitask image model.

00:21:09.780 | You know, the closest thing would be something called

00:21:11.180 | like EmuEdit, but there's no model for that.

00:21:13.940 | It's just a paper that's within meta.

00:21:16.140 | So we did that and we also gave out pre-trained weights,

00:21:20.780 | which is very rare.

00:21:22.300 | Usually you just get the aligned model

00:21:24.020 | and then you have to like,

00:21:25.180 | see if you can do anything with it.

00:21:26.840 | We actually gave out,

00:21:28.260 | there's like a 256 pixel pre-trained stage and a 512.

00:21:32.460 | And we did that for academic research,

00:21:34.100 | 'cause there's a whole bunch of,

00:21:35.020 | we come across people all the time in academia

00:21:36.780 | and they have like,

00:21:37.620 | they have access to like one A100 or eight at best.

00:21:42.060 | And so if we can give them kind of like a 512

00:21:45.220 | pre-trained model, it might,

00:21:47.740 | our hope is that there'll be interesting novel research

00:21:50.340 | that occurs from that.

00:21:51.660 | - What research do you want to happen?

00:21:53.900 | - I would love to see more research around,

00:21:56.620 | you know, things that users care about

00:21:57.900 | tend to be things like character consistency.

00:22:00.660 | - Between frames?

00:22:02.180 | - More like if you have like a face.

00:22:03.900 | Yeah, yeah, basically between frames,

00:22:05.420 | but more just like, you know,

00:22:06.300 | you have your face and it's in, you know,

00:22:08.620 | one image and then you want it to be like in another.

00:22:10.980 | And users are very particular

00:22:12.940 | and sensitive to faces changing.

00:22:14.260 | 'Cause we know, we know what, you know,

00:22:16.900 | we're trained on faces as humans.

00:22:19.140 | And, you know, that's something I don't,

00:22:21.900 | I'm not seeing a lot of innovation,

00:22:23.380 | enough innovation around multitask editing.

00:22:26.820 | You know, there are two things like instruct pics to pics

00:22:28.860 | and then the emu edit paper that are maybe very interesting,

00:22:33.140 | but we certainly are not pushing the fold on that

00:22:36.460 | in that regard.

00:22:37.340 | It just, all kinds of things like around that rotation,

00:22:43.220 | you know, being able to keep coherence across images,

00:22:46.740 | style transfer is still very limited.

00:22:48.740 | Just even reasoning around images, you know,

00:22:52.100 | what's going on in an image, that kind of thing.

00:22:54.820 | Things are still very, very underpowered, very nascent.

00:22:57.820 | So therefore the utility is very, very limited.

00:23:01.140 | - On the 1K Prompt Benchmark,

00:23:02.780 | you are 2.5X prefer to stable diffusion Excel.

00:23:06.740 | How do you get there?

00:23:07.580 | Is it better images in the training corpus?

00:23:10.540 | Is it, yeah, can you maybe talk through

00:23:13.660 | the improvements in the model?

00:23:15.660 | - I think they're still very early on in the recipe,

00:23:18.140 | but I think it's a lot of like little things.

00:23:21.620 | And, you know, every now and then

00:23:22.780 | there are some big important things.

00:23:24.260 | Like certainly your data quality

00:23:26.860 | is really, really important.

00:23:28.020 | So we spend a lot of time thinking about that.

00:23:30.900 | But I would say it's a lot of things

00:23:34.140 | that you kind of clean up along the way

00:23:35.660 | as you train your model.

00:23:37.020 | Everything from captions to the data that you align with

00:23:40.980 | after pre-train to how you're picking your data sets,

00:23:44.380 | how you filter your data sets.

00:23:46.920 | There's a lot, I feel like there's a lot of work in AI

00:23:49.700 | that's like, doesn't really feel like AI.

00:23:52.060 | It just really feels like just data set filtering

00:23:55.220 | and systems engineering.

00:23:56.260 | And just like, you know, and the recipe is all there,

00:23:58.460 | but it's like a lot of extra work to do that.

00:24:01.580 | So I think these models, I think whatever version,

00:24:04.420 | I think we plan to do a Playground V 2.1,

00:24:08.220 | maybe either by the end of the year or early next year.

00:24:10.940 | And we're just like watching what the community does

00:24:13.100 | with the model.

00:24:14.300 | And then we're just gonna take a lot of the things

00:24:16.060 | that they're unhappy about and just like fix them.

00:24:19.520 | You know, so for example, like maybe the eyes of people

00:24:23.560 | in an image don't feel right.

00:24:25.840 | They feel like they're a little misshapen

00:24:27.800 | or they're kind of blurry feeling.

00:24:29.600 | That's something that we already know we wanna fix.

00:24:31.320 | So I think in that case, it's gonna be about data quality.

00:24:34.600 | Or maybe you wanna improve the kind of the dynamic range

00:24:37.600 | of color.

00:24:38.440 | You know, we wanna make sure that that's like got a good

00:24:40.280 | range in any image.

00:24:41.300 | So what technique can we use there?

00:24:43.000 | There's different things like offset noise, pyramid noise,

00:24:45.960 | terminal zero SNR.

00:24:47.120 | Like there are all these various interesting things

00:24:49.080 | that you can do.

00:24:49.920 | So I think it's like a lot of just like tricks.

00:24:52.200 | Some are tricks, some are data,

00:24:53.360 | and some is just like cleaning.

00:24:55.880 | Yeah.

00:24:57.220 | - Specifically for faces, it's very common to use a pipeline

00:25:01.320 | rather than just train the base model more.

00:25:05.400 | Do you have a strong belief either way on like,

00:25:08.440 | oh, they should be separated out to different stages

00:25:10.720 | for like improving the eyes, improving the face

00:25:12.640 | or enhance or whatever?

00:25:14.440 | Or do you think like it can all be done in one model?

00:25:17.440 | - I think we will make a unified model.

00:25:19.320 | - Okay.

00:25:20.160 | - Yeah, I think we'll certainly in the end,

00:25:21.680 | ultimately make a unified model.

00:25:23.320 | There's not enough research about this.

00:25:29.960 | Maybe there is something out there that we haven't read.

00:25:32.220 | There are some bottlenecks, like for example, in the VAE,

00:25:35.800 | like the VAEs are ultimately like compressing these things.

00:25:38.120 | And so you don't know, and then you might have

00:25:39.880 | like a big information bottleneck.

00:25:42.800 | So maybe you would use a pixel based model, perhaps.

00:25:45.520 | You know, there's a lot of belief.

00:25:48.280 | I think we've talked to people, everyone from like Rombach

00:25:51.300 | to various people, Rombach trained stable diffusion.

00:25:54.520 | You know, I think there's like a big question

00:25:56.760 | around the architecture of these things.

00:25:59.360 | It's still kind of unknown, right?

00:26:01.360 | Like we've got transformers

00:26:03.400 | and we've got like a GPT architecture model,

00:26:06.440 | but then there's this like weird thing

00:26:07.800 | that's also seemingly working with diffusion.

00:26:10.240 | And so, you know, are we going to use vision transformers?

00:26:12.520 | Are we going to move to pixel based models?

00:26:14.340 | Is there a different kind of architecture?

00:26:16.360 | We don't really, I don't think there have been

00:26:17.800 | enough experiments in this regard.

00:26:19.320 | - Still? Oh my God.

00:26:21.200 | - Yeah. - That's surprising.

00:26:22.680 | - Yeah, I think it's very computationally expensive

00:26:25.120 | to do a pipeline model where you're like fixing the eyes

00:26:28.080 | and you're fixing the mouth and you're fixing the hands.

00:26:28.920 | - That's what everyone does as far as I understand.

00:26:31.340 | - Well, I'm not sure, I'm not exactly sure what you mean,

00:26:33.320 | but if you mean like you get an image

00:26:35.260 | and then you will like make another model

00:26:37.280 | specifically to fix a face.

00:26:38.940 | Yeah, I think that's a very computationally,

00:26:40.640 | that's fairly computationally expensive.

00:26:42.200 | And I think it's like not,

00:26:43.320 | probably not the right thing, right way.

00:26:45.320 | - Yeah. - Yeah.

00:26:46.160 | And it doesn't generalize very well.

00:26:47.760 | Now you have to pick all these different things.

00:26:49.280 | - Yeah, you're just kind of glomming things on together.

00:26:51.120 | Like when I look at AI artists, like that's what they do.

00:26:54.380 | - Ah, yeah, yeah, yeah.

00:26:55.640 | They'll do things like, you know,

00:26:57.760 | I think a lot of ARs will do, you know,

00:26:59.320 | control net tiling to do kind of generative upscaling

00:27:01.920 | of all these different pieces of the image.

00:27:04.140 | Yeah, I mean, to me, these are all just like,

00:27:05.920 | they're all hacks, ultimately in the end.

00:27:08.280 | I mean, it just, to me, it's like,

00:27:09.480 | let's go back to where we were just three years,

00:27:12.240 | four years ago with where deep learning was at

00:27:14.920 | and where language was at.

00:27:16.600 | You know, it's the same thing.

00:27:17.440 | It's like, we were like, okay,

00:27:18.360 | well, I'll just train these very narrow models

00:27:21.200 | to try to do these things and kind of ensemble them

00:27:23.200 | or pipeline them to try to get to a best-in-class result.

00:27:25.600 | And here we are with like where the models are gigantic

00:27:29.400 | and like very capable of solving huge amounts of tasks

00:27:33.440 | when given like lots of great data.

00:27:35.200 | So, yeah. - Makes sense.

00:27:38.520 | You also released a new benchmark called MJHQ-30K

00:27:42.480 | for automatic evaluation of a model's aesthetic quality.

00:27:45.960 | I have one question.

00:27:48.680 | The dataset that you use for the benchmark

00:27:51.020 | is from MidJourney. - Yes.

00:27:52.440 | - You have 10 categories.

00:27:54.120 | How do you think about the Playground model, MidJourney?

00:27:58.720 | - You know, there are a lot of people,

00:27:59.840 | a lot of people in research like to come up with,

00:28:02.500 | they like to compare themselves

00:28:03.640 | to something they know they can beat, right?

00:28:06.760 | But maybe this is the best reason why

00:28:09.800 | it can be helpful to not be a researcher also sometimes.

00:28:12.840 | Like I'm not like trained as a researcher.

00:28:15.320 | I don't have a PhD in anything AI related, for example.

00:28:19.120 | But I think if you care about products

00:28:21.880 | and you care about your users,

00:28:23.400 | then the most important thing that you wanna figure out

00:28:25.720 | is like everyone has to acknowledge

00:28:28.080 | that MidJourney is very good.

00:28:30.080 | You know, they are the best at this thing.

00:28:32.760 | We would, I would happily, I'm happy to admit that.

00:28:34.840 | I have no problem admitting that.

00:28:37.520 | It's just easy.

00:28:38.760 | It's very visual to tell.

00:28:40.680 | So, you know, I think it's incumbent on us

00:28:43.720 | to try to compare ourselves to the thing that's best,

00:28:45.720 | even if we lose, even if we're not the best, right?

00:28:50.040 | And, you know, at some point,

00:28:53.060 | if we are able to surpass MidJourney,

00:28:55.440 | then we only have ourselves to compare ourselves to.

00:28:58.360 | But on first blush, you know,

00:29:00.020 | I think it's worth comparing yourself

00:29:01.360 | to maybe the best thing and try to find

00:29:03.860 | like a really fair way of doing that.

00:29:06.320 | So I think more people should try to do that.

00:29:08.680 | I definitely don't think you should be

00:29:09.960 | kind of comparing yourself on like some Google model

00:29:13.480 | or some old SD, you know, stable diffusion model

00:29:16.680 | and be like, look, we beat, you know, stable diffusion 1.5.

00:29:19.640 | I think users ultimately want care, you know,

00:29:23.380 | how close are you getting to the thing

00:29:24.520 | that like I also mostly, people mostly agree with.

00:29:28.380 | So we put out that benchmark not because,

00:29:31.280 | for no other reason to say like,

00:29:32.840 | this seems like a worthy thing for us to at least try,

00:29:35.280 | you know, for people to try to get to.

00:29:37.600 | And then if we surpass it, great,

00:29:38.760 | we'll come up with another one.

00:29:40.080 | - Yeah, no, that's awesome.

00:29:41.000 | And you kill stable diffusion Excel and everything.

00:29:45.240 | In the benchmark chart,

00:29:47.960 | it says Playground V2 1024 pixel dash aesthetic.

00:29:51.680 | - Yes. - You have kind of like,

00:29:53.720 | yeah, style fine tunes or like,

00:29:55.680 | what's the dash aesthetic for?

00:29:57.960 | - We debated this, maybe we named it wrong or something,

00:30:00.080 | but we were like, how do we help people realize

00:30:03.400 | the model that's aligned versus the models that weren't.

00:30:06.520 | So because we gave out pre-trained models,

00:30:09.120 | we didn't want people to like use those.

00:30:11.920 | So that's why they're called base.

00:30:13.520 | And then the aesthetic model, yeah,

00:30:15.120 | we wanted people to pick up the thing

00:30:16.600 | that we thought would be like the thing

00:30:18.560 | that makes things pretty.

00:30:19.980 | Who wouldn't want the thing that's aesthetic?

00:30:22.680 | But if there's a better name,

00:30:25.040 | we definitely are open to feedback.

00:30:26.840 | - No, no, that's cool.

00:30:28.040 | I was using the product.

00:30:29.000 | You also have the style filter

00:30:31.080 | and you have all these different style.

00:30:33.000 | And it seems like the styles are tied to the model.

00:30:35.920 | So there's some like SDXL styles,

00:30:38.800 | there's some Playground V2 styles.

00:30:41.320 | Can you maybe give listeners an overview of how that works?

00:30:45.120 | Because in language, there's not this idea of like style,

00:30:49.040 | right, versus like in vision model there is,

00:30:52.640 | and you cannot get certain styles in different models.

00:30:55.640 | How do styles emerge

00:30:56.920 | and how do you categorize them and find them?

00:30:59.160 | - Yeah, I mean, it's so fun having a community

00:31:01.560 | where people are just trying a model.

00:31:03.360 | Like it's only been two days for Playground V2

00:31:05.880 | and we actually don't know what the model's capable of

00:31:09.680 | and not capable of.

00:31:10.600 | You know, we certainly see problems with it,

00:31:12.600 | but we have yet to see what emergent behavior is.

00:31:16.520 | I mean, we've just sort of discovered

00:31:17.960 | that it takes about like a week

00:31:19.680 | before you start to see like new things.

00:31:21.880 | But I think like a lot of that style

00:31:24.080 | kind of emerges after that week

00:31:26.400 | where you start to see, you know,

00:31:28.640 | there's some styles that are very like well-known to us,

00:31:30.560 | like maybe like pixel art is a well-known style.

00:31:33.560 | But then there's some style,

00:31:34.720 | photo realism is like another one

00:31:36.160 | that's like well-known to us.

00:31:38.200 | But there are some styles that cannot be easily named.

00:31:41.880 | You know, it's not as simple as like,

00:31:43.880 | okay, that's an anime style.

00:31:45.800 | It's very visual.

00:31:47.840 | And in the end, you end up making up the name

00:31:50.760 | for what that style represents.

00:31:52.040 | And so the community kind of shapes itself

00:31:55.040 | around these different things.

00:31:56.320 | And so if anyone that's into stable diffusion

00:31:58.920 | and into building anything with graphics and stuff

00:32:01.960 | with these models, you know,

00:32:03.240 | you might've heard of like ProtoVision or DreamShaper,

00:32:07.080 | some of these weird names.

00:32:09.200 | But they're just, you know, invented by these authors,

00:32:11.120 | but they have a sort of je ne sais quoi

00:32:13.000 | that, you know, appeals to users.

00:32:14.960 | - Because it like roughly embeds to what you want.

00:32:18.640 | - I guess so.

00:32:21.240 | I mean, it's like, you know,

00:32:22.080 | there's this one of my favorite ones that's fine-tuned.

00:32:24.400 | It's not made by us.

00:32:25.560 | It's called like Starlight XL.

00:32:28.080 | It's just this beautiful model.

00:32:30.160 | It's got really great color contrast and visual elements.

00:32:33.960 | And the users love it.

00:32:35.280 | I love it.

00:32:36.240 | And yeah, it's so hard.

00:32:38.960 | I think that's like a very big open question with graphics

00:32:41.280 | that I'm not totally sure how we'll solve.

00:32:44.040 | Yeah, I think a lot of styles are sort of,

00:32:47.040 | I don't know, it's like an evolving situation too,

00:32:49.560 | 'cause styles get boring, right?

00:32:51.320 | They get fatigued.

00:32:52.160 | It's like listening to the same style of pop song.

00:32:55.400 | I kind of, I try to relate to graphics

00:32:57.920 | a little bit like with music,

00:32:59.240 | because I think it gives you a little bit

00:33:01.400 | of a different shape to things.

00:33:02.600 | Like in music, it's not just,

00:33:04.760 | it's not as if we just have pop music

00:33:06.440 | and, you know, rap music and country music.

00:33:09.040 | Like they're all of these,

00:33:10.680 | like the EDM genre alone has like sub genres.

00:33:14.040 | And I think that's very true in graphics

00:33:16.160 | and painting and art and anything that we're doing.

00:33:19.080 | There's just these sub genres,

00:33:20.400 | even if we can't quite always name them.

00:33:22.760 | But I think they are emergent from the community,

00:33:24.760 | which is why we're so always happy

00:33:26.120 | to work with the community.

00:33:27.160 | - Yeah, that is a struggle, you know,

00:33:29.680 | coming back to this, like B2B versus B2C thing.

00:33:32.480 | B2C, you're gonna have a huge amount of diversity

00:33:35.040 | and then it's gonna reduce as you get towards

00:33:36.920 | more sort of B2B type use cases.

00:33:38.560 | I'm making this up here, tell me if you disagree.

00:33:41.280 | So like you might be optimizing for a thing

00:33:44.040 | that you may eventually not need.

00:33:45.840 | - Yeah, possibly.

00:33:46.960 | Yeah, possibly.

00:33:48.320 | Yeah, I try not to share,

00:33:49.320 | I think like a simple thing with startups

00:33:51.120 | is that I worry sometimes by trying to be

00:33:55.040 | overly ambitious and like really scrutinizing

00:33:59.440 | like what something is in its most nascent phase

00:34:01.440 | that you miss the most ambitious thing you could have done.

00:34:03.960 | Like just having like very basic curiosity

00:34:06.840 | with something very small

00:34:09.600 | can like kind of lead you to something amazing.

00:34:13.040 | Like Einstein definitely did that.

00:34:14.280 | And then when, and then he like, you know,

00:34:16.480 | he basically won all the prizes

00:34:17.880 | and got everything he wanted

00:34:19.080 | and then basically did like kind,

00:34:20.240 | didn't really, he kind of dismissed quantum

00:34:24.080 | and then just kind of was still searching, you know,

00:34:26.960 | for the unifying theory.

00:34:28.200 | And he like had this quest.

00:34:29.760 | I think that happens a lot with like Nobel prize people.

00:34:31.760 | I think there's like a term for it that I forget.

00:34:34.200 | I actually wanted to go after a toy almost intentionally.

00:34:39.180 | So long as that I could see,

00:34:42.040 | I could imagine that it would lead to something

00:34:45.360 | very, very large later.

00:34:47.080 | And so, yeah, it's a very, like I said, it's very hobbyist,

00:34:50.760 | but you need to start somewhere.

00:34:53.200 | You need to start with something

00:34:54.400 | that has a big gravitational pull,

00:34:58.220 | even if these hobbyists aren't likely to be the people

00:35:01.220 | that, you know, have a way to monetize it or whatever,

00:35:04.080 | even if they're, but they're doing it for fun.

00:35:05.460 | So there's something there

00:35:07.160 | that I think is really important.

00:35:08.700 | But I agree with you that, you know, in time,

00:35:11.160 | we're gonna have to focus,

00:35:12.000 | we will absolutely focus on more utilitarian things,

00:35:16.400 | like things that are more related to editing feats

00:35:18.760 | that are much harder.

00:35:20.060 | But, and so I think like a very simple use case is just,

00:35:23.360 | you know, I'm not a graphics designer.

00:35:26.000 | I don't know if, I don't know if you guys are,

00:35:28.680 | but it's sure, you know, it seems like very simple

00:35:31.080 | that like you, if we could give you the ability

00:35:33.080 | to do really complex graphics without skill,

00:35:37.520 | wouldn't you want that?

00:35:39.000 | You know, like my wife the other day was set, you know,

00:35:41.020 | said, ah, I wish Playground was better

00:35:43.080 | because I wish that, you know, don't you,

00:35:45.560 | when are you guys gonna have a feature

00:35:46.840 | where like we could make my son, his name's Devin,

00:35:48.880 | smile when he was not smiling in the picture

00:35:50.800 | for the holiday card, right?

00:35:53.040 | You know, just being able to highlight his mouth

00:35:55.080 | and just say like, make him smile.

00:35:56.480 | Like, why can't we do that

00:35:58.040 | with like high fidelity and coherence?

00:36:00.600 | Little things like that, all the way to, you know,

00:36:03.920 | putting you in completely different scenarios.

00:36:06.200 | - Is that true?

00:36:07.040 | Can we not do that in painting?

00:36:08.760 | - You can do in painting,

00:36:10.200 | but it's the quality is just so bad.

00:36:12.840 | Yeah, it's just really terrible quality.

00:36:16.240 | You know, it's like, you'll do it five times

00:36:18.480 | and it'll still like kind of look like crooked

00:36:20.440 | or just the artifact.

00:36:21.720 | Part of it's like, you know, the lips on the face are so,

00:36:24.360 | there's such, there's such little information there.

00:36:26.720 | It's so small that the models really struggle with it.

00:36:29.500 | Yeah.

00:36:30.520 | - Make the picture smaller and you won't see it.

00:36:32.360 | - Wait, I think, I think that's my trick, I don't know.

00:36:34.760 | - Yeah, yeah, that's true.

00:36:35.640 | Or, you know, you could take that region

00:36:37.200 | and make it really big and then like say it's a mouth

00:36:39.520 | and then like shrink it.

00:36:40.920 | It feels like you're wrestling with it

00:36:43.120 | more than it's doing something that kind of surprises you.

00:36:47.640 | Yeah.

00:36:48.480 | - It feels like you are very much the internal tastemaker.

00:36:50.600 | Like you carry in your head this vision

00:36:53.320 | for what a good art model should look like.

00:36:56.200 | Is it, do you find it hard to like communicate it

00:36:59.520 | to like your team and, you know, other people?

00:37:02.960 | 'Cause obviously it's hard to put into words

00:37:04.840 | like we just said.

00:37:06.100 | - Yeah, it's very hard to explain.

00:37:10.140 | Like images have such, like such high bit rate

00:37:14.360 | compared to just words.

00:37:15.700 | And we don't have enough words to describe these things.

00:37:19.900 | Difficult, I think everyone on the team,

00:37:21.740 | if they don't have good kind of like judgment taste

00:37:25.180 | or like an eye for some of these things,

00:37:27.300 | they're like steadily building it

00:37:28.860 | 'cause they have no choice, right?

00:37:30.820 | So in that realm, I don't worry too much, actually.

00:37:33.820 | Like everyone is kind of like learning

00:37:35.860 | to get the eye is what I would call it.

00:37:39.980 | But I also have, you know, my own narrow taste.

00:37:41.740 | Like I'm at my, you know, I'm not,

00:37:43.220 | I don't represent the whole population either.

00:37:45.220 | - True, true.

00:37:46.060 | - So.

00:37:47.580 | - When you benchmark models, you know,

00:37:49.780 | like this benchmark we're talking about,

00:37:51.060 | we use FID for efficient input distance.

00:37:53.720 | Okay, that's one measure,

00:37:56.500 | but like doesn't capture anything

00:37:57.700 | you just said about smiles.

00:37:59.380 | - Yeah, FID is generally a bad metric.

00:38:02.660 | You know, it's good up to a point

00:38:04.460 | and then it kind of like is irrelevant.

00:38:06.580 | - Yeah. - Yeah.

00:38:07.420 | - And then, so are there any other metrics that you like

00:38:11.060 | apart from vibes?

00:38:11.980 | I'm always looking for alternatives to vibes.

00:38:13.940 | 'Cause vibes don't scale, you know?

00:38:15.500 | - You know, it might be fun to kind of talk about this

00:38:18.300 | because it's actually kind of fresh.

00:38:20.300 | So up till now, we haven't needed to do

00:38:22.860 | a ton of like benchmarking

00:38:24.540 | because we hadn't trained our own model

00:38:26.540 | and then now we have.

00:38:27.380 | So now what?

00:38:28.340 | What does that mean?

00:38:29.180 | How do we evaluate it?

00:38:30.380 | You know, we're kind of like living

00:38:31.460 | with the last 48, 72 hours of going,

00:38:33.980 | did the way that we benchmark actually succeed?

00:38:37.340 | Did it deliver?

00:38:38.180 | Right?

00:38:39.020 | You know, like I think Gemini just came out.

00:38:40.500 | They just put out a bunch of benchmarks,

00:38:42.340 | but all these benchmarks are just an approximation

00:38:45.100 | of how you think it's gonna end up

00:38:46.340 | with real world performance.

00:38:47.420 | And I think that's like very fascinating to me.

00:38:50.260 | So if you fake that benchmark,

00:38:53.360 | you'll still end up in a really bad scenario

00:38:55.500 | at the end of the day.

00:38:56.540 | And so, you know, one of the benchmarks we did

00:38:58.340 | was we did a, we kind of curated like a thousand prompts.

00:39:01.300 | That's what we published in our blog post, you know,

00:39:03.940 | of all these tasks that we,

00:39:05.140 | a lot of them, some of them are curated by our team

00:39:07.100 | where we know the models all suck at it.

00:39:09.340 | Like my favorite prompt that no model's really capable of

00:39:12.900 | is a horse riding an astronaut.

00:39:15.600 | - Yeah.

00:39:16.440 | - The inverse one.

00:39:17.260 | And it's really, really hard to do.

00:39:19.900 | - Not in data.

00:39:20.900 | - You know, another one is like a giraffe

00:39:22.720 | underneath a microwave.

00:39:24.420 | How does that work?

00:39:25.260 | (laughing)

00:39:26.780 | Right?

00:39:27.620 | There's so many of these little funny ones.

00:39:29.620 | We do, we have prompts that are just like

00:39:31.060 | misspellings of things, right?

00:39:32.740 | Just to see if the models will figure it out.

00:39:35.260 | - So that's easy.

00:39:36.100 | That should embed to the same space.

00:39:38.780 | - Yeah.

00:39:39.620 | And just like all these very interesting, weird,

00:39:42.260 | weirdo things.

00:39:43.080 | And so we have so many of these

00:39:44.120 | and then we kind of like evaluate whether the models

00:39:46.300 | are any good at it.

00:39:47.140 | And the reality is that they're all bad at it.

00:39:48.940 | And so then you're just picking the most aesthetic image.

00:39:51.440 | But I think, you know, we're just,

00:39:53.500 | we're still at the beginning of building like our,

00:39:55.420 | like the best benchmark we can

00:39:56.980 | that aligns most with just user happiness, I think.

00:40:01.980 | 'Cause we're not, we're not like putting these in papers

00:40:03.780 | and trying to like win, you know, I don't know,

00:40:05.900 | awards at ICCV or something if they have awards.

00:40:07.980 | Sorry if they don't.

00:40:09.740 | And you could.

00:40:11.340 | - Well, that's absolutely a valid strategy.

00:40:12.980 | - Yeah, you could.

00:40:14.020 | I don't think it could correlate necessarily

00:40:15.860 | with the impact we want to have on humanity.

00:40:18.100 | I think we're still evolving whatever our benchmarks are.

00:40:20.460 | So the first benchmark was just like very difficult tasks

00:40:23.020 | that we know the models are bad at.

00:40:24.300 | Can we come up with a thousand of these?

00:40:26.700 | Whether they're hand-written

00:40:27.540 | and some of them are generated.

00:40:28.980 | And then can we ask the users, like, how do we do?

00:40:31.900 | And then we wanted to use a benchmark like party prompts

00:40:34.380 | so that people in academia,

00:40:36.020 | we mostly did that so people in academia

00:40:37.740 | could measure their models against ours versus others.

00:40:40.580 | And, but yeah, I mean, fit is pretty bad.

00:40:45.080 | And I think, yeah, in terms of vibes,

00:40:49.380 | it's like when you put out the model

00:40:50.880 | and then you try to see like what users make.

00:40:52.980 | And I think my sense is that we're gonna take all the things

00:40:55.220 | that we noticed that the users kind of were failing at

00:40:58.060 | and try to find like new ways to measure that,

00:41:01.020 | whether that's like a smile or, you know,

00:41:03.740 | color contrast or lighting.

00:41:06.260 | One benefit of Playground is that

00:41:07.900 | we have users making millions of images every single day.

00:41:12.900 | And so we can just ask them.

00:41:15.900 | - And they go for like a post-generation feedback.

00:41:20.260 | - Yeah, we can just ask them.

00:41:21.500 | We can just say like, how good was the lighting here?

00:41:23.740 | How was the subject?

00:41:25.620 | How was the background?

00:41:26.740 | - Oh, like a proper form of like.

00:41:30.460 | - It's just like, you make it,

00:41:32.300 | you come to our site, you make an image

00:41:33.700 | and then we say, and then maybe randomly you just say,

00:41:35.660 | hey, you know, like how was the color

00:41:37.540 | and contrast of this image?

00:41:38.700 | And you say, it was not very good.

00:41:40.460 | And then you just tell us.

00:41:41.760 | So I think we can get like tens of thousands

00:41:45.460 | of these evaluations every single day

00:41:49.100 | to truly measure real world performance

00:41:52.200 | as opposed to just like benchmark performance.

00:41:54.140 | Hopefully next year, I think we will try to publish

00:41:56.940 | kind of like a benchmark that anyone could use,

00:42:01.640 | that we evaluate ourselves on and that other people can,

00:42:04.580 | that we think does a good job

00:42:06.660 | of approximating real world performance

00:42:08.420 | because we've tried it and done it and noticed that it did.

00:42:10.940 | Yeah, I think we will do that.

00:42:12.580 | - Yeah.

00:42:14.060 | I think we're going to ask a few more

00:42:15.500 | like sort of product-y questions.

00:42:17.540 | I personally have a few like categories

00:42:20.020 | that I consider special among, you know,

00:42:22.860 | you have like animals, art, fashion, food.

00:42:25.060 | There are some categories which I consider

00:42:28.640 | like a different tier of image.

00:42:30.680 | So the top among them is text in images.

00:42:33.420 | How do you think about that?

00:42:36.600 | So one of the big wild ones for me,

00:42:38.720 | something I've been looking out for the entire year

00:42:40.720 | is just the progress of text and images.

00:42:42.520 | Like, do you, can you write in an image?

00:42:44.480 | - Yeah.

00:42:45.320 | - Or an ideogram, I think, came out recently,

00:42:48.440 | which had decent but not perfect text and images.

00:42:52.280 | Dottie3 had improved some

00:42:55.140 | and all they said in their paper was that

00:42:58.500 | they just included more text in the dataset

00:43:00.000 | and it just worked.

00:43:01.200 | I was like, that's just, that's just lazy.

00:43:03.000 | (laughing)

00:43:04.320 | But anyway, do you care about that?

00:43:06.200 | 'Cause I don't see any of that in like your sample.

00:43:08.360 | - Yeah, yeah.

00:43:09.200 | Yeah, the V2 model was mostly focused on image quality

00:43:14.200 | versus like the feature of text synthesis.

00:43:18.120 | 'Cause I, well, as a business user,

00:43:20.280 | I care a lot about that.

00:43:21.120 | - Yeah. - Right.

00:43:21.940 | - Yeah, I'm very excited about text synthesis

00:43:23.520 | and yeah, I think ideogram has done a good job

00:43:26.720 | of maybe the best job.

00:43:28.080 | Dottie kind of has like a, it has like a hit rate.

00:43:31.920 | You know, you don't want just text effects.

00:43:33.520 | I think where this has to go is it has to be like,

00:43:36.620 | you could like write little tiny pieces of text

00:43:39.000 | like on like a milk carton.

00:43:41.040 | - Yeah.

00:43:41.880 | - That's maybe not even the focal point of a scene.

00:43:43.600 | - Yeah.

00:43:44.440 | - I think that's like a very hard task

00:43:46.360 | that if you could do something like that,

00:43:48.600 | then there's a lot of other possibilities.

00:43:50.360 | - Well, you don't have to zero shot it.

00:43:51.400 | You can just be like here and focus on this.

00:43:54.080 | - Sure, yeah, yeah, definitely.

00:43:55.520 | Yeah, yeah.

00:43:56.360 | So I think text synthesis would be very exciting.

00:43:58.320 | - Yeah.

00:43:59.160 | And then also flag that Max Wolf, Minimax here,

00:44:02.860 | which you must have come across his work.

00:44:04.960 | He's done a lot of stuff about using like logo masks

00:44:08.700 | that then map onto like food or vegetables

00:44:13.440 | and it looks like text,

00:44:15.720 | which can be pretty fun.

00:44:17.280 | - Yeah, yeah.

00:44:18.280 | I mean, it's very interesting to,

00:44:20.280 | that's the wonderful thing about like

00:44:21.720 | the open source community is that you get things

00:44:23.600 | like control net and then you see all these people

00:44:25.880 | do these just amazing things with control net

00:44:28.360 | and then you wonder, I think from our point of view,

00:44:31.480 | we sort of go, that's really wonderful,

00:44:33.400 | but how do we end up with like a unified model

00:44:35.520 | that can do that?

00:44:36.400 | What are the bottlenecks?

00:44:37.320 | What are the issues?

00:44:39.040 | Because the community ultimately

00:44:40.280 | has very limited resources.

00:44:41.720 | - Yeah.

00:44:42.560 | - And so they need these kinds of like work around

00:44:45.720 | work around research ideas to get there, but yeah.

00:44:50.520 | - Are techniques like control net

00:44:52.480 | portable to your architecture?

00:44:54.240 | - Definitely, yeah.

00:44:55.440 | We kept the Playground v2 exactly the same as SDXL,

00:44:58.520 | not because, not out of laziness,

00:45:00.080 | but just because we wanted,

00:45:01.720 | we knew that the community already had tools.

00:45:03.880 | - Yeah.

00:45:04.720 | - It's, you know, all you have to do

00:45:06.080 | is maybe change a string in your code

00:45:08.600 | and then, you know, retrain a control net for it.

00:45:10.520 | So it was very intentional to do that.

00:45:12.040 | We didn't want to fragment the community

00:45:13.320 | with different architectures.

00:45:14.520 | - Yeah.

00:45:15.360 | Yeah.

00:45:16.200 | I have more questions about that.

00:45:17.240 | I don't know.

00:45:18.080 | I don't want to DDoS you with topics, but okay.

00:45:21.200 | I was basically going to go over three more categories.

00:45:23.640 | One is UIs, like app UIs, like mock UIs.

00:45:27.720 | Third is not safe for work, obviously.

00:45:32.120 | And then copyrighted stuff.

00:45:34.000 | I don't know if you care to comment on any of those.

00:45:36.440 | - The NSFW kind of like safety stuff is really important.

00:45:39.840 | Part of, I kind of think that one of the biggest risks

00:45:44.360 | kind of going into maybe the U.S. election year

00:45:47.200 | will probably be very interrelated

00:45:49.400 | with like graphics, audio, video.

00:45:53.760 | I think it's going to be very hard to explain,

00:45:56.200 | you know, to a family relative

00:45:58.480 | who's not kind of in our world.

00:46:00.880 | And our world is like sometimes very, you know,

00:46:02.800 | we think it's very big, but it's very tiny

00:46:04.680 | compared to the rest of the world.

00:46:05.520 | Some people are like, there's still lots of humanity

00:46:07.280 | who have no idea what chat GPT is.

00:46:09.320 | And I think it's going to be very hard to explain,

00:46:12.080 | you know, to your uncle, aunt, whoever,

00:46:14.960 | you know, hey, I saw, you know,

00:46:16.200 | I saw President Biden say this thing on a video.

00:46:19.800 | You know, I can't believe, you know, he said that.

00:46:22.960 | I think that's going to be a very troubling thing

00:46:25.440 | going into the world next year, the year after.

00:46:29.720 | - Oh, I didn't, that's more like a risk thing.

00:46:32.280 | - Yeah. - Or like deep fakes.

00:46:33.840 | Well, faking, political faking.

00:46:35.800 | But there's just, there's a lot of studies on how,

00:46:40.520 | yeah, for most businesses,

00:46:42.080 | you don't want to train on not safe for work images,

00:46:44.480 | except that it makes you really good at bodies.

00:46:47.560 | - Yeah, I mean, yeah, I mean, we personally,

00:46:51.040 | we filter out NSFW type of images in our data set

00:46:55.760 | so that it's, you know, so our safety filter stuff

00:46:58.440 | doesn't have to work as hard.

00:46:59.640 | - But you've heard this argument that it gets,

00:47:01.600 | it makes you worse at, because obviously,

00:47:04.160 | not safe for work images are very good at human anatomy,

00:47:08.200 | which you do want to be good at.

00:47:09.640 | - Yeah, it's not about like,

00:47:11.280 | it's not like necessarily a bad thing to train on that data.

00:47:14.120 | It's more about like how you go and use it.

00:47:16.160 | That's why I was kind of talking about safety.

00:47:18.200 | - Yeah, I see. - You know, in part,

00:47:19.480 | because there are very terrible things

00:47:20.920 | that can happen in the world.

00:47:21.760 | If you have a sufficiently, you know,

00:47:23.480 | extremely powerful graphics model, you know,

00:47:25.280 | suddenly like you can kind of imagine, you know,

00:47:27.840 | now if you can like generate nudes and then there's like,

00:47:30.040 | you can do very character consistent things with faces,

00:47:32.520 | like what does that lead to?

00:47:33.560 | - Yeah. - I think it's like more

00:47:35.480 | what occurs after that, right?

00:47:37.600 | Even if you train on, let's say, you know, new data,

00:47:40.880 | if it does something to kind of help,

00:47:42.280 | there's nothing wrong with the human anatomy.

00:47:44.760 | It's very valid for a model to learn that,

00:47:47.200 | but then it's kind of like, how does that get used?

00:47:49.440 | And, you know, I won't bring up all of the very,

00:47:52.280 | very unsavory, terrible things that we see

00:47:55.360 | on a daily basis on the site.

00:47:57.640 | I think it's more about what occurs.

00:48:00.320 | And so we, you know, we just recently did like a big sprint

00:48:03.520 | on safety internally around,

00:48:05.760 | and it's very difficult with graphics and art, right?

00:48:08.560 | Because there is tasteful art that has nudity, right?

00:48:12.940 | They're all over in museums, like, you know,

00:48:15.440 | it's very, very valid situations for that.

00:48:18.120 | And then there's, you know,

00:48:19.920 | there's the things that are the gray line of that.

00:48:22.280 | You know, what I might not find tasteful,

00:48:23.960 | someone might be like, that is completely tasteful, right?

00:48:26.840 | And then there's things that are way over the line.

00:48:29.880 | And then there are things that are, you know,

00:48:31.400 | maybe you or, you know, maybe I would be okay with,

00:48:35.600 | but society isn't.

00:48:37.720 | I think it's really hard with art.

00:48:39.600 | I think it's really, really hard.

00:48:41.360 | Sometimes even if you have like,

00:48:43.320 | even if you have things that are not nude,

00:48:45.440 | if a child goes to your site, scrolls down some images,

00:48:48.920 | you know, classrooms of kids, you know, using our product,

00:48:52.040 | it's a really difficult problem.

00:48:53.640 | And it stretches mostly culture, society,

00:48:57.040 | politics, everything, yeah.

00:48:59.040 | - Okay.

00:49:02.160 | Another favorite topic of our listeners is UX and AI.

00:49:06.880 | And I think you're probably one of the best

00:49:09.800 | all-inclusive editors for these things.

00:49:12.040 | So you don't just have the, you know,

00:49:14.680 | prompt images come out, you pray,

00:49:17.360 | and now you do it again.

00:49:19.240 | First, you let people pick a seed

00:49:21.880 | so they can kind of have semi-repeatable generation.

00:49:25.080 | You also have, yeah, you can pick how many images,

00:49:28.840 | and then you leave all of them in the canvas,

00:49:31.280 | and then you have kind of like this box,

00:49:33.720 | the generation box, and you can even cross between them

00:49:37.080 | and outpaint, there's all these things.

00:49:39.040 | How did you get here?

00:49:41.920 | You know, most people are kind of like,

00:49:43.800 | give me text, I give you image.

00:49:45.360 | You know, you're like, these are all the tools for you.

00:49:47.680 | - Even though we were trying to make

00:49:50.200 | a graphics foundation model,

00:49:52.600 | I think we think that we're also trying to like re-imagine

00:49:57.840 | like what a graphics editor might look like

00:49:59.680 | given the change in technology.

00:50:02.240 | So, you know, I don't think we're trying to build Photoshop,

00:50:06.160 | but it's the only thing that we could say

00:50:07.960 | that people are, you know, largely familiar with.

00:50:10.000 | Oh, okay, there's Photoshop.

00:50:11.520 | I think, you know, I don't think you would think

00:50:14.400 | of Photoshop without like the, you know,

00:50:16.840 | you wouldn't think, what would Photoshop compare itself

00:50:19.640 | to pre-computer, I don't know, right?

00:50:22.040 | It's like, or kind of like a canvas,

00:50:24.440 | but, you know, there's these menu options,

00:50:26.360 | and you can use your mouse, what's a mouse?

00:50:29.520 | So I think that we're trying to make like,

00:50:31.640 | we're trying to re-imagine

00:50:32.600 | what a graphics editor might look like.

00:50:34.120 | Not just for the fun of it,

00:50:35.800 | but because we kind of have no choice.

00:50:37.160 | Like there's this idea in image generation

00:50:39.560 | where you can generate images.

00:50:41.440 | That's like a super weird thing.

00:50:42.760 | What is that in Photoshop, right?

00:50:44.440 | You have to wait right now for the time being,

00:50:46.920 | but the wait is worth it often for a lot of people

00:50:50.600 | because they can't make that with their own skills.

00:50:52.560 | So I think it goes back to, you know,

00:50:54.760 | how we started the company,

00:50:56.560 | which was kind of looking at GPT-3's Playground,

00:51:00.720 | that the reason why we're named Playground

00:51:02.240 | is a homage to that, actually.

00:51:04.120 | And, you know, it's like,

00:51:06.480 | shouldn't these products be more visual?

00:51:09.320 | Shouldn't, you know, shouldn't they,

00:51:11.440 | these prompt boxes are like a terminal window, right?

00:51:15.080 | We're kind of at this weird point where it's just like CLI.

00:51:17.400 | It's like MS-DOS.

00:51:18.400 | I remember my mom using MS-DOS,

00:51:20.400 | and I memorized the keywords, like D-I-R-L-S,

00:51:23.160 | all those things, right?

00:51:24.520 | It feels a little like we're there, right?

00:51:26.080 | Prompt engineering is just like--

00:51:27.560 | - The shirt I'm wearing, you know, it's a bug,

00:51:29.480 | not a feature.

00:51:30.320 | - Yeah, exactly.

00:51:31.160 | Parentheses to say beautiful or whatever,

00:51:33.120 | which waits the word token more in the model or whatever.

00:51:37.160 | Yeah, it's, that's like super strange.

00:51:40.000 | I think that's not, I think everybody,

00:51:42.880 | I think a large portion of humanity would agree

00:51:45.240 | that that's not user-friendly, right?

00:51:47.720 | So how do we think about the products

00:51:49.480 | to be more user-friendly?

00:51:50.520 | Well, sure, you know, sure it would be nice

00:51:52.000 | if I could like, you know,

00:51:53.840 | if I wanted to get rid of like the headphones on my head,

00:51:56.360 | you know, it'd be nice to mask it,

00:51:57.640 | and then say, you know, can you remove the headphones?

00:52:00.640 | You know, if I want to grow the, expand the image,

00:52:03.240 | it should, you know, how can we make that feel easier

00:52:06.240 | without typing lots of words and being really confused?

00:52:09.320 | And by no, by no stretch of the imagination,

00:52:11.400 | I don't even think we've nailed the UI/UX yet.

00:52:14.480 | Part of that is because we don't,

00:52:18.160 | we're still experimenting.

00:52:19.480 | And part of that is because the model

00:52:21.760 | and the technology is going to get better.

00:52:24.000 | And whatever felt like the right UX six months ago

00:52:27.760 | is going to feel very broken now.

00:52:29.600 | And so that's a little bit of how we got there,

00:52:34.920 | is kind of saying, does everything have to be

00:52:37.120 | like a prompt in a box?

00:52:38.280 | Or can we do, can we do things

00:52:39.960 | that make it very intuitive for users?

00:52:42.080 | - How do you decide what to give access to?

00:52:44.960 | So you have things like Expand Prompt,

00:52:47.720 | which Dali 3 just does, it doesn't let you decide

00:52:51.280 | whether you should or not.

00:52:52.580 | - As in like, rewrites your prompts for you.

00:52:55.560 | - Yeah.

00:52:56.400 | - Yeah, for that feature, I think we'll probably,

00:52:59.920 | I think once we get it to be cheaper,

00:53:02.720 | we'll probably just give it up,

00:53:03.760 | we'll probably just give it away.

00:53:04.840 | But we also decided something that,

00:53:07.600 | that might be a little bit different.

00:53:08.760 | We noticed that most of image generation

00:53:10.760 | is just like kind of casual.

00:53:12.920 | You know, it's in WhatsApp, it's, you know,

00:53:14.840 | it's in a Discord bot somewhere with Majorny,

00:53:17.200 | it's in ChatGPT.

00:53:19.240 | One of the differentiators I think we provide

00:53:21.480 | is at the expense of just lots of users necessarily,

00:53:26.480 | mainstream consumers, is that we provide as much like power

00:53:29.800 | and tweakability and configurability as possible.

00:53:33.080 | So the only reason why it's a toggle,

00:53:35.000 | because we know that users might want to use it

00:53:37.560 | and might not want to use it, right?

00:53:39.480 | There are some really powerful power user hobbyists

00:53:42.640 | that know what they're doing.

00:53:44.080 | And then there's a lot of people that,

00:53:45.940 | you know, just want something that looks cool,

00:53:49.120 | but they don't know how to prompt.

00:53:50.080 | And so I think a lot of Playground is more about

00:53:53.040 | going after that core user base that like knows,

00:53:57.040 | has a little bit more savviness

00:53:59.160 | and how to use these tools, yeah.

00:54:01.520 | So they might not use like these users probably,

00:54:03.280 | you know, the average Dell user

00:54:04.360 | is probably not going to use ControlNet.

00:54:05.720 | They probably don't even know what that is.

00:54:08.360 | And so I think that like, as the models get more powerful,

00:54:11.040 | as there's more tooling, yeah,

00:54:13.680 | I think you could imagine it,

00:54:15.080 | hopefully you'll imagine a new sort of

00:54:17.040 | AI first graphics editor that's

00:54:20.360 | just as like powerful and configurable as Photoshop.

00:54:24.400 | And you might have to master a new kind of tool.

00:54:27.360 | - Yeah, yeah, well.

00:54:28.720 | There's so many things I could bounce off of that.

00:54:33.640 | One, what you mentioned about waiting.

00:54:35.820 | We have to kind of somewhat address

00:54:39.560 | the elephant in the room.

00:54:40.760 | Consistency models have been blowing up the past month.

00:54:45.640 | Is that, like, how do you think about integrating that?

00:54:48.560 | Obviously there's a lot of other companies

00:54:50.040 | also trying to beat you to that space as well.

00:54:52.960 | - I think we were the first company to integrate it.

00:54:55.320 | Well, we integrated it in a different way.

00:54:57.240 | There are like 10 companies right now

00:54:58.600 | that have kind of tried to do like interactive editing

00:55:00.880 | where you can like draw on the left side

00:55:03.040 | and then you get an image on the right side.

00:55:04.560 | We decided to kind of like wait and see

00:55:06.480 | whether there's like true utility on that.

00:55:09.160 | We have a different feature that's like unique

00:55:11.320 | in our product that's called preview rendering.

00:55:15.520 | And so you go to the product and you say,

00:55:18.760 | we're like, what is the most common use case?

00:55:20.120 | The most common use case is you write a prompt

00:55:22.280 | and then you get an image.

00:55:23.180 | But what's the most annoying thing about that?

00:55:24.960 | The most annoying thing is like,

00:55:26.300 | it feels like a slot machine, right?

00:55:28.160 | You're like, okay, I'm gonna put it in

00:55:29.400 | and maybe I'll get something cool.

00:55:31.480 | So we did something that seemed a lot simpler

00:55:34.320 | but a lot more relevant to how users already use this

00:55:36.960 | product, which is preview rendering.

00:55:38.240 | You toggle it on and it will show you a render of the image.

00:55:40.560 | And then it's just like, graphics tools already have this.

00:55:44.840 | Like if you use Cinema 4D or After Effects or something,

00:55:47.480 | it's called viewport rendering.

00:55:49.600 | And so we try to take something that exists

00:55:52.280 | in the real world that has familiarity and say,

00:55:54.200 | okay, you're gonna get a rough sense

00:55:56.380 | of an early preview of this thing.

00:55:57.780 | And then when you're ready to generate,

00:55:59.640 | we're gonna try to be as coherent

00:56:01.720 | about that image that you saw.

00:56:03.440 | That way you're not spending so much time

00:56:05.300 | just like pulling down the slot machine lever.

00:56:08.900 | So we were actually the first company,

00:56:11.160 | I think we were the first company

00:56:12.080 | to actually ship a quick LCM thing, yeah.

00:56:16.160 | - Okay.

00:56:17.000 | (laughing)

00:56:18.080 | - We were very excited about it.

00:56:19.120 | So we shipped it very quick, yeah.

00:56:20.760 | - Yeah, I think like the other,

00:56:23.800 | well the demos I've been seeing it's also, I guess,

00:56:27.840 | it's not like a preview necessarily.

00:56:30.000 | They're almost using it to animate their generations,

00:56:34.640 | because you can kind of move shapes over.

00:56:36.240 | - Yeah, yeah, they're like doing it.

00:56:37.840 | They're like animating it,

00:56:39.400 | but they're sort of showing like if I move a moon,

00:56:41.520 | you know, can I, yeah.

00:56:42.640 | - Yeah, I don't know.

00:56:43.960 | To me it unlocks video in a way.

00:56:46.560 | - Yeah.

00:56:47.400 | - That--

00:56:48.240 | - But the video models are already

00:56:49.480 | so much better than that.

00:56:50.600 | Yeah, so.

00:56:51.440 | (laughing)

00:56:53.440 | - There's another one which I think is,

00:56:55.400 | like how about the just general ecosystem of Loras, right?

00:57:01.760 | That Civit is obviously the most popular repository of Loras.

00:57:06.200 | How do you think about sort of interacting

00:57:09.680 | with that ecosystem?

00:57:11.500 | - Yeah, I mean, the guy that did Lora,

00:57:14.080 | not the guy that invented Loras,

00:57:15.280 | but the person that brought Loras to Stable Diffusion

00:57:19.120 | actually works with us on some projects.

00:57:23.200 | His name is Simu.

00:57:24.560 | Shout out to Simu.

00:57:26.360 | And I think Loras are wonderful.

00:57:30.160 | Obviously fine tuning all these dream booth models

00:57:33.480 | and such, it's just so heavy.

00:57:35.480 | And giving, and it's obvious in our conversation

00:57:38.240 | around styles and vibes and it's very hard

00:57:42.800 | to evaluate the artistry of these things.

00:57:44.860 | Loras give people this wonderful opportunity

00:57:48.860 | to create sub-genres of art.

00:57:51.860 | And I think they're amazing.

00:57:52.900 | And so any graphics tool, any kind of thing

00:57:54.880 | that's expressing art has to provide

00:57:57.900 | some level of customization to its user base

00:58:01.340 | that goes beyond just typing Greg Rakowski in a prompt.

00:58:04.980 | Right, we have to give more than that.

00:58:08.180 | It's not like users want to type these real artist names.

00:58:11.200 | It's that they don't know how else to get an image

00:58:12.960 | that looks interesting.

00:58:14.280 | They truly want originality and uniqueness.

00:58:16.720 | And I think Loras provide that.

00:58:18.040 | And they provide it in a very nice scalable way.

00:58:21.320 | I hope that we find something even better than Loras

00:58:24.040 | in the long term.

00:58:26.060 | 'Cause there are still weaknesses to Loras,

00:58:31.000 | but I think they do a good job for now.

00:58:32.560 | - Yeah, and so you don't want to be the,

00:58:34.440 | like you wouldn't ever compete with Civet.

00:58:36.320 | You would just kind of--

00:58:37.920 | - Civet's a site where like all these things

00:58:39.320 | get kind of hosted by the community, right?

00:58:41.880 | And so yeah, we'll often pull down

00:58:43.960 | like some of the best things there.

00:58:46.600 | I think when we have a significantly better model,

00:58:51.440 | we will certainly build something.

00:58:53.360 | - I see. - That gets closer to that.

00:58:55.200 | I still, again, I go back to saying just,

00:58:57.080 | I still think this is like very nascent.

00:58:59.120 | Things are very underpowered, right?

00:59:00.920 | Loras are not easy for people to train.

00:59:05.640 | You know, they're easy for an engineer,

00:59:07.800 | but they're not easy, you know,

00:59:10.160 | it sure would be nicer if I could just pick,

00:59:11.920 | you know, five or six reference images, right?

00:59:14.440 | And then say, hey, you know, this is,

00:59:17.480 | and they might even be five or six different

00:59:19.080 | reference images that are not,

00:59:20.400 | they're just very different, actually.

00:59:22.200 | Like they're, they communicate a style,

00:59:24.220 | but they're actually like, it's like a mood board, right?

00:59:27.560 | And it takes, you have to be kind of an engineer almost

00:59:30.480 | to train these Loras or go to some site

00:59:32.120 | and be technically savvy at least.

00:59:33.980 | It seems like it'd be much better if I could say,

00:59:37.280 | I love this style.

00:59:38.500 | I love this style, here are five images.

00:59:43.640 | And you tell the model, like, this is what I want.

00:59:45.680 | And the model gives you something that's very aligned

00:59:48.320 | with what your style is, what you're talking about.

00:59:50.320 | And it's a style you couldn't even communicate, right?

00:59:52.400 | There's no word, you know, this is,

00:59:54.400 | you know, if you have a Tron image, it's not just Tron,

00:59:56.040 | it's like Tron plus like four or five

00:59:57.980 | different weird things. - Cyberpunk, yeah.

00:59:59.480 | - Yeah, even cyberpunk can have its like sub-genre, right?

01:00:03.360 | But I just think training Loras and doing that

01:00:05.680 | is very heavy, so I hope we can do better than that.

01:00:08.800 | - Cool. - Yeah.

01:00:09.640 | - We have Sharif from Lexica on the podcast before.

01:00:13.640 | - Oh, nice.

01:00:14.600 | - Both of you have like a landing page

01:00:17.360 | with just a bunch of images

01:00:19.320 | where you can like explore things.

01:00:20.960 | - Yeah, yeah, we have a feed.

01:00:22.820 | - Yeah, yeah, is that something you see more and more of

01:00:25.880 | in terms of like coming up with these styles?

01:00:27.660 | Is that why you have that as the starting point

01:00:30.540 | versus a lot of other products, you just go in,

01:00:32.680 | you have the generation prompt,

01:00:34.340 | you don't see a lot of examples?

01:00:36.160 | - Our feed is a little different than their feed.

01:00:38.520 | Our feed is more about community.

01:00:41.000 | So we have kind of like a Reddit thing going on

01:00:43.800 | where it's a kind of a competition like every day,

01:00:47.200 | loose competition, mostly fun competition

01:00:49.640 | of like making things.

01:00:51.460 | And there's just this wonderful community of people

01:00:53.760 | where they're liking each other's images

01:00:55.120 | and just showing their genuine interest in each other.

01:00:58.440 | And I think we definitely learn about styles that way.

01:01:01.700 | One of the funniest polls,

01:01:03.400 | if you go to the Mid-Journey polls,

01:01:06.640 | they'll sometimes put these polls out and they'll say,

01:01:08.400 | you know, what do you wish you could like learn more from?

01:01:10.040 | And like one of the things that people vote the most for

01:01:12.760 | is like learning how to prompt, right?

01:01:16.080 | And so I think like, you know,

01:01:17.520 | if you put away your research hat for a minute

01:01:19.560 | and you just put on like your product hat for a second,

01:01:22.160 | you're kind of like, well,

01:01:23.000 | why do people want to learn how to prompt, right?

01:01:25.400 | It's because they want to get higher quality images.

01:01:28.160 | Well, what's higher quality composition,

01:01:29.600 | lighting, aesthetics, so on and so forth.

01:01:32.660 | And I think that the community on our feed,

01:01:35.560 | I think we might have the biggest community

01:01:38.300 | and it gives all of the users a way to learn how to prompt

01:01:43.300 | because they're just seeing this huge rising tide

01:01:47.300 | of all these images that are super cool and interesting

01:01:49.980 | and they can kind of like take each other's prompts

01:01:51.780 | and like kind of learn how to do that.

01:01:53.680 | I think that'll be short-lived

01:01:57.180 | because I think the complexity of these things

01:01:58.540 | is going to get higher,

01:01:59.780 | but that's more about why we have that feed

01:02:03.800 | is to help each other, help teach users

01:02:05.840 | and then also just celebrate people's art.

01:02:08.600 | - You run your own infra.

01:02:09.960 | - We do.

01:02:10.800 | - Yeah, that's unusual.

01:02:12.360 | (laughs)

01:02:14.560 | - It's necessary.

01:02:15.480 | - It's necessary.

01:02:16.480 | What have you learned running DevOps for GPUs?

01:02:19.360 | You had a tweet about like how many A100s you have,

01:02:22.680 | but I feel like it's out of date probably.

01:02:28.020 | - I mean, it just comes down to cost.

01:02:29.200 | These things are very expensive.

01:02:30.400 | So we just want to make it as affordable

01:02:33.160 | for everybody as possible.

01:02:34.960 | I find the DevOps for inference to be relatively easy.

01:02:40.320 | It doesn't feel that different than,

01:02:42.360 | I think we had thousands and thousands of servers

01:02:44.840 | at Mixpanel just for dealing with the API

01:02:47.720 | had such huge quantities of volume that I didn't find it.

01:02:50.200 | I don't find it particularly very different.

01:02:53.680 | I do find model optimization performance

01:02:57.700 | is very new to me.

01:02:58.660 | So I think that I find that very difficult at the moment.

01:03:01.140 | So that's very interesting.

01:03:02.620 | But scaling inference is not terrible.

01:03:05.820 | Scaling a training cluster is much, much harder

01:03:08.860 | than I perhaps anticipated.

01:03:11.840 | - Why is that?

01:03:12.980 | - Well, it's just like a very large distributed system

01:03:16.660 | with if you have like a node that goes down

01:03:20.100 | then your training run crashes

01:03:21.820 | and then you have to somehow be resilient to that.

01:03:23.560 | And I would say training in for a software is very early.

01:03:28.260 | It feels very broken.

01:03:29.820 | I can tell in 10 years, it would be a lot better.

01:03:32.260 | - Like a mosaic or whatever.

01:03:34.340 | - Yeah, we don't even know.

01:03:35.180 | I think we use very basic tools like Slurm for scheduling

01:03:39.020 | and just normal PyTorch, PyTorch Lightning,

01:03:41.340 | that kind of thing.

01:03:42.160 | I think our tooling is an ascent.

01:03:43.780 | I think I talked to a friend that's over at XAI.

01:03:45.740 | They just, they like built their own scheduler

01:03:48.540 | and doing things with Kubernetes.

01:03:50.140 | When people are building out tools

01:03:51.900 | because the existing open source stuff doesn't work

01:03:54.000 | and everyone's doing their own bespoke thing,

01:03:55.600 | you know there's a valuable company to be formed.

01:03:58.040 | - Yeah, I think it's Mosaic.

01:03:59.840 | I don't know.

01:04:01.360 | - Well, with Mosaic, yeah, it's tough with Mosaic

01:04:03.680 | 'cause anyway, I won't go into the details why,

01:04:06.240 | but yeah, we found it difficult to do it.

01:04:09.200 | It might be worth like wondering

01:04:10.640 | like why not everyone is going to Mosaic.

01:04:13.160 | Perhaps it's still, I just think it's nascent

01:04:15.720 | and perhaps Mosaic will come through.

01:04:17.520 | - Cool, anything for you?

01:04:18.920 | - No, no, this was great.

01:04:20.880 | And just to wrap, we talked about

01:04:22.940 | some of the pivotal moments in your mind

01:04:25.040 | with like DALI and whatnot.

01:04:27.140 | If you were not doing this,

01:04:30.120 | what's the most interesting unsolved question in AI

01:04:33.360 | that you would try and build in?

01:04:34.960 | - Oh man, coming up with startup ideas

01:04:38.160 | is very hard on the spot.

01:04:39.900 | - You shoot, you have to have them.

01:04:42.580 | I mean, you're a founder, you're a repeat founder.

01:04:45.440 | - I'm very picky about my startup ideas.

01:04:49.140 | So I don't have any great ones.

01:04:51.620 | The only thing that I, I don't have an idea per se

01:04:54.900 | as much as a curiosity.

01:04:57.300 | And I suppose I'll pose it to you guys.

01:05:00.820 | Right now, we sort of think that a lot of the modalities

01:05:04.600 | just kind of feel like they're vision, language, audio,

01:05:09.600 | that's roughly it.

01:05:11.880 | And somehow all this will like turn into something,

01:05:14.420 | it'll be multimodal and then we'll end up with AGI perhaps.

01:05:18.740 | And I just think that there are probably far more modalities

01:05:22.580 | than maybe we, than meets the eye.

01:05:25.540 | And it just seems hard for us to see it right now

01:05:28.760 | because it's sort of like we have tunnel vision

01:05:31.260 | on the moment.

01:05:32.100 | - We're just like code, image, audio, video.

01:05:34.700 | - Yeah, I think--

01:05:35.540 | - Very, very broad categories.

01:05:36.660 | - I think we are lacking imagination as a species

01:05:39.840 | in this regard.

01:05:40.680 | And I think like, you know, just like, you know,

01:05:43.580 | it's not, I don't know what company would form

01:05:45.300 | as a result of this, but you know,

01:05:47.220 | like there's some very difficult problems,

01:05:49.420 | like just like a true actual, like not a meta world model,

01:05:52.940 | but an actual world model that truly maps everything

01:05:56.860 | that's going in terms of like physics and fluids

01:06:00.140 | and all these various kinds of interactions.

01:06:02.700 | And what does that kind of model,

01:06:04.660 | like a true physics foundation model of sorts

01:06:07.340 | that represents earth.

01:06:09.040 | And that in of itself seems very difficult, you know,

01:06:13.060 | but we just think of, but we're kind of stuck on like

01:06:15.460 | thinking that we can approximate everything

01:06:17.020 | with like, you know, a word or a token, if you will.

01:06:20.820 | And I went, you know, I had a dinner last night

01:06:22.300 | where we were kind of debating this philosophically.

01:06:24.580 | And I think someone, you know, said something

01:06:26.380 | that I also believe in, which is like,

01:06:27.780 | at the end of the day, it doesn't really matter

01:06:29.260 | that it's like a token or a byte.

01:06:31.180 | At the end of the day, it's just like some, you know,

01:06:33.620 | unit of information that it emits.

01:06:36.100 | But, you know, I do wonder if there are more,

01:06:38.780 | far more modalities than meets the eye.

01:06:42.520 | And if you could create that, then what would that,

01:06:45.300 | what would that company become?

01:06:47.220 | What problems could you solve?

01:06:48.940 | So I don't know yet, so I don't have a great company for it.

01:06:52.700 | - I don't know.

01:06:53.540 | Maybe you would just inspire somebody to try.

01:06:56.180 | - Yeah, hopefully.

01:06:57.720 | - My personal response to that is I'm less interested

01:06:59.860 | in physics and more interested in people.

01:07:01.780 | Like how do I mind upload?

01:07:04.220 | Because that is teleportation, that is immortality,

01:07:07.940 | that is everything.

01:07:08.980 | - Yeah, yeah, can we model our own,

01:07:11.660 | rather than trying to create consciousness,

01:07:13.300 | could we model our own?

01:07:15.040 | Even if it was lossy to some extent, yeah.

01:07:18.500 | - Yeah.

01:07:19.780 | Well, we won't solve that here.

01:07:22.180 | If I were to take a Bill Gates book trip and had a week,

01:07:27.180 | what should I take with me to learn AI?

01:07:29.820 | - Oh man, oh gosh.

01:07:32.700 | You shouldn't take a book, you should just go to YouTube

01:07:35.540 | and visit Karpathy's class and just do it, do it,

01:07:40.540 | grind through it.

01:07:41.820 | That's actually the most useful thing for you?

01:07:43.300 | - I wish it came out when I started back last year.

01:07:46.220 | I'm as bummed that I didn't get to take it

01:07:49.460 | at the beginning, but I did do a few of his classes

01:07:53.140 | regardless.

01:07:53.980 | I don't think books, every time I buy a programming book,

01:07:57.300 | I never read it.

01:07:58.220 | I always find that just writing code

01:08:00.500 | helps cement my internal understanding.

01:08:02.300 | - Yeah, so more generally, advice for founders

01:08:04.820 | who are not PhDs and are effectively self-taught

01:08:07.420 | like you are, what should they do, what should they avoid?

01:08:11.000 | - Same thing that I would advise if you're programming.

01:08:14.100 | Pick a project that seems very exciting to you,

01:08:16.700 | but doesn't have to be too serious,

01:08:19.060 | and build it and learn every detail of it while you do it.

01:08:22.420 | - And it must be, should you train?

01:08:24.740 | Or can you go far enough not training, just fine-tuning?

01:08:29.180 | - It depends, I would just follow your curiosity.

01:08:31.500 | If what you want to do is something

01:08:33.660 | that requires fundamental understanding of training models,

01:08:35.980 | then you should learn it.

01:08:37.820 | You don't have to be a PhD, you don't have to get

01:08:39.300 | to become a five-year, whatever, PhD,

01:08:41.940 | but if that's necessary, I would do it.

01:08:44.700 | If it's not necessary, then go as far as you need to go,

01:08:46.860 | but I would learn, pick something that motivates.

01:08:48.940 | I think most people tap out on motivation,

01:08:51.420 | but they're deeply curious.

01:08:52.780 | - Cool. - Cool.

01:08:55.180 | - Thank you so much for coming out, man.

01:08:56.380 | - Thank you for having me, appreciate it.

01:08:58.980 | (upbeat music)

01:09:01.560 | (upbeat music)

01:09:04.140 | (upbeat music)

01:09:06.720 | (upbeat music)

01:09:09.300 | (upbeat music)

01:09:11.880 | (upbeat music)

01:09:14.460 | (upbeat music)

01:09:17.400 | (upbeat music)

01:09:19.980 | (gentle music)

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

Chapters